## Scrape IMDB Movies with Sequels/Franchise (Part I)

As part of revamping Project 2 and potentially for Springboard, I am re-scrapping the sequels movie page on IMDB. This will hopefully be better content going forward.

This notebook only scrapes the movie list on 

https://www.imdb.com/list/ls003495084/?st_dt=&mode=detail&page=1&sort=list_order,asc

## Load Libraries

In [110]:
from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

import pandas as pd
import numpy as np

from tqdm import tqdm

## Start Scrapping

Main page url is: https://www.imdb.com/list/ls003495084/?st_dt=&mode=detail&page=1&sort=list_order,asc

The idea here is as follows:

1. Scrape the urls of all movies on the sequels page
2. Use those urls to go to their respective pages and parse necessary data

The challenge here will be to scroll through pages and grab all data. Not all data will be available. Example, James Bond and I found out that Transformers: The Last Knight is not on that page at all.

In [2]:
def get_movie_sequels_urls_n_more(num_pages=12):
    
    # Initialize lists
    movie_url        = []
    movie_title      = []
    movie_year       = []
    movie_imdb_score = []
    movie_meta_score = []
    movie_rating     = []
    movie_genre      = []
    movie_runtime    = []
    
    # Function to get main page urls and extra information
    for ipage in range(1,num_pages+1):
        
        print('Start: ', ipage)
        
        # This is the list of movie sequels page
        main_page_url =  'https://www.imdb.com/list/ls003495084/?st_dt=&mode=detail&page='\
                      +   str(ipage) + '&sort=list_order,asc'
        
        # Opens the connection and downloads html page from url
        uClient = uReq(main_page_url)
        
        # Parses html into a soup data structure to traverse html
        # as if it were a json data type.
        page_soup = soup(uClient.read(), "html.parser")
        uClient.close()
        
        # Now get all movie containers (typically should total 100)
        containers = page_soup.find_all('div', class_ = 'lister-item-content')
        
        # Loop through all the containers
        
        for container in containers:
            # Grab url link
            try:
                m_url = 'http://www.imdb.com' + container.a['href']
                movie_url.append(m_url)
            except Exception as e:
                movie_url.append('None')
            
            # Grab movie title
            try:
                movie_title.append(container.a.text)
            except Exception as e:
                movie_title.append('None')
                
            # Grab the movie year which will also include whether or not this was a video release
            try:
                m_year = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').text.strip()
                m_year = m_year.replace('(','')
                m_year = m_year.replace(')','')
                m_year = m_year.replace('I','')
                
                if m_year != '':
                    movie_year.append(m_year.strip())
                else:
                    movie_year.append('None')
            except Exception as e:
                movie_year.append('None')
            
            # Grab IMDB score
            try:
                m_imdb_score = container.find('span', class_ = 'ipl-rating-star__rating').text
                movie_imdb_score.append(m_imdb_score)
            except Exception as e:
                movie_imdb_score.append('None')
                
            # Grab Metacritic score
            try:
                m_meta_score = container.find('div', class_ = 'inline-block ratings-metascore').span.text
                movie_meta_score.append(m_meta_score.strip())
            except Exception as e:
                movie_meta_score.append('None')
                
            # Now extract rating, runtime and genre
            top_banner = container.find('p', class_ = 'text-muted text-small')
            
            try:
                m_rating = top_banner.find('span', class_='certificate').text.strip()
                movie_rating.append(m_rating)
            except Exception as e:
                movie_rating.append('None')
    
            try:
                runtime = top_banner.find('span', class_='runtime').text.strip()
                runtime = int(runtime.split(' ')[0])
                movie_runtime.append(runtime)
            except Exception as e:
                movie_runtime.append('None')
    
            try:
                m_genre = top_banner.find('span', class_='genre').text.strip()
                movie_genre.append(''.join(m_genre.split(',')))
            except Exception as e:
                movie_genre.append('None')
                
        print('End: ', ipage)
                
    # Create a dictionary and dump it to a dataframe
    movie_dict = {'Title'            : movie_title,
                  'url'              : movie_url, 
                  'Year-Rel-Type'    : movie_year,
                  'IMDB Score'       : movie_imdb_score,
                  'Metacritic Score' : movie_meta_score,
                  'Rating'           : movie_rating,
                  'Genre'            : movie_genre,
                  'Runtime'          : movie_runtime}
        
    df_movies = pd.DataFrame(movie_dict)
                
    return(df_movies)
                
    # end for loop ipage



In [4]:
num_page_test = 12
dfm = get_movie_sequels_urls_n_more(num_page_test)

Start:  1
End:  1
Start:  2
End:  2
Start:  3
End:  3
Start:  4
End:  4
Start:  5
End:  5
Start:  6
End:  6
Start:  7
End:  7
Start:  8
End:  8
Start:  9
End:  9
Start:  10
End:  10
Start:  11
End:  11
Start:  12
End:  12


In [5]:
dfm.iloc[499,:]

Title               Lara Croft Tomb Raider: The Cradle of Life
url                       http://www.imdb.com/title/tt0325703/
Year-Rel-Type                                             2003
IMDB Score                                                 5.5
Metacritic Score                                            43
Rating                                                   PG-13
Genre                                 Action Adventure Fantasy
Runtime                                                    117
Name: 499, dtype: object

In [109]:
# Dump to a csv file
dfm.to_csv('./data/movies_with_sequels_imdb_first_pass.csv',index=False)

In [7]:
dfm1 = pd.read_csv('./data/movies_with_sequels_imdb_first_pass.csv')

dfm1.head()

Unnamed: 0,Title,url,Year-Rel-Type,IMDB Score,Metacritic Score,Rating,Genre,Runtime
0,Spider-Man,http://www.imdb.com/title/tt0145487/,2002,7.3,73,PG-13,Action Adventure Sci-Fi,121
1,Spider-Man 2,http://www.imdb.com/title/tt0316654/,2004,7.3,83,PG-13,Action Adventure Sci-Fi,127
2,Spider-Man 3,http://www.imdb.com/title/tt0413300/,2007,6.2,59,PG-13,Action Adventure Sci-Fi,139
3,The Matrix,http://www.imdb.com/title/tt0133093/,1999,8.7,73,R,Action Sci-Fi,136
4,The Matrix Reloaded,http://www.imdb.com/title/tt0234215/,2003,7.2,62,R,Action Sci-Fi,138


In [96]:
# Need a way to break-up the categorical data in Genre
genre = dfm['Genre']

#print(list(genre))

master = []

for item in genre:
    
    list_genre = item.split(' ')
    
    master.extend(list_genre)
    
unique_genres = sorted(list(set(master)))

print(len(unique_genres))
print(unique_genres)

dfm_test = dfm.iloc[:10,:]

#dfm_test_copy = dfm_test.copy()

#dfm_test_copy['Action']    = 0
#dfm_test_copy['Adventure'] = 0
#dfm_test_copy['Sci-Fi']    = 0

dfm_test.head(10)

dfm_new = dfm_test.drop(['Title','url'],axis=1)

dfm_new_copy = dfm_new.copy()

for gg in unique_genres:
    dfm_new_copy[gg] = 0
    
dfm_new_copy.head(10)

21
['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Thriller', 'War', 'Western']


Unnamed: 0,Year-Rel-Type,IMDB Score,Metacritic Score,Rating,Genre,Runtime,Action,Adventure,Animation,Biography,...,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War,Western
0,2002,7.3,73,PG-13,Action Adventure Sci-Fi,121,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2004,7.3,83,PG-13,Action Adventure Sci-Fi,127,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2007,6.2,59,PG-13,Action Adventure Sci-Fi,139,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1999,8.7,73,R,Action Sci-Fi,136,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2003,7.2,62,R,Action Sci-Fi,138,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2003,6.8,47,R,Action Sci-Fi,129,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,2001,8.8,92,PG-13,Action Adventure Drama,178,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,2002,8.7,87,PG-13,Adventure Drama Fantasy,179,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,2003,8.9,94,PG-13,Adventure Drama Fantasy,201,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1985,8.5,87,PG,Adventure Comedy Sci-Fi,116,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Scrape James Bond URLs

In [None]:
# Bond movies are on a separate list page on IMDB

In [112]:
def get_movie_sequels_urls_n_more_bond(num_pages=1):
    
    # Initialize lists
    movie_url        = []
    movie_title      = []
    movie_year       = []
    movie_imdb_score = []
    movie_meta_score = []
    movie_rating     = []
    movie_genre      = []
    movie_runtime    = []
    
    # Function to get main page urls and extra information
    for ipage in range(1,num_pages+1):
        
        print('Start: ', ipage)
        
        # # This is the James Bond Page
        main_page_url =  'https://www.imdb.com/list/ls006405458/'
        
        # Opens the connection and downloads html page from url
        uClient = uReq(main_page_url)
        
        # Parses html into a soup data structure to traverse html
        # as if it were a json data type.
        page_soup = soup(uClient.read(), "html.parser")
        uClient.close()
        
        # Now get all movie containers (typically should total 100)
        containers = page_soup.find_all('div', class_ = 'lister-item-content')
        
        # Loop through all the containers
        
        for container in containers:
            # Grab url link
            try:
                m_url = 'http://www.imdb.com' + container.a['href']
                movie_url.append(m_url)
            except Exception as e:
                movie_url.append('None')
            
            # Grab movie title
            try:
                movie_title.append(container.a.text)
            except Exception as e:
                movie_title.append('None')
                
            # Grab the movie year which will also include whether or not this was a video release
            try:
                m_year = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').text.strip()
                m_year = m_year.replace('(','')
                m_year = m_year.replace(')','')
                m_year = m_year.replace('I','')
                
                if m_year != '':
                    movie_year.append(m_year.strip())
                else:
                    movie_year.append('None')
            except Exception as e:
                movie_year.append('None')
            
            # Grab IMDB score
            try:
                m_imdb_score = container.find('span', class_ = 'ipl-rating-star__rating').text
                movie_imdb_score.append(m_imdb_score)
            except Exception as e:
                movie_imdb_score.append('None')
                
            # Grab Metacritic score
            try:
                m_meta_score = container.find('div', class_ = 'inline-block ratings-metascore').span.text
                movie_meta_score.append(m_meta_score.strip())
            except Exception as e:
                movie_meta_score.append('None')
                
            # Now extract rating, runtime and genre
            top_banner = container.find('p', class_ = 'text-muted text-small')
            
            try:
                m_rating = top_banner.find('span', class_='certificate').text.strip()
                movie_rating.append(m_rating)
            except Exception as e:
                movie_rating.append('None')
    
            try:
                runtime = top_banner.find('span', class_='runtime').text.strip()
                runtime = int(runtime.split(' ')[0])
                movie_runtime.append(runtime)
            except Exception as e:
                movie_runtime.append('None')
    
            try:
                m_genre = top_banner.find('span', class_='genre').text.strip()
                movie_genre.append(''.join(m_genre.split(',')))
            except Exception as e:
                movie_genre.append('None')
                
        print('End: ', ipage)
                
    # Create a dictionary and dump it to a dataframe
    movie_dict = {'Title'            : movie_title,
                  'url'              : movie_url, 
                  'Year-Rel-Type'    : movie_year,
                  'IMDB Score'       : movie_imdb_score,
                  'Metacritic Score' : movie_meta_score,
                  'Rating'           : movie_rating,
                  'Genre'            : movie_genre,
                  'Runtime'          : movie_runtime}
        
    df_movies = pd.DataFrame(movie_dict)
                
    return(df_movies)


In [113]:
df_mo_bond = get_movie_sequels_urls_n_more_bond(1)

Start:  1
End:  1


In [117]:
df_mo_bond.to_csv('./data/movies_with_sequels_imdb_first_pass_james_bond.csv',index=False)

## Scrape Madea URLs

In [156]:
def get_movie_sequels_urls_n_more_madea(num_pages=1):
    
    # Initialize lists
    movie_url        = []
    movie_title      = []
    movie_year       = []
    movie_imdb_score = []
    movie_meta_score = []
    movie_rating     = []
    movie_genre      = []
    movie_runtime    = []
    
    # Function to get main page urls and extra information
    for ipage in range(1,num_pages+1):
        
        print('Start: ', ipage)
        
        # This is the Madea Page
        main_page_url =  'https://www.imdb.com/search/keyword/?keywords=madea-series&ref_=fn_al_kw_1&sort=release_date,asc&mode=detail&page=1'
        
        # Opens the connection and downloads html page from url
        uClient = uReq(main_page_url)
        
        # Parses html into a soup data structure to traverse html
        # as if it were a json data type.
        page_soup = soup(uClient.read(), "html.parser")
        uClient.close()
        
        # Now get all movie containers (typically should total 100)
        containers = page_soup.find_all('div', class_ = 'lister-item-content')
        
        # Loop through all the containers
        
        for container in containers:
            # Grab url link
            try:
                m_url = 'http://www.imdb.com' + container.a['href']
                movie_url.append(m_url)
            except Exception as e:
                movie_url.append('None')
            
            # Grab movie title
            try:
                movie_title.append(container.a.text)
            except Exception as e:
                movie_title.append('None')
                
            # Grab the movie year which will also include whether or not this was a video release
            try:
                m_year = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').text.strip()
                m_year = m_year.replace('(','')
                m_year = m_year.replace(')','')
                m_year = m_year.replace('I','')
                
                if m_year != '':
                    movie_year.append(m_year.strip())
                else:
                    movie_year.append('None')
            except Exception as e:
                movie_year.append('None')
                
            try:
                # Slightly different for Madea
                m_imdb_score = container.find('div', class_ = 'inline-block ratings-imdb-rating').strong.text
                movie_imdb_score.append(m_imdb_score)
            except Exception as e:
                movie_imdb_score.append('None')
                
            # Grab Metacritic score
            try:
                m_meta_score = container.find('div', class_ = 'inline-block ratings-metascore').span.text
                movie_meta_score.append(m_meta_score.strip())
            except Exception as e:
                movie_meta_score.append('None')
                
            # Now extract rating, runtime and genre
            top_banner = container.find('p', class_ = 'text-muted text-small')
            
            try:
                m_rating = top_banner.find('span', class_='certificate').text.strip()
                movie_rating.append(m_rating)
            except Exception as e:
                movie_rating.append('None')
    
            try:
                runtime = top_banner.find('span', class_='runtime').text.strip()
                runtime = int(runtime.split(' ')[0])
                movie_runtime.append(runtime)
            except Exception as e:
                movie_runtime.append('None')
    
            try:
                m_genre = top_banner.find('span', class_='genre').text.strip()
                movie_genre.append(''.join(m_genre.split(',')))
            except Exception as e:
                movie_genre.append('None')
                
        print('End: ', ipage)
                
    # Create a dictionary and dump it to a dataframe
    movie_dict = {'Title'            : movie_title,
                  'url'              : movie_url, 
                  'Year-Rel-Type'    : movie_year,
                  'IMDB Score'       : movie_imdb_score,
                  'Metacritic Score' : movie_meta_score,
                  'Rating'           : movie_rating,
                  'Genre'            : movie_genre,
                  'Runtime'          : movie_runtime}
        
    df_movies = pd.DataFrame(movie_dict)
                
    return(df_movies)

In [157]:
df_mo_madea = get_movie_sequels_urls_n_more_madea(1)

Start:  1
End:  1


In [158]:
df_mo_madea.to_csv('./data/movies_with_sequels_imdb_first_pass_madea.csv',index=False)