## Scrape IMDB Movies with Sequels/Franchise/Universe (Part II)

A continuation from Part I. 

Part I scraped the movie data from the sequel list.

Part II will use the movie urls scraped from Part I and extract more data such as budgets, release dates and potentially the cast. Other csv files are concatenated which includes movie franchises such as James Bond, Madea and others.

## Load Libraries

In [1]:
from bs4            import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen       as uReq  # Web client
from tqdm import tqdm

import pandas as pd
import numpy  as np

import time

## Load Data 

In [2]:
df_mo = pd.read_csv('./data/movies_with_sequels_imdb_first_pass_part_01_raw.csv')

df_mo_bond  = pd.read_csv('./data/movies_with_sequels_imdb_first_pass_james_bond.csv')
df_mo_madea = pd.read_csv('./data/movies_with_sequels_imdb_first_pass_madea.csv')

# Concatenate
df_mo = pd.concat([df_mo,df_mo_bond], ignore_index=True)
df_mo = pd.concat([df_mo,df_mo_madea],ignore_index=True)

df_mo.shape

(1150, 8)

## Some Cleaning

The cleaning portion involves removing movies released on Videos/TV and those that are yet to be released.

In [3]:
# Convert Year-Rel-Type to String first
df_mo['Year-Rel-Type'] = df_mo['Year-Rel-Type'].astype(str)

In [4]:
# Remove unwanted movies that were released on Videos and TV
df_no_videos = df_mo[~df_mo['Year-Rel-Type'].str.contains('Video')]

#df_no_videos
df_no_tv_videos = df_no_videos[~df_no_videos['Year-Rel-Type'].str.contains('TV')]

# Remove any movies that have not been released yet
df_mo_clean_1 = df_no_tv_videos[df_no_tv_videos['Year-Rel-Type'] != 'None']

# Now convert the column back to integer
df_mo_clean_1['Year-Rel-Type'] = pd.to_numeric(df_mo_clean_1['Year-Rel-Type'], downcast='integer')

# Remove again those that have not been released
df_mo_clean_1 = df_mo_clean_1[df_mo_clean_1['Year-Rel-Type'] < 2021]
df_mo_clean_2 = df_mo_clean_1[df_mo_clean_1['Runtime'] != 'None']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [5]:
df_mo_clean_2.reset_index;

df_mo_clean_2.shape

(976, 8)

In [6]:
df_mo_clean_2.to_csv('./data/movies_with_sequels_imdb_clean.csv',index=False)

In [8]:
# Those that are missing Metacritic scores
print('Missing Metacritic Scores:', len(df_mo_clean_2[df_mo_clean_2['Metacritic Score'] == 'None']))


Missing Metacritic Scores: 112


In [9]:
print('Missing IMDB Scores: ', len(df_mo_clean_2[df_mo_clean_2['IMDB Score'] == 'None']))

Missing IMDB Scores:  0


In [10]:
# Lets check average and std of IMDB scores
df_mo_clean_2['IMDB Score'] = df_mo_clean_2['IMDB Score'].astype(float)
#df_mo_clean_2['Metacritic Score'] = df_mo_clean_2['Metacritic Score'].apply(pd.to_numeric, errors='ignore')
#print( df_mo_clean_2['IMDB Score'].sum() )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [11]:
# Hold on to these values for now
print('Mean  IMDB Score: ', df_mo_clean_2['IMDB Score'].mean() )
print('Stdev IMDB Score: ',  df_mo_clean_2['IMDB Score'].std() )

Mean  IMDB Score:  6.361987704918028
Stdev IMDB Score:  1.129292832434149


In [12]:
# Extract Metacritic Scores
df_meta = df_mo_clean_2[df_mo_clean_2['Metacritic Score'] != 'None']

df_meta = df_meta['Metacritic Score'].astype(float)

print('Mean  Meta Score: ', df_meta.mean() )
print('Stdev Meta Score: ', df_meta.std()  )

Mean  Meta Score:  53.99421296296296
Stdev Meta Score:  17.421067514775395


## Load Supplementary Data

This spreadsheet contains extra movies that were manually searched and curated. CSV file contains a rudimentary title and link. 

In [13]:
df_mo_extra = pd.read_csv('./data/movies_sequels_extra_manual_input.csv')

#df_mo_clean_title_url = df_mo_clean_2[['Title','url']]

# Now concatenate
df_mo_clean = pd.concat([df_mo_clean_2,df_mo_extra], ignore_index=True)

df_mo_clean = df_mo_clean[['Title','url']]

df_mo_clean.to_csv('./data/movies_sequels_clean_extra_title_url.csv',index=False)

print('Number of Rows: ', len(df_mo_clean))

Number of Rows:  1089


## Start Webscrapping

The function below goes through each movie webpage and extracts the following items:

- Title
- IMDB Score
- Metacritic Score
- Genres
- Runtime
- MPAA rating
- Budget
- The opening weekend income
- Gross in the USA
- World Wide Gross
- Exact Release Date
- Country of Origin

In the future, need to extract cast, director, producer, studio, writers

In [14]:
def get_movie_details(movie_urls):
    
    # This determines the size of 'None' array to be later used 
    # in extending the empty list. If more information is needed, this 
    # number will change
    list_size_temp = 7
    
    # Add a sleep time (seconds) to not overload site
    sleep_time = 5
    
    count = 0    # Initialize Counter
    step  = 100  # Sleep every 100 items
    
    # Initialize arrays
    movie_title     = []
    imdb_score      = []
    meta_score      = []
    mpaa_rating     = []
    genres          = []
    
    budget          = []
    opening_weekend = []
    gross_usa       = []
    gross_world     = []
    rel_date        = []
    country         = []
    runtime         = []    
    
    # Now go through each movie's URL page and extract financial
    # and other information
    for i_url in tqdm(range(0,len(movie_urls))):
    #for i_url in tqdm(range(515,521)):
    
        count += 1
                
        url = movie_urls[i_url]
        
        # Opens the connection and downloads html page from url
        uClient = uReq(url)
        
        # Parses html into a soup data structure to traverse html
        # as if it were a json data type.
        page_soup = soup(uClient.read(), "html.parser")
        uClient.close()
        
        # Get Title
        try:
            m_title = page_soup.find('div', class_ = 'title_wrapper').h1.text.strip()
            movie_title.append(m_title)
        except Exception as e:
            movie_title.append('None')
        
        
        # Get IMDB Score
        try:
            im_score = page_soup.find('div', class_ = 'ratingValue').text.split('/')[0].strip()
            #print('Score: ', im_score)
            imdb_score.append(im_score)
        except Exception as e:
            imdb_score.append('None')
        
        
        # Get MPAA Rating
        try:
            rating = page_soup.find('div', class_ = 'subtext').text.split('|')[0].strip()
            mpaa_rating.append(rating)
            #print('Rated: ',rating)
        except Exception as e:
            mpaa_rating.append('None')
        
        
        # Get Genre
        try:
            genre = []
            gen = page_soup.find('div', class_ = 'subtext').text.split('|')[2]
            
            for i_gen in gen.split(','):
                genre.append(i_gen.strip())
            
            genres.append(' '.join(genre))
                        
        except Exception as e:
            genres.append('None')
        
        
        # Get Metacritic Score
        try:
            meta = page_soup.find('div', class_ = 'titleReviewBarItem').a.text.strip()
            meta_score.append(int(meta))
        except Exception as e:
            meta_score.append('None')
            
        
        # Further information is stored in h4 tags
        h4s = page_soup.findAll('h4')
        
        # Initialize None array 
        detail = list_size_temp*['None']
        
        for h4 in h4s:
            if 'Budget:' in h4:
                # Some movies are quoted in Euros, Francs, GBP, HKD, etc. Best is to just
                # parse the entire string and remove the commas. 
                # Deal with the conversion rate later
                #budg = int(''.join(h4.next_sibling.strip().replace('EUR','').replace('GBP','').replace('FRF','').replace('$','').split(',')))
                budg = ''.join(h4.next_sibling.strip().split(','))
                detail[0] = budg

            if 'Opening Weekend USA:' in h4:
                op_wknd = int(''.join(h4.next_sibling.strip().replace('$','').split(',')))
                detail[1] = op_wknd

            if 'Gross USA:' in h4:
                grs_usa = int(''.join(h4.next_sibling.strip().replace('$','').split(',')))
                detail[2] = grs_usa

            if 'Cumulative Worldwide Gross:' in h4:
                grs_ww = int(''.join(h4.next_sibling.strip().replace('$','').split(',')))
                #print('WorldWide: ',int(''.join(h4.next_sibling.strip().replace('$','').split(','))))
                detail[3] = grs_ww

            # Get the release date
            if 'Release Date:' in h4:
                r_date = ' '.join(h4.next_sibling.strip().split()[:-1])
                detail[4] = r_date
                #print(temp5)

            # Get Country
            if 'Country:' in h4:
                cntry = h4.next_sibling.next_sibling.text
                detail[5] = cntry
                
            # Runtime
            if 'Runtime:' in h4:
                runt = h4.next_sibling.next_sibling.text.strip().split(' ')[0]
                detail[6] = runt
    
                
        budget.append(detail[0])
        opening_weekend.append(detail[1])
        gross_usa.append(detail[2])
        gross_world.append(detail[3])
        rel_date.append(detail[4])
        country.append(detail[5])
        runtime.append(detail[6])
        
        # Reprieve
        if (count%step == 0):
            time.sleep(sleep_time)
        
    movie_dict = {'Title'           : movie_title,
                  'url'             : url,
                  'IMDB Score'      : imdb_score,
                  'Metacritic'      : meta_score,
                  'Runtime (mins)'  : runtime,
                  'Budget'          : budget,
                  'Opening Weekend' : opening_weekend,
                  'Gross USA'       : gross_usa,
                  'Gross World'     : gross_world,
                  'Release Date'    : rel_date,
                  'Rating'          : mpaa_rating,
                  'Genres'          : genres,
                  'Country'         : country}
    
    dfm = pd.DataFrame(movie_dict)
    
    return(dfm)

In [15]:
# Extract urls only as input
m_urls = df_mo_clean['url']

# If the function below is commented out 
df_movie_details = get_movie_details(m_urls)

df_movie_details.head()

print(len(df_movie_details))

100%|██████████| 1089/1089 [21:46<00:00,  1.20s/it]

1089





In [16]:
df_movie_details.to_csv('./data/movies_with_sequels_imdb_details_raw.csv',index=False)

## Next Step

After going through the last generated csv file, several cleaning process steps were taken:

- Removed movie series that had an odd number to it. Example: The Iron Man series were numbered 1 through 3. I removed the 3rd installment since I only need the target variable of which is Iron Man 2. However, if there are 5 movies in a series, I will remove the 5th movie. This was painful manual process that could have been eased using nested dictionaries in JSON format.
- In addition, I also removed movies that were missing budget information. These will be added later with a prediction using Regression.

Next Steps:

- The budget information contains several different currencies. These need to be converted to USD.
- Convert MPAA Rating to numerical if needed, testing needs to be done on how close an ML model will match the box office returns. This also goes for perhaps the Genre. 
- Models in the pipeline are Linear Regression, Random Forest Regressor, XGBoost Regressor and LightGBM 

In [None]:
#df_mo_finances = pd.read_csv('./data/movies_with_sequels_imdb_first_pass_part_02_budgets_n_extra_raw.csv')

In [None]:
# Extract budgets that are not in USD
df_non_usd_temp = df_mo_finances[~df_mo_finances['Budget'].str.startswith('$')]
df_non_usd      = df_non_usd_temp[~df_mo_finances['Budget'].str.match('None')]

df_non_usd.head()

In [None]:
df_non_usd.shape

In [None]:
'''
# Let's test selenium
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.bitbar.com/enterprise/")
line = driver.find_elements_by_css_selector(".b-cta__content > h2:nth-child(1)")

for line in line:
    print(line.text.strip())

driver.quit()
''';