Skip to content

In this project, I am going to start learning how to do web scraping. I am planning to web scrape iMDB and Rotten Tomato to get info about each movie. My aim is to collect as much data as possible. This will be a continuous project with the implementation of machine learning and NLP. We will start with machine learning section and try to predict…

Notifications You must be signed in to change notification settings

masumrumi/imdb_rotten_web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

In this project, I am going to start learning how to do web scraping. I am planning to web scrape iMDB and Rotten Tomato to get info about each movie. My aim is to collect as much data as possible. This will be a continuous project with the implementation of machine learning and NLP. We will start with machine learning section and try to predict the movie score based on all the info we collect from those two websites. Let's get started.

Objectives

Importing all the necessary modules

We are start out by importing a bunch of libraries that will help us with the task.

## Importing necessary modules
import csv  # # Working with CSV.
import re  # # Regular Expression.
import time  # # Time the code.

import numpy as np  # # Helps to do most of the linear algebra work.
import pandas as pd
import requests  # # requests is used to do web scraping.
from bs4 import BeautifulSoup  # # Library that printed html object nicely.

pd.set_option('display.max_columns', 500) ## to see all the columns,
pd.set_option('display.max_rows', 500) ## To see all the rows. 

import warnings ## importing warnings library. 
warnings.filterwarnings('ignore') ## Ignore warning

Collecting Data

Imdb webscraping part #1

## Some functions to extract data
def metascore_getter_class(css_class):
    return (css_class == "metascore mixed" 
            or css_class == "metascore favorable" 
            or css_class=="metascore unfavorable")

    
%%time
## Looping through each page of the IMDB website which consists of 50
##movies in each page, picking movies only with more than 1000 votes 
#and a rating higher than 9 and lower than 5


## Whenever we use "nested for loops" it exponentially increase our executing time. 
## I wrote these codes when I started out learning python and it was an amazing time. 
## Hopefully I have come a long way from then, but these codes haven't. However, these codes 
## gets the needed job done, with the expense of resources and time. I am hoping to relapce these codes 
## very soon. Until then. Enjoy!! 
movies = {}
## a is getting 1500 movies
for a in range(1, 1500, 50): 
    ## b is doing 2 iterations to switch r value
    for b in range(2): 
        
        # r with rating above 8
        r = requests.get("https://www.imdb.com/search/title/?title_type=feature,\
        tv_movie,documentary,short&user_rating=8.0,&num_votes=1000,&start="+str(a)+"&ref_=adv_nxt")
       
        ## altering the value of r, in order to scrape movies with rating lower than 5.
        if b == 1:
         
            # r with rating below 5
            r = requests.get("https://www.imdb.com/search/title/?title_type=feature,\
            tv_movie,documentary,short&user_rating=,5.0&num_votes=1000,&start="+str(a)+"&ref_=adv_nxt")
       
        ## use BeautifulSoup based on either r
        soup = BeautifulSoup(r.content, "lxml")
        
        for i in soup.findAll(class_='lister-item-content'):
            
            # Getting title_id
            movie_id = re.findall(r'tt.+\d', str(i.find("a")))[0].split("/")[0]
            
            # Getting title
            title = i.find('a').text.strip()
            
            
            # Gerring genre
            try:
                genre = i.find('span', class_ = "genre").text.strip()
            except:
                genre = None
            
            # Getting runtime
            try:
                runtime = re.findall(r'\d+', i.find('span', class_ = "runtime").text)[0]
            except:
                runtime = None
            
            # Getting certificate    
            try:
                certificate = i.find("span", class_ ="certificate").text
            except:
                certificate = None
            
            # Getting imdb_rating
            imdb_rating = float(i.find("strong").text)
            
            # Getting year
            year = i.find("span", class_="lister-item-year text-muted unbold").text
            
            # Getting votes
            votes = (i.find("span", attrs={"name":"nv"}).text).replace(",","")
            
            # Getting gross
            try:
                gross = re.findall(r'\d.+\d', i.find("span", attrs={"name":"nv"}
                                                    ).find_next_sibling("span", attrs={"name":"nv"}).get_text())[0]
            except:
                gross = None
            
            ## Getting director
            try:
                director = i.find_all('p', class_="")[0].a.text.strip()
            except:
                director = None
                
                        
            ## Getting actors
            try:
                actors = i.find_all("p", class_="")[0].find_all('a')[1:]
                actors = [string_value.text.strip() for string_value in actors]
            except:
                actors = None
            
            ## getting summary
            try:
                summary = i.find_all('p', class_='text-muted')[1].text.strip()
            except:
                summary = None
            
            ## Getting metascore
            try:
                metascore = i.find_all("span", class_= metascore_getter_class)[0].text.strip()
            except:
                metascore = None
                
            movies[movie_id] = (title, runtime, genre, certificate,imdb_rating,
                                gross,year,votes,director, actors, metascore, summary)
CPU times: user 15.6 s, sys: 194 ms, total: 15.8 s
Wall time: 1min 17s
df = pd.DataFrame.from_dict(movies).T.reset_index()
header =["movie_id", "title", "runtime", "genre", "certificate", "imdb_rating", 
         "gross", "year", "votes", "director","actors", "metascore", 'summary'] 
df.columns = header

Imdb webscarping part #2

## Some functions to extract data
def metascore_getter_class_2(css_class):
    return (css_class == "metacriticScore score_favorable titleReviewBarSubItem" 
            or css_class == "metacriticScore score_mixed titleReviewBarSubItem" 
            or css_class=="metacriticScore score_unfavorable titleReviewBarSubItem")
%%time

import sys
sys.stdout = open('/dev/stdout', 'w') ## Prints out in the terminal as code executes

movies2 = {}
for movie_id in movies.keys():
    
    ## this is just for me to print out "w" in the termin and make sure that My code is working
    sys.stdout.write(movie_id + '\n')
    
    r = requests.get("http://www.imdb.com/title/"+str(movie_id)+"/?ref_=nv_sr_1")
    soup = BeautifulSoup(r.content, "lxml")
    
    # Getting User_review_no
    try:
        user_review = soup.find(class_="titleReviewBarItem titleReviewbarItemBorder"
                               ).find("a", href=re.compile("reviews")).text
    except:
        user_review = None
        
    # Getting Critic_no
    try:
        critic_review = soup.find(class_="titleReviewBarItem titleReviewbarItemBorder"
                                 ).find("a", href=re.compile("externalreviews")).text
    except:
        critic_review = None
        

    # Getting Writer
    try:
        writer = soup.find_all(class_="credit_summary_item")[1].find_all("a", href=re.compile('name'))
        writer = [string_value.text.strip() for string_value in writer]
    except:
        writer = None

    
    # Getting Language
    language = ""
    try:
        for i in soup.find_all("h4", class_ = 'inline'):
            if i.text.strip() == "Language:":
                language = i.find_next().text
                #language.append(i.find_next().text)
    except:
        language = None
        
        
    # Getting Country
    country = ""
    try:
        for i in soup.find_all('h4',class_ = "inline"):
            if i.text.strip() == "Country:":
                country = i.parent
        country = [temp.text for temp in country.find_all("a")]
    except:
        country = None
            
    # Getting Budget
    budget = ""
    
    try:
        for i in soup.find_all("h4", class_ = "inline"):
            if i.text.strip() == "Budget:":
                budget = (i.next_sibling.strip()).replace(",","")
               # budget.append((i.next_sibling.strip()).replace(",",""))
    except:
        budget = None
            
    # Getting Gross
    gross_1 = ""
    try:
        for i in soup.find_all("h4", class_ = "inline"):
            if i.text.strip() == "Gross USA:":    
                gross_1 = (i.next_sibling.strip()).replace(",","")
                #gross_1.append((i.next_sibling.strip()).replace(",",""))
    except:
        gross_1 = None
        
    # Getting Opening Weekend
    opening_week = ""
    try:
        for i in soup.find_all("h4", class_ = "inline"):
            if i.text.strip() == "Opening Weekend USA:":
                opening_week = (i.next_sibling.strip()).replace(",","")
                #opening_week.append((i.next_sibling.strip()).replace(",",""))
    except:
        opening_week = None

    
    
    # Getting Oscar, Oscar_nomination, Other_awards, Other_nomminations
    # getting oscar
    for i in soup.find_all("span", class_="awards-blurb"):
        if "Won" in i.text and ("Oscar." in i.text or "Oscars." in i.text):
            oscar_win = re.findall(r'\d+', i.text.strip(""))
            break
        else:
            oscar_win = None

        
    # find nominations for oscar
    
    for i in soup.find_all("span", class_="awards-blurb"):
        if "Nominated" in i.text and ("Oscar." in i.text or "Oscars." in i.text):
            oscar_nom = re.findall(r'\d+', i.text.strip(""))
            break
        else:
            oscar_nom = None
            
 
    
    # Getting other wins
    other_win = ""
    try:
        for i in soup.find_all("span", class_= "awards-blurb"):
            #print i.text
            if ("wins" in i.text or "win" in i.text) and ("nominations" in i.text 
                                                          or "nomination." in i.text):
                other_win = re.findall(r'\d+', i.text.strip(""))[0]
            elif ("wins" in i.text or "win" in i.text) and ("nominations" not in i.text 
                                                            or "nomination." not in i.text):
                other_win = re.findall(r'\d+', i.text.strip(""))[0]
            elif ("wins" not in i.text or "win" not in i.text) and ("nominations" in i.text 
                                                                    or "nomination." in i.text):
                other_win = None
    except:
        other_win = None

        
        
    # Getting other nominations
    other_nom = ""
    try:
        for i in soup.find_all("span", class_ = "awards-blurb"):
            #print i.text
            if ("wins" in i.text or "win" in i.text) and ("nominations" in i.text 
                                                          or "nomination." in i.text):
                #other_nom.append(re.findall(r'\d+', i.text.strip(""))[1])
                other_nom = re.findall(r'\d+', i.text.strip(""))[1]
            elif ("wins" in i.text or "win" in i.text) and ("nominations" not in i.text 
                                                            or "nomination." not in i.text):
                other_nom = None
            elif ("wins" not in i.text or "win" not in i.text) and ("nominations" in i.text 
                                                                    or "nomination." in i.text):
                other_nom = re.findall(r'\d+', i.text.strip(""))[0]
    except:
        other_nom = None
        
        ## getting metascore again
    meta_score = 0
    try:
        meta_score = soup.find_all(class_ = metascore_getter_class_2)[0].text.strip()
    except:
        meta_score = None

    
    movies2[movie_id] = (user_review,critic_review,writer,language,country,
                         budget,gross_1,opening_week,oscar_win,oscar_nom,other_win,other_nom, meta_score)
header = ["movie_id","user_review", "critic_review", "writer", "language", 
           "country", "budget", "gross_1", "opening_week", "oscar_win", 
           "oscar_nom", "other_win", "other_nom", "meta_score"]

df2 = pd.DataFrame.from_dict(movies2).T.reset_index()
df2.columns = header
## The code above ran in AWS and once the scrapig was completed, the scraped data stored as a csv file in EC2. 
df2.to_csv('input/imdb_2.csv', encoding="utf-8")

Rotten_tomato webscraping

# %%time
# import sys
# df = pd.read_csv('input/imdb_1.csv')
# ## fixing year
# df.year = df.year.apply(lambda x: int(re.findall("\d{4}",x)[0]))

# df.title = df.title.apply(lambda x:x.replace(" ", "_"))
# df.title = df.title.apply(lambda x:x.replace(".", ""))
# df.title = df.title.apply(lambda x:x.replace(":", ""))
# df.title = df.title.apply(lambda x:x.replace("-", ""))
# df.title = df.title.apply(lambda x:x.replace(",", ""))
# df.title = df.title.apply(lambda x:x.replace("'", ""))
# df.title = df.title.apply(lambda x:x.replace("__", "_"))

# ## changdf.es that name of the spoted movies to fit the url. 
# df.title = df.title.apply(lambda x:x.replace("Joker", "joker_2019"))
# df.title = df.title.apply(lambda x:x.replace("1917", "1917_2019"))
# df.title = df.title.apply(lambda x:x.replace("Parasite", "parasite_2019"))
# df.title = df.title.apply(lambda x:x.replace("The_Wolf_of_Wall_Street", "the_wolf_of_wall_street_2013"))
# df.title = df.title.apply(lambda x:x.replace("The_Shining", "shining"))
# df.title = df.title.apply(lambda x:x.replace("Shutter_Island", "1198124_shutter_island"))
# df.title = df.title.apply(lambda x:x.replace("The_Prestige", "prestige"))
# df.title = df.title.apply(lambda x:x.replace("Prisoners", "seven"))
# df.title = df.title.apply(lambda x:x.replace("Se7en", "seven"))
# df.title = df.title.apply(lambda x:x.replace("Star_Wars_Episode_IV_A_New_Hope", "star_wars"))
# df.title = df.title.apply(lambda x:x.replace("The_Last_Airbender", "last_airbender"))
# df.title = df.title.apply(lambda x:x.replace("Casablanca", "1003707_casablanca"))
# df.title = df.title.apply(lambda x:x.replace("The_Usual_Suspects", "usual_Suspects"))
# df.title = df.title.apply(lambda x:x.replace("The_Godfather_Part_II", "godfather_part_ii"))
# df.title = df.title.apply(lambda x:x.replace("Fantasy_Island", "fantasy_island_2020"))
# df.title = df.title.apply(lambda x:x.replace("survive_the_night", "survive_the_night_2020"))
# df.title = df.title.apply(lambda x:x.replace("The_room", "the_room_1998"))
# df.title = df.title.apply(lambda x:x.replace("John_Henry", "john_henry_2020"))
# df.title = df.title.apply(lambda x:x.replace("Batman_&_Robin", "1077027_batman_and_robin"))
# df.title = df.title.apply(lambda x:x.replace("First_Kill", "First_Kill_2017"))
# df.title = df.title.apply(lambda x:x.replace("The_Adventures_of_Sharkboy_and_Lavagirl_3D", "the_adventures_of_sharkboy_and_lavagirl_in_3d"))
# df.title = df.title.apply(lambda x:x.replace("Dragonball_Evolution", "dragonball"))
# df.title = df.title.apply(lambda x:x.replace("The_Happening", "10007985_happening"))
# df.title = df.title.apply(lambda x:x.replace("Zapped!", "zapped"))
# df.title = df.title.apply(lambda x:x.replace("LĂ©on_The_Professional", "leon_the_professional"))
# df.title = df.title.apply(lambda x:x.replace("SpiderMan_Into_the_SpiderVerse", "spider_man_into_the_spider_verse"))
# df.title = df.title.apply(lambda x:x.replace("The_Avengers", "marvels_the_avengers"))
# df.title = df.title.apply(lambda x:x.replace("Braveheart", "1065684_braveheart"))
# df.title = df.title.apply(lambda x:x.replace("Star_Wars_Episode_V_The_Empire_Strikes_Back", "empire_strikes_back"))
# df.title = df.title.apply(lambda x: x.lower())
# df.title = df.title.apply(lambda x:x.replace("the_best_years_of_our_lives", "best_years_of_our_lives"))
# df.title = df.title.apply(lambda x:x.replace("come_and_see", "1036052_come_and_see"))
# df.title = df.title.apply(lambda x:x.replace("a_clockwork_orange", "clockwork_orange"))
# df.title = df.title.apply(lambda x:x.replace("logan", "logan_2017"))
# df.title = df.title.apply(lambda x:x.replace("12_angry_men", "1000013_12_angry_men"))
# df.title = df.title.apply(lambda x:x.replace("the_human_centipede_(first_sequence)", "human_centipede"))
# df.title = df.title.apply(lambda x:x.replace("the_cat_in_the_hat", "cat_in_the_hat"))
# df.title = df.title.apply(lambda x:x.replace("holmes_&_watson", "holmes_and_watson_2018"))
# df.title = df.title.apply(lambda x:x.replace("death_note", "death_note_2017"))
# df.title = df.title.apply(lambda x:x.replace("cool_world", "1040378_cool_world"))
# df.title = df.title.apply(lambda x:x.replace("jaws_3d", "jaws_3"))
# df.title = df.title.apply(lambda x:x.replace("the_three_stooges", "1205530_three_stooges"))
# df.title = df.title.apply(lambda x:x.replace("kim_possible", "kim_possible_2019"))
# df.title = df.title.apply(lambda x:x.replace("the_island_of_dr_moreau", "1072156_island_of_dr_moreau"))
# df.title = df.title.apply(lambda x:x.replace("the_unborn", "10010150_unborn"))
# df.title = df.title.apply(lambda x:x.replace("aliens", "1000617_aliens"))
# df.title = df.title.apply(lambda x:x.replace("the_sixth_sense", "sixth_sense"))
# df.title = df.title.apply(lambda x:x.replace("spotlight", "spotlight_2015"))
# df.title = df.title.apply(lambda x:x.replace("life_is_beautiful", "1084398_life_is_beautiful"))
# df.title = df.title.apply(lambda x:x.replace("your_name", "your_name_2017"))
# df.title = df.title.apply(lambda x:x.replace("the_truman_show", "truman_show"))
# df.title = df.title.apply(lambda x:x.replace("the_thing", "1021244_thing"))
# df.title = df.title.apply(lambda x:x.replace("capharnaĂĽm", "capernaum"))
# df.title = df.title.apply(lambda x:x.replace("the_wizard_of_oz", "the_wizard_of_oz_1939"))
# df.title = df.title.apply(lambda x:x.replace("zeroville", "zeroville_2019"))
# df.title = df.title.apply(lambda x:x.replace("aliens_vs_predator_requiem", "avp2"))
# df.title = df.title.apply(lambda x:x.replace("cell", "cell_2016"))
# df.title = df.title.apply(lambda x:x.replace("the_haunting", "1090789_haunting"))
# df.title = df.title.apply(lambda x:x.replace("megan_is_missing", "megan_is_missing_2011"))
# df.title = df.title.apply(lambda x:x.replace("the_dinner", "the_dinner_2017"))
# df.title = df.title.apply(lambda x:x.replace("high_school_musical_3", "1194524_high_school_musical_3_senior_year"))
# df.title = df.title.apply(lambda x:x.replace("the_saint", "the_saint_2017"))
# df.title = df.title.apply(lambda x:x.replace("the_human_centipede_iii_(final_sequence)", "the_human_centipede_iii"))
# df.title = df.title.apply(lambda x:x.replace("the_human_centipede_ii_(full_sequence)", "the_human_centipede_ii"))
# df.title = df.title.apply(lambda x:x.replace("jack_and_jill", "jack_and_jill_2011"))
# df.title = df.title.apply(lambda x:x.replace("oh_ramona!", "oh_ramona"))
# df.title = df.title.apply(lambda x:x.replace("princess_mononoke", "princess_mononoke_1999"))
# df.title = df.title.apply(lambda x:x.replace("warrior", "1212910_warrior"))
# df.title = df.title.apply(lambda x:x.replace("a_streetcar_named_desire", "1020333_streetcar_named_desire"))

# df.title = df.title.apply(lambda x: x.lower())
# movies3 = {}
# sys.stdout = open('/dev/stdout', 'w')

# for movie, year in zip(df["title"], df["year"]):
    
#     r = requests.get("https://www.rottentomatoes.com/m/"+str(movie))
#     soup = BeautifulSoup(r.content, "lxml")
    
#     if r.status_code != 200:
#         movie = str(movie)+"_"+str(year)
#         r = requests.get("https://www.rottentomatoes.com/m/"+movie)
    
#     #soup.find('section', class_="mop-ratings-wrap__row js-scoreboard-container")
#     ## everything is within the section above. 
    
#     # Getting tomato meter score.
#     tomatometer = 0
#     try:
#         tomatometer = soup.find_all("span", class_="mop-ratings-wrap__percentage")[0].text.strip()
#     except:
#         tomatometer = None
        
#     # getting audience score  
#     audience_score = 0
#     try:
#         audience_score = soup.find_all("span", class_="mop-ratings-wrap__percentage")[1].text.strip()
#     except:
#         audience_score = None
        
#     ## getting total count
#     total_count = 0
#     try:
#         total_count = soup.find_all("small", class_="mop-ratings-wrap__text--small")[0].text.strip()
#     except:
#         total_count = None
        
        
#     # getting verified ratings
#     verified_ratings = 0
#     try:
#         verified_ratings = soup.find_all(
#             "strong", class_="mop-ratings-wrap__text--small")[1].text.strip().split(":")[1].strip()
#     except:
#         verified_ratings = None
        
#     movies3[movie] = (tomatometer, audience_score, total_count, verified_ratings)
    
#     try:
#         sys.stdout.write(tomatometer + " ")
#         sys.stdout.write(audience_score + " " )
#         sys.stdout.write(total_count + " ")
#         sys.stdout.write(verified_ratings + " ")
#         sys.stdout.write(movie + '\n')
#     except:
#         sys.stdout.write(movie + '\n')
# header3 = ["rt_score", "rt_avg_rating", "rt_audience_score", "rt_user_rating", "rt_avg_aud_rating", "rt_fresh", "rt_rotten"]

# df3.columns = header3

# ## The code above ran in AWS and once the scrapig was completed, the scraped data stored as a csv file in EC2. 
# df3.to_csv('rt.csv', encoding="utf-8")

# header = ["metascores"]

# df_mt.to_csv("input/meta.csv", encoding="utf-8")

Cleaning all the datasets

## This can be the socend part of this project, from here on, I dont have to worry web scraping and all 
## the scraped data was saved in a csv file and imported fresh with the following codes
import pandas as pd
import requests
import numpy as np
import re
import csv
from bs4 import BeautifulSoup
from sklearn.metrics import roc_curve, auc, precision_recall_curve


pd.set_option('display.max_columns', 500) ## to see all the columns,
pd.set_option('display.max_rows', 500) ## To see all the rows. 
## getting the scraped csv files. 
df1 = pd.read_csv('input/imdb_1.csv')
df2 = pd.read_csv('input/imdb_2.csv')
## some functions to help us with the cleanning. 
def cleaner(df, function):
    df.loc[:, function] = df.loc[:, function].apply(lambda x: x.replace("[", "").replace("]", "").replace("'", ""))
    return df.loc[:, function]

def unique_values(df, feature):
    unique_feature = []
    for i in df.loc[:, feature]:
        a = i.split(",")
        for b in a:
            if b.strip() not in unique_feature:
                unique_feature.append(b.strip())
    return unique_feature



def custom_encoder(df, feature):
    unique_feature = []
    for i in df.loc[:, feature]:
        a = i.split(",")
        for b in a:
            if b.strip() not in unique_feature:
                unique_feature.append(b.strip())
    cleaned_feature = []
    for i in df.loc[:,feature]:
        a = i.split(",")
        c =[]
        for b in a:
            c.append(b.strip())
        cleaned_feature.append(c)
    for i in unique_feature:
        df[i] = [1 if i in x else 0 for x in cleaned_feature]
    df.drop(columns=[feature], inplace = True)
    return df
#Dropping the first columns of each datasets
df1.drop(df1.columns[0], axis=1, inplace=True)
df2.drop(df2.columns[0], axis=1, inplace=True)

## combining the two dataframes on movie_id
df = df1.merge(df2, on = 'movie_id')

## dropping redundant columns
df.drop(columns=['gross', 'metascore'], inplace=True)

df = df.replace('N/A',np.nan)

df.rename(columns={'gross_1':'gross'}, inplace=True)
# Dropping duplicates 
df.drop_duplicates(inplace = True)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
movie_id title runtime genre certificate imdb_rating year votes director actors summary user_review critic_review writer language country budget gross opening_week oscar_win oscar_nom other_win other_nom meta_score
0 tt7286456 Joker 122.0 Crime, Drama, Thriller R 8.5 (2019) 804730 Todd Phillips ['Joaquin Phoenix', 'Robert De Niro', 'Zazie B... In Gotham City, mentally troubled comedian Art... 10,188 user 682 critic ['Todd Phillips', 'Scott Silver'] English ['USA', 'Canada'] $55000000 $335451311 $96202337 ['2'] NaN 92.0 199.0 59.0
1 tt6751668 Parasite 132.0 Comedy, Drama, Thriller R 8.6 (2019) 423230 Bong Joon Ho ['Kang-ho Song', 'Sun-kyun Lee', 'Yeo-jeong Jo... Greed and class discrimination threaten the ne... 2,507 user 515 critic ['Bong Joon Ho', 'Bong Joon Ho'] Korean ['South Korea'] $11400000 $53369749 $393216 ['4'] NaN 262.0 227.0 96.0
2 tt8579674 1917 119.0 Drama, War R 8.3 (2019) 324071 Sam Mendes ['Dean-Charles Chapman', 'George MacKay', 'Dan... April 6th, 1917. As a regiment assembles to wa... 2,674 user 461 critic ['Sam Mendes', 'Krysty Wilson-Cairns'] English ['USA', 'UK', 'India', 'Spain', 'Canada'] $95000000 $159227644 $576216 ['3'] NaN 110.0 161.0 78.0
3 tt4154796 Avengers: Endgame 181.0 Action, Adventure, Drama PG-13 8.4 (2019) 726032 Anthony Russo ['Joe Russo', 'Robert Downey Jr.', 'Chris Evan... After the devastating events of Avengers: Infi... 8,764 user 550 critic ['Christopher Markus', 'Stephen McFeely'] English ['USA'] $356000000 $858373000 $357115007 NaN ['1'] 65.0 103.0 78.0
4 tt0068646 The Godfather 175.0 Crime, Drama R 9.2 (1972) 1551490 Francis Ford Coppola ['Marlon Brando', 'Al Pacino', 'James Caan', '... The aging patriarch of an organized crime dyna... 3,824 user 249 critic ['Mario Puzo', 'Francis Ford Coppola'] English ['USA'] $6000000 $134966411 $302393 ['3'] NaN 26.0 30.0 100.0
year
df.year = df.year.apply(lambda x: int(re.findall("\d{4}",x)[0]))
runtime.

# fill the 4 missing values with the average. 
df.runtime.fillna(df.runtime.mean(), inplace=True)
Awards
# getting only award amount
df.oscar_win = df.oscar_win.str.extract(r'(\d+)')
df.oscar_nom = df.oscar_nom.str.extract(r'(\d+)')

# fill missing with 0
df.oscar_win.fillna(0, inplace=True)
df.oscar_nom.fillna(0, inplace=True)
df.other_win.fillna(0, inplace=True)
df.other_nom.fillna(0, inplace=True)

# converting to int
df.oscar_win = df.oscar_win.astype(int)
df.oscar_nom = df.oscar_nom.astype(int)
df.other_win = df.other_win.apply(lambda x: int(x))
df.other_nom = df.other_nom.apply(lambda x: int(x))
Genre
df = custom_encoder(df, "genre")
df.shape
(2834, 47)
Country
df.country.isnull().sum()
1
##
df.country.fillna('USA', inplace = True)

## getting rid of unwanted bits in the string value
df.country = cleaner(df, "country")

# df = custom_encoder(df, "country")
df.shape
(2834, 47)
Meta Score
df.meta_score.isnull().sum()
1664
df.meta_score.fillna(df.meta_score.mean(), inplace=True)
Language
df.language.isnull().sum()
19
df.language.fillna("English", inplace=True)

# df = custom_encoder(df, "language")
df.shape
(2834, 47)
User_review
df.user_review.isnull().sum()
1
df.user_review.fillna('0', inplace = True)

df.user_review = df.user_review.apply(lambda x: int(x.replace(",", "").split(" ")[0].strip()))
critic review
df.critic_review.isnull().sum()
102
df.critic_review = df.critic_review.str.extract(r'(\d+)')

df.critic_review.fillna(0, inplace=True)
budget
df.budget.isnull().sum()
1483
df.budget = df.budget.str.extract(r'(\d+)')

df.budget.fillna('0', inplace = True)

df.budget = df.budget.apply(lambda x: int(x))

## replacing the 0's with budget median value not including 0's
df.budget = df.budget.replace(0, df[df.budget != 0].budget.median())
gross
df.gross.isnull().sum()
1629
df.gross = df.gross.str.extract(r'(\d+)')

df.gross.fillna('0', inplace = True)

df.gross = df.gross.apply(lambda x: int(x))

## replacing the 0's with budget median value not including 0's
df.gross = df.gross.replace(0, df[df.gross != 0].gross.median())
Opening week
df.opening_week.isnull().sum()
1747
df.opening_week = df.opening_week.str.extract(r'(\d+)')

df.opening_week.fillna('0', inplace = True)

df.opening_week = df.opening_week.apply(lambda x: int(x))

## replacing the 0's with budget median value not including 0's
df.opening_week = df.opening_week.replace(0, df[df.gross != 0].opening_week.median())
certificate
df.certificate.isnull().sum()
494
df.certificate.fillna('Not Rated', inplace = True)
df.certificate.replace("Not Rated", "Unrated", inplace=True)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
movie_id title runtime certificate imdb_rating year votes director actors summary user_review critic_review writer language country budget gross opening_week oscar_win oscar_nom other_win other_nom meta_score Crime Drama Thriller Comedy War Action Adventure Sci-Fi Western Biography Horror Fantasy Romance Mystery Family History Music Animation Sport Musical Film-Noir Documentary News Short
0 tt7286456 Joker 122.0 R 8.5 2019 804730 Todd Phillips ['Joaquin Phoenix', 'Robert De Niro', 'Zazie B... In Gotham City, mentally troubled comedian Art... 10188 682 ['Todd Phillips', 'Scott Silver'] English USA, Canada 55000000 335451311 96202337 2 0 92 199 59.0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 tt6751668 Parasite 132.0 R 8.6 2019 423230 Bong Joon Ho ['Kang-ho Song', 'Sun-kyun Lee', 'Yeo-jeong Jo... Greed and class discrimination threaten the ne... 2507 515 ['Bong Joon Ho', 'Bong Joon Ho'] Korean South Korea 11400000 53369749 393216 4 0 262 227 96.0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 tt8579674 1917 119.0 R 8.3 2019 324071 Sam Mendes ['Dean-Charles Chapman', 'George MacKay', 'Dan... April 6th, 1917. As a regiment assembles to wa... 2674 461 ['Sam Mendes', 'Krysty Wilson-Cairns'] English USA, UK, India, Spain, Canada 95000000 159227644 576216 3 0 110 161 78.0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 tt4154796 Avengers: Endgame 181.0 PG-13 8.4 2019 726032 Anthony Russo ['Joe Russo', 'Robert Downey Jr.', 'Chris Evan... After the devastating events of Avengers: Infi... 8764 550 ['Christopher Markus', 'Stephen McFeely'] English USA 356000000 858373000 357115007 0 1 65 103 78.0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 tt0068646 The Godfather 175.0 R 9.2 1972 1551490 Francis Ford Coppola ['Marlon Brando', 'Al Pacino', 'James Caan', '... The aging patriarch of an organized crime dyna... 3824 249 ['Mario Puzo', 'Francis Ford Coppola'] English USA 6000000 134966411 302393 3 0 26 30 100.0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Rotten Tomato scores
# df.rt_score = df.rt_score.str.extract(r'(\d+)')

# df.rt_score = [None if type(x) is float else int(x) for x in df.rt_score]

# df.rt_avg_rating = [None if type(x) is float else x.split("/")[0] for x in df.rt_avg_rating]


# df.rt_avg_rating = [int(x) if x else None for x in df.rt_avg_rating]

# df.rt_audience_score = df.rt_audience_score.str.extract(r'(\d+)')

# df.rt_avg_aud_rating = [None if type(x) is float else x.split("/")[0] for x in df.rt_avg_aud_rating]

# df.rt_user_rating = [i.replace(',',"") for i in df.rt_user_rating]
actors
df.actors.isnull().sum()
0
df.actors = cleaner(df, "actors")

# df = custom_encoder(df, 'actors')

## unique values
len(unique_values(df, 'actors'))
8447
## too many unique values, therefore dropping actors.
df.drop(columns="actors", inplace=True)

df.shape
(2834, 46)
Writer
df.writer.isnull().sum()
4
writer = []
for x in df.writer:
    try:
        a = x.split("(")[0].strip()
        a = a.strip(",")
        a = a.replace(" ", "_")
    except:
        a = None
    writer.append(a)
df.writer = writer
unique_writer = []
for i in df.writer:
    if i not in unique_writer:
        unique_writer.append(i)
len(unique_writer)
2585
## Dropping writer since there are too many unique values. 
df.drop("writer", axis=1, inplace=True)
Directors.
df.director.isnull().sum()
0
len(df.director.value_counts())
2073
## Dropping "director" columns since there are too many unique values. 
df.drop("director", axis=1, inplace=True)

Visualization

import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize = (15,8))
sns.heatmap(df.corr(), 
            center = 0, 
           cmap = 'RdBu');

png

fig, ax = plt.subplots(figsize = (15,8))
sns.heatmap(df.corr()**2,
            center=0
           );
plt.title("Correlation between features", size = 20);

png

fig, ax = plt.subplots(figsize = (15,8))
sns.distplot(df.imdb_rating,bins = 50);
plt.title("Distribution of imdb ratings among movies", size = 20);

png

The distribution of the imdb_rating clearly shows how I have chosen two sets of movies. This could give me a result with better r scores. However, it is important to keep in mind that in the real work the distribution do not act similarly. Often you get data sets

fig, ax = plt.subplots(figsize = (15,8))
sns.distplot(df.year,bins = 50);
plt.title("Distribution of years among movies", size = 20);

png

Most of the movies are pretty recent.

Feature Engineering

## using my custom function to encode country
df = custom_encoder(df, 'country')

## Set year to str
df.year = df.year.astype(str)

df = pd.get_dummies(df, columns=["language", "certificate", "year"], drop_first=True)

df['label'] = df.imdb_rating.apply(lambda x : 1 if x > 7.5 else 0)

y = df.label
X = df.drop(["movie_id", "title", "imdb_rating","label", "summary" ],axis = 1)
X.shape
(2834, 315)
len(y)
2834

Train-test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.25)

Feature scaling

from sklearn.preprocessing import minmax_scale, StandardScaler

scale = ["runtime", "votes", "user_review", "critic_review", "budget", "gross","opening_week","meta_score", "other_win", "other_nom"]

scaler = StandardScaler()
X_train[scale] = scaler.fit_transform(X_train[scale])
#X_test[scale] = scaler.fit_transform(X_test[scale])
X_test[scale] = scaler.fit_transform(X_test[scale])

Creating Models

from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report,precision_recall_curve
from sklearn.tree import DecisionTreeClassifier
Finding the best Max_depth
# finding the best max depth for the DecisionTreeClassifier. 

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)

max_depth_counter = {}
for i in [1,2,3,4,5,6,7,8, 9,10, None]:
    decision_tree = DecisionTreeClassifier(max_depth=i)
    
    ## assign cv mean with 
    cv_mean = cross_val_score(decision_tree, X_train, y_train, cv=cv, n_jobs = -1).mean()

    ## getting cv mean and max depth in a dict
    max_depth_counter[i] = cv_mean
    print ("Max depth:{}".format(i))
    print (cross_val_score(decision_tree, X_train, y_train, cv=cv, n_jobs = -1).mean())
    
  
GridSearch on Random Forest
grid = {
    'n_estimators': [50, 95,100,110],
    'max_features': [1,2,3,4,5,6,'auto'],
    'criterion': ['gini','entropy'],
    'class_weight': ["balanced","balanced_subsample",None]
}



decision_tree = DecisionTreeClassifier(
    max_depth=max(max_depth_counter, key = max_depth_counter.get))

random_forest = RandomForestClassifier(decision_tree)
grid_model = GridSearchCV(random_forest, grid, )

grid_model.fit(X_train, y_train)
grid_model.best_params_
{'class_weight': 'balanced_subsample',
 'criterion': 'gini',
 'max_features': 'auto',
 'n_estimators': 110}
print (grid_model.best_params_)
print (grid_model.best_score_)
grid_model
GridSearchCV(cv=None, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=DecisionTreeC...
                                              n_jobs=None, oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'class_weight': ['balanced', 'balanced_subsample',
                                          None],
                         'criterion': ['gini', 'entropy'],
                         'max_features': [1, 2, 3, 4, 5, 6, 'auto'],
                         'n_estimators': [50, 95, 100, 110]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
Create Random Forest Model using the best param_ from GridSearch
random_forest_model = RandomForestClassifier(max_features = grid_model.best_params_["max_features"],\
                            n_estimators = grid_model.best_params_["n_estimators"],\
                            criterion = grid_model.best_params_["criterion"],\
                            class_weight = grid_model.best_params_["class_weight"])
random_forest_model.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight='balanced_subsample', criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       max_samples=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=110, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)
## Predict y_train for X_train
y_pred = random_forest_model.predict(X_test)

Result Metrics

# printing confision matrix
pd.DataFrame(confusion_matrix(y_test,y_pred),\
            columns=["Predicted high", "Predicted low"],\
            index=["is high","is low"] )
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Predicted high Predicted low
is high 320 25
is low 3 361
print (classification_report(y_test, y_pred, labels=random_forest_model.classes_))
# Printing accuracy Score
print (accuracy_score(y_test, y_pred))

Feature Importance

# Get features Gini scores
feature_importances = pd.DataFrame(random_forest_model.feature_importances_, 
                                   index = X_train.columns, 
                                   columns=['importance'])

feature_importances[feature_importances['importance']!=0].sort_values(by='importance', ascending=False).head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
importance
runtime 0.096204
meta_score 0.091236
other_win 0.088065
user_review 0.053687
language_English 0.053065
votes 0.052834
critic_review 0.045019
Documentary 0.040162
USA 0.035724
Horror 0.035355

ROC/AUC CURVE

Y_score = grid_model.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
FPR, TPR, _ = roc_curve(y_test, Y_score)
ROC_AUC = auc(FPR, TPR)
"""
PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)"""

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth=4)
#plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive', fontsize=18)
plt.ylabel('True Positive', fontsize=18)
plt.title('Random Forest for imdb rating', fontsize=18)
plt.legend(loc="lower right")
plt.show()

png

Y_score = grid_model.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall Rate', fontsize=18)
plt.ylabel('Precision Rate', fontsize=18)
plt.title('Random Forest for imdb rating', fontsize=18)
plt.legend(loc="lower right")
plt.show()

png

Wrapping Up

My model seems to perform too well. One of the reasons could be the way I collected data. I hypothesis is that I was biased while collecting data. The way I web-scraped and collected my data points, there was a huge difference between the two types of data which is unlikely in the real world data problems. This is something to think about for the upcoming projects.

Next..

I believe one of the reasons why my model was giving me really good results is because of the way I collected data from the beginning. While web scraping I basically collected two type of movies. First with imdb_rating over 8 and then imdb_rating below 5. Since there is a big gap in the rating, My model was able to predict too well. This is unlikely in the real world. Therefore, my next procedure would be to web scrape again but this time I should have more variety in the rating while scraping since it is my target variable. Also I would like to work with Actors, Director and Writers as a part of NLP for the next trial.

About

In this project, I am going to start learning how to do web scraping. I am planning to web scrape iMDB and Rotten Tomato to get info about each movie. My aim is to collect as much data as possible. This will be a continuous project with the implementation of machine learning and NLP. We will start with machine learning section and try to predict…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published