# Casts' Interest Over Time

## Pytrends API

Scraping of actor’s interest score from Google Trends is done to gain further insight to measure the effect of the popularity of the actors' on the popularity of the movie. An average interest score will then be computed from the search volume which indicates an individual actor's popularity.

Pytrends API was used to automate download of reports from Google Trends.

In [1]:
import pandas as pd
import numpy as np
import json
import re
from time import sleep
from datetime import date, timedelta

#Importing Pytrends
from pytrends.request import TrendReq

## Exploratory Data Analysis on Casts

We performed EDA on the cast column of the original dataset and realised a few observations:

1) There were movies without any casts, we will remove them from the dataset.

2) There were accented characters and symbols in names that cannot be read by the API, we will remove them from our dataset.

3) Some movies have only less than 3 casts, we will remove them as well.

In [2]:
# Remove movies < 2007 
df = pd.read_csv("tmdb_movies_data.csv")
df = df[(df["release_year"] >= 2007)]

# Converting to release date to datetime object
df['release_date'] = pd.to_datetime(df['release_date'], utc = False)
df['release_date'] = df['release_date'].astype(str)
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,Unnamed: 21
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015-06-09,5562,6.5,2015,137999939.3,1392446000.0,0.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,2015-05-13,6185,7.1,2015,137999939.3,348161300.0,1.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2015-03-18,2480,6.3,2015,101199955.5,271619000.0,3.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,2015-12-15,5292,7.5,2015,183999919.0,1902723000.0,4.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2015-04-01,2947,7.3,2015,174799923.1,1385749000.0,


### 1) Removing movies without casts

In [3]:
# Checking for null values in casts
len(df[df["cast"].isnull()])

53

In [4]:
# Remove null values in casts
df = df[df["cast"].isnull() == False]

### 2) Removing casts with accented characters and symbols

We realised that there were characters with symbols in them. Since the API couldn't read them, we will remove them from our dataset

In [5]:
# Checking how many casts are with accented characters and symbols
accented_casts = []
for row,col in df.iterrows():
    names = col["cast"].split("|")
    for i in names:
        if bool(re.search(r'[^\x00-\x7F]+', i)) == True:
            accented_casts.append(i)
len(accented_casts)

728

In [6]:
# Removing them from dataset
def remove_casts(string):
    names = []
    string = string.split("|")
    for name in string:
        if bool(re.search(r'[^\x00-\x7F]+', name)) == False:
            names.append(name)
    return names

df["cast"] = df["cast"].apply(lambda x: remove_casts(x))

### 3) Removing movies with less than 3 casts

We realised that there were movies with less than 3 casts, we will remove them from our dataset because it does not accurately represent the movies.

In [7]:
# Number of cast per movie
for row,col in df.iterrows():
    names = col["cast"]
    df.at[row, "cast_count"] = len(names)
print(df["cast_count"].value_counts())

5.0    4033
4.0     620
3.0     146
1.0     115
2.0     100
0.0       6
Name: cast_count, dtype: int64


In [8]:
# Removing movies with less than 3 casts
df = df[df["cast_count"] > 2]

In [9]:
# Final shape
df.shape

(4799, 23)

## Retrieving Casts Interest Over Time

Google trends reports the interest over time on a weekly basis and we retrieved casts’ interest over time via the API from 2004 to the date when the movie was released.

Casts that have interest over time == 0 were reported as KeyError. This means that those casts have very low search volume. It is still relevant to us and we will concat them later on.

In [10]:
pytrend = TrendReq()

In [None]:
# Code to use Pytrends to gather interest over time
start_date = '2004-01-01'
wait_time = 10.0
wait_time_2 = 5.0

def getIOT():
    cols = ["id", "movie_name", "cast", "average_interest"]
    masterDF = pd.DataFrame(columns = cols)
    errorDF = pd.DataFrame(columns = ["movie_name", "cast"])
    error_names = []
    total = 0
    error = 0
    for row in len(df):
        try:
            _id = df.loc[row]["id"]
            title = df.loc[row]["original_title"]
            names = df.loc[row]["cast"]
            date = start_date + " " + df.loc[row]["release_date"]
            x = names.split("|")
        except KeyError:
            continue
        for i in x:
            try:
                pytrend.build_payload(kw_list=[i], timeframe = date)
                results = pytrend.interest_over_time()
                results['Total'] = results[i].sum()/len(results)
                results = results.iloc[0]["Total"]
                data = [[_id, title, i, results]]
                results = pd.DataFrame(data, columns = cols)
                masterDF = masterDF.append(results, ignore_index=True)
                sleep(wait_time_2)
            except KeyError: # cast with 0 popularity
                print(i)
                error_names.append(i)
                error += 1
                data = [[title, i]]
                dataz = pd.DataFrame(data, columns = ["movie_name", "cast"])
                errorDF = errorDF.append(dataz, ignore_index=True)
                print("errorcount: ",error,)
                continue
            except: # HTTP error, too many requests being made.
                print("ERROR!!!")
                errorDF.to_csv("ERROR.csv")
                return masterDF.to_csv("IOT.csv")
        total += 1
        print(total)
    print("COMPLETE!!")
    errorDF.to_csv("ERROR.csv")
    return masterDF.to_csv("IOT.csv")

## Concatenate

We concat the main csv which contains the interest over time of casts and the error csv which contains casts that have 0 interest over time. 

We then export out the data to merge with the rest of our data.

In [11]:
df_Error = pd.read_csv("ERROR.csv")
df_IOT = pd.read_csv("IOT.csv")
df_tmdb = pd.read_excel("tmdb_movies_data.xlsx")

In [12]:
del df_Error['Unnamed: 0']
print(f'Shape of error csv: {df_Error.shape}')
print(f'Shape of main csv: {df_IOT.shape}')

Shape of error csv: (547, 3)
Shape of main csv: (27527, 5)


In [13]:
# Inspecting main csv
del df_IOT['Unnamed: 0']
df_IOT.head()

Unnamed: 0,id,movie_name,cast,average_interest
0,674,Harry Potter and the Goblet of Fire,Daniel Radcliffe,26.927083
1,674,Harry Potter and the Goblet of Fire,Rupert Grint,27.677083
2,674,Harry Potter and the Goblet of Fire,Emma Watson,22.8125
3,674,Harry Potter and the Goblet of Fire,Ralph Fiennes,15.479167
4,674,Harry Potter and the Goblet of Fire,Michael Gambon,11.541667


In [14]:
# Concatenate both CSVs
df = pd.concat([df_Error, df_IOT], ignore_index = True)

In [15]:
# Taking the average of all the casts
df = df.groupby("id").mean()
df = df.reset_index()
df.head()

Unnamed: 0,id,average_interest
0,17.0,16.098148
1,25.0,20.425
2,26.0,2.777778
3,27.0,21.017172
4,35.0,11.783871


In [16]:
# Export df out for compiling
# df.to_csv("average_cast_interest.csv")