# Which actors / actresses have starred in the most of IMDb's Top 250 rated films?
## *Scraping movie and cast information from IMDb.com and visualising most featured actors/actresses, with BeautifulSoup, Pandas and Plotly*

So here it is...my first data science article (hopefully not my last). I decided to kick-off my series with at IMDb's top rated films to identify actors/actresses that feature across multiple top films - the 'real movie stars'. To be brutally honest about *why* I've dived into this relatively random topic; primarily I wanted to use web scraping technology (i.e. Python with beautifulSoup in this case) to collect data that isn't otherwise readily available in the desired format. The secondary reason was merely because I enjoy watching movies - hardly a unique interest.

Whilst IMDb's does readily offer APIs for accessing movie information (seems a little suprising to me) they do offer a number of static datasets - https://datasets.imdbws.com/. I chose not use these datasets and scraped required data directly from the IMDb.com.

In this blog:

(1) We start with scraping IMDb film, actors/actress data (using BeautifulSoup)

(2) We process and clean the captured data (using Pandas)

(3) Then (more interestingly) we start to pull explore the data by:

    (a) Looking at the distribution of Top 250 film ratings
    (b) Understanding which film genres are more likely to have higher ratings, and
    (c) Identifying which actors/actresses appear in the most top rated films
    
**Ultimately, this blog culminates in identifying *"Which actors/actresses feature in the most Top 250 films?"*. I began this project under the naive assumption that these actors/actresses would be popular household names; the likes of Katharine Hepburn, Robert De Niro and Jack Nicholson. IMDb themselves provide a ranking of '100 greates actors & actresses' https://www.imdb.com/list/ls053085147/ BUT This analysis actually provides some suprising outcomes...enjoy!**


In [1]:

# Import required Python libraries and setup workspace

import certifi
import urllib3
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import csv

from scipy import polyfit, polyval
from scipy.interpolate import CubicSpline

import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
%matplotlib inline



## DATA COLLECTION Pt1 : Scrape Top 250 rated movies from IMDb website
Data captured on 27th July 2018

We start by using *urlib3* and *beautifulSoup* libraries to pull the Top 250 movies from IMDb's https://www.imdb.com/chart/top web page. We capture each movie title, along with it's official IMDb ranking and rating. In the code cell below, we create a list of lists, stored in the **table_data** variable.

In [2]:

table_data = [] # empty list. We'll be adding results to this

# IMDb page url for all top 250 rated films
url_imdbTop250 = 'https://www.imdb.com/chart/top'

# run web page request
page_imdbTop250 = http.request('GET', url_imdbTop250)

# allow for page exploration using BeautifulSoup (i.e. soup-ify returned webpage)
soup_Top250 = BeautifulSoup(page_imdbTop250.data, "lxml")

# create tabulated data with top 250 film information
table = soup_Top250.find('table', attrs={'class':'chart full-width'})# select the <table ...> that contains the ranked movies
table_body = table.find('tbody') # further subset the page. select only the table body
rows = table_body.find_all('tr') # find all rows within top 250 rated movies table

# for each row in table extract the: rank, movie name, year published and Imdb rating
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    movie_name_string = cols[1]
    movie_rank = movie_name_string[:movie_name_string.index('.')] 
    movie_year = int(movie_name_string[-5:-1])
    movie_name = movie_name_string[movie_name_string.index('.')+1:movie_name_string.index('(')].strip()
    movie_rating = cols[2]
    table_data.append([movie_rank, movie_name, movie_year, movie_rating])

table_data[:5] # print the top 5 results

[['1', 'The Shawshank Redemption', 1994, '9.2'],
 ['2', 'The Godfather', 1972, '9.2'],
 ['3', 'The Godfather: Part II', 1974, '9.0'],
 ['4', 'The Dark Knight', 2008, '9.0'],
 ['5', '12 Angry Men', 1957, '8.9']]

## DATA COLLECTION Pt2 : Scrape movie Genre & Cast/Crew

Each movie has it's own title landing page, covering a summary of information and a separate page for viewing the full cast/crew list per feature. Providing that you use the Imdb movie Id (i.e. an ID specific to Imdb) it's very simple to manipulate standard URLs and pull the information need:

- https://www.imdb.com/title/{film_id} : used to retrieve film genre
- https://www.imdb.com/title/{film_id}/fullcredits : used to retrieve full cast / crew

The film IDs are first retrieved from the initial page scrape, as they are provided in the html in the 'wlb_ribbon' class. With these 250 x film IDs we proceed to scrape the information we require. I have clearly commented the code, below, to ensure that it can be intuitively understood.

In [3]:

# return all of the imdb film ids for crawling film casts
film_ids = []
base_castAndCrew = []

links_class  = soup_Top250.findAll("div", {"class":"wlb_ribbon"}) # table/html tag containing each IMDb movie ID

# for each movie, using the film ID:
for idx, link in enumerate(links_class):    
    
    counter = idx + 1
    if counter % 25 == 0: # print to log every n iterations
        print("[INFO] Scraping film no. {0} of {1}".format(counter,len(links_class)))
        
    # Part 1 ----------------------------------------
    # FOR EACH MOVIE, USE MOVIE ID TO CONSTRUCT MOVIE PAGE URL AND RETURN FILM GENRE
    
    filmID = link.attrs['data-tconst']
    filmID_castURL = "https://www.imdb.com/title/{0}/fullcredits".format(filmID) # used to retrieve full cast
    filmID_homePageURL = "https://www.imdb.com/title/{0}".format(filmID) # used to retrieve genre
    
    # request movie home page to return genre
    html_filmHomePage = http.request('GET', filmID_homePageURL)
    soup_filmHomePage = BeautifulSoup(html_filmHomePage.data, "lxml")
    soup_getGenre = soup_filmHomePage.find('div', {'itemprop':'genre'})
    genre_clean = str(soup_getGenre.text).replace('\n' , '').replace('Genres: ', '')
    
    # append derived data to output list
    film_ids.append([filmID , filmID_castURL , filmID_homePageURL, genre_clean])
    
    # Part 2 ----------------------------------------
    # FOR EACH MOVIE, PULL FULL CAST FROM MOVIE CAST/CREW PAGE URL
    
    # query the website and return the html to the variable ‘page’
    html_castAndCrew = http.request('GET', filmID_castURL)
    soup_castAndCrew = BeautifulSoup(html_castAndCrew.data, "lxml")
    table_castAndCrew = soup_castAndCrew.find('table', {'class':'cast_list'})
    
    # select cast list
    list_castAndCrew = []
    for cast in table_castAndCrew.findAll('span', {'class':"itemprop"}):
        list_castAndCrew.append(cast.text)
    base_castAndCrew.append(list_castAndCrew)  
    
    del filmID, filmID_castURL, filmID_homePageURL, genre_clean, html_filmHomePage, soup_filmHomePage, soup_getGenre, \
    list_castAndCrew, html_castAndCrew, soup_castAndCrew, table_castAndCrew


[INFO] Scraping film no. 25 of 250
[INFO] Scraping film no. 50 of 250
[INFO] Scraping film no. 75 of 250
[INFO] Scraping film no. 100 of 250
[INFO] Scraping film no. 125 of 250
[INFO] Scraping film no. 150 of 250
[INFO] Scraping film no. 175 of 250
[INFO] Scraping film no. 200 of 250
[INFO] Scraping film no. 225 of 250
[INFO] Scraping film no. 250 of 250


In [6]:

# This code block exports the scraped data sets into csv files
# you must first create a Data/ directory within your repository
# saving scraped data to disk

# movie rankings
with open("Data/movie_ratings.csv", 'w') as f:
    wr = csv.writer(f, quoting=csv.QUOTE_ALL)
    wr.writerow(table_data)
    
# cast and crew
with open("Data/movie_cast_and_crew.csv", 'w') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(base_castAndCrew)
    

## ANALYSIS & PLOTS 

### Pt1 : Exploring distribution of movie ratings

Looking at the two ratings charts, we deduce that across the Top 250 movies:
- the median movie rating is 8.2
- The box plot identifies a number of outliers (note: Plotly atypically considers outliers as data points that are 1.5 x Interquartile range from the first (Q1) and third (Q3) quartiles. These seven outliers represent relatively (very) highly rated movies compares with others. With no suprise, these top rated films include some of my (and everyone elses) favouries - **[{'1':'The Shawshank Redemption', '2': 'The Godfather','3': 'The Godfather: Part II','4': 'The Dark Knight', '5': '12 Angry Men'}]**
- extending on the above point, ratings do not have a purely (inverse) linear relationship with movie ranking. 95% of movies have a rating between 8.2 - 8.7 (range: 0.5), whilst the remaining 5% have much higher scores between 8.7 to 9.2 (range: 0.5).

- The boxplot illustrates movie rating distribution characteristics, of the Top 250 rated films, split by film genre
- I've ordered the x-axis from highest to lowest median rating value, by genre.
- Perhaps, suprisingly, Music and Horror movies have the highest median rankings; but unsuprisgly each of these genres have just five contributing movies
- X% of top movies are either dramas, x or y.


## Looking at the (boxplot) distribution of movie ratings

In [4]:

# we first unlist our rankings, ratings and movie names into three variables
ranking = [float(el[0]) for el in table_data]
rating = [float(el[3]) for el in table_data]
movie_names = [nm[1] for nm in table_data]


In [15]:

# we use Plotly to create a boxplot to represent the distribution of movie ratings
# across the Top 250 IMDb movies

trace0 = go.Box(
    x = rating,
    name = " ",
    jitter = 1,
    pointpos = 0.4,
    boxpoints = 'all',
    marker = dict(color = 'rgb(32, 51, 155)', opacity=0.6, size=9),
    line = dict(color = 'rgb(162, 160, 165)'),
    text = movie_names
)

data = [trace0]
layout = go.Layout(  title="Box plot of IMDb's Top 250 movie ratings"
                   , boxgap = 0.5
                   , xaxis = dict(title='IMDb move rating'))

fig = go.Figure(data=data,layout=layout)

# explort boxplot visualisation to .html file
plot(fig, filename = "../Plots/Ratings_boxplot.html")


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/Ratings_boxplot.html'

## Looking at the relationship between movie rank and rating

In [16]:

# here we fit a set of models / relationships between rank vs rating
# useful youtube video for using scipy package: https://www.youtube.com/watch?v=ro5ftxuD6is

p1 = polyfit(ranking , rating, 1) # linear regression
p2 = polyfit(ranking , rating, 2) # quadratic
p3 = polyfit(ranking , rating, 3) # cubic


In [17]:

# Create and style traces
trace0 = go.Scatter(
    x = ranking, y = rating,
    name = 'Movies (rating vs. rank)' ,mode='markers'    
    , marker = dict(color = ('rgb(0, 0, 0)'), size=5, opacity=0.2)
)

trace1 = go.Scatter(
    x = ranking, y = polyval(p1 , ranking),
    name = 'Linear model' , line = dict(color = ('rgb(122, 122, 122)') , width = 3, dash='dash'), opacity=0.75)

trace2 = go.Scatter(
    x = ranking, y = polyval(p2 , ranking),
    name = 'Quadratic model' , line = dict(color = ('rgb(32, 51, 155)') , width = 6), opacity=0.75)

trace3 = go.Scatter(
    x = ranking, y = polyval(p3 , ranking),
    name = 'Cubic model' , line = dict(color = ('rgb(249, 126, 32)') , width = 6), opacity=0.75)

data = [trace0, trace1, trace2, trace3]

# Edit the layout
layout = dict(title = 'How movie rating changes with ranking',
              xaxis = dict(title = 'Movie Ranking'),
              yaxis = dict(title = 'IMDb rating'),
              annotations=[
                dict(x=120,y=8.28,xref='x',yref='y',text='0.82 Rsq',showarrow=True,arrowhead=7,ax=0,ay=-40
                     , font=dict(family='Arial black',size=15,color='rgb(122, 122, 122)')),
                dict(x=240,y=8.046,xref='x',yref='y',text='0.92 Rsq',showarrow=True,arrowhead=7,ax=0,ay=-40
                     , font=dict(family='Arial black',size=15,color='rgb(32, 51, 155)')), 
                dict(x=200,y=8.08713,xref='x',yref='y',text='0.96 Rsq',showarrow=True,arrowhead=7,ax=0,ay=-40
                     , font=dict(family='Arial black',size=15,color='rgb(249, 126, 32)'))
              ]
              )



fig = dict(data=data, layout=layout)
plot(fig, filename='../Plots/MovieRating_vs_Ranking.html', auto_open=True)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/MovieRating_vs_Ranking.html'

In [104]:

# Calculate the Rsquared values for each model fit (linear, quadratic, cubic)

"""

# calculate the Rsq value for each model fit

# lienar model
y = rating
yfit = p1[0] * np.array(ranking) + p1[1]
yresid = y - yfit
SSresid = sum(pow(yresid,2))
SStotal = len(y) * var(y)
rsq = 1 - SSresid / SStotal
print("[INFO] Linear model RSquared: "+ str(rsq))

# quadratic model
y = rating
yfit = (p2[0]*(np.array(ranking)*np.array(ranking)))+ p2[1]*np.array(ranking)+ p2[2]
yresid = y - yfit
SSresid = sum(pow(yresid,2))
SStotal = len(y) * var(y)
rsq = 1 - SSresid / SStotal
print("[INFO] Quadratic model RSquared: "+ str(rsq))

# cubic model
y = rating
yfit = polyval(p3 , ranking)
yresid = y - yfit
SSresid = sum(pow(yresid,2))
SStotal = len(y) * var(y)
rsq = 1 - SSresid / SStotal
print("[INFO] Cubic model RSquared: "+ str(rsq))

"""


[INFO] Linear model RSquared: 0.826708647607
[INFO] Quadratic model RSquared: 0.922868641639
[INFO] Cubic model RSquared: 0.955138111232


## Boxplot distributions, ratings by genre

In [62]:

film_genres = [gen[3] for gen in film_ids]
film_ratings = [float(rat[3]) for rat in table_data] 
film_names = [mov[1] for mov in table_data] 

film_genres_split = []
film_ratings_split = []
film_name_split = []

for each_row, each_rating, each_name in zip(film_genres , film_ratings, film_names) :
    for each_genre in each_row.split("|"):
        film_name_split.append(each_name)
        film_genres_split.append(each_genre.strip())
        film_ratings_split.append(each_rating)
        
df_genres_ratings = pd.DataFrame( {'Ratings': film_ratings_split, 'Genres': film_genres_split} )

# order by median value
sorted_by_value = dict(sorted(dict(df_genres_ratings.groupby(['Genres'])['Ratings'].median()).items(), key=lambda kv: -kv[1]))
genre_order = list(sorted_by_value.keys())

order = pd.DataFrame(df_genres_ratings.groupby(['Genres'])['Ratings'].median())
order.reset_index(inplace = True)
order.sort_values(['Ratings'], ascending=False, inplace=True)


data = []
for i in range(0,len(pd.unique(df_genres_ratings['Genres']))):
    trace = {
                "type": 'box',
                "jitter" :1,
                    "pointpos" :0,
                    "boxpoints" :'all',        
        
                "x": df_genres_ratings['Genres'][df_genres_ratings['Genres'] == pd.unique(df_genres_ratings['Genres'])[i]],
                "y": df_genres_ratings['Ratings'][df_genres_ratings['Genres'] == pd.unique(df_genres_ratings['Genres'])[i]],
                "name": pd.unique(df_genres_ratings['Genres'])[i],
                "showlegend": False,
                "marker" : {'color':'rgb(32, 51, 155)'},
                "box": {"visible": True},
                "meanline": {"visible": True}
            }
    data.append(trace)
    

    
fig = {
    "data": data,
    "layout" : {
        "title": "Boxplot of Movie Ratings, by Genre",
        "yaxis": {"zeroline": False},
        "xaxis":{"categoryarray": list(order.Genres)}
                }
}


plot(fig, filename='../Plots/Boxplot_Dist_By_Genre.html', validate = False)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/Boxplot_Dist_By_Genre.html'

## ANALYSIS & PLOTS 

### Pt1 : Who are the best actors / actresses?

- So this was the bit I was most interested in. Here we take a look at which actors and actresses appear (in the cast & crew listings) across all 250 movies, how many movies they appeared in and what those films were?
- in the code snippet below we create an interactive plotly chart that allows the user to select the top N actors/actresses with the most film features. The film start with the highest number of features appear on the far left of the chart and appears in descending order.
- Perhaps I'm not quite the film buff I first thought but, to my suprise, the first actor I had heard of was in position five - **Robert De Niro** - who has appeared in eight of the top 250 films and is undoubtably a household name. 
- Interestingly, **John Ratzenberger** has 'appeared' in more of the Top 250 movies than any other actor - a whopping 12 x movies. However, 10 x of these were animation films - so he didn't even 'appear' in them at all. **Bess Flowers**, in position two, featured in 10 of the Top 250 movies but all before the 1970s.
- In third position, Joseph Oliveira, was an quirky find - whilst he's featured in 9 of the Top 250 movies, he's only played 'supporting' or uncredited roles in each. For example:
Dark Knight - a walk on officer, https://www.imdb.com/title/tt0468569/fullcredits?ref_=tt_cl_sm#cast
uncredited role as 'Marciano' in Goodfellas https://www.imdb.com/title/tt0099685/fullcredits?ref_=tt_cl_sm#cast
The Departed, again an uncredited Officer
Court Room Attendant (uncredited) in Wolf of Wall Street https://www.imdb.com/title/tt0993846/fullcredits


## Pie Chart - Movie count per cast member

In [18]:

# prepare data for charts

movies_list = [mv[1] for mv in table_data]

def actorsAllMovies(base_castAndCrew , movies_list , topNCount):

    
    cast_dictionary = {}
    cast_movie_dict = {}    
    ordered_appearance = []
    ordered_app_num = []    
    
    # (1) CREATE ACTOR/ACTRESS DICTIONARY with MOVIE COUNTER (i.e. ACTORNAME : MOVIECOUNT)
    
    # for each cast member, across all movies, append to dictionary
    # create counter for each cast member (i.e. counter = number of movies featured in)
    for film_cast in base_castAndCrew:
        for cast_member in film_cast:
            if not str(cast_member).lower() in cast_dictionary:
                cast_dictionary[cast_member.lower()] = 1
            else:
                cast_dictionary[cast_member.lower()] += 1
                
    # sort dictionary by movie count           
    cast_dictionary = sorted(cast_dictionary.items(), key=lambda kv: -kv[1])

    
    # (2) CREATE DICTIONARY WITH ACTOR NAME AND MOVIES FEATURED IN (i.e. ACTORNAME : MOVIES)
    
    for movie, cast in zip(movies_list, base_castAndCrew):
        for cast_member in cast:
            cast_member_lower = cast_member.lower()
            if not str(cast_member).lower() in cast_movie_dict:
                cast_movie_dict[cast_member_lower] = [movie]
            else:
                cast_movie_dict[cast_member_lower].append( movie )
                
    # (3) CREATE LISTS OF ACTORS/MOVIES ABOUT A THRESHOLD MOVIE COUNT           

    for idx, name in enumerate(cast_dictionary):
        if cast_dictionary[idx][1] > topNCount:
            ordered_appearance.append(  str(cast_dictionary[idx][0]) )
            ordered_app_num.append( cast_dictionary[idx][1] ) 
            
    return( cast_dictionary ,  cast_movie_dict ,  ordered_appearance , ordered_app_num   )


In [24]:

cast_dict1 , cast_movie_dict1 , ordered_appearance1 , ordered_app_number1 = \
        actorsAllMovies(base_castAndCrew , movies_list , 0 )
    
df_ActorsNumber = pd.DataFrame({'Name': ordered_appearance1 , 'MovieCount': ordered_app_number1})

summ_movie_count = pd.DataFrame(df_ActorsNumber.MovieCount.value_counts())
summ_movie_count.reset_index(inplace = True)
summ_movie_count.columns = ['MovieCount', 'ActorCount']

summ_movie_count['MoveCountBanded'] = np.where(summ_movie_count['MovieCount'] > 3, 4, summ_movie_count['MovieCount'])
summ_movie_count['MoveCountBanded_str'] = np.where(summ_movie_count['MoveCountBanded'] == 4, '4+ movies'
                                                    , np.where(summ_movie_count['MoveCountBanded'] == 1, '1 movie'
                                                     , np.where(summ_movie_count['MoveCountBanded'] == 2, '2 movies'
                                                      , np.where(summ_movie_count['MoveCountBanded'] == 3, '3 movies', 'Missing'))))                                              
                                                  
labels = list(summ_movie_count.MoveCountBanded_str)
values = list(summ_movie_count.ActorCount)
colors = ['rgb(32, 51, 155)', 'rgb(249, 126, 32)', 'rgb(34, 155, 38)', 'rgb(153, 26, 142)']

trace = go.Pie(    labels = labels
                  , values = values
                  , hoverinfo='label+percent'
                  , textinfo='value'
                  , textfont=dict(size=15)
                  , opacity = 0.8
                  , textposition='outside'
                  , marker = dict(colors = colors,
                           line=dict(color='#f7f7f7'
                                     , width=1)))

data = [trace]
layout = go.Layout(title='Number of Top 250 movie features, per actor')
fig = dict(data = data, layout=layout)
plot(fig , filename = '../Plots/PieChart_MovieCount.html', auto_open=True)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/PieChart_MovieCount.html'

## Bar Chart - Number of movie features per actor/actress (5+ movies)

In [28]:
# plot number of films starred in descending
# labels / hover to display list of films per star

cast_dict2 , cast_movie_dict2 , ordered_appearance2, ordered_app_number2 = \
        actorsAllMovies(base_castAndCrew , movies_list , 4 )
    
film_list = []
for name in ordered_appearance2:
    film_list.append( cast_movie_dict2[name] )
    
ordered_appearance_final = []
for cast in ordered_appearance2:
    ordered_appearance_final.append(cast.title())
     
trace0 = go.Bar(
    x = ordered_appearance_final ,
    y = ordered_app_number2,
    text = film_list,
    marker=dict(
        color='rgb(249, 126, 32)',
        line=dict(color='rgb(86, 38, 1)',width=1.5,)),opacity=0.5)

data = [trace0]
layout = go.Layout(title='Number of film Imdb Top 250 film features'
                   , xaxis=dict(tickangle=45))

fig = dict(data = data, layout=layout)
plot(fig , filename = '../Plots/TopActors.html', auto_open=True)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/TopActors.html'