# Exploring imdb’s most featured actors

## Scraping, exploring & visualising most featured actors in IMDb’s Top 250 movies, with BeautifulSoup, Pandas & Plotly

### Introduction & Purpose


So here it is…my first article (hopefully not my last). I decided to kick-off my series with a look at IMDb’s Top 250 rated movies. More specifically, in this article, we scrape IMDb’s top rated movies, along with their corresponding cast & crew listings, and explore who the “real movie superstars” are - essentially, which actors feature in the most/multiple top rated movies?

I began this project with the naive assumption that these “superstars” would all be well-known household names; the likes of Katharine Hepburn, Robert De Niro and Jack Nicholson - similar to IMDb’s list of 100 greatest actors & actresses of all time - BUT this analysis resulted in some suprising finds!

For this project we used Python, and a selection of libraries, for data collection, exploration and visualisation. This article includes a selection of Python code-blocks and visualisations that help us to explore and draw insights from the data. The full codeset is available on Github as a Jupyter notebook.

Motivation
So why did I chose to explore this random and niche topic? Well, primarily I wanted to demonstrate use of web scraping technology to collect data that wasn’t otherwise readily available in the desired format (yawn!). The second reason was merely because I enjoy watching movies - hardly a unique interest - so, I was inately interested in the data itself.

What we’ll be looking at
- Part 1 (of 5): Importing Required Python Libraries
- Part 2 (of 5): Data collection - Scraping IMDb’s Top 250 Rated Movies, using BeautifulSoup
- Part 3 (of 5): Data collection - Scraping Movie Genre & Full Cast + Crew, using BeautifulSoup
- Part 4 (of 5): Data Exploration - Visualising Movie Ratings, with Pandas and Plotly
- Part 5 (of 5): Data Exploration - Who Really Are The Best Actors?, with Pandas and Plotly

Additional note: whilst IMDb does NOT readily offer APIs for accessing movie information (which seemed a little suprising to me) they do offer a number of static datasets. I chose not use these datasets and scraped what I needed directly from the IMDb.com.


### Part 1 of 5: Importing Required Python Libraries


This step is pretty trivial so little explanation is needed, in addition to the commented code-block below. It is perhaps useful to note, however, that in order to use Plotly, within a jupyter notebook, I had to use the plotly.offline configuration and specify init_notebook_mode(connected=True) .

In [1]:

# libraries for requesting and scraping web pages
import certifi
import urllib3
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
from bs4 import BeautifulSoup

# libraries for structuring data
import pandas as pd
import numpy as np
import csv

# libraries for fitting models/relationships (exploratory analysis)
from scipy import polyfit, polyval
from scipy.interpolate import CubicSpline

# libraries (and configuration) for visualisation with Plotly
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
%matplotlib inline


## Part 2 (of 5): Data collection - Scraping IMDb’s Top 250 Rated Movies

Data captured on 27th July 2018

We begun collecting data using urlib3 and beautifulSoup libraries to scrape the Top 250 movies (https://www.imdb.com/chart/top) from IMDb’s Top 250 movie charts. We captured each movie title, along with it’s official IMDb ranking and rating. All the data we required could be found within the html table with class = ‘chart full-width’, on the IMDb web page (highlighted in the printscreen below).


In [14]:

ds_top250Movies = [] # empty list object. We'll be storing scraped data in this

# IMDb page url for Top 250 rated movies
url_imdbTop250 = 'https://www.imdb.com/chart/top'

# Executre web page request
page_imdbTop250 = http.request('GET', url_imdbTop250)

# Allow for page exploration using BeautifulSoup (i.e.'soupify' returned webpage)
soup_Top250 = BeautifulSoup(page_imdbTop250.data, "lxml")

# Subset returned webpage - select Top Rated Movies 'table' only
table = soup_Top250.find('table', attrs={'class':'chart full-width'})
table_body = table.find('tbody') # further subset the page. select only the table body
rows = table_body.find_all('tr') # find all rows within top 250 rated movies table

# For each row in table extract the: rank, movie name, year published and Imdb rating
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    movie_name_string = cols[1]
    movie_rank = movie_name_string[:movie_name_string.index('.')] 
    movie_year = int(movie_name_string[-5:-1])
    movie_name = movie_name_string[movie_name_string.index('.')+1:movie_name_string.index('(')].strip()
    movie_rating = cols[2]

    # retrieve the IMDb movieId (bespoke) for each movie
    movieIdHTML = row.find("div", {"class":"wlb_ribbon"})
    movieId = movieIdHTML.attrs['data-tconst']
    
    # for each row, append selected attributes to the table_data object
    ds_top250Movies.append([movieId , movie_rank, movie_name, movie_year, movie_rating]) 
    
    # delete transient variables
    del movieId , movie_rank, movie_name, movie_year, movie_rating

ds_top250Movies[:5] # print the top 5 results


[['tt0111161', '1', 'The Shawshank Redemption', 1994, '9.2'],
 ['tt0068646', '2', 'The Godfather', 1972, '9.2'],
 ['tt0071562', '3', 'The Godfather: Part II', 1974, '9.0'],
 ['tt0468569', '4', 'The Dark Knight', 2008, '9.0'],
 ['tt0050083', '5', '12 Angry Men', 1957, '8.9']]

## Part 3 (of 5): Data collection - Scraping Movie Genre & Full Cast + Crew

Having captured all of the top rated movie names, and some additional information, I then focused on capturing the genre and full cast and crew, for each movie. This data allowed us to explore actors that featured across multiple top rated movies.

On IMDb.com, each listed movie has its own landing page, covering a summary of movie information, and a separate page for viewing the corresponding full cast & crew. Providing that you use the IMDb ‘film_id’ (i.e. a bespoke ID that IMDb have created to uniquely store movie-level data) it’s very simple to manipulate a couple of IMDb page URLs to retrieve the information we’re after:

- https://www.imdb.com/title/{film_id} : used to retrieve film genre. Replace {film_id} with integer movie ID value (example: https://www.imdb.com/title/tt0111161 for Shawshank Redemption).

- https://www.imdb.com/title/{film_id}/fullcredits : used to retrieve full cast & crew. Replace {film_id} with integer movie ID value (example: https://www.imdb.com/title/tt0111161/fullcredits for Shawshank Redemption full cast & crew).

We actually already had the movieIDs in the output dataset we collected in Part 2. These IDs were found in the html table structure within the class = ‘wlb_ribbon’. With these movieIDs, we iteratively constructed movie-specific URLs and scraped the data we were after, illustrated in the code block, below.


In [30]:

# Empty list objects. We'll be storing scraped data in this
ds_movieGenre = [] # will store the movieID and movie genre(s)
ds_castAndCrew = [] # will store the full cast & crew per movie

# movieIds for each of the 250 x movies (in ranked order by default)
lst_movieIds = [movieId[0] for movieId in ds_top250Movies]

# for each movie, using the film ID:
for idx, movieID in enumerate(lst_movieIds):    
    
    counter = idx + 1
    if counter % 25 == 0: # print to log every n iterations
        print("[INFO] Scraping film no. {0} of {1}".format(counter,len(lst_movieIds)))
        
    # Part 1 ----------------------------------------
    # construct the movie title page and cast/crew page URLs (using the movieID)
    
    movieID_homePageURL = "https://www.imdb.com/title/{0}".format(movieID) # used to retrieve genre
    movieID_castURL = "https://www.imdb.com/title/{0}/fullcredits".format(movieID) # used to retrieve full cast
    
    # Part 2 ----------------------------------------
    # Retrieve the movie Genre from the movie title page
    
    # request movie home page to return genre
    html_movieHomePage = http.request('GET', movieID_homePageURL)
    soup_movieHomePage = BeautifulSoup(html_movieHomePage.data, "lxml")
    soup_getGenre = soup_movieHomePage.findAll('div', {'class':'see-more inline canwrap'})
    genre_clean = str(soup_getGenre[1].getText()).replace('\n' , '').replace('Genres: ', '')
    
    # append derived data to output list
    ds_movieGenre.append([movieID , movieID_castURL , movieID_homePageURL, genre_clean])
    
    # Part 3 ----------------------------------------
    # Retrieve the full cast & crew from the movie cast/crew page
    
    # query the website and return the html to the variable ‘page’
    html_castAndCrew = http.request('GET', movieID_castURL)
    soup_castAndCrew = BeautifulSoup(html_castAndCrew.data, "lxml")
    table_castAndCrew = soup_castAndCrew.find('table', {'class':'cast_list'})
    
    # select cast list
    # here we are retrieving the second column (i.e. index 1) from each row within the table
    list_castAndCrew = []
    for cast in table_castAndCrew.findAll('tr'):
        for idx, td in enumerate(cast.findAll('td')):
            if idx == 1:
                list_castAndCrew.append(td.getText())
    ds_castAndCrew.append(list_castAndCrew)  
    
    # delete transient variables
    del movieID, movieID_castURL, movieID_homePageURL, genre_clean, html_movieHomePage, soup_movieHomePage, soup_getGenre, \
    list_castAndCrew, html_castAndCrew, soup_castAndCrew, table_castAndCrew


[INFO] Scraping film no. 25 of 250
[INFO] Scraping film no. 50 of 250
[INFO] Scraping film no. 75 of 250
[INFO] Scraping film no. 100 of 250
[INFO] Scraping film no. 125 of 250
[INFO] Scraping film no. 150 of 250
[INFO] Scraping film no. 175 of 250
[INFO] Scraping film no. 200 of 250
[INFO] Scraping film no. 225 of 250
[INFO] Scraping film no. 250 of 250


Having completed our data collection, we had the following local datasets:

1. **ds_top250Movies**: contains the following attributes for each movie ([movieID, movie_rank, movie_name, movie_year, movie_rating])
2. **ds_movieGenre**: contains the following attributes for each movie ([movieID , filmID_castURL , filmID_homePageURL, genre_clean])
2. **ds_castAndCrew**: contains the full name for each full cast & crew member, for each movie (e.g. [['Robert DeNiro', 'Julia Roberts']])

Then we were able to start exploring!

In [None]:

# This code block exports the scraped data sets into csv files
# you must first create a Data/ directory within your repository
# saving scraped data to disk

# movie rankings
with open("Data/movie_ratings.csv", 'w') as f:
    wr = csv.writer(f, quoting=csv.QUOTE_ALL)
    wr.writerow(table_data)
    
# cast and crew
with open("Data/movie_cast_and_crew.csv", 'w') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(base_castAndCrew)
    

## Part 4 (of 5): Data Exploration - Visualising Movie Ratings

I began by exploring the movie ratings themselves; asking questions of the data such as **Is there a linear relationsip between movie ranking and movie rating?** and **Which movie genres typically have higher ratings?**.

The boxplot below illustrates the distribution of IMDb movie ratings. With a median movie rating of 8.2 and an upper fence of 8.8, the boxplot identifies seven 'outliers' that have anomalously high ratings. To no suprise, these top rated films include well-known favourites, such as [1] The Shawshank Redemption, [2] The Godfather, [3] The Godfather: Part II, [4] The Dark Knight, and [5] 12 Angry Men. Take a look for yourself...


In [34]:

# we first unlist our rankings, ratings and movie names into three variables
ranking = [float(rank[1]) for rank in ds_top250Movies]
rating = [float(rating[4]) for rating in ds_top250Movies]
movie_names = [name[2] for name in ds_top250Movies]


### Boxplot: distribution of IMDb movie ratings

In [35]:

# we use Plotly to create a boxplot to represent the distribution of movie ratings across the Top 250 IMDb movies

trace0 = go.Box(
    x = rating,
    name = " ",
    jitter = 1,
    pointpos = 0.4,
    boxpoints = 'all',
    marker = dict(color = 'rgb(32, 51, 155)', opacity=0.6, size=9),
    line = dict(color = 'rgb(162, 160, 165)'),
    text = movie_names
)

data = [trace0]
layout = go.Layout(  title="Box plot of IMDb's Top 250 movie ratings"
                   , boxgap = 0.5
                   , xaxis = dict(title='IMDb move rating'))

fig = go.Figure(data = data , layout = layout)

# export boxplot visualisation to .html file
plot(fig, filename = "../Plots/Ratings_boxplot.html")


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/Ratings_boxplot.html'

## Looking at the relationship between movie rank and rating

The chart below (left) plots movie rating versus movie ranking, for all 250 movies. The adjacent chart (below right) attempts to identify a model that best describes the relationship between movie rating and rank (note: I used ```scipy``` to fit these models). We can see that a linear model explains the 'general' relationship (as you'd expect, with an R-squared of 0.82) however it particularly understates the sharp increase in ratings for top ranked movies. Again as you'd expect, the quadratic (R-squared of 0.92) and cubic (R-squared of 0.96) functions provide better representations of the relationship, as they provide a much better fit for top ranked movies.


In [36]:

# here we fit a set of models / relationships between rank vs rating
# we use scipy to fit our models
# useful youtube video for using scipy package: https://www.youtube.com/watch?v=ro5ftxuD6is

p1 = polyfit(ranking , rating, 1) # linear regression
p2 = polyfit(ranking , rating, 2) # quadratic
p3 = polyfit(ranking , rating, 3) # cubic


In [37]:

# we use Plotly to illustrate the relationships using scatter plots

trace0 = go.Scatter(
    x = ranking, y = rating,
    name = 'Movies (rating vs. rank)' ,mode='markers'    
    , marker = dict(color = ('rgb(0, 0, 0)'), size=5, opacity=0.2)
)

trace1 = go.Scatter(
    x = ranking, y = polyval(p1 , ranking),
    name = 'Linear model' , line = dict(color = ('rgb(122, 122, 122)') , width = 3, dash='dash'), opacity=0.75)

trace2 = go.Scatter(
    x = ranking, y = polyval(p2 , ranking),
    name = 'Quadratic model' , line = dict(color = ('rgb(32, 51, 155)') , width = 6), opacity=0.75)

trace3 = go.Scatter(
    x = ranking, y = polyval(p3 , ranking),
    name = 'Cubic model' , line = dict(color = ('rgb(249, 126, 32)') , width = 6), opacity=0.75)

data = [trace0, trace1, trace2, trace3]

# Layout : include annotatations for r-squared
layout = dict(title = 'How movie rating changes with ranking',
              xaxis = dict(title = 'Movie Ranking'),
              yaxis = dict(title = 'IMDb Rating'),
              annotations=[
                dict(x=120,y=8.28,xref='x',yref='y',text='0.82 Rsq',showarrow=True,arrowhead=7,ax=0,ay=-40
                     , font=dict(family='Arial black',size=15,color='rgb(122, 122, 122)')),
                dict(x=240,y=8.046,xref='x',yref='y',text='0.92 Rsq',showarrow=True,arrowhead=7,ax=0,ay=-40
                     , font=dict(family='Arial black',size=15,color='rgb(32, 51, 155)')), 
                dict(x=200,y=8.08713,xref='x',yref='y',text='0.96 Rsq',showarrow=True,arrowhead=7,ax=0,ay=-40
                     , font=dict(family='Arial black',size=15,color='rgb(249, 126, 32)'))
              ]
              )

fig = dict(data=data, layout=layout)
plot(fig, filename='../Plots/MovieRating_vs_Ranking.html', auto_open=True)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/MovieRating_vs_Ranking.html'

In [104]:

# Calculate the Rsquared values for each model fit (linear, quadratic, cubic)

"""

# calculate the Rsq value for each model fit

# lienar model
y = rating
yfit = p1[0] * np.array(ranking) + p1[1]
yresid = y - yfit
SSresid = sum(pow(yresid,2))
SStotal = len(y) * var(y)
rsq = 1 - SSresid / SStotal
print("[INFO] Linear model RSquared: "+ str(rsq))

# quadratic model
y = rating
yfit = (p2[0]*(np.array(ranking)*np.array(ranking)))+ p2[1]*np.array(ranking)+ p2[2]
yresid = y - yfit
SSresid = sum(pow(yresid,2))
SStotal = len(y) * var(y)
rsq = 1 - SSresid / SStotal
print("[INFO] Quadratic model RSquared: "+ str(rsq))

# cubic model
y = rating
yfit = polyval(p3 , ranking)
yresid = y - yfit
SSresid = sum(pow(yresid,2))
SStotal = len(y) * var(y)
rsq = 1 - SSresid / SStotal
print("[INFO] Cubic model RSquared: "+ str(rsq))

"""


[INFO] Linear model RSquared: 0.826708647607
[INFO] Quadratic model RSquared: 0.922868641639
[INFO] Cubic model RSquared: 0.955138111232


## Boxplot distributions, ratings by genre

In [45]:

# extract the arrays for genre, ratings and movie names
film_genres = [gen[3] for gen in ds_movieGenre]
film_ratings = [float(rat[4]) for rat in ds_top250Movies] 
film_names = [mov[2] for mov in ds_top250Movies] 


film_genres_split = []
film_ratings_split = []
film_name_split = []

for each_row, each_rating, each_name in zip(film_genres , film_ratings, film_names) :
    for each_genre in each_row.split("|"):
        film_name_split.append(each_name)
        film_genres_split.append(each_genre.strip())
        film_ratings_split.append(each_rating)
        
df_genres_ratings = pd.DataFrame( {'Ratings': film_ratings_split, 'Genres': film_genres_split} )

# order genre by median movie rating
sorted_by_value = dict(sorted(dict(df_genres_ratings.groupby(['Genres'])['Ratings'].median()).items(), key=lambda kv: -kv[1]))
genre_order = list(sorted_by_value.keys())
order = pd.DataFrame(df_genres_ratings.groupby(['Genres'])['Ratings'].median())
order.reset_index(inplace = True)
order.sort_values(['Ratings'], ascending=False, inplace=True)

data = []
for i in range(0,len(pd.unique(df_genres_ratings['Genres']))):
    trace = {
                "type": 'box',
                "jitter" :1,
                    "pointpos" :0,
                    "boxpoints" :'all',        
        
                "x": df_genres_ratings['Genres'][df_genres_ratings['Genres'] == pd.unique(df_genres_ratings['Genres'])[i]],
                "y": df_genres_ratings['Ratings'][df_genres_ratings['Genres'] == pd.unique(df_genres_ratings['Genres'])[i]],
                "name": pd.unique(df_genres_ratings['Genres'])[i],
                "showlegend": False,
                "marker" : {'color':'rgb(32, 51, 155)'},
                "box": {"visible": True},
                "meanline": {"visible": True}
            }
    data.append(trace)
    
fig = {"data": data,
       "layout" : {
        "title": "Boxplot of Movie Ratings, by Genre",
        "yaxis": {"zeroline": False},
        "xaxis":{"categoryarray": list(order.Genres)}
                }}

plot(fig, filename='../Plots/Boxplot_Dist_By_Genre.html', validate = False)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/Boxplot_Dist_By_Genre.html'

## Part 5 (of 5): Data Exploration - *Who Really Are The Best Actors?*

So this was the bit I was most interested in. Here we take a look at how many, top rated , movies each actor appeared in.

Firstly, I looked at the frequency of movie features per actor. Each actor was assigned to a frequency bucket (i.e. 1, 2, 3, 4+) depending on the number of top rated movies they'd featured in. 

The pie chart below illustrates this split, for the full **15,013 (distinct) cast & crew members** that featured across the 250 movies. Here we see that **90% of actors featured in just one movie**, with **just under 1% of actors featuring in 4 or more movies**. Interesting? Perhaps not, but I then started to put names to these numbers.


## Pie Chart - Movie count per cast member

In [46]:

# prepare data for charts

movies_list = [mv[2] for mv in ds_top250Movies]

def actorsAllMovies(base_castAndCrew , movies_list , topNCount):

    cast_dictionary = {}
    cast_movie_dict = {}    
    ordered_appearance = []
    ordered_app_num = []    
    
    # (1) CREATE ACTOR/ACTRESS DICTIONARY with MOVIE COUNTER (i.e. ACTORNAME : MOVIECOUNT)
    
    # for each cast member, across all movies, append to dictionary
    # create counter for each cast member (i.e. counter = number of movies featured in)
    for film_cast in base_castAndCrew:
        for cast_member in film_cast:
            if not str(cast_member).lower() in cast_dictionary:
                cast_dictionary[cast_member.lower()] = 1
            else:
                cast_dictionary[cast_member.lower()] += 1
                
    # sort dictionary by movie count           
    cast_dictionary = sorted(cast_dictionary.items(), key=lambda kv: -kv[1])

    
    # (2) CREATE DICTIONARY WITH ACTOR NAME AND MOVIES FEATURED IN (i.e. ACTORNAME : MOVIES)
    
    for movie, cast in zip(movies_list, base_castAndCrew):
        for cast_member in cast:
            cast_member_lower = cast_member.lower()
            if not str(cast_member).lower() in cast_movie_dict:
                cast_movie_dict[cast_member_lower] = [movie]
            else:
                cast_movie_dict[cast_member_lower].append( movie )
                
    # (3) CREATE LISTS OF ACTORS/MOVIES ABOUT A THRESHOLD MOVIE COUNT           

    for idx, name in enumerate(cast_dictionary):
        if cast_dictionary[idx][1] > topNCount:
            ordered_appearance.append(  str(cast_dictionary[idx][0]) )
            ordered_app_num.append( cast_dictionary[idx][1] ) 
            
    return( cast_dictionary ,  cast_movie_dict ,  ordered_appearance , ordered_app_num   )


In [48]:

cast_dict1 , cast_movie_dict1 , ordered_appearance1 , ordered_app_number1 = \
        actorsAllMovies(ds_castAndCrew , movies_list , 0 )
    
df_ActorsNumber = pd.DataFrame({'Name': ordered_appearance1 , 'MovieCount': ordered_app_number1})

summ_movie_count = pd.DataFrame(df_ActorsNumber.MovieCount.value_counts())
summ_movie_count.reset_index(inplace = True)
summ_movie_count.columns = ['MovieCount', 'ActorCount']

summ_movie_count['MoveCountBanded'] = np.where(summ_movie_count['MovieCount'] > 3, 4, summ_movie_count['MovieCount'])
summ_movie_count['MoveCountBanded_str'] = np.where(summ_movie_count['MoveCountBanded'] == 4, '4+ movies'
                                                    , np.where(summ_movie_count['MoveCountBanded'] == 1, '1 movie'
                                                     , np.where(summ_movie_count['MoveCountBanded'] == 2, '2 movies'
                                                      , np.where(summ_movie_count['MoveCountBanded'] == 3, '3 movies', 'Missing'))))                                              
                                                  
labels = list(summ_movie_count.MoveCountBanded_str)
values = list(summ_movie_count.ActorCount)
colors = ['rgb(32, 51, 155)', 'rgb(249, 126, 32)', 'rgb(34, 155, 38)', 'rgb(153, 26, 142)']

trace = go.Pie(    labels = labels
                  , values = values
                  , hoverinfo='label+percent'
                  , textinfo='value'
                  , textfont=dict(size=15)
                  , opacity = 0.8
                  , textposition='outside'
                  , marker = dict(colors = colors,
                           line=dict(color='#f7f7f7'
                                     , width=1)))

data = [trace]
layout = go.Layout(title='Number of Top 250 movie features, per actor')
fig = dict(data = data, layout=layout)
plot(fig , filename = '../Plots/PieChart_MovieCount.html', auto_open=True)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/PieChart_MovieCount.html'

## Bar Chart - Number of movie features per actor/actress (5+ movies)

The interactive plotly chart allows us to see the actors with the highest movie features, with the list of movies available on hover over. I chose to only plot actors that had featured in five or more movies as anymore and the plot would be unreadable (and I was too lazy to created a user defined input range or value filter). The plot begins with the highest number of movie features on the far left of the chart, ranked descending order.

o what did I deduce from this chart? Well, perhaps I'm not quite the film buff I first thought but:

- to my suprise, the first actor I had heard of was in position five - **Robert De Niro** - who has appeared in eight of the top 250 movies (and is undoubtably a household name & living legend).

- in position one, **John Ratzenberger** has 'appeared' in more of the Top 250 movies than any other 'actor' - a whopping 12 x movies. Interetingly, **10 of these were animation films**, so he didn't even 'appear' in them at all.

- **Bess Flowers**, in position two, featured in 10 movies all published before the 1970s.

- in third position, **Joseph Oliveira**, was a peculiar find. Whilst he'd featured in nine of the Top 250 movies, he'd only played **supporting** or **'uncredited'** roles in each. To list a few examples, Joseph featured as a [walk on officer in Dark Knight (2008)](https://www.imdb.com/title/tt0468569/fullcredits?ref_=tt_cl_sm#cast); held an uncredited role as ['Marciano' in Goodfellas (1990)](https://www.imdb.com/title/tt0099685/fullcredits?ref_=tt_cl_sm#cast); and **again** an [uncredited Officer Court Room Attendant in Wolf of Wall Street (2013)](https://www.imdb.com/title/tt0993846/fullcredits). He's been in so many modern classics but there's no chance I'd recognise him if he passed me on the street.

- there are also plently of popularly recognised names, including Harrison Ford (6th), Gino Corrado (7th), Morgan Freeman (10th), to name a few.

Why not take a look yourself!



In [49]:
# plot number of films starred in descending
# labels / hover to display list of films per star

cast_dict2 , cast_movie_dict2 , ordered_appearance2, ordered_app_number2 = \
        actorsAllMovies(ds_castAndCrew , movies_list , 4 )
    
    
movie_list_breaks = []
for name in ordered_appearance2:
    Movie_list = []
    for value in cast_movie_dict2[name]:
        Movie_list.append(value+' <br>')
        
    movie_list_breaks.append(Movie_list)
    
ordered_appearance_final = []
for cast in ordered_appearance2:
    ordered_appearance_final.append(cast.title())
     
trace0 = go.Bar(
    x = ordered_appearance_final ,
    y = ordered_app_number2,
    hovertext = movie_list_breaks,
    marker=dict(
        color='rgb(249, 126, 32)',
        line=dict(color='rgb(86, 38, 1)',width=1.5,)),opacity=0.5)

data = [trace0]
layout = go.Layout(title="Number of movie features in IMDb's Top 250 film features (5+ movie features)"
                   , xaxis=dict(tickangle=45)
                   , yaxis=dict(title="# of movie features"))

fig = dict(data = data, layout=layout)
plot(fig , filename = '../Plots/TopActors.html', auto_open=True)


'file:///Users/jeremyirving/Codesets/Imdb_Top_Actors/Plots/TopActors.html'