<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a series of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [64]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


#### Check the number of rows and columns.

In [65]:
# Answer:
movies.shape

(979, 6)

#### Check the data type of each column.

In [66]:
# Answer:
movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

#### Calculate the average movie duration.

In [67]:
# Answer:
movies['duration'].mean()

120.97957099080695

#### Sort the DataFrame by duration to find the shortest and longest movies.

In [68]:
# Answer: longest movie is Hamlet
movies.sort_values(by=['duration'], ascending=False)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
476,7.8,Hamlet,PG-13,Drama,242,"[u'Kenneth Branagh', u'Julie Christie', u'Dere..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
142,8.3,Lagaan: Once Upon a Time in India,PG,Adventure,224,"[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell..."
445,7.9,The Ten Commandments,APPROVED,Adventure,220,"[u'Charlton Heston', u'Yul Brynner', u'Anne Ba..."
...,...,...,...,...,...,...
293,8.1,Duck Soup,PASSED,Comedy,68,"[u'Groucho Marx', u'Harpo Marx', u'Chico Marx']"
88,8.4,The Kid,NOT RATED,Comedy,68,"[u'Charles Chaplin', u'Edna Purviance', u'Jack..."
258,8.1,The Cabinet of Dr. Caligari,UNRATED,Crime,67,"[u'Werner Krauss', u'Conrad Veidt', u'Friedric..."
338,8.0,Battleship Potemkin,UNRATED,History,66,"[u'Aleksandr Antonov', u'Vladimir Barsky', u'G..."


In [69]:
# Answer: shortest movie is Freaks
movies.sort_values(by=['duration'], ascending=True)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
389,8.0,Freaks,UNRATED,Drama,64,"[u'Wallace Ford', u'Leila Hyams', u'Olga Bacla..."
338,8.0,Battleship Potemkin,UNRATED,History,66,"[u'Aleksandr Antonov', u'Vladimir Barsky', u'G..."
258,8.1,The Cabinet of Dr. Caligari,UNRATED,Crime,67,"[u'Werner Krauss', u'Conrad Veidt', u'Friedric..."
293,8.1,Duck Soup,PASSED,Comedy,68,"[u'Groucho Marx', u'Harpo Marx', u'Chico Marx']"
88,8.4,The Kid,NOT RATED,Comedy,68,"[u'Charles Chaplin', u'Edna Purviance', u'Jack..."
...,...,...,...,...,...,...
445,7.9,The Ten Commandments,APPROVED,Adventure,220,"[u'Charlton Heston', u'Yul Brynner', u'Anne Ba..."
142,8.3,Lagaan: Once Upon a Time in India,PG,Adventure,224,"[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."


#### Create a histogram of duration, choosing an "appropriate" number of bins.

In [70]:
# Answer:
px.histogram(movies, x="duration", nbins=30)

#### Use a box plot to display that same data.

In [71]:
# Answer:
px.box(movies, x="duration")

## Intermediate level

#### Count how many movies have each of the content ratings.

In [72]:
# Answer:
content_ratings = movies.groupby('content_rating')['title'].count().reset_index()
content_ratings

Unnamed: 0,content_rating,title
0,APPROVED,47
1,G,32
2,GP,3
3,NC-17,7
4,NOT RATED,65
5,PASSED,7
6,PG,123
7,PG-13,189
8,R,460
9,TV-MA,1


#### Use a visualization to display that same data, including a title and x and y labels.

In [73]:
# Answer:
px.bar(content_ratings, x='content_rating', y='title', barmode='group')

#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [74]:
# Answer:
#update the ratings
movies['content_rating'] = np.where(movies['content_rating'].isin(['NOT RATED', 'APPROVED', 'PASSED', 'GP']), 'UNRATED', movies['content_rating'])

#check to make sure they've all been updated. Below query should retun 0 results
movies.loc[movies['content_rating'].isin(['NOT RATED', 'APPROVED', 'PASSED', 'GP']), :]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list


#### Convert the following content ratings to "NC-17": X, TV-MA.

In [75]:
# Answer:
#update the ratings
movies['content_rating'] = np.where(movies['content_rating'].isin(['X' , 'TV-MA']), 'NC-17', movies['content_rating'])

#check to make sure they've all been updated. Below query should retun 0 results
movies.loc[movies['content_rating'].isin(['X' , 'TV-MA']), :]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list


#### Count the number of missing values in each column.

In [76]:
# Answer:
movies.isnull().sum()

star_rating       0
title             0
content_rating    3
genre             0
duration          0
actors_list       0
dtype: int64

#### If there are missing values: examine them, then fill them in with "reasonable" values.

In [77]:
# Answer:
#Filling in N/A 
movies['content_rating'] = movies['content_rating'].fillna('UNKNOWN')

#Check to make sure they got filled in
movies.loc[movies['content_rating'] == 'UNKNOWN', :]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,UNKNOWN,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,UNKNOWN,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,UNKNOWN,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

In [78]:
# Answer:
# Create a column called length to indicate whether the movies is '2 hours or more' or 'Less than 2 hours' 
movies['length'] = np.where(movies['duration'] >= 120, '2 hours or more', 'Less than 2 hours')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,length
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",2 hours or more
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",2 hours or more
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",2 hours or more
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",2 hours or more
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",2 hours or more


In [79]:
# calculate the avergae star rating, grouped by 'length' column (created in cell above)
movies.groupby('length')['star_rating'].mean()

length
2 hours or more      7.948899
Less than 2 hours    7.838667
Name: star_rating, dtype: float64

#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [80]:
# Answer: There is a positive correlation between duration and star rating, as indicated by the trend line
px.scatter(movies, x='duration', y='star_rating', trendline='ols',  title='Duration vs Star Rating')

#### Calculate the average duration for each genre.

In [81]:
# Answer:
df = movies.groupby('genre')['duration'].mean().reset_index()
df

Unnamed: 0,genre,duration
0,Action,126.485294
1,Adventure,134.84
2,Animation,96.596774
3,Biography,131.844156
4,Comedy,107.602564
5,Crime,122.298387
6,Drama,126.539568
7,Family,107.5
8,Fantasy,112.0
9,Film-Noir,97.333333


## Advanced level

#### Visualize the relationship between content rating and duration.

In [82]:
#Answer: As the content ratings increase, the star ratings trend down 

# Find the unique content ratings (this will be used in the next cell)
movies['content_rating'].unique()

array(['R', 'PG-13', 'UNRATED', 'PG', 'G', 'NC-17', 'UNKNOWN'],
      dtype=object)

In [83]:
# Create a dataframe containing average star rating grouped by content rating
Avg_star_rating = movies.groupby('content_rating').mean()
Avg_star_rating = Avg_star_rating.reset_index()

# Create a column for numeric content rating, that will be used for sorting
numeric_rating = {'NC-17': 6, 'R': 5, 'PG-13': 4, 'PG': 3, 'G': 2, 'UNRATED': 1, 'UNKNOWN': 0}
Avg_star_rating['numeric_rating'] = Avg_star_rating['content_rating'].map(numeric_rating).astype(float)

# Sort the movies dataframe by numeric content rating 
# This ensures the graph will display data in order, from the lowest content rating to the highest
Avg_star_rating.sort_values(by=['numeric_rating'], ascending=True, inplace=True)

# Use bar graph to visualize relationship between content rating and duration
fig = px.bar(Avg_star_rating, x='star_rating', y='content_rating',  title='Content Rating vs Star Rating')
fig.update_xaxes(range=[7.6, 8.1])
fig

#### Determine the top rated movie (by star rating) for each genre.

In [84]:
# Answer: See table below

# Make a list of genres
genres = movies['genre'].unique().tolist()

# Loop through the genres, and find the id of the movie with the highest star rating for each genre. Add the ids to a list
ids = []
for x in genres:
    y = movies[movies['genre'] == x]['star_rating'].idxmax()
    ids.append(y)

# return the rows that correspond to the ids in the 'ids' list (created in the cell above)
movies.loc[movies.index.isin(ids), :]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,length
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",2 hours or more
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",2 hours or more
5,8.9,12 Angry Men,UNRATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...",Less than 2 hours
6,8.9,"The Good, the Bad and the Ugly",UNRATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...",2 hours or more
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...",2 hours or more
8,8.9,Schindler's List,R,Biography,195,"[u'Liam Neeson', u'Ralph Fiennes', u'Ben Kings...",2 hours or more
25,8.6,Life Is Beautiful,PG-13,Comedy,116,"[u'Roberto Benigni', u'Nicoletta Braschi', u'G...",Less than 2 hours
30,8.6,Spirited Away,PG,Animation,125,"[u'Daveigh Chase', u'Suzanne Pleshette', u'Miy...",2 hours or more
38,8.6,Rear Window,UNRATED,Mystery,112,"[u'James Stewart', u'Grace Kelly', u'Wendell C...",Less than 2 hours
39,8.6,Psycho,R,Horror,109,"[u'Anthony Perkins', u'Janet Leigh', u'Vera Mi...",Less than 2 hours


#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [85]:
# Answer: 0 of the 4 movies with the same title are true duplicates. The data in the other columns do not contain duplicated values. 

# Find the duplicate movie titles and add them to a list called dups
dups = movies['title'][movies.duplicated(subset=['title'])]

# Return the rows with a movie title in the dups list 
movies.loc[movies['title'].isin(dups), :].sort_values(by='title')

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,length
703,7.6,Dracula,UNRATED,Horror,85,"[u'Bela Lugosi', u'Helen Chandler', u'David Ma...",Less than 2 hours
905,7.5,Dracula,R,Horror,128,"[u'Gary Oldman', u'Winona Ryder', u'Anthony Ho...",2 hours or more
678,7.7,Les Miserables,PG-13,Drama,158,"[u'Hugh Jackman', u'Russell Crowe', u'Anne Hat...",2 hours or more
924,7.5,Les Miserables,PG-13,Crime,134,"[u'Liam Neeson', u'Geoffrey Rush', u'Uma Thurm...",2 hours or more
466,7.9,The Girl with the Dragon Tattoo,R,Crime,158,"[u'Daniel Craig', u'Rooney Mara', u'Christophe...",2 hours or more
482,7.8,The Girl with the Dragon Tattoo,R,Crime,152,"[u'Michael Nyqvist', u'Noomi Rapace', u'Ewa Fr...",2 hours or more
662,7.7,True Grit,PG-13,Adventure,110,"[u'Jeff Bridges', u'Matt Damon', u'Hailee Stei...",Less than 2 hours
936,7.4,True Grit,UNKNOWN,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']",2 hours or more


#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


#### Option 1: manually create a list of relevant genres, then filter using that list

In [86]:
# Answer: please see option 2

#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [87]:
# Answer:

# Create a dataframe containing the count of unique rows (i.e. unique movies) for each genre
movie_count = movies['genre'].value_counts()
movie_count = movie_count.to_frame() 

# filter for the popular genres, i.e. those with at least 10 movies
popular_genres = movie_count[movie_count['genre'] > 10]

# add the popular genres to a list
popular_genres_list = popular_genres.index.tolist()
popular_genres_list

['Drama',
 'Comedy',
 'Action',
 'Crime',
 'Biography',
 'Adventure',
 'Animation',
 'Horror',
 'Mystery']

In [88]:
# return the rows that correspond to popular genres and save it to a dataframe
df = movies.loc[movies['genre'].isin(popular_genres_list), :]

#calculate the avergae star rating, grouped by genre
df.groupby('genre')['star_rating'].mean()

genre
Action       7.884559
Adventure    7.933333
Animation    7.914516
Biography    7.862338
Comedy       7.822436
Crime        7.916935
Drama        7.902518
Horror       7.806897
Mystery      7.975000
Name: star_rating, dtype: float64

#### Option 3: calculate the average star rating for all genres, then filter using a boolean Series

In [89]:
# Answer: please see option 2

#### Option 4: aggregate by count and mean, then filter using the count

In [90]:
# Answer: please see option 2

## Bonus

#### Figure out something "interesting" using the actors data!

In [91]:
# Didn't get to the bonus this time