# IMDB Movie Cleaning Total Score
![ImdbIcon](../images/imdbheader.jpg)

### Notebook Overview

For my capstone project, I want to see if I can accurately predict a movies score based on the actors, actresses, and directors. I also want to be able to predict how well a movie would do in the future based off of the roles that a user chooses. For example, one might be able to see how well Leonardo Dicaprio and Anne Hathaway would do in a comedy movie, or any movie of their choosing. It should be known that some actors and actresses may be one hit wonders, and there are actors/actresses not in this list that will have successful careers, but with the data I have chosen to work with, we will not account for those factors.

In this notebook I will be reading in four seperate datasets and merging them together based off of specific id columns. I remove columns with too many null values to work with and impute a very small amount of values (mean for 18 rows). I fill some nan values with unknown such as the plot and director where necessary so that I don't remove too many roww of valuable data. The location is very important, and movies outside the USA have too many null values so I chose to only work with USA movies. I am only focusing on specific roles such as actors, actresses, director, producer, and writer, so I removed those that are not as important for this specific project. I also took out all individuals that have passed away so that I do not have a recommended role for someone that has passed away. I also took only movies in the past 20 years so that I do not have recommendations for people that may have passed away similar to date of death column. I also split the genre columns because some movies have multiple genres, and by splitting, I can account for a persons impact in the specific genre instead of removing.

### Imports

In [57]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.sentiment.vader import SentimentIntensityAnalyzer

### Import Data

In [58]:
movies = pd.read_csv('../data/IMDb_movies.csv')
names = pd.read_csv('../data/IMDb_names.csv')
ratings = pd.read_csv('../data/IMDb_ratings.csv')
titles = pd.read_csv('../data/IMDb_title_principals.csv')
budgets_df = pd.read_csv('../data/tmdb_movies_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


### Merge Data

In [59]:
# Merging movies with ratings
movies = movies.merge(ratings, left_on = 'imdb_title_id', right_on = 'imdb_title_id')
# Merging movives with tiles
movies = movies.merge(titles, left_on = 'imdb_title_id', right_on = 'imdb_title_id')
# Merging movies with names (actors/actresses/directors)
movies = movies.merge(names, left_on = 'imdb_name_id', right_on = 'imdb_name_id')

### Dropping Columns

In [60]:
# Dropping columns that serve no values
movies.drop(columns = ['imdb_title_id', 'title', 'language', 'production_company', 'budget', 'metascore', 'reviews_from_critics', 
                       'usa_gross_income', 'worlwide_gross_income', 'allgenders_0age_avg_vote', 'allgenders_0age_votes', 
                       'allgenders_18age_avg_vote','allgenders_18age_votes', 'males_0age_avg_vote', 'males_0age_votes',
                       'males_18age_avg_vote', 'males_18age_votes', 'females_0age_avg_vote', 'females_0age_votes', 
                       'females_18age_avg_vote', 'females_18age_votes', 'females_30age_avg_vote', 'females_30age_votes', 
                       'females_45age_avg_vote', 'females_45age_votes', 'job', 'characters', 'imdb_name_id', 'birth_name',
                       'reviews_from_users', 'writer', 'allgenders_30age_avg_vote', 'allgenders_30age_votes', 
                       'allgenders_45age_avg_vote', 'allgenders_45age_votes', 'males_allages_avg_vote', 'males_allages_votes', 
                       'males_30age_avg_vote',  'males_30age_votes', 'males_45age_avg_vote', 'males_45age_votes', 
                       'females_allages_avg_vote', 'females_allages_votes', 'top1000_voters_rating', 'top1000_voters_votes', 
                       'date_published', 'ordering', 'director', 'non_us_voters_rating', 'non_us_voters_votes'], inplace=True)

### Cleaning

In [61]:
# Removing columns that were not movies
movies = movies[movies['year'] != 'TV Movie 2019']
# Movies outside the United States had a lot of missing data.
movies = movies[movies['country'] == 'USA']
# Removing both Reality-TV 
movies = movies[(movies['genre'] != 'Reality-TV') & (movies['genre'] != 'News')]

In [62]:
# A small amount of movies had no description, so I could fill those with Unknown to keep valuable data.
movies['description'] = movies['description'].fillna("Unknown")

In [63]:
# Less than twenty rows were missing voters rating, so I felt comfortable imputing the mean.
movies['us_voters_rating'].fillna((movies['us_voters_rating'].mean()), inplace=True)
# Less than twenty rows were missing voters votes, so I felt comfortable imputing the mean.
movies['us_voters_votes'].fillna((movies['us_voters_votes'].mean()), inplace=True)

In [64]:
# Only looking to work with actors, actresses, and directors.
jobs_list = ['actor', 'actress', 'director']

movies = movies[movies['category'].isin(jobs_list)]

In [65]:
movies.reset_index(drop=True, inplace=True)

In [66]:
# Removing date_of_death and date_of_birth, no longer necessary
movies.drop(columns = ['date_of_death', 'date_of_birth'], inplace=True)

In [67]:
# Converting column year to integer
movies['year'] = movies['year'].astype('int64')

In [68]:
# Renaming original title, category, and description.
movies.rename(columns = {'original_title':'movie_title',
                          'category':'role',
                         'description':'plot'}, inplace=True)

### Dummify Genre

In [69]:
# Using str.get_dummes(",", I can have multiple values in dummy columns)
# For genre, if a movie is Horror AND Action, a 1 is placed in both of those columns
genre_dummies = movies['genre'].str.get_dummies(", ")
# Merging the dummified columns back to the movie dataframe
movies = pd.merge(movies, genre_dummies, left_index =True, right_index=True)
# Dropping genre and country. Only USA movies and genre is now dummified.
movies.drop(columns = ['genre', 'country'], inplace=True)

In [70]:
# Dummify roles (actors, actresses, directors)
movies = pd.get_dummies(movies, columns = ['role'])

In [71]:
# Grouping by movie_title and dividing by the amount of times it is its own row (because of actors, actresses, and directors)
movies_one = movies.groupby(['movie_title']).sum() / movies.groupby(['movie_title']).count()
movies_one = movies_one.reset_index()

In [72]:
# Getting the average score for each name in the dataframe, then changing the column to 'average_role_score'
movies_one = movies.groupby(['name']).sum() / movies.groupby(['name']).count()
movies_one = movies_one.reset_index()
movies_one = movies_one.round(3)
movies_one = movies_one[['name', 'weighted_average_vote']]

movies_one.rename(columns = {'weighted_average_vote':'average_role_score'}, inplace=True)

### Scores of Actors, Actresses, and Directors

In [73]:
actors_role = movies.groupby(['name']).sum() / movies.groupby(['name']).count()
actors_role = actors_role.reset_index()

In [74]:
actors_role = actors_role[['name', 'movie_title', 'role_actor', 'role_actress', 'role_director', 'weighted_average_vote']]
actress_role = actors_role
directors_role = actors_role

# Setting dataframes based on the dummified columns
actors_role = actors_role[actors_role['role_actor'] >= 1]
actress_role = actress_role[actress_role['role_actress'] >= 1]
directors_role = directors_role[directors_role['role_director'] >= 1]

In [75]:
# Create new dataframe of each individual actor and their average score
actors_role = actors_role[['name', 'weighted_average_vote']]
actors_role.rename(columns = {'weighted_average_vote':'actor_score'}, inplace=True)

# Create new dataframe of each individual actress and their average score
actress_role = actress_role[['name', 'weighted_average_vote']]
actress_role.rename(columns = {'weighted_average_vote':'actress_score'}, inplace=True)

# Create new dataframe of each individual director and their average score
directors_role = directors_role[['name', 'weighted_average_vote']]
directors_role.rename(columns = {'weighted_average_vote':'director_score'}, inplace=True)

In [76]:
# Visualizing each actor and their average score
actors_role

Unnamed: 0,name,actor_score
0,'Big' LeRoy Mobley,6.40
1,'Ducky' Louie,6.40
3,'Lee' George Quinones,7.10
4,'Little Billy' Rhodes,3.90
5,'Philthy' Phil Phillips,3.70
...,...,...
54428,Zuher Khan,4.40
54434,Álex Nova,6.80
54440,Íce Mrozek,4.05
54442,Óscar Jaenada,6.30


In [77]:
# Visualizing each actress and their average score
actress_role

Unnamed: 0,name,actress_score
9,A Leslie Kies,5.100
28,A.J. Cook,5.325
34,A.J. Langer,5.250
47,AJ Michalka,5.860
49,Aaliyah,6.100
...,...,...
54430,Zully Montero,6.400
54431,Zuzanna Surowy,5.700
54435,Ángela Molina,6.100
54437,Élodie Bouchez,5.450


In [78]:
# Visualizing each director and their average score
directors_role

Unnamed: 0,name,director_score
2,'Evil' Ted Smith,4.000000
11,A. Blaine Miller,5.300000
12,A. Dean Bell,4.650000
15,A. Raven Cruz,2.100000
21,A.D. Calvo,5.016667
...,...,...
54433,Àlex Pastor,6.000000
54436,Édouard Molinaro,5.900000
54439,Éric Rochat,5.600000
54441,Ómar Örn Hauksson,5.000000


### Merge Role Scores to DataFrame

In [79]:
# Merging movies with the average role scores.
final_movies = pd.merge(movies, movies_one, left_on = 'name', right_on = 'name')
final_movies = final_movies.groupby(['movie_title']).sum() / final_movies.groupby(['movie_title']).count()
final_movies = final_movies.reset_index()
final_movies.rename(columns = {'average_role_score':'casting_score'}, inplace=True)

In [80]:
# Merging average actors scores for each movie by movie
final_movies_actors = pd.merge(movies, actors_role, left_on = 'name', right_on = 'name')
final_movies_actors = final_movies_actors.groupby(['movie_title']).sum() / final_movies_actors.groupby(['movie_title']).count()
final_movies_actors = final_movies_actors.reset_index()

In [81]:
# Merging average actresses scores for each movie by movie
final_movies_actresses = pd.merge(movies, actress_role, left_on = 'name', right_on = 'name')
final_movies_actresses = final_movies_actresses.groupby(['movie_title']).sum() / final_movies_actresses.groupby(['movie_title']).count()
final_movies_actresses = final_movies_actresses.reset_index()

In [82]:
# Merging average directors scores for each movie by movie
final_movie_directors = pd.merge(movies, directors_role, left_on = 'name', right_on = 'name')
final_movie_directors = final_movie_directors.groupby(['movie_title']).sum() / final_movie_directors.groupby(['movie_title']).count()
final_movie_directors = final_movie_directors.reset_index()

In [83]:
final_df = movies

In [84]:
# Rename columns
final_df.rename(columns = {'plot_y':'plot',
                             'duration_y':'duration',
                             'actors_y':'cast'}, inplace=True)

# Create total_score column which is an average of weighted_average_vote, us_voters_rating, mean_vote, and median_vote
final_df['total_score'] = (final_df['weighted_average_vote'] + final_df['us_voters_rating'] + final_df['mean_vote'] + final_df['median_vote']) / 4

# Dropping duplicates
final_df = final_df.drop_duplicates()

In [85]:
# Ensuring each directors, actors, and actresses dataframe only contains their score and I can merge on movie_title
final_movie_directors = final_movie_directors[['movie_title', 'director_score']]
final_movies_actors = final_movies_actors[['movie_title', 'actor_score']]
final_movies_actresses = final_movies_actresses[['movie_title', 'actress_score']]

final_df = pd.merge(final_df, final_movie_directors, left_on = 'movie_title', right_on = 'movie_title')
final_df = pd.merge(final_df, final_movies_actors, left_on = 'movie_title', right_on = 'movie_title')
final_df = pd.merge(final_df, final_movies_actresses, left_on = 'movie_title', right_on = 'movie_title')

final_df = final_df.round(3)

In [86]:
# Found duplicate values. Dropping to ensure data is valid
final_df = final_df.drop_duplicates(subset = ['movie_title'])

In [94]:
final_df = final_df[final_df['year'] >= 1970]

### Sentiment Analysis of Plot and Tagline

In [96]:
plot_desc = final_df['plot'].tolist()

analyzer = SentimentIntensityAnalyzer()

def get_polarity_plot(plot_desc):
    polarity = []
    for post in plot_desc:
        vs = analyzer.polarity_scores(post)
        polarity.append(vs['compound'])
    return polarity

polarity = get_polarity_plot(plot_desc)

final_df['plot_sentiment'] = polarity

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


### Ordering of Columns in DataFrame

In [98]:
final_df = final_df[['movie_title', 'year', 'actors', 'plot', 'duration', 'Action', 'Adventure', 'Animation', 'Biography', 
          'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror', 'Music', 
          'Musical', 'Mystery', 'News', 'Reality-TV', 'Romance', 'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western',
          'avg_vote', 'votes', 'weighted_average_vote', 'total_votes', 'mean_vote', 'median_vote', 'votes_1', 'votes_2', 
          'votes_3', 'votes_4', 'votes_5', 'votes_6', 'votes_7', 'votes_8', 'votes_9', 'votes_10', 'us_voters_rating', 
          'us_voters_votes', 'plot_sentiment', 'director_score', 'actor_score', 'actress_score', 'total_score']]

In [99]:
final_df = final_df.dropna()
final_df = final_df.reset_index()
final_df.drop(columns = ['index'], inplace=True)

In [100]:
final_df

Unnamed: 0,movie_title,year,actors,plot,duration,Action,Adventure,Animation,Biography,Comedy,...,votes_8,votes_9,votes_10,us_voters_rating,us_voters_votes,plot_sentiment,director_score,actor_score,actress_score,total_score
0,Hambone and Hillie,1983,"Lillian Gish, Candy Clark, O.J. Simpson, Rober...",A dog (Hambone) treks from New York City to Lo...,90,0,0,0,0,1,...,10,4,25,4.8,58.0,0.2732,5.200,5.670,6.150,5.425
1,The Whales of August,1987,"Bette Davis, Lillian Gish, Vincent Price, Ann ...",Two aged sisters reflect on life and the past ...,90,0,0,0,0,0,...,894,422,894,7.3,1476.0,0.0000,7.300,6.621,6.675,7.550
2,Family Plot,1976,"Karen Black, Bruce Dern, Barbara Harris, Willi...",A phony psychic/con artist and her taxi driver...,120,0,0,0,0,1,...,3515,1055,1041,6.8,4670.0,-0.2960,7.493,5.860,5.724,6.850
3,Disconnected,1984,"Frances Raines, Mark Walker, Carl Koch, Profes...","Alicia has started getting these very noisy, a...",82,0,0,0,0,0,...,13,9,22,4.5,165.0,-0.6764,4.600,4.200,3.333,4.325
4,Islands in the Stream,1977,"George C. Scott, David Hemmings, Gilbert Rolan...",An isolated sculptor is visited by his three s...,104,0,0,0,0,0,...,184,80,152,6.5,435.0,-0.3182,6.786,6.434,6.250,6.700
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12934,A Room Full of Nothing,2019,"Ivy Meehan, Duncan Coe, Kat Albert, Austin Ale...","An artistic couple living in Austin, Texas, wa...",85,0,0,0,0,1,...,5,14,15,4.2,32.0,0.0000,3.300,3.300,3.300,4.375
12935,The Nomads,2019,"Andrea Barnes, Erik Blachford, Jennifer Butler...",Amidst the chaos of massive budget cuts and sc...,97,0,0,0,0,0,...,50,19,860,6.2,72.0,-0.7096,8.200,8.200,8.200,8.325
12936,Saint Frances,2019,"Kelly O'Sullivan, Charin Alvarez, Braden Croth...","After an accidental pregnancy turned abortion,...",106,0,0,0,0,1,...,263,98,88,7.0,220.0,0.2023,6.900,6.900,6.900,6.950
12937,Xane: The Vampire God,2020,"Parker Boles, Jenna Farden, Robere Kazadi, Zoë...","Xane, an immortal vampire, returns to the past...",118,0,1,0,0,0,...,30,48,86,3.6,10.0,0.6652,5.000,5.000,5.000,6.225


### Export Final DataFrame

In [101]:
final_df.to_csv('../data/totalscore_df.csv', index=False)