# IMDB Movie Cleaning Total Score
![ImdbIcon](../images/imdbheader.jpg)

### Notebook Overview

In this notebook I will be reading in four seperate datasets and merging them together based off of specific id columns. I remove columns with too many null values to work with and impute a very small amount of values (mean for 18 rows). I fill some nan values with unknown such as the plot and director where necessary so that I don't remove too many roww of valuable data. The location is very important, and movies outside the USA have too many null values so I chose to only work with USA movies. I am only focusing on specific roles such as actors, actresses, and director, so I removed those that are not as important for this specific project. I also took out all individuals that have passed away so that I do not have a recommended role for someone that has passed away. I also took only movies in the past 50 years so that I do not have recommendations for people that may have passed away similar to date of death column. I also split the genre columns because some movies have multiple genres, and by splitting, I can account for a persons impact in the specific genre instead of removing.

### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.sentiment.vader import SentimentIntensityAnalyzer

C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  stacklevel=1)


### Import Data

In [2]:
movies = pd.read_csv('../data/IMDb_movies.csv')
names = pd.read_csv('../data/IMDb_names.csv')
ratings = pd.read_csv('../data/IMDb_ratings.csv')
titles = pd.read_csv('../data/IMDb_title_principals.csv')
budgets_df = pd.read_csv('../data/tmdb_movies_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


### Merge Data

In [3]:
# Merging movies with ratings
movies = movies.merge(ratings, left_on = 'imdb_title_id', right_on = 'imdb_title_id')
# Merging movives with tiles
movies = movies.merge(titles, left_on = 'imdb_title_id', right_on = 'imdb_title_id')
# Merging movies with names (actors/actresses/directors)
movies = movies.merge(names, left_on = 'imdb_name_id', right_on = 'imdb_name_id')

### Dropping Columns

In [4]:
# Dropping columns that serve no values
movies.drop(columns = ['imdb_title_id', 'title', 'language', 'production_company', 'budget', 'metascore', 'reviews_from_critics', 
                       'usa_gross_income', 'worlwide_gross_income', 'allgenders_0age_avg_vote', 'allgenders_0age_votes', 
                       'allgenders_18age_avg_vote','allgenders_18age_votes', 'males_0age_avg_vote', 'males_0age_votes',
                       'males_18age_avg_vote', 'males_18age_votes', 'females_0age_avg_vote', 'females_0age_votes', 
                       'females_18age_avg_vote', 'females_18age_votes', 'females_30age_avg_vote', 'females_30age_votes', 
                       'females_45age_avg_vote', 'females_45age_votes', 'job', 'characters', 'imdb_name_id', 'birth_name',
                       'reviews_from_users', 'writer', 'allgenders_30age_avg_vote', 'allgenders_30age_votes', 
                       'allgenders_45age_avg_vote', 'allgenders_45age_votes', 'males_allages_avg_vote', 'males_allages_votes', 
                       'males_30age_avg_vote',  'males_30age_votes', 'males_45age_avg_vote', 'males_45age_votes', 
                       'females_allages_avg_vote', 'females_allages_votes', 'top1000_voters_rating', 'top1000_voters_votes', 
                       'date_published', 'ordering', 'director', 'non_us_voters_rating', 'non_us_voters_votes'], inplace=True)

### Cleaning

In [5]:
# Removing columns that were not movies
movies = movies[movies['year'] != 'TV Movie 2019']
# Movies outside the United States had a lot of missing data.
movies = movies[movies['country'] == 'USA']
# Removing both Reality-TV 
movies = movies[(movies['genre'] != 'Reality-TV') & (movies['genre'] != 'News')]

In [6]:
# A small amount of movies had no description, so I could fill those with Unknown to keep valuable data.
movies['description'] = movies['description'].fillna("Unknown")

In [7]:
# Less than twenty rows were missing voters rating, so I felt comfortable imputing the mean.
movies['us_voters_rating'].fillna((movies['us_voters_rating'].mean()), inplace=True)
# Less than twenty rows were missing voters votes, so I felt comfortable imputing the mean.
movies['us_voters_votes'].fillna((movies['us_voters_votes'].mean()), inplace=True)

In [8]:
# Only looking to work with actors, actresses, and directors.
jobs_list = ['actor', 'actress', 'director']

movies = movies[movies['category'].isin(jobs_list)]

In [9]:
movies.reset_index(drop=True, inplace=True)

In [10]:
# Removing date_of_death and date_of_birth, no longer necessary
movies.drop(columns = ['date_of_death', 'date_of_birth'], inplace=True)

In [11]:
# Converting column year to integer
movies['year'] = movies['year'].astype('int64')

In [12]:
# Renaming original title, category, and description.
movies.rename(columns = {'original_title':'movie_title',
                          'category':'role',
                         'description':'plot'}, inplace=True)

### Dummify Genre

In [13]:
# Using str.get_dummes(",", I can have multiple values in dummy columns)
# For genre, if a movie is Horror AND Action, a 1 is placed in both of those columns
genre_dummies = movies['genre'].str.get_dummies(", ")
# Merging the dummified columns back to the movie dataframe
movies = pd.merge(movies, genre_dummies, left_index =True, right_index=True)
# Dropping genre and country. Only USA movies and genre is now dummified.
movies.drop(columns = ['genre', 'country'], inplace=True)

In [14]:
# Dummify roles (actors, actresses, directors)
movies = pd.get_dummies(movies, columns = ['role'])

In [15]:
# Grouping by movie_title and dividing by the amount of times it is its own row (because of actors, actresses, and directors)
movies_one = movies.groupby(['movie_title']).sum() / movies.groupby(['movie_title']).count()
movies_one = movies_one.reset_index()

In [16]:
# Getting the average score for each name in the dataframe, then changing the column to 'average_role_score'
movies_one = movies.groupby(['name']).sum() / movies.groupby(['name']).count()
movies_one = movies_one.reset_index()
movies_one = movies_one.round(3)
movies_one = movies_one[['name', 'weighted_average_vote']]

movies_one.rename(columns = {'weighted_average_vote':'average_role_score'}, inplace=True)

### Scores of Actors, Actresses, and Directors

In [17]:
# Grouping roles by name and dividing their scores by the number of times they appear in dataset
actors_role = movies.groupby(['name']).sum() / movies.groupby(['name']).count()
actors_role = actors_role.reset_index()

In [18]:
actors_role = actors_role[['name', 'movie_title', 'role_actor', 'role_actress', 'role_director', 'weighted_average_vote']]
actress_role = actors_role
directors_role = actors_role

# Setting dataframes based on the dummified columns
actors_role = actors_role[actors_role['role_actor'] >= 1]
actress_role = actress_role[actress_role['role_actress'] >= 1]
directors_role = directors_role[directors_role['role_director'] >= 1]

In [19]:
# Create new dataframe of each individual actor and their average score
actors_role = actors_role[['name', 'weighted_average_vote']]
actors_role.rename(columns = {'weighted_average_vote':'actor_score'}, inplace=True)

# Create new dataframe of each individual actress and their average score
actress_role = actress_role[['name', 'weighted_average_vote']]
actress_role.rename(columns = {'weighted_average_vote':'actress_score'}, inplace=True)

# Create new dataframe of each individual director and their average score
directors_role = directors_role[['name', 'weighted_average_vote']]
directors_role.rename(columns = {'weighted_average_vote':'director_score'}, inplace=True)

In [20]:
# Visualizing each actor and their average score
actors_role

Unnamed: 0,name,actor_score
0,'Big' LeRoy Mobley,6.40
1,'Ducky' Louie,6.40
3,'Lee' George Quinones,7.10
4,'Little Billy' Rhodes,3.90
5,'Philthy' Phil Phillips,3.70
...,...,...
54428,Zuher Khan,4.40
54434,Álex Nova,6.80
54440,Íce Mrozek,4.05
54442,Óscar Jaenada,6.30


In [21]:
# Visualizing each actress and their average score
actress_role

Unnamed: 0,name,actress_score
9,A Leslie Kies,5.100
28,A.J. Cook,5.325
34,A.J. Langer,5.250
47,AJ Michalka,5.860
49,Aaliyah,6.100
...,...,...
54430,Zully Montero,6.400
54431,Zuzanna Surowy,5.700
54435,Ángela Molina,6.100
54437,Élodie Bouchez,5.450


In [22]:
# Visualizing each director and their average score
directors_role

Unnamed: 0,name,director_score
2,'Evil' Ted Smith,4.000000
11,A. Blaine Miller,5.300000
12,A. Dean Bell,4.650000
15,A. Raven Cruz,2.100000
21,A.D. Calvo,5.016667
...,...,...
54433,Àlex Pastor,6.000000
54436,Édouard Molinaro,5.900000
54439,Éric Rochat,5.600000
54441,Ómar Örn Hauksson,5.000000


### Merge Role Scores to DataFrame

In [23]:
# Merging movies with the average role scores.
final_movies = pd.merge(movies, movies_one, left_on = 'name', right_on = 'name')
final_movies = final_movies.groupby(['movie_title']).sum() / final_movies.groupby(['movie_title']).count()
final_movies = final_movies.reset_index()
final_movies.rename(columns = {'average_role_score':'casting_score'}, inplace=True)

In [24]:
# Merging average actors scores for each movie by movie
final_movies_actors = pd.merge(movies, actors_role, left_on = 'name', right_on = 'name')
final_movies_actors = final_movies_actors.groupby(['movie_title']).sum() / final_movies_actors.groupby(['movie_title']).count()
final_movies_actors = final_movies_actors.reset_index()

In [25]:
# Merging average actresses scores for each movie by movie
final_movies_actresses = pd.merge(movies, actress_role, left_on = 'name', right_on = 'name')
final_movies_actresses = final_movies_actresses.groupby(['movie_title']).sum() / final_movies_actresses.groupby(['movie_title']).count()
final_movies_actresses = final_movies_actresses.reset_index()

In [26]:
# Merging average directors scores for each movie by movie
final_movie_directors = pd.merge(movies, directors_role, left_on = 'name', right_on = 'name')
final_movie_directors = final_movie_directors.groupby(['movie_title']).sum() / final_movie_directors.groupby(['movie_title']).count()
final_movie_directors = final_movie_directors.reset_index()

In [27]:
# Setting final_df to movies for easier reference and testing
final_df = movies

In [28]:
# Rename columns
final_df.rename(columns = {'plot_y':'plot',
                           'duration_y':'duration',
                           'actors_y':'cast',
                           'avg_vote': 'imdb_score'}, inplace=True)

# Dropping duplicates
final_df = final_df.drop_duplicates()

In [29]:
# Ensuring each directors, actors, and actresses dataframe only contains their score and I can merge on movie_title
final_movie_directors = final_movie_directors[['movie_title', 'director_score']]
final_movies_actors = final_movies_actors[['movie_title', 'actor_score']]
final_movies_actresses = final_movies_actresses[['movie_title', 'actress_score']]

In [30]:
# I had to change the how to outer because some movies had no actors, or actresses
# By doing how, I was able to keep null value movies
final_df = pd.merge(final_df, final_movie_directors, how = 'outer', on = 'movie_title')
final_df = pd.merge(final_df, final_movies_actors, how = 'outer', on = 'movie_title')
final_df = pd.merge(final_df, final_movies_actresses, how = 'outer', on = 'movie_title')

In [31]:
# Found duplicate values. Dropping to ensure data is valid
final_df = final_df.drop_duplicates(subset = ['movie_title'])

In [32]:
# Only keeping movies from the past 50 years 
final_df = final_df[final_df['year'] >= 1970]

In [33]:
# Removing movies that had less than 1000 votes
final_df = final_df[final_df['us_voters_votes'] > 1000]

In [34]:
# Some movies had no scores for actors or actresses, so I will fill those with the corresponding scores.
final_df['actress_score'] = final_df['actress_score'].fillna(final_df['actor_score'])
final_df['actor_score'] = final_df['actor_score'].fillna(final_df['actress_score'])

In [35]:
# Dropping movies where director score, actors, actor_score, and actress_score were null
final_df.dropna(subset = ['director_score', 'actors', 'actor_score', 'actress_score'], inplace=True)

### Sentiment Analysis of Plot and Tagline

In [36]:
# Sentiment analysis of plot
plot_desc = final_df['plot'].tolist()

analyzer = SentimentIntensityAnalyzer()

# Function created to return compound score of plot of each movie and add column for each movie
def get_polarity_plot(plot_desc):
    polarity = []
    for post in plot_desc:
        vs = analyzer.polarity_scores(post)
        polarity.append(vs['compound'])
    return polarity

polarity = get_polarity_plot(plot_desc)

final_df['plot_sentiment'] = polarity

### Ordering of Columns in DataFrame

In [37]:
# Organizing columns of final dataframe before exporting to ensure proper layout and easy readability
final_df = final_df[['movie_title', 'year', 'actors', 'plot', 'duration', 'Action', 'Adventure', 'Animation', 'Biography', 
                     'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 
                     'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western', 'votes', 'weighted_average_vote', 
                     'total_votes', 'mean_vote', 'median_vote', 'votes_1', 'votes_2', 'votes_3', 'votes_4', 'votes_5', 
                     'votes_6', 'votes_7', 'votes_8', 'votes_9', 'votes_10', 'us_voters_rating', 'us_voters_votes',
                     'plot_sentiment', 'director_score', 'actor_score', 'actress_score', 'imdb_score']]

In [38]:
# Ensuring all null values were dropped before exporting
final_df = final_df.dropna()
# Resetting index before export
final_df = final_df.reset_index()
# Dropping index column and inplace=True to stick
final_df.drop(columns = ['index'], inplace=True)

In [39]:
# Visualizing final total score dataframe before exporting to use in modeling process
final_df

Unnamed: 0,movie_title,year,actors,plot,duration,Action,Adventure,Animation,Biography,Comedy,...,votes_8,votes_9,votes_10,us_voters_rating,us_voters_votes,plot_sentiment,director_score,actor_score,actress_score,imdb_score
0,The Whales of August,1987,"Bette Davis, Lillian Gish, Vincent Price, Ann ...",Two aged sisters reflect on life and the past ...,90,0,0,0,0,0,...,894,422,894,7.3,1476.0,0.0000,7.300000,6.621154,6.674631,7.3
1,Family Plot,1976,"Karen Black, Bruce Dern, Barbara Harris, Willi...",A phony psychic/con artist and her taxi driver...,120,0,0,0,0,1,...,3515,1055,1041,6.8,4670.0,-0.2960,7.492593,5.859890,5.723913,6.8
2,Love Story,1970,"Ali MacGraw, Ryan O'Neal, John Marley, Ray Mil...",A boy and a girl from different backgrounds fa...,100,0,0,0,0,0,...,5497,2506,4013,6.6,6692.0,-0.4019,6.137037,6.249722,6.166667,6.9
3,Frogs,1972,"Ray Milland, Sam Elliott, Joan Van Ark, Adam R...",A group of helpless victims celebrate a birthd...,91,0,0,0,0,0,...,216,94,357,4.5,2288.0,-0.7096,5.050000,6.111858,4.400000,4.4
4,Escape to Witch Mountain,1975,"Eddie Albert, Ray Milland, Donald Pleasence, K...",Two mysterious orphan children have extraordin...,97,0,1,0,0,0,...,1095,425,816,6.5,4033.0,0.0000,5.720000,6.291811,5.525000,6.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4151,The Vast of Night,2019,"Sierra McCormick, Jake Horowitz, Gail Cronauer...","In the twilight of the 1950s, on one fateful n...",91,0,0,0,0,0,...,4011,1501,914,6.8,3612.0,-0.2023,4.900000,6.375000,6.500000,6.7
4152,The Rider,2017,"Brady Jandreau, Mooney, Tim Jandreau, Lilly Ja...","After suffering a near fatal head injury, a yo...",104,0,0,0,0,0,...,3834,1969,1021,7.6,2418.0,-0.8555,7.150000,7.400000,7.400000,7.4
4153,Fourth Man Out,2015,"Parker Young, Evan Todd, Chord Overstreet, Jon...","A car mechanic in a small, working class town ...",86,0,0,0,0,1,...,1702,708,843,6.9,1280.0,0.8074,6.700000,6.225000,6.225000,6.7
4154,The Fits,2015,"Royalty Hightower, Alexis Neblett, Da'Sean Min...",While training at the gym 11-year-old tomboy T...,72,0,0,0,0,0,...,830,378,241,6.8,1118.0,-0.8957,6.700000,6.700000,6.700000,6.7


### Export Final DataFrame

In [40]:
# Exporting total score dataframe to data folder
final_df.to_csv('../data/totalscore_df.csv', index=False)