# IMDB Movie Cleaning Revenue
![ImdbIcon](../images/imdbheader.jpg)

### Notebook Overview

Creating a revenue dataset used for predicting movie revenue. Similar cleaning process to my total score dataset, but incorporated budgets and revenue as well. Started by merging my datasets by movie title, and cleaning from there. Movies outside the United States had tons of null values, so I decided to only use movies in the United States, as well as movies from 1970 and onward. I only decided to use actors, actresses, and directors because when I was attempting to use writers, producers, etc., I found that I lost even more data and decided to only use the main three. For future research, I would love to use all of the roles as I found they had a significant impact on total score which I believe has a direct correlation to revenue. I also imputed values for budget and revenue at several locations because either the data given was incorrect, or wrong by quite a bit. Also decided to use sentiment analysis on both the plot and tagline to identify if movies that were upbeat or more dark had an impact on predicting revenue. Lastly, ordered columns for a final dataframe and exported to data folder to use in modeling.

### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.sentiment.vader import SentimentIntensityAnalyzer

C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  stacklevel=1)


### Import Data

In [2]:
movies = pd.read_csv('../data/IMDb_movies.csv')
names = pd.read_csv('../data/IMDb_names.csv')
ratings = pd.read_csv('../data/IMDb_ratings.csv')
titles = pd.read_csv('../data/IMDb_title_principals.csv')
budgets_df = pd.read_csv('../data/tmdb_movies_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


### Merge Data

In [3]:
# Merging movies with ratings
movies = movies.merge(ratings, left_on = 'imdb_title_id', right_on = 'imdb_title_id')
# Merging movives with tiles
movies = movies.merge(titles, left_on = 'imdb_title_id', right_on = 'imdb_title_id')
# Merging movies with names (actors/actresses/directors)
movies = movies.merge(names, left_on = 'imdb_name_id', right_on = 'imdb_name_id')

### Dropping Columns

In [4]:
# Dropping columns that serve no values
movies.drop(columns = ['imdb_title_id', 'title', 'language', 'production_company', 'budget', 'metascore', 'reviews_from_critics', 
                       'usa_gross_income', 'worlwide_gross_income', 'allgenders_0age_avg_vote', 'allgenders_0age_votes', 
                       'allgenders_18age_avg_vote','allgenders_18age_votes', 'males_0age_avg_vote', 'males_0age_votes',
                       'males_18age_avg_vote', 'males_18age_votes', 'females_0age_avg_vote', 'females_0age_votes', 
                       'females_18age_avg_vote', 'females_18age_votes', 'females_30age_avg_vote', 'females_30age_votes', 
                       'females_45age_avg_vote', 'females_45age_votes', 'job', 'characters', 'imdb_name_id', 'birth_name',
                       'reviews_from_users', 'writer', 'allgenders_30age_avg_vote', 'allgenders_30age_votes', 
                       'allgenders_45age_avg_vote', 'allgenders_45age_votes', 'males_allages_avg_vote', 'males_allages_votes', 
                       'males_30age_avg_vote',  'males_30age_votes', 'males_45age_avg_vote', 'males_45age_votes', 
                       'females_allages_avg_vote', 'females_allages_votes', 'top1000_voters_rating', 'top1000_voters_votes', 
                       'date_published', 'ordering', 'director', 'non_us_voters_rating', 'non_us_voters_votes'], inplace=True)

In [5]:
# Keeping only columns I want to use in dataframe in budgets_df
budgets_df = budgets_df[['original_title', 'popularity', 'budget', 'revenue', 'tagline', 'budget_adj', 'revenue_adj']]

In [6]:
# Merging movies with budgets_df
movies = movies.merge(budgets_df, left_on = 'original_title', right_on = 'original_title')

### Cleaning

In [7]:
# Removing columns that were not movies
movies = movies[movies['year'] != 'TV Movie 2019']
# Movies outside the United States had a lot of missing data.
movies = movies[movies['country'] == 'USA']
# Removing both Reality-TV 
movies = movies[(movies['genre'] != 'Reality-TV') & (movies['genre'] != 'News')]

In [8]:
# A small amount of movies had no description, so I could fill those with Unknown to keep valuable data.
movies['description'] = movies['description'].fillna("Unknown")

In [9]:
# Less than twenty rows were missing voters rating, so I felt comfortable imputing the mean.
movies['us_voters_rating'].fillna((movies['us_voters_rating'].mean()), inplace=True)
# Less than twenty rows were missing voters votes, so I felt comfortable imputing the mean.
movies['us_voters_votes'].fillna((movies['us_voters_votes'].mean()), inplace=True)

In [10]:
# Only looking to work with actors, actresses, and directors.
jobs_list = ['actor', 'actress', 'director']

# Only keeping actors, actresses, and directors
movies = movies[movies['category'].isin(jobs_list)]

In [11]:
# Reset index so ordered
movies.reset_index(drop=True, inplace=True)

In [12]:
# Removing date_of_death and date_of_birth, no longer necessary
movies.drop(columns = ['date_of_death', 'date_of_birth'], inplace=True)

In [13]:
# Converting column year to integer
movies['year'] = movies['year'].astype('int64')

In [14]:
# Renaming original title, category, and description.
movies.rename(columns = {'original_title':'movie_title',
                          'category':'role',
                         'description':'plot'}, inplace=True)

### Dummify Genre

In [15]:
# Using str.get_dummes(",", I can have multiple values in dummy columns)
# For genre, if a movie is Horror AND Action, a 1 is placed in both of those columns
genre_dummies = movies['genre'].str.get_dummies(", ")
# Merging the dummified columns back to the movie dataframe
movies = pd.merge(movies, genre_dummies, left_index =True, right_index=True)
# Dropping genre and country. Only USA movies and genre is now dummified.
movies.drop(columns = ['genre', 'country'], inplace=True)

In [16]:
# Dummify roles (actors, actresses, directors)
movies = pd.get_dummies(movies, columns = ['role'])

In [17]:
# Grouping by movie_title and dividing by the amount of times it is its own row (because of actors, actresses, and directors)
movies_one = movies.groupby(['movie_title']).sum() / movies.groupby(['movie_title']).count()
movies_one = movies_one.reset_index()

In [18]:
# Getting the average score for each name in the dataframe, then changing the column to 'average_role_score'
movies_one = movies.groupby(['name']).sum() / movies.groupby(['name']).count()
movies_one = movies_one.reset_index()
movies_one = movies_one.round(3)
movies_one = movies_one[['name', 'weighted_average_vote']]

movies_one.rename(columns = {'weighted_average_vote':'average_role_score'}, inplace=True)

### Scores of Actors, Actresses, and Directors

In [19]:
# Dividing actors total score by number of movies they occur in
actors_role = movies.groupby(['name']).sum() / movies.groupby(['name']).count()
actors_role = actors_role.reset_index()

In [20]:
# Setting actors_role columns to only those I wish to use going forward
actors_role = actors_role[['name', 'movie_title', 'role_actor', 'role_actress', 'role_director', 'weighted_average_vote']]
# Actress_role and directors_role similar to actors role, will filter them differently
actress_role = actors_role
directors_role = actors_role

# Setting dataframes based on the dummified columns
actors_role = actors_role[actors_role['role_actor'] >= 1]
actress_role = actress_role[actress_role['role_actress'] >= 1]
directors_role = directors_role[directors_role['role_director'] >= 1]

In [21]:
# Create new dataframe of each individual actor and their average score
actors_role = actors_role[['name', 'weighted_average_vote']]
actors_role.rename(columns = {'weighted_average_vote':'actor_score'}, inplace=True)

# Create new dataframe of each individual actress and their average score
actress_role = actress_role[['name', 'weighted_average_vote']]
actress_role.rename(columns = {'weighted_average_vote':'actress_score'}, inplace=True)

# Create new dataframe of each individual director and their average score
directors_role = directors_role[['name', 'weighted_average_vote']]
directors_role.rename(columns = {'weighted_average_vote':'director_score'}, inplace=True)

In [22]:
# Visualizing each actor and their average score
actors_role

Unnamed: 0,name,actor_score
0,'Ducky' Louie,6.400000
1,'Weird Al' Yankovic,7.000000
2,50 Cent,5.166667
4,A. Michael Baldwin,6.200000
5,A.J. Buckley,4.700000
...,...,...
12432,Zakes Mokae,6.500000
12433,Zakk Wylde,6.700000
12437,Zane Holtz,5.000000
12438,Zane Pais,6.000000


In [23]:
# Visualizing each actress and their average score
actress_role

Unnamed: 0,name,actress_score
6,A.J. Cook,6.40
8,A.J. Langer,6.40
10,AJ Michalka,6.55
11,Aaliyah,6.10
34,Aarti Mann,6.60
...,...,...
12459,Zoë Bell,5.06
12460,Zoë Kravitz,6.75
12461,Zoë Lund,6.80
12462,Zulay Henao,5.90


In [24]:
# Visualizing each director and their average score
directors_role

Unnamed: 0,name,director_score
3,A. Edward Sutherland,6.1
7,A.J. Kparr,4.7
13,Aaron Blaise,6.8
17,Aaron Hann,6.0
18,Aaron Harvey,4.6
...,...,...
12430,Zackary Adler,5.7
12434,Zal Batmanglij,6.6
12435,Zalman King,4.7
12455,Zoltan Korda,7.5


### Merge Role Scores to DataFrame

In [25]:
# Merging movies with the average role scores.
final_movies = pd.merge(movies, movies_one, left_on = 'name', right_on = 'name')
final_movies = final_movies.groupby(['movie_title']).sum() / final_movies.groupby(['movie_title']).count()
final_movies = final_movies.reset_index()
final_movies.rename(columns = {'average_role_score':'casting_score'}, inplace=True)

In [26]:
# Merging average actors scores for each movie by movie
final_movies_actors = pd.merge(movies, actors_role, left_on = 'name', right_on = 'name')
final_movies_actors = final_movies_actors.groupby(['movie_title']).sum() / final_movies_actors.groupby(['movie_title']).count()
final_movies_actors = final_movies_actors.reset_index()

In [27]:
# Merging average actresses scores for each movie by movie
final_movies_actresses = pd.merge(movies, actress_role, left_on = 'name', right_on = 'name')
final_movies_actresses = final_movies_actresses.groupby(['movie_title']).sum() / final_movies_actresses.groupby(['movie_title']).count()
final_movies_actresses = final_movies_actresses.reset_index()

In [28]:
# Merging average directors scores for each movie by movie
final_movie_directors = pd.merge(movies, directors_role, left_on = 'name', right_on = 'name')
final_movie_directors = final_movie_directors.groupby(['movie_title']).sum() / final_movie_directors.groupby(['movie_title']).count()
final_movie_directors = final_movie_directors.reset_index()

In [29]:
# Used to merge cast, duration, and plot back to the dataframe
movies_details = movies[['movie_title', 'actors', 'duration', 'plot', 'tagline', 'popularity', 'budget', 'revenue']]

In [30]:
# Merging final_movies (roles) with movie details
final_df = pd.merge(final_movies, movies_details, left_on = 'movie_title', right_on = 'movie_title')

In [31]:
# Rename columns
final_df.rename(columns = {'plot_y':'plot',
                           'duration_y':'duration',
                           'actors_y':'cast',
                           'avg_vote': 'imdb_score'}, inplace=True)


# Dropping duplicates
final_df = final_df.drop_duplicates()

# Rounding casting_score
final_df['casting_score'] = final_df['casting_score'].round(3)

In [32]:
# Ensuring each directors, actors, and actresses dataframe only contains their score and I can merge on movie_title
final_movie_directors = final_movie_directors[['movie_title', 'director_score']]
final_movies_actors = final_movies_actors[['movie_title', 'actor_score']]
final_movies_actresses = final_movies_actresses[['movie_title', 'actress_score']]

# Merging director, actors, and actress scores
final_df = pd.merge(final_df, final_movie_directors, how = 'outer', on = 'movie_title')
final_df = pd.merge(final_df, final_movies_actors, how = 'outer', on = 'movie_title')
final_df = pd.merge(final_df, final_movies_actresses, how = 'outer', on = 'movie_title')

In [33]:
final_df['actress_score'] = final_df['actress_score'].fillna(final_df['actor_score'])
final_df['actor_score'] = final_df['actor_score'].fillna(final_df['actress_score'])

In [34]:
final_df.dropna(subset = ['director_score', 'cast', 'actor_score', 'actress_score'], inplace=True)

In [35]:
# Filling tagline where missing with Unknown
final_df['tagline_y'] = final_df['tagline_y'].fillna("Unknown")

In [36]:
# Found duplicate values. Dropping to ensure data is valid
final_df = final_df.drop_duplicates(subset = ['movie_title'])

In [37]:
# Converting year to integer (whole number) from float (decimal)
final_df['year'] = final_df['year'].astype('int64')

In [38]:
# Renaming duplicate columns that were merged
final_df.rename(columns = {'tagline_y':'tagline',
                           'popularity_y':'popularity',
                           'budget_y': 'budget',
                          'revenue_y': 'revenue'}, inplace=True)

In [39]:
# Removing movies where budget and revenue were both 0
final_df = final_df[(final_df['budget'] != 0) & (final_df['revenue'] != 0)]
# Only looking at movies in the past 50 years
final_df = final_df[final_df['year'] >= 1970]

In [40]:
# Only keeping movies with revenue of above ten thousand
final_df = final_df[final_df['revenue'] > 10000]

In [41]:
final_df

Unnamed: 0,movie_title,Action,Adventure,Animation,Biography,Comedy,Crime,Drama,Family,Fantasy,...,cast,duration,plot,tagline,popularity,budget,revenue,director_score,actor_score,actress_score
1,(500) Days of Summer,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,"Joseph Gordon-Levitt, Zooey Deschanel, Geoffre...",95,An offbeat romantic comedy about a woman who d...,It was almost like falling in love.,3.244139,7500000,60722734,7.066667,6.745588,6.308081
5,10 Things I Hate About You,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,"Heath Ledger, Julia Stiles, Joseph Gordon-Levi...",97,"A pretty, popular teenager can't go out on a d...",How do I loathe thee? Let me count the ways.,1.769152,16000000,53478166,6.100000,6.820588,6.319444
10,10th & Wolf,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,"James Marsden, Giovanni Ribisi, Brad Renfro, P...",107,A former street tough returns to his Philadelp...,"The Intersection Where Family, Honor and Betra...",0.384988,8000000,143451,6.300000,6.258333,5.975000
12,12 Rounds,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,"John Cena, Aidan Gillen, Ashley Scott, Steve H...",108,Detective Danny Fisher discovers his girlfrien...,Survive all 12,0.826039,20000000,17280326,6.112500,5.552778,6.025000
18,1408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,"John Cusack, Tony Shalhoub, Len Cariou, Isiah ...",104,A man who specialises in debunking paranormal ...,The only demons in room 1408 are those within ...,0.917818,25000000,94679598,6.750000,6.615192,6.080000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5927,Zombieland,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",88,A shy student trying to reach his family in Oh...,This place is so dead,2.041804,23600000,102391382,6.800000,6.550362,6.664848
5928,Zookeeper,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,...,"Kevin James, Rosario Dawson, Leslie Bibb, Ken ...",102,A group of zoo animals decide to break their c...,Welcome to his jungle.,1.643140,80000000,169852759,6.028571,5.638636,6.114706
5929,Zoom,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,"Tim Allen, Courteney Cox, Chevy Chase, Spencer...",93,Former superhero Jack is called back to work t...,They're going to save the world... as long as ...,0.529881,35000000,12506188,5.250000,5.808333,5.808333
5932,xXx,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,"Vin Diesel, Asia Argento, Marton Csokas, Samue...",124,"An extreme sports athlete, Xander Cage, is rec...",A New Breed Of Secret Agent.,1.936728,70000000,277448382,5.850000,6.578363,5.600000


In [42]:
final_df.dropna(axis = 'columns', inplace=True)

In [43]:
# Adjusting budget_adj and revenue_adj to integers instead of floats
final_df['budget_adj'] = final_df['budget_adj'].astype(int)
final_df['revenue_adj'] = final_df['revenue_adj'].astype(int)

In [44]:
# Creating net_profit so that I can create a profitable column
final_df['net_profit'] = final_df['revenue'] - final_df['budget']
final_df['profitable'] = [1 if x >= 0 else 0 for x in final_df['net_profit']]

In [45]:
final_df = final_df[final_df['budget'] > 10000]

### Sentiment Analysis of Plot and Tagline

In [46]:
# Sentiment analysis of plot
plot_desc = final_df['plot'].tolist()

analyzer = SentimentIntensityAnalyzer()

# Function to return compound score and append column by each movie
def get_polarity_plot(plot_desc):
    polarity = []
    for post in plot_desc:
        vs = analyzer.polarity_scores(post)
        polarity.append(vs['compound'])
    return polarity

polarity = get_polarity_plot(plot_desc)

final_df['plot_sentiment'] = polarity

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [47]:
# Sentiment analysis of tagline
tagline_desc = final_df['tagline'].tolist()

analyzer_two = SentimentIntensityAnalyzer()

# Function to return compound score and append column by each movie
def get_polarity_tag(tagline_desc):
    polarity_tag = []
    for tag in tagline_desc:
        vs = analyzer_two.polarity_scores(tag)
        polarity_tag.append(vs['compound'])
    return polarity_tag

polarity_tag = get_polarity_tag(tagline_desc)

final_df['tagline_sentiment'] = polarity_tag

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


### Ordering of Columns in DataFrame

In [48]:
# Organizing columns
final_df = final_df[['movie_title', 'year', 'tagline', 'plot', 'cast', 'duration', 'Action',
                     'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Drama', 'Family',
                     'Fantasy', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'Romance', 
                     'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western',  'total_votes', 
                     'us_voters_votes', 'votes', 'votes_1', 'votes_2', 'votes_3', 'votes_4', 
                     'votes_5', 'votes_6', 'votes_7', 'votes_8', 'votes_9', 'votes_10', 'popularity', 
                     'director_score', 'actor_score', 'actress_score','tagline_sentiment', 'plot_sentiment', 
                     'imdb_score', 'profitable', 'budget', 'revenue', 'budget_adj', 'revenue_adj',]]

In [49]:
# Ensure all na columns are dropped
final_df.dropna()
# Reset index for ordering purposes
final_df = final_df.reset_index()
# Drop index column, inplace = True to stick
final_df.drop(columns = ['index'], inplace=True)

### Export Final DataFrame

In [50]:
# Export final revenue dataframe to use for modeling
final_df.to_csv('../data/revenue_df.csv', index=False)