<h1><center>DS 3000 - Fall 2021</center></h1>
<h1><center>DS Report</center></h1>

<h1><center>Predicting Movie Box Office Collections Worldwide From the Top 200 Grossing Movies per Year</center></h1>
<h3><center>Nicholas Gjuraj, Dhivas Sugumar, Avinash Makkena</center></h3>

### **Executive Summary**: 
 
The film industry is one of the largest and most profitable industries in the world; however, it is known for its volatile nature with no sure-shot way of predicting the way a film will perform at the box office. Although some films, depending upon the cast, crew, and distributor attached to the film, one can make general predictions on how the film will perform, such as expecting a Disney film to perform really well in the box office. In this project, we wanted to create a more accurate prediction system that utilized past data of films to make intelligent predictions in terms of how well a film will perform worldwide at the box office.
 
To tackle this goal, we first needed data to train our prediction model off of. We got this data through scraping Box Office Mojo and The Numbers sites for film information. From this data, we decided data such as budget, distributor, and opening earnings would be treated as features and our main target variable is worldwide collections. Once we had this data, we tested multiple prediction models and optimized them through feature selection and tuning. We found that Linear Regression, K Neighbors Regressor, Lasso, and Ridge performed consistently and accurately in estimating the overall worldwide box office collection of a film.

## **Outline**
1. [<u>INTRODUCTION</u>](attachment:./#**1.-INTRODUCTION**)
2. [<u>METHOD</u>](attachment:./#**2.-METHOD**)
3. [<u>RESULTS</u>](attachment:./#**3.-RESULTS**)
4. [<u>DISCUSSION</u>](attachment:./#**4.-DISCUSSION**)

# **1. INTRODUCTION**
**Problem Statement**

How are movie box office collections faring at the theaters in a pandemic world along with
streaming platforms on the rise? With movie theaters and films back, we will try to develop a
predictive model that will estimate box office collections for a film, according to past film
collections.

**Significance of the Problem**

The film industry, specifically the movie theater box office, has been hit massively by the
pandemic in a time where theatergoers were already on a decline due to the rise of streaming
services. We find it important to inspect this problem and properly understand the box office
performance of films in theaters, especially since so many jobs and livelihoods are dependent
upon the film industry and how films perform at the box office. From this project, we hope to gain insights into how films are performing today, given current circumstances, and produce an accurate model as to how much a given film can perform at the box office.

**Questions**

Given the aforementioned problem and its importance, we set out to tackle the following questions:
- How much does the distributor attached to a film correlate with the overall worldwide box office performance? 
- What are the most important features that determine how much a film makes given the dataset we scraped? 
- Does release date of the films in our dataset affect how much it can make in the box office? 
- Between Multiple Linear Regression, K Neighbors Regressor, Lasso, Ridge, and Support Vector Machine, which model will suit the best for our problem? 
- How accurate can we actually predict a box office performance of a film? We know that we have a lot of data(over 6000 samples) to draw a predictive model from but can we score consistent and precise estimations?




# **2. METHOD**

## **2.1 Data Acquisition**
- We obtained our data through BoxOffice Mojo, world wide box office collections: https://www.boxofficemojo.com/year/world/. 
    - Contains film box office performance for the top 200 grossing films from 1990 to the present time. The dataset also includes production details for each film. 
- We also used The Numbers site to obtain movie budgets: https://www.the-numbers.com/movie/budgets/all. 
    - Contains the budgets for 6000+ movies over time. It also contains the release date for the movie, the gross worldwide, and gross domestic. Only the name of the movie and the budget were extracted from this website for our dataset. 

<u>Scraping Code:</u>

Link to Scraping Code for BoxOfficeMojo: https://github.com/gjuraj/ds3000-theaters/blob/master/fpwebscrape.ipynb

Link to Scraping Code for Numbers site: https://github.com/gjuraj/ds3000-theaters/blob/1ef3564bf717cebca6baa905397f6dffb8ff1a31/Dataset%20Budget.ipynb


<u>Scraped Data:</u> 

Link to Scraped Data from BoxOfficeMojo: "https://github.com/gjuraj/ds3000-theaters/blob/8d96d638d9565dec027eba25c3e8f6cbf8cb32bd/bud_in_df.csv?raw=true"

Link to Scraped Data from Numbers site: https://github.com/gjuraj/ds3000-theaters/blob/8d96d638d9565dec027eba25c3e8f6cbf8cb32bd/budget_movies-1.csv?raw=true

Initially, our dataset has 6,400 rows, or 6,400 films. We have 13 variables that we will be collecting and using in our models. The varibales are described in detail below. 

**Variables**

Feature Variables: 
- Rank: Rank per year (1-200) by ross earnings.
- Title: Title of the movie. 
- Domestic: Domestic Earnings.
- % from Domestic: % of total earnings from domestic. 
- Foreign: Foreign earnings. 
- % from Foreign: % of total earnings from foreign. 
- Release Date: Release date of the film. 
- Opening Earnings: Money made on the opening day of the film. 
- Domestic Distributor: The domestic distributor of the film. 
- Running Time: The run time of the film. 
- Genres: The genre of the film. 
- MPAA: MPAA rating of the film. 
- Budget: The budget of the movie. 
- Budget per minute of runtime: The budget per minute of runtime (A calculated feature)

Outcome Variables:
- Worldwide: Worldwide earnings (gross)


Our dataset contains a vast variety of features ranging from monetary variables such as opening earnings, statistical variables such as genre, and box office distributor. We believe this gives our model, and our dataset a wide range of variables to look at and explore when it comes to predicting the worldwide gross of a past movie, or more excitingly the that of a movie to be released. 



Here we read the BoxOfficeMojo csv from the github link that holds the BoxOfficeMojo dataset. This csv is stored as a pandas dataframe under the name df_first_fill. This dataset describes the title of the movie, the worldwide gross, the domestic gross, the release date, the opening earnings, the domestic distributor, Running Time, Genres, The MPAA Rating, and Budget. 

In [1]:
import pandas as pd
df_first_full = pd.read_csv("https://github.com/gjuraj/ds3000-theaters/blob/8d96d638d9565dec027eba25c3e8f6cbf8cb32bd/bud_in_df.csv?raw=true")
df_first_full.head()

Unnamed: 0.1,Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget
0,0,1,Home Alone,"$285,761,244","$285,761,243",100%,$1,<0.1%,"Nov 16, 1990","$17,081,997",Twentieth Century Fox,1 hr 43 min,Comedy Family,-,"$18,000,000"
1,1,2,Ghost,"$217,631,306","$217,631,306",100%,-,-,"Jul 13, 1990","$12,191,540",Paramount Pictures,2 hr 7 min,Drama Fantasy Romance Thriller,-,"$22,000,000"
2,2,3,Dances with Wolves,"$184,208,848","$184,208,848",100%,-,-,"Nov 9, 1990","$598,257",Orion Pictures,3 hr 1 min,Adventure Drama Western,-,"$22,000,000"
3,3,4,Pretty Woman,"$178,406,268","$178,406,268",100%,-,-,"Mar 23, 1990","$11,280,591",Walt Disney Studios Motion Pictures,1 hr 59 min,Comedy Romance,R,"$14,000,000"
4,4,5,Teenage Mutant Ninja Turtles,"$135,265,915","$135,265,915",100%,-,-,"Mar 30, 1990","$25,398,367",New Line Cinema,1 hr 33 min,Action Adventure Comedy Family Sci-Fi,-,"$13,500,000"


Here we read the budgets csv from the github link that holds The Numbers Budget dataset. This csv is stored as a pandas dataframe under the name df_budgets. This dataset holds the movie name, and the budget. 

In [2]:
df_budgets = pd.read_csv("https://github.com/gjuraj/ds3000-theaters/blob/8d96d638d9565dec027eba25c3e8f6cbf8cb32bd/budget_movies-1.csv?raw=true")
df_budgets.head()

Unnamed: 0.1,Unnamed: 0,Movie name,budget
0,0,Avengers: Endgame,"$400,000,000"
1,1,Pirates of the Caribbean: On Stranger Tides,"$379,000,000"
2,2,Avengers: Age of Ultron,"$365,000,000"
3,3,Star Wars Ep. VII: The Force Awakens,"$306,000,000"
4,4,Avengers: Infinity War,"$300,000,000"


## **2.2. Data Analysis**


**Predictive Model**

We are going to be predicting the worldwide gross of a movie from information such as opening earnings, genre, budget, running time, budget per minute of runtime of a movie. In other words we should be able to predict the worldwide gross of a movie given the above information for the movie that we would know before release, or at the very least, on the first day of release.

We chose the above listed features as important predictors, since those are all values or properties a film has going into a full release. A strong opening earnings could be an indicator of how well it will perform in the long-term while a weak opening earnings can indicate the lack of audience interest in the film and much lower performance in overall box office colections. Features such as genre and running time can determine the scope of audience the film can cater to or in other words the number of people who would go watch a given film. Budget and run time are both important indicators of the scale of the film. A high budget and and high run-time could equate to grander scale of cinema, and therefore either attract or discourage people from watching the film. 

**A Supervised Learning Problem**

This is a supervised learning problem, specifically a regression problem. We will be attempting to predict worldwide gross of a recently released/ previously released movie which is a continuous output variable. More specifically, since we are using multiple variables to predict one continuous variable, this is a multiple linear regression problem. 

**Machine Learning Algorithms**
We are going to use four regression machine learning algorithms: 

1.Multiple Linear Regression: linear model that utilizes residual sum of squares between the observed targets and predicted targets done by linear approximation. 



2.Ridge: regression model which uses linear least squares function as its loss function and I2-norm for regularization to draw regression upon a given dataset. 


3.Lasso: linear model that is trained with L1 prior as regularizer. 


4.K-Neighbors Regressor: model utilizes k-nearest neighbors logic to make predictions about a target.


5.Support Vector Regressor Machine: model utilizes support vector classification and only depends on a subset of training data to draw predictions. 

We chose the above models since we wanted to use regression to solve our problem of prediction. We found the above 4 models as the major regression models as part of the scikit learn library that seemed like would fit our goal. There was no particular model we thought would be better than the other and instead wanted to test all the major regression models to observe which ones are the best for our specific data and problem. 



# **3. RESULTS**

## **3.1 Data Wrangling**

Below, we are dropping the "Unnamed: 0" column which is an unnecessary column in df_budgets. 

We are also renaming "Movie Name" to "Title", and "budget" to "Budget".  We did this to give more descriptive names for our columns, and also to fix typos df_budgets. This also matches the column names to that of teh Box Office Mojo dataset.  

In [3]:
df_budgets = df_budgets.drop(columns=['Unnamed: 0'])
df_budgets = df_budgets.rename(columns={'Movie name':'Title', 'budget':'Budget'})
df_budgets

Unnamed: 0,Title,Budget
0,Avengers: Endgame,"$400,000,000"
1,Pirates of the Caribbean: On Stranger Tides,"$379,000,000"
2,Avengers: Age of Ultron,"$365,000,000"
3,Star Wars Ep. VII: The Force Awakens,"$306,000,000"
4,Avengers: Infinity War,"$300,000,000"
...,...,...
6095,Sing,"$84,000"
6096,The Foot Fist Way,"$79,000"
6097,Dawn of the Crescent Moon,"$75,000"
6098,Queen Crab,"$75,000"


Below, we are dropping the "Unnamed: 0" column which is an unnecessary column in df_first_full.

In [4]:
df_first_full = df_first_full.drop(columns=['Unnamed: 0'])
df_first_full

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget
0,1,Home Alone,"$285,761,244","$285,761,243",100%,$1,<0.1%,"Nov 16, 1990","$17,081,997",Twentieth Century Fox,1 hr 43 min,Comedy Family,-,"$18,000,000"
1,2,Ghost,"$217,631,306","$217,631,306",100%,-,-,"Jul 13, 1990","$12,191,540",Paramount Pictures,2 hr 7 min,Drama Fantasy Romance Thriller,-,"$22,000,000"
2,3,Dances with Wolves,"$184,208,848","$184,208,848",100%,-,-,"Nov 9, 1990","$598,257",Orion Pictures,3 hr 1 min,Adventure Drama Western,-,"$22,000,000"
3,4,Pretty Woman,"$178,406,268","$178,406,268",100%,-,-,"Mar 23, 1990","$11,280,591",Walt Disney Studios Motion Pictures,1 hr 59 min,Comedy Romance,R,"$14,000,000"
4,5,Teenage Mutant Ninja Turtles,"$135,265,915","$135,265,915",100%,-,-,"Mar 30, 1990","$25,398,367",New Line Cinema,1 hr 33 min,Action Adventure Comedy Family Sci-Fi,-,"$13,500,000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6395,196,Harry Potter and the Sorcerer's Stone 2021 Re-...,"$5,566,768",-,-,"$5,566,768",100%,"Nov 12, 2021","$582,125",Warner Bros.,2 hr 32 min,Adventure Family Fantasy,PG,"$125,000,000"
6396,197,Miracle,"$5,553,998",-,-,"$5,553,998",100%,"Sep 15, 2021",-,-,1 hr 58 min,Drama,-,-
6397,198,Octonauts: The Ring of Fire,"$5,512,875",-,-,"$5,512,875",100%,"Apr 28, 2021","$73,988",-,1 hr 12 min,Animation,-,-
6398,199,Hell's Garden,"$5,450,978",-,-,"$5,450,978",100%,"May 21, 2021","$1,096,206",-,1 hr 42 min,Action Comedy,-,-


The dataframes for df_first_full is assigned to most_malleable and df_budgets is assigned to budgets malleable. This is done so that these dataframes have more noticeable, and more representative names. This makes it easier to do data wrangling. 

In [5]:
most_malleable = df_first_full
budgets_malleable = df_budgets

Here the function rr is used to replace the common movie titles in the two datasets with the same title.
The first time the function rr is run through the main BoxOfficeMojo dataset using .map() allows it to store the names of movie titles into the dictionary saved. The second time the function **(to be finished).**

In [6]:
saved = {}
def rr(x):
    name = ''.join(filter(str.isalnum, x)).lower()
    saved[name] = x
    return name

most_malleable['Title'] = most_malleable['Title'].map(rr)
budgets_malleable['Title'] = budgets_malleable['Title'].map(rr)
print(most_malleable)
print(budgets_malleable)

      Rank                                         Title     Worldwide  \
0        1                                     homealone  $285,761,244   
1        2                                         ghost  $217,631,306   
2        3                              danceswithwolves  $184,208,848   
3        4                                   prettywoman  $178,406,268   
4        5                     teenagemutantninjaturtles  $135,265,915   
...    ...                                           ...           ...   
6395   196  harrypotterandthesorcerersstone2021rerelease    $5,566,768   
6396   197                                       miracle    $5,553,998   
6397   198                        octonautstheringoffire    $5,512,875   
6398   199                                   hellsgarden    $5,450,978   
6399   200                                  sooryavanshi    $5,373,730   

          Domestic % from Domestic     Foreign % from Foreign  Release Date  \
0     $285,761,243            10

Here the two dataframes most_malleable, and budgets_malleable are merged on the column Title with a left join. This allows all common titles to have the budgets from the numbers dataset to be added to the box office mojo's dataset. Doing this enables us to join the two dataframes together so that the dataframe then contains both the Box Office Mojo data and the budget data. 

In [7]:
sample_true = pd.merge(most_malleable, budgets_malleable, on='Title', how='left')
sample_true

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget_x,Budget_y
0,1,homealone,"$285,761,244","$285,761,243",100%,$1,<0.1%,"Nov 16, 1990","$17,081,997",Twentieth Century Fox,1 hr 43 min,Comedy Family,-,"$18,000,000","$15,000,000"
1,2,ghost,"$217,631,306","$217,631,306",100%,-,-,"Jul 13, 1990","$12,191,540",Paramount Pictures,2 hr 7 min,Drama Fantasy Romance Thriller,-,"$22,000,000","$22,000,000"
2,3,danceswithwolves,"$184,208,848","$184,208,848",100%,-,-,"Nov 9, 1990","$598,257",Orion Pictures,3 hr 1 min,Adventure Drama Western,-,"$22,000,000","$19,000,000"
3,4,prettywoman,"$178,406,268","$178,406,268",100%,-,-,"Mar 23, 1990","$11,280,591",Walt Disney Studios Motion Pictures,1 hr 59 min,Comedy Romance,R,"$14,000,000","$14,000,000"
4,5,teenagemutantninjaturtles,"$135,265,915","$135,265,915",100%,-,-,"Mar 30, 1990","$25,398,367",New Line Cinema,1 hr 33 min,Action Adventure Comedy Family Sci-Fi,-,"$13,500,000","$125,000,000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6514,196,harrypotterandthesorcerersstone2021rerelease,"$5,566,768",-,-,"$5,566,768",100%,"Nov 12, 2021","$582,125",Warner Bros.,2 hr 32 min,Adventure Family Fantasy,PG,"$125,000,000",
6515,197,miracle,"$5,553,998",-,-,"$5,553,998",100%,"Sep 15, 2021",-,-,1 hr 58 min,Drama,-,-,"$28,000,000"
6516,198,octonautstheringoffire,"$5,512,875",-,-,"$5,512,875",100%,"Apr 28, 2021","$73,988",-,1 hr 12 min,Animation,-,-,
6517,199,hellsgarden,"$5,450,978",-,-,"$5,450,978",100%,"May 21, 2021","$1,096,206",-,1 hr 42 min,Action Comedy,-,-,


Replace pandas NaN with python's None. This makes it easier to work with null values in the dataframe. 

In [8]:
sample_true = sample_true.where(pd.notnull(sample_true), None)
sample_true

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget_x,Budget_y
0,1,homealone,"$285,761,244","$285,761,243",100%,$1,<0.1%,"Nov 16, 1990","$17,081,997",Twentieth Century Fox,1 hr 43 min,Comedy Family,-,"$18,000,000","$15,000,000"
1,2,ghost,"$217,631,306","$217,631,306",100%,-,-,"Jul 13, 1990","$12,191,540",Paramount Pictures,2 hr 7 min,Drama Fantasy Romance Thriller,-,"$22,000,000","$22,000,000"
2,3,danceswithwolves,"$184,208,848","$184,208,848",100%,-,-,"Nov 9, 1990","$598,257",Orion Pictures,3 hr 1 min,Adventure Drama Western,-,"$22,000,000","$19,000,000"
3,4,prettywoman,"$178,406,268","$178,406,268",100%,-,-,"Mar 23, 1990","$11,280,591",Walt Disney Studios Motion Pictures,1 hr 59 min,Comedy Romance,R,"$14,000,000","$14,000,000"
4,5,teenagemutantninjaturtles,"$135,265,915","$135,265,915",100%,-,-,"Mar 30, 1990","$25,398,367",New Line Cinema,1 hr 33 min,Action Adventure Comedy Family Sci-Fi,-,"$13,500,000","$125,000,000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6514,196,harrypotterandthesorcerersstone2021rerelease,"$5,566,768",-,-,"$5,566,768",100%,"Nov 12, 2021","$582,125",Warner Bros.,2 hr 32 min,Adventure Family Fantasy,PG,"$125,000,000",
6515,197,miracle,"$5,553,998",-,-,"$5,553,998",100%,"Sep 15, 2021",-,-,1 hr 58 min,Drama,-,-,"$28,000,000"
6516,198,octonautstheringoffire,"$5,512,875",-,-,"$5,512,875",100%,"Apr 28, 2021","$73,988",-,1 hr 12 min,Animation,-,-,
6517,199,hellsgarden,"$5,450,978",-,-,"$5,450,978",100%,"May 21, 2021","$1,096,206",-,1 hr 42 min,Action Comedy,-,-,


The below cell is used to get one budget from the two budget columns. Budget_x is from BoxOfficeMojo whilst Budget_y is from The Numbers. We go through a zipped list of Budget_x and Budget_y and is y is not None we append the item in Budget_y to the list budgets. If y is None and x is not equals to "-" we append the Budget from Budget_X. Else if neither of those have a budget the None from Budget_Y is added. The columns for Budget_x and Budget_y are dropped from the combined dataframe. A new column "Budget" is created for sample_true with the list budgets. The reason we do this is to merge the budget from the BoxOfficeMojo dataset and The Numbers dataset. Precedence is given to the budget from the Numbers dataset.

In [9]:
budgets = []
for x, y in zip(sample_true['Budget_x'],sample_true['Budget_y']):
    if y is not None:
        budgets.append(y)
    elif y is None and x != "-":
        budgets.append(x)
    else:
        budgets.append(y)
sample_true = sample_true.drop(columns=['Budget_x','Budget_y'])
sample_true['Budget'] = budgets
sample_true

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget
0,1,homealone,"$285,761,244","$285,761,243",100%,$1,<0.1%,"Nov 16, 1990","$17,081,997",Twentieth Century Fox,1 hr 43 min,Comedy Family,-,"$15,000,000"
1,2,ghost,"$217,631,306","$217,631,306",100%,-,-,"Jul 13, 1990","$12,191,540",Paramount Pictures,2 hr 7 min,Drama Fantasy Romance Thriller,-,"$22,000,000"
2,3,danceswithwolves,"$184,208,848","$184,208,848",100%,-,-,"Nov 9, 1990","$598,257",Orion Pictures,3 hr 1 min,Adventure Drama Western,-,"$19,000,000"
3,4,prettywoman,"$178,406,268","$178,406,268",100%,-,-,"Mar 23, 1990","$11,280,591",Walt Disney Studios Motion Pictures,1 hr 59 min,Comedy Romance,R,"$14,000,000"
4,5,teenagemutantninjaturtles,"$135,265,915","$135,265,915",100%,-,-,"Mar 30, 1990","$25,398,367",New Line Cinema,1 hr 33 min,Action Adventure Comedy Family Sci-Fi,-,"$125,000,000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6514,196,harrypotterandthesorcerersstone2021rerelease,"$5,566,768",-,-,"$5,566,768",100%,"Nov 12, 2021","$582,125",Warner Bros.,2 hr 32 min,Adventure Family Fantasy,PG,"$125,000,000"
6515,197,miracle,"$5,553,998",-,-,"$5,553,998",100%,"Sep 15, 2021",-,-,1 hr 58 min,Drama,-,"$28,000,000"
6516,198,octonautstheringoffire,"$5,512,875",-,-,"$5,512,875",100%,"Apr 28, 2021","$73,988",-,1 hr 12 min,Animation,-,
6517,199,hellsgarden,"$5,450,978",-,-,"$5,450,978",100%,"May 21, 2021","$1,096,206",-,1 hr 42 min,Action Comedy,-,


The below cell drops all rows where "Release Date", "Opening Earnings", "Domestic Distributor", "Running Time", "Genres" are "-". It also drops all null values. The reason we did this was to get rid of rows with null values or "-" (which are placeholder values).

In [10]:
sample_true = sample_true[sample_true['Release Date'] != "-"]
sample_true = sample_true[sample_true['Opening Earnings'] != "-"]
sample_true = sample_true[sample_true["Domestic Distributor"] != "-"]
sample_true = sample_true[sample_true["Running Time"] != "-"]
sample_true = sample_true[sample_true["Genres"] != "-"]
sample_true = sample_true.dropna()
sample_true

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget
0,1,homealone,"$285,761,244","$285,761,243",100%,$1,<0.1%,"Nov 16, 1990","$17,081,997",Twentieth Century Fox,1 hr 43 min,Comedy Family,-,"$15,000,000"
1,2,ghost,"$217,631,306","$217,631,306",100%,-,-,"Jul 13, 1990","$12,191,540",Paramount Pictures,2 hr 7 min,Drama Fantasy Romance Thriller,-,"$22,000,000"
2,3,danceswithwolves,"$184,208,848","$184,208,848",100%,-,-,"Nov 9, 1990","$598,257",Orion Pictures,3 hr 1 min,Adventure Drama Western,-,"$19,000,000"
3,4,prettywoman,"$178,406,268","$178,406,268",100%,-,-,"Mar 23, 1990","$11,280,591",Walt Disney Studios Motion Pictures,1 hr 59 min,Comedy Romance,R,"$14,000,000"
4,5,teenagemutantninjaturtles,"$135,265,915","$135,265,915",100%,-,-,"Mar 30, 1990","$25,398,367",New Line Cinema,1 hr 33 min,Action Adventure Comedy Family Sci-Fi,-,"$125,000,000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6420,102,bogia,"$17,480,489","$350,000",2%,"$17,130,489",98%,"May 28, 2021","$350,000",3388 Films,2 hr 8 min,Comedy Drama Family,-,"$1,000,000"
6428,110,reminiscence,"$15,800,193","$3,900,193",24.7%,"$11,900,000",75.3%,"Aug 20, 2021","$1,950,793",Warner Bros.,1 hr 56 min,Mystery Romance Sci-Fi Thriller,PG-13,"$67,972,729"
6430,112,minari,"$15,372,376","$3,110,580",20.2%,"$12,261,796",79.8%,"Feb 12, 2021","$193,460",A24,1 hr 55 min,Drama,PG-13,"$2,000,000"
6490,172,judasandtheblackmessiah,"$6,798,511","$5,446,607",80.1%,"$1,351,904",19.9%,"Feb 12, 2021","$2,027,076",Warner Bros.,2 hr 6 min,Biography Drama History,R,"$26,000,000"


Here we map the function rr onto the columns Worldwide, Domestic, Opening Earnings, Budget to remove anything that is not alphanumeric characters. We did this so that the numbers for each of these columns are only numbers which are machine learning models can understand. 

In [11]:
sample_true['Worldwide'] = sample_true['Worldwide'].map(rr)
sample_true['Domestic'] = sample_true['Domestic'].map(rr)
sample_true['Opening Earnings'] = sample_true['Opening Earnings'].map(rr)
sample_true['Budget'] = sample_true['Budget'].map(rr)

Release fix takes a date in a release (formatted Month,Day,Year), splits around spaces in the place of commas (pure comma splitting was causing issues), creates a dictionary with months corresponding to their string representations on BoxOfficeMojo, drops all list values appended that were "", and finally converts the date to a number, specifically days since 0AD. (days + months x 30 + year x 365)

In [12]:
def release_fix(x):
    L = str(x).strip().replace(',', " ").split(" ")
    transf = {'Jan':1,
             'Feb':2,
             'Mar':3,
             'Apr':4,
             'May':5,
             'Jun':6,
             'Jul':7,
             'Aug':8,
             'Sep':9,
             'Oct':10,
             'Nov':11,
             'Dec':12}
    for x in L:
        if x=="":
            L.remove(x)
    try:
        R = transf[L[0]] * 30 + int(L[1]) + int(L[2]) * 365
    except KeyError:
        return "-"
    return R
sample_true['Release Date'] = sample_true['Release Date'].map(release_fix)
sample_true

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget
0,1,homealone,285761244,285761243,100%,$1,<0.1%,726696,17081997,Twentieth Century Fox,1 hr 43 min,Comedy Family,-,15000000
1,2,ghost,217631306,217631306,100%,-,-,726573,12191540,Paramount Pictures,2 hr 7 min,Drama Fantasy Romance Thriller,-,22000000
2,3,danceswithwolves,184208848,184208848,100%,-,-,726689,598257,Orion Pictures,3 hr 1 min,Adventure Drama Western,-,19000000
3,4,prettywoman,178406268,178406268,100%,-,-,726463,11280591,Walt Disney Studios Motion Pictures,1 hr 59 min,Comedy Romance,R,14000000
4,5,teenagemutantninjaturtles,135265915,135265915,100%,-,-,726470,25398367,New Line Cinema,1 hr 33 min,Action Adventure Comedy Family Sci-Fi,-,125000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6420,102,bogia,17480489,350000,2%,"$17,130,489",98%,737843,350000,3388 Films,2 hr 8 min,Comedy Drama Family,-,1000000
6428,110,reminiscence,15800193,3900193,24.7%,"$11,900,000",75.3%,737925,1950793,Warner Bros.,1 hr 56 min,Mystery Romance Sci-Fi Thriller,PG-13,67972729
6430,112,minari,15372376,3110580,20.2%,"$12,261,796",79.8%,737737,193460,A24,1 hr 55 min,Drama,PG-13,2000000
6490,172,judasandtheblackmessiah,6798511,5446607,80.1%,"$1,351,904",19.9%,737737,2027076,Warner Bros.,2 hr 6 min,Biography Drama History,R,26000000


Takes the running time in box mojo format (x hr y min) and converts it to just minutes.

In [13]:
def run_fix(x):
    #split around hour
    L = x.split('hr')
    #drop values of ""
    for x in L:
        if x=="":
            L.remove(x)
    #ideal try -- movie has a runtime with both hours and minutes
    try:
        R = int("".join([x for x in L[0] if str(x).isdigit()])) * 60 + int("".join([x for x in L[1] if str(x).isdigit()]))
    except IndexError:
        # if the movie only has a runtime of hours
        if len(L[0])==2:
            return int("".join([x for x in L[0] if str(x).isdigit()])) * 60
        #if the movie only has a runtime of minutes
        else:
            return int("".join([x for x in L[0] if str(x).isdigit()]))  
    return R
sample_true['Running Time'] = sample_true['Running Time'].map(run_fix)
sample_true 

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget
0,1,homealone,285761244,285761243,100%,$1,<0.1%,726696,17081997,Twentieth Century Fox,103,Comedy Family,-,15000000
1,2,ghost,217631306,217631306,100%,-,-,726573,12191540,Paramount Pictures,127,Drama Fantasy Romance Thriller,-,22000000
2,3,danceswithwolves,184208848,184208848,100%,-,-,726689,598257,Orion Pictures,181,Adventure Drama Western,-,19000000
3,4,prettywoman,178406268,178406268,100%,-,-,726463,11280591,Walt Disney Studios Motion Pictures,119,Comedy Romance,R,14000000
4,5,teenagemutantninjaturtles,135265915,135265915,100%,-,-,726470,25398367,New Line Cinema,93,Action Adventure Comedy Family Sci-Fi,-,125000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6420,102,bogia,17480489,350000,2%,"$17,130,489",98%,737843,350000,3388 Films,128,Comedy Drama Family,-,1000000
6428,110,reminiscence,15800193,3900193,24.7%,"$11,900,000",75.3%,737925,1950793,Warner Bros.,116,Mystery Romance Sci-Fi Thriller,PG-13,67972729
6430,112,minari,15372376,3110580,20.2%,"$12,261,796",79.8%,737737,193460,A24,115,Drama,PG-13,2000000
6490,172,judasandtheblackmessiah,6798511,5446607,80.1%,"$1,351,904",19.9%,737737,2027076,Warner Bros.,126,Biography Drama History,R,26000000


In the below cell we run the .replace() function on sample_true using the saved dictionary. The saved dictionary contains the original titles of teh movies before the rr function was run on them and the titles were stripped off of any charcters other than alphanumeric values, and made lowercase. So by running .replace() on "Title" with the saved dictionary we are able to go back to the original names of the movie. 

In [14]:
final_df = sample_true.replace({"Title": saved})

In the below cell we create a new variable visualization_df that holds the sample_true dataframe. We run .map(int) on the "Worldwide" and "Running Time" columns to convert the values to integers. We also run .map(rr) on the "Domestic" column so that we can strip the values of the dollar signs. We want a separate dataframe for visualizations so that we can get meaningful visualizations using plotly that can provide meaningful data such as names for domestic distributor before they are converted to integers. This makes the vizualisations more interesting. 

In [15]:
visualization_df = sample_true.replace({"Title": saved})
visualization_df['Worldwide'] = visualization_df['Worldwide'].map(int)
visualization_df['Domestic'] = visualization_df['Domestic'].map(rr)
visualization_df['Running Time'] = visualization_df['Running Time'].map(int)

Goes over the dataframe and pulls all unique values out from the Domestic Distributor column. For each Domestic Distributor, a value between 1 and length of unique domestic distributors is assigned in a dictionary.

In [16]:
Distrib_map = {}
i=0
for x in final_df['Domestic Distributor'].unique():
    Distrib_map[x] = i
    i+=1

Iterates through all values in genres and splits each entry (multiple genres, formatted as something like "Comedy Action Western") and appends each value to a list. Takes that list, turns it into a set (drops all non-unique values), and turns that set into a dictionary with each genre mapped to a number 1 through length of set g.
Then takes the genres in the string of genres in the [Genres] column for each movie and assigns it to a list, then maps the dictionary value assigned from Genre_map above, then takes the first value in the new list as the genre that represents the movie (as the first genre on BoxOfficeMojo is the most descriptive). The index position in the list matches up to the index position in the dataframe.

In [17]:
G_list = []
for x in final_df['Genres']:
    for y in x.split(" "):
        G_list.append(y)

j=0
Genre_map = {}
g = set(G_list)
for x in g:
    Genre_map[x] = j
    j+=1
    

full = []
for x in final_df['Genres']:
    S = x.split(" ")
    genres_num = []
    for y in S:
        genres_num.append(Genre_map[y])
    full.append(genres_num[0])

Replace the Domestic Distributor values with the dictionary value assigned to it.

In [18]:
final_df = final_df.replace({"Domestic Distributor": Distrib_map})

The below cell assigns the "Genres" column to the list full that contains a list of integers corresponding to each genre. Each column only contains one genre. We did this because the machine learning models, and the pca reduction did not seem to work witha  list of integers. 

In [19]:
final_df['Genres'] = full

Cast each value in Budget, Worldwide, and Opening Earnings to an integer.

In [20]:
final_df['Budget'] = final_df['Budget'].map(int)
final_df['Worldwide'] = final_df['Worldwide'].map(int)
final_df['Opening Earnings'] = final_df['Opening Earnings'].map(int)

Returns either an integer or the original value. Becuase this is called after rr is mapped to domestic, it is only dealing with pure numbers in string format or "-" values.

In [21]:
def modified_int(x):
    try:
        return int(x)
    except ValueError:
        return x

final_df['Domestic'] = final_df['Domestic'].map(modified_int)

Here we produce a new column Budget per minute of Runtime by dividing the two columns Budget and Running Time. We did this to produce a new feature varaible that can be used by our model to make predictions. We thought this would be an interesting feature since we can see how much budget has been put into each minute of a movie. For instance a million dollar budget over a three hour long movie doesn't seem like much in comparison to a million dollars into a 30 minute long movie. This feature would help us distinguish such cases. 

In [22]:
final_df["Budget per Minute of Runtime"] = final_df["Budget"] / final_df["Running Time"]

The below cell contains the function convert_no_rating which converts "-" values for MPAA to NR. The reason this is being done is so that it is more representative of not having a rating. This function is run on the "MPAA" column using .map() to convert all "-" values to "NR".

In [23]:
def convert_no_rating(x):
    if x == "-":
        return "NR"
    else:
        return x

final_df["MPAA"] = final_df["MPAA"].map(convert_no_rating)
final_df

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget,Budget per Minute of Runtime
0,1,Home Alone,285761244,285761243,100%,$1,<0.1%,726696,17081997,0,103,22,NR,15000000,1.456311e+05
1,2,Ghost,217631306,217631306,100%,-,-,726573,12191540,1,127,11,NR,22000000,1.732283e+05
2,3,Dances with Wolves,184208848,184208848,100%,-,-,726689,598257,2,181,3,NR,19000000,1.049724e+05
3,4,Pretty Woman,178406268,178406268,100%,-,-,726463,11280591,3,119,22,R,14000000,1.176471e+05
4,5,Teenage Mutant Ninja Turtles,135265915,135265915,100%,-,-,726470,25398367,4,93,10,NR,125000000,1.344086e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6420,102,Bo Gia,17480489,350000,2%,"$17,130,489",98%,737843,350000,113,128,22,NR,1000000,7.812500e+03
6428,110,Reminiscence,15800193,3900193,24.7%,"$11,900,000",75.3%,737925,1950793,8,116,2,PG-13,67972729,5.859718e+05
6430,112,Minari,15372376,3110580,20.2%,"$12,261,796",79.8%,737737,193460,81,115,11,PG-13,2000000,1.739130e+04
6490,172,Judas and the Black Messiah,6798511,5446607,80.1%,"$1,351,904",19.9%,737737,2027076,8,126,15,R,26000000,2.063492e+05


The below code snippet is used to create a dictionary called rating_map that maps each type of rating to an integer. We do this buy iterating through the unique values of the "MPAA" column. This produces a dictionary where each rating has a corresponding integer. For example "NR" is mapped to zero. This is done so that we can represent the ratings as integers rather than strings so that the model can use rating. 

In [24]:
rating_map = {}
i=0
for x in final_df['MPAA'].unique():
    rating_map[x] = i
    i+=1
rating_map

{'NR': 0, 'R': 1, 'PG': 2, 'PG-13': 3, 'Approved': 4, 'G': 5, 'NC-17': 6}

Here the ratings are replaced with the .replace() method using the rating_map dictionary made in the above snippet. 

In [25]:
final_df = final_df.replace({"MPAA": rating_map})
final_df

Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget,Budget per Minute of Runtime
0,1,Home Alone,285761244,285761243,100%,$1,<0.1%,726696,17081997,0,103,22,0,15000000,1.456311e+05
1,2,Ghost,217631306,217631306,100%,-,-,726573,12191540,1,127,11,0,22000000,1.732283e+05
2,3,Dances with Wolves,184208848,184208848,100%,-,-,726689,598257,2,181,3,0,19000000,1.049724e+05
3,4,Pretty Woman,178406268,178406268,100%,-,-,726463,11280591,3,119,22,1,14000000,1.176471e+05
4,5,Teenage Mutant Ninja Turtles,135265915,135265915,100%,-,-,726470,25398367,4,93,10,0,125000000,1.344086e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6420,102,Bo Gia,17480489,350000,2%,"$17,130,489",98%,737843,350000,113,128,22,0,1000000,7.812500e+03
6428,110,Reminiscence,15800193,3900193,24.7%,"$11,900,000",75.3%,737925,1950793,8,116,2,3,67972729,5.859718e+05
6430,112,Minari,15372376,3110580,20.2%,"$12,261,796",79.8%,737737,193460,81,115,11,3,2000000,1.739130e+04
6490,172,Judas and the Black Messiah,6798511,5446607,80.1%,"$1,351,904",19.9%,737737,2027076,8,126,15,1,26000000,2.063492e+05


Finally, drop values from the dataframe where the release date is unaccounted for (implying the original data was not manageable from BoxOfficeMojo), as well as drop re-released movies (data outliers)

In [26]:
final_df = final_df[final_df['Release Date'] != "-"]
final_df = final_df[final_df["Title"].str.contains("Re-release") == False]

Our final dataframe.

In [27]:
final_df



Unnamed: 0,Rank,Title,Worldwide,Domestic,% from Domestic,Foreign,% from Foreign,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget,Budget per Minute of Runtime
0,1,Home Alone,285761244,285761243,100%,$1,<0.1%,726696,17081997,0,103,22,0,15000000,1.456311e+05
1,2,Ghost,217631306,217631306,100%,-,-,726573,12191540,1,127,11,0,22000000,1.732283e+05
2,3,Dances with Wolves,184208848,184208848,100%,-,-,726689,598257,2,181,3,0,19000000,1.049724e+05
3,4,Pretty Woman,178406268,178406268,100%,-,-,726463,11280591,3,119,22,1,14000000,1.176471e+05
4,5,Teenage Mutant Ninja Turtles,135265915,135265915,100%,-,-,726470,25398367,4,93,10,0,125000000,1.344086e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6407,89,The Marksman,22988549,15566093,67.7%,"$7,422,456",32.3%,737710,3104204,77,108,10,3,23000000,2.129630e+05
6420,102,Bo Gia,17480489,350000,2%,"$17,130,489",98%,737843,350000,113,128,22,0,1000000,7.812500e+03
6428,110,Reminiscence,15800193,3900193,24.7%,"$11,900,000",75.3%,737925,1950793,8,116,2,3,67972729,5.859718e+05
6430,112,Minari,15372376,3110580,20.2%,"$12,261,796",79.8%,737737,193460,81,115,11,3,2000000,1.739130e+04


Saves our final dataframe to a csv named "df_updated1.csv". 

In [28]:
final_df.to_csv("df_updated1")

## **3.2 Data Exploration**

Function to convert a release date in days since AD to release date in plotly express format, YYYY-MM-DD

In [29]:
import math

def modify_release_date(x):
    # cast release date as integer
    x = int(x)
    # floor the value divided by 365, this is the year, cast from float to int
    year = int(math.floor(x/365))
    # floor the remainder from the year division divided by 30, this is our month release
    month = int(math.floor((x%365)/30))
    # get the remainder from the year, with the remainder from the month, this is our days.
    day = (x%365)%30
    # append together in proper format
    return str(year) + "-" + str(month) + "-" + str(day)
    
visualization_df = visualization_df[visualization_df['Release Date'] != "-"]
visualization_df["Release Date"] = visualization_df["Release Date"].map(modify_release_date)   

Here we produce a graph that shows the relationship between Release Date and worldwide gross using plotly's scatter plot.

In [30]:
import plotly.express as px


fig = px.scatter(visualization_df, x="Release Date", y="Worldwide", color = "Domestic Distributor",
                    hover_name = "Title", size_max = 50,
                    title='Relationship between Release Date and Worldwide gross and the effect of Domestic Distributor.')

#modify axis label properties
fig.update_xaxes(title_font={"size":18, "family": "Courier", "color":"gray"}, 
                 tickfont = {"size":16, "family": "Courier", "color":"gray"})
fig.update_yaxes(title_font={"size":18, "family": "Courier", "color":"gray"}, 
                            tickfont = {"size":16, "family": "Courier", "color":"gray"})

fig.show()

Unsupported

<img src = "https://i.ibb.co/qdVGj67/newplot-1.png" width=900/>

The above visualization represents a scatter plot that represents the relationship between Release data and the Worldwide Gross. It allows us to see immediately that there is a lack of correlation between worldwide earnings and release date.Although, as time goes on there are more outlier movies with higher budgets. We are also able to see that a majority of the higher budget films come from Universal Pictures and Walt Disney Studios Motion Pictures especially closer to 2021. This is particularly interesting.  

The below cell is used to extract just the features from the final_df dataframe. This is used for dimensional reduction.  

In [31]:

list_columns = ["Rank", "Release Date","Opening Earnings","Domestic Distributor","Running Time","Genres","MPAA","Budget","Budget per Minute of Runtime"]

features_only = final_df[list_columns]
features_only = features_only[features_only['Release Date'] != "-"]
# genres = features_only.loc[:,"Genres"]
# genres_num = genres.values
# for

# for column in list_columns:
#     print(features_only[features_only[column] == "-"])
#     # print(type(features_only[column].iloc[0]))
features_only

Unnamed: 0,Rank,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget,Budget per Minute of Runtime
0,1,726696,17081997,0,103,22,0,15000000,1.456311e+05
1,2,726573,12191540,1,127,11,0,22000000,1.732283e+05
2,3,726689,598257,2,181,3,0,19000000,1.049724e+05
3,4,726463,11280591,3,119,22,1,14000000,1.176471e+05
4,5,726470,25398367,4,93,10,0,125000000,1.344086e+06
...,...,...,...,...,...,...,...,...,...
6407,89,737710,3104204,77,108,10,3,23000000,2.129630e+05
6420,102,737843,350000,113,128,22,0,1000000,7.812500e+03
6428,110,737925,1950793,8,116,2,3,67972729,5.859718e+05
6430,112,737737,193460,81,115,11,3,2000000,1.739130e+04


Here we use PCA function from sklearn to produce the dimensional reduction numpy array for our dataset. This numpy array is a 2 dimensional numpy array that contains component 1 and component 2 values. 

In [46]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler



# instantiate the PCA object and request two components
pca = PCA(n_components= 2, random_state=3000)

# standardize the features so they are all on the same scale
features_standardized = StandardScaler().fit_transform(features_only)

# transform the standardized features using the PCA algorithm 
reduced_data = pca.fit_transform(features_standardized)

reduced_data

The reduced_data is then converted to a dataframe that can be used to visualize the data. This produces a reduced_df dataframe with columns "Component 1", and "Component 2". 

In [33]:
reduced_df = pd.DataFrame(reduced_data, columns = ["Component1", "Component2"])
reduced_df

Unnamed: 0,Component1,Component2
0,-1.000280,-2.701418
1,-0.425798,-2.659443
2,-0.001735,-2.651066
3,-0.874100,-2.492413
4,1.998162,-2.356608
...,...,...
3699,-0.792446,3.231872
3700,-2.438122,3.606674
3701,0.839781,1.481670
3702,-1.469417,3.410265


Using plotly's scatter plot we produce the PCA graph for our dataset.

In [47]:
import plotly.express as px

graph = px.scatter(reduced_df, x="Component1", y="Component2", title="PCA Graph")
graph.show()

Unsupported

<img src="https://i.ibb.co/MfzcXG8/newplot-2.png" width=900/>

The PCA represents a simplification of our complex, multifeatured dataset as a two-dimensional dataset between two components 1 and 2, derived from the principle component vectors from our feature columns. Because we are performing regression on this dataset rather than classification, the PCA does not give a general classification, nor shows an obvious distribution of our data, besdies a mildly positive correlation.

In the below cell we produce a scatter plot between the opening Earnings and the Worldwide with Title on hover data. We use plotly's scatter plot function to do so. 

In [35]:
figure0 = px.scatter(final_df, x="Opening Earnings", y="Worldwide", hover_data=["Title"], title="Opening Earnings vs Worldwide Earnings")
figure0.show()

Unsupported

<img src="https://i.ibb.co/x359TZd/newplot-3.png" width=900/>

The above graph is a scatter plot with the s varaible being Opening Earnings and the y variable being Worldwide. This graph shows a positive correlation between the Opening Earnings and Worldwide gross of a movie. This means that the higher the opening earnings of a movie, the higher the Worldwide gross of the movie. We see Avengers Endgame having an Opening Earnings of 375.115 million dollars and so resulted in the highest Worldwide of 2.79 Billion dollars.This is expected since the higher the opening earnings of a movie the higher the worldwide for the movie since it indicates higher fanfare for the movie. A scatter plot maeks sense for these variables since it vizualises the correlation between two continous varaibles. 

In [36]:
figure1 = px.box(final_df, x="Domestic Distributor", y="Worldwide", points=False, title="Domestic Distributor vs Worldwide")
figure1.show()

Unsupported

<img src="https://i.ibb.co/Xpjyc1j/newplot-4.png" width=900/>

Box and whisker plot showcases the relation between a distributor and how much they net worldwide per movie. Outliers are represented by the lines stemming from the box, the box itself represents the general range the distributor earns.

In [37]:
figure2 = px.scatter(final_df, x="Budget", y="Worldwide", hover_data=["Title"], title="Budget vs Worldwide Earnings")
figure2.show()

Unsupported

<img src="https://i.ibb.co/qmxthy9/newplot-5.png" width=900/>

General trend for budget v wordwide seems to be positive, with a large cluster near 0,0, dictating movies with low budget generally do not perform well with regards to worldwide earnings.

In [38]:
figure3 = px.box(final_df, x="Genres", y="Worldwide", points = False, title="Genre vs Worldwide Average by Genre")
figure3

Unsupported

<img src="https://i.ibb.co/RSgfrjP/newplot-6.png" width=900/>

This box and whisker plot shows us that there seems to be a correlation amongst genre and expected earnings, showing that Adventure (3) seems to have the highest 75th percentile, shortly followed by mystery (2) and action (10), while Action and Drama (11) seem to have the highest potential for becoming true blockbusters.

## **3.3 Model Training**

**Preparing Data for Model Fitting**

In [39]:
from sklearn.model_selection import train_test_split

features = final_df.drop(["Worldwide", "Rank", "Title", "Domestic", "% from Domestic", "Foreign", "% from Foreign"], axis=1)
features_opening = final_df[["Opening Earnings", "Budget", "Domestic Distributor", "MPAA", "Release Date", "Running Time", "Budget per Minute of Runtime", "Genres"]]
target = final_df["Worldwide"]

def split_dataset(): 
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
    
    return X_train, X_test, y_train, y_test

In [40]:
features

Unnamed: 0,Release Date,Opening Earnings,Domestic Distributor,Running Time,Genres,MPAA,Budget,Budget per Minute of Runtime
0,726696,17081997,0,103,22,0,15000000,1.456311e+05
1,726573,12191540,1,127,11,0,22000000,1.732283e+05
2,726689,598257,2,181,3,0,19000000,1.049724e+05
3,726463,11280591,3,119,22,1,14000000,1.176471e+05
4,726470,25398367,4,93,10,0,125000000,1.344086e+06
...,...,...,...,...,...,...,...,...
6407,737710,3104204,77,108,10,3,23000000,2.129630e+05
6420,737843,350000,113,128,22,0,1000000,7.812500e+03
6428,737925,1950793,8,116,2,3,67972729,5.859718e+05
6430,737737,193460,81,115,11,3,2000000,1.739130e+04


**Initial Testing Regression Algorithms: Multiple Linear Regression, Ridge, Lasso, k-Neighbors, Linear SVR**

In [41]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


import warnings
warnings.filterwarnings('ignore')

X_train, X_test, y_train, y_test = split_dataset()


lnr = LinearRegression()
rid = Ridge() 
lass = Lasso() 
knn = KNeighborsRegressor()
svr = LinearSVR(max_iter=100000)


estimators = {"Multiple Linear Regression": lnr, "Ridge": rid, "Lasso": lass, "KNeighborsRegressor": knn, "Support Vector Machine": svr}

def test_regression_algorithms(): 


    for estimator_name, estimator in estimators.items():
        model = estimator.fit(X=X_train, y=y_train)

        print(estimator_name + ":")
        print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train)))
        print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test)))

        print("\tMean Absolute Error value for training set: ", mean_absolute_error(y_train, model.predict(X_train)))
        print("\tMean Absolute Error value for testing set: ", mean_absolute_error(y_test, model.predict(X_test)))

test_regression_algorithms()

Multiple Linear Regression:
	R-squared value for training set:  0.7967874105749124
	R-squared value for testing set:  0.7467647574237505
	Mean Absolute Error value for training set:  53696056.51393392
	Mean Absolute Error value for testing set:  56038135.814426005
Ridge:
	R-squared value for training set:  0.7967874105544415
	R-squared value for testing set:  0.7467648381172691
	Mean Absolute Error value for training set:  53696066.05135578
	Mean Absolute Error value for testing set:  56038114.80853845
Lasso:
	R-squared value for training set:  0.7967874105749124
	R-squared value for testing set:  0.7467647574948949
	Mean Absolute Error value for training set:  53696056.51318199
	Mean Absolute Error value for testing set:  56038135.78593249
KNeighborsRegressor:
	R-squared value for training set:  0.843565287165408
	R-squared value for testing set:  0.7320903184535146
	Mean Absolute Error value for training set:  43754780.49712023
	Mean Absolute Error value for testing set:  57454512.73

**Feature Selection on our best models**

In [42]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor

best_estimators = {"Multiple Linear Regression": lnr, "Ridge": rid, "Lasso": lass, "KNeighborsRegressor": knn}

# runs a recursive feature elimination and fitting model on top 4 most important features 
def RFE_feature_selection():
    # answer goes here    
    select = RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = 4)
    select.fit(X_train, y_train)
    
    X_train_selected = select.transform(X_train)
    X_test_selected = select.transform(X_test)
    
    #model = estimators["KNeighborsRegressor"].fit(X=X_train_selected, y=y_train)

    for estimator_name, estimator in best_estimators.items():
        model = best_estimators[estimator_name].fit(X=X_train_selected, y=y_train)
        print(estimator_name +":\n")


    
        predicted_training = model.predict(X_train_selected)
        expected_training = y_train
        r2_training_score = r2_score(expected_training, predicted_training)
        
        predicted_test = model.predict(X_test_selected)
        expected_test = y_test
        r2_test_score = r2_score(expected_test, predicted_test)     
            
        print("\tSelected features after RFE:")
          
        selected_features = [] 
        for feature_name, selected_status in zip(features.columns, select.get_support()):
            if(selected_status): 
                print("\t\t" + feature_name)
                selected_features.append(feature_name)
                 
        print(f"\t\t \n \t Performance with selected features: \n \t\t R-squared value for training set: {r2_training_score} \n \t\t R-squared value for testing set: {r2_test_score}")
    
    return X_train_selected, X_test_selected    

X_train_selected, X_test_selected  = RFE_feature_selection()

Multiple Linear Regression:

	Selected features after RFE:
		Release Date
		Opening Earnings
		Running Time
		Budget
		 
 	 Performance with selected features: 
 		 R-squared value for training set: 0.79479428827808 
 		 R-squared value for testing set: 0.7501914806744027
Ridge:

	Selected features after RFE:
		Release Date
		Opening Earnings
		Running Time
		Budget
		 
 	 Performance with selected features: 
 		 R-squared value for training set: 0.7947942882780747 
 		 R-squared value for testing set: 0.7501914862589587
Lasso:

	Selected features after RFE:
		Release Date
		Opening Earnings
		Running Time
		Budget
		 
 	 Performance with selected features: 
 		 R-squared value for training set: 0.79479428827808 
 		 R-squared value for testing set: 0.7501914806918868
KNeighborsRegressor:

	Selected features after RFE:
		Release Date
		Opening Earnings
		Running Time
		Budget
		 
 	 Performance with selected features: 
 		 R-squared value for training set: 0.8438691575790949 
 		 R-squ

## **3.4 Model Optimization**

We want to optimize our models in order to reduce the overfitting shown in the above R-squared values. 

In [43]:
from sklearn.model_selection import GridSearchCV

# performs a Grid Search through tuning n_neighbors and metric attributes for kNN regression 
def grid_search(estimator_name):
    # answer goes here
    if estimator_name == "Multiple Linear Regression" :
        param_grid = {"fit_intercept": [True,False], 'normalize':[True,False]}  
        grid_search = GridSearchCV(estimators["Multiple Linear Regression"], param_grid, cv=5)  
    elif estimator_name == "Ridge" : 
        param_grid = {"alpha":[0.001, 0.01, 0.1, 1, 10, 100],"max_iter":[1,10,100,1000,10000,100000],
        "solver": ["svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga", "lbfgs"]}
        grid_search = GridSearchCV(estimators["Ridge"], param_grid, cv=5 )
    elif estimator_name == "Lasso":
        param_grid = {"alpha":[0.001, 0.01, 0.1, 1, 10, 100], "max_iter":[1,10,100,1000,10000,100000],
        "selection":["cyclic","random"]}
        grid_search = GridSearchCV(estimators["Lasso"], param_grid, cv=5)
    else:
        param_grid = {"n_neighbors": [1, 5, 10], "metric": ["euclidean", "manhattan", "minkowski"]}  
        grid_search = GridSearchCV(estimators["KNeighborsRegressor"], param_grid, cv=5)


    grid_search.fit(X=X_train_selected, y=y_train)  
    print(estimator_name + " :")
    print(f"Best parameters:  {grid_search.best_params_}")
    print(f"Training set score with best parameters: {grid_search.score(X_train_selected, y_train)}")
    print(f"Test set score with best parameters: {grid_search.score(X_test_selected, y_test)}")

    return grid_search

LNR_tuned = grid_search("Multiple Linear Regression")
Ridge_tuned = grid_search("Ridge")
Lasso_tuned = grid_search("Lasso")
KNN_tuned = grid_search("KNeighborsRegressor")

tuned_estimators = {"Multiple Linear Regression": LNR_tuned, "Ridge": Ridge_tuned, "Lasso": Lasso_tuned, "KNeighborsRegressor": KNN_tuned}

Multiple Linear Regression :
Best parameters:  {'fit_intercept': True, 'normalize': False}
Training set score with best parameters: 0.79479428827808
Test set score with best parameters: 0.7501914806744027
Ridge :
Best parameters:  {'alpha': 100, 'max_iter': 1, 'solver': 'svd'}
Training set score with best parameters: 0.7947942882260691
Test set score with best parameters: 0.7501920391251091
Lasso :
Best parameters:  {'alpha': 0.1, 'max_iter': 10, 'selection': 'random'}
Training set score with best parameters: 0.7913221785902639
Test set score with best parameters: 0.759091593745693
KNeighborsRegressor :
Best parameters:  {'metric': 'euclidean', 'n_neighbors': 10}
Training set score with best parameters: 0.8096241967449999
Test set score with best parameters: 0.7514826229268529


## **3.5 Model Testing**

In [44]:
def test_tunedModels(): 
    for estimator_name, estimator in tuned_estimators.items(): 
        model = estimator.fit(X=X_train_selected, y=y_train)

        print(estimator_name + ":")
        print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train_selected)))
        print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test_selected)))

        print("\tMean Absolute Error value for training set: ", mean_absolute_error(y_train, model.predict(X_train_selected)))
        print("\tMean Absolute Error value for testing set: ", mean_absolute_error(y_test, model.predict(X_test_selected)))

test_tunedModels()

Multiple Linear Regression:
	R-squared value for training set:  0.79479428827808
	R-squared value for testing set:  0.7501914806744027
	Mean Absolute Error value for training set:  54085431.19913053
	Mean Absolute Error value for testing set:  56346736.720943846
Ridge:
	R-squared value for training set:  0.7947942882260691
	R-squared value for testing set:  0.7501920391251091
	Mean Absolute Error value for training set:  54085326.36419574
	Mean Absolute Error value for testing set:  56346641.84350096
Lasso:
	R-squared value for training set:  0.7944263261088715
	R-squared value for testing set:  0.7527997051313
	Mean Absolute Error value for training set:  54183454.19942743
	Mean Absolute Error value for testing set:  56158171.55373882
KNeighborsRegressor:
	R-squared value for training set:  0.8096241967449999
	R-squared value for testing set:  0.7514826229268529
	Mean Absolute Error value for training set:  47232756.22483802
	Mean Absolute Error value for testing set:  54846268.640064

# **4. DISCUSSION**

**Which algorithms did you compare?**

Multiple Linear Regression, Ridge Regression, Lasso Regression, KN regressor, and Linear SVR.

**Which algorithm(s) revealed best performance? With what parameters?**

All algorithms performed similarly with the exception of Linear SVR, with the top performing model varying per run, usually between KNeighbors Regressor and Lasso Regression. 
KNeighbors parameters: 'metric': 'euclidean', 'n_neighbors': 10
Lasso Parameters: 'alpha': 0.1, 'max_iter': 10, 'selection': 'random'
Ridge Parameters: 'alpha': 100, 'max_iter': 1, 'solver': 'svd'
Multiple Liner Regression Parameters: 'fit_intercept': True, 'normalize': False

**Which algorithm(s) should be used for your predictive model?**

Generally KN Regressor has the highest average performance on the test sets. 

**Based on your findings, can we use the features in your dataset to predict the outcome variable you identified using the algorithms you've applied? (It is okay if the answer is no. We're interested in the process, not the performance of the model.)**

Yes. The trained models have a high R-Squared value with relatively low MAEs (compared to the scale of the values they are predicting). The model seems reliable.

**Discuss the ethical implications of your project.** 

The model has the potential to be used by, say, theaters to determine what movies they want to screen on a given day to maximize their revenue. Because of this, running this model on a movie that has a low opening day due to unfair circumstances (say, movies that released during lockdown or a country-encompassing snowstorm) would return an expected lower gross earnings for that movie, which would limit the screenings of that movie by theaters, hence further limiting their profits. This implies our model has the potential, if implemented, to limit the profits of certain films.

**Should your results be accepted at face value, why or why not? (e.g. any dataset bias or methodological issues?)**

In most cases, yes. However, there is bias in this dataset, as it already is expecting the movie to be popular (top 200 global grossing films per year), so if you tried to feed in a small indie film, chances are it would not handle it as expected.

**End this section with a conclusion paragraph containing some pointers for future work**

We could have potentially included some form of popularity metric, whether it be through trailer views on youtube, or a worldwide "popularity score" that would likely have heavy influence on the model. There are many other interesting ways we could analyze movies, potentially we could grab a script and perform sentiment analysis on it and generalize it to a positive or negative emotion, incorporate user reviews of the movie, etc.

## **CONTRIBUTIONS**

The entirety of the report was written together by the group during in-person meetings. 

The idea and concept(problem to solve, type of model, feature/target variables) of the project was thought of by Avi. Web-scraping was distributed between Nicholas (Box office mojo) as well as Dhivas (The Numbers). 