<a href="https://colab.research.google.com/github/princetonds/PDS-Movie-Competition/blob/main/PDS_Competition_Starter_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the PDS Data Bowl!
### An introductory workshop to Princeton's first-ever data science competition
Authors: Joyce Luo (joyceluo@princeton.edu) & Nab Kar (nkar@princeton.edu)

This workshop aims to:

* help attendees get familiar with the competition format
*introduce the dataset and task
* get started with building their models

## Competition logistics
**Timeline**: The competition will run until May 1, at which point we will stop accepting submissions.

**Group size**: Students can work individually or teams of two. Finding a partner is recommended!

**Dataset**: The dataset consists of over 3,000 movies with pertinent information (“features”) about each movie. Some features are:

* Movie Budget
* Genre(s)
* Cast
* Crew
* Production company
* Tagline for the movie (if exists)

**Task**: Your task is to predict the movie’s revenue (in $) from all other features.

We will provide train data (*movies_data_public_train.csv*) which contains all features including movie revenues. We will also provide testing data (*movies_data_test.csv*) which contains all features *except* revenues. Your job will be to submit predicted revenues for this test set on the [EvalAI competition page](https://eval.ai/web/challenges/challenge-page/871/overview).

**Submission and Evaluation**:
Each movie in the dataset is associated with an ID. We ask you to submit your predictions for the test set in a CSV file with the format:

    MovieID, prediction
    MovieID, prediction

We’ll be evaluating your predictions using RMSLE, or Root Mean-Squared Log Error. This metric computes the squared difference between the log of your prediction and log of the actual value. Here’s the equation for RMSLE, where $\hat{y}_i$ is your prediction and $y_i$ is the actual value:
$$RMSLE = \sqrt{\frac{1}{n}\sum_{i=1}^n \left(\log(\hat{y}_i+1) - \log(y_i+1)\right)^2}$$

*This next part is very important.* On the backend we’ve divided up the test set into a public and private test set. When you submit your predictions for the test set, you can either submit to the public or private leaderboard. Submitting to the public leaderboard allows you to see how you measure up to other participants, and you can submit up to ten times per day. However, your final score will be evaluated on the private leaderboard, for which you only have three submissions total. Do not forget to submit to the private leaderboard, but also use your submissions sparingly! The reason we do this is so that participants can’t “overfit” to the public leaderboard by constantly tweaking parameters of their model and resubmitting.

More details can be found on the [EvalAI competition site](https://eval.ai/web/challenges/challenge-page/871/overview).

**Prizes**: Prizes can be found at the [PDS competition page](https://princetonds.io/competition.html).

## Loading in the Data
We provided a training and testing set that you should load in to your notebook--it can found on [GitHub](https://github.com/princetonds/PDS-Movie-Competition). You should also import necessary packages for data cleaning here.

In [None]:
!git clone https://github.com/princetonds/PDS-Movie-Competition

In [None]:
# imports important packages for data cleaning
import pandas as pd
import numpy as np

In [None]:
# loads in the training set and test set
df_train = pd.read_csv("./PDS-Movie-Competition/movies_data_public_train.csv")
df_test = pd.read_csv("./PDS-Movie-Competition/movies_data_test.csv")

Taking a quick look at the training data:

In [None]:
print(df_train.columns)
df_train.head()

The testing data looks similar but lacks the *revenue* column.

## Cleaning the Data

We provided some data cleaning examples to illustrate a possible way to clean the categorical data. Make sure to clean both the training and test set in the same way. We only cleaned certain columns, but if you want to use other columns, you should clean those as well as you see fit!


In [None]:
df_train["cast"]

In [None]:
### TRAINING CLEANING
# gets first member of cast and puts in diff col
cast_list = []
# for each entry in the column "cast", get rid of excess characters
# and split the long string of cast members into an array
for i in range(len(df_train["cast"])): 
    cast = df_train["cast"].iloc[i]
    cast = cast.replace("'", "")
    cast = cast.replace('"', "")
    cast = cast.replace("[", "")
    cast = cast.replace("]", "")
    cast = cast.split(',')
    # get only the first member of the cast list
    cast_list.append(cast[0])

# put top cast members in new column
df_train["main_cast"] = cast_list

# Note: if you want to completely clean the cast column, you will need to do 
# some cleaning similar to "genres" show below

In [None]:
df_train["main_cast"]

We clean other textual features similarly.

In [None]:
# gets director (fist member of cast) and puts in diff col
crew_list = []
# for each entry in the column "crew", get rid of excess characters
# and split the long string of crew members into an array
for i in range(len(df_train["crew"])): 
    crew = df_train["crew"].iloc[i]
    crew = crew.replace("'", "") 
    crew = crew.replace('"', "")
    crew = crew.replace("[", "")
    crew = crew.replace("]", "")
    crew = crew.split(',')
    # get only first member of crew list, which is director
    crew_list.append(crew[0])
# put director in new column
df_train["director"] = crew_list

# cleans genres 
genres_list = []
# for each entry in the column "genres", get rid of excess characters
# and split the long string of genres into an array
for i in range(len(df_train["genres"])): 
    genres = df_train["genres"].iloc[i]
    genres = genres.replace("'", "") 
    genres = genres.replace("[", "")
    genres = genres.replace("]", "")
    genres = genres.split(',')
    # for each genre in the list, strip white spaces and add to new array
    array = []
    for gr in genres:
        gr = gr.strip()
        array.append(gr)
    genres_list.append(array) 
# replace "genres" column with cleaned list
df_train["genres"] = genres_list

# cleans production companies
comp_list = []
# for each entry in the column "production_companies", get rid of excess 
# characters and split the long string of companies into an array
for i in range(len(df_train["production_companies"])): 
    comp = df_train["production_companies"].iloc[i]
    comp = comp.replace("'", "") 
    comp = comp.replace("[", "")
    comp = comp.replace("]", "")
    comp = comp.split(',')
    # for each company in the list, strip white spaces and add to new array
    array = []
    for idx, co in enumerate(comp):
        if (idx < 5):
          co = co.strip()
          array.append(co)
    comp_list.append(array) 
# replace "production_companies" column with cleaned list
df_train["production_companies"] = comp_list

We also clean the testing data in a similar manner.

In [None]:
### TESTING CLEANING 
# gets first member of cast and puts in diff col
cast_list = []
# for each entry in the column "cast", get rid of excess characters
# and split the long string of cast members into an array
for i in range(len(df_test["cast"])): 
    cast = df_test["cast"].iloc[i]
    cast = cast.replace("'", "")
    cast = cast.replace('"', "")
    cast = cast.replace("[", "")
    cast = cast.replace("]", "")
    cast = cast.split(',')
    # get only the first member of the cast list
    cast_list.append(cast[0])
# put top cast members in new column
df_test["main_cast"] = cast_list
# Note: if you want to completely clean the cast column, you will need to do 
# some cleaning similar to "genres" show below

# gets director and puts in diff col
crew_list = []
# for each entry in the column "crew", get rid of excess characters
# and split the long string of crew members into an array
for i in range(len(df_test["crew"])): 
    crew = df_test["crew"].iloc[i]
    crew = crew.replace("'", "") 
    crew = crew.replace('"', "")
    crew = crew.replace("[", "")
    crew = crew.replace("]", "")
    crew = crew.split(',')
    # get only first member of crew list, which is director
    crew_list.append(crew[0])
# put director in new column
df_test["director"] = crew_list

# cleans genres 
genres_list = []
# for each entry in the column "genres", get rid of excess characters
# and split the long string of genres into an array
for i in range(len(df_test["genres"])): 
    genres = df_test["genres"].iloc[i]
    genres = genres.replace("'", "") 
    genres = genres.replace("[", "")
    genres = genres.replace("]", "")
    genres = genres.split(',')
    # for each genre in the list, strip white spaces and add to new array
    array = []
    for gr in genres:
        gr = gr.strip()
        array.append(gr)
    genres_list.append(array) 
# replace "genres" column with cleaned list
df_test["genres"] = genres_list

# cleans production companies
comp_list = []
# for each entry in the column "production_companies", get rid of excess 
# characters and split the long string of companies into an array
for i in range(len(df_test["production_companies"])): 
    comp = df_test["production_companies"].iloc[i]
    comp = comp.replace("'", "") 
    comp = comp.replace("[", "")
    comp = comp.replace("]", "")
    comp = comp.split(',')
    # for each company in the list, strip white spaces and add to new array
    array = []
    for idx, co in enumerate(comp):
        if (idx < 5):
          co = co.strip()
          array.append(co)
    comp_list.append(array) 
# replace "production_companies" column with cleaned list
df_test["production_companies"] = comp_list

## Basic Exploratory Data Analysis
Here we have done some basic EDA, so you can visualize the data and make decisions about your models based on this EDA. Feel free to do more EDA to visualize other variables. 

In [None]:
# imports data visualization packages
import seaborn
import matplotlib.pyplot as plt

In [None]:
df_train.describe()

In [None]:
seaborn.histplot(df_train["budget"])
plt.xlabel('revenue')
plt.ylabel('number of movies')
plt.title('Histogram of Revenue and Movies')
plt.show()

seaborn.histplot(df_train["runtime"])
plt.xlabel('runtime')
plt.ylabel('number of movies')
plt.title('Histogram of Runtime and Movies')
plt.show()

In [None]:
arr2 = df_train["genres"].values
from collections import Counter
c = Counter()
for xs in arr2:
    for x in set(xs):
        c[x] += 1
   
categories = []
vals = []
for i in c:
    categories.append(i)
    vals.append(c[i])

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(categories,vals, label='genres')
ax.set_title('Distribution of Genres across Movies')
plt.xticks(rotation = 90)
plt.show()

In [None]:
seaborn.heatmap(df_train.drop(columns=['Unnamed: 0', 'movie_id']).corr(), annot=True, vmin=-1, vmax=1)
plt.show()

## Preparing certain textual variables to use in the model

Many of the given features are textual. Here we illustrate a simple example of how the textual data could be used in your model to predict revenue. Specifically, we define two dummy variables for if a movie is directed by one of two high-profile directors.

In [None]:
df_train['MS_dummy'] = df_train['director'] == 'Martin Scorsese'
df_train['SS_dummy'] = df_train['director'] == 'Steven Spielberg'

df_test['MS_dummy'] = df_test['director'] == 'Martin Scorsese'
df_test['SS_dummy'] = df_test['director'] == 'Steven Spielberg'

In [None]:
df_train['MS_dummy'].sum(), df_train['SS_dummy'].sum()

In [None]:
df_test['MS_dummy'].sum(), df_test['SS_dummy'].sum()

We will evaluate two models, one which does not include these dummy variables and one which does.

## Creating and evaluating models

Here we provide a very simple example for modeling revenue. In order to analyze model performance, we suggest creating an additional validation set from the training data. We evaluate how good the model is based on the root mean squared log error (RMSLE) and $R^2$ value. If you predict negative numbers, you should threshold your predictions at 0 before calculating the RMSLE or you will get an error. When we evaluate your predictions, we will also be thresholding at 0, so it is okay to have negative values in your submission. 

In [None]:
# imports necessary packages for modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error, r2_score

#### Without textual dummy variables

In [None]:
# creates your feature dataframe and outcome column
X = df_train[["budget", "runtime", "popularity"]]
y = df_train["revenue"]

# creates a training and validation set from the overall training data (20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# standardize features
scalar = StandardScaler().fit(X_train)
X_train = scalar.transform(X_train)
X_val = scalar.transform(X_val)

# creates a random forest
rf = RandomForestRegressor(max_depth=7, random_state=0)

# fits the model
rf.fit(X_train, y_train)

# predicts outcomes for the training data
y_train_pred = rf.predict(X_train)

# calculates the RMSLE and R^2 for the training data
rf_train_rmsle = np.sqrt(mean_squared_log_error(y_train, y_train_pred))
rf_train_r2 = r2_score(y_train, y_train_pred)
print("RF training: RMSLE \t= %f & R^2 \t= %f" % (rf_train_rmsle, rf_train_r2))

# predicts outcomes for the validation data (this is important to see how your
# model extends to unseen data)
y_val_pred = rf.predict(X_val)

# calculates the RMSLE and R^2 for the testing data
rf_val_rmsle = np.sqrt(mean_squared_log_error(y_val, y_val_pred))
rf_val_r2 = r2_score(y_val, y_val_pred)
print("RF validation: RMSLE \t= %f & R^2 \t= %f" % (rf_val_rmsle, rf_val_r2))

#### With textual dummy variables

In [None]:
# creates your feature dataframe and outcome column
X = df_train[["budget", "runtime", "popularity", "MS_dummy", "SS_dummy"]]
y = df_train["revenue"]

# creates a training and validation set from the overall training data (20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# standardize features
scalar = StandardScaler().fit(X_train)
X_train = scalar.transform(X_train)
X_val = scalar.transform(X_val)

# creates a random forest
rf2 = RandomForestRegressor(max_depth=7, random_state=0)

# fits the model
rf2.fit(X_train, y_train)

# predicts outcomes for the training data
y_train_pred = rf2.predict(X_train)

# calculates the RMSLE and R^2 for the training data
rf2_train_rmsle = np.sqrt(mean_squared_log_error(y_train, y_train_pred))
rf2_train_r2 = r2_score(y_train, y_train_pred)
print("RF2 training: RMSLE \t= %f & R^2 = %f" % (rf2_train_rmsle, rf2_train_r2))

# predicts outcomes for the testing data (this is important to see how your
# model extends to unseen data)
y_val_pred = rf2.predict(X_val)

# calculates the RMSLE and R^2 for the testing data
rf2_val_rmsle = np.sqrt(mean_squared_log_error(y_val, y_val_pred))
rf2_val_r2 = r2_score(y_val, y_val_pred)
print("RF2 validation: RMSLE \t= %f & R^2 = %f" % (rf2_val_rmsle, rf2_val_r2))

It appears that we can squeeze out marginal performance improvements by using the textual data! In fact, we suspect that the best models will use the textual features in creative ways to maximize performance.

#### Make sure to train on the entire training + validation set
Since the model was developed on a subset of the training data, you need to use all of the official training data to train the model that is submitted to the leaderboard.

In [None]:
# train new model on all of the training data that was provided
X = df_train[["budget", "runtime", "popularity", "MS_dummy", "SS_dummy"]]
y = df_train["revenue"]

rf_final = RandomForestRegressor(max_depth=7, random_state=0)
rf_final.fit(X, y)

## How to upload your output to EvalAI
The required submission format is a .csv file with two columns, *movie_id* and *predicted revenue*. You should submit a file like this when submitting to both the Public phase and Private phase of the competition on EvalAI.

In [None]:
# generate predictions on the official provided test set
X_test = df_test[["budget", "runtime","popularity", "MS_dummy", "SS_dummy"]]

# predict on the testing data
y_test_pred = rf_final.predict(X_test)

# get the movie ids and corresponding revenues you predicted and put in a new dataframe
results = df_test["movie_id"].copy()
results = pd.DataFrame(results)
results["predicted revenue"] = y_test_pred 

# convert dataframe into csv that you will submit to the leaderboard
results.to_csv(r'test_competition_entry.csv', index = False, header=True)

In [None]:
results

You can now upload these predictions to EvalAI to see how well your model performs against other models.

## Good luck!