# Predict the winner! Part 1
---

## Content Overview

After exploring the confusion matrices and the classification accuracies of 10 models from our previous notebook, here we will try to predict the actual Oscar winner directly using the 'predict_proba' function to determine the predicted winner based on the probability of winning. We will also be using the 'feature_importance' function to visualise the weightage of importance of each feature in different classifiers (Though not all classifiers support the 'feature_importance' function).

A bit of introduction on what we are trying to do: We will be trying our movie dataset from 1999-2014 (16 years of data) and to test on 2015-2019 (the recent 5 years of data). The first reason why we split the train and test data manually are because people tend to be more farmiliar with the movies from recent years. The second reason is there will be problem if we use the 'train-test-split' function for our algorithm. Due to the reason that eventually we wish to predict the actual Oscar winner of the recent years, the 'train-test-split' function split the dataset based on a given ratio **RANDOMLY** so we won't know which part of the data is being trained and which part is being tested, and hence we decided to not use it. 

Also, just to clarify: what we mean by X_train is really just the ***features*** (x variables) in the train dataset. And for y_test is just the ***respondents*** (the y value, 'Oscar_winner') in the test dataset.


We considered 16 classifiers for this time, and they were categorized into 4 groups (for aesthetic purpose) as followed:

### #1- The Trees

- Decision Tree 
- Random Forest
- Extremely Randomized Trees (Extra Trees)

### #2 - Boosting Classifiers

- AdaBoost 
- XGBoost
- Gradient Boosting
- Light Gradient Boosting Machine

### #3- Naive Bayes

- Multinomial Naive Bayes
- Gaussian Naive Bayes
- Bernoulli Naive Bayes

### #4- Others

- Linear Support Vector Classifier (Linear SVC)
- Support Vector Classifier (SVC)
- K-Nearest Neighbors Classifier (KNN)
- Logistic Regression
- Bagging Classifier
- Multi Layer Perceptron Classifier (Neural Network)

***Each and every classifier will be ran in their default setting, for example RandomForectCLassifier(). We will not take into account hyperparameter of each models in this notebook.***

In [186]:
import numpy as np
import pandas as pd
import pandas_profiling
import plotly.express as px
import pickle 
import graphviz
import matplotlib.pyplot as plt
import time
pd.set_option("display.max_colwidth", 200)

# Classifiers
from sklearn import tree 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier 
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC, LinearSVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Extra
from sklearn.preprocessing import normalize, scale, Normalizer, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline

In [2]:
data = pd.read_csv('data/movie_dataset_final.csv')
data.head()

Unnamed: 0,Year,Movie,Oscar_winner,Oscar_nominee,Runtime (min),Certificate,Directors,Actors,Metascore,IMDb_rating,...,Golden_Bear_winner,Golden_Bear_nominee,Golden_Lion_winner,Golden_Lion_nominee,PCA_winner,PCA_nominee,NYFCC_winner,NYFCC_nominee,OFCS_winner,OFCS_nominee
0,1999,Fight Club,0,0,139,R(A),David Fincher,"['Brad Pitt', 'Edward Norton', 'Meat Loaf', 'Zach Grenier']",66,8.8,...,0,0,0,0,0,0,0,0,0,1
1,1999,The Matrix,0,0,136,PG,Lana Wachowski Lilly Wachowski,"['Keanu Reeves', 'Laurence Fishburne', 'Carrie-Anne Moss', 'Hugo Weaving']",73,8.7,...,0,0,0,0,0,0,0,0,0,0
2,1999,The Green Mile,0,1,189,R(A),Frank Darabont,"['Tom Hanks', 'Michael Clarke Duncan', 'David Morse', 'Bonnie Hunt']",61,8.6,...,0,0,0,0,0,0,0,0,0,0
3,1999,American Beauty,1,1,122,R(A),Sam Mendes,"['Kevin Spacey', 'Annette Bening', 'Thora Birch', 'Wes Bentley']",84,8.3,...,0,0,0,0,1,1,0,1,1,1
4,1999,The Sixth Sense,0,1,107,PG,M. Night Shyamalan,"['Bruce Willis', 'Haley Joel Osment', 'Toni Collette', 'Olivia Williams']",64,8.1,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Train on 16 years, predict on recent 5 years
train = data[data['Year'] < 2015]
test = data[data['Year'] >= 2015]

In [4]:
train['Oscar_winner'].value_counts()

0    1584
1      16
Name: Oscar_winner, dtype: int64

In [5]:
movie_name = np.array(test["Movie"])
year = np.array(test["Year"])
oscar_w = np.array(test["Oscar_winner"])
oscar_n = np.array(test["Oscar_nominee"])

In [6]:
# feat. = feature
feat_film_elements = ['Runtime (min)', 'Action', 'Adventure', 'Animation', 'Biography', 
                      'Comedy', 'Crime', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Musical', 'Mystery', 
                      'Romance', 'Sci-Fi', 'Sport','Thriller', 'War', 'Western']  # Genre - binary  #20

feat_movie_critics = ['IMDb_rating', 'IMDb_votes','RT_rating','RT_review','Metascore']  #5

feat_commercial = ['Budget','Domestic (US) gross', 'International gross', 'Worldwide gross'] #4

feat_awards = ['GG_drama_winner', 'GG_drama_nominee', 'GG_comedy_winner', 'GG_comedy_nominee',
               'BAFTA_winner', 'BAFTA_nominee', 'DGA_winner', 'DGA_nominee',
               'PGA_winner', 'PGA_nominee', 'CCMA_winner', 'CCMA_nominee',
               'Golden_Palm_winner', 'Golden_Palm_nominee', 'Golden_Bear_winner', 'Golden_Bear_nominee',
               'Golden_Lion_winner', 'Golden_Lion_nominee', 'PCA_winner', 'PCA_nominee',
               'NYFCC_winner', 'NYFCC_nominee', 'OFCS_winner', 'OFCS_nominee'] #24

all_features = []

all_features = feat_film_elements + feat_movie_critics + feat_commercial  + feat_awards

X_train = train[all_features]
X_test = test[all_features]
y_train = train['Oscar_winner']
y_test = test['Oscar_winner']

In [7]:
# transform the data to standardize the values in the data 
preprocessor = ColumnTransformer(transformers=[('scale', StandardScaler(), all_features)])

In [8]:
def get_scores(model, X_train, y_train, X_test, y_test, show = True):
    
    if show: 
        print("Training error:   %.2f" % (1-model.score(X_train, y_train)))
        print("Validation error: %.2f" % (1-model.score(X_test, y_test)))
        print('\n')
    return (1-model.score(X_train, y_train)), (1-model.score(X_test, y_test))

In [173]:
def diff_class_ml(X_train, X_test, y_train, y_test):
    
    # Lets create an empty dictionary to store all the results
    results_dict = {}
    
    models = {
            # The Trees
            'Decision Tree': DecisionTreeClassifier(),
            'Random Forest' : RandomForestClassifier(),
            'Extra Trees' : ExtraTreesClassifier(),
        
            # Boosting
            'AdaBoost Classifier' : AdaBoostClassifier(),
            'XGBoost Classifier' : XGBClassifier(),
            'Gradient Boosting Classifier' : GradientBoostingClassifier(),
            'Light Gradient Boosting Machine': LGBMClassifier(),
        
            # Naive Bayes
            # 'Multinomial Naive Bayes' : MultinomialNB(), input error
            'Gaussian Naive Bayes' : GaussianNB(),
            'Bernoulli Naive Bayes' : BernoulliNB(),
        
            # Others
            'Linear Support Vector Classifier' : LinearSVC(dual=False),
            'Support Vector Classifier' : SVC(),
            'K-Nearest Neighbors Classifier' : KNeighborsClassifier(),
            'Logistic Regression': LogisticRegression(), 
            'Bagging Classifier' : BaggingClassifier(),
            'Multi Layer Perceptron Classifier' : MLPClassifier(),
              }

    for model_name, model in models.items():
        t = time.time() 
        clf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', model)])
        clf.fit(X_train, y_train);
        tr_err, valid_err = get_scores(clf, X_train, y_train, X_test, y_test, show = False)
        elapsed_time = time.time() - t
        results_dict[model_name] = [round(tr_err,3), round(valid_err,3), round(elapsed_time,3)]
    
    results_df = pd.DataFrame(results_dict).T
    results_df.columns = ["Train error", "Validation error", "Elapased Time (s)"]
    return results_df

In [174]:
diff_class_ml(X_train, X_test, y_train, y_test)

Unnamed: 0,Train error,Validation error,Elapased Time (s)
Decision Tree,0.0,0.018,0.014
Random Forest,0.0,0.014,0.162
Extra Trees,0.0,0.012,0.115
AdaBoost Classifier,0.0,0.014,0.165
XGBoost Classifier,0.001,0.014,0.155
Gradient Boosting Classifier,0.0,0.016,0.33
Light Gradient Boosting Machine,0.0,0.012,0.141
Gaussian Naive Bayes,0.024,0.034,0.023
Bernoulli Naive Bayes,0.028,0.044,0.021
Linear Support Vector Classifier,0.0,0.016,0.038


This table just tells us what is the train and validation error (1-accuracy rate)... about the true positive rate we still need to find out ourselves

---

# #1 - The Trees

<img src="img\tree.jpg" width="550" align="center"/>

## #1.1 - Decision Tree

In [41]:
tree = DecisionTreeClassifier(max_depth = 8)

my_tree = tree.fit(X_train, y_train)

In [42]:
tree_importances = pd.DataFrame(my_tree.feature_importances_.round(3), all_features, columns=["Importances Weightage"])

print(tree_importances)
print()
print('Score = ', my_tree.score(X_train, y_train))

                     Importances Weightage
Runtime (min)                        0.000
Action                               0.000
Adventure                            0.000
Animation                            0.000
Biography                            0.000
Comedy                               0.000
Crime                                0.000
Drama                                0.000
Family                               0.000
Fantasy                              0.000
History                              0.000
Horror                               0.000
Musical                              0.000
Mystery                              0.000
Romance                              0.000
Sci-Fi                               0.000
Sport                                0.000
Thriller                             0.000
War                                  0.000
Western                              0.000
IMDb_rating                          0.046
IMDb_votes                           0.000
RT_rating  

In [43]:
tree_importances['Features'] = all_features
fig = px.bar(tree_importances, x='Features', y='Importances Weightage', 
             title='Features Importances', height=600)
fig.show()

In [44]:
pred_tree = my_tree.predict_proba(X_test)[:, 1]

tree_prediction = pd.DataFrame(year, columns=["Year"])
tree_prediction["Movie"] = movie_name
tree_prediction["Oscar_nominee"] = oscar_n
tree_prediction["Oscar_winner"] = oscar_w
tree_prediction['Predicted Win Rate'] = pred_tree

tree_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
1,2015,Mad Max: Fury Road,1,0,0.0
2,2015,The Martian,1,0,0.0
3,2015,Avengers: Age of Ultron,0,0,0.0
4,2015,The Revenant,1,0,1.0


In [45]:
normalized_prediction = tree_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / tree_prediction["Predicted Win Rate"][tree_prediction["Year"] == row["Year"]].sum()).round(3)


invalid value encountered in double_scalars



In [46]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
4,2015,The Revenant,1,0,1.0
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
63,2015,Bãhubali: The Beginning,0,0,0.0
73,2015,Hardcore Henry,0,0,0.0
72,2015,Hitman: Agent 47,0,0,0.0
71,2015,Demolition,0,0,0.0
70,2015,Self/less,0,0,0.0
69,2015,Insidious: Chapter 3,0,0,0.0
68,2015,Home,0,0,0.0
67,2015,The Last Witch Hunter,0,0,0.0


In [47]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,1.0
100,2016,Deadpool,0,0,0.0
163,2016,The Founder,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0
171,2016,Allegiant,0,0,0.0
170,2016,Dag II,0,0,0.0
169,2016,Dirty Grandpa,0,0,0.0
168,2016,Lights Out,0,0,0.0
167,2016,Bad Moms,0,0,0.0


In [48]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
200,2017,Logan,0,0,
201,2017,Thor: Ragnarok,0,0,
202,2017,Guardians of the Galaxy Vol. 2,0,0,
203,2017,Star Wars: Episode VIII - The Last Jedi,0,0,
204,2017,Wonder Woman,0,0,
205,2017,Dunkirk,1,0,
206,2017,Spider-Man: Homecoming,0,0,
207,2017,Get Out,1,0,
208,2017,It,0,0,
209,2017,Blade Runner 2049,0,0,


In [49]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
330,2018,Roma,1,0,1.0
300,2018,Avengers: Infinity War,0,0,0.0
364,2018,The Spy Who Dumped Me,0,0,0.0
373,2018,Enes Batur Hayal mi Gerçek mi?,0,0,0.0
372,2018,Eighth Grade,0,0,0.0
371,2018,Mandy,0,0,0.0
370,2018,Johnny English Strikes Again,0,0,0.0
369,2018,Robin Hood,0,0,0.0
368,2018,Death Wish,0,0,0.0
367,2018,Andhadhun,0,0,0.0


In [50]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,1.0
400,2019,Joker,1,0,0.0
463,2019,Gemini Man,0,0,0.0
473,2019,The Lego Movie 2: The Second Part,0,0,0.0
472,2019,Happy Death Day 2U,0,0,0.0
471,2019,Velvet Buzzsaw,0,0,0.0
470,2019,Cold Pursuit,0,0,0.0
469,2019,Crawl,0,0,0.0
468,2019,Bombshell,0,0,0.0
467,2019,Fighting with My Family,0,0,0.0


## #1.2 - Random Forest

In [175]:
forest = RandomForestClassifier()

my_forest = forest.fit(X_train, y_train)

In [176]:
forest_importances = pd.DataFrame(my_forest.feature_importances_.round(3), all_features, columns=["Importances Weightage"])

print(forest_importances)
print()
print('Score = ', my_forest.score(X_train, y_train))

                     Importances Weightage
Runtime (min)                        0.017
Action                               0.001
Adventure                            0.005
Animation                            0.000
Biography                            0.002
Comedy                               0.002
Crime                                0.006
Drama                                0.003
Family                               0.001
Fantasy                              0.003
History                              0.001
Horror                               0.000
Musical                              0.002
Mystery                              0.001
Romance                              0.002
Sci-Fi                               0.003
Sport                                0.009
Thriller                             0.009
War                                  0.001
Western                              0.000
IMDb_rating                          0.024
IMDb_votes                           0.029
RT_rating  

In [177]:
forest_importances['Features'] = all_features
fig = px.bar(forest_importances, x='Features', y='Importances Weightage', 
             title='Features Importances', height=600)
fig.show()

In [178]:
pred_forest = my_forest.predict_proba(X_test)[:, 1]

forest_prediction = pd.DataFrame(year, columns=["Year"])
forest_prediction["Movie"] = movie_name
forest_prediction["Oscar_nominee"] = oscar_n
forest_prediction["Oscar_winner"] = oscar_w
forest_prediction['Predicted Win Rate'] = pred_forest

forest_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
1,2015,Mad Max: Fury Road,1,0,0.12
2,2015,The Martian,1,0,0.0
3,2015,Avengers: Age of Ultron,0,0,0.0
4,2015,The Revenant,1,0,0.42


In [179]:
normalized_prediction = forest_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / forest_prediction["Predicted Win Rate"][forest_prediction["Year"] == row["Year"]].sum()).round(3)

In [180]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
4,2015,The Revenant,1,0,0.429
9,2015,Spotlight,1,1,0.235
14,2015,The Big Short,1,0,0.153
1,2015,Mad Max: Fury Road,1,0,0.122
36,2015,Straight Outta Compton,0,0,0.01
13,2015,Room,1,0,0.01
11,2015,Sicario,0,0,0.01
56,2015,Carol,0,0,0.01
6,2015,Jurassic World,0,0,0.01
33,2015,The Witch: A New-England Folktale,0,0,0.01


In [181]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.669
122,2016,Manchester by the Sea,1,0,0.068
105,2016,Arrival,1,0,0.042
116,2016,Moonlight,1,1,0.034
100,2016,Deadpool,0,0,0.025
131,2016,Ghostbusters,0,0,0.025
134,2016,Hidden Figures,1,0,0.017
104,2016,Doctor Strange,0,0,0.017
125,2016,Sully,0,0,0.017
130,2016,Lion,1,0,0.017


In [182]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
213,2017,The Shape of Water,1,1,0.47
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.104
207,2017,Get Out,1,0,0.104
224,2017,Lady Bird,1,0,0.07
230,2017,Call Me by Your Name,1,0,0.043
205,2017,Dunkirk,1,0,0.035
236,2017,"I, Tonya",0,0,0.026
257,2017,Phantom Thread,1,0,0.017
232,2017,Mother!,0,0,0.017
246,2017,The Post,1,0,0.017


In [183]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
330,2018,Roma,1,0,0.227
309,2018,Green Book,1,1,0.22
321,2018,BlacKkKlansman,1,0,0.091
310,2018,A Star Is Born,1,0,0.083
325,2018,The Favourite,1,0,0.061
303,2018,Bohemian Rhapsody,1,0,0.03
344,2018,Vice,1,0,0.03
308,2018,Spider-Man: Into the Spider-Verse,0,0,0.03
301,2018,Black Panther,1,0,0.023
334,2018,Crazy Rich Asians,0,0,0.023


In [184]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,0.426
402,2019,Once Upon a Time in Hollywood,1,0,0.168
404,2019,Parasite,1,1,0.11
400,2019,Joker,1,0,0.065
418,2019,Jojo Rabbit,1,0,0.039
407,2019,The Irishman,1,0,0.032
413,2019,Marriage Story,1,0,0.032
419,2019,The Lion King,0,0,0.019
464,2019,Dumbo,0,0,0.013
443,2019,Yesterday,0,0,0.013


## #1.3 - Extremely Randomized Trees (Extra Trees)

In [31]:
extratree = ExtraTreesClassifier()

extree = extratree.fit(X_train, y_train)

In [32]:
extree_importances = pd.DataFrame(extree.feature_importances_.round(3), all_features, columns=["Importances Weightage"])

print(extree_importances)
print()
print('Score = ', extree.score(X_train, y_train))

                     Importances Weightage
Runtime (min)                        0.011
Action                               0.005
Adventure                            0.007
Animation                            0.000
Biography                            0.004
Comedy                               0.005
Crime                                0.008
Drama                                0.003
Family                               0.000
Fantasy                              0.002
History                              0.002
Horror                               0.000
Musical                              0.001
Mystery                              0.001
Romance                              0.005
Sci-Fi                               0.005
Sport                                0.005
Thriller                             0.016
War                                  0.002
Western                              0.000
IMDb_rating                          0.018
IMDb_votes                           0.019
RT_rating  

In [33]:
extree_importances['Features'] = all_features
fig = px.bar(extree_importances, x='Features', y='Importances Weightage', 
             title='Features Importances', height=600)
fig.show()

In [34]:
pred_extree = extree.predict_proba(X_test)[:, 1]

extree_df = pd.DataFrame(year, columns=["Year"])
extree_df["Movie"] = movie_name
extree_df["Oscar_nominee"] = oscar_n
extree_df["Oscar_winner"] = oscar_w
extree_df['Predicted Win Rate'] = pred_extree

extree_df.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
1,2015,Mad Max: Fury Road,1,0,0.03
2,2015,The Martian,1,0,0.0
3,2015,Avengers: Age of Ultron,0,0,0.0
4,2015,The Revenant,1,0,0.29


In [35]:
normalized_prediction = extree_df.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / extree_df["Predicted Win Rate"][extree_df["Year"] == row["Year"]].sum()).round(3)   

In [36]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
9,2015,Spotlight,1,1,0.395
4,2015,The Revenant,1,0,0.382
14,2015,The Big Short,1,0,0.105
17,2015,Bridge of Spies,1,0,0.053
1,2015,Mad Max: Fury Road,1,0,0.039
96,2015,Anomalisa,0,0,0.013
13,2015,Room,1,0,0.013
68,2015,Home,0,0,0.0
65,2015,Vacation,0,0,0.0
66,2015,Paper Towns,0,0,0.0


In [37]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.921
116,2016,Moonlight,1,1,0.022
105,2016,Arrival,1,0,0.022
109,2016,Hacksaw Ridge,1,0,0.022
130,2016,Lion,1,0,0.011
100,2016,Deadpool,0,0,0.0
174,2016,Hunt for the Wilderpeople,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0
171,2016,Allegiant,0,0,0.0


In [38]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
213,2017,The Shape of Water,1,1,0.705
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.227
224,2017,Lady Bird,1,0,0.034
207,2017,Get Out,1,0,0.034
200,2017,Logan,0,0,0.0
265,2017,Geostorm,0,0,0.0
274,2017,Going in Style,0,0,0.0
273,2017,Reis,0,0,0.0
272,2017,The Death of Stalin,0,0,0.0
271,2017,It Comes at Night,0,0,0.0


In [39]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
330,2018,Roma,1,0,0.474
309,2018,Green Book,1,1,0.329
325,2018,The Favourite,1,0,0.079
303,2018,Bohemian Rhapsody,1,0,0.053
321,2018,BlacKkKlansman,1,0,0.039
310,2018,A Star Is Born,1,0,0.013
344,2018,Vice,1,0,0.013
367,2018,Andhadhun,0,0,0.0
375,2018,Sorry to Bother You,0,0,0.0
374,2018,Outlaw King,0,0,0.0


In [40]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,0.47
402,2019,Once Upon a Time in Hollywood,1,0,0.322
400,2019,Joker,1,0,0.087
404,2019,Parasite,1,1,0.078
413,2019,Marriage Story,1,0,0.017
425,2019,Uncut Gems,0,0,0.009
418,2019,Jojo Rabbit,1,0,0.009
416,2019,Ford v Ferrari,1,0,0.009
475,2019,Scary Stories to Tell in the Dark,0,0,0.0
474,2019,Annabelle Comes Home,0,0,0.0


### Quick Comments on #1 - The Trees:

> 1. Decision Tree is bad
> 2. Random Forest is good
> 3. Extra Trees is great!!

---
# #2 - Boosting Classifers

<img src="img\rocket.png" width="500" align="center"/>

## #2.1 - AdaBoost Classifier

In [51]:
ABC = AdaBoostClassifier()

my_ABC = ABC.fit(X_train, y_train)

In [52]:
pred_ABC = my_ABC.predict_proba(X_test)[:, 1]

ABC_prediction = pd.DataFrame(year, columns=["Year"])
ABC_prediction["Movie"] = movie_name
ABC_prediction["Oscar_nominee"] = oscar_n
ABC_prediction["Oscar_winner"] = oscar_w
ABC_prediction['Predicted Win Rate'] = pred_ABC

ABC_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.186088
1,2015,Mad Max: Fury Road,1,0,0.339933
2,2015,The Martian,1,0,0.329134
3,2015,Avengers: Age of Ultron,0,0,0.247149
4,2015,The Revenant,1,0,0.452504


In [53]:
normalized_prediction = ABC_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / ABC_prediction["Predicted Win Rate"][ABC_prediction["Year"] == row["Year"]].sum()).round(3)

In [54]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
4,2015,The Revenant,1,0,0.02
9,2015,Spotlight,1,1,0.02
14,2015,The Big Short,1,0,0.017
2,2015,The Martian,1,0,0.015
8,2015,The Hateful Eight,0,0,0.015
1,2015,Mad Max: Fury Road,1,0,0.015
25,2015,Southpaw,0,0,0.014
39,2015,The Danish Girl,0,0,0.013
23,2015,Chappie,0,0,0.013
74,2015,Concussion,0,0,0.013


In [55]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.031
105,2016,Arrival,1,0,0.019
130,2016,Lion,1,0,0.018
122,2016,Manchester by the Sea,1,0,0.014
147,2016,Inferno,0,0,0.013
120,2016,Warcraft,0,0,0.012
195,2016,The Neon Demon,0,0,0.012
172,2016,Gods of Egypt,0,0,0.012
124,2016,Nocturnal Animals,0,0,0.012
152,2016,Allied,0,0,0.012


In [56]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.019
213,2017,The Shape of Water,1,1,0.018
205,2017,Dunkirk,1,0,0.015
207,2017,Get Out,1,0,0.014
224,2017,Lady Bird,1,0,0.013
261,2017,Downsizing,0,0,0.013
239,2017,Valerian and the City of a Thousand Planets,0,0,0.012
225,2017,Murder on the Orient Express,0,0,0.012
281,2017,Shot Caller,0,0,0.012
259,2017,What Happened to Monday,0,0,0.012


In [57]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
344,2018,Vice,1,0,0.019
309,2018,Green Book,1,1,0.019
321,2018,BlacKkKlansman,1,0,0.017
310,2018,A Star Is Born,1,0,0.015
330,2018,Roma,1,0,0.014
339,2018,A Simple Favor,0,0,0.013
303,2018,Bohemian Rhapsody,1,0,0.013
338,2018,Sicario: Day of the Soldado,0,0,0.013
377,2018,Suspiria,0,0,0.012
381,2018,The House That Jack Built,0,0,0.012


In [58]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,0.024
402,2019,Once Upon a Time in Hollywood,1,0,0.021
404,2019,Parasite,1,1,0.016
418,2019,Jojo Rabbit,1,0,0.016
407,2019,The Irishman,1,0,0.015
499,2019,In the Shadow of the Moon,0,0,0.013
441,2019,The Gentlemen,0,0,0.013
470,2019,Cold Pursuit,0,0,0.013
459,2019,The Highwaymen,0,0,0.012
443,2019,Yesterday,0,0,0.012


## #2.2 - XGBoost Classifier

In [59]:
XGB = XGBClassifier()

my_XGB = XGB.fit(X_train, y_train)

In [60]:
pred_XGB = my_XGB.predict_proba(X_test)[:, 1]

XGB_prediction = pd.DataFrame(year, columns=["Year"])
XGB_prediction["Movie"] = movie_name
XGB_prediction["Oscar_nominee"] = oscar_n
XGB_prediction["Oscar_winner"] = oscar_w
XGB_prediction['Predicted Win Rate'] = pred_XGB

XGB_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.000654
1,2015,Mad Max: Fury Road,1,0,0.0136
2,2015,The Martian,1,0,0.006137
3,2015,Avengers: Age of Ultron,0,0,0.000754
4,2015,The Revenant,1,0,0.551049


In [61]:
normalized_prediction = XGB_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / XGB_prediction["Predicted Win Rate"][XGB_prediction["Year"] == row["Year"]].sum()).round(3)

In [62]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
4,2015,The Revenant,1,0,0.707
9,2015,Spotlight,1,1,0.208
1,2015,Mad Max: Fury Road,1,0,0.017
14,2015,The Big Short,1,0,0.01
2,2015,The Martian,1,0,0.008
8,2015,The Hateful Eight,0,0,0.002
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.001
33,2015,The Witch: A New-England Folktale,0,0,0.001
35,2015,Jupiter Ascending,0,0,0.001
36,2015,Straight Outta Compton,0,0,0.001


In [63]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.903
105,2016,Arrival,1,0,0.02
122,2016,Manchester by the Sea,1,0,0.018
116,2016,Moonlight,1,1,0.01
130,2016,Lion,1,0,0.009
109,2016,Hacksaw Ridge,1,0,0.002
100,2016,Deadpool,0,0,0.002
170,2016,Dag II,0,0,0.001
174,2016,Hunt for the Wilderpeople,0,0,0.001
191,2016,Sing Street,0,0,0.001


In [64]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
213,2017,The Shape of Water,1,1,0.863
207,2017,Get Out,1,0,0.026
224,2017,Lady Bird,1,0,0.015
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.013
205,2017,Dunkirk,1,0,0.011
230,2017,Call Me by Your Name,1,0,0.005
236,2017,"I, Tonya",0,0,0.003
247,2017,Molly's Game,0,0,0.002
201,2017,Thor: Ragnarok,0,0,0.002
200,2017,Logan,0,0,0.002


In [65]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
330,2018,Roma,1,0,0.751
309,2018,Green Book,1,1,0.093
321,2018,BlacKkKlansman,1,0,0.024
344,2018,Vice,1,0,0.019
303,2018,Bohemian Rhapsody,1,0,0.008
310,2018,A Star Is Born,1,0,0.007
325,2018,The Favourite,1,0,0.004
300,2018,Avengers: Infinity War,0,0,0.003
367,2018,Andhadhun,0,0,0.002
334,2018,Crazy Rich Asians,0,0,0.002


In [66]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,0.581
402,2019,Once Upon a Time in Hollywood,1,0,0.336
404,2019,Parasite,1,1,0.019
418,2019,Jojo Rabbit,1,0,0.01
407,2019,The Irishman,1,0,0.009
400,2019,Joker,1,0,0.004
413,2019,Marriage Story,1,0,0.003
487,2019,Uri: The Surgical Strike,0,0,0.001
488,2019,Dolor y gloria,0,0,0.001
498,2019,Richard Jewell,0,0,0.001


## #2.3 - Gradient Boosting Classifier

In [67]:
GBC = GradientBoostingClassifier()

my_GBC = GBC.fit(X_train, y_train)

In [68]:
GBC_importances = pd.DataFrame(my_GBC.feature_importances_.round(3), all_features, columns=["Importances Weightage"])

print(GBC_importances)
print()
print('Score = ', my_GBC.score(X_train, y_train))

                     Importances Weightage
Runtime (min)                        0.000
Action                               0.000
Adventure                            0.000
Animation                            0.000
Biography                            0.000
Comedy                               0.000
Crime                                0.001
Drama                                0.000
Family                               0.000
Fantasy                              0.000
History                              0.000
Horror                               0.000
Musical                              0.000
Mystery                              0.000
Romance                              0.000
Sci-Fi                               0.000
Sport                                0.001
Thriller                             0.019
War                                  0.000
Western                              0.000
IMDb_rating                          0.037
IMDb_votes                           0.017
RT_rating  

In [69]:
GBC_importances['Features'] = all_features
fig = px.bar(GBC_importances, x='Features', y='Importances Weightage', 
             title='Features Importances', height=600)
fig.show()

In [70]:
pred_GBC = my_GBC.predict_proba(X_test)[:, 1]

GBC_prediction = pd.DataFrame(year, columns=["Year"])
GBC_prediction["Movie"] = movie_name
GBC_prediction["Oscar_nominee"] = oscar_n
GBC_prediction["Oscar_winner"] = oscar_w
GBC_prediction['Predicted Win Rate'] = pred_GBC

GBC_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,9e-06
1,2015,Mad Max: Fury Road,1,0,3.4e-05
2,2015,The Martian,1,0,5.7e-05
3,2015,Avengers: Age of Ultron,0,0,1e-05
4,2015,The Revenant,1,0,0.967508


In [71]:
normalized_prediction = GBC_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / GBC_prediction["Predicted Win Rate"][GBC_prediction["Year"] == row["Year"]].sum()).round(3)

In [72]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
4,2015,The Revenant,1,0,0.993
9,2015,Spotlight,1,1,0.006
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
63,2015,Bãhubali: The Beginning,0,0,0.0
73,2015,Hardcore Henry,0,0,0.0
72,2015,Hitman: Agent 47,0,0,0.0
71,2015,Demolition,0,0,0.0
70,2015,Self/less,0,0,0.0
69,2015,Insidious: Chapter 3,0,0,0.0
68,2015,Home,0,0,0.0


In [73]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.999
100,2016,Deadpool,0,0,0.0
163,2016,The Founder,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0
171,2016,Allegiant,0,0,0.0
170,2016,Dag II,0,0,0.0
169,2016,Dirty Grandpa,0,0,0.0
168,2016,Lights Out,0,0,0.0
167,2016,Bad Moms,0,0,0.0


In [74]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
213,2017,The Shape of Water,1,1,0.97
207,2017,Get Out,1,0,0.008
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.003
230,2017,Call Me by Your Name,1,0,0.001
205,2017,Dunkirk,1,0,0.001
200,2017,Logan,0,0,0.0
265,2017,Geostorm,0,0,0.0
274,2017,Going in Style,0,0,0.0
273,2017,Reis,0,0,0.0
272,2017,The Death of Stalin,0,0,0.0


In [75]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
330,2018,Roma,1,0,0.597
321,2018,BlacKkKlansman,1,0,0.048
310,2018,A Star Is Born,1,0,0.048
309,2018,Green Book,1,1,0.039
325,2018,The Favourite,1,0,0.019
372,2018,Eighth Grade,0,0,0.008
344,2018,Vice,1,0,0.006
377,2018,Suspiria,0,0,0.006
318,2018,Hereditary,0,0,0.006
314,2018,Annihilation,0,0,0.006


In [76]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,0.376
402,2019,Once Upon a Time in Hollywood,1,0,0.375
400,2019,Joker,1,0,0.249
463,2019,Gemini Man,0,0,0.0
473,2019,The Lego Movie 2: The Second Part,0,0,0.0
472,2019,Happy Death Day 2U,0,0,0.0
471,2019,Velvet Buzzsaw,0,0,0.0
470,2019,Cold Pursuit,0,0,0.0
469,2019,Crawl,0,0,0.0
468,2019,Bombshell,0,0,0.0


## #2.4 - Light Gradient Boosting Machine (LGBM) Classifier

In [77]:
LGBM = LGBMClassifier()

my_LGBM = LGBM.fit(X_train, y_train)

In [78]:
LGBM_importances = pd.DataFrame(my_LGBM.feature_importances_.round(3), all_features, columns=["Importances Weightage"])

print(LGBM_importances)
print()
print('Score = ', my_LGBM.score(X_train, y_train))

                     Importances Weightage
Runtime (min)                          276
Action                                   6
Adventure                                4
Animation                                0
Biography                               33
Comedy                                   7
Crime                                    5
Drama                                    7
Family                                   0
Fantasy                                  5
History                                  0
Horror                                   0
Musical                                  2
Mystery                                  1
Romance                                  3
Sci-Fi                                   0
Sport                                   32
Thriller                                13
War                                      0
Western                                  0
IMDb_rating                            207
IMDb_votes                             287
RT_rating  

In [79]:
LGBM_importances['Features'] = all_features
fig = px.bar(LGBM_importances, x='Features', y='Importances Weightage', 
             title='Features Importances', height=600)
fig.show()

In [80]:
pred_LGBM = my_LGBM.predict_proba(X_test)[:, 1]

LGBM_prediction = pd.DataFrame(year, columns=["Year"])
LGBM_prediction["Movie"] = movie_name
LGBM_prediction["Oscar_nominee"] = oscar_n
LGBM_prediction["Oscar_winner"] = oscar_w
LGBM_prediction['Predicted Win Rate'] = pred_LGBM

LGBM_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,6.302638e-07
1,2015,Mad Max: Fury Road,1,0,0.02867063
2,2015,The Martian,1,0,0.0002920566
3,2015,Avengers: Age of Ultron,0,0,1.185882e-06
4,2015,The Revenant,1,0,0.002923325


In [81]:
normalized_prediction = LGBM_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / LGBM_prediction["Predicted Win Rate"][LGBM_prediction["Year"] == row["Year"]].sum()).round(3)

In [82]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
1,2015,Mad Max: Fury Road,1,0,0.844
4,2015,The Revenant,1,0,0.086
9,2015,Spotlight,1,1,0.049
2,2015,The Martian,1,0,0.009
14,2015,The Big Short,1,0,0.005
11,2015,Sicario,0,0,0.002
50,2015,Brooklyn,1,0,0.001
57,2015,The Visit,0,0,0.0
58,2015,Daddy's Home,0,0,0.0
74,2015,Concussion,0,0,0.0


In [83]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.99
105,2016,Arrival,1,0,0.003
122,2016,Manchester by the Sea,1,0,0.003
130,2016,Lion,1,0,0.002
116,2016,Moonlight,1,1,0.002
100,2016,Deadpool,0,0,0.0
166,2016,Neighbors 2: Sorority Rising,0,0,0.0
174,2016,Hunt for the Wilderpeople,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0


In [84]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
207,2017,Get Out,1,0,0.857
213,2017,The Shape of Water,1,1,0.058
224,2017,Lady Bird,1,0,0.034
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.027
205,2017,Dunkirk,1,0,0.017
230,2017,Call Me by Your Name,1,0,0.006
266,2017,Fifty Shades Darker,0,0,0.0
275,2017,Death Note,0,0,0.0
274,2017,Going in Style,0,0,0.0
273,2017,Reis,0,0,0.0


In [85]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
309,2018,Green Book,1,1,0.355
310,2018,A Star Is Born,1,0,0.321
325,2018,The Favourite,1,0,0.13
330,2018,Roma,1,0,0.106
304,2018,A Quiet Place,0,0,0.034
303,2018,Bohemian Rhapsody,1,0,0.02
321,2018,BlacKkKlansman,1,0,0.009
308,2018,Spider-Man: Into the Spider-Verse,0,0,0.008
334,2018,Crazy Rich Asians,0,0,0.007
344,2018,Vice,1,0,0.005


In [86]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
404,2019,Parasite,1,1,0.535
402,2019,Once Upon a Time in Hollywood,1,0,0.381
408,2019,1917,1,0,0.069
400,2019,Joker,1,0,0.014
464,2019,Dumbo,0,0,0.0
474,2019,Annabelle Comes Home,0,0,0.0
473,2019,The Lego Movie 2: The Second Part,0,0,0.0
472,2019,Happy Death Day 2U,0,0,0.0
471,2019,Velvet Buzzsaw,0,0,0.0
470,2019,Cold Pursuit,0,0,0.0


### Quick Comments on #2 - Boosting Classifiers:

> 1. AdaBoost is bad
> 2. XGBoost normal a bit but still bad
> 3. Gradient Boosting is bias
> 4. LGBM is great!


---
# #3 - Naive Bayes

<img src="img\naive bayes.png" width="500" align="center"/>

## #3.1 - Multinomial Naive Bayes

In [88]:
MNB = MultinomialNB()

my_MNB = MNB.fit(X_train, y_train)

In [89]:
pred_MNB = my_MNB.predict_proba(X_test)[:, 1]

MNB_prediction = pd.DataFrame(year, columns=["Year"])
MNB_prediction["Movie"] = movie_name
MNB_prediction["Oscar_nominee"] = oscar_n
MNB_prediction["Oscar_winner"] = oscar_w
MNB_prediction['Predicted Win Rate'] = pred_MNB

MNB_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,1.0
1,2015,Mad Max: Fury Road,1,0,0.0
2,2015,The Martian,1,0,1.0
3,2015,Avengers: Age of Ultron,0,0,0.0
4,2015,The Revenant,1,0,0.0


In [90]:
normalized_prediction = MNB_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / MNB_prediction["Predicted Win Rate"][MNB_prediction["Year"] == row["Year"]].sum()).round(3)

In [91]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.038
57,2015,The Visit,0,0,0.038
28,2015,Focus,0,0,0.038
33,2015,The Witch: A New-England Folktale,0,0,0.038
36,2015,Straight Outta Compton,0,0,0.038
39,2015,The Danish Girl,0,0,0.038
40,2015,Cinderella,0,0,0.038
45,2015,Pitch Perfect 2,0,0,0.038
46,2015,The Gift,0,0,0.038
60,2015,Hotel Transylvania 2,0,0,0.038


In [92]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
100,2016,Deadpool,0,0,0.029
166,2016,Neighbors 2: Sorority Rising,0,0,0.029
130,2016,Lion,1,0,0.029
132,2016,Me Before You,0,0,0.029
134,2016,Hidden Figures,1,0,0.029
137,2016,The Secret Life of Pets,0,0,0.029
139,2016,Sausage Party,0,0,0.029
142,2016,Kimi no na wa.,0,0,0.029
153,2016,Busanhaeng,0,0,0.029
157,2016,Sing,0,0,0.029


In [93]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
200,2017,Logan,0,0,0.026
263,2017,You Were Never Really Here,0,0,0.026
230,2017,Call Me by Your Name,1,0,0.026
233,2017,The Hitman's Bodyguard,0,0,0.026
238,2017,Darkest Hour,1,0,0.026
243,2017,Wonder,0,0,0.026
201,2017,Thor: Ragnarok,0,0,0.026
252,2017,The Killing of a Sacred Deer,0,0,0.026
253,2017,Happy Death Day,0,0,0.026
254,2017,Despicable Me 3,0,0,0.026


In [94]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
300,2018,Avengers: Infinity War,0,0,0.024
331,2018,Isle of Dogs,0,0,0.024
333,2018,Bumblebee,0,0,0.024
334,2018,Crazy Rich Asians,0,0,0.024
342,2018,The Ballad of Buster Scruggs,0,0,0.024
343,2018,The Nun,0,0,0.024
345,2018,Maze Runner: The Death Cure,0,0,0.024
301,2018,Black Panther,1,0,0.024
353,2018,"Love, Simon",0,0,0.024
357,2018,To All the Boys I've Loved Before,0,0,0.024


In [95]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
400,2019,Joker,1,0,0.024
469,2019,Crawl,0,0,0.024
441,2019,The Gentlemen,0,0,0.024
443,2019,Yesterday,0,0,0.024
445,2019,The Two Popes,0,0,0.024
447,2019,Ready or Not,0,0,0.024
449,2019,Escape Room,0,0,0.024
401,2019,Avengers: Endgame,0,0,0.024
453,2019,The King,0,0,0.024
454,2019,Brightburn,0,0,0.024


## #3.2 - Gaussian Naive Bayes

In [96]:
GNB = GaussianNB()

my_GNB = GNB.fit(X_train, y_train)

In [97]:
pred_GNB = my_GNB.predict_proba(X_test)[:, 1]

GNB_prediction = pd.DataFrame(year, columns=["Year"])
GNB_prediction["Movie"] = movie_name
GNB_prediction["Oscar_nominee"] = oscar_n
GNB_prediction["Oscar_winner"] = oscar_w
GNB_prediction['Predicted Win Rate'] = pred_GNB

GNB_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.9999999
1,2015,Mad Max: Fury Road,1,0,0.02480482
2,2015,The Martian,1,0,0.5887907
3,2015,Avengers: Age of Ultron,0,0,2.769628e-09
4,2015,The Revenant,1,0,0.03902907


In [98]:
normalized_prediction = GNB_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / GNB_prediction["Predicted Win Rate"][GNB_prediction["Year"] == row["Year"]].sum()).round(3)

In [99]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.205
6,2015,Jurassic World,0,0,0.205
12,2015,Fast & Furious 7,0,0,0.205
28,2015,Focus,0,0,0.204
2,2015,The Martian,1,0,0.121
16,2015,Fifty Shades of Grey,0,0,0.015
5,2015,Inside Out,0,0,0.009
4,2015,The Revenant,1,0,0.008
1,2015,Mad Max: Fury Road,1,0,0.005
7,2015,Ant-Man,0,0,0.003


In [100]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
100,2016,Deadpool,0,0,0.409
137,2016,The Secret Life of Pets,0,0,0.252
108,2016,Zootopia,0,0,0.203
107,2016,La La Land,1,0,0.028
106,2016,Rogue One,0,0,0.024
157,2016,Sing,0,0,0.02
105,2016,Arrival,1,0,0.009
117,2016,The Jungle Book,0,0,0.008
103,2016,Suicide Squad,0,0,0.004
110,2016,Split,0,0,0.004


In [101]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
254,2017,Despicable Me 3,0,0,0.261
217,2017,Beauty and the Beast,1,0,0.256
216,2017,Jumanji: Welcome to the Jungle,0,0,0.234
208,2017,It,0,0,0.146
204,2017,Wonder Woman,0,0,0.036
200,2017,Logan,0,0,0.023
206,2017,Spider-Man: Homecoming,0,0,0.007
201,2017,Thor: Ragnarok,0,0,0.005
207,2017,Get Out,1,0,0.004
210,2017,Baby Driver,0,0,0.002


In [102]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
300,2018,Avengers: Infinity War,0,0,0.185
301,2018,Black Panther,1,0,0.174
316,2018,Jurassic World: Fallen Kingdom,0,0,0.171
303,2018,Bohemian Rhapsody,1,0,0.171
306,2018,Aquaman,0,0,0.13
302,2018,Deadpool 2,0,0,0.058
307,2018,Venom,0,0,0.051
317,2018,Incredibles 2,0,0,0.03
310,2018,A Star Is Born,1,0,0.005
304,2018,A Quiet Place,0,0,0.003


In [103]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
400,2019,Joker,1,0,0.22
435,2019,Frozen II,0,0,0.22
401,2019,Avengers: Endgame,0,0,0.22
406,2019,Spider-Man: Far from Home,0,0,0.121
403,2019,Captain Marvel,0,0,0.091
419,2019,The Lion King,0,0,0.089
428,2019,Jumanji: The Next Level,0,0,0.009
414,2019,Aladdin,0,0,0.007
402,2019,Once Upon a Time in Hollywood,1,0,0.002
421,2019,It Chapter Two,0,0,0.002


## #3.3 - Bernoulli Naive Bayes

In [104]:
BNB = BernoulliNB()

my_BNB = BNB.fit(X_train, y_train)

In [105]:
pred_BNB = my_BNB.predict_proba(X_test)[:, 1]

BNB_prediction = pd.DataFrame(year, columns=["Year"])
BNB_prediction["Movie"] = movie_name
BNB_prediction["Oscar_nominee"] = oscar_n
BNB_prediction["Oscar_winner"] = oscar_w
BNB_prediction['Predicted Win Rate'] = pred_BNB

BNB_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,5.267954e-12
1,2015,Mad Max: Fury Road,1,0,0.08395476
2,2015,The Martian,1,0,0.085475
3,2015,Avengers: Age of Ultron,0,0,2.953273e-14
4,2015,The Revenant,1,0,1.0


In [112]:
normalized_prediction = BNB_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / BNB_prediction["Predicted Win Rate"][BNB_prediction["Year"] == row["Year"]].sum()).round(3)

In [113]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
4,2015,The Revenant,1,0,0.248
9,2015,Spotlight,1,1,0.248
14,2015,The Big Short,1,0,0.246
56,2015,Carol,0,0,0.213
2,2015,The Martian,1,0,0.021
1,2015,Mad Max: Fury Road,1,0,0.021
13,2015,Room,1,0,0.003
70,2015,Self/less,0,0,0.0
67,2015,The Last Witch Hunter,0,0,0.0
68,2015,Home,0,0,0.0


In [114]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.247
116,2016,Moonlight,1,1,0.247
122,2016,Manchester by the Sea,1,0,0.238
130,2016,Lion,1,0,0.15
105,2016,Arrival,1,0,0.119
100,2016,Deadpool,0,0,0.0
166,2016,Neighbors 2: Sorority Rising,0,0,0.0
174,2016,Hunt for the Wilderpeople,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0


In [115]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
224,2017,Lady Bird,1,0,0.206
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.206
213,2017,The Shape of Water,1,1,0.206
205,2017,Dunkirk,1,0,0.202
230,2017,Call Me by Your Name,1,0,0.167
207,2017,Get Out,1,0,0.013
266,2017,Fifty Shades Darker,0,0,0.0
275,2017,Death Note,0,0,0.0
274,2017,Going in Style,0,0,0.0
273,2017,Reis,0,0,0.0


In [116]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
330,2018,Roma,1,0,0.239
321,2018,BlacKkKlansman,1,0,0.239
309,2018,Green Book,1,1,0.239
310,2018,A Star Is Born,1,0,0.238
325,2018,The Favourite,1,0,0.044
300,2018,Avengers: Infinity War,0,0,0.0
366,2018,Mile 22,0,0,0.0
374,2018,Outlaw King,0,0,0.0
373,2018,Enes Batur Hayal mi Gerçek mi?,0,0,0.0
372,2018,Eighth Grade,0,0,0.0


In [117]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,0.167
404,2019,Parasite,1,1,0.167
407,2019,The Irishman,1,0,0.167
418,2019,Jojo Rabbit,1,0,0.167
402,2019,Once Upon a Time in Hollywood,1,0,0.167
400,2019,Joker,1,0,0.114
413,2019,Marriage Story,1,0,0.049
474,2019,Annabelle Comes Home,0,0,0.0
473,2019,The Lego Movie 2: The Second Part,0,0,0.0
472,2019,Happy Death Day 2U,0,0,0.0


### Quick Comments on #3 - Naive Bayes: 

> 1. MultinomialNB is bad
> 2. GaussianNB is bad
> 3. BernoulliNB is also bad in terms of equally spread probabilities but not as worse as the two aboves.

Naive Bayes generally perform poorly for our prediction.

---
# #4 - Others <img src="img\anime.gif" width="200" align="center"/>

## #4.1 - Linear Support Vector Classifier (Linear SVC)

In [118]:
from sklearn.calibration import CalibratedClassifierCV
linearSVC = LinearSVC(dual=False)
clf = CalibratedClassifierCV(linearSVC) 

my_LSVC = clf.fit(X_train, y_train) 

In [119]:
pred_LSVC = my_LSVC.predict_proba(X_test)[:, 1]

LSVC_prediction = pd.DataFrame(year, columns=["Year"])
LSVC_prediction["Movie"] = movie_name
LSVC_prediction["Oscar_nominee"] = oscar_n
LSVC_prediction["Oscar_winner"] = oscar_w
LSVC_prediction['Predicted Win Rate'] = pred_LSVC

LSVC_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.028125
1,2015,Mad Max: Fury Road,1,0,0.02333
2,2015,The Martian,1,0,0.011832
3,2015,Avengers: Age of Ultron,0,0,0.127343
4,2015,The Revenant,1,0,0.015393


In [120]:
normalized_prediction = LSVC_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / LSVC_prediction["Predicted Win Rate"][LSVC_prediction["Year"] == row["Year"]].sum()).round(3)

In [121]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
3,2015,Avengers: Age of Ultron,0,0,0.09
10,2015,Spectre,0,0,0.05
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.02
64,2015,The Good Dinosaur,0,0,0.018
35,2015,Jupiter Ascending,0,0,0.018
1,2015,Mad Max: Fury Road,1,0,0.017
37,2015,Tomorrowland,0,0,0.017
5,2015,Inside Out,0,0,0.014
6,2015,Jurassic World,0,0,0.013
18,2015,The Hunger Games: Mockingjay - Part 2,0,0,0.012


In [122]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
102,2016,Batman v Superman: Dawn of Justice,0,0,0.046
101,2016,Captain America: Civil War,0,0,0.03
106,2016,Rogue One,0,0,0.02
126,2016,Star Trek Beyond,0,0,0.02
145,2016,The Legend of Tarzan,0,0,0.017
103,2016,Suicide Squad,0,0,0.016
123,2016,Finding Dory,0,0,0.016
112,2016,X-Men: Apocalypse,0,0,0.015
149,2016,Deepwater Horizon,0,0,0.015
186,2016,Alice Through the Looking Glass,0,0,0.015


In [123]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
212,2017,Justice League,0,0,0.065
202,2017,Guardians of the Galaxy Vol. 2,0,0,0.064
209,2017,Blade Runner 2049,0,0,0.021
203,2017,Star Wars: Episode VIII - The Last Jedi,0,0,0.019
220,2017,Pirates of the Caribbean: Dead Men Tell No Tales,0,0,0.019
242,2017,Transformers: The Last Knight,0,0,0.018
231,2017,King Arthur: Legend of the Sword,0,0,0.017
239,2017,Valerian and the City of a Thousand Planets,0,0,0.016
234,2017,The Mummy,0,0,0.016
227,2017,The Fate of the Furious,0,0,0.014


In [124]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
313,2018,Solo: A Star Wars Story,0,0,0.07
300,2018,Avengers: Infinity War,0,0,0.028
369,2018,Robin Hood,0,0,0.025
301,2018,Black Panther,1,0,0.018
319,2018,Fantastic Beasts: The Crimes of Grindelwald,0,0,0.016
317,2018,Incredibles 2,0,0,0.015
336,2018,The Equalizer 2,0,0,0.014
327,2018,The Meg,0,0,0.014
347,2018,Pacific Rim: Uprising,0,0,0.012
312,2018,Mission: Impossible - Fallout,0,0,0.012


In [125]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
405,2019,Star Wars: Episode IX - The Rise of Skywalker,0,0,0.046
401,2019,Avengers: Endgame,0,0,0.027
427,2019,Dark Phoenix,0,0,0.023
431,2019,Terminator: Dark Fate,0,0,0.019
407,2019,The Irishman,1,0,0.018
419,2019,The Lion King,0,0,0.016
432,2019,6 Underground,0,0,0.015
412,2019,Alita: Battle Angel,0,0,0.014
430,2019,Godzilla: King of the Monsters,0,0,0.014
463,2019,Gemini Man,0,0,0.014


## #4.2 - Support Vector Classifier (SVC)

In [126]:
from sklearn import svm

SVM = svm.SVC(probability=True)

my_SVM = SVM.fit(X_train, y_train)

In [127]:
pred_SVM = my_SVM.predict_proba(X_test)[:, 1]

SVM_prediction = pd.DataFrame(year, columns=["Year"])
SVM_prediction["Movie"] = movie_name
SVM_prediction["Oscar_nominee"] = oscar_n
SVM_prediction["Oscar_winner"] = oscar_w
SVM_prediction['Predicted Win Rate'] = pred_SVM

SVM_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.005649
1,2015,Mad Max: Fury Road,1,0,0.01044
2,2015,The Martian,1,0,0.003242
3,2015,Avengers: Age of Ultron,0,0,0.000505
4,2015,The Revenant,1,0,0.00874


In [128]:
normalized_prediction = SVM_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / SVM_prediction["Predicted Win Rate"][SVM_prediction["Year"] == row["Year"]].sum()).round(3)

In [129]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
28,2015,Focus,0,0,0.146
66,2015,Paper Towns,0,0,0.01
17,2015,Bridge of Spies,1,0,0.01
27,2015,The Intern,0,0,0.01
58,2015,Daddy's Home,0,0,0.01
81,2015,Goosebumps,0,0,0.01
9,2015,Spotlight,1,1,0.01
22,2015,Spy,0,0,0.01
69,2015,Insidious: Chapter 3,0,0,0.01
14,2015,The Big Short,1,0,0.01


In [130]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
108,2016,Zootopia,0,0,0.013
125,2016,Sully,0,0,0.012
105,2016,Arrival,1,0,0.012
107,2016,La La Land,1,0,0.012
199,2016,The Boy,0,0,0.011
181,2016,Fences,1,0,0.011
154,2016,Jack Reacher: Never Go Back,0,0,0.011
153,2016,Busanhaeng,0,0,0.011
152,2016,Allied,0,0,0.011
151,2016,London Has Fallen,0,0,0.011


In [131]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
254,2017,Despicable Me 3,0,0,0.059
217,2017,Beauty and the Beast,1,0,0.033
250,2017,The Big Sick,0,0,0.011
261,2017,Downsizing,0,0,0.011
273,2017,Reis,0,0,0.011
272,2017,The Death of Stalin,0,0,0.011
271,2017,It Comes at Night,0,0,0.011
270,2017,Gerald's Game,0,0,0.011
269,2017,The Circle,0,0,0.011
267,2017,The Florida Project,0,0,0.011


In [132]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
306,2018,Aquaman,0,0,0.021
317,2018,Incredibles 2,0,0,0.021
350,2018,Creed II,0,0,0.011
362,2018,Blockers,0,0,0.011
372,2018,Eighth Grade,0,0,0.011
371,2018,Mandy,0,0,0.011
368,2018,Death Wish,0,0,0.011
367,2018,Andhadhun,0,0,0.011
366,2018,Mile 22,0,0,0.011
365,2018,12 Strong,0,0,0.011


In [133]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
400,2019,Joker,1,0,0.169
406,2019,Spider-Man: Far from Home,0,0,0.033
403,2019,Captain Marvel,0,0,0.026
461,2019,Maleficent: Mistress of Evil,0,0,0.009
472,2019,Happy Death Day 2U,0,0,0.009
471,2019,Velvet Buzzsaw,0,0,0.009
470,2019,Cold Pursuit,0,0,0.009
469,2019,Crawl,0,0,0.009
467,2019,Fighting with My Family,0,0,0.009
466,2019,Isn't It Romantic,0,0,0.009


## #4.3 - K-Nearest Neighbors Classifier

In [134]:
KNN = KNeighborsClassifier()

my_KNN = KNN.fit(X_train, y_train)

In [135]:
pred_KNN = my_KNN.predict_proba(X_test)[:, 1]

KNN_prediction = pd.DataFrame(year, columns=["Year"])
KNN_prediction["Movie"] = movie_name
KNN_prediction["Oscar_nominee"] = oscar_n
KNN_prediction["Oscar_winner"] = oscar_w
KNN_prediction['Predicted Win Rate'] = pred_KNN

KNN_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
1,2015,Mad Max: Fury Road,1,0,0.0
2,2015,The Martian,1,0,0.0
3,2015,Avengers: Age of Ultron,0,0,0.0
4,2015,The Revenant,1,0,0.0


In [136]:
normalized_prediction = KNN_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / KNN_prediction["Predicted Win Rate"][KNN_prediction["Year"] == row["Year"]].sum()).round(3)

In [137]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
26,2015,Insurgent,0,0,0.143
45,2015,Pitch Perfect 2,0,0,0.143
77,2015,Eddie the Eagle,0,0,0.143
30,2015,San Andreas,0,0,0.143
17,2015,Bridge of Spies,1,0,0.143
28,2015,Focus,0,0,0.143
58,2015,Daddy's Home,0,0,0.143
59,2015,Burnt,0,0,0.0
57,2015,The Visit,0,0,0.0
74,2015,Concussion,0,0,0.0


In [138]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
146,2016,Central Intelligence,0,0,0.25
107,2016,La La Land,1,0,0.25
132,2016,Me Before You,0,0,0.25
169,2016,Dirty Grandpa,0,0,0.25
100,2016,Deadpool,0,0,0.0
164,2016,Nerve,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0
171,2016,Allegiant,0,0,0.0
170,2016,Dag II,0,0,0.0


In [139]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
254,2017,Despicable Me 3,0,0,0.111
286,2017,A Dog's Purpose,0,0,0.111
233,2017,The Hitman's Bodyguard,0,0,0.111
222,2017,The Greatest Showman,0,0,0.111
245,2017,The Lego Batman Movie,0,0,0.111
210,2017,Baby Driver,0,0,0.111
283,2017,Jigsaw,0,0,0.111
213,2017,The Shape of Water,1,1,0.111
217,2017,Beauty and the Beast,1,0,0.111
272,2017,The Death of Stalin,0,0,0.0


In [140]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
343,2018,The Nun,0,0,0.25
342,2018,The Ballad of Buster Scruggs,0,0,0.25
325,2018,The Favourite,1,0,0.125
304,2018,A Quiet Place,0,0,0.125
306,2018,Aquaman,0,0,0.125
323,2018,Ocean's Eight,0,0,0.125
300,2018,Avengers: Infinity War,0,0,0.0
365,2018,12 Strong,0,0,0.0
373,2018,Enes Batur Hayal mi Gerçek mi?,0,0,0.0
372,2018,Eighth Grade,0,0,0.0


In [141]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
400,2019,Joker,1,0,0.111
406,2019,Spider-Man: Far from Home,0,0,0.111
416,2019,Ford v Ferrari,1,0,0.111
421,2019,It Chapter Two,0,0,0.111
491,2019,Charlie's Angels,0,0,0.111
410,2019,John Wick: Chapter 3 - Parabellum,0,0,0.111
409,2019,Knives Out,0,0,0.111
403,2019,Captain Marvel,0,0,0.111
474,2019,Annabelle Comes Home,0,0,0.111
471,2019,Velvet Buzzsaw,0,0,0.0


## #4.4 - Logistic Regression 

In [142]:
LR = LogisticRegression()

my_LR = LR.fit(X_train, y_train)

In [143]:
pred_LR = my_LR.predict_proba(X_test)[:, 1]

LR_prediction = pd.DataFrame(year, columns=["Year"])
LR_prediction["Movie"] = movie_name
LR_prediction["Oscar_nominee"] = oscar_n
LR_prediction["Oscar_winner"] = oscar_w
LR_prediction['Predicted Win Rate'] = pred_LR

LR_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,7.091252e-13
1,2015,Mad Max: Fury Road,1,0,1.420343e-10
2,2015,The Martian,1,0,6.974571e-07
3,2015,Avengers: Age of Ultron,0,0,4.947518e-20
4,2015,The Revenant,1,0,1.031686e-08


In [144]:
normalized_prediction = LR_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / LR_prediction["Predicted Win Rate"][LR_prediction["Year"] == row["Year"]].sum()).round(3)

In [145]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
71,2015,Demolition,0,0,0.061
63,2015,Bãhubali: The Beginning,0,0,0.057
75,2015,The Invitation,0,0,0.056
84,2015,Knock Knock,0,0,0.054
73,2015,Hardcore Henry,0,0,0.053
80,2015,Bone Tomahawk,0,0,0.051
92,2015,Bajrangi Bhaijaan,0,0,0.045
83,2015,No Escape,0,0,0.041
57,2015,The Visit,0,0,0.039
90,2015,Beasts of No Nation,0,0,0.036


In [146]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
142,2016,Kimi no na wa.,0,0,0.119
150,2016,Dangal,0,0,0.101
191,2016,Sing Street,0,0,0.054
184,2016,The Autopsy of Jane Doe,0,0,0.052
170,2016,Dag II,0,0,0.052
167,2016,Bad Moms,0,0,0.048
183,2016,Money Monster,0,0,0.046
174,2016,Hunt for the Wilderpeople,0,0,0.044
153,2016,Busanhaeng,0,0,0.041
168,2016,Lights Out,0,0,0.038


In [147]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
259,2017,What Happened to Monday,0,0,0.052
292,2017,Brawl in Cell Block 99,0,0,0.048
291,2017,A Ghost Story,0,0,0.048
288,2017,The Babysitter,0,0,0.048
281,2017,Shot Caller,0,0,0.048
280,2017,The Ritual,0,0,0.047
263,2017,You Were Never Really Here,0,0,0.045
270,2017,Gerald's Game,0,0,0.045
252,2017,The Killing of a Sacred Deer,0,0,0.043
296,2017,The Square,0,0,0.042


In [148]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
391,2018,Sanju,0,0,0.064
398,2018,The Christmas Chronicles,0,0,0.063
387,2018,Les frères Sisters,0,0,0.048
399,2018,Extinction,0,0,0.046
373,2018,Enes Batur Hayal mi Gerçek mi?,0,0,0.045
397,2018,Capharnaum,0,0,0.045
396,2018,Leave No Trace,0,0,0.045
332,2018,Searching,0,0,0.045
376,2018,Mowgli,0,0,0.044
367,2018,Andhadhun,0,0,0.042


In [149]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
499,2019,In the Shadow of the Moon,0,0,0.045
484,2019,The Dead Don't Die,0,0,0.045
494,2019,The Laundromat,0,0,0.044
490,2019,In the Tall Grass,0,0,0.044
481,2019,Fractured,0,0,0.044
455,2019,Polar,0,0,0.044
497,2019,The Dirt,0,0,0.044
480,2019,Dolemite Is My Name,0,0,0.043
457,2019,Klaus,0,0,0.041
453,2019,The King,0,0,0.041


## #4.5 - Bagging Classifier

In [150]:
BC = BaggingClassifier()

my_BC = BC.fit(X_train, y_train)

In [151]:
pred_BC = my_BC.predict_proba(X_test)[:, 1]

BC_prediction = pd.DataFrame(year, columns=["Year"])
BC_prediction["Movie"] = movie_name
BC_prediction["Oscar_nominee"] = oscar_n
BC_prediction["Oscar_winner"] = oscar_w
BC_prediction['Predicted Win Rate'] = pred_BC

BC_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
1,2015,Mad Max: Fury Road,1,0,0.0
2,2015,The Martian,1,0,0.0
3,2015,Avengers: Age of Ultron,0,0,0.0
4,2015,The Revenant,1,0,0.6


In [152]:
normalized_prediction = BC_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / BC_prediction["Predicted Win Rate"][BC_prediction["Year"] == row["Year"]].sum()).round(3)

In [153]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
4,2015,The Revenant,1,0,0.545
9,2015,Spotlight,1,1,0.273
14,2015,The Big Short,1,0,0.182
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.0
64,2015,The Good Dinosaur,0,0,0.0
74,2015,Concussion,0,0,0.0
73,2015,Hardcore Henry,0,0,0.0
72,2015,Hitman: Agent 47,0,0,0.0
71,2015,Demolition,0,0,0.0
70,2015,Self/less,0,0,0.0


In [154]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.833
116,2016,Moonlight,1,1,0.167
100,2016,Deadpool,0,0,0.0
164,2016,Nerve,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0
171,2016,Allegiant,0,0,0.0
170,2016,Dag II,0,0,0.0
169,2016,Dirty Grandpa,0,0,0.0
168,2016,Lights Out,0,0,0.0


In [155]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
213,2017,The Shape of Water,1,1,0.9
230,2017,Call Me by Your Name,1,0,0.1
275,2017,Death Note,0,0,0.0
273,2017,Reis,0,0,0.0
272,2017,The Death of Stalin,0,0,0.0
271,2017,It Comes at Night,0,0,0.0
270,2017,Gerald's Game,0,0,0.0
269,2017,The Circle,0,0,0.0
268,2017,xXx: Return of Xander Cage,0,0,0.0
267,2017,The Florida Project,0,0,0.0


In [156]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
330,2018,Roma,1,0,0.571
309,2018,Green Book,1,1,0.286
377,2018,Suspiria,0,0,0.143
300,2018,Avengers: Infinity War,0,0,0.0
364,2018,The Spy Who Dumped Me,0,0,0.0
372,2018,Eighth Grade,0,0,0.0
371,2018,Mandy,0,0,0.0
370,2018,Johnny English Strikes Again,0,0,0.0
369,2018,Robin Hood,0,0,0.0
368,2018,Death Wish,0,0,0.0


In [157]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
408,2019,1917,1,0,0.5
400,2019,Joker,1,0,0.25
402,2019,Once Upon a Time in Hollywood,1,0,0.187
418,2019,Jojo Rabbit,1,0,0.062
465,2019,Angel Has Fallen,0,0,0.0
474,2019,Annabelle Comes Home,0,0,0.0
473,2019,The Lego Movie 2: The Second Part,0,0,0.0
472,2019,Happy Death Day 2U,0,0,0.0
471,2019,Velvet Buzzsaw,0,0,0.0
470,2019,Cold Pursuit,0,0,0.0


## #4.6 - Multi Layer Perceptron Classifier (Neural Network)

In [158]:
MLP = MLPClassifier(max_iter=10000)

my_MLP = MLP.fit(X_train, y_train)

In [159]:
pred_MLP = my_MLP.predict_proba(X_test)[:, 1]

MLP_prediction = pd.DataFrame(year, columns=["Year"])
MLP_prediction["Movie"] = movie_name
MLP_prediction["Oscar_nominee"] = oscar_n
MLP_prediction["Oscar_winner"] = oscar_w
MLP_prediction['Predicted Win Rate'] = pred_MLP

MLP_prediction.head(5)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,6.3e-05
1,2015,Mad Max: Fury Road,1,0,0.0
2,2015,The Martian,1,0,0.0
3,2015,Avengers: Age of Ultron,0,0,0.0
4,2015,The Revenant,1,0,0.0


In [160]:
normalized_prediction = MLP_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Predicted Win Rate"] = \
        (row["Predicted Win Rate"] / MLP_prediction["Predicted Win Rate"][MLP_prediction["Year"] == row["Year"]].sum()).round(3)

In [161]:
normalized_prediction[normalized_prediction["Year"] == 2015].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
0,2015,Star Wars: Episode VII - The Force Awakens,0,0,0.059
14,2015,The Big Short,1,0,0.059
60,2015,Hotel Transylvania 2,0,0,0.059
65,2015,Vacation,0,0,0.059
66,2015,Paper Towns,0,0,0.059
44,2015,Fantastic Four,0,0,0.059
69,2015,Insidious: Chapter 3,0,0,0.059
40,2015,Cinderella,0,0,0.059
73,2015,Hardcore Henry,0,0,0.059
31,2015,The Lobster,0,0,0.059


In [162]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
107,2016,La La Land,1,0,0.999
100,2016,Deadpool,0,0,0.0
163,2016,The Founder,0,0,0.0
173,2016,Swiss Army Man,0,0,0.0
172,2016,Gods of Egypt,0,0,0.0
171,2016,Allegiant,0,0,0.0
170,2016,Dag II,0,0,0.0
169,2016,Dirty Grandpa,0,0,0.0
168,2016,Lights Out,0,0,0.0
167,2016,Bad Moms,0,0,0.0


In [163]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
255,2017,Annabelle: Creation,0,0,0.5
211,2017,"Three Billboards Outside Ebbing, Missouri",1,0,0.5
200,2017,Logan,0,0,0.0
264,2017,Okja,0,0,0.0
273,2017,Reis,0,0,0.0
272,2017,The Death of Stalin,0,0,0.0
271,2017,It Comes at Night,0,0,0.0
270,2017,Gerald's Game,0,0,0.0
269,2017,The Circle,0,0,0.0
268,2017,xXx: Return of Xander Cage,0,0,0.0


In [164]:
normalized_prediction[normalized_prediction["Year"] == 2018].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
350,2018,Creed II,0,0,0.071
393,2018,The Grinch,0,0,0.071
318,2018,Hereditary,0,0,0.071
317,2018,Incredibles 2,0,0,0.071
384,2018,The First Purge,0,0,0.071
323,2018,Ocean's Eight,0,0,0.071
325,2018,The Favourite,1,0,0.071
339,2018,A Simple Favor,0,0,0.071
310,2018,A Star Is Born,1,0,0.071
308,2018,Spider-Man: Into the Spider-Verse,0,0,0.071


In [165]:
normalized_prediction[normalized_prediction["Year"] == 2019].sort_values("Predicted Win Rate", ascending=False).head(10)

Unnamed: 0,Year,Movie,Oscar_nominee,Oscar_winner,Predicted Win Rate
449,2019,Escape Room,0,0,0.998
463,2019,Gemini Man,0,0,0.0
473,2019,The Lego Movie 2: The Second Part,0,0,0.0
472,2019,Happy Death Day 2U,0,0,0.0
471,2019,Velvet Buzzsaw,0,0,0.0
470,2019,Cold Pursuit,0,0,0.0
469,2019,Crawl,0,0,0.0
468,2019,Bombshell,0,0,0.0
467,2019,Fighting with My Family,0,0,0.0
466,2019,Isn't It Romantic,0,0,0.0


### Quick Comments on #4 - Others:

> 1. Linear SVC is bad
> 2. SVC is bad
> 3. KNN is bad
> 4. Logistic Regression is bad
> 5. Bagging is okay...
> 6. MLP is bad

Although the performance of Bagging Classifier can't be said as excellent, but it is still far better than the rest in this category.

---
# Conclusion:

We have 16 classifiers initially. We eliminated the poor-performancing one (usually because of equally distributed winning probabilities) and left with the top 4 finalists:

### #1 - The Trees
- Random forest  
- Extremely Randomized Trees (the best)

### #2 - Boosting
- Light Gradient Boosting Machine (LGBM) Classifier

### #4 - Others
- Bagging Classifier

We shall proceed to <<Predict the Winner! Part 2>> notebook with our top 4 candidates now.  d(`･∀･)b