# Movie Oscar win Prediction

![](oscar1.jpg)

### Contents
1. Abstract
2. Dataset
3. Goal
4. Importing the dataset and the required libraries
5. Finding out the correlation between the attributes
6. Spliting the data
7. Prediction Model creation
    - Logistic Regression
    - Decision Tree Classifier
    - Random forest classifier
    - Gaussian NB
    - K-Nearest Neighbouring
    - Support Vector Machine (SVM)
    * Gradient Boosting
    - MLP Classifier
    - Stochastic Gradient Descent (SGD)
    - Linear Discriminant Analysis (LDA)
8. Model Comparison
9. Conclusion

### Abstract
The Academy Awards, popularly known as the Oscars, are awards for artistic and technical merit in the film industry. They are regarded as one of the most significant and prestigious awards in the entertainment industry. Given annually by the Academy of Motion Picture Arts and Sciences (AMPAS), the awards are an international recognition of excellence in cinematic achievements, as assessed by the Academy's voting membership. The various category winners are awarded a copy of a golden statuette as a trophy, officially called the "Academy Award of Merit", although more commonly referred to by its nickname, the "Oscar". The statuette depicts a knight rendered in the Art Deco style.

### Dataset
The dataset is collected from the kaggle website. Here is the link for the website : https://www.kaggle.com/balakrishcodes/others?select=Movie_classification.csv

### Goal
The goal of this project is to make a prediction model, which will predict the chances of winning the Oscar award.



### Importing required libraries and dataset

In [46]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

In [47]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from time import time
from sklearn.metrics import f1_score
from os import path, makedirs, walk
from joblib import dump, load
import json

In [48]:
from sklearn.model_selection import train_test_split

In [49]:
data = pd.read_csv('Movie_classification.csv')

**Finding out the correlation between the attributes of the dataset**

In [50]:
data.corr()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,Time_taken,Twitter_hastags,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar
Marketing expense,1.0,0.406583,-0.420972,-0.219247,0.352734,0.38005,0.379813,0.380069,0.376462,-0.184985,-0.443457,0.026019,0.013518,0.059204,0.383298,-0.389582,-0.013417
Production expense,0.406583,1.0,-0.763651,-0.391676,0.644779,0.706481,0.707956,0.707566,0.705819,-0.251565,-0.591657,0.015888,-0.000839,0.05581,0.707559,-0.484754,-0.024404
Multiplex coverage,-0.420972,-0.763651,1.0,0.302188,-0.73147,-0.768589,-0.769724,-0.769157,-0.764873,0.145555,0.581386,0.035922,0.004882,-0.092104,-0.915495,0.4293,-0.004017
Budget,-0.219247,-0.391676,0.302188,1.0,-0.240265,-0.208464,-0.203981,-0.201907,-0.205397,0.232361,0.602536,0.040773,0.030674,-0.064694,-0.282796,0.696304,-0.027148
Movie_length,0.352734,0.644779,-0.73147,-0.240265,1.0,0.746904,0.746493,0.747021,0.746707,-0.21783,-0.589318,-0.019984,0.00938,0.075198,0.673896,-0.377999,0.016291
Lead_ Actor_Rating,0.38005,0.706481,-0.768589,-0.208464,0.746904,1.0,0.997905,0.997735,0.994073,-0.169978,-0.490267,0.038494,0.014463,0.036794,0.706331,-0.251355,-0.035309
Lead_Actress_rating,0.379813,0.707956,-0.769724,-0.203981,0.746493,0.997905,1.0,0.998097,0.994003,-0.165992,-0.487536,0.038432,0.010239,0.038005,0.708257,-0.249459,-0.040356
Director_rating,0.380069,0.707566,-0.769157,-0.201907,0.747021,0.997735,0.998097,1.0,0.994126,-0.166638,-0.486452,0.0363,0.010077,0.04147,0.709364,-0.24665,-0.035768
Producer_rating,0.376462,0.705819,-0.764873,-0.205397,0.746707,0.994073,0.994003,0.994126,1.0,-0.167003,-0.487911,0.028988,0.00585,0.032542,0.703518,-0.2482,-0.043612
Critic_rating,-0.184985,-0.251565,0.145555,0.232361,-0.21783,-0.169978,-0.165992,-0.166638,-0.167003,1.0,0.228641,-0.015033,-0.023655,-0.049797,-0.128769,0.341288,-0.001084


In [51]:
data.head()

Unnamed: 0,Marketing expense,Production expense,Multiplex coverage,Budget,Movie_length,Lead_ Actor_Rating,Lead_Actress_rating,Director_rating,Producer_rating,Critic_rating,Trailer_views,3D_available,Time_taken,Twitter_hastags,Genre,Avg_age_actors,Num_multiplex,Collection,Start_Tech_Oscar
0,20.1264,59.62,0.462,36524.125,138.7,7.825,8.095,7.91,7.995,7.94,527367,YES,109.6,223.84,Thriller,23,494,48000,1
1,20.5462,69.14,0.531,35668.655,152.4,7.505,7.65,7.44,7.47,7.44,494055,NO,146.64,243.456,Drama,42,462,43200,0
2,20.5458,69.14,0.531,39912.675,134.6,7.485,7.57,7.495,7.515,7.44,547051,NO,147.88,2022.4,Comedy,38,458,69400,1
3,20.6474,59.36,0.542,38873.89,119.3,6.895,7.035,6.92,7.02,8.26,516279,YES,185.36,225.344,Drama,45,472,66800,1
4,21.381,59.36,0.542,39701.585,127.7,6.92,7.07,6.815,7.07,8.26,531448,NO,176.48,225.792,Drama,55,395,72400,1


**Information about the dataset**

In [53]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           494 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int

In [54]:
data.columns

Index(['Marketing expense', 'Production expense', 'Multiplex coverage',
       'Budget', 'Movie_length', 'Lead_ Actor_Rating', 'Lead_Actress_rating',
       'Director_rating', 'Producer_rating', 'Critic_rating', 'Trailer_views',
       '3D_available', 'Time_taken', 'Twitter_hastags', 'Genre',
       'Avg_age_actors', 'Num_multiplex', 'Collection', 'Start_Tech_Oscar'],
      dtype='object')

**Creating the dataset for the training and testing of the model**

In [55]:
x = data[['Collection','Twitter_hastags']]

In [56]:
y = data['Start_Tech_Oscar']

<a id="train-test-split"></a>
**Training and Testing Dataset Spliting using the `train_test_split`**
  
  * Immporting the library from the sklearn.model_selection
  * Split the dataset into 80:20 ratio
  * x_train1 and y_train1 are the trainning datasets
  * x_test1 and y_test1 are the testing datasets
  * After the spliting of the datasets the model is ready to be prepared!

In [57]:
x_train1, x_test1, y_train1, y_test1 = train_test_split(x,y, test_size = 0.2)

## Prediction Model Creation

In short, predictive modeling is a statistical technique using machine learning and data mining to predict and forecast likely future outcomes with the aid of historical and existing data. It works by analyzing current and historical data and projecting what it learns on a model generated to forecast likely outcomes.

Here we are going to predict the winner of Oscar award, based on the provided dataset.

For predicting the winner, we are going to use several algorithms for fitting the model and then the models will be checked depending on the scores of the models.

**The following models that we are going to use -**
  * **Logistic Regression** : Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
  
  
  * **Decision Tree Classifier** : Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.
  
  
  * **Random Forest Classifier** : Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
  
  
  * **Gausian NB** : This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
  
  
  * **KNN algorithm** : K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
  
  
  * **Support Vector Machine Algorithm** : Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.
  

* **Stochastic Gradient Descent Classifier** : Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression.


* **Linear Discriminant Analysis (LDA)** : Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible.


* **Gradient Boosting** : Gradient boosting is a machine learning technique for regression, classification and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.


* **MLP Classifier** : Multi-layer perceptrons (MLP) make powerful classifiers that may provide superior performance compared with other classifiers, but are often criticized for the number of free parameters. Parameter selection for optimal performance is performed using measures that correlate well with generalisation error.


 
 
We are going to use these ten algorithms and based on the scores of the models the most fitted algorithm will be set! Now let's check out the algorithms.

### Logistic Regression 

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

In [59]:
logReg = LogisticRegression(max_iter = 1000)
logReg.fit(x_train1, y_train1)

LogisticRegression(max_iter=1000)

In [60]:
logReg.score(x_test1, y_test1)

0.5980392156862745

### Decision Tree Classifier Algorithm

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

In [61]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train1, y_train1)
dtc.score(x_test1, y_test1)

0.47058823529411764

### Random Forest Classifier Algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

In [62]:
rfc = RandomForestClassifier()
rfc.fit(x_train1, y_train1)
rfc.score(x_test1, y_test1)

0.5098039215686274

### K-Nearest Neighbours Algorithm

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.

In [63]:
from sklearn.neighbors import KNeighborsClassifier  
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 ) 

In [64]:
classifier.fit(x_train1, y_train1)  

KNeighborsClassifier()

In [65]:
classifier.score(x_test1, y_test1)

0.6078431372549019

### Gausian NB Algorithm

This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.

In [66]:
clf = GaussianNB()
clf.fit(x_train1, y_train1)

GaussianNB()

In [67]:
clf.score(x_test1, y_test1)

0.5

### Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

In [68]:
svm = SVC()
svm.fit(x_train1, y_train1)

SVC()

In [69]:
svm.score(x_test1, y_test1)

0.5

### Gradient Boosting
Gradient boosting is a machine learning technique for regression, classification and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

In [103]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=150, learning_rate=3.0,max_depth=0.5, random_state=0)

In [104]:
clf.fit(x_train1, y_train1)

GradientBoostingClassifier(learning_rate=3.0, max_depth=0.5, n_estimators=150,
                           random_state=0)

In [105]:
clf.score(x_test1, y_test1)

0.6372549019607843

### MLP Classifier
Multi-layer perceptrons (MLP) make powerful classifiers that may provide superior performance compared with other classifiers, but are often criticized for the number of free parameters. Parameter selection for optimal performance is performed using measures that correlate well with generalisation error.


In [127]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=2, max_iter=300)

In [128]:
clf.fit(x_train1, y_train1)

MLPClassifier(max_iter=300, random_state=2)

In [126]:
clf.score(x_test1, y_test1)

0.6372549019607843

### Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression.


In [129]:
from sklearn.linear_model import SGDClassifier
sdg = SGDClassifier()


In [130]:
sdg.fit(x_train1, y_train1)

SGDClassifier()

In [131]:
sdg.score(x_test1, y_test1)

0.6372549019607843

### Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible.

In [132]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()

In [133]:
lda.fit(x_train1, y_train1)

LinearDiscriminantAnalysis()

In [134]:
lda.score(x_test1, y_test1)

0.5098039215686274

**********

### Model Comparison
We have deployed ten machine learning algorithms and every algorithm is deployed successfully without any hesitation. We have checked the accuracy of the models based on the accuracy score of each of the models. Now let's take a look at the scores of each models.

|Name of the Model|Accuracy Score|
|:---:|:---:|
|Logistic Regression|0.60|
|Decision Tree Classifier|0.47|
|Random Forest Classifier|0.51|
|Gausian NB Algorithm|0.50|
|KNN Algorithm|0.61|
|Support Vector Machine Algorithm|0.50|
|Gradient Boosting|0.64|
|MLP Classifier|0.64|
|Stochastic Gradient Descent|0.64|
|Linear Discriminant Analysis|0.51|


### Conclusion
**Comparing all those scores scored by the machine learning algorithms, it is clear that Gradient Boosting, MLP and SDG are having the upper hand in case of this dataset and after this, we can use KNN algorithm, which is also having good score as compared to the other deployed algorithms**

Best Fitted Models ranking - 
1. Gradient Boosting
2. MLP
3. SGD
4. KNN
5. Logistic Regression
6. Random Forest Classifer
7. LDA
8. SVM
9. Gaussian Naive Bayes
10. Decision Tree Classifier

Hooray!! The models are deployed successfully!

## Hope this project will help you! Thank you!