# Project 3


# Movie Genre Classification

Classify a movie genre based on its plot.

<img src="moviegenre.png"
     style="float: left; margin-right: 10px;" />




https://www.kaggle.com/c/miia4201-202019-p3-moviegenreclassification/overview

### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 20% API
- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 50% Performance in the Kaggle competition (The grade for each group will be proportional to the ranking it occupies in the competition. The group in the first place will obtain 5 points, for each position below, 0.25 points will be subtracted, that is: first place: 5 points, second: 4.75 points, third place: 4.50 points ... eleventh place: 2.50 points, twelfth place: 2.25 points).

• The project must be carried out in the groups assigned for module 4.
• Use clear and rigorous procedures.
• The delivery of the project is on July 12, 2020, 11:59 pm, through Sicua + (Upload: the API and the report in PDF format).
• No projects will be received after the delivery time or by any other means than the one established. 




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [1]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [2]:
dataTraining = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [5]:
dataTesting.shape[0]

3383

In [6]:
dataTraining.shape[0]

7895

### Create count vectorizer


In [7]:
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [8]:
print(vect.get_feature_names()[:50])

['abandoned', 'able', 'about', 'accepts', 'accident', 'accidentally', 'across', 'act', 'action', 'actor', 'actress', 'actually', 'adam', 'adult', 'adventure', 'affair', 'after', 'again', 'against', 'age', 'agent', 'agents', 'ago', 'agrees', 'air', 'alan', 'alex', 'alice', 'alien', 'alive', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'america', 'american', 'among', 'an', 'and', 'angeles', 'ann', 'anna', 'another', 'any', 'anyone', 'anything']


### Create y

In [9]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))

le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [10]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [11]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [12]:
X_train

<5289x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 267805 stored elements in Compressed Sparse Row format>

### Train multi-class multi-label model

In [13]:
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))

In [14]:
clf.fit(X_train, y_train_genres)

OneVsRestClassifier(estimator=RandomForestClassifier(max_depth=10, n_jobs=-1,
                                                     random_state=42))

In [15]:
y_pred_genres = clf.predict_proba(X_test)

In [16]:
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7668587460707162

### Predict the testing dataset

In [17]:
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = clf.predict_proba(X_test_dtm)


In [18]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)

In [19]:
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.13581,0.094424,0.022989,0.041449,0.363532,0.120802,0.029025,0.495286,0.063293,0.114314,...,0.025804,0.063368,0.0,0.334928,0.060684,0.010041,0.021845,0.184245,0.023033,0.018117
4,0.138673,0.086203,0.024403,0.068438,0.365237,0.194511,0.070734,0.509517,0.065259,0.06691,...,0.024402,0.06278,0.001286,0.14807,0.058957,0.013059,0.020885,0.197897,0.031537,0.018771
5,0.187752,0.1275,0.015475,0.074642,0.327476,0.446904,0.010038,0.670284,0.069855,0.105623,...,0.033038,0.271484,0.0,0.408606,0.15386,0.021106,0.047782,0.47805,0.1075,0.027802
6,0.164374,0.123858,0.01892,0.0964,0.342519,0.133598,0.008242,0.585736,0.061884,0.065952,...,0.059458,0.096792,0.0,0.230547,0.121861,0.011248,0.054109,0.263776,0.088359,0.020532
7,0.181536,0.20189,0.033084,0.030767,0.328605,0.243938,0.011962,0.453543,0.080004,0.16733,...,0.023848,0.090356,0.0,0.207816,0.247667,0.002448,0.022119,0.271938,0.022129,0.01667


In [20]:
res.to_csv('pred_genres_text_RF.csv', index_label='ID')

In [21]:
from sklearn.externals import joblib
joblib.dump(clf, 'model_movie_clf.pkl', compress=3)


['model_movie_clf.pkl']

In [22]:
from mov_model_deployment import predict_proba

In [23]:
predict_proba('Derek Vineyard is paroled after serving 3 years in prison for brutally killing two black men who tried to break into/steal his truck. Through his brothers, Danny Vineyard, narration, we learn that before going to prison, Derek was a skinhead and the leader of a violent white supremacist gang that committed acts of racial crime throughout L.A. and his actions greatly influenced Danny.')

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
0,0.129847,0.089613,0.024788,0.040049,0.3812,0.127926,0.137551,0.441072,0.065148,0.064388,...,0.024402,0.062073,0.000253,0.146422,0.058553,0.016504,0.021391,0.202271,0.023974,0.018815


CREATE API

In [24]:
from flask_restx import Api, Resource, fields

from flask import Flask
from flask_restx import Api, Resource, fields
from sklearn.externals import joblib

In [25]:
app = Flask(__name__)

api = Api(
    app, 
    version='1.0', 
    title='Genre Movie Prediction API',
    description='Genre Movie Prediction API')

ns = api.namespace('predict', 
     description='Movie Genre Classifier')
   
parser = api.parser()

parser.add_argument(
    'PLOT', 
    type=str, 
    required=True, 
    help='PLOT to be analyzed', 
    location='args')

resource_fields = api.model('Resource', {
    'result': fields.String,
})

In [26]:
from mov_model_deployment import predict_proba

@ns.route('/')
class PredictApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        
        plots = args['PLOT']
        plots = plots.split('%3B')
        print(plots)
        
        return {
         "result": predict_proba(plots)
        }, 200

In [27]:
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5050)

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on


 * Running on http://0.0.0.0:5050/ (Press CTRL+C to quit)


['Derek Vineyard is paroled after serving 3 years in prison for brutally killing two black men who tried to break into/steal his truck. Through his brothers, Danny Vineyard, narration, we learn that before going to prison, Derek was a skinhead and the leader of a violent white supremacist gang that committed acts of racial crime throughout L.A. and his actions greatly influenced Danny']


127.0.0.1 - - [14/Jul/2020 20:41:52] "[37mGET /predict/?PLOT=Derek%20Vineyard%20is%20paroled%20after%20serving%203%20years%20in%20prison%20for%20brutally%20killing%20two%20black%20men%20who%20tried%20to%20break%20into/steal%20his%20truck.%20Through%20his%20brothers,%20Danny%20Vineyard,%20narration,%20we%20learn%20that%20before%20going%20to%20prison,%20Derek%20was%20a%20skinhead%20and%20the%20leader%20of%20a%20violent%20white%20supremacist%20gang%20that%20committed%20acts%20of%20racial%20crime%20throughout%20L.A.%20and%20his%20actions%20greatly%20influenced%20Danny HTTP/1.1[0m" 200 -


['uiqjr k oprkahji ']


127.0.0.1 - - [14/Jul/2020 20:42:35] "[37mGET /predict/?PLOT=uiqjr%20k%20oprkahji%20 HTTP/1.1[0m" 200 -


 Test in http://localhost:5050/predict/?PLOT=Derek%20Vineyard%20is%20paroled%20after%20serving%203%20years%20in%20prison%20for%20brutally%20killing%20two%20black%20men%20who%20tried%20to%20break%20into/steal%20his%20truck.%20Through%20his%20brothers,%20Danny%20Vineyard,%20narration,%20we%20learn%20that%20before%20going%20to%20prison,%20Derek%20was%20a%20skinhead%20and%20the%20leader%20of%20a%20violent%20white%20supremacist%20gang%20that%20committed%20acts%20of%20racial%20crime%20throughout%20L.A.%20and%20his%20actions%20greatly%20influenced%20Danny