![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [31]:
# Importación librerías
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import xgboost as xgb
import lightgbm as lgb

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.naive_bayes import MultinomialNB

# Pipeline
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer

# NLP
import keras
from keras import initializers
from keras import optimizers
from keras.optimizers import Adam
from keras import backend as K
from keras.regularizers import l2
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import Embedding

from livelossplot import PlotLossesKeras

# metrics
from sklearn.metrics import r2_score
from sklearn.metrics import roc_auc_score

# setup
plt.style.use('seaborn-v0_8')
plt.rcParams["figure.figsize"] = (5, 4)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('max_colwidth', None)


#Optuna
import optuna

from scipy.stats import uniform, randint

In [5]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [6]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,"most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender . a day before , the boy meets a woman boarding a train , a drug abuser . at the bridge , the father goes into the engine room , and tells his son to stay at the edge of the nearby lake . a ship comes , and the bridge is lifted . though it is supposed to arrive an hour later , the train happens to arrive . the son sees this , and tries to warn his father , who is not able to see this . just as the oncoming train approaches , his son falls into the drawbridge gear works while attempting to lower the bridge , leaving the father with a horrific choice . the father then lowers the bridge , the gears crushing the boy . the people in the train are completely oblivious to the fact a boy died trying to save them , other than the drug addict woman , who happened to look out her train window . the movie ends , with the man wandering a new city , and meets the woman , no longer a drug addict , holding a small baby . other relevant narratives run in parallel , namely one of the female drug - addict , and they all meet at the climax of this tumultuous film .","['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets of his satisfying career to a video store clerk .,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfiguring facial scar meets a gentleman who lives beyond his means . they become accomplices in blackmail , and she falls in love with him , bitterly resigned to the impossibility of his returning her affection . her life changes when one of her victims proves to be the wife of a plastic surgeon , who catches her in his apartment , but believes her to be a jewel thief rather than a blackmailer . he offers her the chance to look like a normal woman again , and she accepts , despite the agony of multiple operations . meanwhile , her gentleman accomplice forms an evil scheme to rid himself of the one person who stands in his way to a fortune - his four - year - old - nephew .","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the president of the tredway corporation avery bullard has just had a meeting with investment bankers and sends a telegram scheduling a meeting at the furniture factory in millburgh , pennsylvania , at six pm with his executives . bullard has never appointed an executive vice - president for the corporation after the death of the previous one but when he is getting a taxi , he has a stroke and dies on the street . a thief steals his wallet to get his money and his body goes to the morgue without identification . the investment banker george nyle caswell sees bullard ' s body from his window and decides to use the information to make money , asking a broker to sell as much tredway stocks as possible until the end of the day , with the intention of buying them back monday morning by a lower price making profit . meanwhile the executives unsuccessfully wait for bullard in the meeting room . when they learn that bullard is dead , the ambitions accountant vp and controller loren phineas shaw releases to the press the balance of tredway showing profit and assumes temporarily the leadership of the company , expecting to be elected the next president by the seven - member board . however , the vp for design and development mcdonald "" don "" walling and the vp and treasurer frederick y . alderson oppose to shaw . there is a struggle in the corporation for the position of president and shaw blackmails the vp for sales josiah walter dudley that is married and has a mistress , his secretary eva bardeman , to get his vote . caswell needs to cover the N , N stocks he sold and shaw promises to give to him the stocks for the price he sold if he is elected president . the vp for manufacturing jesse q . grimm is near to retire but is a close friend of frederick and supports him . therefore the heir of tredway and bullard ' s mistress julia o . tredway will be responsible to give the casting vote . but she is disenchanted with the corporation . who will be elected the next president ?",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing house carol hunnicut goes to a blind date with the lawyer michael tarlow , who has embezzled the powerful mobster leo watts . carol accidentally witnesses the murder of michel by leo ' s hitman . the scared carol sneaks out of michael ' s room and hides in an isolated cabin in canada . meanwhile the deputy district attorney robert caulfield and sgt . dominick benti discover that carol is a witness of the murder and they report the information to caulfield ' s chief martin larner and they head by helicopter to canada to convince carol to testify against leo . however they are followed and the pilot and benti are murdered by the mafia . caulfield and carol flees and they take a train to vancouver . caulfield hides carol in his cabin and he discloses that there are three hitman in the train trying to find carol and kill her . but they do not know her and caulfield does not know who might be the third killer from the mafia and who has betrayed him in his office .","['Action', 'Crime', 'Thriller']",6.6


In [7]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate . theresa osborne is running along the beach when she stumbles upon a bottle washed up on the shore . inside is a message , reading the letter she feels so moved and yet she felt as if she has violated someone ' s thoughts . in love with a man she has never met , theresa tracks down the author of the letter to a small town in wilmington , two lovers with crossed paths . but yet one can ' t let go of their past ."
4,1978,Midnight Express,"the true story of billy hayes , an american college student who is caught smuggling drugs out of turkey and thrown into prison ."
5,1996,Primal Fear,"martin vail left the chicago da ' s office to become a successful criminal lawyer , that success predicated on working on high profile cases . as such , he fights to get the case of naive nineteen year old rural kentuckian aaron stampler , an altar boy accused of the vicious bludgeoning death of archbishop rushman of chicago . the story that aaron tells marty is that he , abused by his father , was in the room when the murder was committed by a third party , a shadowy figure he did not see , before he blacked out , which commonly happens to him . not remembering anything during the blackout period , he awoke covered in the archbishop ' s blood , his fright the reason he ran from the police . he also states that he had no reason to kill the archbishop , who he loved as the father he wished he had . marty doesn ' t care if he is guilty or innocent , but needs to know the truth to defend him adequately . unlike the rest of the world , marty does believe his story , he who hopes he can use aaron ' s general appearance of being an innocent to his advantage . the powerful state attorney , john shaughnessy , who marty has had many a moral run - in , wants a first degree murder conviction and the death penalty in this case . he appoints to the case janet venable , who still has bad feelings toward marty , an ex - lover , their six month relationship which ended badly . although the case looks to be a slam dunk for janet , her career may be made or broken by its outcome . in building his case , marty comes across some major pieces of information , some pertaining to the archbishop himself , and one uncovered by dr . molly arrington about aaron , she a psychiatrist hired by marty to assess aaron ' s mental state . these pieces of information as a collective pose a problem for marty in how to mount a credible and legitimate defense for his client . it is more of a moral dilemma for marty if only because he believes the life of a young man , who he believes in , is at stake ."
6,1950,Crisis,"husband and wife americans dr . eugene and mrs . helen ferguson - he a renowned neurosurgeon - are traveling through latin america for a vacation . when they make the decision to return to new york earlier than expected , they find they are being detained by the military in the country they are in . ultimately , they learn the reason is that president raoul farrago , the tyrannical military dictator of the country , has been diagnosed with a brain tumor and will die without an operation to remove it , farrago choosing gene as the doctor to lead the surgical team . because of the volatile politics within the country and for his own safety as revolutionary forces would like to see him dead , farrago refuses to go to a hospital for the operation , instead it to be done at his home . despite not particularly liking farrago or his ways , gene agrees purely in his oath as a doctor . however , he ends up being caught in the middle between farrago / his brutal regime and the revolutionaries , each side who is willing to use him and helen to get what they want , namely the life or death of farrago ."
7,1959,The Tingler,"the coroner and scientist dr . warren chapin is researching the shivering effect of fear with his assistant david morris . dr . warren is introduced to ollie higgins , the relative of a criminal sentenced to the electric chair , while making the autopsy of the corpse , and he makes a comment about the tingler - effect to him . ollie asks for a lift to dr . warner , and introduces his deaf - mute wife martha higgins , who manages a theater of their own . dr . warner returns home , where he lives with his unfaithful and evil wife isabel stevens chapin and her sweet sister lucy stevens . dr . warner , upset with the situation with his wife , threatens and uses her as a subject of his experiment . when martha dies of fear , dr . warner makes her autopsy and finds a creature that lives inside every human being , feeds with fear and is controlled by the scream . once martha was not able to scream , the tingler was not rendered harmless and became enormous . when the living being escapes , dr . warner and ollie chase it in a crowded movie theater ."


In [8]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [9]:
# https://www.nltk.org/howto/stem.html
import re
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

# https://stackoverflow.com/questions/41610543/corpora-stopwords-not-found-when-import-nltk-library
nltk.download('stopwords')
nltk.download('punkt')
english_stopwords = nltk.corpus.stopwords.words('english')

ps = PorterStemmer()
def clean_text(text):
    # remove backslash-apostrophe
    text = re.sub("\'", "", text)
    # remove everything except alphabets
    text = re.sub("[^a-zA-Z]"," ",text)
    # convert text to lowercase
    text = text.lower()
    # Steming the text
    text = [ps.stem(word) for word in text.split() if word not in english_stopwords]
    # join the stemed words
    text = ' '.join(text)

    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WD\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\WD\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
X_dtm = dataTraining['plot'].apply(lambda x: clean_text(x))

In [11]:
pd.DataFrame(X_dtm).head()

Unnamed: 0,plot
3107,stori singl father take eight year old son work railroad drawbridg bridg tender day boy meet woman board train drug abus bridg father goe engin room tell son stay edg nearbi lake ship come bridg lift though suppos arriv hour later train happen arriv son see tri warn father abl see oncom train approach son fall drawbridg gear work attempt lower bridg leav father horrif choic father lower bridg gear crush boy peopl train complet oblivi fact boy die tri save drug addict woman happen look train window movi end man wander new citi meet woman longer drug addict hold small babi relev narr run parallel name one femal drug addict meet climax tumultu film
900,serial killer decid teach secret satisfi career video store clerk
6724,sweden femal blackmail disfigur facial scar meet gentleman live beyond mean becom accomplic blackmail fall love bitterli resign imposs return affect life chang one victim prove wife plastic surgeon catch apart believ jewel thief rather blackmail offer chanc look like normal woman accept despit agoni multipl oper meanwhil gentleman accomplic form evil scheme rid one person stand way fortun four year old nephew
4704,friday afternoon new york presid tredway corpor averi bullard meet invest banker send telegram schedul meet furnitur factori millburgh pennsylvania six pm execut bullard never appoint execut vice presid corpor death previou one get taxi stroke die street thief steal wallet get money bodi goe morgu without identif invest banker georg nyle caswel see bullard bodi window decid use inform make money ask broker sell much tredway stock possibl end day intent buy back monday morn lower price make profit meanwhil execut unsuccess wait bullard meet room learn bullard dead ambit account vp control loren phinea shaw releas press balanc tredway show profit assum temporarili leadership compani expect elect next presid seven member board howev vp design develop mcdonald wall vp treasur frederick alderson oppos shaw struggl corpor posit presid shaw blackmail vp sale josiah walter dudley marri mistress secretari eva bardeman get vote caswel need cover n n stock sold shaw promis give stock price sold elect presid vp manufactur jess q grimm near retir close friend frederick support therefor heir tredway bullard mistress julia tredway respons give cast vote disench corpor elect next presid
2582,lo angel editor publish hous carol hunnicut goe blind date lawyer michael tarlow embezzl power mobster leo watt carol accident wit murder michel leo hitman scare carol sneak michael room hide isol cabin canada meanwhil deputi district attorney robert caulfield sgt dominick benti discov carol wit murder report inform caulfield chief martin larner head helicopt canada convinc carol testifi leo howev follow pilot benti murder mafia caulfield carol flee take train vancouv caulfield hide carol cabin disclos three hitman train tri find carol kill know caulfield know might third killer mafia betray offic


In [12]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
# kfolds = 5 => 1/5 = 0.2
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)
xTrain, xVal, yTrain, yVal = train_test_split(X_train, y_train_genres, test_size=0.20, random_state=42)

In [13]:
# Definición de variables predictoras (X)
vect = TfidfVectorizer(
    max_features=3000,
    ngram_range=(1,5)
)

In [14]:
vect.fit_transform(X_train)
xTrain_tdf = vect.transform(xTrain)
X_test_tdf = vect.transform(X_test)
xVal_tdf = vect.transform(xVal)

In [15]:
# Para efectuar CV en el Clasificador
xTr = np.concatenate((xTrain_tdf.toarray(), xVal_tdf.toarray()), axis=0)
yTr = np.concatenate((yTrain, yVal), axis=0)

In [16]:
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression

In [None]:
# Modelo LGBM

param_dist = {
    'stackingclassifier__lgbmR__reg_alpha': uniform(1e-3, 10.0 - 1e-3),  # Uniform distribution between 1e-3 and 10.0
    'stackingclassifier__lgbmR__reg_lambda': uniform(1e-3, 10.0 - 1e-3),  # Uniform distribution between 1e-3 and 10.0
    'stackingclassifier__lgbmR__colsample_bytree': uniform(0.3, 1.0 - 0.3),  # Uniform distribution between 0.3 and 1.0
    'stackingclassifier__lgbmR__subsample': uniform(0.4, 1.0 - 0.4),  # Uniform distribution between 0.4 and 1.0
    'stackingclassifier__lgbmR__learning_rate': uniform(0.006, 0.02 - 0.006),  # Uniform distribution between 0.006 and 0.02
    'stackingclassifier__lgbmR__max_depth': [10, 20, 100],
    'stackingclassifier__lgbmR__num_leaves': randint(1, 1001),  # Random integer between 1 and 1000
    'stackingclassifier__lgbmR__min_child_samples': randint(1, 301),  # Random integer between 1 and 300
    'stackingclassifier__lgbmR__cat_smooth': randint(1, 101)  # Random integer between 1 and 100
}

vect_plot = make_pipeline(
    TfidfVectorizer(max_features=3000, ngram_range=(1,5))
)

modelo = GridSearchCV(
    estimator=OneVsRestClassifier( MultinomialNB(alpha=0.099)),
    param_grid=param_dist,
    scoring='accuracy',
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    verbose=0
)
# canalizacion
pipe = make_pipeline(
    vect_plot,
    modelo
)

pipe.fit(X_train, y_train_genres)
pipe.get_params()

In [40]:
# Modelo MultinomialNB

vect_plot = make_pipeline(
    TfidfVectorizer(max_features=3000, ngram_range=(1,5))
)

modelo = GridSearchCV(
    estimator=OneVsRestClassifier( MultinomialNB(alpha=0.099)),
    param_grid={},
    scoring='accuracy',
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    verbose=0
)
# canalizacion
pipe = make_pipeline(
    vect_plot,
    modelo
)

pipe.fit(X_train, y_train_genres)
pipe.get_params()

{'memory': None,
 'steps': [('pipeline', Pipeline(steps=[('tfidfvectorizer',
                    TfidfVectorizer(max_features=3000, ngram_range=(1, 5)))])),
  ('gridsearchcv',
   GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                estimator=OneVsRestClassifier(estimator=MultinomialNB(alpha=0.099)),
                param_grid={}, scoring='accuracy'))],
 'verbose': False,
 'pipeline': Pipeline(steps=[('tfidfvectorizer',
                  TfidfVectorizer(max_features=3000, ngram_range=(1, 5)))]),
 'gridsearchcv': GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
              estimator=OneVsRestClassifier(estimator=MultinomialNB(alpha=0.099)),
              param_grid={}, scoring='accuracy'),
 'pipeline__memory': None,
 'pipeline__steps': [('tfidfvectorizer',
   TfidfVectorizer(max_features=3000, ngram_range=(1, 5)))],
 'pipeline__verbose': False,
 'pipeline__tfidfvectorizer': TfidfVectorizer(max_features=3000, ngram_range=(1, 5)),
 'pipeline_

In [None]:
# preprocesamiento
vect_plot = make_pipeline(
    TfidfVectorizer(max_features=3000, ngram_range=(1,5))
)

# Definición y entrenamiento
stack_clf = OneVsRestClassifier(
    StackingClassifier(
        estimators=[
            # ('xgbC', xgb.XGBClassifier(n_jobs=1, random_state=42)),
            ('lgbmR', lgb.LGBMClassifier(n_jobs=1, verbose=-1, random_state=42)),
            ('nb', MultinomialNB(alpha=0.099)),
            ('rf', RandomForestClassifier(n_jobs=1, random_state=42))
        ],
        cv=KFold(n_splits=5, shuffle=True, random_state=42),
        final_estimator=LogisticRegression(random_state=42)
    )
)

# canalizacion
pipe = make_pipeline(
    vect_plot,
    stack_clf
)

In [17]:
pipe.fit(X_train, y_train_genres)

In [17]:
import optuna
from optuna.integration import OptunaSearchCV


In [18]:
# Predicción del modelo de clasificación
y_pred_genres = pipe.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8504275029776873

In [None]:
import joblib

In [None]:
joblib.dump(pipe, 'model/stackClassifierML.pkl', compress=3)

['model/stackClassifierML.pkl']

In [None]:
cols = [
    'p_Action', 'p_Adventure', 'p_Animation',
    'p_Biography', 'p_Comedy', 'p_Crime',
    'p_Documentary', 'p_Drama', 'p_Family',
    'p_Fantasy', 'p_Film-Noir', 'p_History',
    'p_Horror', 'p_Music', 'p_Musical',
    'p_Mystery', 'p_News', 'p_Romance',
    'p_Sci-Fi', 'p_Short', 'p_Sport',
    'p_Thriller', 'p_War', 'p_Western'
]

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_genres, columns=cols)
res.to_csv('stackClassifierML.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,p_Film-Noir,p_History,p_Horror,p_Music,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
0,0.051146,0.055507,0.026109,0.042806,0.071972,0.338956,0.115019,0.740985,0.039411,0.056977,0.020586,0.035972,0.06466,0.028727,0.029293,0.051128,0.000793,0.081316,0.037662,0.011731,0.020612,0.233745,0.024043,0.023018
1,0.134053,0.266369,0.029917,0.070725,0.207396,0.049153,0.022008,0.650474,0.06868,0.048365,0.021008,0.141579,0.031529,0.023874,0.033938,0.067562,0.000793,0.252737,0.037911,0.011746,0.019692,0.145611,0.993465,0.021429
2,0.844228,0.148653,0.027122,0.035224,0.073895,0.051361,0.02467,0.157,0.037411,0.046807,0.020357,0.029416,0.677924,0.023372,0.029375,0.151272,0.000793,0.060223,0.970466,0.01165,0.017829,0.602955,0.022488,0.021593
3,0.531914,0.556689,0.026803,0.038905,0.164683,0.094126,0.021654,0.479321,0.040567,0.053483,0.020363,0.039246,0.05061,0.024954,0.030941,0.047521,0.000793,0.086633,0.074041,0.011652,0.018148,0.290442,0.059696,0.042022
4,0.747859,0.63722,0.027857,0.047798,0.214848,0.045272,0.038213,0.31417,0.062448,0.356831,0.020288,0.035787,0.322967,0.026079,0.029877,0.043892,0.000793,0.07478,0.374086,0.011698,0.018671,0.097386,0.025269,0.021462


In [None]:
clf = OneVsRestClassifier(lgb.LGBMClassifier(n_jobs=1, random_state=42))
clf.fit(xTr, yTr)
y_predict_clf = clf.predict_proba(X_test_tdf)
roc_auc_score(y_test_genres, y_predict_clf, average='macro')

0.8225126165101792

In [None]:
rclf = OneVsRestClassifier(RandomForestClassifier(n_jobs=1, random_state=42))
rclf.fit(xTr, yTr)
y_predict_rclf = rclf.predict_proba(X_test_tdf)
roc_auc_score(y_test_genres, y_predict_rclf, average='macro')

In [None]:
from sklearn.naive_bayes import MultinomialNB, ComplementNB, GaussianNB

In [None]:
nb_clf = OneVsRestClassifier(MultinomialNB(alpha=0.099))
nb_clf.fit(xTr, yTr)
y_predict_nb_clf = nb_clf.predict_proba(X_test_tdf)
roc_auc_score(y_test_genres, y_predict_nb_clf, average='macro')


0.8667309326731999

In [None]:
clf = OneVsRestClassifier(ComplementNB(alpha=0.099))
clf.fit(xTr, yTr)
y_predict_clf = clf.predict_proba(X_test_tdf)
roc_auc_score(y_test_genres, y_predict_clf, average='macro')

0.8667309326731999

In [None]:
from sklearn.linear_model import LogisticRegressionCV

In [None]:
clf = OneVsRestClassifier(LogisticRegressionCV(cv=5, random_state=42))
clf.fit(vect.transform(X_train).toarray(), y_train_genres)
y_predict_clf = clf.predict_proba(X_test_tdf)
roc_auc_score(y_test_genres, y_predict_clf, average='macro')

0.8623614069231925

In [None]:
clf = OneVsRestClassifier(GaussianNB())
clf.fit(xTr, yTr)
y_predict_clf = clf.predict_proba(X_test_tdf.toarray())
roc_auc_score(y_test_genres, y_predict_clf, average='macro')

0.588779693264699

In [None]:
cols = [
    'p_Action', 'p_Adventure', 'p_Animation',
    'p_Biography', 'p_Comedy', 'p_Crime',
    'p_Documentary', 'p_Drama', 'p_Family',
    'p_Fantasy', 'p_Film-Noir', 'p_History',
    'p_Horror', 'p_Music', 'p_Musical',
    'p_Mystery', 'p_News', 'p_Romance',
    'p_Sci-Fi', 'p_Short', 'p_Sport',
    'p_Thriller', 'p_War', 'p_Western'
]

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_predict_nb_clf, columns=cols)
#res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,p_Film-Noir,p_History,p_Horror,p_Music,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
0,0.030488,0.041772,0.005131,0.028553,0.086778,0.152461,0.181141,0.614601,0.014092,0.049949,0.005659,0.02809,0.149269,0.030239,0.001829,0.064035,0.001282,0.070728,0.039767,0.007677,0.022586,0.210468,0.013984,0.01143
1,0.268552,0.251455,0.033639,0.136191,0.111511,0.018639,0.003788,0.739428,0.049949,0.023304,0.008689,0.356197,0.004821,0.006349,0.033158,0.054464,0.000442,0.318319,0.021172,0.013184,0.010769,0.215446,0.932475,0.00108
2,0.372282,0.151275,0.010886,0.006608,0.079254,0.079128,0.010038,0.319391,0.007503,0.027077,0.003629,0.010145,0.285648,0.003275,0.00493,0.119737,0.000155,0.029927,0.353894,0.001427,0.001227,0.436286,0.009488,0.004662
3,0.521798,0.346447,0.009761,0.023747,0.18633,0.117851,0.001922,0.479081,0.018669,0.053748,0.002889,0.055431,0.066361,0.014752,0.019436,0.025297,5.9e-05,0.126475,0.130889,0.001532,0.003325,0.335381,0.125042,0.075039
4,0.447401,0.414775,0.021816,0.034111,0.233162,0.022433,0.02898,0.311585,0.046697,0.320706,0.000889,0.023567,0.197409,0.016027,0.003527,0.011467,0.000562,0.047661,0.376088,0.006699,0.005679,0.117078,0.012316,0.002888


In [None]:
nb_clf = OneVsRestClassifier(MultinomialNB(alpha=0.99))
model_cv_nb_clf = GridSearchCV(
    estimator=nb_clf,
    param_grid={},
    scoring='accuracy',
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    verbose=1
)

In [None]:
model_cv_nb_clf.fit(xTr, yTr)
roc_auc_score(y_test_genres, model_cv_nb_clf.predict_proba(X_test_tdf.toarray()), average='macro')

Fitting 5 folds for each of 1 candidates, totalling 5 fits


0.8151338492428954

In [None]:
output_var = yTrain.shape[1]
print(output_var, ' output variables')

24  output variables


In [None]:
dims = xTrain_tdf.shape[1]
print(dims, 'input variables')

2000 input variables


In [None]:
# Reproducibility in Keras Models
# https://keras.io/examples/keras_recipes/reproducibility_recipes/
keras.utils.set_random_seed(22)

In [None]:
K.clear_session()

model = Sequential([
    Embedding(input_dim=dims, output_dim=128),
    LSTM(256),
    Dropout(0.5),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(output_var, activation='softmax')
])

model.compile(
    optimizer='Adam', #'adam',
    loss='categorical_crossentropy', #"categorical_crossentropy",
    metrics = ["accuracy"],
    sample_weight_mode='temporal'
)

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 128)         256000    
                                                                 
 lstm (LSTM)                 (None, 256)               394240    
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense (Dense)               (None, 128)               32896     
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 24)                3096      
                                                                 
Total params: 686232 (2.62 MB)
Trainable params: 686232 

In [None]:
model.fit(
    xTrain_tdf.toarray(), yTrain,
    verbose=1,
    batch_size=32,
    epochs=10,
    validation_data=(xVal_tdf.toarray(), yVal),
    callbacks=[PlotLossesKeras()]
)

Epoch 1/10

KeyboardInterrupt: 

In [None]:
y_pred_class = model.predict(X_test_tdf.toarray())
y_test_class = np.argmax(X_test_tdf.toarray(), axis=1)



In [None]:
# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_class, average='macro')

0.4930746364251313

In [None]:
cols = [
    'p_Action', 'p_Adventure', 'p_Animation',
    'p_Biography', 'p_Comedy', 'p_Crime',
    'p_Documentary', 'p_Drama', 'p_Family',
    'p_Fantasy', 'p_Film-Noir', 'p_History',
    'p_Horror', 'p_Music', 'p_Musical',
    'p_Mystery', 'p_News', 'p_Romance',
    'p_Sci-Fi', 'p_Short', 'p_Sport',
    'p_Thriller', 'p_War', 'p_Western'
]

In [None]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
#res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,p_Film-Noir,p_History,p_Horror,p_Music,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.178666,0.111486,0.021162,0.040451,0.412487,0.136443,0.025503,0.486787,0.063274,0.091445,0.009869,0.028732,0.080358,0.02849,0.039942,0.060797,0.0,0.520442,0.065661,0.103077,0.019148,0.182945,0.023956,0.036228
4,0.141989,0.087454,0.020934,0.099489,0.339891,0.215064,0.064928,0.54,0.061722,0.069519,0.010302,0.027411,0.087357,0.036983,0.022385,0.061874,6.4e-05,0.165986,0.055703,0.007257,0.01849,0.207937,0.079976,0.017317
5,0.172132,0.124699,0.021659,0.038556,0.253822,0.556154,0.019145,0.597389,0.063644,0.085029,0.088108,0.03163,0.117146,0.027422,0.024473,0.337297,0.0,0.327024,0.069038,0.007037,0.020403,0.558126,0.029693,0.017246
6,0.18941,0.137246,0.021502,0.031718,0.29475,0.149847,0.026721,0.531383,0.061621,0.066046,0.030969,0.036475,0.200914,0.027675,0.02389,0.100082,0.0,0.181014,0.096859,0.007037,0.019201,0.306362,0.03957,0.017371
7,0.205589,0.134627,0.02202,0.032584,0.341892,0.216454,0.021318,0.430259,0.077072,0.145261,0.027939,0.043913,0.197059,0.066935,0.022559,0.068007,0.0,0.214391,0.201836,0.009802,0.019559,0.229932,0.02397,0.01718
