![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

In [22]:
import warnings
warnings.filterwarnings('ignore')

In [23]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [24]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [25]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [26]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [27]:
import nltk
nltk.download("stopwords")
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USUARIO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USUARIO\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [28]:
import pandas as pd
import numpy as np
import json
import re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import ToktokTokenizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

tokenizer = ToktokTokenizer() 
STOPWORDS = set(stopwords.words("english"))
stemmer = SnowballStemmer("english")

def limpiar_texto(texto):
    """
    Función para realizar la limpieza de un texto dado.
    """
    # Eliminamos los caracteres especiales
    texto = re.sub(r'\W', ' ', str(texto))
    # Eliminado las palabras que tengo un solo caracter
    texto = re.sub(r'\s+[a-zA-Z]\s+', ' ', texto)
    # Sustituir los espacios en blanco en uno solo
    texto = re.sub(r'\s+', ' ', texto, flags=re.I)
    # Convertimos textos a minusculas
    texto = texto.lower()
    return texto

def filtrar_stopword_digitos(tokens):
    """
    Filtra stopwords y digitos de una lista de tokens.
    """
    return [token for token in tokens if token not in STOPWORDS 
            and not token.isdigit()]

def stem_palabras(tokens):
    """
    Reduce cada palabra de una lista dada a su raíz.
    """
    return [stemmer.stem(token) for token in tokens]

def tokenize(texto):
    """
    Método encargado de realizar la limpieza y preprocesamiento de un texto
    """
    text_cleaned = limpiar_texto(texto)
    tokens = [word for word in tokenizer.tokenize(text_cleaned) if len(word) > 1]
    tokens = filtrar_stopword_digitos(tokens)
    stems = stem_palabras(tokens)
    return stems

In [29]:
# Definición de variables predictoras (X)
vect = TfidfVectorizer(tokenizer=tokenize,sublinear_tf=True,max_features=15000)
X_dtm = vect.fit_transform(dataTraining['plot']).toarray()
X_dtm.shape
print(X_dtm)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [30]:
dataTraining['genres']

3107                                   ['Short', 'Drama']
900                         ['Comedy', 'Crime', 'Horror']
6724                   ['Drama', 'Film-Noir', 'Thriller']
4704                                            ['Drama']
2582                      ['Action', 'Crime', 'Thriller']
                              ...                        
8417                                ['Comedy', 'Romance']
1592                   ['Action', 'Adventure', 'Fantasy']
1723    ['Adventure', 'Musical', 'Fantasy', 'Comedy', ...
7605    ['Animation', 'Adventure', 'Drama', 'Family', ...
215       ['Animation', 'Adventure', 'Family', 'Fantasy']
Name: genres, Length: 7895, dtype: object

In [31]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [32]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [None]:
# # Definición de variables predictoras (X)
# vect = CountVectorizer(max_features=1000)
# X_dtm = vect.fit_transform(dataTraining['plot'])
# X_dtm.shape

In [None]:
# # Definición de variable de interés (y)
# dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
# le = MultiLabelBinarizer()
# y_genres = le.fit_transform(dataTraining['genres'])

In [None]:
# # Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
# X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

## Entrenamiento del modelo random forest

In [12]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [13]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')


0.8167568769904546

In [33]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

NameError: name 'clf' is not defined

In [15]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.133566,0.107824,0.023475,0.038133,0.367979,0.157483,0.041389,0.542698,0.068654,0.09705,...,0.043395,0.075135,9.5e-05,0.272629,0.066538,0.027403,0.021615,0.230212,0.031224,0.020178
4,0.142036,0.117511,0.036646,0.046974,0.350519,0.195915,0.046406,0.515236,0.084135,0.081001,...,0.034955,0.075471,0.020266,0.187379,0.066975,0.007499,0.031419,0.244723,0.043972,0.033989
5,0.152087,0.121257,0.023475,0.038222,0.312578,0.330165,0.043085,0.549945,0.067185,0.087345,...,0.025526,0.141149,3.5e-05,0.243097,0.066538,0.007662,0.021615,0.345025,0.051351,0.029918
6,0.152259,0.111502,0.023475,0.040748,0.360597,0.160701,0.056535,0.53747,0.070001,0.071004,...,0.025526,0.096842,7e-05,0.273555,0.082657,0.007662,0.022105,0.304202,0.052783,0.020162
7,0.173368,0.134173,0.023475,0.038133,0.348132,0.16987,0.038902,0.406225,0.068692,0.116166,...,0.025526,0.098532,0.000175,0.20058,0.151139,0.007662,0.021615,0.232384,0.031224,0.020162


## Entrenamiento del modelo Logistic Regresssion

In [34]:
from sklearn.linear_model import LogisticRegression

# Definición y entrenamiento
lr = OneVsRestClassifier(LogisticRegression())
lr.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_lr = lr.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_lr, average='macro')

0.8858701291884202

## Gradient Boosting

In [18]:
from sklearn.ensemble import GradientBoostingClassifier

# Definición y entrenamiento
gb = OneVsRestClassifier(GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42))
gb.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_gb = gb.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_gb, average='macro')

0.7675902816408949

## Redes neuronales

In [19]:
from sklearn.neural_network import MLPClassifier

# Definición y entrenamiento
rn = OneVsRestClassifier(MLPClassifier(max_iter=1000, random_state=42))
rn.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_rn = rn.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_rn, average='macro')

0.8312762738916275

## XGBoost

In [20]:
from xgboost import XGBClassifier

# Definición y entrenamiento
xgb = XGBClassifier(learning_rate= 0.01,n_estimators= 500,subsample=0.7,max_depth=15)
xgb.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_xgb = xgb.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_xgb, average='macro')

0.8412382329788834

## AdaBoost

In [21]:
from sklearn.ensemble import AdaBoostClassifier

# Definición y entrenamiento
adb = OneVsRestClassifier(AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42))
adb.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_adb = adb.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_adb, average='macro')

0.7874306209069912

## Calibracion de parametros

In [22]:
# Diccionario con nombres de parámetros como claves y listas 
# de ajustes de parámetros a probar como valores
param_grid = {
    'estimator__max_iter': [1000, 5000],  # Valores para max_iter del estimador LogisticRegression
    'estimator__tol': [1e-3, 1e-5],      # Valores para tol del estimador LogisticRegression
    'estimator__solver': ['lbfgs', 'liblinear']  # Valores para solver del estimador LogisticRegression
}

# Modelo de Regresión logistica
estimator = LogisticRegression(random_state=42)
LR = OneVsRestClassifier(estimator)
# Paso 5: Realizar la búsqueda de hiperparámetros utilizando GridSearchCV
grid_search = GridSearchCV(estimator=LR, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train_genres)



KeyError: 'max_iter'

In [23]:
# Paso 6: Obtener los mejores hiperparámetros encontrados
best_params = grid_search.best_params_
print("Mejores hiperparámetros encontrados:", best_params)

Mejores hiperparámetros encontrados: {'estimator__max_iter': 1000, 'estimator__solver': 'liblinear', 'estimator__tol': 0.001}


In [24]:
best_max_iter = grid_search.best_params_['estimator__max_iter']
best_tol = grid_search.best_params_['estimator__tol']
best_solver = grid_search.best_params_['estimator__solver']

print("Mejor valor para max_iter:", best_max_iter)
print("Mejor valor para tol:", best_tol)
print("Mejor valor para solver:", best_solver)

Mejor valor para max_iter: 1000
Mejor valor para tol: 0.001
Mejor valor para solver: liblinear


## Regresion logistica

In [25]:
from sklearn.linear_model import LogisticRegression

# Definición y entrenamiento
lr_1 = OneVsRestClassifier(LogisticRegression(max_iter=1000,solver='liblinear',tol=0.001))
lr_1.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_lr = lr_1.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_lr, average='macro')

0.8877110086352514

In [26]:
# Predicción del conjunto de test
y_pred_test_genres_lr = lr_1.predict_proba(X_test_dtm)

# Guardar predicciones en formato exigido en la competencia de kaggle
res_vc = pd.DataFrame(y_pred_test_genres_lr, index=dataTesting.index, columns=cols)
res_vc.to_csv('pred_genres_text_RL.csv', index_label='ID')
res_vc.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.084325,0.08498,0.024958,0.031261,0.365746,0.128829,0.032204,0.568852,0.048948,0.107991,...,0.037304,0.102305,0.001847,0.534524,0.050489,0.012506,0.027298,0.179724,0.023206,0.028459
4,0.157144,0.043064,0.028468,0.133374,0.210425,0.322069,0.063876,0.748243,0.031767,0.028609,...,0.031806,0.038345,0.002295,0.082746,0.027583,0.012959,0.027842,0.257118,0.067064,0.033881
5,0.079557,0.027628,0.013896,0.051718,0.111866,0.622504,0.025251,0.845699,0.020772,0.036739,...,0.017997,0.33238,0.001747,0.177013,0.051279,0.008527,0.022934,0.529714,0.035253,0.017658
6,0.137927,0.093056,0.018533,0.041184,0.202068,0.097321,0.029809,0.724823,0.050779,0.046309,...,0.034808,0.090985,0.001873,0.24975,0.092061,0.008683,0.032097,0.392009,0.064157,0.020676
7,0.082837,0.066391,0.027729,0.035479,0.227235,0.096232,0.050172,0.32872,0.057529,0.118586,...,0.02225,0.097509,0.001948,0.140456,0.365632,0.012492,0.01805,0.257824,0.023513,0.02272


## exportacion de modelo

In [48]:
lr_1

In [37]:
import joblib
import os

# Create the directory if it doesn't exist
os.makedirs('deployment_taller2', exist_ok=True)

# Now you can safely save the model
joblib.dump(lr_1, 'deployment_taller2/pred_genres_text_RL.pkl', compress=3)


In [38]:
# exportar vectorizador Tfidf
os.chdir('D:/MIAD/ML-PNL/GIT/MIAD_NLP_2024/deployment_taller2') 
# Now you can safely save the model
joblib.dump(vect, 'pred_genres_vect.pkl', compress=3)

['pred_genres_vect.pkl']

In [146]:
os.chdir('D:/MIAD/ML-PNL/GIT/MIAD_NLP_2024/deployment_taller2')  # Ruta absoluta
current_directory = os.getcwd()
print(f"El directorio de trabajo actual es: {current_directory}")


El directorio de trabajo actual es: D:\MIAD\ML-PNL\GIT\MIAD_NLP_2024\deployment_taller2


In [149]:
# Importar modelo y predicción
from pred_genres_text_RL import predict_genero

# Predicción de probabilidad de que un link sea phishing
descri= 'most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender .  a day before ,  the boy meets a woman boarding a train ,  a drug abuser .  at the bridge ,  the father goes into the engine room ,  and tells his son to stay at the edge of the nearby lake .  a ship comes ,  and the bridge is lifted .  though it is supposed to arrive an hour later ,  the train happens to arrive .  the son sees this ,  and tries to warn his father ,  who is not able to see this .  just as the oncoming train approaches ,  his son falls into the drawbridge gear works while attempting to lower the bridge ,  leaving the father with a horrific choice .  the father then lowers the bridge ,  the gears crushing the boy .  the people in the train are completely oblivious to the fact a boy died trying to save them ,  other than the drug addict woman ,  who happened to look out her train window .  the movie ends ,  with the man wandering a new city ,  and meets the woman ,  no longer a drug addict ,  holding a small baby .  other relevant narratives run in parallel ,  namely one of the female drug - addict ,  and they all meet at the climax of this tumultuous film .'

predict_genero(descri)

AttributeError: 'OneVsRestClassifier' object has no attribute 'vect'