![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 -- Grupo 18 -- Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

#### explicación del proyecto

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Preprocesamiento de datos

#### Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [None]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [None]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [None]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


Al visualizar los datos de entrenamiento y prueba vemos que: en ambas bases de datos tenemos año (year), el título (title) y la sinapsis (plot) de la película en los dos conjuntos, solo que en entrenamiento tenemos adicional la  variable rating (puntuación)y las variables de predicción en este caso los tipos de generos posibles (genres)

In [None]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [None]:
# Definición de variable de interés (y)
# Convert strings to lists, but leave lists as is
dataTraining['genres'] = dataTraining['genres'].apply(lambda x: eval(x) if isinstance(x, str) else x)
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

### Eliminación de stopwords

In [None]:
!pip install neattext



In [None]:
!pip install scikit-multilearn

^C


In [None]:
import neattext as nt
import neattext.functions as nfx

In [None]:
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.metrics import accuracy_score,hamming_loss
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset


In [None]:
dataTraining['plot'].apply(lambda x:nt.TextFrame(x).noise_scan())

3107    {'text_noise': 11.650485436893204, 'text_lengt...
900     {'text_noise': 8.51063829787234, 'text_length'...
6724    {'text_noise': 11.533242876526458, 'text_lengt...
4704    {'text_noise': 10.256410256410255, 'text_lengt...
2582    {'text_noise': 10.321324245374878, 'text_lengt...
                              ...                        
8417    {'text_noise': 12.313432835820896, 'text_lengt...
1592    {'text_noise': 9.106239460370995, 'text_length...
1723    {'text_noise': 9.045226130653267, 'text_length...
7605    {'text_noise': 11.129476584022038, 'text_lengt...
215     {'text_noise': 7.627118644067797, 'text_length...
Name: plot, Length: 7895, dtype: object

In [None]:
dataTraining['plot'].apply(lambda x:nt.TextExtractor(x).extract_stopwords())


3107    [most, is, the, of, a, who, his, eight, to, wi...
900                          [a, to, the, of, his, to, a]
6724    [in, a, with, a, a, who, beyond, his, they, be...
4704    [in, a, in, the, of, the, has, just, had, a, w...
2582    [in, the, of, a, to, a, with, the, who, has, t...
                              ...                        
8417    [our, their, it, s, one, for, any, and, and, a...
1592    [the, his, are, with, and, her, to, a, they, m...
1723    [a, by, the, and, of, that, a, it, all, in, on...
7605    [a, in, a, with, her, on, the, she, is, to, mo...
215                [of, never, to, up, with, her, and, a]
Name: plot, Length: 7895, dtype: object

In [None]:
dataTraining['plot'].apply(nfx.remove_stopwords)

3107    story single father takes year - old son work ...
900     serial killer decides teach secrets satisfying...
6724    sweden , female blackmailer disfiguring facial...
4704    friday afternoon new york , president tredway ...
2582    los angeles , editor publishing house carol hu...
                              ...                        
8417    " marriage , wedding . " ' lesson number newly...
1592    wandering barbarian , conan , alongside goofy ...
1723    like tale spun scheherazade , kismet follows r...
7605    mrs . brisby , widowed mouse , lives cinder bl...
215     tinker bell journey far north land patch thing...
Name: plot, Length: 7895, dtype: object

In [None]:
corpus = dataTraining['plot'].apply(nfx.remove_stopwords)

In [None]:
corpus

3107    story single father takes year - old son work ...
900     serial killer decides teach secrets satisfying...
6724    sweden , female blackmailer disfiguring facial...
4704    friday afternoon new york , president tredway ...
2582    los angeles , editor publishing house carol hu...
                              ...                        
8417    " marriage , wedding . " ' lesson number newly...
1592    wandering barbarian , conan , alongside goofy ...
1723    like tale spun scheherazade , kismet follows r...
7605    mrs . brisby , widowed mouse , lives cinder bl...
215     tinker bell journey far north land patch thing...
Name: plot, Length: 7895, dtype: object

### Inicialización del Vectorizador TF-IDF

In [None]:
tfidf = TfidfVectorizer()


### Ajuste del Vectorizador al Corpus

In [None]:
Xfeat = tfidf.fit(corpus)
#X feat son los atributos o features de la variable X

### Guardar modelo

In [None]:
import os
import joblib

# Create the directory if it doesn't exist
if not os.path.exists('model_deployment'):
    os.makedirs('model_deployment')

joblib.dump(Xfeat, 'model_deployment/Xfeat.pkl', compress=3)
#grabar pkl o modelo en git model_deployment

['model_deployment/Xfeat.pkl']

In [None]:
# Vectorización utilizando TfidfVectorizer
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()

In [None]:
Xfeatures

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Tratamiento con tokens

In [None]:
import nltk

In [None]:
import pandas as pd
import numpy as np
import json
import re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import ToktokTokenizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix


In [None]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:

tokenizer = ToktokTokenizer()
STOPWORDS = set(stopwords.words("english"))
stemmer = SnowballStemmer("english")

#vamos a quitar espacios, estandarizar textos con mínusculas en genral limipiar datos
def limpiar_texto(texto):

    texto = re.sub(r'\W', ' ', str(texto))

    texto = re.sub(r'\s+[a-zA-Z]\s+', ' ', texto)

    texto = re.sub(r'\s+', ' ', texto, flags=re.I)

    texto = texto.lower()
    return texto

def filtrar_stopword_digitos(tokens):
    return [token for token in tokens if token not in STOPWORDS
            and not token.isdigit()]

def stem_palabras(tokens):
    return [stemmer.stem(token) for token in tokens]

def tokenize(texto):
    text_cleaned = limpiar_texto(texto)
    tokens = [word for word in tokenizer.tokenize(text_cleaned) if len(word) > 1]
    tokens = filtrar_stopword_digitos(tokens)
    stems = stem_palabras(tokens)
    return stems

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Definición de variables predictoras (X)
vect = TfidfVectorizer(tokenizer=tokenize,sublinear_tf=True,max_features=15000)
X_dtm = vect.fit_transform(dataTraining['plot']).toarray()
X_dtm.shape

(7895, 15000)

### Separación y validación de datos en el modelo

In [None]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

### Modelo RandomForestClassifier

In [134]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [135]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8167568769904546

In [136]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

### Modelo de regresión logística

In [None]:
from sklearn.linear_model import LogisticRegression

MRL = OneVsRestClassifier(LogisticRegression(random_state=42))
MRL.fit(X_train, y_train_genres)  #por ser un problema multi-label, OneVsRestClassifier.()

In [126]:
# Exportar modelo a archivo binario .pkl
joblib.dump(MRL, 'movie_genre_MRL3.pkl', compress=3)

['movie_genre_MRL3.pkl']

In [None]:
y_pred_genres = MRL.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8872001870449445

In [None]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = tfidf.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = MRL.predict_proba(X_test_dtm)



In [None]:

# Guardar predicciones en formato exigido en la competencia de Kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_MRL4_26.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.118906,0.09194,0.028154,0.040843,0.338875,0.118409,0.0398,0.544789,0.062132,0.120078,...,0.039186,0.078791,0.000681,0.54891,0.062577,0.012104,0.025324,0.182757,0.030925,0.030185
4,0.144365,0.058738,0.029806,0.102112,0.256749,0.307222,0.056382,0.738892,0.043368,0.037717,...,0.034057,0.045585,0.000696,0.113574,0.030402,0.011689,0.029225,0.222783,0.050337,0.035587
5,0.115259,0.050894,0.019201,0.052019,0.186162,0.556576,0.034139,0.770691,0.033442,0.041781,...,0.023579,0.300214,0.000677,0.151438,0.073453,0.010014,0.023401,0.495128,0.031696,0.025169
6,0.11699,0.097831,0.020185,0.042219,0.234858,0.087335,0.042486,0.710475,0.054216,0.0568,...,0.033132,0.10446,0.000675,0.243425,0.08785,0.008402,0.031762,0.335164,0.065932,0.020419
7,0.079485,0.071038,0.025683,0.028432,0.210374,0.129143,0.03544,0.397787,0.059006,0.123337,...,0.027689,0.158501,0.000678,0.193445,0.28966,0.010824,0.018056,0.374319,0.027744,0.022711


### Modelo de redes neuronales

In [None]:
# Importar las bibliotecas necesarias
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import roc_auc_score




In [None]:


# Separación de variables predictoras (X) y variable de interés (y)
X_train, X_test, y_train_genres, y_test_genres = train_test_split(Xfeatures, y_genres, test_size=0.33, random_state=42)



In [None]:
# Definición del modelo de redes neuronales
def create_nn_model(input_dim, output_dim, neurons=64, activation='relu', optimizer='adam', loss='binary_crossentropy'):
    model = Sequential()
    model.add(Dense(neurons, input_dim=input_dim, activation=activation))
    model.add(Dense(neurons, activation=activation))
    model.add(Dense(output_dim, activation='sigmoid'))
    model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
    return model

# Crear el modelo
input_dim = X_train.shape[1]
output_dim = y_train_genres.shape[1]
model = create_nn_model(input_dim, output_dim)

# Definición de la función EarlyStopping
early_stopping = EarlyStopping(monitor="val_loss", patience=2)

# Entrenamiento del modelo de redes neuronales
history = model.fit(X_train, y_train_genres,
                    validation_data=(X_test, y_test_genres),
                    epochs=20,
                    batch_size=64,
                    callbacks=[early_stopping],
                    verbose=True)

# Predicción en el conjunto de prueba
y_pred_genres = model.predict(X_test)

# Impresión del desempeño del modelo
roc_auc = roc_auc_score(y_test_genres, y_pred_genres, average='macro')
print("ROC AUC Score: ", roc_auc)

# Generar las predicciones para el conjunto de prueba de Kaggle
X_test_dtm = tfidf.transform(dataTesting['plot'])  # Vectorización utilizando TfidfVectorizer
y_pred_test_genres = model.predict(X_test_dtm.toarray())  # Predicción del conjunto de prueba

# Guardar las predicciones en un archivo CSV con el formato exigido por Kaggle
cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('kaggle_submissionRN.csv', index_label='ID')

Epoch 1/20
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 71ms/step - accuracy: 0.1334 - loss: 0.5879 - val_accuracy: 0.1926 - val_loss: 0.2998
Epoch 2/20
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 71ms/step - accuracy: 0.2057 - loss: 0.2905 - val_accuracy: 0.1926 - val_loss: 0.2910
Epoch 3/20
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 51ms/step - accuracy: 0.2235 - loss: 0.2704 - val_accuracy: 0.2084 - val_loss: 0.2752
Epoch 4/20
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 51ms/step - accuracy: 0.3403 - loss: 0.2319 - val_accuracy: 0.3258 - val_loss: 0.2492
Epoch 5/20
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 72ms/step - accuracy: 0.4452 - loss: 0.1805 - val_accuracy: 0.3603 - val_loss: 0.2343
Epoch 6/20
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 55ms/step - accuracy: 0.5158 - loss: 0.1342 - val_accuracy: 0.3853 - val_loss: 0.2327
Epoch 7/20
[1m83/83[0m [32m━━━

### Calibración del modelo con mayor AUC

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
 !pip install livelossplot



Collecting livelossplot
  Downloading livelossplot-0.5.5-py3-none-any.whl (22 kB)
Installing collected packages: livelossplot
Successfully installed livelossplot-0.5.5


In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras import backend as K
from keras.callbacks import EarlyStopping
from livelossplot import PlotLossesKeras  # Import after installing the package

def nn_model_params(optimizer, neurons, batch_size, epochs, activation, patience, loss):
    K.clear_session()

    model = Sequential()
    model.add(Dense(neurons, input_shape=(dims,), activation=activation))
    model.add(Dense(neurons, activation=activation))
    model.add(Dense(output_var, activation='sigmoid'))

    model.compile(optimizer=optimizer, loss=loss)

    early_stopping = EarlyStopping(monitor="val_loss", patience=patience)

    model.fit(X_train, y_train_genres,
              validation_data=(X_test, y_test_genres),
              epochs=epochs,
              batch_size=batch_size,
              callbacks=[early_stopping, PlotLossesKeras()],
              verbose=True)

    return model

In [None]:
nn_params = {
    'optimizer': ['adam', 'sgd'],
    'activation': ['relu'],
    'batch_size': [64, 128],
    'neurons': [64, 256],
    'epochs': [20, 50],
    'patience': [2, 5],
    'loss': ['binary_crossentropy']
}


In [None]:
dims = X_train.shape[1]
print(dims, 'input variables')

output_var = y_train_genres.shape[1]
print(output_var, 'output variables')


38358 input variables
24 output variables


In [None]:
!pip install tensorflow scikit-learn livelossplot


Collecting keras<2.16,>=2.15.0 (from tensorflow)
  Using cached keras-2.15.0-py3-none-any.whl (1.7 MB)
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 3.3.3
    Uninstalling keras-3.3.3:
      Successfully uninstalled keras-3.3.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikeras 0.13.0 requires keras>=3.2.0, but you have keras 2.15.0 which is incompatible.[0m[31m
[0mSuccessfully installed keras-2.15.0


In [None]:
!pip install scikeras


Collecting keras>=3.2.0 (from scikeras)
  Using cached keras-3.3.3-py3-none-any.whl (1.1 MB)
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 2.15.0
    Uninstalling keras-2.15.0:
      Successfully uninstalled keras-2.15.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.3.3 which is incompatible.[0m[31m
[0mSuccessfully installed keras-3.3.3


In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score

In [None]:
#Arquitectura modelo

def create_model(optimizer='adam', dropout_rate=0.0):
    model = Sequential()
    model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(num_classes, activation='softmax'))  # Use the correct number of classes
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    return model


def create_model(input_dim, output_dim, optimizer='adam', dropout_rate=0.0, neurons=64, activation='relu'):
    model = Sequential()
    model.add(Dense(neurons, input_dim=input_dim, activation=activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(neurons, activation=activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# instancia KerasClassifier para API
nn_model = KerasClassifier(build_fn=create_model, input_dim=X_train.shape[1], output_dim=num_classes, verbose=0)



In [None]:
# Define parámetro grid RandomizedSearchCV

nn_params = {
    'model__optimizer': ['adam', 'rmsprop'],
    'model__dropout_rate': [0.0, 0.2, 0.4],
    'model__neurons': [32, 64, 128],
    'model__activation': ['relu', 'tanh'],
    'epochs': [10, 20],
    'batch_size': [32, 64]
}


In [None]:

# RandomizedSearchCV

rs = RandomizedSearchCV(estimator=nn_model,
param_distributions=nn_params,
n_iter=5,
scoring='neg_mean_squared_error',
cv=5,
n_jobs=-1)

In [None]:
rs

In [None]:
from sklearn.metrics import roc_auc_score

### Regresión lógistica

In [None]:
from sklearn.linear_model import LogisticRegression

# Definición y entrenamiento
lr = OneVsRestClassifier(LogisticRegression(max_iter=5000,solver='liblinear',tol=0.000001))
lr.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_lr = lr.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_lr, average='macro')

0.878230957783155

### Red neuronal

In [None]:
from sklearn.neural_network import MLPClassifier

# Definición y entrenamiento
rn = OneVsRestClassifier(MLPClassifier(max_iter=1000, random_state=42))
rn.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres = rn.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_rn, average='macro')

### XGBoost

In [None]:
from xgboost import XGBClassifier

# Definición y entrenamiento
xgb = XGBClassifier(learning_rate= 0.01, colsample_bytree= 0.7999999999999999,n_estimators= 500,subsample=0.7999999999999999,colsample_bylevel=0.7,max_depth=15)
xgb.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_xgb = xgb.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_xgb, average='macro')

0.8045429016347838

### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Definición y entrenamiento
gb = OneVsRestClassifier(GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42))
gb.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_gb = gb.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_gb, average='macro')

0.7752809591506488

### CatBoost

In [None]:
!pip install catboost

from catboost import CatBoostClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score


Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.5


In [None]:


# Definición y entrenamiento
cbt = OneVsRestClassifier(CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, random_seed=42))
cbt.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_cbt = cbt.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_cbt, average='macro')

0:	learn: 0.6421282	total: 111ms	remaining: 11s
1:	learn: 0.6017279	total: 165ms	remaining: 8.08s
2:	learn: 0.5678153	total: 242ms	remaining: 7.83s
3:	learn: 0.5426274	total: 303ms	remaining: 7.27s
4:	learn: 0.5214551	total: 357ms	remaining: 6.78s
5:	learn: 0.5047903	total: 422ms	remaining: 6.61s
6:	learn: 0.4899505	total: 471ms	remaining: 6.26s
7:	learn: 0.4767658	total: 524ms	remaining: 6.02s
8:	learn: 0.4660916	total: 601ms	remaining: 6.07s
9:	learn: 0.4574811	total: 705ms	remaining: 6.35s
10:	learn: 0.4483967	total: 784ms	remaining: 6.34s
11:	learn: 0.4422234	total: 866ms	remaining: 6.35s
12:	learn: 0.4369218	total: 936ms	remaining: 6.26s
13:	learn: 0.4322315	total: 990ms	remaining: 6.08s
14:	learn: 0.4268778	total: 1.07s	remaining: 6.08s
15:	learn: 0.4208529	total: 1.13s	remaining: 5.93s
16:	learn: 0.4168932	total: 1.19s	remaining: 5.82s
17:	learn: 0.4142480	total: 1.25s	remaining: 5.69s
18:	learn: 0.4094563	total: 1.29s	remaining: 5.49s
19:	learn: 0.4058498	total: 1.36s	remaining

0.7915483115827975

AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Definición y entrenamiento
adb = OneVsRestClassifier(AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42))
adb.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres_adb = adb.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres_adb, average='macro')

0.7701094207998874

### Disponibilización en API del modelo

In [None]:
pip install flask_restx

Collecting flask_restx
  Downloading flask_restx-1.3.0-py2.py3-none-any.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aniso8601>=0.82 (from flask_restx)
  Downloading aniso8601-9.0.1-py2.py3-none-any.whl (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.8/52.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: aniso8601, flask_restx
Successfully installed aniso8601-9.0.1 flask_restx-1.3.0


In [None]:
pip install flask



In [None]:
from flask import Flask
from flask_restx import Api, Resource, fields

In [None]:
pip install joblib



In [None]:
# Exportar modelo
import joblib

from flask import Flask
from flask_restx import Api, Resource, fields
import sys
import os
import joblib
import pandas as pd

import neattext as nt
import neattext.functions as nfx
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
MRL

In [124]:
# Exportar modelo a archivo binario .pkl
joblib.dump(MRL, 'movie_genre_MRL2.pkl', compress=3)

['movie_genre_MRL2.pkl']



### Disponibilización del API con el modelo de forma local

In [114]:
!pip install flask-restx

from flask import Flask
from flask_restx import Api, Resource, fields
import pandas as pd
import joblib



In [115]:
app = Flask(__name__)
api = Api(app, version='1.0', title='API_Géneros de Películas', description='Predicción del género de las pelìculas según la sipnósis')
ns = api.namespace('predicción', description='Género de la película')
parser = api.parser()

In [116]:
parser.add_argument(
    'URL',
    type=str,
    required=True,
    help='Data to be analyzed',
    location='args'
)

resource_fields = api.model('Resource', {
    'result': fields.String,
})

In [117]:
@ns.route('/')
class MovieGenreApi(Resource):
    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        plot = args['Plot']
        prediction = predict_genre(plot)
        return {"result": prediction.to_dict()}, 200

In [132]:
#Se usa mejor con una aplicación para subir la API en la nube con un archivo serveless y endpoint