![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
# Importación librerías

import pandas as pd
import re
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, SpatialDropout1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, LearningRateScheduler
from tensorflow.keras.metrics import AUC
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\userml\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\userml\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# Carga de datos de archivo .csv Se descargo la informacion del Github ya que el servidor no nos dio permisos

#dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
#dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)
dataTraining = pd.read_csv('dataTraining.csv', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('dataTesting.csv', encoding='UTF-8', index_col=0)

In [3]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [5]:
# Convertir todas las palabras a minúsculas
#dataTraining["plot"]

In [6]:
# Definición de la función para limpiar los datos
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words])
    return text

In [7]:
# Invocar la función para limpiar los datos 'plot'
dataTraining['plot_cleaned'] = dataTraining['plot'].apply(clean_text)
dataTesting['plot_cleaned'] = dataTesting['plot'].apply(clean_text)

In [196]:
# Tokenizar los datos
tokenizer = Tokenizer(num_words=30000)
tokenizer.fit_on_texts(dataTraining['plot_cleaned'])
X_train_seq = tokenizer.texts_to_sequences(dataTraining['plot_cleaned'])
X_test_seq = tokenizer.texts_to_sequences(dataTesting['plot_cleaned'])

max_seq_len = 500
X_train_padded = pad_sequences(X_train_seq, maxlen=max_seq_len, padding='post')
X_test_padded = pad_sequences(X_test_seq, maxlen=max_seq_len, padding='post')

In [9]:
# Cargue de datos GloVe embeddings
embeddings_index = {}
glove_path = r'C:\Users\userml\Documents\Proyecto 2\glove.6B.300d.txt'  

In [10]:
with open(glove_path, encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

embedding_dim = 300  # Cambiado a 300 dimensiones
word_index = tokenizer.word_index
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [12]:
# Definir la variable objetivo
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x) if isinstance(x, str) else x)
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])


In [13]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_train_padded, y_genres, test_size=0.2, random_state=42)

In [14]:
# Definir el modelo
model = Sequential()
model.add(Embedding(input_dim=len(word_index) + 1, output_dim=embedding_dim, weights=[embedding_matrix], input_length=max_seq_len, trainable=True))
model.add(SpatialDropout1D(0.3))
model.add(Bidirectional(LSTM(256, return_sequences=True)))  
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(128)))  
model.add(Dropout(0.5))
model.add(Dense(len(le.classes_), activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=[AUC(name='auc')])



In [16]:
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, min_lr=0.00001)

def scheduler(epoch, lr):
    if epoch < 5:
        return lr
    else:
        return float(lr * tf.math.exp(-0.1))

lr_scheduler = LearningRateScheduler(scheduler)

In [17]:
# Entrenar el modelo
history = model.fit(X_train, y_train, epochs=20, batch_size=128, validation_data=(X_val, y_val), callbacks=[early_stopping, reduce_lr, lr_scheduler])

Epoch 1/20
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m344s[0m 7s/step - auc: 0.6684 - loss: 0.4018 - val_auc: 0.7919 - val_loss: 0.2964 - learning_rate: 0.0010
Epoch 2/20
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m409s[0m 8s/step - auc: 0.7787 - loss: 0.2988 - val_auc: 0.8217 - val_loss: 0.2798 - learning_rate: 0.0010
Epoch 3/20
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m469s[0m 9s/step - auc: 0.8221 - loss: 0.2778 - val_auc: 0.8658 - val_loss: 0.2535 - learning_rate: 0.0010
Epoch 4/20
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m519s[0m 10s/step - auc: 0.8607 - loss: 0.2548 - val_auc: 0.8855 - val_loss: 0.2380 - learning_rate: 0.0010
Epoch 5/20
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m536s[0m 11s/step - auc: 0.8871 - loss: 0.2344 - val_auc: 0.8949 - val_loss: 0.2299 - learning_rate: 0.0010
Epoch 6/20
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m577s[0m 11s/step - auc: 0.9057 - loss: 0.2195 - val_au

In [18]:
# Evaluar el modelo
y_val_pred = model.predict(X_val)
roc_auc = roc_auc_score(y_val, y_val_pred, average='macro')
print(f'ROC AUC score: {roc_auc}')

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 658ms/step
ROC AUC score: 0.8752441973979918


In [19]:
# Predicción con los datos de prueba
y_test_pred = model.predict(X_test_padded)

[1m106/106[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 630ms/step


In [20]:
# Guardar los datos de la predicción
cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

res = pd.DataFrame(y_test_pred, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_lstm_07.csv', index_label='ID')
print(res.head())


   p_Action  p_Adventure  p_Animation  p_Biography  p_Comedy   p_Crime  \
1  0.001352     0.007599     0.000637     0.004125  0.761713  0.014265   
4  0.110071     0.010819     0.000622     0.237064  0.053731  0.743492   
5  0.060138     0.004059     0.000157     0.033374  0.062364  0.928365   
6  0.077048     0.038337     0.000165     0.007262  0.029274  0.089453   
7  0.023518     0.015545     0.000509     0.006084  0.031476  0.060377   

   p_Documentary   p_Drama  p_Family  p_Fantasy  ...  p_Musical  p_Mystery  \
1       0.000753  0.838573  0.008699   0.014334  ...   0.016757   0.008411   
4       0.023588  0.930460  0.001160   0.001941  ...   0.008084   0.038446   
5       0.005234  0.918653  0.000474   0.002217  ...   0.001635   0.295490   
6       0.001951  0.884827  0.000706   0.004850  ...   0.000669   0.116021   
7       0.007203  0.275910  0.001985   0.051811  ...   0.002000   0.506919   

     p_News  p_Romance  p_Sci-Fi   p_Short   p_Sport  p_Thriller     p_War  \
1  0.000

**Disponibilización de la API**

In [243]:
Plot = 'two drifters are passing through a western town ,  when news comes in that a local farmer has been murdered and his cattle stolen .  the townspeople ,  joined by the drifters ,  form a posse to catch the perpetrators .  they find three men in possession of the cattle ,  and are determined to see justice done on the spot .'

In [245]:
df['Plot_Cleaned'] = df['plot'].apply(clean_text)

In [247]:
Text = tokenizer.texts_to_sequences(df['Plot_Cleaned'])

In [249]:
max_seq_len = 500
Plot_Ejemplo = pad_sequences(Text, maxlen=max_seq_len, padding='post')

In [250]:
valores= model.predict(Plot_Ejemplo)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 121ms/step


In [251]:
# Combinar los dos arrays
valores= model.predict(Plot_Ejemplo)
probabilidad= valores[0]
generos= dataTraining_exploded['genres'].unique()
combinados = list(zip(generos, probabilidad))
# Encontrar el género con la probabilidad más alta
generos_max_probabilidades = sorted(combinados, key=lambda x: x[1], reverse=True)[:3]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 114ms/step


In [253]:
generos_max_probabilidades

[('Action', 0.58583987), ('News', 0.3208802), ('Horror', 0.16855434)]

In [49]:
# Exportar modelo a archivo binario .pkl
import joblib

#joblib.dump(clf, r'C:\Users\userml\Documents\proyecto2\clf.pkl', compress=3)

import os

# Crear el directorio si no existe
output_dir = r'C:\Users\userml\Documents\Proyecto 2\model.pkl'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
# Guardar el modelo en el archivo .pkl
joblib.dump(model, os.path.join(output_dir, 'model.pkl'), compress=3)

['C:\\Users\\userml\\Documents\\Proyecto 2\\model.pkl\\model.pkl']

In [None]:
from flask import Flask
from flask_restx import Api, Resource, fields
import joblib

app = Flask(__name__)

api = Api(
    app, 
    version='1.0', 
    title='API Clasificador de Generos',
    description='API clasificador de genero de peliculas basado en la descripcion de la misma'
)

ns = api.namespace('predict', 
     description='Clasificador de Generos')
   
parser = api.parser()

parser.add_argument('Plot', type=str, required=True, help='Description of the movie')


output_fields = api.model('Output', {
    'Generos': fields.List(fields.String, description='Predicciones de géneros para la película.')
})



# Función de limpieza de texto
def clean_text(text):
    # Convertir todas las palabras a minúsculas
    text = text.lower()
    # Eliminar los signos de puntuación y los números
    text = re.sub(r'\s+([^\w\s])', r'\1', text)
    return text
    
@ns.route('/')
class peliculasApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(output_fields)
    def get(self):
        
        # Obtener los datos de entrada de la solicitud
        data = parser.parse_args()

        
        # Extraer y limpiar el valor del parámetro
        plot = data['Plot']
        df = pd.DataFrame({'plot': [plot]})
        df['Plot_Cleaned'] = df['plot'].apply(clean_text)
        

        # Transformar el texto usando el tokenize
        Text = tokenizer.texts_to_sequences(df['Plot_Cleaned'])
        max_seq_len = 500
        Plot_Final = pad_sequences(Text, maxlen=max_seq_len, padding='post')

        
        # Predecir el género usando el clasificador
        valores= model.predict(Plot_Final)[0]
        dataTraining_exploded = dataTraining.explode('genres')
        generos= dataTraining_exploded['genres'].unique()
        combinados = list(zip(generos, valores))
        generos_max_probabilidades = sorted(combinados, key=lambda x: x[1], reverse=True)[:3]

        generos_max_probabilidades_nombres = [genero for genero, probabilidad in generos_max_probabilidades]

        

        return {
            "Generos": generos_max_probabilidades_nombres
        }, 200

if __name__ == '__main__':
    app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)

        

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.0.0.4:5000
Press CTRL+C to quit
10.0.0.4 - - [27/May/2024 05:42:08] "GET / HTTP/1.1" 200 -
10.0.0.4 - - [27/May/2024 05:42:08] "GET /swaggerui/droid-sans.css HTTP/1.1" 304 -
10.0.0.4 - - [27/May/2024 05:42:08] "GET /swaggerui/swagger-ui.css HTTP/1.1" 304 -
10.0.0.4 - - [27/May/2024 05:42:08] "GET /swaggerui/swagger-ui-bundle.js HTTP/1.1" 304 -
10.0.0.4 - - [27/May/2024 05:42:08] "GET /swaggerui/swagger-ui-standalone-preset.js HTTP/1.1" 304 -
10.0.0.4 - - [27/May/2024 05:42:09] "GET /swagger.json HTTP/1.1" 200 -


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 119ms/step


10.0.0.4 - - [27/May/2024 05:43:15] "GET /predict/?Plot=a%20serial%20killer%20decides%20to%20teach%20the%20secrets%20of%20his%20satisfying%20career%20to%20a%20video%20store%20clerk. HTTP/1.1" 200 -


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 118ms/step


10.0.0.4 - - [27/May/2024 05:44:55] "GET /predict/?Plot=who%20meets%20by%20fate%20,%20%20shall%20be%20sealed%20by%20fate%20.%20%20theresa%20osborne%20is%20running%20along%20the%20beach%20when%20she%20stumbles%20upon%20a%20bottle%20washed%20up%20on%20the%20shore%20.%20%20inside%20is%20a%20message%20,%20%20reading%20the%20letter%20she%20feels%20so%20moved%20and%20yet%20she%20felt%20as%20if%20she%20has%20violated%20someone%20'%20s%20thoughts%20.%20%20in%20love%20with%20a%20man%20she%20has%20never%20met%20,%20%20theresa%20tracks%20down%20the%20author%20of%20the%20letter%20to%20a%20small%20town%20in%20wilmington%20,%20%20two%20lovers%20with%20crossed%20paths%20.%20%20but%20yet%20one%20can%20'%20t%20let%20go%20of%20their%20past. HTTP/1.1" 200 -


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 113ms/step


10.0.0.4 - - [27/May/2024 05:47:17] "GET /predict/?Plot=the%20true%20story%20of%20billy%20hayes%20,%20%20an%20american%20college%20student%20who%20is%20caught%20smuggling%20drugs%20out%20of%20turkey%20and%20thrown%20into%20prison.%20 HTTP/1.1" 200 -


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 111ms/step


200.118.62.8 - - [27/May/2024 05:48:04] "GET /predict/?Plot=a%20young%20woman%20who%20lives%20in%20a%20desert%20trailer%20park%20must%20choose%20between%20caring%20for%20her%20hapless%20father%20and%20sick%20friend%20or%20fulfilling%20her%20own%20destiny%20. HTTP/1.1" 200 -
200.118.62.8 - - [27/May/2024 05:48:15] "GET / HTTP/1.1" 200 -
200.118.62.8 - - [27/May/2024 05:48:15] "GET /swaggerui/droid-sans.css HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:48:15] "GET /swaggerui/swagger-ui.css HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:48:15] "GET /swaggerui/swagger-ui-bundle.js HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:48:15] "GET /swaggerui/swagger-ui-standalone-preset.js HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:48:16] "GET /swagger.json HTTP/1.1" 200 -
200.118.62.8 - - [27/May/2024 05:48:16] "GET /swaggerui/favicon-32x32.png HTTP/1.1" 304 -
186.155.19.235 - - [27/May/2024 05:48:19] "GET / HTTP/1.1" 200 -
186.155.19.235 - - [27/May/2024 05:48:19] "GET /swaggerui/dro

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 111ms/step


200.118.62.8 - - [27/May/2024 05:48:36] "GET /predict/?Plot=a%20young%20woman%20who%20lives%20in%20a%20desert%20trailer%20park%20must%20choose%20between%20caring%20for%20her%20hapless%20father%20and%20sick%20friend%20or%20fulfilling%20her%20own%20destiny%20. HTTP/1.1" 200 -


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 108ms/step


200.118.62.8 - - [27/May/2024 05:49:16] "GET /predict/?Plot=a%20young%20woman%20who%20lives%20in%20a%20desert%20trailer%20park%20must%20choose%20between%20caring%20for%20her%20hapless%20father%20and%20sick%20friend%20or%20fulfilling%20her%20own%20destiny%20. HTTP/1.1" 200 -
200.118.62.8 - - [27/May/2024 05:49:24] "GET / HTTP/1.1" 200 -
200.118.62.8 - - [27/May/2024 05:49:24] "GET /swaggerui/droid-sans.css HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:49:25] "GET /swaggerui/swagger-ui.css HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:49:25] "GET /swaggerui/swagger-ui-bundle.js HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:49:40] "GET /swaggerui/swagger-ui-standalone-preset.js HTTP/1.1" 304 -
200.118.62.8 - - [27/May/2024 05:49:55] "GET /swagger.json HTTP/1.1" 200 -
200.118.62.8 - - [27/May/2024 05:49:55] "GET /swaggerui/favicon-32x32.png HTTP/1.1" 304 -
