## Desarrollo Modelos: Clasificación Género de Películas

Este notebook comprende el desarrollo de modelos de clasificación del género de películas y es parte del proyecto final  del curso de Machine Learning y Procesamiento de Lenguaje Natural de la Maestría en Inteligencia Analítica de Datos (MIAD)


#### 0 - Importación de librerías y definición de funciones

In [2]:
# Librerias
#import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#import seaborn as sns

#%matplotlib inline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split, KFold, cross_validate, cross_val_predict, cross_val_score, StratifiedKFold, RepeatedStratifiedKFold

#

### 1 - Carga de los datos


In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('dataTraining.csv')
dataTesting = pd.read_csv('dataTesting.csv', index_col=0)

In [4]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0.1,Unnamed: 0,year,title,plot,genres,rating
0,3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
1,900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
2,6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
3,4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
4,2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [5]:
# Visualización datos de test
dataTesting.head(3)

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...


In [6]:
# Resumen de tipo y cantidad de datos por columna:
dataTraining.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7895 entries, 0 to 7894
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  7895 non-null   int64  
 1   year        7895 non-null   int64  
 2   title       7895 non-null   object 
 3   plot        7895 non-null   object 
 4   genres      7895 non-null   object 
 5   rating      7895 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 370.2+ KB


### 2 - Preparación de los Datos

In [7]:
# Quitando columnas que no tienen aporte
columns_to_drop = ['Unnamed: 0', 'rating']
data = dataTraining.drop(columns=columns_to_drop)

In [8]:
# Verificación de existencia de observaciones duplicadas

duplicates = data[data.duplicated()]
duplicate_count = duplicates.shape[0]

print("Observaciones duplicadas:", duplicate_count)

Observaciones duplicadas: 1


In [9]:
duplicates

Unnamed: 0,year,title,plot,genres
5983,1999,Gekijô-ban poketto monsutâ: Maboroshi no pokem...,an evil genius in a flying fortress is trying ...,"['Animation', 'Action', 'Adventure', 'Family',..."


In [10]:
# Eliminación de observaciones duplicadas

data = data.drop_duplicates()

print(data.shape)

(7894, 4)


In [11]:
# Resumen estadístico or columna:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,7894.0,1989.725234,22.661912,1894.0,1980.0,1997.0,2007.0,2015.0


In [12]:
# Separando columnas numéricas de categóricas

columnas_numericas = data.select_dtypes(include=['number']).columns

# Columnas categóricas (incluye object y bool)
columnas_categoricas = data.select_dtypes(include=['object']).columns

# También podrías separar los DataFrames si lo prefieres:
data_numericas = data[columnas_numericas]
data_categoricas = data[columnas_categoricas]

print("Columnas numéricas: ",columnas_numericas)

print("Columnas categóricas: ",columnas_categoricas)

Columnas numéricas:  Index(['year'], dtype='object')
Columnas categóricas:  Index(['title', 'plot', 'genres'], dtype='object')


#### Generos

In [13]:
from collections import Counter
all_genres = data['genres'].apply(eval).explode()
genre_counts = Counter(all_genres)
genre_counts = pd.Series(genre_counts).sort_values(ascending=False)
print(genre_counts)

Drama          3965
Comedy         3046
Thriller       2024
Romance        1892
Crime          1447
Action         1302
Adventure      1023
Horror          954
Mystery         759
Sci-Fi          723
Fantasy         706
Family          681
Documentary     419
Biography       373
War             348
Music           341
History         273
Musical         271
Sport           261
Animation       259
Western         237
Film-Noir       168
Short            92
News              7
dtype: int64


In [14]:
len(genre_counts)

24

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score
#from xgboost import XGBClassifier
#import optuna

In [16]:
# Target Preprocessing
# Definición de variable de interés (y)
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(data['genres'].apply(eval))

# Feature Preprocessing
# Combine text 
data['text'] = data['title'] + ' ' + data['plot']
dataTesting['text'] = dataTesting['title'] + ' ' + dataTesting['plot']

In [17]:
y.shape

(7894, 24)

In [18]:
X = data[['text', 'year']]
Testing = dataTesting[['text', 'year']]
# --- Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
# ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('text', TfidfVectorizer(
            max_features=12000,
            ngram_range=(1,2),
            stop_words='english'
        ), 'text'),
        
        ('year', Pipeline([
            ('reshape', FunctionTransformer(lambda x: x.values.reshape(-1,1))), 
            ('scaler', StandardScaler())
        ]), 'year')
    ]
)


#### Incorporando SVD

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

def make_preprocessor(trial):
    tfidf_max_features = trial.suggest_int('tfidf_max_features', 11000, 15000)
    tfidf_ngram_upper = 2

    if tfidf_max_features > 12000:
        use_svd = trial.suggest_categorical('use_svd', [True, False])
    else:
        use_svd = False
    
    if use_svd:
        svd_n = trial.suggest_int('svd_n', 100, 500)
        text_pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(
                max_features=tfidf_max_features,
                ngram_range=(1, tfidf_ngram_upper),
                stop_words='english')),
            ('svd', TruncatedSVD(n_components=svd_n))
        ])
    else:
        text_pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(
                max_features=tfidf_max_features,
                ngram_range=(1, tfidf_ngram_upper),
                stop_words='english'))
        ])

    preprocessor = ColumnTransformer([
        ('text', text_pipeline, 'text'),
        ('year', Pipeline([
            ('reshape', FunctionTransformer(lambda x: x.values.reshape(-1,1))),
            ('scaler', StandardScaler())
        ]), 'year')
    ])
    
    return preprocessor

#### LightGBM

In [21]:
from lightgbm import LGBMClassifier

# def objective_lgb(trial):
#     preprocessor = make_preprocessor(trial)

#     lgb_params = {
#         'learning_rate': trial.suggest_float('learning_rate', 0.02, 0.08),
#         'max_depth': trial.suggest_int('max_depth', 6, 9),
#         'n_estimators': trial.suggest_int('n_estimators', 350, 450),
#         'num_leaves': trial.suggest_int('num_leaves', 20, 80),
#         'min_child_samples': trial.suggest_int('min_child_samples', 10, 60),
#         'subsample': trial.suggest_float('subsample', 0.6, 0.75),
#         'colsample_bytree': trial.suggest_float('colsample_bytree', 0.55, 0.72),
#         'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 0.6),
#         'reg_lambda': trial.suggest_float('reg_lambda', 0.4, 0.65),
#         'random_state': 42
#     }

#     pipeline = Pipeline([
#         ('preprocessor', preprocessor),
#         ('classifier', OneVsRestClassifier(LGBMClassifier(**lgb_params)))
#     ])

#     pipeline.fit(X_train, y_train)
#     y_pred_proba = pipeline.predict_proba(X_test) # no X_test 

#     mcauc = roc_auc_score(y_test, y_pred_proba, average="macro")
#     return mcauc

In [26]:
#!pip install optuna

In [24]:
import optuna
# study_lgb = optuna.create_study(direction='maximize')
# study_lgb.optimize(objective_lgb, n_trials=30)


# print("Best trial LGB:")
# print(study_lgb.best_params)
# print(f"Best MCAUC LGB: {study_lgb.best_value:.4f}")

In [None]:
# trials3_df = study_lgb.trials_dataframe(attrs=("number", "value", "params", "state"))
# top_trials3 = trials3_df.sort_values(by="value", ascending=False)
# display(top_trials3.head(5))

#### Entrenando el modelo lgb (inc svd) con los mejores parámetros encontrados

In [23]:
# Best params
lgb_best_params = {
    'tfidf_max_features': 13025,
    'use_svd': True,
    'svd_n': 438,
    'learning_rate': 0.07231808145093328,
    'max_depth': 9,
    'n_estimators': 430,
    'num_leaves': 20,
    'min_child_samples': 18,
    'subsample': 0.6319075336650705,
    'colsample_bytree': 0.5500254570669165,
    'reg_alpha': 0.38441817549774887,
    'reg_lambda': 0.4264209131455222,
    'verbosity': -1,
    'force_col_wise': True,  # suppress threading warning
    'random_state': 42
}

In [25]:
# pre_processor
preprocessor3 = ColumnTransformer([
    ('text', Pipeline([
        ('tfidf', TfidfVectorizer(
            max_features=lgb_best_params['tfidf_max_features'],
            ngram_range=(1, 2),
            stop_words='english')),
        ('svd', TruncatedSVD(n_components=lgb_best_params['svd_n']))
    ]), 'text'),
    ('year', Pipeline([
        ('reshape', FunctionTransformer(lambda x: x.values.reshape(-1, 1))),
        ('scaler', StandardScaler())
    ]), 'year')
])
# Remove non-LGBM keys
lgb_model_params = {k: v for k, v in lgb_best_params.items() if k not in ['tfidf_max_features', 'use_svd', 'svd_n']}

# Full pipeline
model_lgb = Pipeline([
    ('preprocessor', preprocessor3),
    ('classifier', OneVsRestClassifier(LGBMClassifier(**lgb_model_params)))
])

# Train on all data
model_lgb.fit(data[['text', 'year']], y)

In [26]:
cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres2 = model_lgb.predict_proba(Testing)
# Guardar predicciones en formato exigido en la competencia de kaggle
res2 = pd.DataFrame(y_pred_test_genres2, index=dataTesting.index, columns=cols)
res2.to_csv('pred_genres_text_lgb.csv', index_label='ID')
res2.head()



Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.015316,0.003947,0.000116,7e-05,0.336363,0.024425,0.000339,0.773513,0.000285,0.017364,...,4.5e-05,0.012904,2.9e-05,0.741934,0.000889,0.000237,7.6e-05,0.1276,1.3e-05,0.000108
4,0.007748,0.000541,0.000344,0.064693,0.084115,0.741502,0.005481,0.838846,0.000232,0.000178,...,0.00044,0.000244,5.8e-05,0.031621,7.8e-05,0.000133,0.002067,0.147624,0.010017,0.000343
5,0.004925,0.001101,5.5e-05,0.001934,0.027128,0.96282,2.9e-05,0.848151,0.00068,0.000276,...,8.1e-05,0.821523,5.8e-05,0.01637,0.001209,8.1e-05,3.5e-05,0.891884,5.5e-05,3.3e-05
6,0.028749,0.088285,0.000241,0.003444,0.053295,0.008165,4.8e-05,0.481579,0.000593,0.000282,...,0.000402,0.006844,5.8e-05,0.113748,0.008792,2.8e-05,0.000113,0.136219,0.015411,0.00024
7,0.003058,0.005454,0.000785,2.4e-05,0.07687,0.014264,3e-05,0.155736,0.009903,0.073905,...,0.004527,0.005245,3.8e-05,0.005869,0.889338,4.6e-05,1.6e-05,0.455097,0.000222,0.000229


In [82]:
# Rebuild and fit both models
xgb_model_r = build_model_from_params(best_params_xgb1, model_type='xgb')
lgb_model_r = build_model_from_params(lgb_best_params, model_type='lgb')

xgb_model_r.fit(X_train, y_train)
lgb_model_r.fit(X_train, y_train)

# Predict on validation set
y_proba_xgb_r = xgb_model_r.predict_proba(X_test)
y_proba_lgb_r = lgb_model_r.predict_proba(X_test)



## Exportando modelo a pkl para API

In [29]:
import joblib

In [33]:
#!pip install cloudpickle


In [34]:
# Guarda el modelo
import cloudpickle

with open('lgbm_simple_model.pkl', 'wb') as f:
    cloudpickle.dump(model_lgb, f)
