## Contenido:
- [1. Cargar Librerías](#librerias)
- [2. Cargar Datos](#datos)
- [3. Construir el Modelo usando Naive Bayes](#modelo)
- [4. Optimización del Modelo](#optimizar)
    - [4.1 Guardar el Modelo](#guardarmodelo)
- [5. Combinar Datasets](#combinar)
- [6. Guardar Dataset](#guardar)


# 1. Cargar Librerías <a classs="anchor" id="librerias"></a>

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder

# 2. Cargar Datos <a classs="anchor" id="datos"></a>

In [13]:
movies = pd.read_csv("../data/processed/movies.csv")
critics = pd.read_csv("../data/processed/critics.csv")
c_df_copy = pd.read_csv("../data/processed/c_df_copy.csv")

In [3]:
X = c_df_copy['review_content']  
y = c_df_copy['review_label']    

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [4]:
vec_model = CountVectorizer(stop_words = "english", max_features=5000)
X_train_vec = vec_model.fit_transform(X_train)
X_test_vec = vec_model.transform(X_test)

# 3. Construir el Modelo usando Naive Bayes <a classs="anchor" id="modelo"></a>

In [5]:
nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)

y_pred_nb = nb_model.predict(X_test_vec)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))

Naive Bayes Accuracy: 0.771371048287898


# 4. Optimización del Modelo <a classs="anchor" id="optimizar"></a>

In [6]:
def evaluar_modelo(model, X_train_vec, X_test_vec, y_train, y_test, nombre="Modelo"):
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    acc = accuracy_score(y_test, y_pred)
    print(f"{nombre} Accuracy: {acc:.5f}")
    return model, y_pred

In [None]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('nb', MultinomialNB())
])

hyperparams = {
    'vectorizer__max_features': [1000, 2000, 3000, None],
    'vectorizer__ngram_range': [(1,1), (1,2)],
    'nb__alpha': np.linspace(0.1, 2.0, 20)
}

In [8]:
# Naive Bayes simple
vec_model = CountVectorizer(stop_words="english")
X_train_vec = vec_model.fit_transform(X_train)
X_test_vec  = vec_model.transform(X_test)

nb_model = MultinomialNB()
nb_model, y_pred_nb = evaluar_modelo(nb_model, X_train_vec, X_test_vec, y_train, y_test, "Naive Bayes")

Naive Bayes Accuracy: 0.79712


In [9]:
# 5. Naive Bayes optimizado con Pipeline + RandomizedSearchCV
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('nb', MultinomialNB())
])

hyperparams = {
    'vectorizer__max_features': [1000, 2000, 3000, None],
    'vectorizer__ngram_range': [(1,1), (1,2)],
    'nb__alpha': np.linspace(0.1, 2.0, 20)
}

grid = RandomizedSearchCV(pipeline, hyperparams, scoring="accuracy", n_iter=20, random_state=42)
grid.fit(X_train, y_train)

best_nb_model = grid.best_estimator_
y_pred_best_nb = best_nb_model.predict(X_test)

print("Naive Bayes Optimizado Accuracy:", accuracy_score(y_test, y_pred_best_nb))
print("Mejores parámetros:", grid.best_params_)

Naive Bayes Optimizado Accuracy: 0.8029742317356519
Mejores parámetros: {'vectorizer__ngram_range': (1, 2), 'vectorizer__max_features': None, 'nb__alpha': np.float64(0.7)}


In [10]:
#Logistic Regression
log_model = LogisticRegression(max_iter=1000)
log_model, y_pred_log = evaluar_modelo(log_model, X_train_vec, X_test_vec, y_train, y_test, "Logistic Regression")

Logistic Regression Accuracy: 0.80569


In [11]:
acc_nb = accuracy_score(y_test, y_pred_best_nb)
acc_log = accuracy_score(y_test, y_pred_log)

if acc_log > acc_nb:
    modelo_final = log_model
    nombre_final = "Logistic Regression"
else:
    modelo_final = best_nb_model
    nombre_final = "Naive Bayes Optimizado"

print(f"Mejor modelo: {nombre_final} ({max(acc_nb, acc_log):.5f})")

Mejor modelo: Logistic Regression (0.80569)


El modelo obtenido aplicando Regresión Logística tuvo mejores resultados que el modelo usando el algoritmo de Naive Bayes. Usaremos Logistic Regression como modelo final.

In [12]:
pipeline = Pipeline([
    ("vectorizer", CountVectorizer(stop_words="english")),
    ("classifier", modelo_final)   
])

pipeline.fit(X_train, y_train)

0,1,2
,steps,"[('vectorizer', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


##  4.1 Guardar el Modelo<a classs="anchor" id="guardarmodelo"></a>

In [None]:
with open("model/rotten_pipeline.pkl", "wb") as f:
    pickle.dump(pipeline, f)

print("Pipeline con modelo final guardado correctamente.")

# 5. Combinar Datasets <a classs="anchor" id="combinar"></a>

Para el sistema de recomendación, necesitaremos trabajar con un único modelo, por lo cual agregaremos al dataset **movies** una columna adicional que contenga el valor promedio del puntaje obtenido al hacer el análisis de sentimientos.

In [14]:
critics["review_content"] = critics["review_content"].fillna("")

critics["sentiment_prob"] = pipeline.predict_proba(critics["review_content"])[:, 1]

In [15]:
sentiment_scores = critics.groupby("rotten_tomatoes_link")["sentiment_prob"].mean().reset_index()
sentiment_scores.rename(columns={"sentiment_prob": "avg_sentiment_score"}, inplace=True)

movies = movies.merge(sentiment_scores, on="rotten_tomatoes_link", how="left")

In [16]:
movies.head()

Unnamed: 0,rotten_tomatoes_link,movie_title,movie_info,critics_consensus,content_rating,genres,runtime,production_company,tomatometer_status,tomatometer_rating,...,audience_count,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count,directors,authors,actors,release_year,streaming_release_year,avg_sentiment_score
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,"Always trouble-prone, the life of teenager Per...",Though it may seem like just another Harry Pot...,PG,"['Action & Adventure', 'Comedy', 'Drama', 'Sci...",119.0,20th Century Fox,Rotten,49.0,...,254421.0,43,73,76,['Chris Columbus'],"['Craig Titley', 'Chris Columbus', 'Rick Riord...","['Logan Lerman', 'Brandon T. Jackson', 'Alexan...",2010.0,2015,0.552701
1,m/0878835,Please Give,Kate (Catherine Keener) and her husband Alex (...,Nicole Holofcener's newest might seem slight i...,R,['Comedy'],90.0,Sony Pictures Classics,Certified-Fresh,87.0,...,11574.0,44,123,19,['Nicole Holofcener'],['Nicole Holofcener'],"['Catherine Keener', 'Amanda Peet', 'Oliver Pl...",2010.0,2012,0.81182
2,m/10,10,"A successful, middle-aged Hollywood songwriter...",Blake Edwards' bawdy comedy may not score a pe...,R,"['Comedy', 'Romance']",122.0,Waner Bros.,Fresh,67.0,...,14684.0,2,16,8,['Blake Edwards'],['Blake Edwards'],"['Dudley Moore', 'Bo Derek', 'Julie Andrews', ...",1979.0,2014,0.597555
3,m/1000013-12_angry_men,12 Angry Men (Twelve Angry Men),Following the closing arguments in a murder tr...,Sidney Lumet's feature debut is a superbly wri...,NR,"['Classics', 'Drama']",95.0,Criterion Collection,Certified-Fresh,100.0,...,105386.0,6,54,0,['Sidney Lumet'],['Reginald Rose'],"['Martin Balsam', 'John Fiedler', 'Lee J. Cobb...",1957.0,2017,0.814403
4,m/1000079-20000_leagues_under_the_sea,"20,000 Leagues Under The Sea","In 1866, Professor Pierre M. Aronnax (Paul Luk...","One of Disney's finest live-action adventures,...",G,"['Action & Adventure', 'Drama', 'Kids & Family']",127.0,Disney,Fresh,89.0,...,68918.0,5,24,3,['Richard Fleischer'],['Earl Felton'],"['James Mason', 'Kirk Douglas', 'Paul Lukas', ...",1954.0,2016,0.766097


## 6. Guardar Dataset <a classs="anchor" id="guardar"></a>

In [17]:
movies.to_csv("dataset/movies.csv", index=False)