#### Entrega - Modelo e Previsão

> Explique como você faria a previsão da nota do imdb a partir dos dados. Quais variáveis e/ou suas transformações você utilizou e por quê? Qual tipo de problema estamos resolvendo (regressão, classificação)? Qual modelo melhor se aproxima dos dados e quais seus prós e contras? Qual medida de performance do modelo foi escolhida e por quê?

Primeiramente, esse é um problema de regressão, estamos prevendo o valor de uma variável numérica (IMDB_Rating) a partir das variáveis independentes. Considerando que o dataset tem valores limitados de IMDB_Rating, por causa de algum filtro ou da origem do dataset, esse poderia também ser um problema de classificação ao criar categorías de baixo, médio ou alto rating IMDB, mas irei continuar como um problema de regressão.

Para fazer a previsão irei utilizar os dados e aplicar transformações e feature engineering para obter o máximo de informação possível dos dados, por exemplo:
- A partir do Overview pude utilizar um processo de NLP (NLTK + CountVectorizer) para transformar o em embeddings e conseguir as 300 keywords e com isso ter uma nova feature de No_keywords no dataset.
- Com Overview também criei a feature Overview_length, que é a quantidade total de palavras do Overview.
- Transformei Released_Year -> Released_Year_Group, pegando a divisão inteira por 10 e multiplicando por 10 para agrupar o ano de lançamento em agrupamentos de 10 em 10 anos.
- Separei o Runtime para Runtime_time e Runtime_category, como só temos filmes com o Runtime em minutos, posso descartar essa feature, com Runtime_time obtive o Runtime_category, com filmes short, medium ou long duration.
- Com genre irei aplicar uma transformação a depender do algoritmo que estiver usando, como Count ou Target Encoding, mas também criei uma feature No_Genres que conta o número de genêros no filme.
- Director também irei precisar aplicar uma transformação, como o Label Encodding por exemplo, mas também utilizei para criar a feature Director_No_Movies que conta quantos filmes o diretor produziu.

Em relação as outras variáveis númericas irei testar transformações como normalização e escalonamento.

Sobre os modelos e métricas irei descorrer melhor sobre isso no relatório completo (PDF)


In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MultiLabelBinarizer, FunctionTransformer, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn import metrics
from sklearn.impute import SimpleImputer


dataset =  pd.read_csv('./desafio_indicium_imdb_after_process.csv', index_col=0, sep=';')
dataset.head(3)
dataset.isna().sum()

print(dataset.groupby('Certificate').size())

Certificate
16            1
A           196
Approved     11
G            12
PG           39
PG-13        43
Passed       34
R           146
TV-14         1
TV-MA         1
TV-PG         3
U           233
UA          176
Unrated     102
dtype: int64


In [2]:
string_var = ['Series_Title', 'Overview']
cat_var = ['Certificate', 'Genre', 'Director', 'Star1', 'Star2', 'Star3', 'Star4']
numeric_var = ['Released_Year', 'Runtime_time', 'Meta_score', 'No_of_Votes','Gross', 'No_keywords', 'Released_Year_Group', 'No_Genres', 'Director_No_Movies', 'Overview_length']

X = dataset[string_var + cat_var + numeric_var]
y = dataset['IMDB_Rating']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

In [3]:
# Basemodel com Lasso Regression (L1)
# Multi-label Binarizer para Genres and Actors.
# Ordinal Encoder para Certificate and Director.

# Tentei fazer o Multi-label Binarizer dentro da pipeline, mas infelizmente não estava funcionando de jeito nenhum por conta dele perder o fit aos dados de treino,
# quando tentava fazer o .predict()

from sklearn.linear_model import LassoCV

X_train_l = X_train.copy()
X_test_l = X_test.copy()
y_train_l = y_train.copy()
y_test_l = y_test.copy()

X_train_l['Actors'] = X_train_l[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)
X_test_l['Actors'] = X_test_l[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)

multilabel_binarizer_genres = MultiLabelBinarizer()
X_train_l['Genre'] = multilabel_binarizer_genres.fit_transform(X_train_l['Genre'].str.split(', '))
X_test_l['Genre'] = multilabel_binarizer_genres.transform(X_test_l['Genre'].str.split(', '))

multilabel_binarizer_actors = MultiLabelBinarizer()
X_train_l['Actors'] = multilabel_binarizer_actors.fit_transform(
    X_train_l[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)
X_test_l['Actors'] = multilabel_binarizer_actors.transform(
     X_test_l[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)

num_pipeline = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

preprocessor = ColumnTransformer(transformers=[
    ('cat_var', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), [cat_var[0],cat_var[2]]),
    ('num_var', num_pipeline, numeric_var)
])

lasso_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lasso', LassoCV(max_iter=10000, cv=5, n_jobs=-1, random_state=42))
])

lasso_pipeline.fit(X_train_l, y_train_l)

y_pred = lasso_pipeline.predict(X_test_l)

r2 = metrics.r2_score(y_test_l, y_pred)
rmse = metrics.root_mean_squared_log_error(y_test_l, y_pred)

print(f'R2 score : {r2:.2f} \nRMSE : {rmse}')




R2 score : 0.35 
RMSE : 0.023976626249230786


In [24]:
# Tentando melhorar o LassoCV com RandomizedSearch

from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Lasso
from scipy.stats import uniform

X_train_rl = X_train.copy()
X_test_rl = X_test.copy()
y_train_rl = y_train.copy()
y_test_rl = y_test.copy()

rs_lasso_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lasso', Lasso(max_iter=10000, random_state=42))
])

params = {
    'lasso__alpha': uniform(0.0001, 10)
}

random_search = RandomizedSearchCV(
    estimator=rs_lasso_pipeline,
    param_distributions=params,
    n_iter=100,
    scoring='neg_mean_squared_error',
    cv=5,
    random_state=42,
    n_jobs=3
)

random_search.fit(X_train_rl, y_train_rl)

y_pred = random_search.predict(X_test_rl)

r2 = metrics.r2_score(y_test_rl, y_pred)
rmse = metrics.root_mean_squared_log_error(y_test_rl, y_pred)

print(f'R2 score : {r2:.2f} \nRMSE : {rmse}')

R2 score : 0.13 
RMSE : 0.02760988479350257


In [None]:
# Agora irei avaliar o uso de TargetEncoding para as variáveis categóricas (Director e Genre), continuando com Lasso.
# Irei com OrdinalEncoder para Certificate pois sao poucas categorias.

from sklearn.preprocessing import TargetEncoder

X_train_lt = X_train.copy()
X_test_lt = X_test.copy()
y_train_lt = y_train.copy()
y_test_lt = y_test.copy()

encoder_genres = TargetEncoder()
X_train_lt['Genre'] = encoder_genres.fit_transform(X_train_lt[['Genre']], y_train_lt)
X_test_lt['Genre'] = encoder_genres.transform(X_test_lt[['Genre']])

encoder_directors = TargetEncoder()
X_train_lt['Director'] = encoder_genres.fit_transform(X_train_lt[['Director']], y_train_lt)
X_test_lt['Director'] = encoder_genres.transform(X_test_lt[['Director']])

lasso_pipeline_te = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lasso_te', LassoCV(max_iter=10000, cv=5, n_jobs=-1, random_state=42))
])

lasso_pipeline_te.fit(X_train_lt, y_train_lt)

y_pred = lasso_pipeline_te.predict(X_test_lt)

r2 = metrics.r2_score(y_test_lt, y_pred)
rmse = metrics.root_mean_squared_log_error(y_test_lt, y_pred)

print(f'R2 score : {r2:.2f} \nRMSE : {rmse}')



R2 score : 0.33 
RMSE : 0.024327543937649153


In [29]:
# Agora irei avaliar usar TargetEncoder (Genre e Director) e OneHotEncoder para Certificate

X_train_lt_oh = X_train.copy()
X_test_lt_oh = X_test.copy()
y_train_lt_oh = y_train.copy()
y_test_lt_oh = y_test.copy()

encoder_genres = TargetEncoder()
X_train_lt_oh['Genre'] = encoder_genres.fit_transform(X_train_lt_oh[['Genre']], y_train_lt_oh)
X_test_lt_oh['Genre'] = encoder_genres.transform(X_test_lt_oh[['Genre']])

encoder_directors = TargetEncoder()
X_train_lt_oh['Director'] = encoder_genres.fit_transform(X_train_lt_oh[['Director']], y_train_lt_oh)
X_test_lt_oh['Director'] = encoder_genres.transform(X_test_lt_oh[['Director']])

preprocessor_lt_oh = ColumnTransformer(transformers=[
    ('cat_var', OneHotEncoder(handle_unknown='ignore'), [cat_var[0],cat_var[2]]),
    ('num_var', num_pipeline, numeric_var)
])

lasso_pipeline_lt_oh = Pipeline(steps=[
    ('preprocessor', preprocessor_lt_oh),
    ('lasso_te', LassoCV(max_iter=10000, cv=5, n_jobs=-1, random_state=42))
])

lasso_pipeline_lt_oh.fit(X_train_lt_oh, y_train_lt_oh)

y_pred = lasso_pipeline_lt_oh.predict(X_test_lt_oh)

r2 = metrics.r2_score(y_test_lt_oh, y_pred)
rmse = metrics.root_mean_squared_log_error(y_test_lt_oh, y_pred)

print(f'R2 score : {r2:.2f} \nRMSE : {rmse}')



R2 score : 0.35 
RMSE : 0.02386144965081919


In [11]:
# LassoCV + OneHot encoder para Certificate e MultiLabelBinarizer para Genres e Director.

X_train_l_mlb_oh = X_train.copy()
X_test_l_mlb_oh = X_test.copy()
y_train_l_mlb_oh = y_train.copy()
y_test_l_mlb_oh = y_test.copy()

X_train_l_mlb_oh['Actors'] = X_train_l_mlb_oh[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)
X_test_l_mlb_oh['Actors'] = X_test_l_mlb_oh[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)

multilabel_binarizer_genres = MultiLabelBinarizer()
X_train_l_mlb_oh['Genre'] = multilabel_binarizer_genres.fit_transform(X_train_l_mlb_oh['Genre'].str.split(', '))
X_test_l_mlb_oh['Genre'] = multilabel_binarizer_genres.transform(X_test_l_mlb_oh['Genre'].str.split(', '))

multilabel_binarizer_actors = MultiLabelBinarizer()
X_train_l_mlb_oh['Actors'] = multilabel_binarizer_actors.fit_transform(
    X_train_l_mlb_oh[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)
X_test_l_mlb_oh['Actors'] = multilabel_binarizer_actors.transform(
     X_test_l_mlb_oh[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)

preprocessor = ColumnTransformer(transformers=[
    ('cat_var', OneHotEncoder(handle_unknown='ignore'), [cat_var[0],cat_var[2]]),
    ('num_var', num_pipeline, numeric_var)
])

lasso_pipeline_mlb_oh = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lasso', LassoCV(max_iter=10000, cv=5, n_jobs=-1, random_state=42))
])

lasso_pipeline_mlb_oh.fit(X_train_l_mlb_oh, y_train_l_mlb_oh)

y_pred = lasso_pipeline_mlb_oh.predict(X_test_l_mlb_oh)

r2 = metrics.r2_score(y_test_l_mlb_oh, y_pred)
rmse = metrics.root_mean_squared_log_error(y_test_l_mlb_oh, y_pred)


y_pred_train = lasso_pipeline_mlb_oh.predict(X_train_l_mlb_oh)
r2_train = metrics.r2_score(y_train_l_mlb_oh, y_pred_train)

print(f'R2 score : {r2:.2f} \nRMSE : {rmse}')
print(f'R2 score train : {r2_train:.2f}')




R2 score : 0.40 
RMSE : 0.023033328003620106
R2 score train : 0.66


In [83]:
# Avaliando como o DecisionTreeRegressor se comporta nesse problema.

from sklearn.tree import DecisionTreeRegressor
from scipy.stats import randint

X_train_dt = X_train.copy()
X_test_dt = X_test.copy()
y_train_dt = y_train.copy()
y_test_dt = y_test.copy()

X_train_dt['Actors'] = X_train_dt[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)
X_test_dt['Actors'] = X_test_dt[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)

multilabel_binarizer_genres = MultiLabelBinarizer()
X_train_dt['Genre'] = multilabel_binarizer_genres.fit_transform(X_train_dt['Genre'].str.split(', '))
X_test_dt['Genre'] = multilabel_binarizer_genres.transform(X_test_dt['Genre'].str.split(', '))

multilabel_binarizer_actors = MultiLabelBinarizer()
X_train_dt['Actors'] = multilabel_binarizer_actors.fit_transform(
    X_train_dt[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)
X_test_dt['Actors'] = multilabel_binarizer_actors.transform(
     X_test_dt[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)

preprocessor_dt = ColumnTransformer(transformers=[
    ('cat_var', TargetEncoder(random_state=42), [cat_var[0],cat_var[2]]),
    ('num_var', num_pipeline, numeric_var)
])

pipeline_dt = Pipeline(steps=[
    ('preprocessor', preprocessor_dt),
    ('decision_tree', DecisionTreeRegressor(random_state=42))
])

params_dt = {
    'decision_tree__criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'decision_tree__splitter': ['best', 'random'],
    'decision_tree__max_depth': randint(1, 50),
    'decision_tree__min_samples_split': randint(2, 20),
    'decision_tree__min_samples_leaf': randint(1, 20),
    'decision_tree__min_weight_fraction_leaf': uniform(0, 0.5),
    'decision_tree__max_features': ['sqrt', 'log2', None],
    'decision_tree__max_leaf_nodes': randint(10, 200),
}
rs_decision_tree = RandomizedSearchCV(
    estimator=pipeline_dt,
    param_distributions=params_dt,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    random_state=42,
    cv=5,
    n_iter=10000
)

rs_decision_tree.fit(X_train_dt, y_train_dt)

y_pred = rs_decision_tree.predict(X_test_dt)
r2 = metrics.r2_score(y_test_dt, y_pred)
rmse = metrics.root_mean_squared_log_error(y_test_dt, y_pred)

print(f'R2 score : {r2:.2f} \nRMSE : {rmse}')





R2 score : 0.25 
RMSE : 0.0256027099528172


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

param_dist = {
    "gradient_boosting__n_estimators":       [300, 500, 800, 1000],
    "gradient_boosting__learning_rate":      [0.01, 0.02, 0.03, 0.05, 0.08],
    "gradient_boosting__max_depth":          [2, 3, 4],               
    "gradient_boosting__min_samples_leaf":   [5, 10, 20, 30, 50],
    "gradient_boosting__min_samples_split":  [10, 20, 50, 100],
    "gradient_boosting__subsample":          [0.6, 0.7, 0.8, 0.9, 1.0],
    "gradient_boosting__max_features":       [None, 0.5, 0.7, 0.9],
    "gradient_boosting__validation_fraction":[0.1, 0.15, 0.2],
    "gradient_boosting__n_iter_no_change":   [10, 20],                
}

X_train_gbr = X_train.copy()
X_test_gbr = X_test.copy()
y_train_gbr = y_train.copy()
y_test_gbr = y_test.copy()

X_train_gbr['Actors'] = X_train_gbr[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)
X_test_gbr['Actors'] = X_test_gbr[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").agg(', '.join, axis=1)

multilabel_binarizer_genres = MultiLabelBinarizer()
X_train_gbr['Genre'] = multilabel_binarizer_genres.fit_transform(X_train_gbr['Genre'].str.split(', '))
X_test_gbr['Genre'] = multilabel_binarizer_genres.transform(X_test_gbr['Genre'].str.split(', '))

multilabel_binarizer_actors = MultiLabelBinarizer()
X_train_gbr['Actors'] = multilabel_binarizer_actors.fit_transform(
    X_train_gbr[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)
X_test_gbr['Actors'] = multilabel_binarizer_actors.transform(
     X_test_gbr[['Star1', 'Star2', 'Star3', 'Star4']].fillna("Unknown").values.tolist()
)

preprocessor_gbr = ColumnTransformer(transformers=[
    ('cat_var', OneHotEncoder(handle_unknown='ignore'), [cat_var[0],cat_var[2]]),
    ('num_var', num_pipeline, numeric_var)
])

pipeline_gbr = Pipeline(steps=[
    ('preprocessor_gbr', preprocessor_gbr),
    ('gradient_boosting',GradientBoostingRegressor(random_state=42))
])

gbr_cv = RandomizedSearchCV(
    estimator=pipeline_gbr,
    param_distributions=params,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    n_iter=10000,
)

gbr_cv.fit(X_train_gbr, y_train_gbr)

y_pred = gbr_cv.predict(X_test_gbr)
r2 = metrics.r2_score(y_test_gbr, y_pred)
rmse = metrics.root_mean_squared_log_error(y_test_gbr, y_pred)

y_pred_train = gbr_cv.predict(X_train_gbr)
r2_train = metrics.r2_score(y_train_gbr, y_pred_train)

print(f'R2 score : {r2:.2f} \nRMSE : {rmse}')
print(f'R2 score train: {r2_train:.2f}')






R2 score : 0.47 
RMSE : 0.02164106788927329
R2 score train: 0.86


In [9]:
import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(gbr_cv, f)