# Predição do Ranting do Filmes 
Este tutorial tem como objetivo realizar a predição de rating de filmes por meio de modelos de machine learning 
utilizando o dataset do IMDB.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# importing
movie_df = pd.read_csv('movies.csv')

In [3]:
movie_df.shape

(2589747, 14)

In [4]:
movie_df.isna().sum().sum()

0

In [5]:
movie_df.duplicated().sum()

0

In [6]:
movie_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,ordering,nconst,category,primaryName,primaryProfession
0,tt0000009,movie,Miss Jerry,Miss Jerry,1894.0,45,Romance,5.3,204,1,nm0063086,actress,Blanche Bayliss,actress
1,tt0000009,movie,Miss Jerry,Miss Jerry,1894.0,45,Romance,5.3,204,2,nm0183823,actor,William Courtenay,actor
2,tt0019859,movie,Evidence,Evidence,1929.0,70,"Crime,Drama,Romance",7.3,27,3,nm0183823,actor,William Courtenay,actor
3,tt0020403,movie,Show of Shows,The Show of Shows,1929.0,128,"Comedy,Music",5.8,454,2,nm0183823,actor,William Courtenay,actor
4,tt0000009,movie,Miss Jerry,Miss Jerry,1894.0,45,Romance,5.3,204,3,nm1309758,actor,Chauncey Depew,"actor,writer"


# Modelagem

## Separação dos dados. 
Esta etapa será feita a separação dos dados entre features e labels para construção dos modelos supervisionados
onde:
- y: Vetor com as labels
- X: Matriz das features. 

Para matrix das features foram removidas colunas redundates ou com alta correlação

In [7]:
y = movie_df['averageRating']
X = movie_df.drop(labels=['tconst','titleType', 'primaryTitle', 'averageRating', 'primaryProfession'], axis=1)

## Construção do Pipeline de Pré processamento 
Etapa responsável para construção de etapas para lidar com features de tipos categórias e numéricas.

*Obs. Embora neste tutorial serão utilizados modelos baseados em árvores, esta etapa será mantida para utilização de futuros modelos.*


In [8]:
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer, ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

# Features categóricas
cat_features = make_pipeline(
        (OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-1))
)

# Features numéricas
num_features = make_pipeline(
    (StandardScaler())
)

# Pipeline de pré processamento das features
preproc = ColumnTransformer(transformers=[
                        ('cat_feat', cat_features, make_column_selector(dtype_include=object)),
                        ('num_feat', num_features, make_column_selector(dtype_include=np.number) )
                         ])

## Divisão treino-teste

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.4,
                                                    shuffle=True,
                                                    random_state=42)

## Treinamento

#### Construção do modelo
Será criado um novo Pipeline unindo a etapa de préprocessamento como o modelo de regressão, além disso será utilizado uma pequena busca de hiperâmetros. 

In [10]:
# Instaciando o modelo de regressão
from xgboost import XGBRegressor

regressor = XGBRegressor(tree_method='gpu_hist',
                         random_state=42)

In [11]:
model_pipeline = Pipeline(steps=[
    ('preprocessing', preproc),
    ('regressor', regressor)
])

In [12]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

param_grid = {'regressor__n_estimators':[50,100],
              'regressor__max_depth':[3,5],
              'regressor__learning_rate':[0.1, 0.5]           
}

search = HalvingGridSearchCV(model_pipeline,
                             param_grid=param_grid,
                             cv=3,
                             random_state=42,
                             n_jobs=-1,
                             refit=True,
                             scoring='neg_mean_squared_error')

### treino

In [13]:
search.fit(X_train, y_train)

#### Melhores parâmetros selecionados e métricas durante o k-fold

In [14]:
print(search.best_params_)

{'regressor__learning_rate': 0.5, 'regressor__max_depth': 5, 'regressor__n_estimators': 50}


In [15]:
index = search.best_index_
results = search.cv_results_

mean_score = results['mean_test_score'][index]
std_score  = results['std_test_score'][index]

print(f"Score de Validação: {mean_score:.5f} +- {std_score:.5f}")

Score de Validação: -1.20033 +- 0.00402


----------------------------------------------**Avaliação do desempenho**-----------------------------------------------------------

In [16]:
predicted = search.predict(X_test)

In [17]:
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error, mean_absolute_error

In [21]:
from tabulate import tabulate

print(tabulate([['Erro médio Absoluto percentual (MAPE) :', mean_absolute_percentage_error(predicted, y_test)], 
                ['Erro médio absoluto (MAE) :', mean_absolute_error(predicted, y_test)],
                ['Erro médio quadrático (MSE):', mean_squared_error(predicted, y_test)],
                ['Raiz do erro médio quadrático (RMSE) :', np.sqrt(mean_squared_error(predicted, y_test))]],     
               headers=['Métrica do modelo', 'Valor'], floatfmt=".2f"))

Métrica do modelo                          Valor
---------------------------------------  -------
Erro médio Absoluto percentual (MAPE) :     0.14
Erro médio absoluto (MAE) :                 0.83
Erro médio quadrático (MSE):                1.19
Raiz do erro médio quadrático (RMSE) :      1.09


# Model Persistence
Aqui salvaremos nosso pipeline construído

In [22]:
import joblib

joblib.dump(search, 'my_saved_pipeline.joblib')

['my_saved_pipeline.joblib']

#### Loading saved model pipeline

In [23]:
saved_pipeline = joblib.load('my_saved_pipeline.joblib')

#### Prediction Ranting from movie feature (single sample)

In [40]:
y_test.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

In [85]:
# selecting a random movie from test set
from random import sample

idx = sample(sorted(y_test.index.values), 1)

In [86]:
# Movie Featues
X_test.iloc[idx]

Unnamed: 0,originalTitle,startYear,runtimeMinutes,genres,numVotes,ordering,nconst,category,primaryName
337073,The Postman Always Rings Twice,1946.0,113,"Crime,Drama,Film-Noir",21817,10,nm0005956,composer,George Bassman


In [87]:
# Prediction
pred_rate = saved_pipeline.predict(X_test.iloc[idx])

In [108]:
movie_name = X_test.originalTitle.loc[idx].values[0]
print(f'Movie Name: {movie_name}\nMovie Rating: {y_test[idx].values[0]:.2f}\nPredicted Rating: {pred_rate[0]:.2f}')

Movie Name: The Postman Always Rings Twice
Movie Rating: 7.40
Predicted Rating: 7.63
