# Clasificador Xgboost CLiengo

En esta notebook se desarrolla un modelo de clasificador (utilizando la librería **XGBoost**) cuyo objetivo es predecir si una *review* es **buena** o **mala**

## Lectura de datos

Teniendo en cuenta lo realizado en notebook **EDA**, donde se realizó un análisis exploratorio de los datos, se tomará la versión del dataset preprocesado en dicha notebook.Básicamente, allí se realizarón las siguientes tareas de limpieza de texto:

* Conversión del  texto a minúscula
* Eliminación de **stopwords**
* Eliminación de los signos de puntuación
* Eliminación de **stopwords del dominio**
* Lematización de los términos

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_dataset = pd.read_csv('training_data_preprocessed.csv')
df_dataset.head()

Unnamed: 0,review,score,review cleaned,reviews_length,number_of_words,reviews_avg_length,review cleaned stopwords,lemmatized
0,Era necesario mucho coraje para abordar aconte...,buena,era necesario mucho coraje para abordar aconte...,3812,649,4.873652,era necesario mucho coraje para abordar aconte...,necesario coraje abordar acontecimiento recien...
1,Esperaba con curiosidad y ciertas ganas el est...,mala,esperaba con curiosidad y ciertas ganas el est...,2259,405,4.577778,esperaba con curiosidad y ciertas ganas el est...,esperar curiosidad y gana estreno antonio band...
2,"Wes Craven, convertido en factoría, nos vuelve...",mala,wes craven convertido en factoría nos vuelve a...,1816,317,4.728707,wes craven convertido en factoría nos vuelve a...,wes cravir convertido factoría volver a contar...
3,Va la gente y se rasga las vestiduras con 'Caó...,mala,va la gente y se rasga las vestiduras con caót...,3598,624,4.766026,va la gente y se rasga las vestiduras con caót...,gente y rasgar vestidura caótico án julio mede...
4,Director: Mariano Ozores.Duración: 77 minutos....,buena,director mariano ozoresduración 77 minutosestr...,2271,395,4.749367,mariano ozoresduración minutosestreno de dic...,mariano ozoresduración minutosestreno dici...


## Generación de *features*

Para la generación de ***features*** se utilizará la matriz de ***Document Term Matrix*** 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cv = TfidfVectorizer(analyzer='word', ngram_range=(1,2))
data = cv.fit_transform(df_dataset['lemmatized'])
df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names())
df_dtm.index = df_dataset.index
df_dtm.head()

Unnamed: 0,aa,aa luz,aar,aar eckhart,aar eckhartir,abadés,abadés imaginar,abajo,abajo barriga,abajo casa,...,útil,útil absurdo,útil exposición,útil humanidad,útil osear,útil permanente,útimo,útimo moda,útlimo,útlimo década
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Una vez generados los ***features*** se procesan las etiquetas **buena** y **mala** para que sirvan de entrada al clasificador.

In [5]:
df_dataset['label'] = df_dataset['score'].apply(lambda x: 1 if x == 'buena' else 0)
df_dataset.head()

Unnamed: 0,review,score,review cleaned,reviews_length,number_of_words,reviews_avg_length,review cleaned stopwords,lemmatized,label
0,Era necesario mucho coraje para abordar aconte...,buena,era necesario mucho coraje para abordar aconte...,3812,649,4.873652,era necesario mucho coraje para abordar aconte...,necesario coraje abordar acontecimiento recien...,1
1,Esperaba con curiosidad y ciertas ganas el est...,mala,esperaba con curiosidad y ciertas ganas el est...,2259,405,4.577778,esperaba con curiosidad y ciertas ganas el est...,esperar curiosidad y gana estreno antonio band...,0
2,"Wes Craven, convertido en factoría, nos vuelve...",mala,wes craven convertido en factoría nos vuelve a...,1816,317,4.728707,wes craven convertido en factoría nos vuelve a...,wes cravir convertido factoría volver a contar...,0
3,Va la gente y se rasga las vestiduras con 'Caó...,mala,va la gente y se rasga las vestiduras con caót...,3598,624,4.766026,va la gente y se rasga las vestiduras con caót...,gente y rasgar vestidura caótico án julio mede...,0
4,Director: Mariano Ozores.Duración: 77 minutos....,buena,director mariano ozoresduración 77 minutosestr...,2271,395,4.749367,mariano ozoresduración minutosestreno de dic...,mariano ozoresduración minutosestreno dici...,1


## División del dataset en entrenamiento y test

A continuación se define una función que divide el dataset en dos: uno para entrenamiento del modelo y otro para la evaluación del mismo

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, df_dataset['label'], test_size=.25)

## Entrenamiento del modelo XGBoost

A continuación se entrena un clasificador basado en el algoritmo **XGBoost**. Se utilizará una técnica de GridSearch para buscar la mejor hiperparametrización

In [8]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

In [9]:
param_tuning = {
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5, 7, 10],
        'min_child_weight': [1, 3, 5],
        'subsample': [0.5, 0.7],
        'colsample_bytree': [0.3, 0.5, 0.7],
        'n_estimators' : [100, 200, 500],
        'objective': ['binary:logistic']
    }

In [None]:
xg_clf = xgb.XGBClassifier()

gsearch = GridSearchCV(estimator = xg_clf,
                           param_grid = param_tuning,                        
                           scoring = 'f1', #F1 score
                           cv = 5,
                           n_jobs = -1,
                           verbose = 1)

gsearch.fit(X_train,y_train)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits


In [None]:
#Es de utilidad para XGBoost
data_dmatrix = xgb.DMatrix(data=data,label=df_dataset['label'])
clf = xgb.XGBClassifier(colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10)

## Evaluación del modelo

In [None]:
xg_clf = xgb.train(params=param, dtrain=data_dmatrix, num_boost_round=10)

In [None]:
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10,10]
xgb.plot_importance(xg_clf)