## 20N - Dataset - Modelos Naive Bayes y Logistic Regression

Este notebook contiene el procesamiento, entrenamiento y evaluacion de modelos de clasificacion de Naive Bayes y Logistic Regression entrenados a partir del conjunto de datos 20N que contiene 20 categorias diferentes de articulos de noticias.

## 1. Importar librerias

In [1]:
from glob import glob
from functions import read_file, build_preprocess_pipeline
from pandas import DataFrame

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.pipeline import Pipeline

## 2. Cargar datos

In [2]:
cates_dir = glob('data/20news-18828/*') # Directorio donde se almacenan los datos - Cambiar si es necesario

topics_path = {c.split('/')[-1]: glob(f'{c}/*') for c in cates_dir}

# La estructura de los datos indica la clase a la que pertenece cada articulo
print("Total topics (classes): ", len(topics_path)) 

df = DataFrame([(k, v) for k, v in topics_path.items()], columns=['topics', 'files'])

df = df.explode('files')

print("Total files: ", len(df))

df['text'] = df['files'].apply(read_file)

# Se observa que no hay mucho desbalanceo de clases
df.value_counts('topics')

Total topics (classes):  20
Total files:  18828


topics
20news-18828\rec.sport.hockey            999
20news-18828\soc.religion.christian      997
20news-18828\rec.sport.baseball          994
20news-18828\rec.motorcycles             994
20news-18828\sci.crypt                   991
20news-18828\rec.autos                   990
20news-18828\sci.med                     990
20news-18828\sci.space                   987
20news-18828\comp.os.ms-windows.misc     985
20news-18828\comp.sys.ibm.pc.hardware    982
20news-18828\sci.electronics             981
20news-18828\comp.windows.x              980
20news-18828\comp.graphics               973
20news-18828\misc.forsale                972
20news-18828\comp.sys.mac.hardware       961
20news-18828\talk.politics.mideast       940
20news-18828\talk.politics.guns          910
20news-18828\alt.atheism                 799
20news-18828\talk.politics.misc          775
20news-18828\talk.religion.misc          628
Name: count, dtype: int64

## 3. Preprocesamiento

In [3]:
# Conjunto de entrenamiento y prueba
x_train, x_test, y_train, y_test = train_test_split(df['text'], df['topics'], 
                                                    train_size=0.7, random_state=42)

print("Train: ", len(x_train))
print("Test: ", len(x_test))

Train:  13179
Test:  5649


In [4]:
# Crea el pipeline usando dos vectorizaciones diferentes (tf y tfidf)
cnt_pipeline = build_preprocess_pipeline('count').fit(x_train)
tfidf_pipeline = build_preprocess_pipeline('tfidf').fit(x_train)

In [5]:
# Se utiliza para hacer cross-validation usando 10 folds
# Ver https://scikit-learn.org/stable/modules/cross_validation.html
cv = StratifiedShuffleSplit(n_splits=10, random_state=42, 
                            test_size=1/7) # Validacion es cerca del 10% del dataset original

In [6]:
# Ejecuta el preprocesamiento para generar los vectores de entrada de los modelos
X_train_tfidf_transformed = tfidf_pipeline.transform(x_train)
X_train_cnt_transformed = cnt_pipeline.transform(x_train)

## 4.0 Modelos TF - IDF

#### Regresión Logística 

In [7]:
# Entrenar un clasificador usando regresion logistica
logistic_estimator = LogisticRegression(n_jobs=-1, random_state=42, 
                                        class_weight='balanced', solver='saga',
                                        max_iter=1000, penalty='l2',
                                        tol=1e-2,
                                        )
# Busqueda de hiperparametros
logistic_param_grid = {
    'C': [1, 10],
}

# Se usa un GridSearchCV para la busqueda de hiperparametros (C - inverso de la regularizacion)
grid_search_best_tfidf_lr_estimator = GridSearchCV(
    estimator=logistic_estimator,
    param_grid=logistic_param_grid,
    cv=cv,
    scoring='f1_macro',
    n_jobs=-1,
    return_train_score=False,
    refit=True
).fit(X_train_tfidf_transformed, y_train)

grid_search_best_tfidf_lr_estimator.cv_results_

{'mean_fit_time': array([5.03403425, 6.28133607]),
 'std_fit_time': array([0.54272684, 1.05229602]),
 'mean_score_time': array([0.06741891, 0.04984288]),
 'std_score_time': array([0.03131618, 0.01892294]),
 'param_C': masked_array(data=[1, 10],
              mask=[False, False],
        fill_value=999999),
 'params': [{'C': 1}, {'C': 10}],
 'split0_test_score': array([0.8862346 , 0.89944589]),
 'split1_test_score': array([0.88497246, 0.89965614]),
 'split2_test_score': array([0.87082913, 0.89151214]),
 'split3_test_score': array([0.88603784, 0.90479766]),
 'split4_test_score': array([0.87324393, 0.89457976]),
 'split5_test_score': array([0.88328331, 0.90944104]),
 'split6_test_score': array([0.87003422, 0.89189012]),
 'split7_test_score': array([0.87826901, 0.89497397]),
 'split8_test_score': array([0.88490615, 0.90367546]),
 'split9_test_score': array([0.88438184, 0.9020709 ]),
 'mean_test_score': array([0.88021925, 0.89920431]),
 'std_test_score': array([0.00621035, 0.0056258 ]),
 'r

In [8]:
# Evaluar el modelo usando el conjunto de prueba
X_test_transformed_tfidf = tfidf_pipeline.transform(x_test)
y_pred = grid_search_best_tfidf_lr_estimator.predict(X_test_transformed_tfidf)
print(classification_report(y_test, y_pred))

                                       precision    recall  f1-score   support

             20news-18828\alt.atheism       0.90      0.88      0.89       220
           20news-18828\comp.graphics       0.83      0.84      0.84       303
 20news-18828\comp.os.ms-windows.misc       0.82      0.87      0.85       280
20news-18828\comp.sys.ibm.pc.hardware       0.78      0.80      0.79       286
   20news-18828\comp.sys.mac.hardware       0.91      0.87      0.89       275
          20news-18828\comp.windows.x       0.88      0.86      0.87       300
            20news-18828\misc.forsale       0.82      0.88      0.85       287
               20news-18828\rec.autos       0.92      0.92      0.92       302
         20news-18828\rec.motorcycles       0.98      0.95      0.97       317
      20news-18828\rec.sport.baseball       0.94      0.93      0.94       300
        20news-18828\rec.sport.hockey       0.96      0.96      0.96       297
               20news-18828\sci.crypt       0.97   

Se observa una buena precision y recall para la mayoria de clases mayor al 80%. Teniendo la mayor dificultad para clasificar `comp.sys.ibm.pc.hardware` 

#### Naive Bayes

In [9]:
# Clasificador multinomial de Naive Bayes
nb_estimator = MultinomialNB()

# Busqueda de hiperparametros
nb_param_grid = {
    'alpha': [0.01, 0.1, 1],
}

grid_search_best_tfidf_nb_estimator = GridSearchCV(
    estimator=nb_estimator,
    param_grid=nb_param_grid,
    cv=cv,
    scoring='f1_macro',
    n_jobs=-1,
    return_train_score=False,
    refit=True
).fit(X_train_tfidf_transformed, y_train)

grid_search_best_tfidf_nb_estimator.cv_results_

{'mean_fit_time': array([0.15810995, 0.15114057, 0.16162992]),
 'std_fit_time': array([0.01629736, 0.0134238 , 0.01140076]),
 'mean_score_time': array([0.03784821, 0.04039466, 0.04132943]),
 'std_score_time': array([0.00349535, 0.00542403, 0.00787505]),
 'param_alpha': masked_array(data=[0.01, 0.1, 1.0],
              mask=[False, False, False],
        fill_value=1e+20),
 'params': [{'alpha': 0.01}, {'alpha': 0.1}, {'alpha': 1}],
 'split0_test_score': array([0.89184862, 0.88557447, 0.86684112]),
 'split1_test_score': array([0.89199304, 0.88576377, 0.85274573]),
 'split2_test_score': array([0.88793753, 0.88714995, 0.84367105]),
 'split3_test_score': array([0.88342205, 0.88106166, 0.86128463]),
 'split4_test_score': array([0.89438796, 0.89554142, 0.85899888]),
 'split5_test_score': array([0.8961432 , 0.89240712, 0.86818379]),
 'split6_test_score': array([0.88042153, 0.88263514, 0.85373031]),
 'split7_test_score': array([0.89472547, 0.88768814, 0.85013738]),
 'split8_test_score': array([

In [10]:
# Evaluar el modelo con el conjunto de pruebas
y_pred = grid_search_best_tfidf_nb_estimator.predict(X_test_transformed_tfidf)
print(classification_report(y_test, y_pred))

                                       precision    recall  f1-score   support

             20news-18828\alt.atheism       0.86      0.90      0.88       220
           20news-18828\comp.graphics       0.79      0.84      0.82       303
 20news-18828\comp.os.ms-windows.misc       0.79      0.81      0.80       280
20news-18828\comp.sys.ibm.pc.hardware       0.76      0.79      0.78       286
   20news-18828\comp.sys.mac.hardware       0.87      0.88      0.88       275
          20news-18828\comp.windows.x       0.86      0.85      0.86       300
            20news-18828\misc.forsale       0.82      0.84      0.83       287
               20news-18828\rec.autos       0.92      0.91      0.91       302
         20news-18828\rec.motorcycles       0.95      0.95      0.95       317
      20news-18828\rec.sport.baseball       0.96      0.94      0.95       300
        20news-18828\rec.sport.hockey       0.96      0.96      0.96       297
               20news-18828\sci.crypt       0.97   

Se obtiene un modelo ligeramente peor pues el accuracy global es menor `0.89`, y se observa mayor dificultad para recuperar varias clases como `20news-18828\talk.religion.misc` con un recall de `0.67`. Valores que no se observaban con el modelo de regresion logistica.

## 5.0 Modelos Tf

#### Regresión Logística 

In [11]:
# Clasificador usando regresion logistica usando las frecuencias de los terminos como entrada al modelo
grid_search_best_cnt_lr_estimator = GridSearchCV(
    estimator=logistic_estimator,
    param_grid=logistic_param_grid,
    cv=cv,
    scoring='f1_macro',
    n_jobs=-1,
    return_train_score=False,
    refit=True
).fit(X_train_cnt_transformed, y_train)

grid_search_best_cnt_lr_estimator.cv_results_

{'mean_fit_time': array([20.13087564, 21.2608979 ]),
 'std_fit_time': array([1.62307125, 1.19520256]),
 'mean_score_time': array([0.05210958, 0.05029557]),
 'std_score_time': array([0.00648786, 0.01122549]),
 'param_C': masked_array(data=[1, 10],
              mask=[False, False],
        fill_value=999999),
 'params': [{'C': 1}, {'C': 10}],
 'split0_test_score': array([0.83018866, 0.8356691 ]),
 'split1_test_score': array([0.83429819, 0.83732594]),
 'split2_test_score': array([0.82859833, 0.83082365]),
 'split3_test_score': array([0.83313852, 0.83758565]),
 'split4_test_score': array([0.83239731, 0.83553954]),
 'split5_test_score': array([0.83469407, 0.83935271]),
 'split6_test_score': array([0.81415871, 0.8197063 ]),
 'split7_test_score': array([0.82660366, 0.83136315]),
 'split8_test_score': array([0.82915861, 0.83149625]),
 'split9_test_score': array([0.8365562 , 0.84058577]),
 'mean_test_score': array([0.82997923, 0.8339448 ]),
 'std_test_score': array([0.00603552, 0.00573314]),
 

In [12]:
# Evaluar el modelo
X_test_transformed_cnt = cnt_pipeline.transform(x_test)
y_pred = grid_search_best_cnt_lr_estimator.predict(X_test_transformed_cnt)
print(classification_report(y_test, y_pred))

                                       precision    recall  f1-score   support

             20news-18828\alt.atheism       0.88      0.86      0.87       220
           20news-18828\comp.graphics       0.57      0.71      0.63       303
 20news-18828\comp.os.ms-windows.misc       0.77      0.79      0.78       280
20news-18828\comp.sys.ibm.pc.hardware       0.71      0.70      0.70       286
   20news-18828\comp.sys.mac.hardware       0.80      0.79      0.80       275
          20news-18828\comp.windows.x       0.70      0.75      0.72       300
            20news-18828\misc.forsale       0.69      0.91      0.78       287
               20news-18828\rec.autos       0.88      0.85      0.87       302
         20news-18828\rec.motorcycles       0.88      0.92      0.90       317
      20news-18828\rec.sport.baseball       0.89      0.88      0.89       300
        20news-18828\rec.sport.hockey       0.97      0.94      0.95       297
               20news-18828\sci.crypt       0.97   

Se obtiene un resultado decente, pero claramente con menor rendimiento que el mismo modelo pero usando una representacion diferente de los datos

#### Naive Bayes

In [13]:
# Clasificador usando Naive Bayes
grid_search_best_cnt_nb_estimator = GridSearchCV(
    estimator=nb_estimator,
    param_grid=nb_param_grid,
    cv=cv,
    scoring='f1_macro',
    n_jobs=-1,
    return_train_score=False,
    refit=True
).fit(X_train_cnt_transformed, y_train)

grid_search_best_cnt_nb_estimator.cv_results_

{'mean_fit_time': array([0.1663542 , 0.1699923 , 0.14066293]),
 'std_fit_time': array([0.02165891, 0.0341979 , 0.01088923]),
 'mean_score_time': array([0.04326825, 0.04706988, 0.03491297]),
 'std_score_time': array([0.00944919, 0.01133113, 0.00447589]),
 'param_alpha': masked_array(data=[0.01, 0.1, 1.0],
              mask=[False, False, False],
        fill_value=1e+20),
 'params': [{'alpha': 0.01}, {'alpha': 0.1}, {'alpha': 1}],
 'split0_test_score': array([0.82596001, 0.84655526, 0.83167098]),
 'split1_test_score': array([0.83776982, 0.84333162, 0.8376375 ]),
 'split2_test_score': array([0.83462112, 0.83514511, 0.82215158]),
 'split3_test_score': array([0.83636907, 0.83453099, 0.82466715]),
 'split4_test_score': array([0.83395827, 0.83684251, 0.81838189]),
 'split5_test_score': array([0.83578677, 0.84402088, 0.83207091]),
 'split6_test_score': array([0.81588106, 0.82154142, 0.80862033]),
 'split7_test_score': array([0.84196559, 0.8506871 , 0.82780732]),
 'split8_test_score': array([

In [14]:
# Evaluar el modelo
y_pred = grid_search_best_cnt_nb_estimator.predict(X_test_transformed_cnt)
print(classification_report(y_test, y_pred))

                                       precision    recall  f1-score   support

             20news-18828\alt.atheism       0.83      0.86      0.84       220
           20news-18828\comp.graphics       0.66      0.75      0.70       303
 20news-18828\comp.os.ms-windows.misc       0.80      0.62      0.70       280
20news-18828\comp.sys.ibm.pc.hardware       0.65      0.76      0.70       286
   20news-18828\comp.sys.mac.hardware       0.77      0.82      0.79       275
          20news-18828\comp.windows.x       0.80      0.77      0.79       300
            20news-18828\misc.forsale       0.79      0.79      0.79       287
               20news-18828\rec.autos       0.82      0.86      0.84       302
         20news-18828\rec.motorcycles       0.86      0.92      0.89       317
      20news-18828\rec.sport.baseball       0.94      0.89      0.91       300
        20news-18828\rec.sport.hockey       0.95      0.94      0.94       297
               20news-18828\sci.crypt       0.94   

Se obtiene un modelo muy similar al anterior, donde aun se refleja menor rendimiento que los modelos basados en `tf-idf`

### Best Model

Como se observo el mejor modelo resulta de usar `tf-idf` y regresion logistica.

In [15]:
best_model_pipeline = Pipeline([
                            ('preprocess', tfidf_pipeline),
                            ('classifier', grid_search_best_tfidf_lr_estimator)
                        ])

In [16]:
best_model_pipeline.predict(['hi! I suffer very painful stomach pains'])

array(['20news-18828\\sci.med'], dtype=object)

In [17]:
# Evaluar el modelo
y_pred = best_model_pipeline.predict(x_test)
print(classification_report(y_test, y_pred))

                                       precision    recall  f1-score   support

             20news-18828\alt.atheism       0.90      0.88      0.89       220
           20news-18828\comp.graphics       0.83      0.84      0.84       303
 20news-18828\comp.os.ms-windows.misc       0.82      0.87      0.85       280
20news-18828\comp.sys.ibm.pc.hardware       0.78      0.80      0.79       286
   20news-18828\comp.sys.mac.hardware       0.91      0.87      0.89       275
          20news-18828\comp.windows.x       0.88      0.86      0.87       300
            20news-18828\misc.forsale       0.82      0.88      0.85       287
               20news-18828\rec.autos       0.92      0.92      0.92       302
         20news-18828\rec.motorcycles       0.98      0.95      0.97       317
      20news-18828\rec.sport.baseball       0.94      0.93      0.94       300
        20news-18828\rec.sport.hockey       0.96      0.96      0.96       297
               20news-18828\sci.crypt       0.97   