# Proyecto práctico

## Unidad 3 - Aprendizaje supervisado

El proyecto práctico consiste en abordar un problema de clasificación de documentos textuales. Tenemos a nuestra disposición un dataset de noticias de prensa en español publicada por el medio "CNN Chile".

Las noticias están divididas en 7 categorías temáticas: *'pais','deportes','tendencias','tecnologias','cultura','economia','mundo'*

El proyecto se divide en dos partes:

- Utilizar al menos 3 estrategías para entrenar modelos de clasificación capaces de clasificar las noticias según su categoría temática.

- Explorar cuáles son las características que permiten explicar las decisiones de su modelo.

## 0. Evaluación

El proyecto se realiza de forma individual. Se entrega a más tardar el **lunes 30 de noviembre** en su repositorio GitHub.

**Pauta de evaluación:**

Competencia 1: Aplicar un protocolo de aprendizaje supervisado para resolver un problema clasificación estandar, utilizando un entorno de programación en Python

- < 2 : El protocolo de aprendizaje supervisado utilizado es incompleto y/o presenta errores importantes
- 2 a 3.9 : El protocolo de aprendizaje supervisado utilizado es incompleto o presenta un error importante
- 4 a 5.5 : El protocolo de aprendizaje es completo, no tiene error, pero las estrategias utilizadas son relativamente simples y el rendimiento de los modelos es perfectible.
- 5.6 a 7.0 : El protocolo de aprendizaje es completo, no tiene error y al menos una de las estrategias utilizadas a necesitado un trabajado más avanzado y/o permite obtener un mejor rendimiento.

Competencia 2: Explicar el rendimiento de un modelo de clasificación aplicando un protocolo de evaluación Precision/Recall/F-Score

- < 2 : El trabajo no presenta explicaciones del rendimiento de los modelos de clasificación
- 2 a 3.9 : El trabajo presenta algunas explicaciones pero tienen errores.
- 4 a 5.5 : El trabajo presenta explicaciones correctas del rendimiento de los modelos
- 5.6 a 7 : El trabajo presenta explicaciones correctas del rendimiento de los modelos y además presenta un método para explicar las decisiones/errores


# TODO
* Explicaciones de los modelos
* Implementar matriz de confusion para poder ver de forma más visual los casos buenos y malos
* Label encoder

## 1. Dataset

In [2]:
import pandas as pd
from pandasql import sqldf
import spacy
from sklearn.model_selection import train_test_split

df = pd.read_csv('cnnchile_7000.csv')
df = df.drop(["country","media_outlet", "url","date","title"],1)
df

Unnamed: 0,text,category
0,La Federación de Estudiantes de la Universidad...,pais
1,La Defensoría de la Niñez emitió este domingo ...,pais
2,El monto del bono es de dos tercios de Unidad ...,pais
3,Una nueva polémica tiene esta carrera presiden...,pais
4,Especialistas recomiendan no consumir más de 2...,pais
...,...,...
6995,Las compañías ya han revelado muchos detalles ...,tecnologias
6996,Se proyecta que tras un virtual empate en 2012...,tecnologias
6997,Tablets y smartphones fueron los regalos tecno...,tecnologias
6998,Crecí jugando clásicos de naves como Terminal ...,tecnologias


In [125]:
q="""SELECT category, count(*) FROM df GROUP BY category ORDER BY count(*) DESC;"""
result=sqldf(q)
result

Unnamed: 0,category,count(*)
0,tendencias,1000
1,tecnologias,1000
2,pais,1000
3,mundo,1000
4,economia,1000
5,deportes,1000
6,cultura,1000


In [126]:
df.shape

(7000, 2)

In [127]:
categories = result["category"].astype("str").tolist()
categories

['tendencias',
 'tecnologias',
 'pais',
 'mundo',
 'economia',
 'deportes',
 'cultura']

In [3]:
q="""SELECT * FROM df WHERE category = "tendencias";"""
df_tend=sqldf(q)

df_tend = df_tend.sample(n=300)

q="""SELECT * FROM df WHERE category = "tecnologias";"""
df_tech = sqldf(q)

df_tech = df_tech.sample(n=300)


q="""SELECT * FROM df WHERE category = "pais";"""
df_pais=sqldf(q)

df_pais = df_pais.sample(n=300)

q="""SELECT * FROM df WHERE category = "mundo";"""
df_mundo=sqldf(q)

df_mundo = df_mundo.sample(n=300)

q="""SELECT * FROM df WHERE category = "economia";"""
df_eco=sqldf(q)

df_eco = df_eco.sample(n=300)

q="""SELECT * FROM df WHERE category = "deportes";"""
df_dep = sqldf(q)

df_dep = df_dep.sample(n=300)

q="""SELECT * FROM df WHERE category = "cultura";"""
df_cult = sqldf(q)

df_cult = df_cult.sample(n=300)

df_train = pd.concat([df_tend, df_tech, df_pais, df_mundo,df_eco,df_dep,df_cult], ignore_index=True)
df_train.shape


(2100, 2)

In [4]:
q="""SELECT category, count(*) FROM df_train GROUP BY category ORDER BY count(*) DESC;"""
test=sqldf(q)
test

Unnamed: 0,category,count(*)
0,tendencias,300
1,tecnologias,300
2,pais,300
3,mundo,300
4,economia,300
5,deportes,300
6,cultura,300


In [5]:
nlp = spacy.load("es_core_news_md")

### Determinar parametros (Cross Validation)

In [6]:
## Determinar la cantidad de data para TRAIN and TEST

total = df.shape[0]
train = int(total * 0.6)
test = total - train
print("Train: ", train,"\nTest: ",test)

Train:  4200 
Test:  2800


In [7]:
## PRUEBAS REALES

X = df['text'].astype(str)
ylabels = df['category'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.25, random_state=0)

## PRUEBAS PEQUEÑAS
''' 
X = df_train['text'].astype(str)
ylabels = df_train['category'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.25, random_state=0)
'''



" \nX = df_train['text'].astype(str)\nylabels = df_train['category'].astype(str)\n\nX_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.25, random_state=0)\n"

In [8]:
print("Train size: ",X_train.shape, "\nTest size: ", X_test.shape)

Train size:  (5250,) 
Test size:  (1750,)


### Spacy function

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from pandas import DataFrame

In [9]:
def feature_extraction(text):
    
    mytokens = nlp(text)

    #Guardamos las palabras como características si corresponden a ciertas categorias gramaticaless
    mytokens = [ word for word in mytokens if word.pos_ in ["NOUN", "ADJ", "VERB"] ]
    
    #Transformamos las palabras en minusculas
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]

    # return preprocessed list of tokens
    return mytokens

In [11]:
bow_vector = CountVectorizer(tokenizer = feature_extraction, min_df=0., max_df=1.0)

### Entrenamiento


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

### Regresión logistica y CountVectorizer

In [159]:
## ENTRENAMIENTO USANDO PIPELINE

model_1 = LogisticRegression(max_iter=1000)
pipe = Pipeline([('vectorizing', bow_vector),
                 ('learning', model_1)])

# model generation
#Determina los mejores fit para el df usando el modelo_1 y bow_vector
pipe.fit(X_train,y_train)

Pipeline(steps=[('vectorizing',
                 CountVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7f28b5e91e60>)),
                ('learning', LogisticRegression(max_iter=1000))])

In [160]:
predicted = pipe.predict(X_test) # Vectoriza los datos de test.
#predicted_proba = pipe.predict_proba(X_test) #Vectoriza la probabilidad de ser una de las x posibilidades


In [161]:
# Exactitud del modelo.
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))

Logistic Regression Accuracy: 0.736


In [162]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.84      0.83      0.84       250
    deportes       0.81      0.86      0.84       255
    economia       0.75      0.80      0.78       273
       mundo       0.70      0.68      0.69       238
        pais       0.72      0.65      0.68       250
 tecnologias       0.70      0.67      0.68       252
  tendencias       0.61      0.64      0.62       232

    accuracy                           0.74      1750
   macro avg       0.73      0.73      0.73      1750
weighted avg       0.74      0.74      0.74      1750



### Regresión logistica y TfidfTransformer

In [14]:
##Usamos los id, en funcion de los poco común se aumentan los pesos de las palabras.

tfidf_vector = TfidfVectorizer(tokenizer = feature_extraction, min_df=0., max_df=1.0)

In [143]:
model_2 = LogisticRegression()

pipe2 = Pipeline([('vectorizing', tfidf_vector),
                 ('learning', model_2)])


In [144]:
pipe2.fit(X_train,y_train)

Pipeline(steps=[('vectorizing',
                 TfidfVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7f28b5e91e60>)),
                ('learning', LogisticRegression())])

In [145]:
predicted = pipe2.predict(X_test)

In [146]:
# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))

Logistic Regression Accuracy: 0.772


In [147]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.87      0.90      0.89       250
    deportes       0.85      0.87      0.86       255
    economia       0.77      0.78      0.78       273
       mundo       0.72      0.72      0.72       238
        pais       0.78      0.68      0.72       250
 tecnologias       0.72      0.72      0.72       252
  tendencias       0.69      0.72      0.70       232

    accuracy                           0.77      1750
   macro avg       0.77      0.77      0.77      1750
weighted avg       0.77      0.77      0.77      1750



### MultinomialNB y TfidfTransformer  (Naive Bayes Classifier)

* Para textos no deberia funcionar muy bien

In [148]:
model_3 = MultinomialNB()

pipe3 = Pipeline([('vectorizing', tfidf_vector),
                 ('learning', model_3)])


In [149]:
pipe3.fit(X_train,y_train)

Pipeline(steps=[('vectorizing',
                 TfidfVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7f28b5e91e60>)),
                ('learning', MultinomialNB())])

In [150]:
predicted = pipe3.predict(X_test)

In [151]:
# Model Accuracy
print("MultiNomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))

Logistic Regression Accuracy: 0.752


In [152]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.73      0.94      0.82       250
    deportes       0.91      0.85      0.88       255
    economia       0.75      0.82      0.78       273
       mundo       0.70      0.75      0.73       238
        pais       0.81      0.66      0.73       250
 tecnologias       0.64      0.73      0.69       252
  tendencias       0.77      0.48      0.59       232

    accuracy                           0.75      1750
   macro avg       0.76      0.75      0.74      1750
weighted avg       0.76      0.75      0.75      1750



### SGDClassifier y TfidfTransformer (Gradient descent)

* Stochastic Gradient Descent (SGD) This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). 

In [153]:
model_4 = SGDClassifier(loss='hinge', 
              penalty='l2', 
              alpha=1e-3, 
              random_state=42,
              max_iter=5, 
              tol=None)

In [154]:
pipe4 = Pipeline([('vectorizing', tfidf_vector),
                 ('learning', model_4)])


In [155]:
pipe4.fit(X_train,y_train)

Pipeline(steps=[('vectorizing',
                 TfidfVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7f28b5e91e60>)),
                ('learning',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=42,
                               tol=None))])

In [156]:
predicted = pipe4.predict(X_test)

In [157]:
# Model Accuracy
print("SGDC Accuracy:",metrics.accuracy_score(y_test, predicted))

Logistic Regression Accuracy: 0.7611428571428571


In [163]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.84      0.83      0.84       250
    deportes       0.81      0.86      0.84       255
    economia       0.75      0.80      0.78       273
       mundo       0.70      0.68      0.69       238
        pais       0.72      0.65      0.68       250
 tecnologias       0.70      0.67      0.68       252
  tendencias       0.61      0.64      0.62       232

    accuracy                           0.74      1750
   macro avg       0.73      0.73      0.73      1750
weighted avg       0.74      0.74      0.74      1750



### Decision Tree

In [169]:
model_5 = tree.DecisionTreeClassifier()

pipe5 = Pipeline([('vectorizing', tfidf_vector),
                 ('learning', model_5)])

In [170]:
pipe5.fit(X_train, y_train)

Pipeline(steps=[('vectorizing',
                 TfidfVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7f28b5e91e60>)),
                ('learning', DecisionTreeClassifier())])

In [171]:
predicted = pipe5.predict(X_test)

In [172]:
# Exactitud del modelo.
print("Decision Tree Accuracy:",metrics.accuracy_score(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.69      0.66      0.68       250
    deportes       0.74      0.74      0.74       255
    economia       0.58      0.54      0.56       273
       mundo       0.40      0.45      0.42       238
        pais       0.38      0.37      0.37       250
 tecnologias       0.54      0.54      0.54       252
  tendencias       0.44      0.45      0.44       232

    accuracy                           0.54      1750
   macro avg       0.54      0.54      0.54      1750
weighted avg       0.54      0.54      0.54      1750



In [173]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.69      0.66      0.68       250
    deportes       0.74      0.74      0.74       255
    economia       0.58      0.54      0.56       273
       mundo       0.40      0.45      0.42       238
        pais       0.38      0.37      0.37       250
 tecnologias       0.54      0.54      0.54       252
  tendencias       0.44      0.45      0.44       232

    accuracy                           0.54      1750
   macro avg       0.54      0.54      0.54      1750
weighted avg       0.54      0.54      0.54      1750



### Random Forest

In [174]:
model_6 = RandomForestClassifier(n_estimators=100)

pipe6 = Pipeline([('vectorizing', tfidf_vector),
                 ('learning', model_6)])

In [175]:
pipe6.fit(X_train, y_train)

Pipeline(steps=[('vectorizing',
                 TfidfVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7f28b5e91e60>)),
                ('learning', RandomForestClassifier())])

In [176]:
predicted = pipe6.predict(X_test)

In [177]:
# Exactitud del modelo.
print("Random Forest Accuracy:",metrics.accuracy_score(y_test, predicted))

Logistic Regression Accuracy: 0.7177142857142857


In [178]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.79      0.91      0.85       250
    deportes       0.80      0.86      0.83       255
    economia       0.71      0.75      0.73       273
       mundo       0.71      0.62      0.66       238
        pais       0.63      0.64      0.63       250
 tecnologias       0.71      0.62      0.67       252
  tendencias       0.64      0.60      0.62       232

    accuracy                           0.72      1750
   macro avg       0.71      0.72      0.71      1750
weighted avg       0.71      0.72      0.71      1750



### KNN (K-nearest Neighbor)

In [15]:
vecinos = 18 
model_7 = KNeighborsClassifier(vecinos)

pipe7 = Pipeline([('vectorizing', tfidf_vector),
                 ('learning', model_7)])

In [16]:
pipe7.fit(X_train, y_train)

Pipeline(steps=[('vectorizing',
                 TfidfVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7fb9273db9e0>)),
                ('learning', KNeighborsClassifier(n_neighbors=18))])

In [18]:
predicted = pipe7.predict(X_test)

In [19]:
# Exactitud del modelo.
print("KNN Accuracy:",metrics.accuracy_score(y_test, predicted))

KNN Accuracy: 0.7022857142857143


In [20]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

     cultura       0.85      0.86      0.85       250
    deportes       0.75      0.89      0.81       255
    economia       0.81      0.68      0.74       273
       mundo       0.76      0.61      0.67       238
        pais       0.47      0.81      0.59       250
 tecnologias       0.75      0.60      0.67       252
  tendencias       0.75      0.44      0.56       232

    accuracy                           0.70      1750
   macro avg       0.73      0.70      0.70      1750
weighted avg       0.73      0.70      0.70      1750

