# TODO
* Explicaciones de los modelos
* Implementar matriz de confusion para poder ver de forma más visual los casos buenos y malos
* Label encoder
* Posibles implementaciones: 
> * Regresión logística con CountVectorizer y TFid
> * Decision Tree vs Random forest
> * Word Embedding text vs titles?
> * Algún modelo de Deep learning (GPT-2, GPT-3 (IMPOSIBLE), BERT (BETO EN ESPAÑOL))

### Imports

In [43]:
import pandas as pd
from pandasql import sqldf

##Spacy function
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from pandas import DataFrame

##
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

##ver rendimiento
from sklearn import metrics
from sklearn.metrics import classification_report



### Dataset original

In [12]:
df = pd.read_csv('cnnchile_7000.csv')
df = df.drop(["country","media_outlet", "url","date","title"],1)

q="""SELECT category, count(*) FROM df GROUP BY category ORDER BY count(*) DESC;"""
result=sqldf(q)
result

Unnamed: 0,category,count(*)
0,tendencias,1000
1,tecnologias,1000
2,pais,1000
3,mundo,1000
4,economia,1000
5,deportes,1000
6,cultura,1000


### Dataset Tests textos

In [13]:
q="""SELECT * FROM df WHERE category = "tendencias";"""
df_tend=sqldf(q)

df_tend = df_tend.sample(n=300)

q="""SELECT * FROM df WHERE category = "tecnologias";"""
df_tech = sqldf(q)

df_tech = df_tech.sample(n=300)


q="""SELECT * FROM df WHERE category = "pais";"""
df_pais=sqldf(q)

df_pais = df_pais.sample(n=300)

q="""SELECT * FROM df WHERE category = "mundo";"""
df_mundo=sqldf(q)

df_mundo = df_mundo.sample(n=300)

q="""SELECT * FROM df WHERE category = "economia";"""
df_eco=sqldf(q)

df_eco = df_eco.sample(n=300)

q="""SELECT * FROM df WHERE category = "deportes";"""
df_dep = sqldf(q)

df_dep = df_dep.sample(n=300)

q="""SELECT * FROM df WHERE category = "cultura";"""
df_cult = sqldf(q)

df_cult = df_cult.sample(n=300)

df_train = pd.concat([df_tend, df_tech, df_pais, df_mundo,df_eco,df_dep,df_cult], ignore_index=True)
df_train.shape

(2100, 2)

In [14]:
q="""SELECT category, count(*) FROM df_train GROUP BY category ORDER BY count(*) DESC;"""
test=sqldf(q)
test

Unnamed: 0,category,count(*)
0,tendencias,300
1,tecnologias,300
2,pais,300
3,mundo,300
4,economia,300
5,deportes,300
6,cultura,300


### Seleccionar Dataset

In [15]:
## PRUEBAS REALES

X = df['text'].astype(str)
ylabels = df['category'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.25, random_state=0)

## PRUEBAS PEQUEÑAS
''' 
X = df_train['text'].astype(str)
ylabels = df_train['category'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.25, random_state=0)
'''

" \nX = df_train['text'].astype(str)\nylabels = df_train['category'].astype(str)\n\nX_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.25, random_state=0)\n"

### Spacy function

In [34]:
nlp = spacy.load("es_core_news_md")
def feature_extraction(text):
    
    mytokens = nlp(text)

    #Guardamos las palabras como características si corresponden a ciertas categorias gramaticaless
    mytokens = [ word for word in mytokens if word.pos_ in ["NOUN", "ADJ", "VERB"] ]
    
    #Transformamos las palabras en minusculas
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]

    # return preprocessed list of tokens
    return mytokens

### Regresión logistica utilizando CountVectorizer vs TFid 


##### Mejorar



Al usar **countVectorizer**, utilizamos un enfoque `bag of word` por lo que creamos una matriz la cual tiene la siguente estructura:

* Columnas: Todas las palabras que existen en el dataset
* Filas: Textos del dataset

Cada fila tiene un 1 si contiene la palabra indicada y un 0 en casos donde no.

Al usar **TFid** cambiamos el enfoque bag of words a un enfoque `Term frequency (tf)` y `Inverse data frequency (idf)` por lo cual a la estructura anterior se le remplazaan los valores 1 y 0 por la frecuencia que tenga esa palabra en el texto de la fila.
- A media qeu la palabra se repite más veces en la fila, tendrá un valor más alto y si no existe la palabra en la fila, tendrá un valor de 0.
- Si la palabra se repite muchas veces en el dataset, el valor tambien aumenta.
- Los valores difieren dependiendo de la cantidad de palabras que tenga la fila.

In [25]:
bow_vector = CountVectorizer(tokenizer = feature_extraction, min_df=0., max_df=1.0)

tfidf_vector = TfidfVectorizer(tokenizer = feature_extraction, min_df=0., max_df=1.0)

In [32]:
model_1 = LogisticRegression(max_iter=1000)

pipe1 = Pipeline([('vectorizing', bow_vector),
                 ('learning', model_1)])


pipe2 = Pipeline([('vectorizing', tfidf_vector),
                 ('learning', model_1)])

In [35]:
pipe1.fit(X_train,y_train)

pipe2.fit(X_train,y_train)

Pipeline(steps=[('vectorizing',
                 TfidfVectorizer(min_df=0.0,
                                 tokenizer=<function feature_extraction at 0x7fd30808d5f0>)),
                ('learning', LogisticRegression(max_iter=1000))])

In [39]:
predicted_model_1 = pipe1.predict(X_test) 

In [40]:
predicted_model_2 = pipe2.predict(X_test)

In [41]:
print("Logistic Regression using CountVectorizer:",metrics.accuracy_score(y_test, predicted_model_1))

print("Logistic Regression using TfidfVectorizer:",metrics.accuracy_score(y_test, predicted_model_2))

Logistic Regression using CountVectorizer: 0.7222857142857143
Logistic Regression using TfidfVectorizer: 0.772


In [44]:
print("Matriz de confusión para CountVectorizer: ")
print(classification_report(y_test, predicted_model_1))

Matriz de confusión para CountVectorizer: 
              precision    recall  f1-score   support

     cultura       0.59      0.97      0.74       250
    deportes       0.86      0.78      0.82       255
    economia       0.74      0.82      0.78       273
       mundo       0.72      0.71      0.72       238
        pais       0.80      0.58      0.68       250
 tecnologias       0.69      0.72      0.70       252
  tendencias       0.79      0.43      0.56       232

    accuracy                           0.72      1750
   macro avg       0.74      0.72      0.71      1750
weighted avg       0.74      0.72      0.72      1750



In [45]:
print("Matriz de confusión para TfidfVectorizer: ")
print(classification_report(y_test, predicted_model_2))

Matriz de confusión para TfidfVectorizer: 
              precision    recall  f1-score   support

     cultura       0.87      0.90      0.89       250
    deportes       0.85      0.87      0.86       255
    economia       0.77      0.78      0.78       273
       mundo       0.72      0.72      0.72       238
        pais       0.78      0.68      0.72       250
 tecnologias       0.72      0.72      0.72       252
  tendencias       0.69      0.72      0.70       232

    accuracy                           0.77      1750
   macro avg       0.77      0.77      0.77      1750
weighted avg       0.77      0.77      0.77      1750



#### Comentarios
* Comentar matriz de confusión, precision, recall, 