# Clasificacion de Texto
## Jahzeel Ulises Mendez Diaz

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer

### Stemming

Stemming es un método para reducir una palabra a su raíz o (en inglés) a un stem. Hay algunos algoritmos de stemming que ayudan en sistemas de recuperación de información. Stemming aumenta el recall que es una medida sobre el número de documentos que se pueden encontrar con una consulta. Por ejemplo una consulta sobre "bibliotecas" también encuentra documentos en los que solo aparezca "bibliotecario" porque el stem de las dos palabras es el mismo ("bibliotec"). [(ref)](https://es.wikipedia.org/wiki/Stemming)

In [78]:
stemmer = SnowballStemmer('english')
def tokenize_and_stem(text):
  tokens = word_tokenize(text.lower())
  stems = [stemmer.stem(token) for token in tokens if token.isalpha()]
  return ' '.join(stems)

### Lematización

La lematización es un proceso lingüístico que consiste en, dada una forma flexionada (es decir, en plural, en femenino, conjugada, etc), hallar el lema correspondiente. El lema es la forma que por convenio se acepta como representante de todas las formas flexionadas de una misma palabra. Es decir, el lema de una palabra es la palabra que nos encontraríamos como entrada en un diccionario tradicional: singular para sustantivos, masculino singular para adjetivos, infinitivo para verbos. Por ejemplo, decir es el lema de dije, pero también de diré o dijéramos; guapo es el lema de guapas; mesa es el lema de mesas. [(ref)](https://es.wikipedia.org/wiki/Lematizaci%C3%B3n)

In [79]:
lemmatizer = WordNetLemmatizer()
def tokenize_and_lematize(text):
  tokens = word_tokenize(text.lower())
  stems = [lemmatizer.lemmatize(token)for token in tokens if token.isalpha()]
  return ' '.join(stems)

### Dataframe

In [80]:
#Cargamos el dataframe
data = pd.read_csv("twitter_training.csv")
data = data.rename(columns={"2401":"No","Borderlands":"Game","Positive":"Class","im getting on borderlands and i will murder you all ,":"Text"})
data

Unnamed: 0,No,Game,Class,Text
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...
...,...,...,...,...
74676,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74677,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74678,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74679,9200,Nvidia,Positive,Just realized between the windows partition of...


In [81]:
#Eliminamos filas con np.nan
data = data.dropna()

In [82]:
#Aplicamos la funcion de lematizacion y de stemming
nltk.download("punkt")
nltk.download("stopwords")
nltk.download('wordnet')

data["text_stem"] = data["Text"].apply(tokenize_and_stem)
data["text_lem"] = data["Text"].apply(tokenize_and_lematize)
data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["text_stem"] = data["Text"].apply(tokenize_and_stem)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["text_lem"] = data["Text"].apply(tokenize_and_lematize)


Unnamed: 0,No,Game,Class,Text,text_stem,text_lem
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,i am come to the border and i will kill you all,i am coming to the border and i will kill you all
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,im get on borderland and i will kill you all,im getting on borderland and i will kill you all
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,im come on borderland and i will murder you all,im coming on borderland and i will murder you all
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,im get on borderland and i will murder you me all,im getting on borderland and i will murder you...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...,im get into borderland and i can murder you all,im getting into borderland and i can murder yo...
...,...,...,...,...,...,...
74676,9200,Nvidia,Positive,Just realized that the Windows partition of my...,just realiz that the window partit of my mac i...,just realized that the window partition of my ...
74677,9200,Nvidia,Positive,Just realized that my Mac window partition is ...,just realiz that my mac window partit is year ...,just realized that my mac window partition is ...
74678,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...,just realiz the window partit of my mac is now...,just realized the window partition of my mac i...
74679,9200,Nvidia,Positive,Just realized between the windows partition of...,just realiz between the window partit of my ma...,just realized between the window partition of ...


### Primeros modelos

In [83]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data["text_lem"],data["Class"])

In [84]:
#Tranformamos el texto bag-of-words
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

#### Naive Bayes

In [85]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train_counts,y_train)

In [86]:
from sklearn.metrics import confusion_matrix
X_test_counts = count_vect.transform(X_test)
print(clf.score(X_test_counts,y_test))
confusion_matrix(clf.predict(X_test_counts),y_test)

0.7160927617709065


array([[1853,  177,  199,  165],
       [ 575, 4663,  857,  673],
       [ 215,  318, 2709,  286],
       [ 554,  455,  778, 4022]])

In [87]:
from sklearn.metrics import classification_report
print(classification_report(y_test,clf.predict(X_test_counts)))

              precision    recall  f1-score   support

  Irrelevant       0.77      0.58      0.66      3197
    Negative       0.69      0.83      0.75      5613
     Neutral       0.77      0.60      0.67      4543
    Positive       0.69      0.78      0.73      5146

    accuracy                           0.72     18499
   macro avg       0.73      0.70      0.71     18499
weighted avg       0.72      0.72      0.71     18499



#### SVM

In [88]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)
clf.fit(X_train_counts,y_train)

In [89]:
print(clf.score(X_test_counts,y_test))
confusion_matrix(clf.predict(X_test_counts),y_test)

0.6566841450889237


array([[1377,  174,  254,  175],
       [ 620, 4346,  802,  634],
       [ 315,  411, 2471,  383],
       [ 885,  682, 1016, 3954]])

In [90]:
print(classification_report(y_test,clf.predict(X_test_counts)))

              precision    recall  f1-score   support

  Irrelevant       0.70      0.43      0.53      3197
    Negative       0.68      0.77      0.72      5613
     Neutral       0.69      0.54      0.61      4543
    Positive       0.60      0.77      0.68      5146

    accuracy                           0.66     18499
   macro avg       0.67      0.63      0.64     18499
weighted avg       0.66      0.66      0.65     18499



### Prueba con n-gramas

In [91]:
count_vect = CountVectorizer(ngram_range=(1,4))
X_train_ngram = count_vect.fit_transform(X_train)

#### Naive Bayes

In [92]:
clf = MultinomialNB()
clf.fit(X_train_ngram,y_train)

In [93]:
X_test_ngram = count_vect.transform(X_test)
print(clf.score(X_test_ngram,y_test))
confusion_matrix(clf.predict(X_test_ngram),y_test)

0.9023731012487162


array([[2676,   20,   20,   32],
       [ 238, 5385,  360,  317],
       [  88,  100, 3924,   89],
       [ 195,  108,  239, 4708]])

In [94]:
print(classification_report(y_test,clf.predict(X_test_ngram)))

              precision    recall  f1-score   support

  Irrelevant       0.97      0.84      0.90      3197
    Negative       0.85      0.96      0.90      5613
     Neutral       0.93      0.86      0.90      4543
    Positive       0.90      0.91      0.91      5146

    accuracy                           0.90     18499
   macro avg       0.91      0.89      0.90     18499
weighted avg       0.91      0.90      0.90     18499



#### SVM

In [95]:
clf = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)
clf.fit(X_train_ngram,y_train)

In [96]:
X_test_ngram = count_vect.transform(X_test)
print(clf.score(X_test_ngram,y_test))
confusion_matrix(clf.predict(X_test_ngram),y_test)

0.8657224714849451


array([[2489,   36,   52,   52],
       [ 222, 5159,  333,  309],
       [ 106,  123, 3703,  121],
       [ 380,  295,  455, 4664]])

In [97]:
print(classification_report(y_test,clf.predict(X_test_ngram)))

              precision    recall  f1-score   support

  Irrelevant       0.95      0.78      0.85      3197
    Negative       0.86      0.92      0.89      5613
     Neutral       0.91      0.82      0.86      4543
    Positive       0.80      0.91      0.85      5146

    accuracy                           0.87     18499
   macro avg       0.88      0.85      0.86     18499
weighted avg       0.87      0.87      0.87     18499



## FastText

FastText es una biblioteca liviana, gratuita y de código abierto que permite a los usuarios aprender representaciones y clasificadores de texto. Funciona en hardware genérico estándar. Posteriormente, los modelos se pueden reducir de tamaño para que quepan incluso en dispositivos móviles.


### Word Embedding

Un word embedding es una representación vectorial de palabras en un espacio continuo de baja dimensión. Esta representación permite que las palabras con significados similares tengan representaciones similares en el espacio vectorial.

#### Como clasifica FastText
1. Entradas y Embeddings:

* Para cada documento (línea de texto), FastText convierte las palabras y   sub-palabras en sus correspondientes embeddings. Si se utilizan n-gramas, estos también se convierten en embeddings.

2. Promedio de Embeddings:

* FastText promedia los embeddings de las palabras (y n-gramas) en el documento para obtener una representación vectorial fija del documento.

3. Regresión Logística Multinomial:

* La representación vectorial del documento se pasa a través de una capa softmax que calcula las probabilidades de cada clase. La clase con la mayor probabilidad se predice como la etiqueta del documento.

In [99]:
#Preparación de archivo
X_train, X_test,_, y_test = train_test_split(data,data["Class"])

with open("data_fst.txt", 'w', encoding='utf-8') as archivo:
  for i in range(len(X_train)):
    archivo.write(f'__label__{X_train.iloc[i,2]} {X_train.iloc[i,3]}\n')

with open("data_fst_t.txt", 'w', encoding='utf-8') as archivo:
  for i in range(len(X_test)):
    archivo.write(f'__label__{X_test.iloc[i,2]} {X_test.iloc[i,3]}\n')

In [100]:
import fasttext
#Entrenamiento
modelo = fasttext.train_supervised(input="data_fst.txt", epoch=40, lr=0.1, wordNgrams=5, verbose=2, minCount=1)

In [101]:
modelo.predict("This is awful")

(('__label__Negative',), array([0.99979228]))

In [102]:
resultados = modelo.test("data_fst_t.txt")
resultados

(18499, 0.8853451537920969, 0.8853451537920969)

In [103]:
def get_data():
  data = []
  for i in range(len(X_test)):
    prediction = modelo.predict(X_test.iloc[i,3])
    data.append(prediction[0][0][9:])
  return data

pred = get_data()
confusion_matrix(pred,y_test)

array([[2681,   68,   94,   77],
       [ 167, 5102,  197,  245],
       [ 181,  219, 3964,  237],
       [ 199,  211,  226, 4631]])

In [104]:
from sklearn.metrics import classification_report
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

  Irrelevant       0.92      0.83      0.87      3228
    Negative       0.89      0.91      0.90      5600
     Neutral       0.86      0.88      0.87      4481
    Positive       0.88      0.89      0.89      5190

    accuracy                           0.89     18499
   macro avg       0.89      0.88      0.88     18499
weighted avg       0.89      0.89      0.89     18499

