# Introducción

Cuando trabajamos con texto existen multitud de formas de representar la información.

<img src=https://miro.medium.com/max/904/1*DocMTV7nTAomKxcu3m-tyw.jpeg>

# 1. One-Hot Encoding

Antes, introducimos el concepto de **Bag-of-Words**

Quizá la forma más sencilla de representar la información. Permite representar cada texto como un vector. Los pasos son los siguientes:

1. Definir un **vocabulario** (puede extraerse del corpus)
2. Asignamos un entero a cada palabra, de manera que tendremos un vector de longitud igual al número de palabras (cardinalidad) del vocabulario. **Cada posición en el vector representará una palabra del vocabulario**.
3. Para cada documento, asignamos en la posición correspondiente del vector pre-construído a cada palabra que lo compone un valor. Dicho valor puede ser si aparece o no (**Term Presence**) o el número de veces que aparece (**Term Frequency**).

En su aproximación más simple, **one-hot-encoding**, la codificación se realiza a nivel de token. De esta manera, un documento estará definido por N vectores (tantas como tokens contenga), en las que la posición de cada palabra en cada vector tendrá valor igual a 1 (Term Presence).

<img src=https://miro.medium.com/max/1800/1*ArM6Z5jeptCQ082DYn9nDQ.png width=600px>

# 2. Count Vectorizer

Convierte una colleción de documentos en una matriz de documentos-palabras. La codificación se realiza, por tanto, a nivel de documento, en lugar de a nivel de token.

Al ser un modelo de bag-of-words, **no se codifica la información relativa a la posición de los tokens ni su contexto, solo información a si aparecen y su frecuencia**.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [1]:

from sklearn.feature_extraction.text import CountVectorizer


In [2]:
sent_1 = 'me gustan los perros'
sent_2 = 'hay perros y perros'
sent_3 = 'hay muchas razas de perros'

In [3]:
corpus = [sent_1, sent_2, sent_3]

### Ejemplo básico

In [4]:
vectorizer = CountVectorizer()

In [5]:
X = vectorizer.fit_transform(corpus)

In [6]:

print(vectorizer.get_feature_names_out())



['de' 'gustan' 'hay' 'los' 'me' 'muchas' 'perros' 'razas']


In [7]:
import pandas as pd
doc_term_matrix = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
doc_term_matrix

Unnamed: 0,de,gustan,hay,los,me,muchas,perros,razas
0,0,1,0,1,1,0,1,0
1,0,0,1,0,0,0,2,0
2,1,0,1,0,0,1,1,1


### Stop words

El parámetro `stop_words` acepta:
- 'english'
- lista de stopwords
- None (default), no filtra stop words

In [8]:
vectorizer = CountVectorizer(stop_words=['de', 'hay', 'los', 'me'])
X = vectorizer.fit_transform(corpus)

In [9]:
print(vectorizer.get_feature_names_out())

['gustan' 'muchas' 'perros' 'razas']


In [10]:
doc_term_matrix = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
doc_term_matrix

Unnamed: 0,gustan,muchas,perros,razas
0,1,0,1,0
1,0,0,2,0
2,0,1,1,1


### Número máximo de palabras

El parámetro `max_features` establece el número máximo de features a extraer (vocabulario). Mantendrá solo el top indicado por dicho parámetro.

In [11]:
vectorizer = CountVectorizer(max_features=4)
X = vectorizer.fit_transform(corpus)

In [12]:
print(vectorizer.get_feature_names_out())

['de' 'gustan' 'hay' 'perros']


In [13]:
doc_term_matrix = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
doc_term_matrix

Unnamed: 0,de,gustan,hay,perros
0,0,1,0,1
1,0,0,1,2
2,1,0,1,1


### N-grams como features

El parámetro `ngram_range` (tupla) permite definir los valores de `n` para los ngrams (mínimo y máximo) que serán calculados. Por defecto `ngram_range=(1, 1)` (solo palabras).

In [14]:
vectorizer = CountVectorizer(ngram_range=(2, 3))  # Jugar con los valores
X = vectorizer.fit_transform(corpus)

In [15]:
print(vectorizer.get_feature_names_out())

['de perros' 'gustan los' 'gustan los perros' 'hay muchas'
 'hay muchas razas' 'hay perros' 'hay perros perros' 'los perros'
 'me gustan' 'me gustan los' 'muchas razas' 'muchas razas de'
 'perros perros' 'razas de' 'razas de perros']


In [16]:
doc_term_matrix = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
doc_term_matrix

Unnamed: 0,de perros,gustan los,gustan los perros,hay muchas,hay muchas razas,hay perros,hay perros perros,los perros,me gustan,me gustan los,muchas razas,muchas razas de,perros perros,razas de,razas de perros
0,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0
1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0
2,1,0,0,1,1,0,0,0,0,0,1,1,0,1,1


### max_df y min_df

Límites superior (`max_df`) e inferior (`min_df`). Pueden definirse como `float` (de 0.0 a 1.0) o como `int`:
- `float`: frecuencia de repetición máxima / mínima
- `int`: número de repeticiones máximo / mínimo

In [17]:
vectorizer = CountVectorizer(max_df=0.95, min_df=1)  # Jugar con los valores
X = vectorizer.fit_transform(corpus)

In [18]:
print(vectorizer.get_feature_names_out())

['de' 'gustan' 'hay' 'los' 'me' 'muchas' 'razas']


In [19]:
doc_term_matrix = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
doc_term_matrix

Unnamed: 0,de,gustan,hay,los,me,muchas,razas
0,0,1,0,1,1,0,0
1,0,0,1,0,0,0,0
2,1,0,1,0,0,1,1


### TF-IDF Vectorizer

TF-IDF (Term Frequency - Inverse Document Frequency) es una medida de feature weighting que expresa lo **relevante que es una palabra en un documento**, siendo este documento parte de un corpus.

Tiene en cuenta el número de veces que aparece la palabra (o token) en dicho documento, pero también el total de veces que aparece en todo el corpus.

- **Tokens muy frecuentes a nivel de documento y de corpus** - posibles stop words - obtendrán un valor de **TF-IDF bajo**.
- Tokens que aparecen **solo en ciertos documentos del corpus** tendrán un **IDF mayor** que aquellos que aparecen en mayor número de documentos.

<img src=https://3.bp.blogspot.com/-u928a3xbrsw/UukmRVX_JzI/AAAAAAAAAKE/wIhuNmdQb7E/s1600/td-idf-graphic.png width=700px>

En un sistema de Information Retrieval sencillo, el módulo de ranking de documentos puede construirse considerando el peso de cada documento como la suma de los TF-IDF de cada palabra que lo componen.

In [20]:
from sklearn.feature_extraction.text import TfidfTransformer

In [21]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [22]:
transformer = TfidfTransformer()
tf_idf = transformer.fit_transform(X)

In [23]:
vectorizer.get_feature_names_out()

array(['de', 'gustan', 'hay', 'los', 'me', 'muchas', 'perros', 'razas'],
      dtype=object)

In [24]:
doc_term_matrix = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names_out())
doc_term_matrix

Unnamed: 0,de,gustan,hay,los,me,muchas,perros,razas
0,0.0,0.546454,0.0,0.546454,0.546454,0.0,0.322745,0.0
1,0.0,0.0,0.541343,0.0,0.0,0.0,0.840802,0.0
2,0.504611,0.0,0.38377,0.0,0.0,0.504611,0.298032,0.504611


### TF-IDF Vectorizer (manera directa)

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(corpus)

In [27]:
doc_term_matrix = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names_out())
doc_term_matrix

Unnamed: 0,de,gustan,hay,los,me,muchas,perros,razas
0,0.0,0.546454,0.0,0.546454,0.546454,0.0,0.322745,0.0
1,0.0,0.0,0.541343,0.0,0.0,0.0,0.840802,0.0
2,0.504611,0.0,0.38377,0.0,0.0,0.504611,0.298032,0.504611


## Ejemplo: Detección de Spam



#### Lectura de datos

In [28]:
import pandas as pd
import numpy as np

In [29]:
df = pd.read_csv('./spam.csv', encoding='latin-1')

df = df[['v1', 'v2']]
df.rename(columns={'v1': 'label', 'v2': 'sms'}, inplace=True)

In [30]:
df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [31]:
df.shape

(5572, 2)

In [32]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,4825
spam,747


#### Preprocesado

In [33]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

In [34]:
df.head()

Unnamed: 0,label,sms
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [35]:
df['sms'] = df['sms'].str.lower()

In [36]:
df.head()

Unnamed: 0,label,sms
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


#### Train / Test set

In [37]:
import numpy as np
msk = np.random.rand(len(df)) < 0.75

In [38]:
df_train = df[msk]
df_test = df[~msk]

In [39]:
df_train.shape

(4173, 2)

In [40]:
df_test.shape

(1399, 2)

#### Features

In [41]:
# CountVectorizer simple
cv_simple = CountVectorizer()
X_train_cv_simple = cv_simple.fit_transform(df_train['sms'])
X_test_cv_simple = cv_simple.transform(df_test['sms'])

# CountVectorizer con ngrams, max_features, min_df y max_df
cv_complex = CountVectorizer(ngram_range=(1, 2), max_features=1000, max_df=0.95, min_df=5)
X_train_cv_complex = cv_complex.fit_transform(df_train['sms'])
X_test_cv_complex = cv_complex.transform(df_test['sms'])

# TfIdfVectorizer simple
tfidf_simple = TfidfVectorizer()
X_train_tfidf_simple = tfidf_simple.fit_transform(df_train['sms'])
X_test_tfidf_simple = tfidf_simple.transform(df_test['sms'])

# TfIdfVectorizer complejo
tfidf_complex = TfidfVectorizer(ngram_range=(1, 2), max_features=1000, max_df=0.95, min_df=5)
X_train_tfidf_complex = tfidf_complex.fit_transform(df_train['sms'])
X_test_tfidf_complex = tfidf_complex.transform(df_test['sms'])

#### Modelo de clasificación binaria

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [43]:
lr_cv_simple = LogisticRegression()
lr_cv_complex = LogisticRegression()
lr_tfidf_simple = LogisticRegression()
lr_tfidf_complex = LogisticRegression()

In [44]:
lr_cv_simple.fit(X_train_cv_simple, df_train['label'])  # train
y_pred_cv_simple = lr_cv_simple.predict(X_test_cv_simple)  # test

In [45]:
lr_cv_complex.fit(X_train_cv_complex, df_train['label'])  # train
y_pred_cv_complex = lr_cv_complex.predict(X_test_cv_complex)  # test

In [46]:
lr_tfidf_simple.fit(X_train_tfidf_simple, df_train['label'])  # train
y_pred_tfidf_simple = lr_tfidf_simple.predict(X_test_tfidf_simple)  # test

In [47]:
lr_tfidf_complex.fit(X_train_tfidf_complex, df_train['label'])  # train
y_pred_tfidf_complex = lr_tfidf_complex.predict(X_test_tfidf_complex)  # test

In [48]:
print('CountVectorizer simple\n')
print(confusion_matrix(df_test['label'], y_pred_cv_simple))
print(classification_report(df_test['label'], y_pred_cv_simple))

CountVectorizer simple

[[1205    6]
 [  21  167]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1211
           1       0.97      0.89      0.93       188

    accuracy                           0.98      1399
   macro avg       0.97      0.94      0.96      1399
weighted avg       0.98      0.98      0.98      1399



In [49]:
print('CountVectorizer complejo\n')
print(confusion_matrix(df_test['label'], y_pred_cv_complex))
print(classification_report(df_test['label'], y_pred_cv_complex))

CountVectorizer complejo

[[1206    5]
 [  19  169]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1211
           1       0.97      0.90      0.93       188

    accuracy                           0.98      1399
   macro avg       0.98      0.95      0.96      1399
weighted avg       0.98      0.98      0.98      1399



In [50]:
print('TfIdfVectorizer simple\n')
print(confusion_matrix(df_test['label'], y_pred_tfidf_simple))
print(classification_report(df_test['label'], y_pred_tfidf_simple))

TfIdfVectorizer simple

[[1208    3]
 [  44  144]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      1211
           1       0.98      0.77      0.86       188

    accuracy                           0.97      1399
   macro avg       0.97      0.88      0.92      1399
weighted avg       0.97      0.97      0.96      1399



In [51]:
print('TfIdfVectorizer complejo\n')
print(confusion_matrix(df_test['label'], y_pred_tfidf_complex))
print(classification_report(df_test['label'], y_pred_tfidf_complex))

TfIdfVectorizer complejo

[[1209    2]
 [  35  153]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1211
           1       0.99      0.81      0.89       188

    accuracy                           0.97      1399
   macro avg       0.98      0.91      0.94      1399
weighted avg       0.97      0.97      0.97      1399



# 3. Word Embeddings

Permiten codificar la información semántica de los tokens en función del contexto (tokens anteriores y posteriores) en el que se encuentren.

Cada palabra estará representada por un vector con dicha información semántica. Operaciones con vectores, y el concepto de distancia, nos permitirá encontrar tokens que semánticamente son parecidos o diferentes.

Lo veremos con más detalle en la próxima sesión.

<img src=https://blog.enzymeadvisinggroup.com/hs-fs/hubfs/Word%20Embeddings%20en%20el%20Natural%20Language%20Processing.png>