<a href="https://colab.research.google.com/github/gmauricio-toledo/NLP-LCC/blob/main/Notebooks/06-Vectores_sem%C3%A1nticos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Sem√°ntica Vectorial</h1>

En esta notebook usaremos dos m√≥delos de sem√°ntica vectorial para diversas tareas de NLP. Los modelos que usaremos son:

* Bag of Words (BoW). [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
* Term Frequency - Inverse Document Frequency (TF-IDF). [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Usando estos dos modelos realicermos tareas como:

* Vecinos m√°s cercanos
* Information Retrieval
* Segmentaci√≥n
* Clasificaci√≥n


In [24]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import numpy as np
import re
import matplotlib.pyplot as plt
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# Corpus 0: Detecci√≥n de SPAM

## Leer el corpus

In [23]:
!gdown 1-pMLSTkJ3ZPCKQU8oXA3uI-swvXw7DmJ

Downloading...
From: https://drive.google.com/uc?id=1-pMLSTkJ3ZPCKQU8oXA3uI-swvXw7DmJ
To: /content/Spam_SMS.csv
  0% 0.00/487k [00:00<?, ?B/s]100% 487k/487k [00:00<00:00, 27.6MB/s]


In [25]:
import pandas as pd

df = pd.read_csv('Spam_SMS.csv')
df

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will √º b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


In [26]:
for idx in df.sample(10).index.to_list():
    print(df.loc[idx,'Message'])

Why must we sit around and wait for summer days to celebrate. Such a magical sight when the worlds dressed in white. Oooooh let there be snow.
I guess it is useless calling u 4 something important.
Well, I was about to give up cos they all said no they didn‚Äòt do one nighters. I persevered and found one but it is very cheap so i apologise in advance. It is just somewhere to sleep isnt it?
How come it takes so little time for a child who is afraid of the dark to become a teenager who wants to stay out all night?
If india win or level series means this is record:)
Yes. Please leave at  &lt;#&gt; . So that at  &lt;#&gt;  we can leave
I'm leaving my house now...
:)
URGENT! Your Mobile No was awarded a ¬£2,000 Bonus Caller Prize on 1/08/03! This is our 2nd attempt to contact YOU! Call 0871-4719-523 BOX95QU BT National Rate
Was it something u ate?


## Limpieza del texto

In [27]:
import re

def limpiar_texto(texto):
    texto = texto.lower()
    texto = re.sub(r'[^\w\s]', '', texto)
    texto = re.sub(r'\d+', ' ', texto)
    return texto

df['texto limpio'] = df['Message'].apply(limpiar_texto)
df

Unnamed: 0,Class,Message,texto limpio
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...
...,...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...,this is the nd time we have tried contact u...
5570,ham,Will √º b going to esplanade fr home?,will √º b going to esplanade fr home
5571,ham,"Pity, * was in mood for that. So...any other s...",pity was in mood for that soany other suggest...
5572,ham,The guy did some bitching but I acted like i'd...,the guy did some bitching but i acted like id ...


In [28]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['Class'].values)
print(y[:5])

docs = df['texto limpio'].values

[0 0 1 0 0]


## Divisi√≥n train/test

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(docs, y,
                                                    test_size=0.2,
                                                    random_state=12)

## Extracci√≥n de variables (vectorizaci√≥n)

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

cv = CountVectorizer(stop_words=stopwords)
cv.fit(X_train)

X_train_bow = cv.transform(X_train).todense()
X_train_bow = np.array(X_train_bow)
X_test_bow = cv.transform(X_test).todense()
X_test_bow = np.array(X_test_bow)

Observa las dimensiones de las matrices BOW

In [34]:
X_train_bow.shape, X_test_bow.shape

((4459, 7355), (1115, 7355))

Veamos la proporci√≥n de ceros

In [35]:
number_of_zero_entries = np.count_nonzero(X_train_bow == 0)
number_of_entries = X_train_bow.shape[0] * X_train_bow.shape[1]

print(f"Porcentaje de entradas cero: {number_of_zero_entries/number_of_entries}")

Porcentaje de entradas cero: 0.9988902896379416


Interpretabilidad de las variables

In [None]:
cv.vocabulary_

## Entrenamiento e inferencia

In [37]:
from sklearn.svm import SVC

svm = SVC(kernel='linear')
svm.fit(X_train_bow, y_train)

y_pred_train = svm.predict(X_train_bow)
y_pred_test = svm.predict(X_test_bow)

## Evaluaci√≥n

In [40]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print(f"Accuracy train: {accuracy_score(y_train, y_pred_train)}")
print(f"Accuracy test: {accuracy_score(y_test, y_pred_test)}")
print(f"F1 train: {f1_score(y_train, y_pred_train)}")
print(f"F1 test: {f1_score(y_test, y_pred_test)}")
print(f"Matriz de confusi√≥n:\n{confusion_matrix(y_test, y_pred_test)}")

Accuracy train: 0.9993272034088361
Accuracy test: 0.9838565022421525
F1 train: 0.9975103734439834
F1 test: 0.9343065693430657
Matriz de confusi√≥n:
[[969   3]
 [ 15 128]]


## Acerca de la interpretabilidad

In [44]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

dt = DecisionTreeClassifier(max_depth=20)
dt.fit(X_train_bow, y_train)

y_pred_train = dt.predict(X_train_bow)
y_pred_test = dt.predict(X_test_bow)

print(f"Accuracy train: {accuracy_score(y_train, y_pred_train)}")
print(f"Accuracy test: {accuracy_score(y_test, y_pred_test)}")
print(f"F1 train: {f1_score(y_train, y_pred_train)}")
print(f"F1 test: {f1_score(y_test, y_pred_test)}")
print(f"Matriz de confusi√≥n:\n{confusion_matrix(y_test, y_pred_test)}")

Accuracy train: 0.9854227405247813
Accuracy test: 0.9704035874439462
F1 train: 0.9431321084864392
F1 test: 0.8764044943820225
Matriz de confusi√≥n:
[[965   7]
 [ 26 117]]


Obtenemos las palabras con las importancias m√°s altas para la clasificaci√≥n de la clase positiva (spam)

In [45]:
most_important_words_idxs = np.argsort(dt.feature_importances_)[::-1]
most_important_words = [cv.get_feature_names_out()[idx] for idx in most_important_words_idxs]

In [46]:
most_important_words[:10]

['call',
 'txt',
 'free',
 'reply',
 'text',
 'ill',
 'claim',
 'pmsg',
 'im',
 'mobile']

# Corpus 1: Wikipedia

In [None]:
from nltk import word_tokenize

nltk.download('punkt_tab')
nltk.download('stopwords')

Ahora probemos con otro corpus. Es una parte de un dump de wikipedia del 2006 ([informaci√≥n](https://www.cs.upc.edu/~nlp/wikicorpus/)).

In [None]:
import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/gmauricio-toledo/NLP-MCD/main/data/spanish-wikipedia-dataframe.csv"
df = pd.read_csv(url,index_col=0)
df

Preprocesamos y limpiamos el texto.

‚≠ï ¬øQu√© estamos haciendo al texto?

In [None]:
docs_raw = df['Texto'].tolist()
docs = [re.sub(r'\d+', ' ', doc) for doc in docs_raw]
tokenized_docs = [word_tokenize(doc) for doc in docs]
docs = [[token for token in doc if token not in nltk.corpus.stopwords.words('spanish')] for doc in tokenized_docs]
docs = [' '.join(doc) for doc in docs]
docs[:3]

## Modelo BOW

Observa c√≥mo especificamos la lista de stopwords en espa√±ol.

In [None]:
stop_words = nltk.corpus.stopwords.words('spanish')

cv = CountVectorizer(stop_words=stop_words, max_features=1000)
X_bow = cv.fit_transform(docs)
X_bow.shape

In [None]:
X_bow[:3,:7].todense()

In [None]:
X_bow[140:143,750:756].todense()

¬øQu√© tan *sparse* es la matriz?

In [None]:
num_ceros = np.where(X_bow.toarray()==0)[0].shape[0]
total_entradas = (X_bow.toarray().shape[0]*X_bow.toarray().shape[1])

print(f"N√∫mero de entradas: {total_entradas}")
print(f"Proporci√≥n de entradas cero: {round(100*num_ceros/total_entradas,2)} %")

In [None]:
vocabulary = cv.get_feature_names_out()

## Vectores de documentos

Representaciones de documentos

In [None]:
doc_vectors = X_bow.toarray()

Inspeccionemos los vecinos m√°s cercanos de ciertos documentos

In [None]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(doc_vectors)

In [None]:
doc_number = 189
# doc_number = np.random.randint(0, len(docs_raw))

print(f"Consulta:\n\t{docs_raw[doc_number]}\n")

v_doc = doc_vectors[doc_number,:].reshape(-1,)
nns = nn.kneighbors([v_doc])
print(f"Vecinos m√°s cercanos: {[idx for idx in nns[1][0]]}\n")

for idx,dist in zip(nns[1][0],nns[0][0]):
    print(f"Distancia: {round(dist,3)}")
    print(f"{docs_raw[idx]}\n")

### Information Retrieval

Inspeccionemos los vecinos m√°s cercanos de una query

In [None]:
query = "sello discogr√°fico de artistas de pop"

query_vector = cv.transform([query]).toarray().reshape(-1,)
print(query_vector)

responses = nn.kneighbors([query_vector])
for idx,dist in zip(responses[1][0],responses[0][0]):
    print(f"Distancia: {round(dist,3)}")
    print(f"{docs_raw[idx]}\n")

In [None]:
#@title Grafiquemos la reducci√≥n de dimensionalidad 3d t-SNE

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

tsne = TSNE(n_components=3, metric='cosine')
X_tsne = tsne.fit_transform(doc_vectors)

plotly.offline.init_notebook_mode()

trace = go.Scatter3d(
    x=X_tsne[:,0],
    y=X_tsne[:,1],
    z=X_tsne[:,2],
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.75,
        'color': 'black'
    },
    hovertemplate='%{text}<extra></extra>',
    text = [f"{docs_raw[j][:75]}" for j in range(X_tsne.shape[0])]
)

layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)

plot_figure.update_layout(
    title = 'Wikipedia Docs',
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis =dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='wiki-bow-tsne3d-docs.html')

### Clustering: Topic Modelling

Clustericemos los documentos. Usemos un m√©todo basado en densidad, en lugar de uno de partici√≥n.

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.1, min_samples=3, metric='cosine')
dbscan.fit(doc_vectors)
num_doc_clusters = np.max(dbscan.labels_)+1
print(f"Hay {num_doc_clusters} clusters")

In [None]:
!pip install -qq wordcloud

In [None]:
#@title funci√≥n para factorizar
import math

def factor_int(n):
    val = math.ceil(math.sqrt(n))
    val2 = int(n/val)
    while val2 * val != float(n):
        val -= 1
        val2 = int(n/val)
    return val, val2

Exploremos los t√©rminos m√°s frecuentes en cada cluster.

In [None]:
from wordcloud import WordCloud

idxs_per_cluster = {j: np.where(dbscan.labels_==j)[0] for j in range(num_doc_clusters)}
docs_per_cluster = {j: [docs[idx] for idx in idxs_per_cluster[j]] for j in idxs_per_cluster.keys()}

wc = WordCloud(background_color="white", max_words=1000)

w, h = factor_int(num_doc_clusters)
fig, axs = plt.subplots(w, h, figsize=(6*w, 3*h))

for j,ax in zip(idxs_per_cluster.keys(),axs.flatten()):
    wc.generate(' '.join(docs_per_cluster[j]))
    ax.imshow(wc, interpolation='bilinear')
    ax.axis("off")
    ax.set_title(f"Cluster {j}")
fig.tight_layout()
fig.show()

## Vectores de palabras

Ahora veamos las palabras:

In [None]:
def get_word_vector(word):
    idx = np.where(vocabulary==word)[0][0]
    return X_bow[:, idx].toarray().flatten()

In [None]:
word_vectors = [get_word_vector(word) for word in vocabulary]
word_vectors = np.array(word_vectors)
word_vectors.shape

In [None]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(word_vectors)

Probemos los vecinos m√°s cercanos de las palabras: cine, equipo, guerra, m√∫sica, mayores

In [None]:
word = 'guerra'
v = get_word_vector(word)
nns = nn.kneighbors([v])
print(f"Vecinos m√°s cercanos: {[vocabulary[idx] for idx in nns[1][0]]}")
print(f"Distancias: {[round(sim,3) for sim in nns[0][0]]}")

In [None]:
#@title Grafiquemos la reducci√≥n de dimensionalidad 3d t-SNE

tsne = TSNE(n_components=3, metric='cosine')
X_tsne = tsne.fit_transform(word_vectors)

plotly.offline.init_notebook_mode()

trace = go.Scatter3d(
    x=X_tsne[:,0],
    y=X_tsne[:,1],
    z=X_tsne[:,2],
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.75,
        'color': 'black'
    },
    hovertemplate='%{text}<extra></extra>',
    text = [f"{vocabulary[j]}" for j in range(X_tsne.shape[0])]
)

layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)

plot_figure.update_layout(
    title = 'Wikipedia Words',
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis =dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='wiki-bow-tsne3d-words.html')

In [None]:
!pip install -qq umap-learn

import umap

### Clustering

Analicemos algunos clusters de palabras

In [None]:
from sklearn.cluster import KMeans

n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, random_state=0, n_init='auto')
kmeans.fit(word_vectors)

In [None]:
for j in range(n_clusters):
    print(f"Cluster {j}:")
    print([vocabulary[idx] for idx in np.where(kmeans.labels_==j)[0]])

Analicemos los casos de *familia*, *campe√≥n*

In [None]:
word = 'campe√≥n'
v = get_word_vector(word)
nns = nn.kneighbors([v])
print(f"Vecinos m√°s cercanos: {[vocabulary[idx] for idx in nns[1][0]]}")
print(f"Distancias: {[round(sim,3) for sim in nns[0][0]]}")

‚≠ï ¬øPor qu√© tenemos estos resultados que no corresponden a la gr√°fica?

In [None]:
from sklearn.cluster import AgglomerativeClustering

agglom = AgglomerativeClustering(n_clusters=n_clusters, metric='cosine',linkage='average')
agglom.fit(word_vectors)

In [None]:
for j in range(n_clusters):
    print(f"Cluster {j}:")
    print([vocabulary[idx] for idx in np.where(agglom.labels_==j)[0]])

In [None]:
word = 'abril'
v = get_word_vector(word)
nns = nn.kneighbors([v])
print(f"Vecinos m√°s cercanos: {[vocabulary[idx] for idx in nns[1][0]]}")
print(f"Distancias: {[round(sim,3) for sim in nns[0][0]]}")

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.35, min_samples=2, metric='cosine')
dbscan.fit(word_vectors)

In [None]:
for j in np.unique(dbscan.labels_):
    print(f"Cluster {j}:")
    print([vocabulary[idx] for idx in np.where(dbscan.labels_==j)[0]])

## Modelo TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

stop_words = nltk.corpus.stopwords.words('spanish')

tfv = TfidfVectorizer(stop_words=stop_words, max_features=1000)
X_tfidf = tfv.fit_transform(docs)
print(X_tfidf.shape)

Ver la matriz, ¬øes m√°s sparse? Ten√≠a el 94.78% de entradas en 0

In [None]:
tota_entradas = X_tfidf.shape[0]*X_tfidf.shape[1]
num_ceros = np.where(X_tfidf.toarray()==0)[0].shape[0]

print(f"N√∫mero de entradas: {total_entradas}")
print(f"Proporci√≥n de entradas cero: {round(100*num_ceros/total_entradas,2)} %")

### Vectores de palabras

In [None]:
vocabulary = tfv.get_feature_names_out()

def get_word_vector(word):
    idx = np.where(vocabulary==word)[0][0]
    return X_tfidf[:, idx].toarray().flatten()

In [None]:
word_vectors = [get_word_vector(word) for word in vocabulary]
word_vectors = np.array(word_vectors)
word_vectors.shape

In [None]:
nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(word_vectors)

word = 'a√±os'
v = get_word_vector(word)
nns = nn.kneighbors([v])
print(f"Vecinos m√°s cercanos: {[vocabulary[idx] for idx in nns[1][0]]}")
print(f"Distancias: {[round(sim,3) for sim in nns[0][0]]}")

In [None]:
#@title Grafiquemos la reducci√≥n de dimensionalidad 3d con t-SNE

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

tsne = TSNE(n_components=3,metric='cosine')
X_tsne = tsne.fit_transform(word_vectors)

plotly.offline.init_notebook_mode()

trace = go.Scatter3d(
    x=X_tsne[:,0],
    y=X_tsne[:,1],
    z=X_tsne[:,2],
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.75,
        'color': 'black'
    },
    hovertemplate='%{text}<extra></extra>',
    text = [f"{vocabulary[j]}" for j in range(X_tsne.shape[0])]
)

layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)

plot_figure.update_layout(
    title = 'Wikipedia Words',
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis =dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='wiki-tfidf-tsne3d-words.html')

### Vectores de documentos

In [None]:
doc_vectors = X_tfidf.toarray()

In [None]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(doc_vectors)

### Information Retrieval

In [None]:
# query = "sello discogr√°fico de artistas de pop"
query = "acontecimientos importantes en abril o nacido en abril"

query_vector = tfv.transform([query]).toarray().reshape(-1,)

if np.sum(query_vector)==0:
    print("Query no v√°lida (OOV)")
else:
    responses = nn.kneighbors([query_vector])
    for idx,dist in zip(responses[1][0],responses[0][0]):
        print(f"Distancia: {round(dist,3)}")
        print(f"{docs_raw[idx]}\n")

In [None]:
#@title Reducci√≥n de dimensionalidad 3d t-SNE

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

tsne = TSNE(n_components=3, metric='cosine')
X_tsne = tsne.fit_transform(doc_vectors)

plotly.offline.init_notebook_mode()

trace = go.Scatter3d(
    x=X_tsne[:,0],
    y=X_tsne[:,1],
    z=X_tsne[:,2],
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.75,
        'color': 'black'
    },
    hovertemplate='%{text}<extra></extra>',
    text = [f"{docs_raw[j][:75]}" for j in range(X_tsne.shape[0])]
)

layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)

plot_figure.update_layout(
    title = 'Wikipedia Docs',
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis =dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='wiki-tfidf-tsne3d-docs.html')

In [None]:
!pip install -qq umap-learn

In [None]:
#@title Reducci√≥n de dimensionalidad 3d UMAP

from umap import UMAP
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

umap = UMAP(n_components=3, metric='cosine')
X_umap = umap.fit_transform(doc_vectors)

plotly.offline.init_notebook_mode()

trace = go.Scatter3d(
    x=X_umap[:,0],
    y=X_umap[:,1],
    z=X_umap[:,2],
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.75,
        'color': 'black'
    },
    hovertemplate='%{text}<extra></extra>',
    text = [f"{docs_raw[j][:75]}" for j in range(X_umap.shape[0])]
)

layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)

plot_figure.update_layout(
    title = 'Wikipedia Docs',
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis = dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='wiki-tfidf-umap3d-docs.html')

### Clustering

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.1, min_samples=3, metric='cosine')
dbscan.fit(doc_vectors)
num_doc_clusters = np.max(dbscan.labels_)+1
print(f"Hay {num_doc_clusters} clusters")

idxs_per_cluster = {j: np.where(dbscan.labels_==j)[0] for j in range(num_doc_clusters)}
docs_per_cluster = {j: [docs[idx] for idx in idxs_per_cluster[j]] for j in idxs_per_cluster.keys()}

In [None]:
from wordcloud import WordCloud

wc = WordCloud(background_color="white", max_words=1000)

w, h = factor_int(num_doc_clusters)
fig, axs = plt.subplots(w, h, figsize=(6*w, 3*h))

for j,ax in zip(idxs_per_cluster.keys(),axs.flatten()):
    wc.generate(' '.join(docs_per_cluster[j]))
    ax.imshow(wc, interpolation='bilinear')
    ax.axis("off")
    ax.set_title(f"Cluster {j}")
fig.tight_layout()
fig.show()

In [None]:
#@title visualizar clusters en la reducci√≥n de dimensionalidad

from umap import UMAP
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

umap = UMAP(n_components=3, metric='cosine')
X_umap = umap.fit_transform(doc_vectors)

plotly.offline.init_notebook_mode()

trace = go.Scatter3d(
    x=X_umap[:,0],
    y=X_umap[:,1],
    z=X_umap[:,2],
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.25,
        'color': 'gray'
    },
    hovertemplate='%{text}<extra></extra>',
    text = [f"{docs_raw[j][:75]}" for j in range(X_umap.shape[0])]
)

layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)

for j in idxs_per_cluster.keys():
    Xs = X_umap[idxs_per_cluster[j],0]
    Ys = X_umap[idxs_per_cluster[j],1]
    Zs = X_umap[idxs_per_cluster[j],2]
    plot_figure.add_trace(
        go.Scatter3d(
            x=Xs,
            y=Ys,
            z=Zs,
            mode='markers',
            marker={
                'size': 3,
                'opacity': 0.75
            },
            hovertemplate='%{text}<extra></extra>',
            text = [f"{docs_raw[j][:75]}" for j in range(X_umap.shape[0])]
        )
    )

plot_figure.update_layout(
    title = 'Wikipedia Docs',
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis = dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='wiki-tfidf-umap3d-docs-clusters.html')

# Corpus 2: 20newsgroups
<h2>Features como clasificaci√≥n</h2>

Finalmente usamos los vectores de documentos como features para un clasificador. Hasta el momento, s√≥lo *sab√≠amos* hacer la clasificaci√≥n con los $n$-gramas.

In [None]:
from sklearn.datasets import fetch_20newsgroups

# train_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
# test_docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

train_docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=['sci.med', 'sci.space'])
test_docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), categories=['sci.med', 'sci.space'])

y_train = train_docs.target
y_test = test_docs.target

y_train.shape, y_test.shape

In [None]:
train_docs.data[0]

## BoW

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000,stop_words='english')
X_train = cv.fit_transform(train_docs.data)
X_test = cv.transform(test_docs.data)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred))
print(f"F1 score: {f1_score(y_test, y_pred, average='weighted')}")

In [None]:
#@title visualizar clases en la reducci√≥n de dimensionalidad
!pip install -qq umap-learn

from umap import UMAP
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

umap = UMAP(n_components=3, metric='cosine')
X_umap = umap.fit_transform(X_train)

plotly.offline.init_notebook_mode()


layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

plot_figure = go.Figure(layout=layout)

for j in np.unique(y_train):
    Xs = X_umap[y_train==j,0].reshape(-1,)
    Ys = X_umap[y_train==j,1].reshape(-1,)
    Zs = X_umap[y_train==j,2].reshape(-1,)
    plot_figure.add_trace(
        go.Scatter3d(
            x=Xs,
            y=Ys,
            z=Zs,
            mode='markers',
            marker={
                'size': 3,
                'opacity': 0.75
            },
            hovertemplate='%{text}<extra></extra>',
            text = [f"{train_docs.data[j][:75]}" for j in range(Xs.shape[0])],
            name = train_docs.target_names[j]
        )
    )

plot_figure.update_layout(
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis = dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='20ng-bow-umap3d-docs-classes.html')

## Tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer(max_features=1000,stop_words='english')
X_train = cv.fit_transform(train_docs.data)
X_test = cv.transform(test_docs.data)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

lr = LogisticRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred))
print(f"F1 score: {f1_score(y_test, y_pred, average='weighted')}")

Al usar un m√©todo interpretable, como regresi√≥n log√≠stica, podemos obtener la importancia de las variables, en este caso, las palabras del vocabulario.

In [None]:
lr.coef_.shape

Veamos las palabras que m√°s influyen en la clasificaci√≥n de la clase *positiva*

In [None]:
word_importance = zip(cv.get_feature_names_out() ,lr.coef_.reshape(-1,))
word_importance = sorted(word_importance, key=lambda x: x[1], reverse=True)

word_importance[:10]

In [None]:
#@title visualizar clases en la reducci√≥n de dimensionalidad

from umap import UMAP
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

umap = UMAP(n_components=3, metric='cosine')
X_umap = umap.fit_transform(X_train)

plotly.offline.init_notebook_mode()


layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0}
)

plot_figure = go.Figure(layout=layout)

for j in np.unique(y_train):
    Xs = X_umap[y_train==j,0].reshape(-1,)
    Ys = X_umap[y_train==j,1].reshape(-1,)
    Zs = X_umap[y_train==j,2].reshape(-1,)
    plot_figure.add_trace(
        go.Scatter3d(
            x=Xs,
            y=Ys,
            z=Zs,
            mode='markers',
            marker={
                'size': 3,
                'opacity': 0.75
            },
            hovertemplate='%{text}<extra></extra>',
            text = [f"{train_docs.data[j][:75]}" for j in range(Xs.shape[0])],
            name = train_docs.target_names[j]
        )
    )

plot_figure.update_layout(
    scene = dict(
        xaxis = dict(visible=False),
        yaxis = dict(visible=False),
        zaxis = dict(visible=False)
        )
    )

plotly.offline.plot(plot_figure, filename='20ng-tfidf-umap3d-docs-classes.html')

# üü• Ejercicios Adicionales

**Tarea de clasificaci√≥n con el corpus `20newsgroups`**. Probar las siguientes estrategias y en cada caso medir el F1 score:

1. Todas las clases, sin quitar *headers*, *quotes*, *footers*. Comparar:
 * BOW
 * TF-IDF
 * BOW + PCA
 * TF-IDF + PCA
 * BOW + t-SNE
 * TF-IDF + t-SNE
2. Las mismas 6 estrategias del paso anterior, quitando *headers*, *quotes*, *footers*.
3. Escoge dos clases que crees que se diferencien muy bien entre s√≠ con estos modelos. ¬øQu√© clases escogiste y por qu√©? Compara BOW y TF-IDF para la clasificaci√≥n binaria.
4. Compara tu clasificador de la tarea pasada con el mejor clasificador del paso 3.
5. Escoge ahora dos clases que crees que no se diferencien entre s√≠ con estos modelos. ¬øQu√© clases escogiste y por qu√©? Compara BOW y TF-IDF para la clasificaci√≥n binaria. ¬øQu√© tanto baj√≥ el rendimiento respecto al paso 3?
6. En tu mejor clasificador del paso 3, prueba bajando y subiendo el par√°metro `max_features` ¬øqu√© efecto tiene esto en la tarea de clasificaci√≥n?
7. ¬øQu√© efecto tiene lematizar el texto en la tarea de clasificaci√≥n? Prueba con tu mejor clasificador binario.

En cada uno de los modelos BOW/TF-IDF que construyas puedes ajustar el hiperpar√°metro `max_features`.

**Information Retrieval con el corpus `20newsgroups`** Entrena un modelo BOW y un TF-IDF con todos los documentos juntos de `train` y `test`. Realiza algunas consultas al modelo para obtener los documentos m√°s relevantes para tu busqueda. Reporta algunos casos que creas interesantes y explica porque los consideras interesantes.

**An√°lisis de sentimientos con BOW/TFIDF**. Usando el corpus de la tarea anterior (el de turismo), entrena un clasificador de Machine Learning con los embeddings BOW/TF-IDF, ¬ømejora el rendimiento respecto al que presentaste en clase?
