# Práctica de Laboratorio de Procesamiento del Lenguaje Natural (NLP)

Tema: Clustering divisivo mediante K-means

## Objetivo de la práctica

El objetivo de esta práctica es procesar los datos de entrada dados en el "corpus", con el objetivo de ejecutar clustering divisivo (utilizando `K-means`) para comprobar los resultados.



In [1]:
import pandas as pd
import numpy as np
import nltk
import re

pd.options.display.max_colwidth = 200
%matplotlib inline

## Corpus de documentos

In [2]:
corpus = ['El cielo es azul y luminoso.',
          '¡Me encanta este cielo azul y luminoso!',
          'El zorro marrón es rápido y salta sobre el perro que es dormilón.',
          'Un desayuno real tiene salchichas, jamón, bacon, huevos, tostadas y queso',
          '¡Me encanta el jamón, los huevos, las salchichas y el bacon!',
          '¡El zorro marrón es rápido, y el perro azul es un dormilón!',
          'El cielo es azul intenso y hoy está muy luminoso',
          '¡El perro es un dormilón, pero el zorro es mu rápido!'
]

corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Documento': corpus}) 

corpus_df

Unnamed: 0,Documento
0,El cielo es azul y luminoso.
1,¡Me encanta este cielo azul y luminoso!
2,El zorro marrón es rápido y salta sobre el perro que es dormilón.
3,"Un desayuno real tiene salchichas, jamón, bacon, huevos, tostadas y queso"
4,"¡Me encanta el jamón, los huevos, las salchichas y el bacon!"
5,"¡El zorro marrón es rápido, y el perro azul es un dormilón!"
6,El cielo es azul intenso y hoy está muy luminoso
7,"¡El perro es un dormilón, pero el zorro es mu rápido!"


# Pre-procesamiento

Para poder clusterizar los textos con mayor facilidad, es necesario que se queden las palabras que contengan significado de cada texto. Por ello, preprocesa los textos para eliminar los caracteres especiales, unificarlo en minúsculas, eliminar espacios extra (utiliza `strip()` para esto) y eliminar las stopwords.

In [3]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('spanish')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc



In [4]:
norm_corpus=[]

for document in corpus:
    norm_corpus.append(normalize_document(document))

norm_corpus

['cielo azul luminoso',
 'encanta cielo azul luminoso',
 'zorro marrn rpido salta perro dormiln',
 'desayuno real salchichas jamn bacon huevos tostadas queso',
 'encanta jamn huevos salchichas bacon',
 'zorro marrn rpido perro azul dormiln',
 'cielo azul intenso hoy est luminoso',
 'perro dormiln zorro mu rpido']

# Generación de las matrices

Genera BOW, TFIDF & Bag Of N-grams; se utilizarán para comparar los resultados.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
# get bag of words features in sparse format
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix

<8x22 sparse matrix of type '<class 'numpy.int64'>'
	with 43 stored elements in Compressed Sparse Row format>

In [6]:
# view dense representation 
# warning might give a memory error if data is too big
cv_matrix = cv_matrix.toarray()

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer

tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True)
tt_matrix = tt.fit_transform(cv_matrix)

tt_matrix = tt_matrix.toarray()
vocab = cv.get_feature_names_out()
pd.DataFrame(np.round(tt_matrix, 2), columns=vocab)

Unnamed: 0,azul,bacon,cielo,desayuno,dormiln,encanta,est,hoy,huevos,intenso,...,marrn,mu,perro,queso,real,rpido,salchichas,salta,tostadas,zorro
0,0.53,0.0,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.43,0.0,0.49,0.0,0.0,0.57,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.37,0.0,0.0,0.0,0.0,0.0,...,0.43,0.0,0.37,0.0,0.0,0.37,0.0,0.51,0.0,0.37
3,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.0,0.32,0.0,...,0.0,0.0,0.0,0.38,0.38,0.0,0.32,0.0,0.38,0.0
4,0.0,0.45,0.0,0.0,0.0,0.45,0.0,0.0,0.45,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.45,0.0,0.0,0.0
5,0.35,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,...,0.47,0.0,0.4,0.0,0.0,0.4,0.0,0.0,0.0,0.4
6,0.3,0.0,0.34,0.0,0.0,0.0,0.47,0.47,0.0,0.47,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.41,0.0,0.0,0.0,0.0,0.0,...,0.0,0.57,0.41,0.0,0.0,0.41,0.0,0.0,0.0,0.41


# Kmeans

Utiliza KMeans para clusterizar los textos en distintos grupos. El número de clusters será 4. ¿Observas alguna diferencia con el Clustering jerárquico?

In [10]:
from sklearn.cluster import KMeans

num_clusters = 4

km = KMeans(num_clusters)
km.fit(tt_matrix)

cluster_labels = km.labels_.tolist()
pd.concat([corpus_df, pd.DataFrame(cluster_labels)], axis=1)

  super()._check_params_vs_input(X, default_n_init=10)


Unnamed: 0,Documento,0
0,El cielo es azul y luminoso.,0
1,¡Me encanta este cielo azul y luminoso!,0
2,El zorro marrón es rápido y salta sobre el perro que es dormilón.,1
3,"Un desayuno real tiene salchichas, jamón, bacon, huevos, tostadas y queso",2
4,"¡Me encanta el jamón, los huevos, las salchichas y el bacon!",2
5,"¡El zorro marrón es rápido, y el perro azul es un dormilón!",1
6,El cielo es azul intenso y hoy está muy luminoso,3
7,"¡El perro es un dormilón, pero el zorro es mu rápido!",1
