DiploDatos 2018 / Aprendizaje no supervizado / Clustering Demo*

# Aplicación de técnicas de *clustering* a documentos de texto

**Objetivos:**

En este ejemplo mostraremos cómo utilizar técnicas de clustering para aprender la estructura subyacente de un conjunto de documentos de texto.

In [57]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3

from __future__ import print_function

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib

### DATOS: Top 100 Greatest Movies of All Time (The Ultimate List), by ChrisWalczyk55

https://www.imdb.com/list/ls055592025/

El problema consiste en agrupar un conjunto de películas en base a sus críticas en inglés, 
usando para ello procesamiento del texto


Lo primero que haremos es leer los datos, disponibles en:
https://github.com/brandomr/document_cluster.git

In [58]:
# Lectura de los titulos

with open("data/document_cluster/title_list.txt") as file:
    titles = [line.strip() for line in file]
    
# Lectura de las criticas

synopses = []
with open("data/document_cluster/synopses_list_wiki.txt") as file:
    i = True
    l = ' '
    for line in file:            
        if 'BREAKS HERE' in line:
            synopses.append(l) # append the previously collected lines
            l = ' '       
        l = l + line.decode('utf-8').strip()
        
# Lectura de los generos

with open("data/document_cluster/genres_list.txt") as file:
    genres = [line.strip() for line in file]

### Análisis del texto


Para analizar el texto debemos estudiar la frecuencia de las palabras, es decir, separar el texto en unidades sintácticas o *tokens*.

In [59]:
def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [60]:
# e.g.:
from nltk.tokenize import word_tokenize
text = "Computer science is no more about computers than astronomy is about telescopes. Edsger Dijkstra"
tokens = tokenize_only(text)
print(tokens)

['computer', 'science', 'is', 'no', 'more', 'about', 'computers', 'than', 'astronomy', 'is', 'about', 'telescopes', 'edsger', 'dijkstra']


In [61]:
totalvocab_tokenized = []

for i in synopses:
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

In [62]:
print('Hay en total ' + str(len(totalvocab_tokenized)) + ' tokens \n')
len(totalvocab_tokenized)
print (totalvocab_tokenized[0:50])

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_tokenized)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

Hay en total 164243 tokens 

[u'plot', u'edit', u'edit', u'edit', u'on', u'the', u'day', u'of', u'his', u'only', u'daughter', u"'s", u'wedding', u'vito', u'corleone', u'hears', u'requests', u'in', u'his', u'role', u'as', u'the', u'godfather', u'the', u'don', u'of', u'a', u'new', u'york', u'crime', u'family', u'vito', u"'s", u'youngest', u'son', u'michael', u'in', u'a', u'marine', u'corps', u'uniform', u'introduces', u'his', u'girlfriend', u'kay', u'adams', u'to', u'his', u'family', u'at']
there are 164243 items in vocab_frame


In [84]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_only, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
print(tfidf_matrix.shape)

terms = tfidf_vectorizer.get_feature_names()

(100, 143)


### Buscar clusters con Kmeans

Primero tenemos que hacer el *embeding*:

In [64]:
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [65]:
print (clusters)

# Recuento del número de elementos en cada cluster
for i in range(num_clusters):
    print ('El cluster %i tiene %i elementos' % (i, clusters.count(i)))

[0, 3, 3, 2, 4, 4, 1, 0, 1, 4, 3, 0, 4, 4, 2, 1, 4, 0, 4, 1, 1, 2, 4, 4, 3, 4, 3, 1, 2, 3, 1, 3, 2, 3, 2, 2, 3, 3, 3, 2, 0, 2, 1, 4, 1, 1, 0, 3, 3, 3, 1, 2, 3, 3, 4, 3, 3, 3, 0, 0, 1, 3, 3, 4, 1, 0, 4, 2, 4, 1, 1, 2, 1, 1, 2, 1, 4, 0, 0, 2, 2, 3, 1, 2, 1, 4, 4, 3, 3, 2, 4, 4, 2, 4, 4, 4, 4, 4, 4, 0]
El cluster 0 tiene 12 elementos
El cluster 1 tiene 20 elementos
El cluster 2 tiene 18 elementos
El cluster 3 tiene 24 elementos
El cluster 4 tiene 26 elementos


In [66]:
films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'genre'])

In [67]:
frame[1:10]

Unnamed: 0,title,genre
3,The Shawshank Redemption,"[u' Crime', u' Drama']"
3,Schindler's List,"[u' Biography', u' Drama', u' History']"
2,Raging Bull,"[u' Biography', u' Drama', u' Sport']"
4,Casablanca,"[u' Drama', u' Romance', u' War']"
4,One Flew Over the Cuckoo's Nest,[u' Drama']
1,Gone with the Wind,"[u' Drama', u' Romance', u' War']"
0,Citizen Kane,"[u' Drama', u' Mystery']"
1,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
4,Titanic,"[u' Drama', u' Romance']"


In [68]:
frame.ix[1]

Unnamed: 0,title,genre
1,Gone with the Wind,"[u' Drama', u' Romance', u' War']"
1,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
1,On the Waterfront,"[u' Crime', u' Drama']"
1,Star Wars,"[u' Action', u' Adventure', u' Fantasy', u' Sc..."
1,E.T. the Extra-Terrestrial,"[u' Adventure', u' Family', u' Sci-Fi']"
1,Some Like It Hot,[u' Comedy']
1,Amadeus,"[u' Biography', u' Drama', u' Music']"
1,To Kill a Mockingbird,[u' Drama']
1,The Best Years of Our Lives,"[u' Drama', u' Romance', u' War']"
1,My Fair Lady,"[u' Drama', u' Family', u' Musical', u' Romance']"


In [88]:
dist = 1 - cosine_similarity(tfidf_matrix)

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

In [91]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

In [93]:
order_centroids

array([[ 86,  37,  76,  58,  36, 141, 110,  72, 135,  15,   2, 102, 127,
         38,  87,  63,  57, 111,  32, 122, 120,  39,  29,  42,  94, 125,
        109,  84, 129, 123,  69, 136,  46,  50, 114, 131, 126,  10,  17,
         66,  35,   0,  91, 116, 133, 137,   5,   1, 104,  98,  30,  75,
         28, 101,  60,  19,  24, 138,  53,  70, 119,  74,  92,  22, 103,
        115,  56,  40, 140,  47, 100,  59,  73,  16, 105,   8,  20, 106,
         78,  51, 132, 121,  77, 139,  81,  62,  45,  11,  85, 130,  83,
         71,   6,  12,  89, 142,  93,  49,  43,  61,  25,  21,  44, 108,
         55,  33,  52,  54,  23,  96,  80,  88,  79,  68,  14,  48,   9,
         90, 134,  27, 112, 107,  13,  95, 128,  34, 117,  31,  64,  67,
        118,   3,  99,   7,  18,  41,  26,  82, 124,  65,   4, 113,  97],
       [ 34, 134,  14,  35, 141, 118,  53,  86,  84,  58,  69, 125, 103,
        111,  74, 135,  95,  94,  64,  61,  63, 139,  51, 117,  54,  52,
          5,  83,  76,  39,  67,  19, 127,  26,   

In [17]:
joblib.dump(km,  'doc_cluster.pkl')
km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

In [22]:
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]     

Top terms per cluster:



In [52]:
order_centroids

array([[ 53,  35,  22, 128,  76,  38, 127,  54,  77,  19,  43, 136,  61,
        126, 103,  23,  65,   9,  15,  10,  95,  74,  31, 138, 133,  39,
         24, 137,   5,  67,  97,  29, 108, 123,  57,  59, 129, 125,  14,
          4,  66, 104,  18,  79,  11,  20,  63,  41, 122,  28,   0,  60,
         69,  98,  86,   8,  52, 142,  82,  70,  87, 132,  37,  68,  25,
        100,  47,  46, 135, 140,  32,  73, 112,  27,  84, 118, 117, 102,
          7,  78,  21, 111, 107, 105,  89, 101, 110,  90,  51,  17,  50,
         93, 139,  99,  72,  64,  88,  36, 124, 109, 120,  83, 121,  33,
         96,  42,  81,  45,  34,  94, 116,   6,  75,  30, 131, 106,  71,
        115,   1,  16,  12, 119, 134,  56,  92,  49, 130,  62, 113,  80,
          3,  40,  58,  13, 114,  85,  91,  44,  48,  55, 141,   2,  26],
       [  2, 134,  80, 113,  61,   1,   6,  92,  22,  53,  21,  87, 120,
          8,  63,  19, 125,  59, 128,  71,  41,   3,  76, 127,   9,  32,
         60, 124,  73,  86,  95,  65,   4,   5,  3

In [39]:
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]     
        
for i in range(num_clusters):
    print("*** Cluster %d:" % i, end='\n\n')

    print("CENTROID /// ", end='')
    print(titles[order_centroids[i][0]])
    print(titles[order_centroids[i][1]])
    
    print("WORDS /// ", end='')
    
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=' / ')
    print() #add whitespace
    print() #add whitespace
    
    print("TITLES /// ", end='')
    for title in frame.ix[i]['title'].values.tolist():
        print(' %s / ' % title, end='')
    print() #add whitespace
    print() #add whitespace
    
print()
print()        
        

Top terms per cluster:

*** Cluster 0:

CENTROID /// The Treasure of the Sierra Madre
From Here to Eternity
WORDS ///  home /  father /  death /  town /  man /  finally / 

TITLES ///  The Wizard of Oz /  Vertigo /  Forrest Gump /  The Sound of Music /  E.T. the Extra-Terrestrial /  The Philadelphia Story /  Ben-Hur /  Doctor Zhivago /  The Exorcist /  Mr. Smith Goes to Washington /  Out of Africa /  Good Will Hunting /  Terms of Endearment /  Close Encounters of the Third Kind /  The Graduate /  Wuthering Heights /  Yankee Doodle Dandy / 

*** Cluster 1:

CENTROID /// Schindler's List


IndexError: list index out of range

### Ahora lo limpiamos un poco más: STOPWORDS, STEMMING & TOKENIZING

In [57]:
# STOPWORDS

# la primera vez hay que descargar la lista de 'stopwords': nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

In [58]:
#print (stopwords)

In [59]:
# STEMMING

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [60]:
# e.g.:
tokenize_and_stem('cats are running')

[u'cat', u'are', u'run']

In [61]:
totalvocab_stemmed = []

for i in synopses:
    allwords_stemmed = tokenize_and_stem(i)
    totalvocab_stemmed.extend(allwords_stemmed)
    
vocab_frame = pd.DataFrame({'words': totalvocab_stemmed}, index=totalvocab_stemmed)

In [62]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

terms = tfidf_vectorizer.get_feature_names()

In [63]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

from sklearn.externals import joblib

joblib.dump(km,  'doc_cluster.pkl')
km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'cluster', 'genre'])

In [64]:
terms = tfidf_vectorizer.get_feature_names()
dist = 1 - cosine_similarity(tfidf_matrix)
num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

In [65]:
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]     
        
for i in range(num_clusters):
    print("*** Cluster %d:" % i, end='\n\n')
    
    print("WORDS /// ", end='')
    
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=' / ')
    print() #add whitespace
    print() #add whitespace
    
    print("TITLES /// ", end='')
    for title in frame.ix[i]['title'].values.tolist():
        print(' %s / ' % title, end='')
    print() #add whitespace
    print() #add whitespace
    
print()
print()  

Top terms per cluster:

*** Cluster 0:

WORDS ///  armi /  kill /  soldier /  command /  order /  attack / 

TITLES ///  Schindler's List /  Lawrence of Arabia /  Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb /  Apocalypse Now /  The Lord of the Rings: The Return of the King /  Patton /  Braveheart /  Platoon /  Dances with Wolves /  All Quiet on the Western Front / 

*** Cluster 1:

WORDS ///  famili /  john /  new /  apart /  york /  father / 

TITLES ///  The Shawshank Redemption /  Casablanca /  One Flew Over the Cuckoo's Nest /  The Wizard of Oz /  Psycho /  West Side Story /  Star Wars /  The Silence of the Lambs /  The Bridge on the River Kwai /  Gladiator /  From Here to Eternity /  Saving Private Ryan /  Raiders of the Lost Ark /  Jaws /  The Treasure of the Sierra Madre /  The Pianist /  The Deer Hunter /  The African Queen /  Stagecoach /  Mutiny on the Bounty /  The Maltese Falcon /  Rebel Without a Cause /  Rear Window /  The Third Man /  North by No