DiploDatos 2018 / Aprendizaje no supervizado / Clustering Demo*

# Aplicación de técnicas de *clustering* a documentos de texto

**Objetivos:**

En este ejemplo mostraremos cómo utilizar técnicas de clustering para aprender la estructura subyacente de un conjunto de documentos de texto.

In [195]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3



### DATOS: Top 100 Greatest Movies of All Time (The Ultimate List), by ChrisWalczyk55

https://www.imdb.com/list/ls055592025/

El problema consiste en agrupar un conjunto de películas en base a sus críticas en inglés, 
usando para ello procesamiento del texto


Lo primero que haremos es leer los datos, disponibles en:
https://github.com/brandomr/document_cluster.git


### Leer el conjunto de *títulos* de las películas

In [203]:
with open("document_cluster/title_list.txt") as file:
    titles = [line.strip() for line in file]
    
print (titles)

['The Godfather', 'The Shawshank Redemption', "Schindler's List", 'Raging Bull', 'Casablanca', "One Flew Over the Cuckoo's Nest", 'Gone with the Wind', 'Citizen Kane', 'The Wizard of Oz', 'Titanic', 'Lawrence of Arabia', 'The Godfather: Part II', 'Psycho', 'Sunset Blvd.', 'Vertigo', 'On the Waterfront', 'Forrest Gump', 'The Sound of Music', 'West Side Story', 'Star Wars', 'E.T. the Extra-Terrestrial', '2001: A Space Odyssey', 'The Silence of the Lambs', 'Chinatown', 'The Bridge on the River Kwai', "Singin' in the Rain", "It's a Wonderful Life", 'Some Like It Hot', '12 Angry Men', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 'Amadeus', 'Apocalypse Now', 'Gandhi', 'The Lord of the Rings: The Return of the King', 'Gladiator', 'From Here to Eternity', 'Saving Private Ryan', 'Unforgiven', 'Raiders of the Lost Ark', 'Rocky', 'A Streetcar Named Desire', 'The Philadelphia Story', 'To Kill a Mockingbird', 'An American in Paris', 'The Best Years of Our Lives', 'My Fair

### Leer el conjunto de *críticas*

In [204]:
synopses = []

with open("document_cluster/synopses_list_wiki.txt") as file:
    i = True
    l = ' '
    for line in file:            
        if 'BREAKS HERE' in line:
            synopses.append(l) # append the previously collected lines
            l = ' '
        
        l = l + line.strip()
        
print len(synopses)

SyntaxError: invalid syntax (<ipython-input-204-db08515e1440>, line 13)

### Leer conjunto de *géneros*

In [205]:
with open("document_cluster/genres_list.txt") as file:
    genres = [line.strip() for line in file]

In [207]:
print (titles[0])
print (synopses[0])

The Godfather
 Plot  [edit]  [  [  edit  edit  ]  ]On the day of his only daughter's wedding, Vito Corleone hears requests in his role as the Godfather, the Don of a New York crime family. Vito's youngest son, Michael, in a Marine Corps uniform, introduces his girlfriend, Kay Adams, to his family at the sprawling reception. Vito's godson Johnny Fontane, a popular singer, pleads for help in securing a coveted movie role, so Vito dispatches his consigliere, Tom Hagen, to Los Angeles to influence the abrasive studio head, Jack Woltz. Woltz is unmoved until the morning he wakes up in bed with the severed head of his prized stallion.  On the day of his only daughter's wedding,   Vito Corleone  Vito Corleone   hears requests in his role as the Godfather, the   Don  Don   of a New York crime family. Vito's youngest son,   Michael  Michael  , in a   Marine Corps  Marine Corps   uniform, introduces his girlfriend,   Kay Adams  Kay Adams  , to his family at the sprawling reception. Vito's godson




Para analizar el texto debemos estudiar la frecuencia de las palabras, es decir, separar el texto en unidades sintácticas o *tokens*.

In [208]:
# e.g.:
from nltk.tokenize import word_tokenize
text1 = "Computer science is no more about computers than astronomy is about telescopes. Edsger Dijkstra"
tokens = word_tokenize(text1)
print(tokens)

['Computer', 'science', 'is', 'no', 'more', 'about', 'computers', 'than', 'astronomy', 'is', 'about', 'telescopes', '.', 'Edsger', 'Dijkstra']


In [209]:
def tokenize(text):
    # Separar primero por oraciones y luego por palabras
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # Eliminar los tokens que no son letras (e.g. 35, ';', '#', etc.)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

totalvocab_tokenized = []
for i in synopses:
    allwords_tokenized = tokenize(i.decode('utf-8').strip())
    totalvocab_tokenized.extend(allwords_tokenized)

In [211]:
print('Hay en total ' + str(len(totalvocab_tokenized)) + ' tokens \n')
len(totalvocab_tokenized)
print (totalvocab_tokenized[0:50])

Hay en total 164243 tokens 

[u'plot', u'edit', u'edit', u'edit', u'on', u'the', u'day', u'of', u'his', u'only', u'daughter', u"'s", u'wedding', u'vito', u'corleone', u'hears', u'requests', u'in', u'his', u'role', u'as', u'the', u'godfather', u'the', u'don', u'of', u'a', u'new', u'york', u'crime', u'family', u'vito', u"'s", u'youngest', u'son', u'michael', u'in', u'a', u'marine', u'corps', u'uniform', u'introduces', u'his', u'girlfriend', u'kay', u'adams', u'to', u'his', u'family', u'at']


### Buscar clusters con Kmeans

Primero tenemos que hacer el embeding:

In [212]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

(100, 143)


In [213]:
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [214]:
print (clusters)

[2, 0, 0, 1, 0, 4, 3, 2, 3, 3, 0, 2, 4, 4, 4, 1, 3, 2, 4, 3, 3, 1, 4, 3, 0, 4, 2, 1, 1, 0, 1, 0, 1, 0, 0, 4, 0, 0, 0, 1, 2, 3, 1, 4, 1, 3, 2, 2, 0, 0, 0, 1, 0, 0, 1, 0, 3, 0, 2, 2, 3, 0, 0, 4, 1, 2, 1, 1, 1, 3, 4, 1, 1, 4, 1, 1, 4, 2, 2, 4, 4, 0, 0, 4, 3, 4, 4, 0, 4, 4, 3, 4, 4, 1, 4, 2, 4, 4, 3, 2]


In [215]:
# Recuento del número de elementos en cada cluster

for i in range(num_clusters):
    print ('El cluster %i tiene %i elementos' % (i, clusters.count(i)))

El cluster 0 tiene 24 elementos
El cluster 1 tiene 21 elementos
El cluster 2 tiene 15 elementos
El cluster 3 tiene 15 elementos
El cluster 4 tiene 25 elementos


In [216]:
films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'genre'])

In [217]:
frame[1:10]

Unnamed: 0,title,genre
0,The Shawshank Redemption,"[u' Crime', u' Drama']"
0,Schindler's List,"[u' Biography', u' Drama', u' History']"
1,Raging Bull,"[u' Biography', u' Drama', u' Sport']"
0,Casablanca,"[u' Drama', u' Romance', u' War']"
4,One Flew Over the Cuckoo's Nest,[u' Drama']
3,Gone with the Wind,"[u' Drama', u' Romance', u' War']"
2,Citizen Kane,"[u' Drama', u' Mystery']"
3,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
3,Titanic,"[u' Drama', u' Romance']"


In [218]:
frame.ix[0]

Unnamed: 0,title,genre
0,The Shawshank Redemption,"[u' Crime', u' Drama']"
0,Schindler's List,"[u' Biography', u' Drama', u' History']"
0,Casablanca,"[u' Drama', u' Romance', u' War']"
0,Lawrence of Arabia,"[u' Adventure', u' Biography', u' Drama', u' H..."
0,The Bridge on the River Kwai,"[u' Adventure', u' Drama', u' War']"
0,Dr. Strangelove or: How I Learned to Stop Worr...,"[u' Comedy', u' War']"
0,Apocalypse Now,"[u' Drama', u' War']"
0,The Lord of the Rings: The Return of the King,"[u' Adventure', u' Fantasy']"
0,Gladiator,"[u' Action', u' Drama']"
0,Saving Private Ryan,"[u' Action', u' Drama', u' War']"


In [219]:
for i in range(num_clusters):
    print("*** Cluster %d:" % i, end='\n\n')
    
    print("WORDS /// ", end='')    
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=' / ')
    print('\n') 

    print("GENRES /// ", end='')    
    for title in frame.ix[i]['genre'].values.tolist():
         print(' %s / ' % title, end='')
    print('\n') 
        
    print("TITLES /// ", end='')
    for title in frame.ix[i]['title'].values.tolist():
         print(' %s / ' % title, end='')
    print('\n\n')

*** Cluster 0:

WORDS ///  john /  girl /  lets /  car /  sister /  lined / 

GENRES ///  [u' Crime', u' Drama'] /  [u' Biography', u' Drama', u' History'] /  [u' Drama', u' Romance', u' War'] /  [u' Adventure', u' Biography', u' Drama', u' History', u' War'] /  [u' Adventure', u' Drama', u' War'] /  [u' Comedy', u' War'] /  [u' Drama', u' War'] /  [u' Adventure', u' Fantasy'] /  [u' Action', u' Drama'] /  [u' Action', u' Drama', u' War'] /  [u' Western'] /  [u' Action', u' Adventure'] /  [u' Biography', u' Drama', u' War'] /  [u' Drama', u' Thriller'] /  [u' Action', u' Biography', u' Drama', u' History', u' War'] /  [u' Biography', u' Crime', u' Western'] /  [u' Action', u' Adventure', u' Drama', u' Western'] /  [u' Drama', u' War'] /  [u' Adventure', u' Drama', u' Western'] /  [u' Drama', u' War'] /  [u' Drama', u' War'] /  [u' Drama', u' Sci-Fi'] /  [u' Drama'] /  [u' Adventure', u' Romance', u' War'] / 

TITLES ///  The Shawshank Redemption /  Schindler's List /  Casablanca /  Law

Ahora lo limpiamos un poco más: STOPWORDS, STEMMING & TOKENIZING

In [220]:
# STOPWORDS

# la primera vez hay que descargar la lista de 'stopwords': nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

In [221]:
print (stopwords)

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u"you're", u"you've", u"you'll", u"you'd", u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'eac

In [222]:
# STEMMING

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# e.g.:
stemmer.stem('fishes are running')

u'fishes are run'

In [223]:
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [224]:
tokenize_and_stem('fishes are running')

[u'fish', u'are', u'run']

In [225]:
# descargar diccionarios: nltk.download('punkt')
#  iterate over the list of synopses to create two vocabularies

totalvocab_stemmed = []
totalvocab_tokenized = []
for i in synopses:
    allwords_stemmed = tokenize_and_stem(i.decode('utf-8').strip()) #for each item in 'synopses', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i.decode('utf-8').strip())
    totalvocab_tokenized.extend(allwords_tokenized)

In [226]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print(vocab_frame.head())

there are 164243 items in vocab_frame
     words
plot  plot
edit  edit
edit  edit
edit  edit
on      on


In [227]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

CPU times: user 5.42 s, sys: 172 ms, total: 5.59 s
Wall time: 5.4 s
(100, 217)


In [232]:
terms = tfidf_vectorizer.get_feature_names()
print (len(terms))
print (terms[0:50])

217
[u'accept', u'agre', u'allow', u'alon', u'american', u'ani', u'anoth', u'apart', u'appear', u'approach', u'arm', u'armi', u'arrang', u'arriv', u'ask', u'attack', u'attempt', u'away', u'becaus', u'becom', u'befor', u'begin', u'believ', u'bodi', u'bring', u'build', u'car', u'carri', u'caus', u'chang', u'charg', u'children', u'citi', u'claim', u'close', u'come', u'command', u'commit', u'confess', u'confront', u'continu', u'convinc', u'daughter', u'day', u'dead', u'death', u'decid', u'despit', u'did', u'die']


In [233]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

In [234]:
num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

In [235]:
from sklearn.externals import joblib

joblib.dump(km,  'doc_cluster.pkl')

km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

In [236]:
films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'cluster', 'genre'])

In [237]:
from __future__ import print_function

print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("*** Cluster %d:" % i, end='\n\n')
    
    print("WORDS /// ", end='')
    
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=' / ')
    print() #add whitespace
    print() #add whitespace
    
    print("TITLES /// ", end='')
    for title in frame.ix[i]['title'].values.tolist():
        print(' %s / ' % title, end='')
    print() #add whitespace
    print() #add whitespace
    
print()
print()

Top terms per cluster:

*** Cluster 0:

WORDS ///  town /  killed /  build /  remaining /  arrive /  gun / 

TITLES ///  Schindler's List /  It's a Wonderful Life /  Unforgiven /  To Kill a Mockingbird /  Jaws /  Butch Cassidy and the Sundance Kid /  High Noon /  Shane / 

*** Cluster 1:

WORDS ///  returns /  home /  father /  tells /  mother /  love / 

TITLES ///  Gone with the Wind /  Citizen Kane /  The Wizard of Oz /  Titanic /  Vertigo /  Forrest Gump /  The Sound of Music /  E.T. the Extra-Terrestrial /  2001: A Space Odyssey /  The Silence of the Lambs /  Some Like It Hot /  A Streetcar Named Desire /  The Best Years of Our Lives /  My Fair Lady /  Doctor Zhivago /  The Exorcist /  City Lights /  The King's Speech /  It Happened One Night /  Midnight Cowboy /  Mr. Smith Goes to Washington /  Rain Man /  Annie Hall /  Out of Africa /  Good Will Hunting /  Terms of Endearment /  Tootsie /  Network /  Nashville /  The Graduate /  Pulp Fiction /  A Clockwork Orange /  North by Nor