DiploDatos 2018 / Aprendizaje no supervizado / Clustering Demo*

# Aplicación de técnicas de *clustering* a documentos de texto

**Objetivos:**

En este ejemplo mostraremos cómo utilizar técnicas de clustering para aprender la estructura subyacente de un conjunto de documentos de texto.

In [93]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3

from __future__ import print_function

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



### DATOS: Top 100 Greatest Movies of All Time (The Ultimate List), by ChrisWalczyk55

https://www.imdb.com/list/ls055592025/

El problema consiste en agrupar un conjunto de películas en base a sus críticas en inglés, 
usando para ello procesamiento del texto


Lo primero que haremos es leer los datos, disponibles en:
https://github.com/brandomr/document_cluster.git


### Leer el conjunto de *títulos* de las películas

In [94]:
with open("data/document_cluster/title_list.txt") as file:
    titles = [line.strip() for line in file]
    
print (titles)

['The Godfather', 'The Shawshank Redemption', "Schindler's List", 'Raging Bull', 'Casablanca', "One Flew Over the Cuckoo's Nest", 'Gone with the Wind', 'Citizen Kane', 'The Wizard of Oz', 'Titanic', 'Lawrence of Arabia', 'The Godfather: Part II', 'Psycho', 'Sunset Blvd.', 'Vertigo', 'On the Waterfront', 'Forrest Gump', 'The Sound of Music', 'West Side Story', 'Star Wars', 'E.T. the Extra-Terrestrial', '2001: A Space Odyssey', 'The Silence of the Lambs', 'Chinatown', 'The Bridge on the River Kwai', "Singin' in the Rain", "It's a Wonderful Life", 'Some Like It Hot', '12 Angry Men', 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 'Amadeus', 'Apocalypse Now', 'Gandhi', 'The Lord of the Rings: The Return of the King', 'Gladiator', 'From Here to Eternity', 'Saving Private Ryan', 'Unforgiven', 'Raiders of the Lost Ark', 'Rocky', 'A Streetcar Named Desire', 'The Philadelphia Story', 'To Kill a Mockingbird', 'An American in Paris', 'The Best Years of Our Lives', 'My Fair

### Leer el conjunto de *críticas*

In [96]:
synopses = []

with open("data/document_cluster/synopses_list_wiki.txt") as file:
    i = True
    l = ' '
    for line in file:            
        if 'BREAKS HERE' in line:
            synopses.append(l) # append the previously collected lines
            l = ' '
        
        l = l + line.strip()
        
print (len(synopses))

100


### Leer conjunto de *géneros*

In [97]:
with open("data/document_cluster/genres_list.txt") as file:
    genres = [line.strip() for line in file]

In [98]:
print (titles[0])
print (genres[0])
print (synopses[0])

The Godfather
[u' Crime', u' Drama']
 Plot  [edit]  [  [  edit  edit  ]  ]On the day of his only daughter's wedding, Vito Corleone hears requests in his role as the Godfather, the Don of a New York crime family. Vito's youngest son, Michael, in a Marine Corps uniform, introduces his girlfriend, Kay Adams, to his family at the sprawling reception. Vito's godson Johnny Fontane, a popular singer, pleads for help in securing a coveted movie role, so Vito dispatches his consigliere, Tom Hagen, to Los Angeles to influence the abrasive studio head, Jack Woltz. Woltz is unmoved until the morning he wakes up in bed with the severed head of his prized stallion.  On the day of his only daughter's wedding,   Vito Corleone  Vito Corleone   hears requests in his role as the Godfather, the   Don  Don   of a New York crime family. Vito's youngest son,   Michael  Michael  , in a   Marine Corps  Marine Corps   uniform, introduces his girlfriend,   Kay Adams  Kay Adams  , to his family at the sprawling r

### Análisis del texto


Para analizar el texto debemos estudiar la frecuencia de las palabras, es decir, separar el texto en unidades sintácticas o *tokens*.

In [99]:
# e.g.:
from nltk.tokenize import word_tokenize
text1 = "Computer science is no more about computers than astronomy is about telescopes. Edsger Dijkstra"
tokens = word_tokenize(text1)
print(tokens)

['Computer', 'science', 'is', 'no', 'more', 'about', 'computers', 'than', 'astronomy', 'is', 'about', 'telescopes', '.', 'Edsger', 'Dijkstra']


In [159]:
def tokenize(text):
    # Separar primero por oraciones y luego por palabras
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # Eliminar los tokens que no son letras (e.g. 35, ';', '#', etc.)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

totalvocab_tokenized = []
for i in synopses:
    allwords_tokenized = tokenize(i.decode('utf-8').strip())
    totalvocab_tokenized.extend(allwords_tokenized)

In [163]:
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [164]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in synopses:
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'synopses', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

In [165]:
print('Hay en total ' + str(len(totalvocab_tokenized)) + ' tokens \n')
len(totalvocab_tokenized)
print (totalvocab_tokenized[0:50])

Hay en total 3014 tokens 

['plot', 'edit', 'edit', 'edit', 'on', 'the', 'day', 'of', 'his', 'only', 'daughter', "'s", 'wedding', 'vito', 'corleone', 'hears', 'requests', 'in', 'his', 'role', 'as', 'the', 'godfather', 'the', 'don', 'of', 'a', 'new', 'york', 'crime', 'family', 'vito', "'s", 'youngest', 'son', 'michael', 'in', 'a', 'marine', 'corps', 'uniform', 'introduces', 'his', 'girlfriend', 'kay', 'adams', 'to', 'his', 'family', 'at']


### Buscar clusters con Kmeans

Primero tenemos que hacer el *embeding*:

In [166]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

(100, 143)


In [167]:
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [168]:
print (clusters)

[2, 4, 4, 1, 3, 1, 0, 2, 0, 1, 3, 2, 1, 1, 1, 4, 0, 2, 4, 4, 0, 1, 1, 0, 3, 1, 4, 4, 1, 3, 2, 3, 1, 3, 3, 3, 3, 4, 3, 3, 2, 2, 4, 4, 4, 0, 0, 0, 3, 3, 3, 1, 4, 3, 1, 3, 4, 3, 2, 2, 0, 4, 3, 2, 4, 2, 4, 1, 1, 0, 4, 1, 0, 4, 2, 1, 4, 2, 2, 3, 2, 0, 1, 1, 0, 2, 1, 3, 4, 3, 1, 1, 1, 1, 0, 2, 1, 2, 1, 0]


In [169]:
# Recuento del número de elementos en cada cluster

for i in range(num_clusters):
    print ('El cluster %i tiene %i elementos' % (i, clusters.count(i)))

El cluster 0 tiene 15 elementos
El cluster 1 tiene 26 elementos
El cluster 2 tiene 18 elementos
El cluster 3 tiene 21 elementos
El cluster 4 tiene 20 elementos


In [170]:
films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'genre'])

In [171]:
frame[1:10]

Unnamed: 0,title,genre
4,The Shawshank Redemption,"[u' Crime', u' Drama']"
4,Schindler's List,"[u' Biography', u' Drama', u' History']"
1,Raging Bull,"[u' Biography', u' Drama', u' Sport']"
3,Casablanca,"[u' Drama', u' Romance', u' War']"
1,One Flew Over the Cuckoo's Nest,[u' Drama']
0,Gone with the Wind,"[u' Drama', u' Romance', u' War']"
2,Citizen Kane,"[u' Drama', u' Mystery']"
0,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
1,Titanic,"[u' Drama', u' Romance']"


In [172]:
frame.ix[0]

Unnamed: 0,title,genre
0,Gone with the Wind,"[u' Drama', u' Romance', u' War']"
0,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
0,Forrest Gump,"[u' Drama', u' Romance']"
0,E.T. the Extra-Terrestrial,"[u' Adventure', u' Family', u' Sci-Fi']"
0,Chinatown,"[u' Drama', u' Mystery', u' Thriller']"
0,My Fair Lady,"[u' Drama', u' Family', u' Musical', u' Romance']"
0,Ben-Hur,"[u' Adventure', u' Drama']"
0,Doctor Zhivago,"[u' Drama', u' Romance', u' War']"
0,The Exorcist,[u' Horror']
0,Mr. Smith Goes to Washington,[u' Drama']


In [173]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_tokenized)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

there are 3014 items in vocab_frame


In [174]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

In [183]:
terms = tfidf_vectorizer.get_feature_names()

dist = 1 - cosine_similarity(tfidf_matrix)

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

from sklearn.externals import joblib

joblib.dump(km,  'doc_cluster.pkl')

km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

In [185]:
films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'cluster', 'genre'])

In [188]:
frame[0:10]

Unnamed: 0,title,cluster,genre
3,The Godfather,3,"[u' Crime', u' Drama']"
1,The Shawshank Redemption,1,"[u' Crime', u' Drama']"
2,Schindler's List,2,"[u' Biography', u' Drama', u' History']"
0,Raging Bull,0,"[u' Biography', u' Drama', u' Sport']"
1,Casablanca,1,"[u' Drama', u' Romance', u' War']"
1,One Flew Over the Cuckoo's Nest,1,[u' Drama']
3,Gone with the Wind,3,"[u' Drama', u' Romance', u' War']"
0,Citizen Kane,0,"[u' Drama', u' Mystery']"
3,The Wizard of Oz,3,"[u' Adventure', u' Family', u' Fantasy', u' Mu..."
0,Titanic,0,"[u' Drama', u' Romance']"


In [189]:
frame.ix[0]

Unnamed: 0,title,cluster,genre
0,Raging Bull,0,"[u' Biography', u' Drama', u' Sport']"
0,Citizen Kane,0,"[u' Drama', u' Mystery']"
0,Titanic,0,"[u' Drama', u' Romance']"
0,Sunset Blvd.,0,"[u' Drama', u' Film-Noir']"
0,2001: A Space Odyssey,0,"[u' Mystery', u' Sci-Fi']"
0,Singin' in the Rain,0,"[u' Comedy', u' Musical', u' Romance']"
0,12 Angry Men,0,[u' Drama']
0,Gandhi,0,"[u' Biography', u' Drama', u' History']"
0,From Here to Eternity,0,"[u' Drama', u' Romance', u' War']"
0,Rocky,0,"[u' Drama', u' Sport']"


### Ahora lo limpiamos un poco más: STOPWORDS, STEMMING & TOKENIZING

In [190]:
# STOPWORDS

# la primera vez hay que descargar la lista de 'stopwords': nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

In [191]:
print (stopwords)

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u"you're", u"you've", u"you'll", u"you'd", u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'eac

In [192]:
# STEMMING

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# e.g.:
stemmer.stem('fishes are running')

u'fishes are run'

In [193]:
def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [194]:
tokenize_and_stem('cats are running')

[u'cat', u'are', u'run']

In [195]:
totalvocab_stemmed = []
for i in synopses:
    allwords_stemmed = tokenize_and_stem(i.decode('utf-8').strip()) #for each item in 'synopses', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list

In [196]:
vocab_frame = pd.DataFrame({'words': totalvocab_stemmed})
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print(vocab_frame.head())

there are 162054 items in vocab_frame
  words
0  plot
1  edit
2  edit
3  edit
4    on


In [197]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

CPU times: user 5.54 s, sys: 64 ms, total: 5.6 s
Wall time: 5.53 s
(100, 212)


In [198]:
terms = tfidf_vectorizer.get_feature_names()
print (len(terms))
print (terms[0:50])

212
[u'accept', u'agre', u'allow', u'alon', u'american', u'ani', u'anoth', u'apart', u'appear', u'approach', u'arm', u'armi', u'arrang', u'arriv', u'ask', u'attack', u'attempt', u'away', u'becaus', u'becom', u'befor', u'begin', u'believ', u'bodi', u'bring', u'car', u'carri', u'caus', u'chang', u'charg', u'children', u'citi', u'claim', u'close', u'come', u'command', u'commit', u'confess', u'confront', u'continu', u'convinc', u'daughter', u'day', u'dead', u'death', u'decid', u'despit', u'did', u'die', u'discov']


In [199]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

from sklearn.externals import joblib

joblib.dump(km,  'doc_cluster.pkl')

km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

films = { 'title': titles, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['title', 'cluster', 'genre'])