# 03. Clustering

unsupervised/text

## Problem of posts similarity

**Bag-of-word** approach - simple word counts as its basis.
For each word its occurence is counted and noted in a vector 
(*vectorization*)

### Clustering steps

1. Extract the salient features from each post and store it as vector per post
2. Compute clustering on the vectors
3. Determine the cluster for the post in question
4. From this cluster, fetch a handful of posts that are different from the post in question.

## Preprocessing

min_df determines how CountVectorizer treats words not used frequently

if integer - less than that value will be dropped
if fraction -less than fraction of the overall dataset will be dropped

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

print(vectorizer)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [2]:
content = ["How to format my hard disk",
           "Hard disk format problems"]

X = vectorizer.fit_transform(content)
vectorizer.get_feature_names()

[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']

In [3]:
print(X.toarray().transpose())

[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]


In [4]:
import os

DIR = os.path.join(os.getcwd(), "data")
posts = [open(os.path.join(DIR, f)).read() for f in os.listdir(DIR)]

print(posts)

['Imaging databases provide storage capabilities.', 'This is a toy post about machine learning. Actually, it contains not much interesting stuff.', 'Imaging databases store data. Imaging databases store data. Imaging databases store data.', 'Most imaging databases save images permanently.\n', 'Imaging databases store data.']


In [94]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

X_train = vectorizer.fit_transform(posts)

num_samples, num_features = X_train.shape

print("#samples: %d, #features: %d" % (num_samples, num_features))

#samples: 5, #features: 25


In [95]:
print(vectorizer.get_feature_names())

[u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'save', u'storage', u'store', u'stuff', u'this', u'toy']


In [96]:
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)
print(new_post_vec.toarray())

  (0, 5)	1
  (0, 7)	1
[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


### Naive similarity measurement

In [97]:
import scipy as sp
def dist_raw(v1, v2):
    delta = v1-v2
    return sp.linalg.norm(delta.toarray())

In [107]:
def calculate_similarity(dist_lambda):
    best_doc = None
    best_dist = 999999999
    best_i = None
    
    X_train = vectorizer.fit_transform(posts)
    new_post_vec = vectorizer.transform([new_post])
    for i in range(0, num_samples):
        post = posts[i]

        if post==new_post:
            continue
        post_vec = X_train.getrow(i)
        d = dist_lambda(post_vec, new_post_vec)
        print("Post %i with dist=%.2f: %s" % (i, d, post))
        if d < best_dist:
            best_dist = d
            best_i = i

    print("Best post is %i with dist=%.2f" % (best_i, best_dist))

In [108]:
calculate_similarity(dist_raw)

Post 0 with dist=1.73: Imaging databases provide storage capabilities.
Post 1 with dist=3.16: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
Post 2 with dist=5.10: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Post 3 with dist=1.73: Most imaging databases save images permanently.

Post 4 with dist=1.41: Imaging databases store data.
Best post is 4 with dist=1.41


In [100]:
print(X_train.getrow(2).toarray())
print(X_train.getrow(4).toarray())

[[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]
[[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]


In [101]:
def dist_norm(v1, v2):
    v1_normalized = v1/sp.linalg.norm(v1.toarray())
    v2_normalized = v2/sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

In [103]:
calculate_similarity(dist_norm)

Post 0 with dist=0.86: Imaging databases provide storage capabilities.
Post 1 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
Post 2 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Post 3 with dist=0.92: Most imaging databases save images permanently.

Post 4 with dist=0.77: Imaging databases store data.
Best post is 2 with dist=0.77


# Removing less important words

Stop words

In [104]:
vectorizer = CountVectorizer(min_df=1, stop_words='english')

In [105]:
sorted(vectorizer.get_stop_words())[0:20]

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

# Stemming

We count 'imaging' and 'images' as different words. **Natural Language Toolkit (NLTK)

In [26]:
import nltk.stem

# For english, we can take SnowballStemmer
s = nltk.stem.SnowballStemmer('english')
print(
    s.stem("graphics"),
    s.stem("imaging"),
    s.stem("image")
)

(u'graphic', u'imag', u'imag')


# Extending the vectorizer with NLTK's stemmer


In [109]:
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
        
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')
vectorizer.fit_transform(posts)
vectorizer.get_feature_names()

[u'actual',
 u'capabl',
 u'contain',
 u'data',
 u'databas',
 u'imag',
 u'interest',
 u'learn',
 u'machin',
 u'perman',
 u'post',
 u'provid',
 u'save',
 u'storag',
 u'store',
 u'stuff',
 u'toy']

In [110]:
calculate_similarity(dist_norm)

Post 0 with dist=0.86: Imaging databases provide storage capabilities.
Post 1 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
Post 2 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Post 3 with dist=0.63: Most imaging databases save images permanently.

Post 4 with dist=0.77: Imaging databases store data.
Best post is 3 with dist=0.63


# Stop words on steroids

We wantterm that occurs often in particular post and very rarely anywhere else.
This is **term frequency - inverse document frequency (TF-IDF)**

In [111]:
import scipy as sp
def tfidf(term, doc, docset):
    tf = float(doc.count(term)/sum(doc.count(w) for w in docset))
    idf = math.log(float(len(docset))/(len([doc for doc in docset if term in doc])))
    return tf * idf

In [114]:
from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (
            english_stemmer.stem(w) for w in analzyer(doc)
        )
    
vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')

# To be aware:
- not captuing relations
- no negations (solution: unigrams)
- misspelled