# Simple Text clustering

Clustering is about categorizing/organizing/labelling objects such as to maximize the similarity between objects in one cluster/group (inner class similarity) while maximazing the dissimilarity between different clusters (inter class similarity).

Clustering is an example of an unsupervised learning algorithm.

In the following cells I will explore clustering related to text/sentences. In such context similarity should target the semantic and pragmatic meaning of the text: sentences with the same or closely similar meaning should fall into the same category.

In [106]:
import itertools 
import csv
import numpy as np
import pandas as pd

## Data Preprocessing

In [107]:
# Dummy example data.
vocabulary_size = 1000
sentences = ["A brown fox jumped on the lazy dog", 
            "A brown fox jumped on the brown duck",
            "A brown fox jumped on the lazy elephant",
            "An elephant is eating green grass near the alpaca",
            "A green alpaca tried to jump over an elephant",
            "May you rest in a deep and dreamless slumber"]
df = pd.DataFrame(sentences, columns=['sentences'])
df

Unnamed: 0,sentences
0,A brown fox jumped on the lazy dog
1,A brown fox jumped on the brown duck
2,A brown fox jumped on the lazy elephant
3,An elephant is eating green grass near the alpaca
4,A green alpaca tried to jump over an elephant
5,May you rest in a deep and dreamless slumber


## Text Vectorization

Common ways to vectorize your sentences are based on words count. 
Each sentence is represented by a vector of length N, where N is the size of your vocabulary. Each element of the vector is then associated with a word (or N-gram), and has a value that depends on the technique used for the vectorization.
* count
* tf-idf (term frequency * inverse term frequency)

In [108]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [109]:
# This class accepts functions for preprocessing and tokenization, 
# so you can operate your data cleaning directly at this point.
vectorizer = CountVectorizer(analyzer="word", max_features=vocabulary_size, stop_words="english", ngram_range=(1,2))
X = vectorizer.fit_transform(df["sentences"].values)

In [110]:
X.shape

(6, 37)

In [111]:
vectorizer.vocabulary_

{'brown': 2,
 'fox': 15,
 'jumped': 24,
 'lazy': 27,
 'dog': 7,
 'brown fox': 4,
 'fox jumped': 16,
 'jumped lazy': 26,
 'lazy dog': 28,
 'duck': 10,
 'jumped brown': 25,
 'brown duck': 3,
 'elephant': 13,
 'lazy elephant': 29,
 'eating': 11,
 'green': 19,
 'grass': 17,
 'near': 30,
 'alpaca': 0,
 'elephant eating': 14,
 'eating green': 12,
 'green grass': 21,
 'grass near': 18,
 'near alpaca': 31,
 'tried': 35,
 'jump': 22,
 'green alpaca': 20,
 'alpaca tried': 1,
 'tried jump': 36,
 'jump elephant': 23,
 'rest': 32,
 'deep': 5,
 'dreamless': 8,
 'slumber': 34,
 'rest deep': 33,
 'deep dreamless': 6,
 'dreamless slumber': 9}

In [112]:
X[4].toarray()

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
        1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]])

## Clustering

In [113]:
from sklearn.cluster import KMeans

In [114]:
# Specify number of clusters and fit the data
num_clusters = 3
kmeans = KMeans(n_clusters= num_clusters)
kmeans.fit(X)
kmeans.labels_

array([0, 0, 0, 2, 2, 1], dtype=int32)

In [115]:
# Predict/retrieve the cluster ID of our data
df['Cluster'] = kmeans.labels_
df

Unnamed: 0,sentences,Cluster
0,A brown fox jumped on the lazy dog,0
1,A brown fox jumped on the brown duck,0
2,A brown fox jumped on the lazy elephant,0
3,An elephant is eating green grass near the alpaca,2
4,A green alpaca tried to jump over an elephant,2
5,May you rest in a deep and dreamless slumber,1


## Inference

In [116]:
new_text = ["This sentence describes a new fox",  "A random sentence without any animal", "Deep learning"]
new_X = vectorizer.transform(new_text)

In [117]:
outputs = kmeans.predict(new_X)
outputs

array([0, 2, 1], dtype=int32)

## Practice

Practice text clustering using corrona dataset.

In [118]:
train_file = "Corona_NLP_train.csv"
train_df = pd.read_csv(train_file, encoding='latin-1')[:1000]
train_df

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
...,...,...,...,...,...,...
995,4794,49746,"Washington, DC",17-03-2020,Connectivity is essential during times of cris...,Negative
996,4795,49747,"San Francisco, CA",17-03-2020,@standwithPrager Wells Fargo is committed to h...,Extremely Positive
997,4796,49748,"San Francisco, CA",17-03-2020,@KariLeeAK907 Wells Fargo is committed to help...,Extremely Positive
998,4797,49749,"San Francisco, CA",17-03-2020,@TheIndigoAuthor Wells Fargo is committed to h...,Extremely Positive


In [119]:
set(train_df["Sentiment"].values)

{'Extremely Negative', 'Extremely Positive', 'Negative', 'Neutral', 'Positive'}

In [120]:
test_file = "Corona_NLP_test.csv"
test_df = pd.read_csv(test_file, encoding='latin-1')[:100]
test_df

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral
...,...,...,...,...,...,...
95,96,45048,Ireland,10-03-2020,The government must provide hand sanitizer in ...,Extremely Positive
96,97,45049,United States,10-03-2020,What You Need If Quarantined at Home | #Corona...,Neutral
97,98,45050,"Indianapolis, IN",10-03-2020,See the new @FujifilmX_US X-T4 and X100V at Ro...,Extremely Positive
98,99,45051,"San Diego, CA",10-03-2020,Spiking prices during a state of emergency is ...,Extremely Negative


In [121]:
vectorizer = CountVectorizer(analyzer="word", max_features=vocabulary_size, stop_words="english", ngram_range=(1,2))
X = vectorizer.fit_transform(train_df["OriginalTweet"].values)

### Question 1: Train a Kmean algorithm using thre train data then use the trained model to cluster the test data.

### Question 2: In your opinion, is this a good result?