# 2. Clustering: Introduction

Our objective in this Solution is to use Data Analytics to use tweets from our dataset to map words such that similar words, or words with similar sentiments are close to each other. This can help in many applications where handling text corpuses is the main problem. This method will provide a solution such that the amount of text (or data) can be easily reduced and mapped to meaning (information/knowledge) that is usable to extract and extrapolate sentiments, feelings, opinions, etc.


We will be using open source python packages and known methods like Word2Vec and K-Means clustering to achieve this task. 

In [0]:
import pandas as pd
csv = 'clean_tweet.csv'
df = pd.read_csv(csv, index_col=0)
df['text'] = df['text'].astype('str')
df.head()
print(df.columns.values)

['text' 'target']


Here, we imported the cleaned version of tweets. The dataset was cleaned earlier in the project.

Here we see what the data looks like on which we are clustering. 

In [0]:
data = pd.DataFrame()
data['text'] = df.text
del df

Conversion to pandas for ease of handling

In [0]:
tweets = data['text']

In [0]:
import gensim
from gensim.models import Word2Vec, KeyedVectors
# model = Word2Vec(tweets, min_count=2)
model = KeyedVectors.load('model.w2v')

**Gensim**: is a production-ready open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. Gensim is implemented in Python and Cython for top performance and scalability.

Here we use the pretrained model of Word2Vec to provide the ability to cluster on text via sentiment analysis. 


---



But, what is Word2Vec?

> Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

> It is basically a shallow neural network that is trained to convert words to vectors of several dimensions such that similar words are close together in this space. It was created and patented at Google in 2013.

In [0]:
print(model.wv.similar_by_word('sister'))

[('brother', 0.7047790288925171), ('cousin', 0.6642760634422302), ('sis', 0.6568150520324707), ('mom', 0.5664997100830078), ('dad', 0.5584393739700317), ('bro', 0.5507373809814453), ('sisters', 0.5271422863006592), ('niece', 0.5188810229301453), ('cousins', 0.5045449733734131), ('sissy', 0.5023375749588013)]


For example, when we look at similarity with the word "sister" in the twitter dataset, we see the closest words as a result of the model are brother, sis, mom, dad, etc. which are very realistic to our daily language usage.

In [0]:
word_vectors = model.wv.vectors
n_words = word_vectors.shape[0]
vector_size = word_vectors.shape[1]

In [0]:
from sklearn.cluster import KMeans
n_clusters = 150
kmeans = KMeans(n_clusters=n_clusters, n_jobs=4)
idx = kmeans.fit_predict(word_vectors)

We are using these vectors as a base for clustering using K-Means clustering from the open source Scikit-Learn package.

---
What is K-Means Clustering?

> k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. 

>Hence similar words from the tweets should come into the same clusters. This helps us identify similar words and helps us automate some opinion mining problems with ease.

In [0]:
# See if there's any semantic meaning captured in the clustering
words = ['tea', 'water','good',  'excellent', 'beautiful', 'bad',
         'ugly', 'pen', 'chocolate',
         'food', 'pizza', 'hungry', 'company', 'water', 'rain',
         'hurricane', 'brother', 'sister', 'father', 'niece',
         'school', 'college', 'university', 'institute', 'harvard', 'cambridge', 'oxford']
data = {
    'Word': words,
    'Cluster': [kmeans.predict(model.wv[word].reshape(1,-1)) for word in words]
}
# data = [[word, kmeans.predict(model.wv[word].reshape(1,-1))] for word in words]
print(pd.DataFrame(data).sort_values(by=['Cluster']))

          Word Cluster
12     company     [3]
23   institute    [15]
22  university    [15]
6         ugly    [27]
0          tea    [39]
8    chocolate    [39]
7          pen    [55]
13       water    [55]
1        water    [55]
21     college    [59]
20      school    [59]
17      sister    [73]
19       niece    [73]
18      father    [73]
16     brother    [73]
14        rain    [79]
24     harvard    [80]
26      oxford    [80]
25   cambridge    [80]
15   hurricane    [91]
11      hungry    [94]
10       pizza    [94]
9         food    [94]
5          bad   [147]
3    excellent   [148]
2         good   [148]
4    beautiful   [148]


## Result

The results seen here are a projection of what this clustering solution does. We looked at some random words that may or may not similar to one another, and we were surprised at what the clustering showed us. All the university names like **Oxford**, **Harvard**, and **Cambridge** were found in the same cluster.

The words **Hungry**, **Food**, and **Pizza** were grouped together even though all three are actually very different words, but are used in similar contexts!

**Pen**, **Institute**, and **University** were grouped together and **School** and **College** were close by.

These simple observations show us the accuracy of this Solution and how well this maps to daily human communication in the English language without any human intervention. This clearly represents the power of Data Analytics in todays contexts.