# Project 5: Semantic Analysis (Part 2)
## Recap:

In my previous presentation, I had covered the process of finding the right dataset and the importance of getting the correct type of data (labeled vs. unlabeled). I also went over the process of scrubbing and cleaning the data, process such as removing invalid entries with missing data, conversion of data format, and removal of certain stop-words (words that we do not care about) to form the corpus that we will be using in this project.

I also went over some of the useful python libraries that I used to help me process my dataset. The notable ones that stood out to me were the NLTK and SciKitLearn libraries.

Lastly I went over the process of analyzing my corpus and obtaiing the TF-IDF scores for each document in my corpus. This presented me with a vectorized dataset which I was then able to locate in an euclideian space, and perform a cosine similaritie comparison between each document to determine it's closeness to one another in terms their TF-IDF scores.

## Part 2-1: Data Visualization

Having found the data in euclidean space (TF-IDF scores of each document), I felt that it would be useful for us to be able to visualize this data in a 2/3 dimensional space. To do that, there are a few data visalization tools that we can use.

#### t-SNE

t-Distributed Stochastic Neighbor Embedding - From sklearn library.
Developed by Geoffry Hinton and Laurens van der Maaten, it is a machine learning algorithm to help with dimensional reduction.

 - Dimension reduction algorithm
 - Plots higher level n-dimension (n > 3) into 2 or 3 dimensional space
 - Reducing it to a lower 2/3 dimensional space will allow us to plot it and visualize it.
 - Retains a lot of the relations of the higher dimensional graph
 
#### Naive projection:
 
<img src="files/tsne1.png">

As you can see, projecting it in a naive way will cause the points to cluster together in an unfavorable way that will cause them to lose their original properties and relative values towards one another.

#### T-SNE projection:

<img src="files/tsne2.png">

Ideally we will want the projected plots in the lower dimension to retain their original properties, which effectively allows us to view the data in a correct manner.

T-SNE uses machine learning algorithm to calculate the probability of each point being in a similar cluster with the next. It finds this similarity score based on a normal Gausian distribution. After these scores are obtained, it then projects the plots onto a lower order dimension in a random mannar and try to re-adjust itself to match the distribution scores that we obtaine from above:

<img src="files/tsnelearning.png">

Lastly when clustering the points on the lower dimension, it uses a T-Distribution instead of the normal Gausian distribution that we used earlier. This is where the "T" in T-SNE comes from:

<img src="files/tdistribution.png">

Source: https://www.youtube.com/watch?v=NEaUSP4YerM

In [None]:
V_tsne = TSNE(learning_rate=100).fit_transform(V.toarray())

print "Shape of V_tsne: ", V_tsne.shape

x = V_tsne[:,0]
y = V_tsne[:,1]

points = V_tsne[:,0:0]
color = np.sqrt((points**2).sum(axis=1))/np.sqrt(2.0)
rgb = plt.get_cmap('jet')(color)
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(x,y,color=rgb)
plt.show()

The above reduced the original vector V into 2 dimensional vector that we can then plot:

In [None]:
Shape of V: (99, 1430)
Shape of V_tsne:  (99, 2)

Ploting the above V_tsne vector, we are presented with a somewhat evenly distributed plot:

<img src="files/plot1.png">


## Part 2-2: Performing K-Means Clustering

For the next part of the project, I had decided to perform K-Means clustering to start grouping my documents into similar topics.

For this, I had decided to continue using the sklearn library, and use it's KMeans functionality to help me perform K-Mean clustering on my previously obtained TF-IDF vector of all the documents in my corpus.

#### How K-Means Clustering Work?

Input: takes in a set of points in euclidean space (x1...xn)
Places initial k number of centroids (c1...ck)
Repeat until convergence of all points onto the centroids:
    - for each point x:
        * Find nearest centroid
        * Assign the point to the centroid's cluster
    - for each cluster
        * Assign new centroid based on the average distance of all points in the cluster


<img src="files/kmeans_info.png">

In [None]:
#
# K-Means Clustering
#

km = KMeans(n_clusters=7, init='k-means++', n_init=10, max_iter=300)
km.fit(V)

print km.predict(V)

Running the above function KMeans from our sklearn library, we are able to provide parmeters such as number of clusters, and maximum number of iterations to try and achieve convergence. Below are the results of the above clustering where number of clusters n is set to 7.

We can see that the array of 99 documents are clustered into 7 distinct (0...6) clusters. On different runs, I noticed the documents were sometimes clustered slightly differently. I believe I will need to further fine tune the numbr of clusters I pick and compare the results by checking each documents within the same cluster for similarities.

In [None]:
[5 5 5 5 5 3 3 3 3 3 0 3 3 3 4 3 5 3 6 1 6 1 4 1 4 6 1 0 2 1 6 1 6 1 1 4 2
 6 6 4 6 4 1 6 2 2 1 4 5 6 5 6 3 4 1 6 6 6 2 0 2 1 6 2 2 1 1 1 0 1 4 4 0 0
 1 0 1 3 4 0 3 6 0 4 1 1 1 6 6 2 1 1 4 1 4 6 1 1 1]