# Project 5: Semantic Analysis (Part 3)
## Recap:

In my last presentation, I focused on being able to visualize my data in a 2-D space. Using TSNE algorithm to reduce our dimensions of our TF-IDF vector to a more managable number, we were able to visualize the corpus/documents in a 2-D space, and see how various documents clustered closely together, based on their TF-IDF score.

## Next step + update

For the next step, we needed to make sense of what we were visualizing. Being able to see these documents plotted was not enough, we needed to study why certain points were closer to others and to determine if documents that were close together were indeed close in their semantic meatning.

I also needed to start expanding my training set to get better results when comparing TF-IDF scores and also when I start doing K-Mean clustering.

*Note: For this part of the project, I had to first switch over to Python 3.6, as some of the features needed in our libraries were not supported in Python 2.7.*

### Step 1:  Labeling the plots

We first needed to label the plots. This will help us determine which document was represented by which plot. Technically we can use the coordinates and reference the vectory array that was stored in memory to determine the document number, but luckly for us, matplotlib provided a way for us to visualize the index and provide a nice interface for us when we hovered over the points.


In [2]:
# Hover over annotation/labeling.
def update_annot(ind):
    
    pos = sc.get_offsets()[ind['ind'][0]]
    annot.xy = pos
    text = 'Review#: {}'.format(ind['ind'][0])
    annot.set_text(text)
    annot.get_bbox_patch().set_alpha(1)


def hover(event):
    vis = annot.get_visible()
    if event.inaxes == ax:
        cont, ind = sc.contains(event)
        if cont:
            update_annot(ind)
            annot.set_visible(True)
            fig.canvas.draw_idle()
        else:
            if vis:
                annot.set_visible(False)
                fig.canvas.draw_idle()


<img src="31_85.png">

### Step 2 Checking 

Comparing review # 31 & 85

#### 31
Great bracket for small TVs.  I used it on a 19" LCD and it worked well.  The large mounting plate may block some connections on some TVs.  Installed easily and could be adjusted to make level.  The links were very tight and need to be loosened for initial installation.  When the TV is in the desired position, you can tighten them up.

#### 85
Used this with a Panasonic 22 inch LCD. Very durable and works as advertised.It was advertised as fitting 22"-37" televisions, but the size of the t.v mounting plate is so large that it covers some of the plug connections on the back of the television. I will have to cut the metal plate to get to the P.C connector if i choose to connect a computer.Haven't decided if i will keep or send back to get something with a smaller mounting plate.Other than that, it's a great product and meets my need for a hanging television.

#### Analysis:

An initial analysis reveals that some of the common words found in both reviews include "mounting" and "connections", which leads me to believe that the semantic behind these reviews were about how easy it was to install these mounts and how it affected the connection ports of each device.

*Note also other common "clusters" that occurr in the plot*

### Step 3 K-Means clustering

Performing K-Means clustering will allow us to start "grouping" these reviews into randomized grouping and tune their mean distance from a randomly determed focus point that will act as the center of each cluster.



In [None]:
km = KMeans(n_clusters=5, init='k-means++', n_init=10, max_iter=1000)
km.fit(V_tsne)
kmeanV = km.predict(V_tsne)

Below is a color coded K-Mean clustering of my original 100 reviews that I had used as part of my development process to write the code to process the data:

<img src="kmean_100.png">

I realize that I had to add more data to get a better feel of the semantic that each cluster represented. Here's the K-Mean clustering of my data increased to about 1000:

<img src="kmean_1000.png">

In addition to performing clustering, I also wanted to know what the possible semantic meaning of each cluster represented. For that, I retrieved the TF-IDF score for each cluster and printed out the top 5 features (words) for each cluster

In [5]:
def print_top_features(cluster_number, dataframe):
    stoplist = set(nltk.corpus.stopwords.words('english'))
    stoplist.update(['-', 'yet'])
    
    cluster=dataframe.loc[dataframe['cluster']==cluster_number]
    cluster_tfidf = TfidfVectorizer(stop_words=stoplist, use_idf=True)

    V_cluster = cluster_tfidf.fit_transform(cluster.reviewText)
    
    indices = np.argsort(cluster_tfidf.idf_)[::-1]
    features = cluster_tfidf.get_feature_names()
    top_n = 5
    top_features = [features[i] for i in indices[:top_n]]
    print('{} - {}'.format(kmeanColors[cluster_number],top_features))

<img src="kmean_topword.png">


I noticed that the term 'zune' showed up in multiple clusters. I believe this is due to different semantical meaning around this device. Having a larger dataset should provide a better clustering outcome, which may help us better predict the semantical meaning around this word. In addition, I had also tried to increase/decrease the K size for K-Mean clustering, but still this word continues to show up in multiple groups.

#### Closing Analysis

For the next step, I would need to look into effective methods of running this modeling process over larger data sets. Perhaps look into better hardware and, pre-processing the datasets, and removing unncessary steps at formatting data.