# Assignment 5: Document similarity

In this assignment we will explore some ways to characterize the similarity of documents.  We will be using the file `california_wines.csv`, which contains information about 30,000 or so wines, taken from Wine Enthusiast magazine.

At many points during the coding exercises, you will be asked to "make a good choice" about how to do something, and explain why you made the choice you did.  You can either write text in a Markdown cell before or after your code, or use comments in the code.  In many cases, it is okay to say "I used the default options because I had no reason not to".  But keep in mind that in some cases, you may do things a certain way, and then later discover a problem which requires you to go back and change how you did an earlier part.  If so, include such reasoning in your description (e.g., "At first I did X, but then later I encountered problem Y, so I went back and changed X to Z").

First, load in the file and display the first few rows.

In [125]:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import mplcursors
import warnings
from sklearn.decomposition import TruncatedSVD, NMF
from sklearn.cluster import MeanShift, SpectralClustering
from sklearn.metrics import silhouette_score
warnings.filterwarnings('ignore')

In [4]:
reviews = pd.read_csv('california_wines.csv')
reviews.head(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,US,"Soft, supple plum envelopes an oaky structure ...",Mountain Cuvée,87,19.0,California,Napa Valley,Napa,Virginie Boone,@vboone,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature
1,US,"Slightly reduced, this wine offers a chalky, t...",,87,34.0,California,Alexander Valley,Sonoma,Virginie Boone,@vboone,Louis M. Martini 2012 Cabernet Sauvignon (Alex...,Cabernet Sauvignon,Louis M. Martini
2,US,Building on 150 years and six generations of w...,,87,12.0,California,Central Coast,Central Coast,Matt Kettmann,@mattkettmann,Mirassou 2012 Chardonnay (Central Coast),Chardonnay,Mirassou
3,US,This wine from the Geneseo district offers aro...,Signature Selection,87,22.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Bianchi 2011 Signature Selection Merlot (Paso ...,Merlot,Bianchi
4,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Virginie Boone,@vboone,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa


## Nearest neighbors

Our first goal will be to create a nearest-neighbors model that can find the wines most similar to a given wine.  To begin, create a TfidfVectorizer, fit it on the `description` column of the table, and use it to create a sparse matrix of word features.  (Since we are doing exploratory, unsupervised learning here, there is no need to do a train/test split.)

Choose appropriate arguments to pass to your TfidfVectorizer.  Explain, in a Markdown cell or in comments in your code, why you passed the arguments you did.

In [14]:
vectorizer = TfidfVectorizer(max_df=0.95, min_df = 5, max_features=2000, stop_words = 'english')
vectorizer.fit(reviews.description)

features = vectorizer.transform(reviews['description'])

The reason I chose the following parameters:
- max_df: The numbers I usually see for max_df when looking at this online is usually around .90-.95, hence I just chose 0.95.
- min_df = 5: Again, the numbers that I usually see for this online is 1-10 so I just chose something in the middle. 
- max_features: Using something around the 2000 range seems pretty good as I don't want too much overfitting.
- stop_words: I need to include stop words as a parameter to remove unnecessary noise.

Now create a NearestNeighbor model using scikit, and fit it on the features you created.  Again, describe any choices you made about how to set up the model.

In [16]:
nn = NearestNeighbors(metric='cosine')
nn.fit(features)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

I chose cosine for metric because in the slides, it said that cosine would be much better for text similarity. 

Use your model to find the 10 wines most similar to the wine whose `title` is `'Grassini 2014 Merlot (Happy Canyon of Santa Barbara)'` (other than that wine itself).  Your code should give a DataFrame containing the rows of the original table that correspond to these most-similar wines.

In [17]:
reviews[reviews.title == 'Grassini 2014 Merlot (Happy Canyon of Santa Barbara)']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
22034,US,"Tremendously delicious, this bottling combines...",,95,95.0,California,Happy Canyon of Santa Barbara,Central Coast,Matt Kettmann,@mattkettmann,Grassini 2014 Merlot (Happy Canyon of Santa Ba...,Merlot,Grassini


In [87]:
top10_similar = reviews.iloc[nn.kneighbors(features[22034, :], 11)[1][0,1:]].sort_values('points')

top10_similar

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,SVDX,SVDY
23356,US,This bottling offers much of the richness and ...,Indie Noir,88,28.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,ONX 2016 Indie Noir Red (Paso Robles),Red Blend,ONX,0.19163,-0.049049
31159,US,Wild cherry and sagebrush show on the nose of ...,Ex Anima,88,29.0,California,Monterey,Central Coast,Matt Kettmann,@mattkettmann,Wrath 2015 Ex Anima Pinot Noir (Monterey),Pinot Noir,Wrath,0.23684,-0.043591
20519,US,Nose-tickling brown spice coats a black cherry...,Bendicion Estate,91,36.0,California,Adelaida District,Central Coast,Matt Kettmann,@mattkettmann,Oso Libre 2011 Bendicion Estate Mourvèdre (Ade...,Mourvèdre,Oso Libre,0.206492,-0.045857
26906,US,Aromas of maple and sweet hickory smoke emerge...,Osiris,91,38.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Kaleidos 2013 Osiris Red (Paso Robles),Rhône-style Red Blend,Kaleidos,0.1971,-0.035526
5266,US,Richness is on full display in this bottling b...,,93,36.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Kaleidos 2014 Syrah (Paso Robles),Syrah,Kaleidos,0.158574,0.016169
24699,US,This upper-end bottling from the metalworks-mi...,Bentley Ironworks,93,60.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Sculpterra 2011 Bentley Ironworks Cabernet Sau...,Cabernet Sauvignon,Sculpterra,0.206356,-0.127696
5262,US,"Dark red cherry, crushed carnations, rose peta...",Sierra Mar Vineyard,93,55.0,California,Santa Lucia Highlands,Central Coast,Matt Kettmann,@mattkettmann,Bernardus 2015 Sierra Mar Vineyard Pinot Noir ...,Pinot Noir,Bernardus,0.248627,-0.070473
10655,US,"Delicious aromas of cherry-crusted beef, cocon...",Bilancio,94,36.0,California,Monterey County,Central Coast,Matt Kettmann,@mattkettmann,Pianetta 2011 Bilancio Syrah-Cabernet Sauvigno...,Syrah-Cabernet Sauvignon,Pianetta,0.200469,-0.108447
4517,US,There's an elegant weave to the aromas of this...,Reserve,94,65.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Daou 2013 Reserve Cabernet Sauvignon (Paso Rob...,Cabernet Sauvignon,Daou,0.201638,-0.145226
2460,US,There is impressive depth to the nose of this ...,Mia's Vineyard,95,50.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Falcone 2013 Mia's Vineyard Cabernet Sauvignon...,Cabernet Sauvignon,Falcone,0.23534,-0.12755


Now take that table and sort it by the `points` column to find, among these 10 similar wines, which one was highest rated by the wine reviewer.  What was the name (or `title`) of that wine?

In [34]:
top10_similar.sort_values('points').tail(1)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
2460,US,There is impressive depth to the nose of this ...,Mia's Vineyard,95,50.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Falcone 2013 Mia's Vineyard Cabernet Sauvignon...,Cabernet Sauvignon,Falcone


Suppose a person likes that particular Grassini wine.  They are looking for similar wines.  However, they also obviously care about the quality of the wines, which we assume is accurately measured by the `points` column of the table.  A person's choice of which wine to get might be based on a combination of these two features: similarity to their favorite wine, and overall quality.  They may not want the most similar wine, if it is not as high-quality, but they may also not want just any high-quality wine if it's not similar to their favorite, because even a good wine might not align with someone's personal tastes.

To help our hypothetical oenophile, let's make a plot that combines those two pieces of information.  Use your nearest-neighbor model to get the 50 wines most similar to our target wine (that Grassini 2014 from above).  Then, make a scatter plot where each point is one wine.  One axis should show the `points` rating of the wine, and the other should show its similarity to (or "distance from", which is the opposite of similarity) the target wine.  (Do *not* include the target wine on your graph, as doing so will cause the graph to look odd.)

Each point should be labeled with the wine's`title`.  If you were able to install the `mplcursors` library, you can use that.  Otherwise, use the `text` function from `matplotlib` to plot the text directly on the graph.  (If you do the latter, you can shorten the title to the first 20 characters or something, since otherwise the graph may become very cluttered.)

In [61]:
%matplotlib notebook

top50_similar = reviews.iloc[nn.kneighbors(features[22034, :], 51)[1][0,1:]]

top50_similar.loc[:,'distance'] = nn.kneighbors(features[22034, :], 51)[0][0][1:]

ax = top50_similar.plot.scatter(x="points", y="distance")
cursor = mplcursors.cursor(ax)
cursor.connect('add', lambda sel: sel.annotation.set_text(top50_similar.title.iloc[sel.target.index]))

<IPython.core.display.Javascript object>

1

**Question**

Explain in words how a person could make use of this graph to find a wine they might like.  Where on the graph should they look to find wines they should consider trying?  Why?  If you were going to use this graph to recommend a wine to someone who liked the Grassini 2014, which wine would you recommend?  Why?

**Answer**
Because we are interested in wine that has a high score from the sommeliers and a low "similarity" distance from the Grassini wine, we would need to be looking at the lower right quadrant. Hence, we should be recommending wines like Kaleidos 2014 Syrah, Sculpterra 2011 Bentley Ironworks Cabernet Sauvignon, and Pianetta 2011 Bilancio Syrah-Cabernet Sauvignon. 

## Dimensionality reduction and clustering

Now we will take the same data and try to glean some information about clusters of similar wines.

First, let's transform the data to a two-dimensional space and see what we can see.

Create a TruncatedSVD object, fit it to the features you created from your TfidfVectorizer above, and use it to transform those features into two new dimensions.  Add the new dimensions as columns called "SVDX" and "SVDY" in the DataFrame that contains all the wine info.

In [64]:
svd = TruncatedSVD(n_components=2)
svd.fit(features)
svd_features = svd.transform(features)

reviews["SVDX"] = svd_features[:, 0]
reviews["SVDY"] = svd_features[:, 1]

There are too many wines in this dataset to effectively visualize in a detailed way, so we will restrict ourselves for the moment to wines in a particular growing region, namely the "Happy Canyon of Santa Barbara" region.  The dataset has a column called `region_1` that indicates the region where each wine is  from.  Make a new DataFrame (called `happy` perhaps) that contains only the wines in the original dataset whose `region_1` is equal to `"Happy Canyon of Santa Barbara"`;

In [66]:
happy = reviews[reviews['region_1'] == 'Happy Canyon of Santa Barbara']

How many wines are in our Happy Canyon set?

In [68]:
happy.shape[0]

150

Make a scatter plot of these Happy Canyon wines using the SVDX and SVDY dimension you just created.  Again, use either `mplcursors` or plain `text` to label each point.  (`mplcursors` is definitely preferred, since otherwise the display will be quite cluttered.)

In [92]:
%matplotlib notebook

ax = happy.plot.scatter(x="SVDX", y="SVDY")
cursor = mplcursors.cursor(ax)
cursor.connect('add', lambda sel: sel.annotation.set_text(happy.title.iloc[sel.target.index]))

<IPython.core.display.Javascript object>

1

**Question**

Based on your plot, are there any obvious clusters of similar wines?

**Answer:**
I'd say that there's some clear obvious clusters of similar wines but the number of clusters can be pretty subjective. For me, I'd say that there's two clear clusters with one at the top and one at the bottom.

Next we would like to use some clustering algorithm to find clusters in the data.  Unfortunately, many of the clustering algorithms available in scikit do not work on sparse data, which is what we get from our TfidfVectorizer.  Since we are now working with a smaller data set (only Happy Canyon wines), it should be okay to "densify" the data.  Take the feature matrix you got from your vectorizer, select out only the rows corresponding to Happy Canyon wines, and convert this to a dense array using `.toarray()`.  These are our "happy features".

In [73]:
happy_features = features[happy.index.values].toarray()

Go to [the scikit documentation](https://scikit-learn.org/stable/modules/clustering.html) on clustering methods and choose one of the methods that requires the number of clusters as a parameter, other than K-means (which we did in class).  We also used agglomerative clustering in class; if you choose that one, you must pass some arguments to it to alter the way it works; in the Shakespeare example we just used the default behavior, but if you look in the documentation for `AgglomerativeClustering` you'll find it has various options to tweak.

Write a loop in which you vary the number of clusters from 2 up to 9.  For each possible number of clusters, fit your chosen method on your "happy features" and compute the silhouette score of the resulting clusters.  Collect this information into a DataFrame and show a plot with the number of clusters on the X axis and the corresponding silhouette score on the Y axis.  (If you like, you can play around with the parameters to your chosen algorithm a bit, you don't have to use the default settings.)

In [128]:
%matplotlib notebook

list_score = []

for i in range(2,10):
    spec = SpectralClustering(n_clusters = i)
    spec.fit(happy_features)
    
    print()
    score = silhouette_score(happy_features, spec.labels_, metric='cosine')
    list_score.append(score)
    
sil_df = pd.DataFrame({'score':list_score, 'n_clusters': list(range(2,10))})

plt.plot(sil_df['n_clusters'], sil_df['score'], linewidth=2.0)
plt.show()











<IPython.core.display.Javascript object>

**Question**

Using your silhouette-score plot, determine what you think is the "right" number of clusters for this dataset.  How did you decide this?

Re-fit your chosen algorithm on the same data, using your chosen number of clusters.  Add a column called `Clust1` to your DataFrame of Happy Canyon wines, containing the cluster label for each wine.

In [131]:
spec = SpectralClustering(n_clusters = 4)
spec.fit(happy_features)

happy['Clust1'] = spec.labels_

Now choose one of the clustering algorithms that does *not* require you to specify the number of clusters ahead of time.  Fit the model on the same data and compute the silhouette score for this new set of clusters.

In [118]:
shift = MeanShift()
shift.fit(happy_features)

silhouette_score(happy_features, hierarchical.labels_, metric='euclidean')

0.02469187910801374

Now add a new column `Clust2` to your Happy Canyon table that contains the cluster labels from this new approach.

In [119]:
happy['Clust2'] = hierarchical.labels_

Now you're going to plot each cluster.  Make two plots.  Each plot will be similar to the plot you did earlier, with the X and Y axes being the SVDX and SVDY columns of your DataFrame.  But now, you'll color the points according to what cluster they're in.

So, do your first plot.  Color the points according to the cluster labels you stored in the `Clust1` column.  (Use `.pyplot.get_cmap` to create an appropriate colormap.)  As usual, use `mplcursors` or `text` to label each point.

In [120]:
%matplotlib notebook

ax = happy.plot.scatter(x="SVDX", y="SVDY", c='Clust1', colormap=pyplot.get_cmap('jet', 7))
cursor = mplcursors.cursor(ax)
cursor.connect('add', lambda sel: sel.annotation.set_text(happy.title.iloc[sel.target.index]))

<IPython.core.display.Javascript object>

1

Now do the same plot, but this time color the points according to the clusters in `Clust2`.  (You'll have to figure out how many clusters the algortihm chose in order to generate a colormap with the right number of colors.)

In [121]:
%matplotlib notebook

ax = happy.plot.scatter(x="SVDX", y="SVDY", c='Clust2', colormap=pyplot.get_cmap('jet', 7))
cursor = mplcursors.cursor(ax)
cursor.connect('add', lambda sel: sel.annotation.set_text(happy.title.iloc[sel.target.index]))

<IPython.core.display.Javascript object>

1

Spend some time looking at the information in your DataFrame, including the wine descriptions, varieties, wineries, etc., as well as the cluster labels.  Show any code you used to select interesting slices of data to ponder.

**Questions**

1. Do the results of your clustering align with the results of your SVD transformation --- that is, are points of the same color close to each other on one or both graphs?  Whether they are or not, explain why you think this happened.
2. Which clustering method do you think gave a better result?  How did you come to this conclusion?
3. Can you explain what you think these clusters "mean"?  That is, can you give a simple characterization like "the cluster of red points represents such-and-such kinds of wine, green represents so-and-so, ..." and so on?  Give the best characterization you can, but explain if you think there are some points that seem to be in the "wrong" cluster.
4. Do you think these clusters would be useful for someone trying to find new wines similar to existing ones they like?  Why or why not?
5. Do you think you could make the clusters more useful by making changes to the process we just went through?  If so, how?

**Answers:**
1. Not really. 

Because evaluating unsupervised learning models (like cluster algorithms) is a more subjective task than evaluating supervised ones, these questions will probably not have simple answers.  Be sure to give a full explanation of your thought process, with reference to any patterns you noticed when looking at the data.

## Topic modeling

For the last part of the assignment, we will try to model "topics" in the wine reviews.  Of course, the topic of all the reviews is wine!  But our goal is to see if we can uncover some subtler dimensions of what the reviews are getting at.  For this part we will go back to using the entire California wines dataset (not just the Happy Canyon wines we were using for clustering in the previous section).

In the "similarity tour" notebook that we did in class, we used Latent Dirichlet Allocation to identify topics.  This time we will use an alternative algorithm called Non-negative Matrix Factorization.  The `NMF` class is also in the `sklearn.decomposition` module, and the process for using it essentially the same as for LDA.  Unlike LDA, NMF works okay with Tf-Idf features, so you don't need to create a separate CountVectorizer; you can use the TfidfVectorizer you created above.

Like LDA, NMF requires us to specify up front the number of topics we are looking for.  That means the first thing to do is think of your favorite number.  If your favorite number isn't between 3 and 10, think of your second-favorite number.  If that's still not between 3 and 10 --- okay, okay, just think of a number between 3 and 10.  (We can choose any number of topics, but it becomes difficult to practically evaluate the topics if there are more than about 10.)

Create an NMF model using the number of topics you chose, and fit it on your wines data (all the wines, not just Happy Canyon).

In [100]:
nmf = NMF(n_components=3).fit(features)

We would like to see which words are associated with each topic.  To do that, we'll make a DataFrame that will hold the topic weight for each word.  Use the `.components_` attribute of your fitted model to get the topic-word association weights, and wrap them into a DataFrame with the word as the index, as we did in class.  Each topic should be represented as one column in your DataFrame.

In [102]:
topics = pd.DataFrame(nmf.components_.T, index=vectorizer.get_feature_names())
topics.head()

Unnamed: 0,0,1,2
0,0.042334,0.020113,0.0
4,0.025277,0.0,0.0
5,0.042461,0.0,0.0
6,0.030399,0.009406,0.0
10,0.091265,0.000647,0.050331


Iterate over the columns of your topic DataFrame, and for each one, sort the words by their weight associated with that topic, and print out the 10 most "important" words for the topic.  Your output should look something like this:

```
Topic1: ['blah', 'these', 'words', 'are', 'important', 'for', 'topic', 'one', 'yeah', 'man']
Topic2: ['and', 'these', 'others', 'are', 'vital', 'to', 'topic', 'two', 'seriously', 'critical']
...
```

In [113]:
for i in range(3):
    print('Topic{}:'.format(i), list(topics[i].sort_values().tail(10).index))

Topic0: ['soft', 'oak', 'blackberry', 'wine', 'cabernet', 'drink', 'dry', 'cherry', 'tannins', 'flavors']
Topic1: ['sweet', 'pineapple', 'lemon', 'wine', 'apple', 'chardonnay', 'vanilla', 'crisp', 'flavors', 'acidity']
Topic2: ['cherry', 'red', 'dried', 'bottling', 'plum', 'fruit', 'aromas', 'nose', 'palate', 'black']


**Questions**

1. Can you explain what each topic "means", or give a description of what each topic is "about"?  Give an explanation as best you can for each topic.
2. Do you think the number of topics you chose was too small, too big, or just right?  Why do you think so?
3. How does the output of the topic modeling here differ from what we saw in class with topic modeling on Wikipedia articles (besides the obvious difference that it has different words because the texts are different)?  Why do you think this is?
4. Do you think this topic modeling would be useful in navigating or making use of wine reviews?  Why or why not?