**Let's learn about wine**

Admittedly, I know nothing about wine. That being said, when I hear experts pontificate over the subtle undertones of a glass of booze, I can't help but wonder whether they really know what they are talking about or if they are simply full of hot air (and cold wine). Through a little bit of exploratory analysis and ML, maybe we all can see if there are patterns to these reviews and best understand the nuance of fermented grape juice. 

The end goal of this project is to help you be able to more accurately chose an excellent bottle of wine

![](https://pics.me.me/wine-tasting-me-mmmm-firm-robust-flavor-complex-citrusy-but-27060408.png)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
data = pd.read_csv('../input/winemag-data_first150k.csv') #Load the 150k reviews

We see the data is quite simple. We can focus on a couple of important things. 

Points: This represents the score of the wine, out of 100, given by experts. We can use this as our measuring stick for the quality of wines. Sure, it will be subjective, but it's one of the best measures we can get. 

Price: Obviously less important than points, but not negligible. We can see whether pricey wines are best and perhaps find a sweet spot of highly scored wines which are affordable. You can then use that list to impress your friends, family and local sommeliers. 

In [None]:
data.head(5) #Observe the format

In [None]:
import matplotlib.pyplot as plt

In [None]:
import matplotlib.mlab as mlab

num_bins = 20
n, bins, patches = plt.hist(data['points'], num_bins, normed=1, facecolor='blue', alpha=0.5)
plt.title('Distribution of Wine Scores')
plt.xlabel('Score out of 100')
plt.ylabel('Frequency')

mu = 88 # mean of distribution
sigma = 3 # standard deviation of distribution

y = mlab.normpdf(bins, mu, sigma) # create the y line

plt.plot(bins, y, 'r--')



**Distributions**

The scores follow an accurate normal distribution about a rough mean of 88 and a devition of 3 or so. This means that we have a nice sample. We don't have polarized reviews or any other odd almagamation of reviews that would be uncharacteristic of a wine selection that one would see at your local store. 

That all being said, we see that these are all very highly scored wines. They're all above 80. I'm going to go ahead and assume that those are some good to terrific wines. Knowing this, we should understand that some of the words used to describe all of these wines will be similar, seeing as they all seemingly have some merit to them. Or perhaps these wine reviewers are generally positive and tend to at least tolerate whatever is put in front of them. Either way, it's noteworthy. 

In [None]:
plt.scatter(data['points'], data['price'])

**Will a five buck chuck cut it?**

Pricey booze looks like it *is* a little bit better. We would certainly hope so. 

We also notably see that there are plenty of wines that are fantastic and are at reasonable price points. We will make a list of these later so that you will know exactly which wines will be a 'best value'

You can thank me later

In [None]:
df = data.dropna(subset=['description'])  # drop all NaNs

df_sorted = df.sort_values(by='points', ascending=True)  # sort by points

num_of_wines = df_sorted.shape[0]  # number of wines
worst = df_sorted.head(int(0.25*num_of_wines))  # 25 % of worst wines listed
best = df_sorted.tail(int(0.25*num_of_wines))  # 25 % of best wines listed

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english',analyzer='word')

In [None]:
X1 = vectorizer.fit_transform(best['description'])
idf = vectorizer.idf_
goodlist = vectorizer.vocabulary_

**Good and Bad Descriptors** 

We've divided the wines as below the 25th percentile and above the 75th percentile of sampled wines. We now are all able to see which characteristic words are used to describe good wines and bad ones. 

Some of the good words used are bodied, pretty, polished, ripe and complexity amongst others. There are also some other descriptors which are more lteral, such as acidity, cabernet and napa. Those also seem to indicate good wines. However there are also shared words between good and bad, such as tannins. It should be noted that there are far more mentions of tannins in the good wines though. 

Scroll through the list to see which are the descriptors which you find most appealing within the 'good' category or even test hypothesis of your own.

Also take a look at the bad list to see some words which you should avoid. Mealy, canned, fermented, foamy, excessive and gaseous are some of the words used to describe these wines. I don't know about you, but these all sound pretty yucky to me. 

In [None]:
goodlist

In [None]:
X = vectorizer.fit_transform(worst['description'])
idf = vectorizer.idf_
not_so_good_list = vectorizer.vocabulary_

In [None]:
not_so_good_list

In [None]:
import operator

sorted_good = sorted(goodlist.items(), key=operator.itemgetter(0))
sorted_bad= sorted(not_so_good_list.items(), key=operator.itemgetter(1), reverse=True)

*Sorted*

Feel free to read your own wine reviews and then examine the similarities with the sorted list. Use it in a dictionary-esque way to precisely look for exactly the word you're looking for. 

In [None]:
sorted_bad

In [None]:
sorted_good

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(best['variety'])

**Let's make a model!**

When you read a wine review, do you struggle to understand if tannins are good or bad? Me too! 

Let's put our minds together and see if we can create a quick model to understand not only what type of wine a review is describing, but also how good it tastes. 

In [None]:
from sklearn.model_selection import train_test_split #get in all our sklean modules
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(X1, y, test_size=0.25, random_state=10) #split data

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression() # Logistic regression based on the type of the wine we have 
clf.fit(x_train, y_train)
pred = clf.predict(x_test)

In [None]:
accuracy_score(y_test, pred)

Hey 69% r-sq isn't bad at all. 

It should be noted that we shoud do a better job of cross-validating and cleaning the data set. First of all, there are duplicates throughout the set. More concerning, we have some words which we shouldn't consider when trying to classify these wines. Often times in the review, the reviewer will directly name the type of wine. 

Does it take a genius to know a reviewer in discussing a pinot noir when they specifically say 'pinot' in the review?

No. No it does not. 

As a result, we should understand our results for 'understanding' what wine is being reviewed is probably slightly worse. 

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

**Scores**

We're going to use ridge regression for this linear model and stochastically descend to our optimal model. This is done to add some regularization and deal with the sparcity of the massive matrix of words used in the reviews. 


In [None]:
reg = linear_model.Ridge(alpha = 0.5, solver = 'sag')

In [None]:
y = data['points']
x = vectorizer.fit_transform(data['description'])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=32)
reg.fit(x_train, y_train)

In [None]:
pred = reg.predict(x_test)

In [None]:
r2_score(y_test, pred)

Oh baby! Seems like it's actually easier to predict the review. 

Maybe this makes sense. It is easier to understand if a wine is good when one says 'perfect' vs 'vulgar.' However it could be tougher to discern exactly which putrid red wine a reviewer may have had the unfortunate task of tasting. Especially when he isn't specific. 

In [None]:
best_sorted = best.sort_values(by='price', ascending=True)  # sort by points

num_best = best.shape[0]  # number of wines
cheapestngood = best_sorted.head(int(0.25*num_of_wines))  
cheapngoodest = cheapestngood.sort_values(by = 'points', ascending = False)

**Sweet Spot**

We now are going to show you a few wines that you should love. They are scored highly and priced low. 

We divide into cheapest and best within the category. As you can see, there's a bunch of fluctuation in price. I'd love to fit a formula to show where you can find an even sweeter spot, but that formula would be entirely arbitrary and I'll let you put in your own preferences if you chose to do so. 

In [None]:
cheapestngood.head(10)

In [None]:
cheapngoodest.head(10)

In [None]:
cheapestngood['region_1'].value_counts()

In [None]:
topareas = cheapestngood['region_1'].value_counts().head(10)

**Where should these wines come from?**

California. 

Napa and Sonoma have some of the best wines in the world. Furthermore, it seems like the entire west coast really dominates the global landscape of these top wines. Some italian and french regions sneak in there, but it is still dominated by local wines. 

Who's got it better than us? Nobody

In [None]:
topareas

**Takeaways**

Now we know what to look for in terms of descriptors for our wines, as well as great areas and reasonable price points. Awesome. Hopefully now you can go to the liquor store with a bit of a better idea of what you want. 

I understand this isn't very deep insight on wines, but it's a great way to get an introduction into analyzing wines with the confidence of data behind you. 

Cheers

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import cross_val_score
from sklearn.metrics.pairwise import euclidean_distances
pd.set_option('display.max_colwidth', 1500)

vectorizer = TfidfVectorizer(stop_words='english',
                     binary=False,
                     max_df=0.95, 
                     min_df=0.15,
                     ngram_range = (1,2),use_idf = False, norm = None)
doc_vectors = vectorizer.fit_transform(data['description'])
print(doc_vectors.shape)
print(vectorizer.get_feature_names())

In [None]:

def comp_description(query, results_number=20):
        results=[]
        q_vector = vectorizer.transform([query])
        print("Comparable Description: ", query)
        results.append(cosine_similarity(q_vector, doc_vectors.toarray()))
        f=0
        elem_list=[]
        for i in results[:10]:
            for elem in i[0]:
                    #print("Review",f, "Similarity: ", elem)
                    elem_list.append(elem)
                    f+=1
            print("The Review Most similar to the Comparable Description is Description #" ,elem_list.index(max(elem_list)))
            print("Similarity: ", max(elem_list))
            if sum(elem_list) / len(elem_list)==0.0:
                print("No similar descriptions")
            else:
                print(data['description'].loc[elem_list.index(max(elem_list)):elem_list.index(max(elem_list))])
                

In [None]:
comp_description("Bright, fresh fruit aromas of cherry, raspberry, and blueberry.Youthfully with lots of sweet fruit on the palate with hints of spice and vanilla.")


In [None]:
comp_description("Delicate pink hue with strawberry flavors; easy to drink and very refreshing. Perfect with lighter foods. Serve chilled.")

In [None]:
comp_description("This wine highlights how the power of Lake County’s Red Hills seamlessly compliments the elegance and aromatic freshness of the High Valley. Aromas of plum, allspice and clove develop into flavors of fresh dark cherry and cedar on the palate. The Red Hills’ fine tannins provide a smoothly textured palate sensation from start to finish. Fresh acidity from the High Valley culminates in a bright finish of cherry with a gentle note of French oak.")

In [None]:
comp_description("On the nose are those awful love-heart candies, but the palate is nothing but Nesquik strawberry powder. This alcoholic Powerade is what gives box wine a bad name. Pair with BBQ chicken")

In [None]:
comp_description("This wine is very bad, do not drink.")

In [None]:
comp_description("This is the best wine I have ever drank")