# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_...

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [5]:
import pandas as pd
import re as re
import statsmodels.formula.api as smf
import numpy as mp

In [7]:
df = pd.read_csv('wine-reviews.csv')
df.head()


Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review]
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review]


In [8]:
df['price'] = df['price'].astype(str)
df['alcohol'] = df['alcohol'].astype(str)
df['date published'] = df['date published'].astype(str)
df['bottle size'] = df['bottle size'].astype(str)
df['appellation'] = df['appellation'].astype(str)
df['wine_name'] = df['wine_name'].astype(str)

In [9]:
df['percentage'] = df.alcohol.str.replace('%','').astype(float)
df['size'] = df['bottle size'].str.replace('ml','')
df['origin'] = df['appellation'].str.split(',').str[-1]
df['cost'] = df.price.str.split(',').str[0].str.replace('$','')
df.drop('country', inplace=True, axis=1)

KeyError: "['country'] not found in axis"

In [None]:
def find_number(text):
    num = re.findall(r'[0-9]+',text)
    return " ".join(num)
df['year']=df['wine_name'].apply(lambda x: find_number(x))

In [10]:
df.columns

Index(['url', 'wine_points', 'wine_name', 'wine_desc', 'taster', 'price',
       'designation', 'variety', 'appellation', 'winery', 'alcohol',
       'bottle size', 'category', 'importer', 'date published',
       'user avg rating', 'percentage', 'size', 'origin', 'cost'],
      dtype='object')

In [11]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,percentage,size,origin,cost
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],14.5,750,Spain,25
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review],13.5,750,US,65
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review],13.5,750,US,25
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review],13.0,750,US,65
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],13.0,750,Spain,17


In [12]:
winedf = df.filter(['wine_name','wine_desc','designation', 'variety', 'winery','category','percentage','size','origin', 'cost', 'year'], axis=1)

In [13]:
winedf

Unnamed: 0,wine_name,wine_desc,designation,variety,winery,category,percentage,size,origin,cost
0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Viñas de Gain,Tempranillo,Artadi,Red,14.5,750,Spain,25
1,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Stoller Vineyard,Chardonnay,Adelsheim,White,13.5,750,US,65
2,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Ribbon Springs Vineyard,"Auxerrois, Other White",Adelsheim,White,13.5,750,US,25
3,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,No. 11,Pinot Noir,JCB,Red,13.0,750,US,65
4,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",,Albariño,Pazo Pondal,White,13.0,750,Spain,17
...,...,...,...,...,...,...,...,...,...,...
42290,Concannon 2002 Stampmaker's Red Wine Red (Live...,Very fruit forward in cherries and pomegranate...,Stampmaker's Red Wine,Rhône-style Red Blend,Concannon,Red,,750ML,US,24
42291,San Simeon 2001 Merlot (Paso Robles),"Very dry and robust in the mouth, a clean wine...",,Merlot,San Simeon,Red,,750ML,US,22
42292,Torres de Anguix 2003 Tinto (Ribera del Duero),"Black in color and saturated with plum, fruit ...",Tinto,"Tinto del Pais, Tempranillo",Torres de Anguix,Red,13.4,750ML,Spain,10
42293,Villacezan 2001 Doce Meses Red (Vino Tierra de...,"Muddled to start, with chocolate and earth aro...",Doce Meses,"Red Blends, Red Blends",Villacezan,Red,13.5,750ML,Spain,17


## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [14]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,percentage,size,origin,cost
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],14.5,750,Spain,25
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review],13.5,750,US,65
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review],13.5,750,US,25
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review],13.0,750,US,65
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],13.0,750,Spain,17


## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - For example, how does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

In [15]:
# Is the wine red or white based on percentage of alcohol?

In [16]:
# Using description to predict if the wine is from the US

In [17]:
# Predicting who was the taster by the score?

In [18]:
# What year the wine was made based on the score?

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

In [19]:
from sklearn.linear_model import LogisticRegression

# White or red based on the wine's description.

In [20]:
# Is the wine white or red based on description
redorwhite = df.filter(['wine_desc','category'])
import numpy as np

In [21]:
redorwhite['red'] = np.where(df['category'].str.contains("Red"), '1', '0')

In [22]:
redorwhite[redorwhite['category']=='Red']

Unnamed: 0,wine_desc,category,red
0,"Inky, minerally aromas of blackberry, black pl...",Red,1
3,Light in color and lilting floral aromas of ro...,Red,1
6,The two-acre Clos du Chapitre vineyard is in t...,Red,1
7,"Spice, licorice and herbal notes complement re...",Red,1
8,"Full-bodied and fresh, this offfers attractive...",Red,1
...,...,...,...
42290,Very fruit forward in cherries and pomegranate...,Red,1
42291,"Very dry and robust in the mouth, a clean wine...",Red,1
42292,"Black in color and saturated with plum, fruit ...",Red,1
42293,"Muddled to start, with chocolate and earth aro...",Red,1


In [23]:
train_df = pd.DataFrame({
    'red': redorwhite.red,
    'dark': redorwhite.wine_desc.str.contains("dark", na=False).astype(int),
    'full': redorwhite.wine_desc.str.contains("full", na=False).astype(int),
    'dry': redorwhite.wine_desc.str.contains("dry", na=False).astype(int),
    'oak': redorwhite.wine_desc.str.contains("oak", na=False).astype(int),
    'fruit': redorwhite.wine_desc.str.contains("fruit", na=False).astype(int),
    'chocolate': redorwhite.wine_desc.str.contains("chocolate", na=False).astype(int),
    'rich': redorwhite.wine_desc.str.contains("rich", na=False).astype(int),
    'floral': redorwhite.wine_desc.str.contains("floral", na=False).astype(int),

})
train_df

Unnamed: 0,red,dark,full,dry,oak,fruit,chocolate,rich,floral
0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,1,0
2,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
42290,1,0,0,1,0,1,0,1,0
42291,1,0,0,1,0,0,0,0,0
42292,1,0,0,0,0,1,0,0,0
42293,1,0,0,0,0,1,1,0,0


In [24]:
# features
X = train_df.drop(columns='red')
# labels
y = train_df.red

In [25]:
# our features
X.head()

Unnamed: 0,dark,full,dry,oak,fruit,chocolate,rich,floral
0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,1,0
2,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0


In [26]:
# our labels
y.head()

0    1
1    0
2    0
3    1
4    0
Name: red, dtype: object

In [27]:
# Build a new classifier
# C=1e9 is a magic secret I don't want to talk about
# If we don't say solver='lbfgs' it complains that it's the new default
clf = LogisticRegression(C=1e9, solver='lbfgs')

# Teach the classifier about the complaints we read
clf.fit(X, y)

LogisticRegression(C=1000000000.0)

In [28]:
# The words we were looking for,
# X were our features, X.columns is the column names
feature_names = X.columns

# Coefficients! Remember this from linear regression?
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values(by='coefficient', ascending=False)

Unnamed: 0,feature,coefficient
5,chocolate,2.646985
0,dark,2.491206
3,oak,0.652921
1,full,-0.042424
4,fruit,-0.051655
2,dry,-0.218562
6,rich,-0.314858
7,floral,-0.879638


In [29]:
clf.score(X, y)

0.618300035465185

In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [31]:
# I'm only letting you learn from 75% of my data
clf.fit(X_train, y_train)

LogisticRegression(C=1000000000.0)

In [32]:
# And I'm going to test you on the other 25%
clf.score(X_test, y_test)

0.6114053338377151

In [33]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not red', 'red'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not red,Predicted red
Is not red,1332,2927
Is red,1182,5133


In [34]:
import eli5

feature_names=list(X.columns)
eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])

# Using description to predict if the country got above 95 points

In [35]:
points = df.filter(['wine_desc','origin'])

In [36]:
points['fromUS'] = np.where(df['origin'].str.contains("US"), '1', '0')

In [None]:
points.head()

In [38]:
points[points['fromUS']=='1']

Unnamed: 0,wine_desc,origin,fromUS
1,"A tiny production wine, this is rich, tart and...",US,1
2,This is another fine vintage for this rare win...,US,1
3,Light in color and lilting floral aromas of ro...,US,1
5,"Pretty peach in color, this 50-50 sparkling bl...",US,1
9,"Round, savory aromas of orange-cranberry with ...",US,1
...,...,...,...
42286,It's no insult to say this is the perfect fast...,US,1
42289,"Dry, soft and a little rustic, with earthy, be...",US,1
42290,Very fruit forward in cherries and pomegranate...,US,1
42291,"Very dry and robust in the mouth, a clean wine...",US,1


In [39]:
train_df2 = pd.DataFrame({
    'fromUS': points.fromUS,
    'vintage': points.wine_desc.str.contains("vintage", na=False).astype(int),
    'rustic': points.wine_desc.str.contains("rustic", na=False).astype(int),
    'dry': redorwhite.wine_desc.str.contains("dry", na=False).astype(int),
    'production': redorwhite.wine_desc.str.contains("production", na=False).astype(int),
    'perfect': redorwhite.wine_desc.str.contains("perfect", na=False).astype(int),
    'soil': redorwhite.wine_desc.str.contains("soil", na=False).astype(int),
    'savory': redorwhite.wine_desc.str.contains("savory", na=False).astype(int),
    'sweet': redorwhite.wine_desc.str.contains("sweet", na=False).astype(int)

})
train_df2

Unnamed: 0,fromUS,vintage,rustic,dry,production,perfect,soil,savory,sweet
0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,0,0,0,0
2,1,1,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
42290,1,0,0,1,0,0,0,0,1
42291,1,0,0,1,0,0,0,0,0
42292,0,0,0,0,0,0,0,0,0
42293,0,0,0,0,0,0,0,0,0


In [40]:
# features
X = train_df2.drop(columns='fromUS')
# labels
y = train_df2.fromUS

In [41]:
# Build a new classifier
# C=1e9 is a magic secret I don't want to talk about
# If we don't say solver='lbfgs' it complains that it's the new default
clf = LogisticRegression(C=1e9, solver='lbfgs')

# Teach the classifier about the complaints we read
clf.fit(X, y)

LogisticRegression(C=1000000000.0)

In [42]:
# The words we were looking for,
# X were our features, X.columns is the column names
feature_names = X.columns

# Coefficients! Remember this from linear regression?
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values(by='coefficient', ascending=False)

Unnamed: 0,feature,coefficient
3,production,0.920737
1,rustic,0.843966
4,perfect,0.749492
2,dry,0.55701
7,sweet,0.406869
6,savory,0.274701
0,vintage,0.103872
5,soil,-0.78177


In [43]:
clf.score(X, y)

0.5777042203570162

In [44]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [45]:
# I'm only letting you learn from 75% of my data
clf.fit(X_train, y_train)

LogisticRegression(C=1000000000.0)

In [46]:
# And I'm going to test you on the other 25%
clf.score(X_test, y_test)

0.576602988462266

In [47]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not US', 'from US'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not US,Predicted from US
Is not US,4477,1409
Is from US,3068,1620
