# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_...

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [49]:
import pandas as pd
pd.options.display.max_columns = 999

In [2]:
df = pd.read_csv('wine-reviews.csv')

In [3]:
df.head(30)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review]
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review]
5,https://www.winemag.com/buying-guide/mumm-napa...,90.0,Mumm Napa 2008 DVX Rosé Sparkling (Napa Valley),"Pretty peach in color, this 50-50 sparkling bl...",Virginie Boone,"$70, Buy Now",DVX Rosé,"Sparkling Blend, Sparkling","Napa Valley, Napa, California, US",Mumm Napa,12.5%,750 ml,Sparkling,,12/1/2014,Not rated yet [Add Your Review]
6,https://www.winemag.com/buying-guide/nuiton-be...,90.0,Nuiton-Beaunoy 2011 Clos du Chapitre Premier C...,The two-acre Clos du Chapitre vineyard is in t...,Roger Voss,"N/A, Buy Now",Clos du Chapitre Premier Cru,Pinot Noir,"Gevrey-Chambertin, Burgundy, France",Nuiton-Beaunoy,13%,750 ml,Red,"Fruit of the Vines, Inc",12/1/2014,Not rated yet [Add Your Review]
7,https://www.winemag.com/buying-guide/trapiche-...,90.0,Trapiche 2012 Broquel Cabernet Sauvignon (Mend...,"Spice, licorice and herbal notes complement re...",Michael Schachner,"$15, Buy Now",Broquel,Cabernet Sauvignon,"Mendoza, Mendoza Province, Argentina",Trapiche,14%,750 ml,Red,The Wine Group,12/1/2014,Not rated yet [Add Your Review]
8,https://www.winemag.com/buying-guide/zonin-201...,90.0,Zonin 2010 Amarone della Valpolicella,"Full-bodied and fresh, this offfers attractive...",Kerin O’Keefe,"$50, Buy Now",,"Red Blends, Red Blends","Amarone della Valpolicella, Veneto, Italy",Zonin,15%,750 ml,Red,Zonin USA,12/1/2014,Not rated yet [Add Your Review]
9,https://www.winemag.com/buying-guide/pali-2012...,90.0,Pali 2012 Cargasacchi Vineyard Pinot Noir (Sta...,"Round, savory aromas of orange-cranberry with ...",Matt Kettmann,"$56, Buy Now",Cargasacchi Vineyard,Pinot Noir,"Sta. Rita Hills, Central Coast, California, US",Pali,13.8%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]


In [4]:
df.dtypes

url                 object
wine_points        float64
wine_name           object
wine_desc           object
taster              object
price               object
designation         object
variety             object
appellation         object
winery              object
alcohol             object
bottle size         object
category            object
importer            object
date published      object
user avg rating     object
dtype: object

In [5]:
df['alc_percent'] = df.alcohol.str.extract(r'(.+)%$')

In [6]:
df['alc_percent'] = df.alc_percent.astype(float)

In [7]:
df.wine_points = df.wine_points.astype(float)

In [8]:
df['price_wine'] = df.price.str.extract(r'\$(\d*)')

In [9]:
df.price_wine = df.price_wine.astype(float)

In [10]:
df['bottlesize'] = df['bottle size'].str.extract(r'^([\d\W]*)[ML ml L l]')

In [11]:
# df['bottle size'].str.extract(r'^([\d\W]*)L')

In [12]:
df.bottlesize = df.bottlesize.astype(float)

In [13]:
df['bottle size'].value_counts()

750 ml    35457
750ML      6157
375 ml      363
500 ml      160
375ML        52
1.5 L        31
3 L          22
500ML        21
1 L          20
1.5L          4
3L            4
187 ml        3
1L            1
Name: bottle size, dtype: int64

In [14]:
df['date published'] = pd.to_datetime(df['date published'])

In [15]:
df.bottlesize[df.bottlesize < 100] = df.bottlesize[df.bottlesize < 100] * 1000 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.bottlesize[df.bottlesize < 100] = df.bottlesize[df.bottlesize < 100] * 1000


In [16]:
df.bottlesize.value_counts()

750.0     41614
375.0       415
500.0       181
1500.0       35
3000.0       26
1000.0       21
187.0         3
Name: bottlesize, dtype: int64

In [17]:
df['country'] = df.appellation.str.extract(r'(\w*)$')

In [18]:
df.country.isna().value_counts()

False    42295
Name: country, dtype: int64

## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [19]:
df.wine_name[(df.category == 'Red') & (df.alc_percent >= 15)].count()
#& (df['date published'].dt.year <= 1990)

1978

In [20]:
# df[df.alc_percent == 4920]

In [21]:
df.wine_name[(df.category == 'White') & (df.alc_percent >= 15)].count()

104

In [22]:
df.wine_points.mean()

88.7770894904835

In [23]:
df.country[df.wine_points > 95].value_counts()

# this is pretty arbitrary though and US and france are very close depending on what wine point threshold you choose

France       198
US           187
Italy         48
Austria       31
Portugal      28
Australia     11
Germany        9
Spain          9
Hungary        4
Argentina      2
England        1
Name: country, dtype: int64

In [24]:
#per country average cost of wine

df.price_wine.groupby(df.country).mean().sort_values(ascending=False)

country
Switzerland    116.666667
Hungary         57.285714
England         55.435897
France          55.200147
Canada          41.319149
Germany         39.568142
Italy           38.913871
Armenia         38.500000
US              37.547314
Israel          33.900000
Australia       32.408135
Slovenia        32.000000
Austria         31.654432
Morocco         31.000000
Portugal        30.774132
Spain           30.589208
Turkey          29.200000
Croatia         28.000000
Zealand         27.653527
Mexico          27.409091
Lebanon         26.307692
Africa          25.065678
Uruguay         24.894737
Argentina       24.421794
Chile           23.423633
Luxembourg      22.000000
Greece          21.576923
Georgia         20.250000
Cyprus          19.500000
Brazil          18.250000
India           16.000000
Slovakia        16.000000
Macedonia       15.000000
Peru            15.000000
Moldova         13.750000
Bulgaria        13.600000
Ukraine         13.000000
Romania         12.285714
Koso

In [25]:
#between france and US has more expensive wines

df.price_wine.median()

# df.price_wine[df.price_wine >= 100].groupby(df.country).count().sort_values(ascending=False)
df.price_wine[df.country == 'US'].mean()

37.547314442799745

In [26]:
df.price_wine[df.country == 'France'].mean()

55.20014716703459

In [63]:
df.taster.value_counts(dropna=False)

NaN                    12489
Roger Voss              7986
Michael Schachner       5119
Paul Gregutt            3646
Joe Czerwinski          2532
Virginie Boone          2058
Kerin O’Keefe           1814
Matt Kettmann           1496
Sean P. Sullivan        1111
Anne Krebiehl MW         957
Jim Gordon               844
Anna Lee C. Iijima       813
Lauren Buzzeo            528
Susan Kostrzewa          492
Alexander Peartree       132
Jeff Jenssen              85
Mike DeSimone             61
Christina Pickard         59
Carrie Dykes              39
Fiona Adams               24
Marshall Tilden III       10
Name: taster, dtype: int64

In [71]:
df.loc[(df.taster.str.contains('Roger'))|
       (df.taster.str.contains('Michael')) |
       (df.taster.str.contains('Paul'))|
       (df.taster.str.contains('Joe'))|
       (df.taster.str.contains('Matt'))|
       (df.taster.str.contains('Sean'))|
       (df.taster.str.contains('Jim'))|
       (df.taster.str.contains('Alexander'))|
       (df.taster.str.contains('Jeff')) |
       (df.taster.str.contains('Mike')) |
       (df.taster.str.contains('Marshall')), 'gender']='Male'

df.loc[(df.taster.str.contains('Virginie'))|
       (df.taster.str.contains('Kerin')) |
       (df.taster.str.contains('Anne'))|
       (df.taster.str.contains('Anna'))|
       (df.taster.str.contains('Lauren'))|
       (df.taster.str.contains('Susan'))|
       (df.taster.str.contains('Christina'))|
       (df.taster.str.contains('Carrie')) |
       (df.taster.str.contains('Fiona')), 'gender']='Female'

df.sample(10)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,alc_percent,price_wine,bottlesize,country,male,gender
14433,https://www.winemag.com/buying-guide/talamonti-2005-rose-cerasuolo-italian-red-montepulciano-abruzzo/,86.0,Talamonti 2005 Rosé Cerasuolo Montepulciano (Abruzzo),"Red berry fruit, raspberry and almonds characterize the nose but there's also a hint of sherry that may denote overripe fruit. The wine is lean but vibrant in the mouth and would pair well with informal foods.",,"$9, Buy Now",Rosé Cerasuolo,"Montepulciano, Italian Red","Abruzzo, Central Italy, Italy",Talamonti,13%,750 ml,Rose,"Slocum & Sons, Inc",2007-07-01,Not rated yet [Add Your Review],13.0,9.0,750.0,Italy,,
4432,https://www.winemag.com/buying-guide/bridgman-1999-merlot-columbia-valley-wa/,84.0,Bridgman 1999 Merlot (Columbia Valley (WA)),"A big, aggressively oaky aroma, smelling like a stroll through a sawmill, leads into a wine whose simple fruit is simply overmatched by the wood. Tough, dry and tannic through the finish.",Paul Gregutt,"$17, Buy Now",,Merlot,"Columbia Valley (WA), Columbia Valley, Washington, US",Bridgman,,750ML,Red,,2002-06-01,Not rated yet [Add Your Review],,17.0,750.0,US,True,Male
41830,https://www.winemag.com/buying-guide/banfi-2003-centine-sangiovese-toscana/,87.0,Banfi 2003 Centine Sangiovese (Toscana),"A succulent, chewy Sangiovese, Merlot and Cabernet Sauvignon blend with a pretty ruby color and intense aromas of coffee, tar, leather and toasted wood. The tannins are still a bit raw and beg for hearty meat. Tightly packed cherry and blackberry linger over a long finish. Castello Banfi performs the extraordinary vintage after vintage: It's almost a one million case per year winery and it continues to offer excellent quality on its lowest priced products.",,"$11, Buy Now",Centine,Sangiovese,"Toscana, Tuscany, Italy",Banfi,12.8%,750ML,Red,Banfi Vintners,2005-11-01,Not rated yet [Add Your Review],12.8,11.0,750.0,Italy,,
1674,https://www.winemag.com/buying-guide/south-coast-2014-winemakers-signature-collection-sauvignon-blanc-south-coast-temecula-valley/,86.0,South Coast 2014 Winemaker's Signature Collection Sauvignon Blanc (Temecula Valley),"Tropical ripeness shows strong on the nose of this wine, which is made from the Musque clone out of the Carter Estate Vineyard. It shows aromas of Juicy Fruit gum, papaya-orange-guava juice and daffodils. The palate offers a honeyed red-apple flavor.",Matt Kettmann,"$16, Buy Now",Winemaker's Signature Collection,Sauvignon Blanc,"Temecula Valley, South Coast, California, US",South Coast,13.6%,750 ml,White,,2015-08-01,Not rated yet [Add Your Review],13.6,16.0,750.0,US,True,Male
41565,https://www.winemag.com/buying-guide/leeuwin-estate-2007-art-series-cabernet-sauvignon-margaret-river/,91.0,Leeuwin Estate 2007 Art Series Cabernet Sauvignon (Margaret River),"A textbook example of Margaret River Cabernet Sauvignon, the 2007 Art Series offers perfumed cassis fruit just tinged with mint and tobacco. There's ample body yet classic Cabernet restraint, so it's not jammy or overdone. The tannins are supple enough to make it enjoyable now, yet sufficient to see it through 2018, at least.",Joe Czerwinski,"$45, Buy Now",Art Series,Cabernet Sauvignon,"Margaret River, Western Australia, Australia",Leeuwin Estate,14%,750 ml,Red,Old Bridge Cellars,2012-07-01,Not rated yet [Add Your Review],14.0,45.0,750.0,Australia,True,Male
20207,https://www.winemag.com/buying-guide/matetic-2007-shiraz-syrah-syrah-san-antonio/,93.0,Matetic 2007 Syrah (San Antonio),"Inky, penetrating and bursting with black-fruit aromas as well as coffee, mocha and pastry notes. The palate is perfectly lush and deep, with smooth tannins and proper acidity propelling jammy, lovely berry flavors. Dark, smoky and rubbery on the finish, with a streamlined tail. One of the best Chilean Syrahs. Drink now through 2014; only 150 cases made.",Michael Schachner,"$86, Buy Now",,Syrah,"San Antonio, Chile",Matetic,14%,750 ml,Red,Quintessential Wines,2010-12-31,Not rated yet [Add Your Review],14.0,86.0,750.0,Chile,True,Male
4573,https://www.winemag.com/buying-guide/bortoli-nv-emeri-moscato-sparkling-south-eastern/,83.0,De Bortoli NV Emeri Moscato Sparkling (South Eastern Australia),"Rather restrained for a Muscat-based wine, with modest citrus notes dominating this soft, off-dry sparkler. Simple, but clean and well made.",Joe Czerwinski,"$13, Buy Now",Emeri Moscato,"Sparkling Blend, Sparkling","South Eastern Australia, Australia Other, Australia",De Bortoli,8%,750 ml,Sparkling,De Bortoli Wines USA Inc,2010-12-31,Not rated yet [Add Your Review],8.0,13.0,750.0,Australia,True,Male
7114,https://www.winemag.com/buying-guide/bava-2003-piano-alto-barbera-barbera-dasti-superiore/,89.0,Bava 2003 Piano Alto (Barbera d'Asti Superiore),"Despite the infamous heat of the 2003 vintage, Piano Alto Barbera d'Asti Superiore is a surprising and impressive wine. There are pretty linear notes of black stone and granite backed by some ripe berry and blackberry. The wine seems younger than it really is.",,"$33, Buy Now",Piano Alto,Barbera,"Barbera d'Asti Superiore, Piedmont, Italy",Bava,14%,750 ml,Red,Wine Wave,2010-09-01,Not rated yet [Add Your Review],14.0,33.0,750.0,Italy,,
14942,https://www.winemag.com/buying-guide/arissa-jane-2007-pinot-noir-sonoma-coast/,85.0,Arissa Jane 2007 Pinot Noir (Sonoma Coast),"Too sharp in acidity, with an aggressive mouthfeel. Other than that, the cherry and sandalwood flavors are fine, and the wine is dry and silky. But that tartness is a turnoff.",,"$35, Buy Now",,Pinot Noir,"Sonoma Coast, Sonoma, California, US",Arissa Jane,14.5%,750 ml,Red,,2010-08-01,Not rated yet [Add Your Review],14.5,35.0,750.0,US,,
30140,https://www.winemag.com/buying-guide/boekenhoutskloof-2004-chocolate-block-red-franschhoek/,88.0,Boekenhoutskloof 2004 Chocolate Block Red (Franschhoek),"The name is enticing enough--and in the glass, this red blend from innovator Marc Kent delivers with style and grace. Spice and, appropriately, chocolate waft on the nose, and on the palate, the wine is rich, structured and full of red berry, cocoa and pepper flavors. Syrah, Grenache Noir, Cabernet Sauvignon, Viognier and Cinsault offer ripeness and a juicy character.",Susan Kostrzewa,"$39, Buy Now",Chocolate Block,"Red Blends, Red Blends","Franschhoek, South Africa",Boekenhoutskloof,,750 ml,Red,Vineyard Brands,2007-07-01,Not rated yet [Add Your Review],,39.0,750.0,Africa,,Female


In [64]:
df.gender

## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - For example, how does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

----------------------------------------------------------------------

Ideas:

- Classifier for whether wine is white or red based on description
- Cost of the wine based on alcohol content (linear regression?)
- Whether wine from european countries get more wine points than african/middle eastern countries
- Relationship between gender of taster and words in description (fruity vs smokey etc) 
- Relationship between gender of taster and wine points (logistic regression?)




In [27]:

pd.options.display.max_colwidth = 200000

In [73]:
df[['wine_desc','category']].sample(50)

Unnamed: 0,wine_desc,category
6445,"The reserve from this well-known winery is, as you would expect, a textbook version of Washington Syrah, the state's newest love affair. Dense and dark, with smooth cherry/chocolate flavors across the palate, it comes on full and soft, seductive. There are spicy berry highlights, very dry tannins, and lots of chocolatey oak.",Red
12179,"Red Chassagne-Montrachet used to be found more than white in the past. This wine has a ripe, earthy character that brings out structure and tannin as well as strawberry fruits. It has a juicy as well as smoky character with spice and solid tannins in the background. Drink from 2019.",Red
39702,"The winery's high-end cuvée, this is a five-barrel blend, three of the barrels new. Dancing in cardamom and dark cherry, the background ripe and toasty, the flavors are exotic and long lingering. There's a tension and intensity to this wine that bests speaks to aging, or a full decant; either way, it'll be delicious.",Red
30453,"Scents of roses and white pepper are wrapped into a dark, tannic, sturdy red that opens out and miraculously expresses itself with a strong floral component despite its muscularity. This is unique, distinctive and delicious, with tight spice. It has a big front and a quick finish.",Red
34059,"Aromas suggest pastry crust, almonds, spice, and lemon zest. The vibrant palate doles out Golden Delicious apple, white grapefruit, clove and toasted walnut alongside smooth, persistent bubbles.",Sparkling
3984,"This wine leads with enticing floral and earthy fragrances, and just a hint of meaty character. The palate is a bit closed, but reveals sweet black cherry, plum and spicy pepper notes. This is still very tannic and will be better after 2015.",Red
15817,"Steel and stones are brightened by tones of lemon and grapefruit in this dry-style wine. It's medium bodied but texturally quite rich, lending a satisfying sensation of volume that adds impact to the finish.",White
21962,"All from Rosebud Vineyard, this moderately aromatic wine brings notes of herbs and pineapple, with Sémillon (15%) adding fig to the mix. The concentration doesn't seem all there and it needs a bit more acid to stand it up.",White
26550,"This wine is half Viognier with the rest Grenache Blanc and Roussanne. It offers aromas of almond, lees and peach. It's unctuous in feel but lighter in style with a touch of bitterness that distracts.",White
23043,"Cab Franc from Southern Oregon may not be on your radar, but this bottle suggests it should be. Violets and leather perfume the nose, introducing soft, round purple fruits highlighted with cinnamon and licorice. Maybe not technically perfect, but what a mouthful of delicious flavors.",Red


In [74]:
df.category.value_counts(dropna=False)

Red            25392
White          11922
Sparkling       2519
Rose            1329
Dessert          744
Port/Sherry      358
Fortified         31
Name: category, dtype: int64

In [30]:
# df['words'] = df.wine_desc.str.split(' ')
# df.words.value_counts().head(30)

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

1. Red or white based on description

In [78]:
df['is_red'] = df.category.str.contains('Red').astype(int)|df.category.str.contains('Port/Sherry').astype(int)

In [80]:
df.head(1)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,alc_percent,price_wine,bottlesize,country,male,gender,is_red
0,https://www.winemag.com/buying-guide/artadi-2011-vinas-gain-tempranillo-rioja/,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black plum and coconut filter into a round, fluffy palate that's friendly and pure but not very dense or structured. Baked flavors of molasses and gamy berry finish mild and easy.",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,2014-12-01,Not rated yet [Add Your Review],14.5,25.0,750.0,Spain,True,Male,1


In [91]:
df.wine_desc = df.wine_desc.fillna('')

In [92]:
df.wine_desc.isna().value_counts()

False    42295
Name: wine_desc, dtype: int64

In [None]:
# train_df = pd.DataFrame({
#     'is_red': df.is_read,
#     'knife': df.DO_NARRATIVE.str.contains("KNIFE", na=False).astype(int) | df.DO_NARRATIVE.str.contains("KNIVE", na=False).astype(int),
#     'gun': df.DO_NARRATIVE.str.contains("GUN", na=False).astype(int),
#     'shootingshot': df.DO_NARRATIVE.str.contains("SHOOT", na=False).astype(int) | df.DO_NARRATIVE.str.contains("SHOT", na=False).astype(int),
#     'struck': df.DO_NARRATIVE.str.contains("STRUCK", na=False).astype(int) | df.DO_NARRATIVE.str.contains("STRIKE", na=False).astype(int),
#     'stab': df.DO_NARRATIVE.str.contains("STAB", na=False).astype(int),

In [93]:
from sklearn.feature_extraction.text import CountVectorizer

# Make a vectorizer
vectorizer = CountVectorizer()

# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df.wine_desc)

# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())

In [94]:
words_df.head()

Unnamed: 0,000,002,01,01s,02,02s,03,030,03s,04,04s,05,05s,06,061,064,06s,07,07s,08,09,09s,10,100,1000,100g,100th,101,103,104,105,107,10th,11,110,114,115,117,12,123,125,126,127,1290,12th,13,130,130th,132,134,136,1396,13g,13th,14,140,145,148,14th,15,150,1500,150th,151,1522,153,154,155,159g,15th,16,160,1610,1622,165,1667,166th,1674,1695,16g,16th,17,170,1700,1700s,1709,170g,172,174,175,1756,1763,1787,1789,179,1791,17th,18,180,1800,1800s,1806,1845,1847,185,1850s,1855,1858,1860s,1865,1880,1882,1886,1889,1890s,1892,1893,1894,18th,19,190,1900,1900s,1901,1904,1905,1908,1909,1910,1912,1913,1914,1915,1918,192,1920,1920s,1922,1924,1927,1929,1930,1930s,1932,1934,1935,1936,1939,1940,1940s,1943,1944,1945,1946,1947,194g,1950,1950s,1951,1954,1955,1957,1958,1960s,1964,1965,1967,1969,197,1970,1970s,1971,1972,1973,1974,1975,1976,1978,1979,1980,1980s,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1990s,1991,1992,1993,1994,1995,1996,1997,1998,1999,19th,20,200,2000,2000s,2001,2001s,2002,2002s,2003,2003s,2004,2004s,2005,2005s,2006,2006s,2007,2007s,2008,2008s,2009,2009s,2010,2010s,2011,2011s,2012,2012s,2013,2014,2014s,2015,2015s,2016,2016s,2017,2018,2019,2020,20206,2020s,2021,2022,2023,2024,2025,2026,2027,2028,2029,2030,2031,2032,2033,2034,2035,2036,2037,2038,2039,204,2040,2041,2043,2044,2045,2048,2050,208,20th,21,210,212,215,217,21st,22,220,225,23,230,235,24,240,246,25,250,2500,252,25th,26,260,266,27,270,274,275,277,28,280,283,287,288,29,2923,299,2a,2l,2t,30,300,301,30th,31,312,32,320,325,33,34,34south,35,350,35th,36,360,365,37,375,38,380,380th,387,389,39,3bb,3g,3rd,40,400,400l,407,40th,41,416,42,420,422,424,425,43,430,44,45,450,46,47,471,474,476,48,486,49,496,50,500,500l,500th,50th,51,510,52,53,536,54,544,55,550,56,563,57,58,59,5l,60,600,60th,61,62,63,64,65,650,66,667,67,68,69,70,700,705,707,70g,71,72,73,74,75,750,76,77,777,78,780,79,7g,80,800,800th,80s,81,82,83,84,85,850,86,867,87,88,89,8th,90,900,90s,90th,91,92,93,94,95,950,95s,96,97,98,984,99,9g,_mocha,a1,aa,aand,aaron,aas,abacela,abadal,abandon,abandoning,abbey,abbinare,abbott,abbreviated,abc,abeille,abeilles,abeja,abel,abelé,abernathy,abeyance,ability,able,ably,abnormal,abnormally,aboard,abound,abounding,abounds,abouriou,about,abovde,above,abraham,abrasive,abrasively,abreu,abrie,abrupt,abruptly,abruzzo,absence,absent,absolute,absolutely,absorb,...,west,westerly,western,westernmost,westland,westside,wet,wetness,wets,wetstone,weygandt,weyrich,whack,whacked,whafts,whallop,wham,whammy,what,whatever,whatsoever,wheat,wheaty,wheel,wheeler,when,where,whereas,wherever,whet,whether,whets,whew,which,whidbey,whie,whiff,whiffs,whiile,while,whimsical,whimsically,whip,whipped,whipping,whirl,whirls,whirlwind,whiskey,whisp,whisper,whisperer,whispering,whispers,whispy,whistle,whistles,whistling,white,whitecraft,whitepepper,whites,who,whoever,whole,wholemeal,wholesome,wholesomely,wholesomeness,wholly,whom,whopper,whopping,whose,why,wich,wichmann,wicked,wickedly,wide,widely,widen,widening,widens,wider,widespread,widest,widow,widowmaker,width,wields,wiemer,wiener,wieninger,wiese,wiest,wife,wild,wildbacher,wildcat,wilder,wilderness,wildfire,wildflower,wildflowers,wildland,wildly,wildman,wildness,wilds,wildwood,wilhelm,wilkins,will,willakenzie,willamette,willard,willed,willenborg,willful,willi,william,williams,willing,willow,willows,willowy,wills,wilson,wilted,wilting,wimbledon,wimpy,win,winc,wind,windblown,winderlea,winding,windmill,windmills,window,windows,windrow,winds,windswept,windthrow,windy,wine,winebow,winegrape,winegrower,winelink,winemaker,winemakers,winemaking,winemonger,wineries,winery,wines,winesellers,wings,winner,winners,winning,wins,winsome,winter,wintergreen,winters,wintertime,wintery,winy,winzer,wipe,wiped,wipes,wire,wires,wiry,wisdom,wise,wisely,wish,wishes,wishing,wishy,wisp,wisps,wispy,wisteria,wit,with,withdrawing,withdrawn,wither,withered,withholding,within,without,withstand,witness,witnessing,wits,wizard,wizardry,wne,wobbles,wohler,woken,wolf,wolfe,wolff,wolffer,wolfgang,wolves,woman,women,won,wonder,wondered,wonderful,wonderfully,wondering,wonders,wondrous,wondrously,wonï,wood,woodbridge,wooded,wooden,woodiness,woodland,woodpile,woods,woodsap,woodshop,woodsmoke,woodspice,woodspices,woodsy,woodward,woody,wool,woolly,wooly,woop,worcestershire,word,wordenon,words,work,workaday,worked,worker,workers,workhorse,workhorses,working,workmanship,workmen,works,world,worldly,worlds,worldwide,worn,worries,worry,worse,worshipping,worst,worth,worthiness,worthwhile,worthy,would,wouldn,wound,woven,wow,wowing,wows,wowy,wrangle,wrangled,wrap,wraparound,wrapped,wrapper,wrapping,wraps,wrath,wrattonbully,wreath,wreathed,wreck,wrecking,wrest,wrestle,wrestles,wright,wrinkle,wrinkles,write,writer,writers,written,wrong,wrote,wrought,wth,wunderkind,wurtele,wvv,wwith,wyeast,wädenswil,wölffer,würzgarten,xanadu,xarel,xarello,xavier,xil,ximenez,ximénez,xinomavro,xiv,xix,xy,yadkin,yakima,yalumba,yam,yamhill,yang,yangarra,yannick,yard,yards,yarra,yarrow,yattarna,yauquén,yeah,yealands,year,yearly,yearning,yearns,years,yeas,yeast,yeastiness,yeasts,yeasty,yecla,yellow,yellowbird,yellowing,yellowish,yellowwood,yeoman,yep,yering,yes,yesterday,yesteryear,yet,yianni,yiannis,yield,yielded,yielding,yields,yikes,yin,ynez,yogev,yogurt,yogurty,yoking,yonne,yorba,york,yorker,yorkville,yost,you,young,younger,youngest,youngs,youngster,yountville,your,yours,yourself,youth,youthful,youthfully,youthfulness,yquem,ysios,yum,yumminess,yummy,yung,yup,yuzu,yves,yvon,zabaco,zac,zaca,zachary,zactly,zaftig,zamora,zancanella,zantho,zanzibar,zap,zaps,zd,zealand,zellenberg,zeller,zelma,zemmer,zerba,zerbina,zero,zest,zestier,zestiness,zesty,zibibbo,zieregg,zierfandler,zigzags,zimmermann,zin,zincks,zinfanatics,zinfandal,zinfandel,zinfandels,zinfully,zing,zingarelli,zinger,zinginess,zinging,zings,zingy,zinniness,zinny,zins,zio,zip,zipolo,zippiest,zippiness,zipping,zippy,zips,zlahtina,zocker,zone,zones,zonin,zonked,zooms,zoppega,zork,zorzettig,zotovich,zuccardi,zucchini,zull,zuri,zweigelt,zwerithaler,zwiegelt,zédé,zéro,½seasoningï,½t,àmaurice,élevage,émilion,épernay,étalon,über,überbest,ürziger
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [108]:
# df[df.wine_desc.str.contains('Wrath')]

In [109]:
X = words_df
y = df.is_red

In [110]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [111]:
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [112]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not red', 'red'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not red,Predicted red
Is not red,4024,164
Is red,1515,4871


In [113]:
import eli5

eli5.show_weights(clf, feature_names=vectorizer.get_feature_names())

Weight,Feature
0.3327,tannins
0.2240,cherry
0.1752,black
0.1155,berry
0.0999,blackberry
0.0359,rosé
0.0056,peach
0.0033,pink
0.0031,pear
0.0019,honey


Using linearSVC

In [114]:
from sklearn.svm import LinearSVC 
clf = LinearSVC(max_iter=10000)
clf.fit(X_train, y_train)

LinearSVC(max_iter=10000)

In [115]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not red', 'red'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not red,Predicted red
Is not red,3938,250
Is red,219,6167


In [116]:
import eli5

eli5.show_weights(clf, feature_names=vectorizer.get_feature_names())

Weight?,Feature
+3.117,sublte
+1.704,legs
+1.680,fold
+1.616,weaken
+1.611,ansonica
+1.545,eleganty
+1.516,tawny
+1.501,madeira
+1.480,zone
+1.404,sleekly


In [117]:
## I don't know how to remove irrelevant words here, like "predominantly??"

2. Just doing logistic regression on this because I think it makes the most sense? Gender ~ wine points

In [128]:
df['is_male'] = (df.gender.dropna() == 'Male').astype(int)

In [129]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,alc_percent,price_wine,bottlesize,country,male,gender,is_red,is_male
0,https://www.winemag.com/buying-guide/artadi-2011-vinas-gain-tempranillo-rioja/,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black plum and coconut filter into a round, fluffy palate that's friendly and pure but not very dense or structured. Baked flavors of molasses and gamy berry finish mild and easy.",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,2014-12-01,Not rated yet [Add Your Review],14.5,25.0,750.0,Spain,True,Male,1,1.0
1,https://www.winemag.com/buying-guide/adelsheim-2012-stoller-vineyard-chardonnay-willamette-valley-dundee-hills/,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Dundee Hills),"A tiny production wine, this is rich, tart and vividly fruity. The generous mix of citrus, apple and peach fruit is augmented by barrel fermentation flavors of toasted hazelnuts, caramel and baking spices. Though made with traditional Burgundian techniques, the flavors are New Worldly, bright and fruit-driven.",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,2014-12-01,Not rated yet [Add Your Review],13.5,65.0,750.0,US,True,Male,0,1.0
2,https://www.winemag.com/buying-guide/adelsheim-2013-ribbon-springs-vineyard-other-white-auxerrois-willamette-valley-ribbon-ridge/,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerrois (Ribbon Ridge),"This is another fine vintage for this rare wine. It's loaded with cool climate, mineral-laced scents of grapefruit, kiwi and melon. A whiff of fennel adds further interest. Super refreshing and a nice change from the ordinary sipping whites.",Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,2014-12-01,Not rated yet [Add Your Review],13.5,25.0,750.0,US,True,Male,0,1.0
3,https://www.winemag.com/buying-guide/jcb-2011-no-11-pinot-noir-sonoma-coast/,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),"Light in color and lilting floral aromas of rose, this is an inviting cool-climate Pinot Noir swirling in equal parts strawberry and spice, subtle and sophisticated.",Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,2014-12-01,Not rated yet [Add Your Review],13.0,65.0,750.0,US,,Female,1,0.0
4,https://www.winemag.com/buying-guide/pazo-pondal-2013-albarino-rias-baixas/,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, melon and peach are pure as stream water. This feels round and juicy, with flavors of green herbs, lettuce, lime and orange. Tangerine notes carry the finish, which is linear and racy in feel before turning slightly pithy in flavor.",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,2014-12-01,Not rated yet [Add Your Review],13.0,17.0,750.0,Spain,True,Male,0,1.0


In [130]:
import statsmodels.formula.api as smf

# What effect does the length of the scarf have one whether it was completed?
model = smf.logit(formula='is_male ~ wine_points', data=df)
results = model.fit()

Optimization terminated successfully.
         Current function value: 0.534477
         Iterations 5


In [131]:
results.summary()

0,1,2,3
Dep. Variable:,is_male,No. Observations:,29806.0
Model:,Logit,Df Residuals:,29804.0
Method:,MLE,Df Model:,1.0
Date:,"Thu, 01 Apr 2021",Pseudo R-squ.:,0.003522
Time:,23:16:28,Log-Likelihood:,-15931.0
converged:,True,LL-Null:,-15987.0
Covariance Type:,nonrobust,LLR p-value:,2.634e-26

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,5.2680,0.385,13.697,0.000,4.514,6.022
wine_points,-0.0453,0.004,-10.543,0.000,-0.054,-0.037


In [133]:
import numpy as np

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs

Unnamed: 0,coef,odds ratio,name
0,5.268042,194.035596,Intercept
1,-0.045343,0.95567,wine_points


In [134]:
#I don't think I did this correctly... I'm trying to see whether any one gender is more 
# likely to give more (or less) points than the other??

