# Preprocessing
- Stem the Data (run/running/ran --> ran)
- Identify the most commonly used words in decription
- One hot encody top top 10(ish) of the nontrvial words
- Scale the Data
- PCA (95% of variation captured)

# Models

- Linear Regression
- Decision Tree Regression
- [Plot Tree Regression](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html)
- Perceptron
- Neural Network

# Set Hyperparameters

Use 10-fold cross validation to tune hyperparameters
- using RSS as an error metric?

In [33]:
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords#stopwords are words like "is", and "the"
from sklearn.model_selection import train_test_split
from tqdm import tqdm #pip install tqdm, loading bar for visualization
import string#to access a convinient list of punctiation
import re#to replace numbers in strings
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

In [15]:

df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)

In [16]:
df

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


In [17]:
for attr in df:
    print(f"{attr}: {round((sum(df[attr].isnull())/df.shape[0])*100,4)}%")

country: 0.0485%
description: 0.0%
designation: 28.8257%
points: 0.0%
price: 6.9215%
province: 0.0485%
region_1: 16.3475%
region_2: 61.1367%
taster_name: 20.1922%
taster_twitter_handle: 24.0154%
title: 0.0%
variety: 0.0008%
winery: 0.0%


In [18]:

def text_stemming(df):
    stop_words = set(stopwords.words("english"))
    unhelpful_words = set(["wine", "drink"])
    stemmer = PorterStemmer()
    xFeat = df.to_numpy()
    newDescription = []
    l = len(xFeat)
    for i in tqdm(range (l), desc="Stemming"):
        wordList = word_tokenize(re.sub(r'\d+', 'number', (xFeat[i].lower()).translate(str.maketrans('', '', string.punctuation)))) #splits the words into a list, and lowercases them
        word = ""
        for item in wordList:
            if item not in stop_words:#we can get rid of stop words: "is", "a", "the"
                candidate = stemmer.stem(item)
                if candidate not in unhelpful_words:
                    word += candidate #replace each word with its stem
                    word += " "
        newDescription.append(word)
    return pd.DataFrame(newDescription)

In [19]:
df["description"] = text_stemming(df["description"])

Stemming: 100%|██████████| 129971/129971 [03:24<00:00, 634.79it/s]


In [20]:
df.dropna(subset=['price'], inplace=True)

In [21]:
df

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,ripe fruiti smooth still structur firm tannin ...,Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,tart snappi flavor lime flesh rind domin green...,,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,pineappl rind lemon pith orang blossom start a...,Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,much like regular bottl number come across rat...,Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
5,Spain,blackberri raspberri aroma show typic navarran...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,note honeysuckl cantaloup sweeten delici feath...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,citat given much decad bottl age prior releas ...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,welldrain gravel soil give crisp dri charact r...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,dri style pinot gri crisp acid also weight sol...,,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


In [22]:
df.to_csv("data/data_stemmed.csv", index = False)

In [25]:
stemmed = pd.read_csv("data/data_stemmed.csv")
stemmed

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Portugal,ripe fruiti smooth still structur firm tannin ...,Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
1,US,tart snappi flavor lime flesh rind domin green...,,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
2,US,pineappl rind lemon pith orang blossom start a...,Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
3,US,much like regular bottl number come across rat...,Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
4,Spain,blackberri raspberri aroma show typic navarran...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem
...,...,...,...,...,...,...,...,...,...,...,...,...,...
120970,Germany,note honeysuckl cantaloup sweeten delici feath...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
120971,US,citat given much decad bottl age prior releas ...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
120972,France,welldrain gravel soil give crisp dri charact r...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
120973,France,dri style pinot gri crisp acid also weight sol...,,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


In [35]:
vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
print("Vectorizing...")
splitWords = vectorizer.fit_transform(stemmed['description'])
print("Transforming...")
vocabMap = list(vectorizer.get_feature_names())
# Taken from http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = splitWords.sum(axis=0).A1
finalMap = Counter(dict(zip(vocabMap, counts)))
print(finalMap.most_common(10))

Vectorizing...
Transforming...
[('flavor', 66947), ('fruit', 53469), ('number', 42263), ('finish', 38962), ('aroma', 38838), ('palat', 36574), ('acid', 35694), ('cherri', 30136), ('tannin', 29951), ('ripe', 25838)]


In [36]:
print(finalMap.most_common(20))

[('flavor', 66947), ('fruit', 53469), ('number', 42263), ('finish', 38962), ('aroma', 38838), ('palat', 36574), ('acid', 35694), ('cherri', 30136), ('tannin', 29951), ('ripe', 25838), ('note', 23849), ('black', 23809), ('dri', 22821), ('spice', 21245), ('rich', 19644), ('fresh', 19229), ('red', 17784), ('show', 17208), ('oak', 16760), ('berri', 16548)]


In [24]:
 xTrain, xTest, yTrain, yTest = train_test_split(df.loc[:, df.columns != 'price'], df['price'], test_size =.3, random_state = 334)