# Preprocessing
- Stem the Data (run/running/ran --> ran)
    - Data is stemmed
    - lower case
    - lemotizor - reduce inflection to base form rather than chopping off
- Identify the most commonly used words in decription
- One hot encody top top 10(ish) of the nontrvial words
- Scale the Data
- PCA (95% of variation captured)

# Models
- [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
   - There are not many parameters to set
   - I dont think feature extraction will work well on our catagorical dataset
- [Decision Tree Regression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
   - Parameters

       - Max_features int, float or {“auto”, “sqrt”, “log2”}
             - This one might not be as useful
   - [Plot Tree Regression](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html)
- [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html)
    - Parameters
      - Penalty  {‘l2’,’l1’,’elasticnet’}, default=None
      - verbose- the verbosity level
           - not sure what this is
      - validation_fraction
- Neural Network

 # Set Hyperparameters
- Use 10-fold cross validation to tune hyperparameters
- using RSS as an error metric?
- [Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)


In [4]:
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from tqdm import tqdm #pip install tqdm

In [5]:

df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)

In [6]:
df

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


In [7]:
for attr in df:
    print(f"{attr}: {round((sum(df[attr].isnull())/df.shape[0])*100,4)}%")

country: 0.0485%
description: 0.0%
designation: 28.8257%
points: 0.0%
price: 6.9215%
province: 0.0485%
region_1: 16.3475%
region_2: 61.1367%
taster_name: 20.1922%
taster_twitter_handle: 24.0154%
title: 0.0%
variety: 0.0008%
winery: 0.0%


In [8]:

def text_stemming(df):
    ps = PorterStemmer()
    xFeat = df.to_numpy()
    newDescription = []
    l = len(xFeat)
    for i in tqdm(range (l), desc="Stemming"):
        wordList = word_tokenize(xFeat[i].lower()) #splits the words into a list, and lowercases them
        word = ""
        for item in wordList:
            word += ps.stem(item) #replace each word with its stem
            word += " "
        newDescription.append(word)
    return pd.DataFrame(newDescription)

In [9]:
df["description"] = text_stemming(df["description"])

Stemming: 100%|██████████| 129971/129971 [01:56<00:00, 1115.26it/s]


In [10]:
df.dropna(subset=['price'], inplace=True)

In [11]:
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,"thi is ripe and fruiti , a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"tart and snappi , the flavor of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"pineappl rind , lemon pith and orang blossom s...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"much like the regular bottl from 2012 , thi co...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
5,Spain,blackberri and raspberri aroma show a typic na...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,note of honeysuckl and cantaloup sweeten thi d...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,citat is given as much as a decad of bottl age...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,well-drain gravel soil give thi wine it crisp ...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"a dri style of pinot gri , thi is crisp with s...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


In [12]:
 xTrain, xTest, yTrain, yTest = train_test_split(df.loc[:, df.columns != 'price'], df['price'], test_size =.3, random_state = 334)

In [13]:
xTrain.head()

Unnamed: 0,country,description,designation,points,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
125322,US,"thi pink blend of 64 % grenach , 31 % syrah an...",,90,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Kaleidos 2015 Rosé (Paso Robles),Rosé,Kaleidos
108443,France,pesquié 's red quintess is a blend of 80 % syr...,Quintessence,90,Rhône Valley,Ventoux,,Joe Czerwinski,@JoeCz,Château Pesquié 2009 Quintessence Red (Ventoux),Rhône-style Red Blend,Château Pesquié
97506,France,"a richli textur wine , show some attract fresh...",,88,Bordeaux,Montagne-Saint-Émilion,,Roger Voss,@vossroger,Château Plaisance 2006 Montagne-Saint-Émilion,Bordeaux-style Red Blend,Château Plaisance
70175,Croatia,"thi red blend of 34 % babic , 33 % plavina and...",Riserva R6,89,North Dalmatia,,,Jeff Jenssen,@worldwineguys,Bibich 2010 Riserva R6 Red (North Dalmatia),Red Blend,Bibich
17614,US,"savori and tart , thi is a dri , express white...",Mayacamas Mountains,87,California,Sonoma County,Sonoma,Virginie Boone,@vboone,Viluko 2016 Mayacamas Mountains Sauvignon Blan...,Sauvignon Blanc,Viluko


# Neural Network

Strongly considering using RNN



In [16]:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, Dataset, DataLoader, random_split
from torch.nn import functional as F

if not torch.cuda.is_available():
    raise EnvironmentError("CUDA not available, skipping neural network")

OSError: I need CUDA