# Kaggle Competition

Now it's your turn to determine what machine learning model you want to fit to the data! You may use any machine model you like, including ones that we did not cover in class. Remember, your goal is to win the [Kaggle competition](https://inclass.kaggle.com/c/beer2), so try to get your prediction error down, any way you can!

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

data_train = pd.read_csv("/data/beer_train.csv")
data_test = pd.read_csv("/data/beer_test.csv")

In [2]:
data_train.columns

Index(['id', 'abv', 'available', 'description', 'glass', 'ibu', 'isOrganic',
       'name', 'originalGravity', 'srm'],
      dtype='object')

In [3]:
scores = []

## Question 1

Fit _at least_ 5 different models to the training data (`/data/beer_train.csv`). Each model must include at least one categorical and one quantitative input variable. At least one model must use linear regression, and at least one model must use $k$-nearest neighbors. Other than that, you are free to fit any machine learning model you like, with any input variables you like, in your pursuit of the model with the best prediction accuracy. (_Hint:_ You might find it worthwhile to create new input variables out of the descriptions of the beers, which are rich in information.)

Estimate the test error of each of the models using cross-validation. Determine which of the models you tried is the best.

## Clean Data

In [4]:
# YOUR CODE HERE
data_clean = data_train.sort_values(by='ibu', ascending=False).reset_index().drop([0,1,2])
data_clean['abv2'] = data_clean['abv']**2

## Model 1

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

X = pd.get_dummies(data_clean['isOrganic'])
Y = pd.get_dummies(data_clean['srm'])
X = pd.concat([X, Y, data_clean[['abv', 'abv2']]], axis=1)
X = X.drop('1', axis=1).drop('N', axis=1)
model = LinearRegression()
model.fit(X, data_clean['ibu'])
Z = pd.get_dummies(data_test['isOrganic'])
W = pd.get_dummies(data_test['srm'])
data_test['abv2'] = data_test['abv']**2
Z = pd.concat([Z, W, data_test[['abv', 'abv2']]], axis=1)
y_predict = model.predict(Z)
submish = ({
        'id':data_test['id'],
        'ibu':y_predict
    })
sub = pd.DataFrame(submish)
sub.to_csv('sub.csv')
print(
    -cross_val_score(model, X, data_clean['ibu'], cv=15, scoring='neg_mean_squared_error').mean()
    )

514.019482683


## Model 2

In [6]:
data_clean['originalGravity2'] = data_clean['originalGravity']**2
srm_dum = pd.get_dummies(data_clean['srm']).drop("1",axis=1)
org_dum = pd.get_dummies(data_clean['isOrganic']).drop("N", axis=1)
ava_dum = pd.get_dummies(data_clean['available']).drop("Available at the same time of year, every year.",axis=1)
X = pd.concat([srm_dum, org_dum, ava_dum, data_clean[["abv","abv2","originalGravity","originalGravity2"]]], axis=1)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, data_clean["ibu"], test_size=10)

model = LinearRegression()

print(
    -cross_val_score(model, X, data_clean['ibu'], cv=15, scoring='neg_mean_squared_error').mean()
    )

500.227371234


## Model 3

In [7]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=25)
model.fit(X, data_clean['ibu'])
print(-cross_val_score(model, X, data_clean['ibu'], cv=15, 
            scoring="neg_mean_squared_error").mean())

536.213512352


## Model 4

In [8]:
data_clean.iloc[-10]['description']

'A light beer with a pleasant aroma and after taste of, you guessed it, strawberries.  How to we do it?  10 pounds of strawberries per 10 gallon batch!'

In [9]:
data_clean['originalGravity3'] = data_clean['originalGravity']**3
srm_dum = pd.get_dummies(data_clean['srm']).drop("1",axis=1)
org_dum = pd.get_dummies(data_clean['isOrganic']).drop("N", axis=1)
ava_dum = pd.get_dummies(data_clean['available']).drop("Available at the same time of year, every year.",axis=1)
X = pd.concat([srm_dum, org_dum, ava_dum, data_clean[["abv","abv2","originalGravity","originalGravity2", "originalGravity3"]]], axis=1)

X['BerryName'] = [1 if 'berry' in name or 'Berry' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Fruity'] = [1 if 'fruit' in desc or 'Fruit' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['NameLight'] = [1 if 'light' in name else 0 for name in data_clean['name'].fillna(" ")]
X['DescWheat'] = [1 if 'wheat' in desc or 'light' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['DescDouble'] = [1 if 'double' in desc or '2x' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['ContainsBitter'] = [1 if 'bitter' in desc or 'Bitter' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['IPA'] = [1 if 'IPA' in name or 'I.P.A' in name or 'India' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Pilsner'] = [1 if 'Pilsner' in name or 'pilsner' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Lager'] = [1 if 'Lager' in name or 'lager' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Amber'] = [1 if 'Amber' in name or 'amber' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Blonde'] = [1 if 'blonde' in desc or 'blonde' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['Dark'] = [1 if 'Dark' in desc or 'dark' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['Black'] = [1 if 'Black' in name or 'black' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Red'] = [1 if 'Red' in desc or 'red' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['German'] = [1 if 'German' in desc or 'german' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['ContainsHops'] = [1 if 'hop' in desc or 'hops' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['DescIPA'] = [1 if 'IPA' in desc or 'I.P.A' in desc or 'India' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['3X'] = [1 if '3x' in name or '3X' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Hops2'] = [1 if 'Hops' in desc or 'Hops' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['Malty'] = [1 if 'Malt' in desc or 'malt' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['USA'] = [1 if 'American' in desc or 'america' in desc or 'USA' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['Belgian'] = [1 if 'Belgian' in desc or 'belgian' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['Russian'] = [1 if 'Russia' in desc or 'russia' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['England'] = [1 if 'Engl' in desc or 'engl' in desc or 'Brit' in desc or 'brit' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['Imp'] = [1 if 'Imperial' in desc or 'imperial' in desc else 0 for desc in data_clean['description'].fillna(" ")]
X['NameImp'] = [1 if 'Imperial' in name or 'imperial' in name else 0 for name in data_clean['name'].fillna(" ")]
X['Japanese'] = [1 if 'Japan' in desc or 'japan' in desc else 0 for desc in data_clean['description'].fillna(" ")]

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

model = LinearRegression()
model.fit(X, data_clean['ibu'])

print(
    -cross_val_score(model, X, data_clean['ibu'], cv=15, scoring='neg_mean_squared_error').mean()
    )

338.483707753


In [10]:
data_test['originalGravity2'] = data_test['originalGravity']**2
data_test['originalGravity3'] = data_test['originalGravity']**3
srm_dum = pd.get_dummies(data_test['srm'])
org_dum = pd.get_dummies(data_test['isOrganic'])
ava_dum = pd.get_dummies(data_test['available']).drop('Available at the same time of year, every year.', axis=1)
Z = pd.concat([srm_dum, org_dum, ava_dum, data_test[['abv', 'abv2','originalGravity','originalGravity2', 'originalGravity3']]], axis=1)

Z['BerryName'] = [1 if 'berry' in name or 'Berry' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Fruity'] = [1 if 'fruit' in desc or 'Fruit' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['NameLight'] = [1 if 'light' in name else 0 for name in data_test['name'].fillna(" ")]
Z['DescWheat'] = [1 if 'wheat' in desc or 'light' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['DescDouble'] = [1 if 'double' in desc or '2x' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['ContainsBitter'] = [1 if 'bitter' in desc or 'Bitter' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['NameIPA'] = [1 if 'IPA' in name or 'I.P.A' in name or 'India' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Pilsner'] = [1 if 'Pilsner' in name or 'pilsner' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Amber'] = [1 if 'Amber' in name or 'amber' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Lager'] = [1 if 'Lager' in name or 'lager' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Blonde'] = [1 if 'blonde' in desc or 'blonde' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['Dark'] = [1 if 'Dark' in desc or 'dark' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['Black'] = [1 if 'Black' in name or 'black' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Red'] = [1 if 'Red' in desc or 'red' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['German'] = [1 if 'German' in desc or 'german' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['ContainsHops'] = [1 if 'hop' in desc or 'hops' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['DescIPA'] = [1 if 'IPA' in desc or 'I.P.A' in desc or 'India' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['3X'] = [1 if '3x' in name or '3X' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Hops2'] = [1 if 'Hops' in desc or 'Hops' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['Malty'] = [1 if 'Malt' in desc or 'malt' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['USA'] = [1 if 'American' in desc or 'america' in desc or 'USA' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['Belgian'] = [1 if 'Belgian' in desc or 'belgian' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['Russian'] = [1 if 'Russia' in desc or 'russia' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['England'] = [1 if 'Engl' in desc or 'engl' in desc or 'Brit' in desc or 'brit' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['Imp'] = [1 if 'Imperial' in desc or 'imperial' in desc else 0 for desc in data_test['description'].fillna(" ")]
Z['NameImp'] = [1 if 'Imperial' in name or 'imperial' in name else 0 for name in data_test['name'].fillna(" ")]
Z['Japanese'] = [1 if 'Japan' in desc or 'japan' in desc else 0 for desc in data_test['description'].fillna(" ")]


In [11]:
# write predictions to csv
y_predict = model.predict(Z)
submish = ({
        'id':data_test['id'],
        'ibu':y_predict
    })
sub = pd.DataFrame(submish)
sub.to_csv('sub5.csv')

## Model 5

In [12]:
for num in range(1,30):
    model = KNeighborsRegressor(num)
    model.fit(X, data_clean['ibu'])
    print(
        -cross_val_score(model, X, data_clean['ibu'], cv=15, scoring='neg_mean_squared_error').mean()
        )

698.524051156
551.763869208
500.764579426
470.687241541
453.941204087
442.834918126
440.567953152
438.998738725
438.47950312
434.769758801
434.374729934
436.436251708
436.540750331
437.168550187
436.865632819
437.19364642
437.95598401
437.850524256
437.93280823
439.57087604
440.490045014
441.611244621
442.915966085
443.998511607
445.416072182
445.942566454
447.293119433
447.810288532
447.926335531


_YOUR ANALYSIS OF THE RESULTS HERE_

### Grader Comments

- 
- 

[This question is worth 30 points]

In [13]:
scores.append(None)

## Question 2

Use the model that you determined to be optimal in Question 1, and predict the IBU for the test data. Export your predictions to a CSV file (using `.to_csv()`) in the format expected by Kaggle (see `/data/beer_test_sample_submission.csv`). Then, upload your predictions to [Kaggle](https://inclass.kaggle.com/c/beer2). You'll be able to see how well you did on the Leaderboard. You can upload as often as twice a day until the contest ends on Tuesday, June 6.

The top 5 teams will earn up to 5 bonus points. In addition, the team that wins the competition will get another prize!

_Hint:_ Be extra careful when encoding the categorical variables. Make sure your encoding for the test data matches the encoding you used for the training data **exactly**.

## ****Code for csv write is at end of Model 4 code

### Grader Comments

- 
- 

[This question is worth 20 points]

In [14]:
scores.append(None)