# ML Comptetition

Tried many different linear regression models:

    • Linear Regression
    • Lasso Regression 
    • Ridge Regression
    👎 These did not perform well (high RSME, low R2)
    
    • Decision Tree Regression
Focused on:
    
    • Random Forest
    • KNeighbors
    
Used:
    
    • Different loops to check for parameters (k and depth) with best R2, RMSE and cross validation score
    • Used different scalers (StandardScaler, MinMaxScaler, RobustScaler)
    • GridSearchCV for parameter tuning
    
Final and best Model:
    
    • KNN with n = 27, N weighted by distance and StandardScaler
    
    • Scores: 
      • RMSE = 0.641
      • R2 = 0.656

**Importing libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

**Importing data**

In [2]:
cookies = pd.read_csv('../Data/cookies_lisa_test.csv')

In [3]:
cookies.head(3)

Unnamed: 0.1,Unnamed: 0,sugar to flour ratio,sugar index,bake temp,chill time,calories,density,pH,grams baking soda,bake time,quality,butter type,weight,chocolate,raisins,oats,nuts,peanut butter
0,0,0.25,9.5,300,15.0,136.0,0.99367,8.1,0.44,12.1,8,1,15.2,0,1,0,0,0
1,1,0.23,3.3,520,34.0,113.0,0.99429,8.16,0.48,8.4,7,1,12.4,0,1,0,0,0
2,2,0.18,1.9,360,33.0,106.0,0.98746,8.21,0.83,14.0,9,1,9.4,1,0,0,1,0


In [4]:
cookies.drop('Unnamed: 0', axis=1, inplace=True)

## KNN 

In [18]:
range_list = [*range(1, 50, 1)]

scaler_list = [StandardScaler(), RobustScaler(), MinMaxScaler()]

X = cookies.drop(['quality'], axis=1)
y = cookies['quality']

X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=0)

#scaling the x values
for s in scaler_list:
    scaler = s
    print(s)
    scaler.fit(X_train)
    X_train_scale = scaler.transform(X_train)
    X_test_scale = scaler.transform(X_test)
    for k in range_list:
        Knn = KNeighborsRegressor(n_neighbors = k, weights = 'distance')
        Knn.fit(X_train_scale, y_train)
        y_pred = Knn.predict(X_test_scale)
        print('For K = ', k)
        print('R2:', round(r2_score(y_test, y_pred), 3))
        print('RMSE:', round(mean_squared_error(y_test, y_pred, squared=False), 3))
        print('Cross Validation Score: ', cross_val_score(Knn, X, y, cv=10).mean())
        print('')

StandardScaler()
For K =  1
R2: 0.583
RMSE: 0.828
Cross Validation Score:  0.40596102413212565

For K =  2
R2: 0.675
RMSE: 0.731
Cross Validation Score:  0.533928742310089

For K =  3
R2: 0.698
RMSE: 0.705
Cross Validation Score:  0.5818187552702068

For K =  4
R2: 0.714
RMSE: 0.686
Cross Validation Score:  0.6049906590329603

For K =  5
R2: 0.721
RMSE: 0.677
Cross Validation Score:  0.619048424784719

For K =  6
R2: 0.726
RMSE: 0.671
Cross Validation Score:  0.6302595800154712

For K =  7
R2: 0.728
RMSE: 0.669
Cross Validation Score:  0.6354495520975506

For K =  8
R2: 0.732
RMSE: 0.664
Cross Validation Score:  0.6405964944636715

For K =  9
R2: 0.737
RMSE: 0.658
Cross Validation Score:  0.6451485451265992

For K =  10
R2: 0.735
RMSE: 0.66
Cross Validation Score:  0.6498890097961112

For K =  11
R2: 0.738
RMSE: 0.657
Cross Validation Score:  0.6524647890920362

For K =  12
R2: 0.739
RMSE: 0.655
Cross Validation Score:  0.6552651118020297

For K =  13
R2: 0.739
RMSE: 0.655
Cross Valida

Cross Validation Score:  0.6354495520975506

For K =  8
R2: 0.736
RMSE: 0.659
Cross Validation Score:  0.6405964944636715

For K =  9
R2: 0.738
RMSE: 0.657
Cross Validation Score:  0.6451485451265992

For K =  10
R2: 0.74
RMSE: 0.654
Cross Validation Score:  0.6498890097961112

For K =  11
R2: 0.741
RMSE: 0.652
Cross Validation Score:  0.6524647890920362

For K =  12
R2: 0.741
RMSE: 0.652
Cross Validation Score:  0.6552651118020297

For K =  13
R2: 0.742
RMSE: 0.651
Cross Validation Score:  0.6561851706262832

For K =  14
R2: 0.744
RMSE: 0.649
Cross Validation Score:  0.657754118257676

For K =  15
R2: 0.745
RMSE: 0.648
Cross Validation Score:  0.6598542460629663

For K =  16
R2: 0.746
RMSE: 0.647
Cross Validation Score:  0.6613778429649301

For K =  17
R2: 0.747
RMSE: 0.645
Cross Validation Score:  0.6626382272326852

For K =  18
R2: 0.746
RMSE: 0.646
Cross Validation Score:  0.6634755316447312

For K =  19
R2: 0.745
RMSE: 0.648
Cross Validation Score:  0.6636767997248071

For K =  20

**Tuning Hyperparameters**

In [5]:
range_list = [*range(10, 30, 1)]

X = cookies.drop(['quality'], axis=1)
y = cookies['quality']

scaler = StandardScaler()
scaler.fit(X)

X_scale = scaler.transform(X)

# Select an algorithm
algorithm = KNeighborsRegressor()

# Create 3 folds
seed = 13
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

# Define our candidate hyperparameters
hp_candidates = [{'n_neighbors': range_list, 'weights': ['uniform','distance']}]

# Search for best hyperparameters
grid = GridSearchCV(estimator=algorithm, param_grid=hp_candidates, cv=kfold, scoring='r2')
grid.fit(X_scale, y)

# Get the results
print(grid.best_score_)
print(grid.best_estimator_)
print(grid.best_params_)

0.7617068608088978
KNeighborsRegressor(n_neighbors=27, weights='distance')
{'n_neighbors': 27, 'weights': 'distance'}


In [24]:
grid.predict(X_test_scale)

In [6]:
#with Standard Scaler
from sklearn.neighbors import KNeighborsRegressor

X = cookies.drop('quality', axis=1)
y = cookies['quality']

X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.05, random_state=0)

#scaling the x values
scaler = StandardScaler()
scaler.fit(X_train)

X_scale = scaler.transform(X)

X_train_scale = scaler.transform(X_train)
X_test_scale = scaler.transform(X_test)

Knn = KNeighborsRegressor(n_neighbors = 27, weights = 'distance')

Knn.fit(X_train_scale, y_train)

y_pred = Knn.predict(X_test_scale)

print('R2:', round(r2_score(y_test, y_pred), 3))
print('RMSE:', round(mean_squared_error(y_test, y_pred, squared=False), 3))
print('Cross Validation Score: ', cross_val_score(Knn, X_scale, y, cv=10).mean())

R2: 0.782
RMSE: 0.601
Cross Validation Score:  0.7573852789203195


## Saving Data 

In [16]:
#transforming test data frame

test = pd.read_csv('../Data/cookies_validate.csv')

#dropping
test.drop('id', axis=1, inplace=True)
test.drop('quality', axis=1, inplace=True)
test.drop(['crunch factor', 'aesthetic appeal', 'diameter'], axis=1, inplace=True)

#making dummies
#butter types
test['butter type'] = test['butter type'].replace('melted', 1).replace('cubed', 0)

#mixins
mixins_list = ['chocolate', 'raisins', 'oats', 'nuts', 'peanut butter']

for x in mixins_list:
    test[x] = 0
    test[x] = test['mixins'].str.contains(x).astype(int)

test.drop('mixins', axis=1, inplace=True)

In [17]:
test.columns

Index(['sugar to flour ratio', 'sugar index', 'bake temp', 'chill time',
       'calories', 'density', 'pH', 'grams baking soda', 'bake time',
       'butter type', 'weight', 'chocolate', 'raisins', 'oats', 'nuts',
       'peanut butter'],
      dtype='object')

In [18]:
test = scaler.transform(test)

In [19]:
quality_pred = Knn.predict(test)
quality_pred

array([ 7.66784549,  8.23066736,  8.05326928,  8.18728978,  7.        ,
        7.31999433,  7.94424376,  7.03949042,  7.        ,  8.        ,
        8.05870803,  8.04144623,  7.32384152,  7.98417089,  8.        ,
        7.77608308,  7.91836586,  7.38023618,  8.22638756,  8.33009362,
        8.10645423,  9.        ,  8.32815766,  7.59013309,  7.        ,
        7.18132015,  7.55193455,  7.75617873,  7.        ,  7.        ,
        7.94779872,  7.63339783,  7.51943428,  8.01257349,  8.42293306,
        8.38622638, 10.        ,  7.94721908,  7.2532355 ,  7.76027049,
        8.37170846,  8.16951216,  7.776395  ,  7.41721626,  7.84518555,
        7.5331206 ,  7.51278604,  8.16985694,  7.        ,  7.9041964 ,
        7.        ,  7.8921116 ,  7.57184598,  7.75931992,  8.        ,
        7.        ,  7.77364658,  8.        ,  8.20531341,  8.76159865,
        7.31431404,  7.8969384 ,  7.23458379,  8.50674841,  8.03088414,
        8.11608955,  7.55889137,  7.44943139,  7.5129231 ,  8.  

In [20]:
test = pd.DataFrame(test, columns = ['sugar to flour ratio', 'sugar index', 'bake temp', 'chill time',
       'calories', 'density', 'pH', 'grams baking soda', 'bake time',
       'butter type', 'weight', 'chocolate', 'raisins', 'oats', 'nuts',
       'peanut butter'])

In [21]:
test['quality_pred'] = quality_pred

In [23]:
test.to_csv('../Data/fifthtry_sarahlisa.csv')