## Capstone 2 - Abalone Age Prediction
### Modeling
**Context**:

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

_Credit: https://www.kaggle.com/rodolfomendes/abalone-dataset_

**Goal**: The goal of this capstone project is to build a regression model that can predict the age of an abalone shell by accurately predicting its ring count.


**Pre-processing & Training Data Development Objective**: Build two to three different models and identify the best one to predict the age of an abalone. 

In [26]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, cross_val_score

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression

In [27]:
#Import abalone dataset
abalone_data = pd.read_csv('/Users/joyopsvig/github/springboard/2-CapstoneAbalone/Notebooks/abaloneDW_cleaned.csv')

In [28]:
#One hot encode the 'Sex' column since it is categorical
one_hot = pd.get_dummies(abalone_data['Sex'])

# Drop 'Sex' column as it is now encoded
abalone_data = abalone_data.drop('Sex',axis = 1)

# Join the encoded df
abalone_data = abalone_data.join(one_hot)

#Confirm Sex is one hot encoded
abalone_data.head()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Age,F,I,M
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,16.5,0,0,1
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,8.5,0,0,1
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,10.5,1,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,11.5,0,0,1
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,8.5,0,1,0


In [29]:
#Drop response variable
X = abalone_data.drop('Age', axis = 1)
y = abalone_data['Age']

In [30]:
#Transform data so that it has a mean of 0 and std of 1
standardScale = StandardScaler()
standardScale.fit_transform(X)

#Use SelectKBest to extract best features of given dataset aka features that contribute most to target variable (age)
selectkBest = SelectKBest(k=2)
X_new = selectkBest.fit_transform(X, y)

#Split data in to train and test data
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.25)

In [31]:
np.random.seed(10)
def rmse_cv(model, X_train, y):
    rmse =- (cross_val_score(model, X_train, y, scoring='neg_mean_squared_error', cv=5))
    return(rmse*100)

models = [LinearRegression(),
             Ridge(),
             SVR(),
             RandomForestRegressor(),
             GradientBoostingRegressor(),
             KNeighborsRegressor(n_neighbors = 4),]

names = ['LR','Ridge','svm','GNB','RF','GB','KNN']

for model,name in zip(models,names):
    score = rmse_cv(model,X_train,y_train)
    print("{}    : {:.6f}, {:4f}".format(name,score.mean(),score.std()))

LR    : 700.135873, 24.017946
Ridge    : 707.443800, 24.043746
svm    : 746.717807, 18.654078
GNB    : 878.375574, 63.029332
RF    : 720.064097, 26.481160
GB    : 870.425925, 48.693007
