## Capstone 2 - Abalone Age Prediction
### Modeling
**Context**:

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

_Credit: https://www.kaggle.com/rodolfomendes/abalone-dataset_

**Goal**: The goal of this capstone project is to build a regression model that can predict the age of an abalone shell by accurately predicting its ring count.


**Pre-processing & Training Data Development Objective**: Build two to three different models and identify the best one to predict the age of an abalone. 

In [134]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

from math import sqrt
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, cross_val_score

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.dummy import DummyRegressor
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression

In [135]:
#Import abalone dataset
abalone_data = pd.read_csv('/Users/joyopsvig/github/springboard/2-CapstoneAbalone/Notebooks/abaloneDW_cleaned.csv')

In [136]:
#One hot encode the 'Sex' column since it is categorical
one_hot = pd.get_dummies(abalone_data['Sex'])

# Drop 'Sex' column as it is now encoded
abalone_data = abalone_data.drop('Sex',axis = 1)

# Join the encoded df
abalone_data = abalone_data.join(one_hot)

#Confirm Sex is one hot encoded
abalone_data.head()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Age,F,I,M
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,16.5,0,0,1
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,8.5,0,0,1
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,10.5,1,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,11.5,0,0,1
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,8.5,0,1,0


In [137]:
#Drop response variable
X = abalone_data.drop('Age', axis = 1)
y = abalone_data['Age']

In [138]:
#Transform data so that it has a mean of 0 and std of 1
standardScale = StandardScaler()
standardScale.fit_transform(X)

#Split data in to train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

## Linear Regression Model

_In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. Wikipedia_

In [139]:
#Create the linear regression model and fit it to training data
model_lin = linear_model.LinearRegression()
model_lin.fit(X_train, y_train)

LinearRegression()

In [140]:
#Train the model
y_pred_train_lin = model_lin.predict(X_train)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_train, y_pred_train_lin)) #Mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_train, y_pred_train_lin)))
print('Coefficient of determination (R^2): %.2f' #Coefficient of determination
      % r2_score(y_train, y_pred_train_lin))

Mean squared error (MSE): 4.69
Root mean squared error (RMSE): 2.16
Coefficient of determination (R^2): 0.55


In [141]:
#Apply model to test set
y_pred_test_lin = model_lin.predict(X_test)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, y_pred_test_lin)) #mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_test, y_pred_test_lin)))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_test, y_pred_test_lin)) 

Mean squared error (MSE): 5.18
Root mean squared error (RMSE): 2.28
Coefficient of determination (R^2): 0.51


## Ridge Regression Model

_Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where linearly independent variables are highly correlated. Wikipedia_

In [142]:
#Create the ridge regression model and fit it to training data
model_ridge = Ridge()
model_ridge.fit(X_train, y_train)

Ridge()

In [143]:
#Train the model
y_pred_train_ridge = model_ridge.predict(X_train)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_train, y_pred_train_ridge)) #Mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_train, y_pred_train_ridge)))
print('Coefficient of determination (R^2): %.2f' #Coefficient of determination
      % r2_score(y_train, y_pred_train_ridge))

Mean squared error (MSE): 4.73
Root mean squared error (RMSE): 2.17
Coefficient of determination (R^2): 0.54


In [144]:
#Apply model to test set
y_pred_test_ridge = model_ridge.predict(X_test)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, y_pred_test_ridge)) #mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_test, y_pred_test_ridge)))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_test, y_pred_test_ridge)) 

Mean squared error (MSE): 5.22
Root mean squared error (RMSE): 2.28
Coefficient of determination (R^2): 0.50


## Random Forest Model

_Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For regression tasks, the mean or average prediction of the individual trees is returned. Wikipedia_

In [145]:
#Create the random forest model and fit it to training data
model_rf = RandomForestRegressor()
model_rf.fit(X_train, y_train)

RandomForestRegressor()

In [146]:
#Train the model
y_pred_train_rf = model_rf.predict(X_train)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_train, y_pred_train_rf)) #Mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_train, y_pred_train_rf)))
print('Coefficient of determination (R^2): %.2f' #Coefficient of determination
      % r2_score(y_train, y_pred_train_rf))

Mean squared error (MSE): 0.66
Root mean squared error (RMSE): 0.81
Coefficient of determination (R^2): 0.94


In [147]:
#Apply model to test set
y_pred_test_rf = model_rf.predict(X_test)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, y_pred_test_rf)) #mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_test, y_pred_test_rf)))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_test, y_pred_test_rf)) 

Mean squared error (MSE): 4.95
Root mean squared error (RMSE): 2.23
Coefficient of determination (R^2): 0.53


## Support Vector Machine (SVM) / Support Vector Regression (SVR)

_Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. Wikipedia_ 

_Support Vector Regression is a supervised learning algorithm that is used to predict discrete values. Support Vector Regression uses the same principle as the SVMs. The basic idea behind SVR is to find the best fit line. In SVR, the best fit line is the hyperplane that has the maximum number of points. Towards Data Science_

In [148]:
#Create the SVR model and fit it to training data
model_svr = SVR()
model_svr.fit(X_train, y_train)

SVR()

In [149]:
#Train the model
y_pred_train_svr = model_svr.predict(X_train)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_train, y_pred_train_svr)) #Mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_train, y_pred_train_svr)))
print('Coefficient of determination (R^2): %.2f' #Coefficient of determination
      % r2_score(y_train, y_pred_train_svr))

Mean squared error (MSE): 5.05
Root mean squared error (RMSE): 2.25
Coefficient of determination (R^2): 0.51


In [150]:
#Apply model to test set
y_pred_test_svr = model_svr.predict(X_test)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, y_pred_test_svr)) #mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_test, y_pred_test_svr)))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_test, y_pred_test_svr)) 

Mean squared error (MSE): 5.69
Root mean squared error (RMSE): 2.39
Coefficient of determination (R^2): 0.46


## Gradient Boost Model

_Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees.[1][2] When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. Wikipedia_

In [151]:
#Create the gradient boost model and fit it to training data
model_gb = GradientBoostingRegressor()
model_gb.fit(X_train, y_train)

GradientBoostingRegressor()

In [152]:
#Train the model
y_pred_train_gb = model_gb.predict(X_train)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_train, y_pred_train_gb)) #Mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_train, y_pred_train_gb)))
print('Coefficient of determination (R^2): %.2f' #Coefficient of determination
      % r2_score(y_train, y_pred_train_gb))

Mean squared error (MSE): 3.42
Root mean squared error (RMSE): 1.85
Coefficient of determination (R^2): 0.67


In [153]:
#Apply model to test set
y_pred_test_gb = model_gb.predict(X_test)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, y_pred_test_gb)) #mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_test, y_pred_test_gb)))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_test, y_pred_test_gb)) 

Mean squared error (MSE): 5.04
Root mean squared error (RMSE): 2.24
Coefficient of determination (R^2): 0.52


## K Nearest Neighbors (KNN)

_KNN model is popularly used for non-linear regression in Machine Learning. KNN (K Nearest Neighbours) follows an easy implementation approach for non-linear regression in Machine Learning. KNN assumes that the new data point is similar to the existing data points. The new data point is compared to the existing categories and is placed under a relatable category. The average value of the k nearest neighbors is taken as the input in this algorithm. The neighbors in KNN models are given a particular weight that defines their contribution to the average value. Jigsaw Academy_

In [154]:
#Create the KNN model and fit it to training data
model_knn = KNeighborsRegressor(n_neighbors = 4)
model_knn.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=4)

In [155]:
#Train the model
y_pred_train_knn = model_knn.predict(X_train)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_train, y_pred_train_knn)) #Mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_train, y_pred_train_knn)))
print('Coefficient of determination (R^2): %.2f' #Coefficient of determination
      % r2_score(y_train, y_pred_train_knn))

Mean squared error (MSE): 3.11
Root mean squared error (RMSE): 1.76
Coefficient of determination (R^2): 0.70


In [156]:
#Apply model to test set
y_pred_test_knn = model_knn.predict(X_test)

#Evaluate the performance
print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, y_pred_test_knn)) #mean squared error (MSE)
print('Root mean squared error (RMSE): %.2f' #Root mean squared error (RMSE)
      % sqrt(mean_squared_error(y_test, y_pred_test_knn)))
print('Coefficient of determination (R^2): %.2f'
      % r2_score(y_test, y_pred_test_knn)) 

Mean squared error (MSE): 5.36
Root mean squared error (RMSE): 2.31
Coefficient of determination (R^2): 0.49
