## Predicting age profile of Abalone

Predict the age profile of Abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it,and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import time
from math import sqrt

## Exploratory analysis

Load the dataset and do some quick exploratory analysis.

In [None]:
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'Whole_weight','Shucked_weight', 
                 'Viscera_weight', 'Shell_weight', 'Rings']
data = pd.read_csv('abalone.txt', index_col=False, delimiter = ",", names=column_names)
data.head(5)

In [None]:
print(data.shape)

In [None]:
data.describe()

## Data visualisation and pre-processing


Let's take a look at the number of Male, Female and Infant samples from the dataset. From the output shown below, the distribution between the 3 categories is balance.

In [None]:
print(data.groupby('Sex').size())

Next, we visualise the data using density plots to get a sense of the data distribution. From the outputs below, you can see the data shows a general gaussian distribution. 

In [None]:
data.plot(kind='density', subplots=True, layout=(4,4), sharex=False, legend=False, fontsize=1)
plt.show()

It is good to check the correlations between the attributes. From the output graph below, The red around
the diagonal suggests that attributes are correlated with each other. The yellow and green patches suggest some moderate correlation and the blue boxes show negative correlations. 

In [None]:
import seaborn as sns

correlation_map = data.corr(method='pearson')
sns.set(font_scale=1.0)
sns.heatmap(correlation_map, cbar=True, annot=True, square=True, fmt='.2f', 
            yticklabels=correlation_map.columns.values, 
            xticklabels=correlation_map.columns.values)

plt.show()

In [None]:
sex = {'M': 1, 'F': 2, 'I': 0}
# Use the dictionary to map the 'winner' column to the new column: election['color']
data['Sex'] = data.Sex.map(sex)

In [None]:
Y = data['Rings'].values
X = data.drop('Rings', axis=1).values

X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.20, random_state=21)

## Baseline algorithm checking


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))
models.append(('GBR', GradientBoostingRegressor()))
models.append(('AB', AdaBoostRegressor()))
models.append(('RF', RandomForestRegressor()))
models.append(('ET', ExtraTreesRegressor()))

results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state=21)
    start = time.time()
    cv_results = abs(cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_mean_squared_error'))
    end = time.time()
    results.append(cv_results)
    names.append(name)
    print( "%s: RMSE %f (STD %f) (run time: %f)" % (name, sqrt(cv_results.mean()), cv_results.std(), end-start))

In [None]:
fig = plt.figure()
fig.suptitle('Performance Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

From the initial run, it looks like Gradient Boosting Method performed the best given the dataset. 

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = dict(n_estimators=np.array([50,100,200,300,400, 500]))
model = GradientBoostingRegressor(random_state=21)
kfold = KFold(n_splits=10, random_state=21)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=kfold)
start = time.time()
grid_result = grid.fit(X_train, Y_train)
end = time.time()

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (sqrt(abs(mean)), stdev, param))

print("Best: %f using %s (run time :%f)" % (sqrt(abs(grid_result.best_score_)), grid_result.best_params_, end-start))

The best n_estimator configuration is 100 with the root mean square error closest to 0.

In [None]:
from sklearn.metrics import mean_squared_error

model = GradientBoostingRegressor(random_state=21, n_estimators=100)
model.fit(X_train, Y_train)

# transform the validation dataset
predictions = np.round(model.predict(X_test),0)
print (sqrt(mean_squared_error(Y_test, predictions)))

In [None]:
compare = pd.DataFrame({'Prediction': predictions, 'Test Data' : Y_test})
compare.head(10)

## Feature Importance

Let's take a look what are the important features GBR used to build the model.

In [None]:
train = data.drop('Rings', axis = 1)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
priority = pd.DataFrame({'Attribute Label': train.columns, 'Feature Importances': model.feature_importances_})
priority = priority.sort_values('Feature Importances', ascending=False)

In [None]:
fig = plt.figure()
fig.suptitle('Feature Importance Comparison')
ax = fig.add_subplot(111)
plt.bar(range(len(priority)), priority['Feature Importances'])
ax.set_xticks(np.arange(len(priority['Attribute Label'])))
ax.set_xticklabels(priority['Attribute Label'], rotation=70)
plt.ylabel('Importance Score')
plt.xlabel('Attribute Labels')
plt.show()

## Using Neural Network to predict the age

In [None]:
from keras.layers import Dense
from keras.models import Sequential

In [None]:
#Initialise Model
n_cols = X_train.shape[1] # Save the number of columns in predictors: n_cols
model = Sequential()
model.add(Dense(2*n_cols, activation='relu', input_shape=(n_cols,))) # Add the first layer
model.add(Dense(32, activation='relu')) #Add the second layer
model.add(Dense(1)) # Add the output layer, 1 neuron
model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
from keras import optimizers

#initiate the optimizer
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) 
model.compile(loss='mean_squared_error', optimizer=sgd)

In [None]:
print("[INFO] evaluating on testing set...")
model.fit(X_train, Y_train, epochs=500, validation_split=0.2, verbose=True)

In [None]:
loss = sqrt(model.evaluate(X_test, Y_test, verbose=True))
print("[INFO] SQRT loss={:.3f}".format(loss))

In [None]:
#predictions = np.round(model.predict(X_test), 0).tolist()
predictions = model.predict(X_test).tolist()
predictions = [item for items in predictions for item in items]
compare = pd.DataFrame({'Prediction': predictions, 'Test Data' : Y_test})
compare.Prediction = compare.Prediction.astype(int)
compare.head(10)