# Final Project - Abalone Age Prediction 

## 1) Abalone data
source: Machine Learning Resipository, University of California-Irvine (UCI)
* A biological study for Department of Primary Industry and Fisheries, Tasmania, Australia.
* Abalone, a very common type of sellfish. Their flesh are served as delicacy dishes and their shells are popular in jewellery.
* 4177 observations
* 9 attributes in data:
1. Sex 
2. Length 
3. Diameter
4. Height
5. Whole Weight
6. Shucked Weight
7. Viscera Weight
8. Shell Weight
9. Ring

## 2) URLs for data acquisition
UCI http://mlr.cs.umass.edu/ml/datasets/Abalone

Abalone data http://mlr.cs.umass.edu/ml/machine-learning-databases/abalone/abalone.data

Data description http://mlr.cs.umass.edu/ml/machine-learning-databases/abalone/abalone.names

## 3) Import data from local file

In [2]:
#read in data
from pandas import read_csv
data=read_csv('C:/Users/Min-Chin/Downloads/abalone.csv',sep=',')# default 1st row attri names

## 4) The head of abalone data

In [3]:
print(data.head())
#data.info()

  Sex  Length  Diameter  Height  WholeWeight  ShuckedWeight  VisceraWeight  \
0   M   0.455     0.365   0.095       0.5140         0.2245         0.1010   
1   M   0.350     0.265   0.090       0.2255         0.0995         0.0485   
2   F   0.530     0.420   0.135       0.6770         0.2565         0.1415   
3   M   0.440     0.365   0.125       0.5160         0.2155         0.1140   
4   I   0.330     0.255   0.080       0.2050         0.0895         0.0395   

   ShellWeight  Ring  
0        0.150    15  
1        0.070     7  
2        0.210     9  
3        0.155    10  
4        0.055     7  


## 5) The shape of abalone data

In [4]:
data.shape

(4177, 9)

## 6) Missing observations for each column of the data

No missing value in data set.

In [5]:
data.isnull().sum()

Sex              0
Length           0
Diameter         0
Height           0
WholeWeight      0
ShuckedWeight    0
VisceraWeight    0
ShellWeight      0
Ring             0
dtype: int64

## 7) A problem statement.
The goal is to predict age of abalone using available attributes.

## 8) y-variable
y as age of abalone which is represents by rings 

statistical model-Random Forest

##### Data Overview

In [6]:
from pandas.plotting import scatter_matrix
data.corr()
#data.Diameter.std()

Unnamed: 0,Length,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Ring
Length,1.0,0.986812,0.827554,0.925261,0.897914,0.903018,0.897706,0.55672
Diameter,0.986812,1.0,0.833684,0.925452,0.893162,0.899724,0.90533,0.57466
Height,0.827554,0.833684,1.0,0.819221,0.774972,0.798319,0.817338,0.557467
WholeWeight,0.925261,0.925452,0.819221,1.0,0.969405,0.966375,0.955355,0.54039
ShuckedWeight,0.897914,0.893162,0.774972,0.969405,1.0,0.931961,0.882617,0.420884
VisceraWeight,0.903018,0.899724,0.798319,0.966375,0.931961,1.0,0.907656,0.503819
ShellWeight,0.897706,0.90533,0.817338,0.955355,0.882617,0.907656,1.0,0.627574
Ring,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0


##### Data Pre-processing
* combined 3 features
- **length * diameter * height** as abalone **volume** to represent the sellfish size
- making it more relative to the other 4 abalone-weight-related attributes 

In [7]:
data['Volume']=data['Length']*data['Diameter']*data['Height']
data.head()

Unnamed: 0,Sex,Length,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Ring,Volume
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,0.015777
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,0.008347
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,0.030051
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,0.020075
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,0.006732


##### Normalization
simple standardization-mean 0, variance 1

In [8]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

scaler=StandardScaler()
scaledData=scaler.fit_transform(data.loc[:,'WholeWeight':'Volume'])
df=pd.DataFrame(scaledData)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,-0.641898,-0.607685,-0.726212,-0.638217,1.571544,-0.895386
1,-1.230277,-1.17091,-1.205221,-1.212987,-0.910013,-1.246658
2,-0.309469,-0.4635,-0.35669,-0.207139,-0.289624,-0.22052
3,-0.637819,-0.648238,-0.6076,-0.602294,0.020571,-0.692184
4,-1.272086,-1.215968,-1.287337,-1.320757,-0.910013,-1.323039


In [9]:
import numpy as np

##### Divide data into predictors and the response

In [10]:
X=df.iloc[:,[0,1,2,3,5]]
#X.head()
y=df.iloc[:,4]
y.shape

(4177,)

##### Train and Test Split

In [11]:
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
kf=KFold(n_splits=5,shuffle=True,random_state=35)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=35)

Random Forest (Ensemble)
* collection of weak (poor performing) models makes a strong and robust model
* subsampling the features
* shallow trees

In [12]:
from sklearn.ensemble import RandomForestRegressor
regressor=RandomForestRegressor()
gscv=GridSearchCV(regressor,{'max_depth':range(0,20),'n_estimators':range(2,20)},cv=kf,n_jobs=-1)

In [18]:
rf=regressor.fit(X_train,y_train)

In [19]:
ypredict=regressor.predict(X_test)

In [20]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, ypredict)

0.49367994863197856