In [1]:
import pandas as pd 
import numpy as np 

## Abalone Dataset description
 
* The dataset used is Abalone Data set. This dataset is from UCI Machine Learning Repository. The source of the data is from original study in this link https://archive.ics.uci.edu/ml/datasets/Abalone.

* In this task with the dataset, the age of the abalone should be predicted with the physical measurements. In general, the age is determined by cutting the shell through the cone, staining after that, and counting the number of rings present through the microscope.

* This dataset has 4177 samples. There are no missing values in this data set as per the description given. The missing values are already removed from the original samples and the ranges of continuous values are scaled with ANN.

## Attributes

* There are 9 attributes. These attributes are the columns of the data set. The age column is not included in the dataset. But in the description it is given that age = Rings + 1.5.

    * Sex – The data type is categorical and there are three types in this data. M, F and I.
    * Length – It is the continuous datatype and the units are in mm. It is the longest shell measurement.
    * Diameter – It is also continuous datatype and the units are in mm. It is the perpendicular to length.
    * Height – It is a continuous data type. It is numerical just like diameter and length. Its units are in mm and it is the meat in the shell.
    * Whole weight – A continuous data measured in grams. It is the weight of whole abalone.
    * Shucked weight – It is same as whole weight, continuous data, measured in grams and it is the weight of the meat.
    * Viscera weight – It is also continuous data, measured in grams and it is the git weight after bleeding.
    * Shell weight – It is the weight of the shell after being dried. It is continuously varying data.
    * Rings – It is an integer and adding 1.5 to rings gives the age of the abalone.

## 2. Preprocess your dataset. Indicate which steps worked and which didn’t. Include your thoughts on why certain steps worked and certain steps didn’t. 

In [2]:
column_list = ['Sex','Length','Diameter','Height','Whole Weight','Shucked Weight','Viscera Weight','Shell Weight','Rings']
abalone_df = pd.read_csv("abalone.data",names = column_list)
abalone_df

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


In [3]:
# Checking Nans in the dataset
abalone_df.isnull().sum()

Sex               0
Length            0
Diameter          0
Height            0
Whole Weight      0
Shucked Weight    0
Viscera Weight    0
Shell Weight      0
Rings             0
dtype: int64

In [4]:
abalone_df.describe()

Unnamed: 0,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


In [30]:
# It seems like we have some zeros in Height column. Because minimum height value is 0.
abalone_df[abalone_df['Height'] == 0]

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings,Age
1257,1,0.43,0.34,0.0,0.428,0.2065,0.086,0.115,8,9.5
3996,1,0.315,0.23,0.0,0.134,0.0575,0.0285,0.3505,6,7.5


In [31]:
# replacing 0s with it mean
abalone_df['Height'] = abalone_df['Height'].replace(0,np.mean(abalone_df['Height']))

In [32]:
abalone_df.describe()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings,Age
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,1.052909,0.523992,0.407881,0.139583,0.828742,0.359367,0.180594,0.238831,9.933684,11.433684
std,0.82224,0.120093,0.09924,0.041715,0.490389,0.221963,0.109614,0.139203,3.224169,3.224169
min,0.0,0.075,0.055,0.01,0.002,0.001,0.0005,0.0015,1.0,2.5
25%,0.0,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0,9.5
50%,1.0,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0,10.5
75%,2.0,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0,12.5
max,2.0,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0,30.5


In [5]:
# Ie have replaced 0s with that column mean value. Inorder to use machine learning techniques we need to convert 
# categorical columns into numeric. Our Sex column is categorical. We can encode it using scikit library label encoder.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
abalone_df['Sex'] = le.fit_transform(abalone_df['Sex'])

#### Age column

* In the dataset they didn't add age column. Information provided for age column is Rings + 1.5 gives the age in years. Adding output column age in the dataframe 

In [33]:
abalone_df['Age'] = abalone_df['Rings'] + 1.5
abalone_df

Unnamed: 0,Sex,Length,Diameter,Height,Whole Weight,Shucked Weight,Viscera Weight,Shell Weight,Rings,Age
0,2,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15,16.5
1,2,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7,8.5
2,0,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9,10.5
3,2,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10,11.5
4,1,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7,8.5
...,...,...,...,...,...,...,...,...,...,...
4172,0,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11,12.5
4173,2,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10,11.5
4174,2,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9,10.5
4175,0,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10,11.5


In [34]:
# checking correlation with age and other columns
correlation = abalone_df.corr()
correlation['Age']

Sex              -0.034627
Length            0.556720
Diameter          0.574660
Height            0.557502
Whole Weight      0.540390
Shucked Weight    0.420884
Viscera Weight    0.503819
Shell Weight      0.627574
Rings             1.000000
Age               1.000000
Name: Age, dtype: float64

## 3.Create a decision tree model tuned to the best of your abilities. Explain how you tuned it.


Our predicted column is a continuos variable. So this prediction will come under **regression supervised learning**. So I will be using Decision Tree Regressor. My performance metric is Root Mean Squared Error(RMSE)

In [11]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.preprocessing import StandardScaler

In [35]:
X = abalone_df.drop('Age',axis = 1)
y = abalone_df['Age']
              
# splitting the data 
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 6)

# standardScaler
sc = StandardScaler()
X_tr_scale = sc.fit_transform(X_train)
X_te_scale = sc.fit_transform(X_test)

# creating an object for Regressor
DTR = DecisionTreeRegressor(random_state = 6)

# fitting the model
DTR.fit(X_tr_scale,y_train)

# predicting output
y_pred = DTR.predict(X_te_scale)

#calculating MSE
mse_DTR = MSE(y_test,y_pred)
RMSE_DTR = np.sqrt(mse_RFR)
print("RMSE from Decision Tree Regressor: ", RMSE_DTR)

RMSE from Decision Tree Regressor:  0.010444659357341877


## 4.Create a random forest model tuned to the best of your abilities. Explain how you tuned it.

In [36]:
from sklearn.ensemble import RandomForestRegressor
RFR = RandomForestRegressor(random_state = 6)
RFR.fit(X_tr_scale,y_train)
y_pred_RFR = RFR.predict(X_te_scale)
MSE_RFR = MSE(y_test,y_pred_RFR)
np.sqrt(MSE_RFR)

0.010444659357341877

## 5.Create an xgboost model tuned to the best of your abilities. Explain how you tuned it. 


In [37]:
import xgboost as xgb
xgb_regressor = xgb.XGBRegressor()
XG = xgb.XGBRegressor(random_state = 6)
XG.fit(X_tr_scale,y_train)
y_pred_XG = XG.predict(X_te_scale)
MSE_XG = MSE(y_test,y_pred_XG)
np.sqrt(MSE_XG)

0.002223235655209661