Picking up where we left off in the exploratory data analysis, let's load in the finalized crab_df_cleaned where outliers were removed. In this next step of the capstone, we'll prepare the dataset for modelling to help elucidate relationships between feature(s) in the dataset and Age.

In [41]:
#Import necessary modules
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [42]:
#Import crab_df
crab_df = pd.read_csv('./cleaned_data/crab_df_eda.csv', index_col= [0]).drop(columns = ['index'])

In [43]:
#Check first few observations of dataset to ensure import was successful
crab_df.head()

Unnamed: 0,id,Sex,Length,Diameter,Height,Weight,Shucked_Weight,Viscera_Weight,Shell_Weight,Age
0,0,I,0.8125,0.625,0.1625,5.244657,2.423882,1.119805,1.13398,5.0
1,1,M,1.175,0.925,0.2875,13.366789,5.556502,2.806601,4.2666,8.0
2,2,I,0.85,0.65,0.25,5.301356,2.679028,1.190679,1.644271,6.0
3,3,M,1.3375,1.0875,0.4125,26.322511,10.999606,6.562909,7.654365,10.0
4,4,M,1.7,1.3,0.425,44.22522,24.67824,9.043491,10.517665,9.0


In [44]:
#Check datatypes of crab_df
crab_df.dtypes

id                  int64
Sex                object
Length            float64
Diameter          float64
Height            float64
Weight            float64
Shucked_Weight    float64
Viscera_Weight    float64
Shell_Weight      float64
Age               float64
dtype: object

Before we dive into modelling, we will need to do something about Gender - our only categorical variable. These types of string variables can not be handled by modelling algorithms so we will need to represent the different categories of Gender - Male, Female, and Indeterminate - as numeric values.

One of the common options to handle categorical variables is Label Encoding- i.e. where each value of a categorical variable can be represented as number. For example, Male = 0, Female = 1, and Indeterminate = 2. This approach is nice because it is straightforward to understand each unique value of the variable, however some algorithms can misinterpret the difference in magnitude between values as encoding some additional meaning which can impact their performance.

To mitigate this, One Hot Encoding is the best approach. One Hot Encoding converts each category value into a new column and assigns a 1 or 0 (True/False) value to each newly created column. This handles the issue of weighting a value improperly as seen in Label Encoding approach but does have the downside of adding additional columns to the data set.

An example of Label Encoding vs One Hot Encoding is shown below.

Image Source : KDNuggets

**Example of Label Encoding Categorical Variables**
<p align="right">
<img src="./figures/Label_Encoding_Example.jpg" alt="alt text" width="500" height="300" class="blog-image align">
</p>

**Example of One Hot Encoding Categorical Variables**
<p align="center">
<img src="./figures/One_Hot_Encoding_Example.jpg" alt="alt text" width="600" height="400" class="blog-image">
</p>

In [45]:
#One Hot Encoding the Gender categorical variable
crab_df_modelling = pd.get_dummies(crab_df, dtype = float)

In [46]:
#Review modelling 
crab_df_modelling.head()

Unnamed: 0,id,Length,Diameter,Height,Weight,Shucked_Weight,Viscera_Weight,Shell_Weight,Age,Sex_F,Sex_I,Sex_M
0,0,0.8125,0.625,0.1625,5.244657,2.423882,1.119805,1.13398,5.0,0.0,1.0,0.0
1,1,1.175,0.925,0.2875,13.366789,5.556502,2.806601,4.2666,8.0,0.0,0.0,1.0
2,2,0.85,0.65,0.25,5.301356,2.679028,1.190679,1.644271,6.0,0.0,1.0,0.0
3,3,1.3375,1.0875,0.4125,26.322511,10.999606,6.562909,7.654365,10.0,0.0,0.0,1.0
4,4,1.7,1.3,0.425,44.22522,24.67824,9.043491,10.517665,9.0,0.0,0.0,1.0


Before proceeding with the modelling, we will need to split our data into our independent variables, or the features like length and diameter that will be predicting age, and our dependent variable, age. Following this, the independent variable and dependent variable values will be split into training and testing sets. Training sets are used to create the model and the testing data is a portion of the data purposefully held out of training to validate the performance of the model once it has been made. If a model is tested on the same set of data used to create it, this can lead to overfitting. Overfitting is a situation when a model accurately predicts the outcome on data it has seen during training but is not accurate when performing on unseen data. Typically, 20-30% of all data is witheld when developing a model to use for testing.

In [47]:
#Splitting crab_df_modelling into independent variables, X, & dependent variable, y (Age)
#id values will be dropped as this is not a physical attribute or feature that we want to incorporate in our model
X = crab_df_modelling.drop(['id', 'Age'], axis = 1)
y = crab_df_modelling.Age

In [48]:
#Review X
X

Unnamed: 0,Length,Diameter,Height,Weight,Shucked_Weight,Viscera_Weight,Shell_Weight,Sex_F,Sex_I,Sex_M
0,0.8125,0.6250,0.1625,5.244657,2.423882,1.119805,1.133980,0.0,1.0,0.0
1,1.1750,0.9250,0.2875,13.366789,5.556502,2.806601,4.266600,0.0,0.0,1.0
2,0.8500,0.6500,0.2500,5.301356,2.679028,1.190679,1.644271,0.0,1.0,0.0
3,1.3375,1.0875,0.4125,26.322511,10.999606,6.562909,7.654365,0.0,0.0,1.0
4,1.7000,1.3000,0.4250,44.225220,24.678240,9.043491,10.517665,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
45350,1.0625,0.8125,0.2250,9.128539,3.855532,1.729319,2.735727,0.0,1.0,0.0
45351,1.0875,0.8750,0.2750,11.594945,4.294949,2.537280,3.756309,1.0,0.0,0.0
45352,1.3125,1.0375,0.3500,22.963095,9.738053,3.983105,7.229122,0.0,0.0,1.0
45353,0.9625,0.7125,0.2125,8.689122,3.217668,1.828543,2.267960,0.0,1.0,0.0


In [49]:
#Review y
y

0         5.0
1         8.0
2         6.0
3        10.0
4         9.0
         ... 
45350     8.0
45351    10.0
45352    11.0
45353     6.0
45354    12.0
Name: Age, Length: 45355, dtype: float64

Now that the independent variable and dependent variable values have been saved separately, we will split X and y into training and testing sets. Training sets are used to create the model and the testing data is a portion of the data purposefully held out of training to validate the performance of the model once it has been made. If a model is tested on the same set of data used to create it, this can lead to overfitting. Overfitting is a situation when a model accurately predicts the outcome on data it has seen during training but is not accurate when performing on unseen data. Typically, 20-30% of all data is witheld when developing a model to use for testing.

In [50]:
#Split data into training/testing sets, withold 30% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

In [51]:
#Review X split
X_train.shape, X_test.shape

((31748, 10), (13607, 10))

In [52]:
#Review y split
y_train.shape, y_test.shape

((31748,), (13607,))

Lastly, before modelling the independent variable values (or X training and testing sets) will need to be normalized. Many machine learning algorithms perform best with normalized data as inputs. The terms scaling and normalizing are often used interchangeably but refer to different things. Both have a place in data science but for this example, normalization is required because in the subsequent modelling steps more advanced machine learning algorithms such as XGBoost will be employed which necessitate this. For normalization, we will use the sklearn StandardScaler() function.

   - **Scaling** : Transforming data so all values are within a specific scale, like 0-100 or 0-1; scaling data is valuable when  you’re using methods based on measures of how far apart data points are (like clustering algorithms such as k-Nearest Neighbors)
   <br>
   <br>
   - **Normalization** : Changing the values of your data values so that each feature can be described as a normal distribution (i.e. has mean of 0 and standard deviation of 1); normalize data if you’re going to be using a machine learning or statistics technique that assumes your data is normally distributed.

In [53]:
#Normalizing X_train & X_test using StandardScaler(); y values will be left as is
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [54]:
#Review normalization was performed appropriately
X_train_scaled.mean(axis = 0)

array([-4.22995392e-16,  6.93801966e-17, -4.86332797e-16,  6.62468974e-17,
       -3.57196109e-16, -1.24436740e-16,  8.95228343e-17,  9.98179602e-17,
       -1.56664960e-17, -8.07943579e-17])

In [55]:
#Review normalization was performed appropriately
X_train_scaled.std(axis = 0, ddof = 0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [56]:
#Review normalization was performed appropriately
X_test_scaled.mean(axis = 0)

array([-0.00146365, -0.00171342,  0.00043667, -0.00077406,  0.00191731,
       -0.00596893,  0.00052852,  0.00638273, -0.00523258, -0.00095411])

In [57]:
#Review normalization was performed appropriately
X_test_scaled.std(axis = 0, ddof = 0)

array([0.99737105, 1.00040994, 0.99846115, 1.00040343, 1.00679183,
       0.99698801, 0.99566923, 1.00276741, 0.99811418, 0.99973975])

In [58]:
#Save training & testing data
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns = [X.columns])
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns = [X.columns])
y_train_df = pd.DataFrame(y_train, columns = ['Age'])
y_test_df = pd.DataFrame(y_test, columns = ['Age'])

In [59]:
#Check head of each df to ensure df creation was correct
X_train_scaled_df.head()

Unnamed: 0,Length,Diameter,Height,Weight,Shucked_Weight,Viscera_Weight,Shell_Weight,Sex_F,Sex_I,Sex_M
0,0.873989,0.860293,0.937882,0.944023,0.606259,0.905845,0.989585,1.528832,-0.704519,-0.764144
1,0.460982,0.693374,0.500199,0.51169,0.439462,0.609758,0.360296,-0.654094,-0.704519,1.308655
2,-0.043803,0.13698,0.646093,0.165359,-0.46381,0.287699,0.693946,-0.654094,-0.704519,1.308655
3,0.552762,0.526456,0.500199,0.597692,0.421499,0.698065,0.398307,-0.654094,-0.704519,1.308655
4,0.369203,0.359538,0.354304,-0.012455,-0.119951,-0.029166,-0.06627,-0.654094,-0.704519,1.308655


In [60]:
X_test_scaled_df.head()

Unnamed: 0,Length,Diameter,Height,Weight,Shucked_Weight,Viscera_Weight,Shell_Weight,Sex_F,Sex_I,Sex_M
0,0.231534,0.248259,0.500199,-0.027563,-0.350901,-0.138251,-0.028259,-0.654094,-0.704519,1.308655
1,1.470554,1.305408,1.521461,2.071342,1.494134,2.355112,1.623098,1.528832,-0.704519,-0.764144
2,0.919879,1.027211,1.083777,1.469331,1.437679,1.233099,1.158522,-0.654094,-0.704519,1.308655
3,0.415093,0.470817,0.500199,0.230442,0.390705,0.313671,0.187136,1.528832,-0.704519,-0.764144
4,0.231534,0.303898,0.062515,0.11771,-0.022439,-0.14864,0.313838,-0.654094,1.419408,-0.764144


In [61]:
y_train_df.head()

Unnamed: 0,Age
11735,12.0
21925,10.0
12389,12.0
12361,9.0
2357,12.0


In [62]:
y_test_df.head()

Unnamed: 0,Age
25384,10.0
37878,10.0
39380,10.0
43279,10.0
4762,13.0


In [63]:
#Save each dataframe to csv for use in subsequent modelling steps
X_train_scaled_df.to_csv('./cleaned_data/X_train_scaled.csv')
X_test_scaled_df.to_csv('./cleaned_data/X_test_scaled.csv')
y_train_df.to_csv('./cleaned_data/y_train.csv')
y_test_df.to_csv('./cleaned_data/y_test.csv')