# Modelling Experiments 

In this notebook we begin some basic modelling for our problem of predicting fish weight based on various measurements of fish length. 

We will compare a few different regression methods, and decide on what type(s) of model we wish to use with a rough idea on its potential performance, but we will not train the final model within this notebook. 

We can see this as setting a baseline which is a crucial part of ML workflow.

In [36]:
#imports
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error,  mean_squared_error

In [29]:
#Read in
df = pd.read_csv("data/Fish.csv")
df.head()

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


The Species column contains strings. The Species is a categorical variable. There is no ordinal relationship between species of fish and so ordinal encoding is not appropriate. Instead we use one-hot encoding.

In [30]:
df = pd.get_dummies(df)

In [31]:
y = df['Weight']
X = df.drop(['Weight'], axis=1)

In [32]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=1)

In [33]:
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_test)

In [34]:
lin_reg_r2 = r2_score(y_test, y_pred)
lin_reg_mae = mean_absolute_error(y_test, y_pred)
lin_reg_mse = mean_squared_error(y_test, y_pred)

print("Lin Reg... r2: {}, MAE: {}, MSE: {}".format(lin_reg_r2, lin_reg_mae, lin_reg_mse))

r2: 0.9012475959051086, MAE: 87.04429070346575, MSE: 11107.667607870422


These metrics suggest a good performance from linear regression. There may also be advantages in terms of interpretability for using a simple model.

We will experiment with a more complicated model.

In [38]:
gb_reg = GradientBoostingRegressor().fit(X_train, y_train)
y_pred_gb = gb_reg.predict(X_test)

In [50]:
gb_reg_r2 = r2_score(y_test, y_pred_gb)
gb_reg_mae = mean_absolute_error(y_test, y_pred_gb)
gb_reg_mse = mean_squared_error(y_test, y_pred_gb)

print("GradientBoost Reg... r2: {}, MAE: {}, MSE: {}".format(gb_reg_r2, gb_reg_mae, gb_reg_mse))

GradientBoost Reg... r2: 0.9511407473156664, MAE: 46.467677894439426, MSE: 5495.687354254525


We see that the more complicated model (GradientBoostingRegressor) performs better than the linear regression. At this stage we have not done any work on tuning or optimisation, we are simply trying to get some baseline numbers and use this information to guide where we should spend effort in future. 

I am interested to see the performance if we remove some of the variables, in particular the height and width.

In [55]:
reduced_columns = ['Length1', 'Length3', 'Species_Bream',
       'Species_Parkki', 'Species_Perch', 'Species_Pike', 'Species_Roach',
       'Species_Smelt', 'Species_Whitefish']
gb_reg = GradientBoostingRegressor().fit(X_train[reduced_columns], y_train)
y_pred_gb = gb_reg.predict(X_test[reduced_columns])


gb_reg_r2 = r2_score(y_test, y_pred_gb)
gb_reg_mae = mean_absolute_error(y_test, y_pred_gb)
gb_reg_mse = mean_squared_error(y_test, y_pred_gb)

print("GradientBoost Reg Reduced Columns... r2: {}, MAE: {}, MSE: {}".format(gb_reg_r2, gb_reg_mae, gb_reg_mse))

GradientBoost Reg Reduced Columns... r2: 0.9537481798006965, MAE: 41.78235851858404, MSE: 5202.403422393519


On this split it actually seems to be performing better with just Length 1 and Length 3. 

It seems that using only Length1, Length 3 and Species, we are getting similar performance to using all 6 variables. It is my preference to use fewer variables for aesthetic reasons as for this project the final part will be an interactive web app where the user can submit fish measurements and get a prediction for the fish weight. Visually having 3 lengths along with Height and Width will be too cluttered. 

The two measurements we are keeping also seem to be the most natural and easiest to interpret. 

