<a href="https://colab.research.google.com/github/isaacwoood/ADS2002-iwoo0004/blob/main/Imputation_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For the exercises we will use the Abalone Dataset, which can be downloaded from Monash Gitlab. This consists of physical measurements of abalones from the Tasmanian coast in the 1990s, in an effort to determine their age. Previously the age would need to be determined in the laboratory by counting the number of rings in the shell. This is a complete dataset, however we will randomly remove entries in two columns to perform imputation.

In [3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [4]:
abalone = pd.read_csv("abalone.csv")
abalone.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


The Sex field has three categorical entries: Male (M), Female (F) and Infant (I). Se we need to one-hot encode these fields to create three binary columns.

In [5]:
dummy = pd.get_dummies(abalone['Sex'])
abalone = pd.concat([abalone, dummy], axis=1)
abalone.drop(columns=['Sex'], inplace=True)
abalone.head()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,F,I,M
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,0,0,1
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,0,0,1
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,1,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,0,0,1
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,0,1,0


Last we create a features array (Xf) and a label array (Yf). Then we randomly remove 33% of the Height samples and 25% of the Shell weight samples from the features array.

In [28]:
Xf = abalone.drop(columns=['Rings'])
Yf = abalone[['Rings']]

X = Xf.copy()
X['Height'] = X['Height'].sample(frac=0.67)
X['Shell weight'] = X['Shell weight'].sample(frac=0.75)

X.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,F,I,M
count,4177.0,4177.0,2799.0,4177.0,4177.0,4177.0,3133.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139557,0.828742,0.359367,0.180594,0.239572,0.312904,0.321283,0.365813
std,0.120093,0.09924,0.038447,0.490389,0.221963,0.109614,0.139797,0.463731,0.467025,0.481715
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,0.0,0.0,0.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,0.0,0.0,0.0
50%,0.545,0.425,0.145,0.7995,0.336,0.171,0.233,0.0,0.0,0.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.33,1.0,1.0,1.0
max,0.815,0.65,0.25,2.8255,1.488,0.76,1.005,1.0,1.0,1.0


Fill in the missing values of X using KnnImputer with 10 neighbours. Calculate the accuracy of of Random Forest regressor using this imputed dataset.

In [10]:
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

In [29]:
copy_x = X.copy()
y = copy_x.iloc[:,0:10]
Y = (y-y.mean())/y.std()
Yt = KNNImputer(n_neighbors=10).fit_transform(Y)
copy_x.iloc[:,0:10] = Yt

In [30]:
copy_x

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,F,I,M
0,-0.574489,-0.432097,-1.158923,-0.641821,-0.607613,-0.726125,-0.642517,-0.674753,-0.687936,1.316520
1,-1.448812,-1.439757,-1.288973,-1.230130,-1.170770,-1.205077,-1.212989,-0.674753,-0.687936,1.316520
2,0.050027,0.122116,-0.053502,-0.309432,-0.463444,-0.356647,-0.211534,1.481669,-0.687936,-0.759397
3,-0.699393,-0.432097,-0.586705,-0.637743,-0.648160,-0.607527,-0.604962,-0.674753,-0.687936,1.316520
4,-1.615350,-1.540523,-1.497052,-1.271933,-1.215822,-1.287183,-1.320288,-0.674753,1.453277,-0.759397
...,...,...,...,...,...,...,...,...,...,...
4172,0.341468,0.424414,0.661770,0.118799,0.047902,0.532836,0.067443,1.481669,-0.687936,-0.759397
4173,0.549640,0.323648,0.245612,0.279896,0.358765,0.309325,0.149706,-0.674753,-0.687936,1.316520
4174,0.632909,0.676328,1.702167,0.708127,0.748470,0.975296,0.489485,-0.674753,-0.687936,1.316520
4175,0.841081,0.777094,0.531721,0.541933,0.773248,0.733540,0.403646,1.481669,-0.687936,-0.759397


In [33]:
imp_x = (copy_x * y.std()) + y.mean()
imp_x

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,F,I,M
0,0.455,0.365,0.0950,0.5140,0.2245,0.1010,0.14975,0.0,0.0,1.0
1,0.350,0.265,0.0900,0.2255,0.0995,0.0485,0.07000,0.0,0.0,1.0
2,0.530,0.420,0.1375,0.6770,0.2565,0.1415,0.21000,1.0,0.0,0.0
3,0.440,0.365,0.1170,0.5160,0.2155,0.1140,0.15500,0.0,0.0,1.0
4,0.330,0.255,0.0820,0.2050,0.0895,0.0395,0.05500,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
4172,0.565,0.450,0.1650,0.8870,0.3700,0.2390,0.24900,1.0,0.0,0.0
4173,0.590,0.440,0.1490,0.9660,0.4390,0.2145,0.26050,0.0,0.0,1.0
4174,0.600,0.475,0.2050,1.1760,0.5255,0.2875,0.30800,0.0,0.0,1.0
4175,0.625,0.485,0.1600,1.0945,0.5310,0.2610,0.29600,1.0,0.0,0.0


In [39]:
X = imp_x
Y = abalone[['Rings']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)
rfr = RandomForestRegressor()
rfr.fit(X_train, Y_train)
Y_pred = rfr.predict(X_test)
acc = r2_score(Y_test, Y_pred)
print(f"Testing Score: {np.round(acc,3)}")

Testing Score: 0.535
