# Data preprocessing

The book recommends the wonderful pandas library for processing/preprocessing real data for ingestion into ML.

In [1]:
import os

In [2]:
os.makedirs(os.path.join("..", "data"), exist_ok=True)
data_file = os.path.join("..", "data", "house_tiny.csv")
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [3]:
os.listdir(os.path.join("..", "data"))

['house_tiny.csv']

In [5]:
# Now we can open the file with pandas
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


In supervised learning problems, we are concerned with predicting a particular target value given a number of input/feature values, thus, it is important to first separate out the target and feature values from a dataset. We can do this by selecting columns, which can be done either by name, or by location with the pandas.dataframe.iloc[] method for index-based referencing. 

There are a number of missing values in the dataset, which is common in real-world applications, these can be dealt with either by imputation or deletion, where imputation is the process of estimating the values based on some heuristics. 

For categorical values, like the roof type, we can treat the NaN values as a category. This can be done automartically by pandas using the "get dummies" method.

In [33]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]

In [34]:
inputs

Unnamed: 0,NumRooms,RoofType
0,,
1,2.0,
2,4.0,Slate
3,,


In [35]:
inputs = pd.get_dummies(inputs, dummy_na=True)

inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,,False,True
1,2.0,False,True
2,4.0,True,False
3,,False,True


For missing numerical values, a common approach is to impute the missing values by replacing them with the mean for the column

In [36]:
print(inputs.mean())
inputs = inputs.fillna(inputs.mean())

NumRooms          3.00
RoofType_Slate    0.25
RoofType_nan      0.75
dtype: float64


In [37]:
inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,3.0,False,True
1,2.0,False,True
2,4.0,True,False
3,3.0,False,True


## Conversion to the tensor format

Now that all inputs and targets are numerical, and have no missing values, they can be converted into tensors. This is achieved via converting into a numpy array as an intermediary.

In [38]:
import torch

In [41]:
X = torch.tensor(inputs.to_numpy(dtype="float"))
y = torch.tensor(targets.to_numpy(dtype="float"))

In [42]:
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

## Experimentation

In [44]:
from ucimlrepo import fetch_ucirepo 

In [45]:
abalone = fetch_ucirepo(id=1) 

In [50]:
X = abalone.data.features
y = abalone.data.targets

In [54]:
metadata = abalone.metadata
variables = abalone.variables

In [55]:
variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Sex,Feature,Categorical,,"M, F, and I (infant)",,no
1,Length,Feature,Continuous,,Longest shell measurement,mm,no
2,Diameter,Feature,Continuous,,perpendicular to length,mm,no
3,Height,Feature,Continuous,,with meat in shell,mm,no
4,Whole_weight,Feature,Continuous,,whole abalone,grams,no
5,Shucked_weight,Feature,Continuous,,weight of meat,grams,no
6,Viscera_weight,Feature,Continuous,,gut weight (after bleeding),grams,no
7,Shell_weight,Feature,Continuous,,after being dried,grams,no
8,Rings,Target,Integer,,+1.5 gives the age in years,,no


In [63]:
features = pd.get_dummies(X)

In [64]:
X_abalone = torch.tensor(features.to_numpy(dtype=float))
y_abalone = torch.tensor(y.to_numpy(dtype=float))

In [67]:
X_abalone.shape

torch.Size([4177, 10])

In [68]:
y_abalone.shape

torch.Size([4177, 1])