## 2.2. Data Preprocessing

*Studying and coding along with the printed book __„Dive into Deep Learning“__ by Aston Zhang, Zachary C. Lipton, Mu Li & Alexander J. Smola. The accompanying website for the chapter Preliminaries > Data Preprocessing can be found at [d2l.ai](https://d2l.ai/chapter_preliminaries/pandas.html).*

In [18]:
import os
import pandas as pd
import torch

### 2.2.1. Reading the Dataset

In [5]:
# creating an example csv file
os.makedirs(os.path.join('..', 'assets/data'), exist_ok=True)
data_file = os.path.join('..', 'assets/data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [12]:
# loading the dataset with pandas' read_csv
# pandas replaces all CSV entries with value NA with a special NaN (not a number) value
data = pd.read_csv(data_file)
data

Unnamed: 0,NumRooms,RoofType,Price
0,,,127500
1,2.0,,106000
2,4.0,Slate,178100
3,,,140000


### 2.2.2. Data Preparation

Supervised learning is used to teach models how to make accurate predictions by showing them many examples of input data paired with their correct answers (labels). This approach allows models to learn patterns that can then be applied to make predictions on new, unseen data - similar to how humans learn from examples with feedback (*from Claude.ai*).

__Processing the dataset:__

1. Separating out columns corresponding to input versus target values.

In [13]:
# selecting columns via integer-location based indexing (iloc)
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
# inputs: NumRooms	RoofType	
# targets: Price

# missing values might be handled via imputation or deletion
# imputation replaces missing values with estimates of their values
# imputation heuristics for categorical input fields: NaN are treated as a category
# pandas can converts RoofType columns into the two columns RoofType_Slate and RoofType_nan
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True


In [15]:
# common heuristic of replacing missing numerical values
# replacing the NaN entries with the mean value of the corresponding column
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True


In [16]:
print(targets)

0    127500
1    106000
2    178100
3    140000
Name: Price, dtype: int64


### 2.2.3. Conversion to the Tensor Format

The dataset can be loaded into a tensor after all the entries in inputs and targets have been made numerical.

In [22]:
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
print("Inputs:")
print(X)
print("----------------------------------------------------------------------")
print("Targets:")
print(y)

Inputs:
tensor([[3., 0., 1.],
        [2., 0., 1.],
        [4., 1., 0.],
        [3., 0., 1.]], dtype=torch.float64)
----------------------------------------------------------------------
Targets:
tensor([127500., 106000., 178100., 140000.], dtype=torch.float64)
