<a href="https://colab.research.google.com/github/mohamedyosef101/101_learning_area/blob/area/d2l/Preliminaries/2_2-data-preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing
**Source**: [Dive into Deep Learning](https://d2l.ai/)

In [31]:
# create a data file
import os
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
  f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

# read the data file
import pandas as pd
data = pd.read_csv(data_file)
data.head()

Unnamed: 0,NumRooms,RoofType,Price
0,,,127500
1,2.0,,106000
2,4.0,Slate,178100
3,,,140000


## Data Preparation
In supervised learning, we train models to predict a designated target value, given some set of input values.

Our first step in processing the dataset is to separate out columns corresponding to input versus target values.

In [34]:
inputs, targets = data.iloc[:, :-1], data.iloc[:, -1]
pd.DataFrame(inputs)

Unnamed: 0,NumRooms,RoofType
0,,
1,2.0,
2,4.0,Slate
3,,


There are some missing values in our dataset. Missing values might be handled either via imputation or deletion. **Imputation** replaces missing values with estimates of their values while **deletion** simply discards either those rows or those columns that contain missing values.

In [35]:
# fill missing values in numerical column with the mean
inputs.NumRooms = inputs.NumRooms.fillna(inputs.NumRooms.mean())

# create a separate category of NaN values
inputs = pd.get_dummies(inputs, dummy_na=True)

print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0               0             1
1       2.0               0             1
2       4.0               1             0
3       3.0               0             1


## Convert to Tensor formate

In [36]:
import torch
X = torch.tensor(inputs.to_numpy())
y = torch.tensor(targets.to_numpy())
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))