# Data Preprocessing

pandas: https://pandas.pydata.org/docs/

## Reading the Dataset

Write the dataset:

In [None]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # Column names
    f.write('NA,Pave,127500\n')  # Each row represents a data example
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

Read the dataset:

In [2]:
# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000


## Handling Missing Data

By integer-location based indexing (iloc), we split data into inputs and outputs.

In [3]:
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN


For categorical or discrete values in inputs, we consider “NaN” as a category.

In [4]:
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1


## Conversion to the Tensor Format

Now that all the entries in inputs and outputs are numerical, they can be converted to the tensor format.

In [5]:
import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

## Summary

- Like many other extension packages in the vast ecosystem of Python, pandas can work together with tensors.
- Imputation and deletion can be used to handle missing data.

## Exercises

Create a raw dataset with more rows and columns.

In [6]:
import pandas as pd
import numpy as np
data = np.arange(20, dtype=float).reshape((5, 4))
data[0:2, 1] = np.nan
data[2, 2] = np.nan
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,0.0,,2.0,3.0
1,4.0,,6.0,7.0
2,8.0,9.0,,11.0
3,12.0,13.0,14.0,15.0
4,16.0,17.0,18.0,19.0


1. Delete the column with the most missing values.

In [7]:
data = df.iloc[:, [0, 2, 3]]
data

Unnamed: 0,a,c,d
0,0.0,2.0,3.0
1,4.0,6.0,7.0
2,8.0,,11.0
3,12.0,14.0,15.0
4,16.0,18.0,19.0


2. Convert the preprocessed dataset to the tensor format.

In [8]:
import torch
torch.tensor(data.values)

tensor([[ 0.,  2.,  3.],
        [ 4.,  6.,  7.],
        [ 8., nan, 11.],
        [12., 14., 15.],
        [16., 18., 19.]], dtype=torch.float64)