### Data Preprocessing

Let's begin by reading the dataset. We use `pandas` to read an artificial CSV dataset. A great Pandas tutorial is `pandas.pydata.org/pandas-docs/stable/user_guide/10min.html`.

In [1]:
import os
import pandas as pd

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''County,Rooms,Price,Size
               Santa Clara,NA,127500,1050
               NA,3,106000,2000
               Santa Cruz,4,178100,1570
               NA,NA,140000,3700''')

Let's load the dataset using `read_csv`.

In [2]:
data = pd.read_csv(data_file)

### Data Frames

Data frames behave like a matrix, just with entries of different types. Let's look at our tiny house dataset.

In [3]:
data

Unnamed: 0,County,Rooms,Price,Size
0,Santa Clara,,127500,1050
1,,3.0,106000,2000
2,Santa Cruz,4.0,178100,1570
3,,,140000,3700


We can select columns via the keywords, e.g. `data['County']` selects the name of the street, whereas `data[1:3]` selects rows 1 and 2. Obviously these indices can be combined using the `loc` method. Even better, columns are named variables!

In [4]:
print(data.loc[1:3,['Rooms','Price']])
print(data.County[0])

   Rooms   Price
1    3.0  106000
2    4.0  178100
3    NaN  140000
               Santa Clara


There are a few more useful operations, e.g. to display data types `dtypes` and to `describe()` the data.

In [5]:
print(data.dtypes)
print(data.describe())

County     object
Rooms     float64
Price       int64
Size        int64
dtype: object
          Rooms         Price         Size
count  2.000000       4.00000     4.000000
mean   3.500000  137900.00000  2080.000000
std    0.707107   30255.68817  1147.722382
min    3.000000  106000.00000  1050.000000
25%    3.250000  122125.00000  1440.000000
50%    3.500000  133750.00000  1785.000000
75%    3.750000  149525.00000  2425.000000
max    4.000000  178100.00000  3700.000000


### Handling Missing Data

`NaN` are missing values. We replace them by their mean (via *imputation*). 
We split `data` into `inputs` and `outputs`. Note that for brevity we could also use `iloc` to index columns by position rather than name.

In [6]:
inputs, outputs = data.loc[:, ['County','Rooms','Size']], data.loc[:, 'Price']
print(inputs)

                       County  Rooms  Size
0                 Santa Clara    NaN  1050
1                          NA    3.0  2000
2                  Santa Cruz    4.0  1570
3                          NA    NaN  3700


For categorical or discrete values we consider `NA` and `NaN` respectively as a category. For the `County` column this works by encoding categories as one-hot. 

In [7]:
inputs = pd.get_dummies(inputs, columns=['County'])
missingRooms = (inputs.Rooms == inputs.Rooms) #preserve NaNs
inputs = inputs.fillna(inputs.mean())
inputs = pd.concat([inputs, missingRooms], axis=1).astype('float64')
print(inputs)

   Rooms    Size  County_               NA  County_               Santa Clara  \
0    3.5  1050.0                       0.0                                1.0   
1    3.0  2000.0                       1.0                                0.0   
2    4.0  1570.0                       0.0                                0.0   
3    3.5  3700.0                       1.0                                0.0   

   County_               Santa Cruz  Rooms  
0                               0.0    0.0  
1                               0.0    1.0  
2                               1.0    1.0  
3                               0.0    0.0  


### Conversion to the Tensor Format

Now that all the entries in `inputs` and `outputs` are numerical, they can be converted to the tensor format.

In [8]:
import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

(tensor([[3.5000e+00, 1.0500e+03, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00],
         [3.0000e+00, 2.0000e+03, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00],
         [4.0000e+00, 1.5700e+03, 0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00],
         [3.5000e+00, 3.7000e+03, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
        dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))