### Reading the Dataset

In [69]:
import os

os.makedirs("data", exist_ok=True)
data_file = "data/house_tiny.csv"

with open(data_file, "w") as f:
# Structure matters, if you will add 'tab' for each row, it will break the structure and NA values will not be read correctly
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [70]:
import pandas

data = pandas.read_csv(data_file, keep_default_na=True)
data

Unnamed: 0,NumRooms,RoofType,Price
0,,,127500
1,2.0,,106000
2,4.0,Slate,178100
3,,,140000


### Data preparation

RoofType coversion to _Slate and _nan (dealing with missing values)

In [81]:
# iloc[:, 0:2] 
#    x = : - all rows
#    y = 0:2 - columns from index 0 to 1 (column where pandas should stop is NOT included)
#    for more than one column - column names are also displayed on print
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]

print(inputs)
print("\n")
print(targets)

   NumRooms RoofType
0       NaN      NaN
1       2.0      NaN
2       4.0    Slate
3       NaN      NaN


0    127500
1    106000
2    178100
3    140000
Name: Price, dtype: int64


For categorical input fields, we can treat NaN as a category.

Since the RoofType column takes values Slate and NaN, pandas can convert this column into two columns RoofType_Slate and RoofType_nan.

In [88]:
# get_dummies convert categorical variable into dummy/indicator variables.
inputs = pandas.get_dummies(inputs, dummy_na=True)
inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,,False,True
1,2.0,False,True
2,4.0,True,False
3,,False,True


For missing numerical values, one common heuristic is to replace the NaN entries with the mean value of the corresponding column.

In [90]:
# fillna() - fill NA/NaN values using the specified method.
inputs = inputs.fillna(inputs.mean())
inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,3.0,False,True
1,2.0,False,True
2,4.0,True,False
3,3.0,False,True
