# 2.2.2 Data Preparation

### Handling Input, Target Values, and Missing Data in Supervised Learning

In supervised learning, we train models to predict a **target value** based on a set of **input values**.  
- The first step in processing a dataset is to separate the columns into input and target values.  
- Columns can be selected by name or using integer-location-based indexing (`iloc`).  

#### Missing Values in Data
- `pandas` replaces missing or empty entries (e.g., `"3,,270000"`) with a special `NaN` (Not a Number) value.  
- Missing values, also known as **"bed bugs" of data science**, are common and must be handled carefully.  

#### Handling Missing Values
1. **Imputation**: Replace missing values with estimated values.  
   - For **categorical fields**, treat `NaN` as a separate category.  
   - Example: If the `RoofType` column has values like `Slate` and `NaN`, `pandas` can create two new columns: `RoofType_Slate` and `RoofType_nan`.  
     - A row with `Slate` will have `RoofType_Slate = 1` and `RoofType_nan = 0`.  
     - A row with a missing value will have `RoofType_Slate = 0` and `RoofType_nan = 1`.  
2. **Deletion**: Remove rows or columns containing missing values.

In [16]:
import pandas as pd
import os

In [17]:
os.makedirs(os.path.join('data'), exist_ok=True)
data_file = os.path.join('data', 'house_tiny.csv')

In [18]:
data = pd.read_csv(data_file)

In [27]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   Numrooms_            2  Numrooms_            3  Numrooms_            4  \
0                    True                   False                   False   
1                   False                    True                   False   
2                   False                   False                   False   
3                   False                   False                    True   

   Numrooms_            NA  Numrooms_nan  Rooftype_Flat  Rooftype_nan  
0                    False         False           True         False  
1                    False         False           True         False  
2                     True         False          False          True  
3                    False         False           True         False  


For missing numerical values, one common heuristic is to replace theNaNentries with themean value of the corresponding column.

In [23]:
inputs = inputs.fillna(inputs.mean())