## Vorhersage von Immobilienpreisen - Preprocessing

In diesem Tutorial lernen wir verschiedene Arbeitsschritte zur Vorverarbeitung von Daten kennen, die oft in der Anwendung ausgeführt werden müssen:

- Encoding (ordinale und nominale Features)
- Imputation (fehlende Werte)
- Feature Scaling (numerische Features)

Wir tun dies anhand eines kleinen Beispiels, das dem Datensatz zur Vorhersage von Immobilienpreisen angelehnt ist, um alle Zwischenschritte im Detail zu verstehen.

Das Vorgehen ist in folgenden Videos dargestellt:
- [Preprocessing - Einleitung](https://youtu.be/cT8ffI4U8-E)
- [Preprocessing - Missing Values](https://youtu.be/2Nd4tPhophc)
- [Preprocessing - Encoding](https://youtu.be/PA0Bykxn4_w)
- [Preprocessing - Feature Scaling](https://youtu.be/QxiUUpnglnk)

![Train Data](https://raw.githubusercontent.com/layerwise/training/main/assets/house_prices_test_example_image.png)

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [11]:
df_train = pd.DataFrame(
    {
        "Lot Shape": ["Reg", "Reg", "IR1", "IR2", "IR3", "IR2"],
        "Street": ["Grvl", "Grvl", np.nan, "Pave", "Pave", "Grvl"],
        "Lot Frontage": [64.0, 81.0, np.nan, 80.0, 96.0, np.nan],
        "Yr Sold": [2008, 2007, 2009, 2007, 2006, 2009]
    }
)

df_train

Unnamed: 0,Lot Shape,Street,Lot Frontage,Yr Sold
0,Reg,Grvl,64.0,2008
1,Reg,Grvl,81.0,2007
2,IR1,,,2009
3,IR2,Pave,80.0,2007
4,IR3,Pave,96.0,2006
5,IR2,Grvl,,2009


## 1. Imputation

In [20]:
# TODO: Imputation
from sklearn.impute import SimpleImputer  # wie PolynomialFeatures

### 1.1. Imputation kategorischer Features

In [40]:
categorical_columns = ["Lot Shape", "Street"]

categorical_imputer = SimpleImputer(strategy="most_frequent")
categorical_imputer.fit(df_train[categorical_columns])

# TODO
df_train.loc[:, categorical_columns] = categorical_imputer.transform(df_train[categorical_columns])

### 1.2. Imputation numerischer Features

In [41]:
# TODO: Imputation
numeric_columns = ["Lot Frontage", "Yr Sold"]

numeric_imputer = SimpleImputer(strategy="median")
numeric_imputer.fit(df_train[numeric_columns])

df_train.loc[:, numeric_columns] = numeric_imputer.transform(df_train[numeric_columns])

## 2. Encoding

### 2.1. Encoding ordinaler Features

In [43]:
from sklearn.preprocessing import OrdinalEncoder

# TODO: Ordinale Spalten identifizieren
ordinal_columns = ["Lot Shape"]

# TODO: Ordinale Kategorien identifizieren
for column in ordinal_columns:
    ordinal_categories = df_train[column].unique()
    print(ordinal_categories)
    
ordinal_categories = [
    ['Reg', 'IR1', 'IR2', 'IR3']
]

# TODO: Encoding
ordinal_encoder = OrdinalEncoder(categories=ordinal_categories)
ordinal_encoder.fit(df_train[ordinal_columns])


df_train.loc[:, ordinal_columns] = ordinal_encoder.transform(df_train[ordinal_columns])

['Reg' 'IR1' 'IR2' 'IR3']


In [44]:
df_train

Unnamed: 0,Lot Shape,Street,Lot Frontage,Yr Sold
0,0.0,Grvl,64.0,2008.0
1,0.0,Grvl,81.0,2007.0
2,1.0,Grvl,80.5,2009.0
3,2.0,Pave,80.0,2007.0
4,3.0,Pave,96.0,2006.0
5,2.0,Grvl,80.5,2009.0


### 2.2. Encoding nominaler Features

In [45]:
from sklearn.preprocessing import OneHotEncoder

# TODO: Nominale Spalten identifizieren
nominal_columns = ["Street"]

# TODO: Nominale Kategorien identifizieren
nominal_categories = ["Grvl", "Pave"]

# TODO: Encoding
nominal_encoder = OneHotEncoder(sparse=False)
nominal_encoder.fit(df_train[nominal_columns])


OneHotEncoder(sparse=False)

## 3. Feature Scaling

In [57]:
# TODO: Feature Scaling
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler

scaler = MinMaxScaler()
scaler.fit(df_train[numeric_columns])

MinMaxScaler()

## 4. Vorhersagen auf unbekannten Daten

In [60]:
df_new = pd.DataFrame(
    {
        "Lot Shape": ["Reg", "IR1", np.nan],
        "Street": ["Pave", np.nan, "Grvl"],
        "Lot Frontage": [56.0, 58.0, np.nan],
        "Yr Sold": [2005, 2008, 2007]
    }
)

df_new

Unnamed: 0,Lot Shape,Street,Lot Frontage,Yr Sold
0,Reg,Pave,56.0,2005
1,IR1,,58.0,2008
2,,Grvl,,2007


In [61]:
df_new.loc[:, categorical_columns] = categorical_imputer.transform(df_new[categorical_columns])
df_new

Unnamed: 0,Lot Shape,Street,Lot Frontage,Yr Sold
0,Reg,Pave,56.0,2005
1,IR1,Grvl,58.0,2008
2,IR2,Grvl,,2007


In [62]:
df_new.loc[:, numeric_columns] = numeric_imputer.transform(df_new[numeric_columns])
df_new

Unnamed: 0,Lot Shape,Street,Lot Frontage,Yr Sold
0,Reg,Pave,56.0,2005.0
1,IR1,Grvl,58.0,2008.0
2,IR2,Grvl,80.5,2007.0


In [63]:
df_new.loc[:, ordinal_columns] = ordinal_encoder.transform(df_new[ordinal_columns])
df_new

Unnamed: 0,Lot Shape,Street,Lot Frontage,Yr Sold
0,0.0,Pave,56.0,2005.0
1,1.0,Grvl,58.0,2008.0
2,2.0,Grvl,80.5,2007.0


In [64]:
df_new_nominal = nominal_encoder.transform(df_new[nominal_columns])

df_new_nominal

array([[0., 1.],
       [1., 0.],
       [1., 0.]])

In [65]:
nominal_encoder.get_feature_names()

array(['x0_Grvl', 'x0_Pave'], dtype=object)

In [66]:
df_new_nominal = pd.DataFrame(
    df_new_nominal,
    columns=nominal_encoder.get_feature_names()
)

In [67]:
df_new_nominal

Unnamed: 0,x0_Grvl,x0_Pave
0,0.0,1.0
1,1.0,0.0
2,1.0,0.0


In [68]:
df_new = df_new.drop(columns=["Street"])

In [69]:
df_new = pd.concat(
    (df_new_nominal, df_new),
    axis=1
)

In [70]:
df_new

Unnamed: 0,x0_Grvl,x0_Pave,Lot Shape,Lot Frontage,Yr Sold
0,0.0,1.0,0.0,56.0,2005.0
1,1.0,0.0,1.0,58.0,2008.0
2,1.0,0.0,2.0,80.5,2007.0


In [71]:
df_new.loc[:, numeric_columns] = scaler.transform(df_new[numeric_columns])

In [72]:
df_new

Unnamed: 0,x0_Grvl,x0_Pave,Lot Shape,Lot Frontage,Yr Sold
0,0.0,1.0,0.0,-0.25,-0.333333
1,1.0,0.0,1.0,-0.1875,0.666667
2,1.0,0.0,2.0,0.515625,0.333333
