# Preprocessing Data

## Regression with categorical features

### Categorical Features

* Need to encode categorical features numerically
* Convert to "Dummy variables"
    * 0: Observation was **not** in that category
    * 1: Observation was in that category

## Dummy Variables

From:

|Origin|
|---|
|US|
|Europe|
|Asia|

To:

|origin_Asia|origin_Europe|origin_US|
|---|---|---|
|0|0|1|
|0|1|0|
|1|0|0|

We can infer that if it's not Asia or US, then is Europe, therefore:

|origin_Asia|origin_US|
|---|---|
|0|1|
|0|0|
|1|0|

> If we don't do this, we may be duplicating information and this may be an issue for some models

Code examples:

![](../images/automobil_dataset.png)

![](../images/cars_eda.png)

In [None]:
# Scikit Learn
#OneHotEncoder()

# Pandas
#pd.get_dummies()

import pandas as pd
df = pd.read_csv("auto.csv")
df_origin = pd.get_dummies(df)
print(df_origin)


Print with all data
![](../images/dummies_print.png)

In [None]:
df_origin = df_origin.drop("origin_Asia", axis=1)
print(df_origin)

> Also df.get_dummies(df, drop_first=True)

Print without Asia
![](../images/dummies_print_without_asia.png)

## Regression with categorical features

In [None]:
from sklearn.model_selection import  train_test_split
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
ridge = Ridge(alpha=0.5, normalize=True).fit(X_train, y_train)

ridge.score(X_test, y_test)
# 0.719064519022

## Cross Validation

In [None]:
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)

# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)

# Print the cross-validated scores
print(ridge_cv)