# Preprocessing Data

- Impute missing values
- Convert categorical data to numeirc values
- Scale data
- Evaluate multiple supervised learning models simultaneously
- Build pipelines

We will be using the music dataset through out this course:

In [77]:
import pandas as pd

music_df = pd.read_csv("music.csv")
music_df.head()

Unnamed: 0.1,Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre
0,36506,60.0,0.896,0.726,214547.0,0.177,2e-06,0.116,-14.824,0.0353,92.934,0.618,1
1,37591,63.0,0.00384,0.635,190448.0,0.908,0.0834,0.239,-4.795,0.0563,110.012,0.637,1
2,37658,59.0,7.5e-05,0.352,456320.0,0.956,0.0203,0.125,-3.634,0.149,122.897,0.228,1
3,36060,54.0,0.945,0.488,352280.0,0.326,0.0157,0.119,-12.02,0.0328,106.063,0.323,1
4,35710,55.0,0.245,0.667,273693.0,0.647,0.000297,0.0633,-7.787,0.0487,143.995,0.3,1


## Dealing with Categorical Features

---

When dealing with categorical features, we convert them to binary features called dummy variables (0 and 1) for example:

if we have a feature called `genre` with values: Alternative, Anime, Blues, Classical, Country, Electronic, Hiphop, Jazz, Rap, and Rock. We can create dummy features like this: 

| Alternative | Anime | Blues | Classical | Country | Electronic | Hip-Hop | Jazz | Rap | Rock |
|-------------|-------|-------|-----------|---------|------------|---------|------|-----|------|
| 1           | 0     | 0     | 0         | 0       | 0          | 0       | 0    | 0   | 0    |
| 0           | 1     | 0     | 0         | 0       | 0          | 0       | 0    | 0   | 0    |
| 0           | 0     | 1     | 0         | 0       | 0          | 0       | 0    | 0   | 0    |
| 0           | 0     | 0     | 1         | 0       | 0          | 0       | 0    | 0   | 0    |
| 0           | 0     | 0     | 0         | 1       | 0          | 0       | 0    | 0   | 0    |
| 0           | 0     | 0     | 0         | 0       | 1          | 0       | 0    | 0   | 0    |
| 0           | 0     | 0     | 0         | 0       | 0          | 1       | 0    | 0   | 0    |
| 0           | 0     | 0     | 0         | 0       | 0          | 0       | 1    | 0   | 0    |
| 0           | 0     | 0     | 0         | 0       | 0          | 0       | 0    | 1   | 0    |
| 0           | 0     | 0     | 0         | 0       | 0          | 0       | 0    | 0   | 1    |


Now, we may notice that the `Rock` feature maybe redundant because having 0 from `Alternative` to `Rap` so we omit it:

| Alternative | Anime | Blues | Classical | Country | Electronic | Hip-Hop | Jazz | Rap |
|-------------|-------|-------|-----------|---------|------------|---------|------|-----|
| 1           | 0     | 0     | 0         | 0       | 0          | 0       | 0    | 0   |
| 0           | 1     | 0     | 0         | 0       | 0          | 0       | 0    | 0   |
| 0           | 0     | 1     | 0         | 0       | 0          | 0       | 0    | 0   |
| 0           | 0     | 0     | 1         | 0       | 0          | 0       | 0    | 0   |
| 0           | 0     | 0     | 0         | 1       | 0          | 0       | 0    | 0   |
| 0           | 0     | 0     | 0         | 0       | 1          | 0       | 0    | 0   |
| 0           | 0     | 0     | 0         | 0       | 0          | 1       | 0    | 0   |
| 0           | 0     | 0     | 0         | 0       | 0          | 0       | 1    | 0   |
| 0           | 0     | 0     | 0         | 0       | 0          | 0       | 0    | 1   |
| 0           | 0     | 0     | 0         | 0       | 0          | 0       | 0    | 0   |


This approach is called One-Hot Encoding (OHE), and omitting one category (Rock in this case) avoids multicollinearity, as the omitted category can be inferred from the others.

To be able to do One-Hot-Encoding, here are the tools you can use:

## One-Hot-Encoding with Pandas

---

You can pass the entire dataframe, if you only hvae one column of categories. But if you want to be specific pass the series of the category column. Or you can also pass the dataframe and the specific columns as seen below:

**Functions (pandas)**  

| Function               | Description                                        | Syntax                           |
|------------------------|----------------------------------------------------|----------------------------------|
| `get_dummies`         | Converts categorical variables into dummy/indicator variables | `pd.get_dummies(df, columns=[col])` |

**Arguments (pandas)**  

| Argument              | Description                                        | Syntax                           |
|----------------------|--------------------------------------------------|---------------------------------|
| `df`                | The DataFrame containing categorical columns      | `pd.get_dummies(df, columns=[col])` |
| `columns`           | Specifies which columns to encode                 | `pd.get_dummies(df, columns=[col])` |
| `drop_first`        | Drops the first category to avoid multicollinearity | `pd.get_dummies(df, drop_first=True)` |
| `dtype`             | Specifies data type for the output                 | `pd.get_dummies(df, dtype=int)` |

## One-Hot-Encoding with Sklearn

---



**Functions (scikit-learn)**  

| Function               | Description                                         | Syntax                                      |
|------------------------|-----------------------------------------------------|---------------------------------------------|
| `OneHotEncoder`       | Encodes categorical features as a one-hot numeric array | `OneHotEncoder()`                           |
| `fit`                 | Learns the categories from the data                 | `encoder.fit(X)`                            |
| `transform`           | Transforms categorical data into one-hot encoded format | `encoder.transform(X)`                      |
| `fit_transform`       | Combines `fit` and `transform`                      | `encoder.fit_transform(X)`                  |

**Arguments (scikit-learn)**  

| Argument              | Description                                        | Syntax                                      |
|----------------------|--------------------------------------------------|---------------------------------------------|
| `categories`        | Specifies categories manually or auto-detects them | `OneHotEncoder(categories='auto')`         |
| `sparse`           | Returns a sparse matrix if `True`, dense array if `False` | `OneHotEncoder(sparse=False)`              |
| `handle_unknown`   | Determines how to handle unknown categories         | `OneHotEncoder(handle_unknown='ignore')`   |
| `drop`            | Specifies which category to drop to avoid redundancy | `OneHotEncoder(drop='first')`              |
| `dtype`           | Specifies data type for the encoded output           | `OneHotEncoder(dtype=int)`                 |



## Handling Missing Data

---

### Dropping
When we find missing data, a common approach is **dropping** null values amounting to less than 5% of our data

DataFrame.

**Functions**

| Function          | Description                                                   | Syntax                                      |
|-------------------|---------------------------------------------------------------|---------------------------------------------|
| dropna            | Removes missing values from the DataFrame                     | `df.dropna(axis=, how=, thresh=, subset=, inplace=)` |

**Arguments**

| Argument          | Description                                                   | Syntax                                      |
|-------------------|---------------------------------------------------------------|---------------------------------------------|
| axis              | Determines whether to drop rows or columns                    | `df.dropna(axis=0)` (rows) or `df.dropna(axis=1)` (columns) |
| how               | Specifies the condition for dropping                          | `df.dropna(how='any')` (default) or `df.dropna(how='all')` |
| thresh            | Requires a minimum number of non-null values to retain        | `df.dropna(thresh=n)` |
| subset            | Specifies which columns to check for null values              | `df.dropna(subset=['col1', 'col2'])` |
| inplace           | Whether to modify the DataFrame in place                      | `df.dropna(inplace=True)` |

**Examples**

1. **Drop Rows with Any Null Values**:
   ```python
   df.dropna()

### Imputing

Imputation: use subject-matter expertise to replace missing data with educated guesses, it is common to use the **mean**. We also use the **median**
- For categorical, we use the mode
- Note we must split our data before imputing to avoid leaking test set information to our model, a concept known as data leakage.

Before imputing missing values, you should perform the train-test split. This ensures that the imputation process only uses information from the training set, preventing data leakage.

**Arguments**

| Arguments       | Description                                                   | Possible Value(s) | Syntax                                      |
|-----------------|---------------------------------------------------------------|-------------------|---------------------------------------------|
| strategy        | The imputation strategy to use                                | 'mean', 'median', 'most_frequent', 'constant' | `SimpleImputer(strategy='')`                |
| fill_value      | The value to use for imputation when strategy is 'constant'   | Any constant value | `SimpleImputer(fill_value='')`              |
| missing_values  | The placeholder for the missing values                        | `np.nan`, None, etc. | `SimpleImputer(missing_values='')`          |
| add_indicator   | Whether to add a missing indicator column                     | True, False       | `SimpleImputer(add_indicator=)`             |
| copy            | Whether to copy the input data                                | True, False       | `SimpleImputer(copy=)`                      |
| verbose         | Whether to print progress messages                            | Integer (0 or 1)  | `SimpleImputer(verbose=)`                   |

**Functions**

| Functions       | Description                                                   | Syntax                                      |
|-----------------|---------------------------------------------------------------|---------------------------------------------|
| SimpleImputer   | Impute missing values using a specified strategy              | `SimpleImputer(strategy='', fill_value='')` |
| fit_transform   | Fit the imputer on the training set and transform the data    | `imputer.fit_transform(X_train)`            |
| transform       | Transform the test set using the fitted imputer               | `imputer.transform(X_test)`                 |
| append          | Combine arrays along a specified axis                         | `np.append(arr1, arr2, axis=)`              |


**Imputation workflow in scikit-learn:**

- Import `SimpleImputer`
- Split the data, either because you will use different imputation technique or becuase you will imputate traning and testing data
- Combine the data if you divided it because of different imputation technique

```python
from sklearn.impute import SimpleImputer

# since we will be using two different techniques to our dataframe with null it is best to divide them
X_category = df['category'].values.reshape(-1, 1)
X_number = df.drop(['taregt', 'genre'], axis = 1).values
y = df['target'].values

X_train_category, X_test_category, y_train, y_test = train_test_split(X_category, y, test_size = 0.2, random_sate = 42)
X_train_number, X_test_number, y_train, y_test = train_test_split(X_number, y, test_size = 0.2, random_sate = 42)

# to impute missing value, we instantiate an imputer
imp_cat = SimpleImputer(strategy = "most_frequent") # the mode
X_train_category = imp_cat.fit_transform(X_train_category)
X_test_category = imp_cat.transform(X_test_cat)

imp_num = SimpleImputer(strategy = "mean") # mean
X_train_number = imp_num.fit_transform(X_train_number)
X_test_number = imp_num.transform(X_test_number)

# combine the data
X_train = np.append(X_train_num, X_train_cat, axis = 1)
X_test = np.append(X_test_num, X_test_cat, axis =1)
```

Another example:

``` python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Create a sample DataFrame with missing values
data = {
    'Feature1': [1, 2, None, 4, 5],
    'Feature2': [None, 2, 3, 4, 5],
    'Target': [1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)

# Split the data into features (X) and target (y)
X = df[['Feature1', 'Feature2']]
y = df['Target']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')

# Fit the imputer on the training set and transform the training set
X_train_imputed = imputer.fit_transform(X_train)

# Transform the test set using the imputer fitted on the training set
X_test_imputed = imputer.transform(X_test)

print("Imputed Training Set:\n", X_train_imputed)
print("Imputed Test Set:\n", X_test_imputed)


## Pipeline and ColumnTransformer
---
Each step but the last is a transformer. The complete guide is on another notebook.

## Scaling and Centering
---

- Many models use some form of distance to inform them
- Features on a larger scale can disproportionately influence the model hence we normalize or standardize our features (scaling and centering)

### How to Scale our Data?

- Given the column, we can subtract the mean and divide by variance so that all features are centered around zero and have a variance of one (standardiation)
- We can also subtract the minimum and divide by the range to have minimum zero and maximum one (given data in a column)
- Or we can center our data (normalize) so the data range from -1 to +1

The flow of standardization follows:
1. Import the `StandardScaler`
2. Get the X and y
3. Split data to avoid leakage
4. Instantiate a scaler
5. Fit the training data
6. Transform the test data

In [78]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

X = music_df.drop("genre", axis = 1).values
y = music_df['genre'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(np.mean(X), np.std(X))
print(np.mean(X_train_scaled), np.std(X_train_scaled))

20666.582585618085 68890.98734103922
3.5971225997855074e-16 0.9999999999999996


You can also input the scaler inside a pipeline as it is a transformer

In [79]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier 

pipeline = Pipeline([('scaler', StandardScaler()),
                     ('knn', KNeighborsClassifier(n_neighbors = 6))])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.2)

knn_scaled = pipeline.fit(X_train, y_train)
y_pred = knn_scaled.predict(X_test)
knn_scaled

In [80]:
print("The predictions of scaled data: ", y_pred)
print("The accuracy of scaled data:", knn_scaled.score(X_test, y_test))

The predictions of scaled data:  [0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 0
 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 0
 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0
 0 0 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 0 0 0 1
 0 1 1 1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0
 1 1 1 0 0 0 0 0 1 1 1 1 1 0 1]
The accuracy of scaled data: 0.89


**It's weird that unscaled data is more accurate LOL!**

In [81]:
model = KNeighborsClassifier(n_neighbors = 6)
model.fit(X_train, y_train)
y_pred_unscaled = model.predict(X_test)
print("The predictions of scaled data: ", y_pred_unscaled)
print("The accuracy of scaled data:", model.score(X_test, y_test))

The predictions of scaled data:  [0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 1 0
 0 0 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 0
 1 1 0 1 1 1 0 1 1 1 0 0 1 0 0 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 0 0
 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 0 0 0 1
 0 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0
 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1]
The accuracy of scaled data: 0.925


## Implementing Scaling with CV (Grid Search)

---

In [82]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, KFold
import numpy as np

X = music_df.drop("genre", axis = 1).values
y = music_df['genre'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


pipeline = Pipeline([('scaler', StandardScaler()),
                    ('knn', KNeighborsClassifier())])

# knn_scaled = pipeline.fit(X_train, y_train) # why assign this? Why ml model do not need assignemnt when you fit

# knn_scaled.predict(X_test) # is this necessary? 

kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

cv = GridSearchCV(pipeline, {'knn__n_neighbors' : np.arange(1, 50)}, cv = kf)

cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
print(cv.best_params_)
df = pd.DataFrame(cv.cv_results_)
df

{'knn__n_neighbors': 2}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002704,0.00117,0.010642,0.002748,1,{'knn__n_neighbors': 1},0.9375,0.875,0.91875,0.94375,0.95,0.925,0.027099,2
1,0.002384,0.000434,0.009752,0.000823,2,{'knn__n_neighbors': 2},0.94375,0.89375,0.93125,0.94375,0.91875,0.92625,0.018708,1
2,0.002197,0.000266,0.009004,0.001189,3,{'knn__n_neighbors': 3},0.93125,0.85625,0.925,0.91875,0.9375,0.91375,0.029422,6
3,0.002611,0.001863,0.009973,0.003279,4,{'knn__n_neighbors': 4},0.94375,0.875,0.9125,0.93125,0.95,0.9225,0.026984,3
4,0.001152,0.000944,0.009202,0.001085,5,{'knn__n_neighbors': 5},0.9375,0.86875,0.9125,0.8875,0.94375,0.91,0.028668,7
5,0.003098,0.003067,0.010086,0.001874,6,{'knn__n_neighbors': 6},0.94375,0.875,0.925,0.925,0.9375,0.92125,0.024238,4
6,0.002217,0.00049,0.009973,0.000439,7,{'knn__n_neighbors': 7},0.925,0.8375,0.93125,0.89375,0.925,0.9025,0.035045,11
7,0.002326,0.00062,0.010059,0.001294,8,{'knn__n_neighbors': 8},0.9375,0.85625,0.9375,0.91875,0.93125,0.91625,0.030771,5
8,0.001925,0.000669,0.010933,0.001232,9,{'knn__n_neighbors': 9},0.93125,0.85,0.9125,0.89375,0.9125,0.9,0.02767,12
9,0.002485,0.000424,0.010892,0.001027,10,{'knn__n_neighbors': 10},0.9375,0.8625,0.9125,0.90625,0.925,0.90875,0.025495,8


In [83]:
cv.best_score_

0.9262499999999999

In [84]:
cv.score(X_test, y_test)

0.88

## Evaluating Multiple Models

---

The question still remains: **What model to choose?**

- We can try different models via loop and then see their accuracy score and compare, it can be done as in the following:
- Remember that we need to preprocess the data first and then fit, we delay model tuning in here as we have not selected a specific model yet

In [1]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

X = music_df.drop("genre", axis = 1).values
y = music_df['genre'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

models = {"knn": KNeighborsClassifier(), "logreg": LogisticRegression(), "DTC": DecisionTreeClassifier()}
results = []

for model in models.values():
    kf = KFold(n_splits = 6, random_state = 42, shuffle = True)
    results.append(cross_val_score(model, X_train_scaled, y_train, cv = kf, scoring = 'accuracy'))
    
plt.boxplot(results, labels = models.keys())
plt.show()
print(results)

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    print(name + ": " + str(test_score))


NameError: name 'music_df' is not defined