**<center><font color='#023F7C' size="6.5">Course 1: Data Preprocessing </font>** <br>
<font color=#023F7C size=4>**Hi!ckathon #6**</font> <br>
<font color=#023F7C size=2> 6:15PM-8:00PM </font> <br>

</center>

<img src = https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png width = "300" height = "200" >

<font color="#023F7C">**Authors**:</font>
- Laurène DAVID, Machine Learning Research Engineer @ Hi! PARIS <br>
- Pierre-Antoine AMIAND-LEROY, Machine Learning Research Engineer @ Hi! PARIS <br>
- Nathan NEIKE, Machine Learning Research Engineer @ Hi! PARIS <br>
- Damien NGO, Machine Learning Research Engineer @ Hi! PARIS <br>
- Nassima OULD OUALI, Machine Learning Research Engineer @ Hi! PARIS <br>

# **III. Feature Engineering** ⛏️

An essential step before creating a Machine Learning model is Data Preprocessing. <br> It consists of transforming features of the dataset into a proper format for a model. <br>



In [None]:
# Columns of dataset
dataset.columns

Index(['artists', 'popularity', 'duration_ms', 'explicit', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

In [None]:
# Shape of dataset
dataset.shape

(5926, 17)

In [None]:
X = dataset.drop(columns=["track_genre"]) # features
y = dataset["track_genre"] # target variable

In [None]:
# Shapes of features and targets
print(X.shape, y.shape)

(5926, 16) (5926,)


**<font size=4> <u>Train, test split</u></font>** <br>
To be able to test the performance of a model on unseen data, the original dataset is split into a **training set** and a **test set**. <br> In most cases, a new set of observations cannot be used to test the model.
- The training set is used to train the model and learn parameters (70%-80% of the data)
- The test set is used to test the model on unseen data during training (20%-30% of the data)


<br>

<img src = https://ugc.futurelearn.com/uploads/assets/05/8a/058ad514-cb51-4107-855e-23aeace3d0e3.png width = "600" height = "300" >



**Tips**:
- The choice in the size of the test split mostly depends on the size of the original data. Small datasets usually require the test set to be small (20%), whereas large datasets can include a larger test set.

- If the **target variable is unbalanced** (skewed distribution of classes), you can use the `stratify=y` parameter. This will insure each class is well represented in both splits.

- It is recommended to apply data preprocessing on each set seperatly, as to avoid **Data Leakage**. The test set shouldn't contain information from the training set, as to simulate unseen data.

- If the dataset is small, you can use **cross validation** to stabilize the estimator.


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape,  y_test.shape)


(4740, 16) (4740,)
(1186, 16) (1186,)


## **1. Categorical encoding**
Categorical encoding refers to the process of **converting categorical data into numerical format**, so that it can be used as input for algorithms to process. Most Machine Learning algorithms work with numerical data, not text so this step is essential.

Many types of encoders can be used on categorical variables. Here are important questions you should ask before selecting one.
- *What are the categorical variables in the dataset ?* (excluding the target)
- *What are the number of unique values of the variable ?* (low or high cardinality)
- *Is the variable ordinal or nominal ?* (ordered or unordered categories)




<img src = https://miro.medium.com/v2/resize:fit:756/1*MAr4rWj6zw0Rdo01ecZu1A.png width = "650" height = "400" >

<br>

Here are a few common catgeorical encoding methods:
- Low cardinality and nominal: **One-hot encoding**
- High cardinality and nominal: **Frequency Encoding, Target Encoding, Hash Encoding**
- Ordinal variables: **Label encoding**

**What are the categorical features in the dataset ?** <br>
Our dataset has five categorical features, three of them are integers, one is a string and one is a boolean. <br>
`mode` and `time_signature` don't need to be encoded as they already have the proper format.

<br>

| Variable | Description | dtype |
|----------| ----------- | ----- |
| artists | Nominal, High cardinality (1925 unique values) | string
| explicit | Nominal, Low cardinality (True/False) | boolean
| key | Nominal, Low cardinality (11 unique values) | int
| mode | Nominal, Low cardinality (0/1) | int
| time_signature | Ordinal, Low cardinality (4 unique values) | int



<font size=4> <b><u>One Hot Encoding</u> </b> </font> <br>
One Hot Encoding consists of creating a binary variable for each category. <br>
It is the preferred method of encoding when a categorical variable is nominal.


<img src = https://miro.medium.com/v2/resize:fit:1200/1*ggtP4a5YaRx6l09KQaYOnw.png width = "600" height = "200">

One Hot Encoding shouldn't be used on variables with a high cardinality. <br>
If used, it can add a huge amount of variables which can overcomplexify your model and lead to slow model training.<br>







In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Select the variables to OneHotEncode in the train and test set
var_onehot = ["explicit", "key", "artists"]
onehot_df_train = X_train[var_onehot]
onehot_df_test = X_test[var_onehot]

In [None]:
# Fit the OneHotEncoder to the training set
onehot_enc = OneHotEncoder(sparse_output=False)
onehot_enc.fit(onehot_df_train)

In [None]:
# Get the new feature names created
onehot_features = onehot_enc.get_feature_names_out()

onehot_df_train_t = pd.DataFrame(onehot_enc.transform(onehot_df_train).astype("int"), columns=onehot_features)
onehot_df_test_t = pd.DataFrame(onehot_enc.transform(onehot_df_test).astype("int"), columns=onehot_features)

In [None]:
onehot_df_train_t.head()

Unnamed: 0,explicit_False,explicit_True,key_0,key_1,key_2,key_3,key_4,key_5,key_6,key_7,...,key_9,key_10,key_11,artists_Dean Martin,artists_Ella Fitzgerald,artists_Feid,artists_J Balvin,artists_Nat King Cole,artists_Other,artists_Wolfgang Amadeus Mozart
0,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
3,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0


In [None]:
# Recreate the train and test sets with the newly encoded variables
X_train = pd.concat([X_train.drop(columns=var_onehot).reset_index(drop=True), onehot_df_train_t], axis=1)
X_test = pd.concat([X_test.drop(columns=var_onehot).reset_index(drop=True), onehot_df_test_t], axis=1)

<font size=4> <b><u>Label Encoding</u> </b> </font> <br>
For ordinal variables, you can try Label Encoding which will assign each category to a numerical value. <br> It isn't recommended to use this encoder on nominal variables as it might lead the model to misinterpret your variable.

<img src = https://www.statology.org/wp-content/uploads/2022/08/labelencode2-1.jpg width = "400" height = "250">

In [None]:
from sklearn.preprocessing import LabelEncoder

<font size=4> <b><u>Other methods:</u></b> </font> <br>
Other categorical encoding techniques can be used on features with a high cardinality. <br>

**a) Frequency encoding**: <br>
This method replaces each category in a variable by its frequency in the data.
<br>


<img src = https://www.elastic.co/guide/en/machine-learning/current/images/frequency-encoding.jpg width = "600" height = "250" >






**b) Target encoding**: <br>
This method replaces each category by their average target value/frequency. <br> If the target variable is multi-class, a variable shoud be created for each class to predict. <br>

<u>Warning </u>: Target encoding can sometimes lead to **Target Leakage**. <br> This occurs when the model is built, or trained, with information that will not be available in unseen data, such as information about the target variable. This could lead to falsely high results in the train set.

<br>

<!-- <img src = https://miro.medium.com/v2/resize:fit:419/1*W77md1OC9HSuAFy9b0LEIw.png width = "300" height = "400" > -->

<img src = https://miro.medium.com/v2/resize:fit:1016/1*5F7LDKbf1qRw9yvbJ2WJyw.png width = "450" height = "300">



**c) Hash encoding** <br>
Hash Encoding encodes categorical data into numerical value using a **"hashing function"**. The hashing function **maps each category to a pre-determined and fixed number of numerical columns**, instead of creating a column for each category.

This method reduces high cardinality if you set the number of numerical columns much lower than the number of categories.

<u>Warning</u>:
- This method can lead to <u>information loss</u> and reduce interpretability since we transform the data into fewer features.
- Since a high number of categorical values are represented into a smaller number of features, different categorical values could be represented by the same Hash values. This is called a <u>collision</U>.

## **2. Feature scaling**

Feature scaling consists of transforming all of the (continuous) variables in a dataset to a **similar scale**. <br>
This ensures that all features contribute equally to the model and avoids the dominance of features with larger values. It can be important for models that use Euclidean distances.


In [None]:
# Select continuous variables
var_scaling = X_train.select_dtypes(include=["float64"]).columns.to_list()
var_scaling.extend(["popularity", "duration_ms", 'time_signature'])

In [None]:
dataset[var_scaling].head()

Unnamed: 0,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,popularity,duration_ms,time_signature
0,0.643,0.268,-15.073,0.09,0.593,2e-06,0.316,0.62,143.813,58,298266,4
1,0.484,0.898,-4.132,0.164,0.365,0.0,0.091,0.68,91.975,59,482586,4
2,0.608,0.638,-6.008,0.0292,0.581,0.0172,0.448,0.439,140.109,54,219437,4
3,0.695,0.293,-16.278,0.0431,0.596,0.0158,0.132,0.637,143.804,68,299146,4
4,0.583,0.308,-18.303,0.0465,0.581,0.0106,0.257,0.241,118.226,59,387716,4


Here are two common methods for feature scaling:
- **Normalization**  (`MinMaxScaler`): Scale features to a given range, usually between 0 and 1. <br>
Normalization is recommended for variables without a Gaussian distribution or with a small standard deviation. <br>
It isn't recommended for variables with outliers.

- **Standardization**  (`StandardScaler`): Scale features by removing the mean and scaling to unit variance. <br>
Standardization can help reduce the presence of outliers but doesn't guarantee balanced feature scales in the presence of outliers. <br>
Outliers have an influence on the empirical mean and standard deviation computed for Standardization.

<img src = https://miro.medium.com/v2/resize:fit:744/1*HW7-kYjj6RKwrO-5WTLkDA.png width = "600" height = "300" >





In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [None]:
scaler_std = StandardScaler() # Standardization
#scaler_minmax = MinMaxScaler() # Normalization

scaler_std.fit(X_train[var_scaling])

In [None]:
X_train[var_scaling] = scaler_std.transform(X_train[var_scaling])
X_test[var_scaling] = scaler_std.transform(X_test[var_scaling])

In [None]:
X_train.head()

Unnamed: 0,popularity,duration_ms,danceability,energy,loudness,mode,speechiness,acousticness,instrumentalness,liveness,...,key_9,key_10,key_11,artists_Dean Martin,artists_Ella Fitzgerald,artists_Feid,artists_J Balvin,artists_Nat King Cole,artists_Other,artists_Wolfgang Amadeus Mozart
0,1.11632,-0.295761,0.497158,0.381649,0.070392,1,-0.455825,-0.874162,-0.607004,1.935737,...,0,0,0,0,0,0,0,0,1,0
1,-1.019619,-0.205809,-1.06795,1.2618,0.373036,1,-0.190406,-0.985223,-0.057398,-0.458076,...,0,0,0,0,0,0,0,0,1,0
2,-0.986245,0.506245,-1.645153,1.268927,0.389884,1,1.318981,-0.983842,-0.264816,0.409681,...,0,0,0,0,0,0,0,0,1,0
3,-1.019619,-0.872038,0.74691,1.333067,0.75451,0,2.179405,-0.650954,-0.609577,0.230145,...,0,1,0,0,0,0,0,0,1,0
4,-1.019619,2.085031,-1.756154,-0.23125,0.030205,1,-0.403324,1.615346,1.971928,4.292146,...,0,1,0,0,0,0,0,0,1,0


For variables with a **very skewed distribution**, it is recommended to use other types of scalers:
- `RobustScaler`: Scale features using statistics that are robust to outliers (median and interquartile distance). <br>
The scaling is thus not influenced by a small number of very large marginal outliers and transformed variables tend to have a similar range.

- `PowerScaler`: Apply a power transform (yeo-johnson or box-cox) to make features more Gaussian-like. <br> These non-linear transformation can reduce the scale of outliers in a variable.

To learn about the impact of different scalers on variables, you can read this [page](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html).

<font size=4> <b>Bonus: <u>Feature selection</u></b> </font> <br>
Another popular preprocessing techniques is **Feature selection**. <br>
Feature selection is the process of selecting a subset of relevant features for your model.

Scikit-learn has many functions to perform Feature selection (`VarianceThreshold`, `SelectKBest`, `RFE`...)  <br>
You can find more information about these methods [here](https://scikit-learn.org/stable/modules/feature_selection.html).

<img src = https://miro.medium.com/v2/resize:fit:1100/format:webp/1*Qmyx3_UX9QuSyLzw9X8GXw.png width = "600" height = "400" >


## **3. Scikit-learn pipeline**

Pipelines can be useful to test different strategies to replace missing values as well as multiple data pre-processing methods. <br>
It also allows you to transform the training and test set separately and limit the risks of Data Leakage.

Scikit-learn has two functions to create preprocessing pipelines:
- `Pipeline`: Apply multiple transformations to the same columns.
- `ColumnTransformer`: Transform each column set separately before combining them later.

<img src = https://miro.medium.com/v2/resize:fit:1200/1*LdwXVtec9-Byt-lOO7Csyg.png width = "600" height = "300" >

Any scikit-learn transformer (encoder, scaler,...) can be used in a pre-processing pipeline. <br>
You can also add a missing value imputer (for example `SimpleImputer`) and Machine Learning models to a pipeline.


Let's build a pipeline to predict the popularity variable.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
# Train, test split for predicting "popularity"
X_ = dataset.drop(columns=["popularity"]) # features
y_ = dataset["popularity"] # target

X_train_, X_test_, y_train_, y_test_ = train_test_split(X_, y_)

In [None]:
# columns for categorical pipeline
categorical_pipeline = X_.select_dtypes(include=["object"]).columns.to_list()
categorical_pipeline.append("key")

# columns for continuous pipeline
continuous_pipeline = X_.drop(columns=categorical_pipeline + ["mode", "time_signature"]).columns.to_list()

In [None]:
# Build the pipeline then fit it to training data
pipe1 = ColumnTransformer(
    [("categorical", OneHotEncoder(sparse_output=False), categorical_pipeline),
     ("continuous", StandardScaler(), continuous_pipeline)],
    sparse_threshold=0)

pipe1.fit(X_train_)

In [None]:
# Transform training and test set with pipeline
X_train_1 = pipe1.transform(X_train_)
X_test_1 = pipe1.transform(X_test_)

In [None]:
# Get new feature names
new_columns = [column.split("__")[1] for column in pipe1.get_feature_names_out()]

# Rebuild a dataframe with new features
X_train_1 = pd.DataFrame(X_train_1, columns=new_columns)
X_test_1 = pd.DataFrame(X_test_1, columns=new_columns)

**Tips**: If you need to apply more than one pre-processing step to a set of columns, you can use the `Pipeline` function to add these steps to `ColumnTransformer`.

<font size='5'>The course ends here, thank you for listening ! </font><br><br>

Upcoming sessions of the Pre-Hi!ckathon training:

The next course on Machine Learning – Part 1 will be on November 13th. <br>

See you on November 18th at 6:00 PM for the Hi! PARIS Career Fair at Télécom Paris.<br>

The course on Machine Learning – Part 2 will be on November 20th. <br>

You will learn how to design an impressive Pitch presentation on November 25th. <br>

You will complete the Pre-Hi!ckathon training with a course on Deep Learning and Explainability on November 26th. <br><br>

Before you leave – we need your help for our research project 👇<br>
As generative video models (such as OpenAI’s Sora or Google’s Veo) become more powerful, synthetic videos are increasingly realistic, and even experts struggle to distinguish real from fake. At Hi! PARIS Center – AI for Science, Business & Society, the FakeParts project (with researchers Vicky Kalogeiton, Gianni Franchi, and Xi Wang) investigates how both humans and algorithms detect fake videos in a deepfake video dataset, and how these models evolve over time.<br><br>

We would be very grateful if you could take 5 minutes to participate in a short online study:
you will watch 20 short videos and simply decide for each one whether it is real or fake. Your answers will directly support ongoing research on AI safety, trust in generative models, and the detection of deepfakes.<br><br>

👉 Take the survey here:
<a href="https://lnkd.in/eKqEbNhZ">https://lnkd.in/eKqEbNhZ
</a><br><br>

Thank you again for your participation and for contributing to our research!