<img src="https://github.com/FarzadNekouee/Flight-EDA-to-Preprocessing/blob/master/image.jpg?raw=true" width="1800">


<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:115%; text-align:left">

<h3 align="left"><font color=#8502d1>Problem:</font></h3>

Welcome to our journey through a cool dataset all about flights! This dataset is like a big treasure chest, full of information about when flights leave, when they arrive, how long they're delayed, how far they go, and lots more. We'll be playing detective to spot patterns and find clues that help us figure out what makes flights late. So, let's buckle up and get ready for a fun ride through this flight data!

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:115%; text-align:left">

<h3 align="left"><font color=#8502d1>Objectives:</font></h3>
    
1. Data Understanding
2. Exploratory Data Analysis (EDA)
   - Univariate Analysis
   - Bivariate Analysis
   - Multivariate Analysis
3. Data Preprocessing
   - Irrelevant Features Removal
   - Missing Value Treatment
   - Outlier Treatment
   - Encoding Categorical Features
   - Time Feature Transformation
   - Feature Scaling
   - Transforming Skewed Features

<a id="contents_tabel"></a>    
<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:115%; text-align:left">

<h3 align="left"><font color=#8502d1>Table of Contents:</font></h3>

* [Step 1 | Import Libraries](#import)
* [Step 2 | Read Dataset](#read)
* [Step 3 | Dataset Overview](#overview)
    - [Step 3.1 | Dataset Basic Information](#basic)
    - [Step 3.2 | Summary Statistics for Numerical Variables](#cat_statistics)
    - [Step 3.3 | Summary Statistics for Categorical Variables](#num_statistics)
* [Step 4 | EDA](#eda)
    - [Step 4.1 | Univariate Analysis](#univariate)
    - [Step 4.2 | Bivariate Analysis](#bivariate)
    - [Step 4.3 | Multivariate Analysis](#multivariate)
* [Step 5 | Data Preprocessing](#preprocessing)
    - [Step 5.1 | Irrelevant Features Removal](#removal)
    - [Step 5.2 | Missing Value Treatment](#missing)
    - [Step 5.3 | Outlier Treatment](#outlier)
    - [Step 5.4 | Categorical Features Encoding](#encoding)
    - [Step 5.5 | Time Feature Transformation](#time)
    - [Step 5.6 | Feature Scaling](#scaling)
    - [Step 5.7 | Transforming Skewed Features](#boxcox)

<h2 align="left"><font color=#8502d1>Let's get started:</font></h2>

<a id="import"></a>
# <p style="background-color:#8502d1; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 1 | Import Libraries</p>

⬆️ [Tabel of Contents](#contents_tabel)

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from scipy import stats
%matplotlib inline

In [None]:
# Set the resolution of the plotted figures
plt.rcParams['figure.dpi'] = 120

# Configure Seaborn plot styles: Set background color and use dark grid
sns.set(rc={'axes.facecolor': '#F3E8FF'}, style='darkgrid')

<a id="read"></a>
# <p style="background-color:#8502d1; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 2 | Read Dataset</p>

⬆️ [Tabel of Contents](#contents_tabel)

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

First, I am going to load the dataset:

In [None]:
# Read dataset
df = pd.read_csv('https://raw.githubusercontent.com/hakimyameen/DS_ML_Work/main/flights_data.csv')
df.head()

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Dataset Description:</font></h2>
    
| __Variable__ | __Description__ |
|     :---      |       :---      |      
| __id__ | A unique identifier assigned to each flight record in this dataset. |                
| __year__ | The year in which the flight took place. The dataset includes flights from the year 2013 |                        
| __month__ | The month of the year in which the flight occurred, represented by an integer ranging from 1 (January) to 12 (December) |
| __day__ | The day of the month on which the flight took place, represented by an integer from 1 to 31 |
| __dep_time__ | The actual departure time of the flight, represented in 24-hour format (hhmm) |                     
| __sched_dep_time__ | The locally scheduled departure time of the flight, presented in a 24-hour format (hhmm) |
| __dep_delay__ | The delay in flight departure, calculated as the difference (in minutes) between the actual and scheduled departure times. Positive values indicate a delay, while negative values indicate an early departure. |  
| __arr_time__ | The actual arrival time of the flight, represented in 24-hour format (hhmm) |                      
| __sched_arr_time__ | The locally scheduled arrival time of the flight, presented in a 24-hour format (hhmm) |
| __arr_delay__ |  The delay in flight arrival, calculated as the difference (in minutes) between the actual and scheduled arrival times. Positive values indicate a delay, while negative values indicate an early arrival |
| __carrier__ |  A two-letter code representing the airline carrier responsible for the flight |                      
| __flight__ | The designated number of the flight |              
| __tailnum__ | A unique identifier associated with the aircraft used for the flight |                      
| __origin__ | A three-letter code signifying the airport from which the flight departed |
| __dest__ | A three-letter code representing the airport at which the flight arrived |
| __air_time__ | The duration of the flight, measured in minutes |                 
| __distance__ | The total distance (in miles) between the origin and destination airports |
| __hour__ | The hour component of the scheduled departure time, expressed in local time |
| __minute__ | The minute component of the scheduled departure time, expressed in local time |
| __time_hour__ | The scheduled departure time of the flight, represented in local time and formatted as "yyyy-mm-dd hh:mm:ss" |
| __name__ | The full name of the airline carrier responsible for the flight |

<a id="overview"></a>
# <p style="background-color:#8502d1; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 3 | Dataset Overview</p>

⬆️ [Tabel of Contents](#contents_tabel)

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Now, I am goining to get a comprehensive overview of the dataset:

<a id="basic"></a>
# <b><span style='color:darkorange'>Step 3.1 |</span><span style='color:#8502d1'> Dataset Basic Information</span></b>

In [None]:
# Display a concise summary of the dataframe.
df.info()

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Inferences:</font></h2>

* The dataset contains __215,427 entries__ (rows) and __21 columns__.
    
    
* The columns are of different data types:
    - integer (int64)
    - float (float64)
    - object (usually representing string or categorical data).
    
    
* The dataset contains some __missing values__. Specifically, the columns `dep_time`, `dep_delay`, `arr_time`, `arr_delay`, `tailnum`, and `air_time` have a certain number of non-null entries, indicating that there are some missing values in these columns.

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Based on the data types and the feature explanations we had earlier, we can see that the `id` and `flight` features are indeed numerical in terms of data type, but categorical in terms of their semantics. These two features should be converted to string (__object__) data type for proper analysis and interpretation:

In [None]:
# Convert 'id' and 'flight' to object data type
df['id'] = df['id'].astype(str)
df['flight'] = df['flight'].astype(str)

<a id="num_statistics"></a>
# <b><span style='color:darkorange'>Step 3.2 |</span><span style='color:#8502d1'> Summary Statistics for Numerical Variables</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Now let's look at the summary statistics of the numerical features:

In [None]:
# Get the summary statistics for numerical variables
df.describe().T

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Inferences:</font></h2>
    
    
* __`year`__: All records are from the year 2013, hence there is no variation.
* __`month`__, __`day`__, __`hour`__, __`minute`__: These features show the scheduled departure date and time. They have a good range and seem to be evenly distributed throughout the year and day.
* __`dep_time`__, __`sched_dep_time`__, __`arr_time`__, __`sched_arr_time`__: These are the actual and scheduled departure and arrival times of the flights. They are in the 24-hour format and cover all possible values.
* __`dep_delay`__, __`arr_delay`__: These are our target variables. They show the departure and arrival delays in minutes. The values range from negative (early departure or arrival) to positive (late departure or arrival).
* __`air_time`__: This is the flight duration in minutes. It varies from 20 to 695 minutes.
* __`distance`__: This is the total distance between the origin and destination airports. It varies from 17 to 4983 miles.  

<a id="cat_statistics"></a>
# <b><span style='color:darkorange'>Step 3.3 |</span><span style='color:#8502d1'> Summary Statistics for Categorical  Variables</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
Afterward, let's look at the summary statistics of the categorical features:

In [None]:
# Get the summary statistics for categorical variables
df.describe(include='object')

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Inferences:</font></h2>
    
    
* __`id`__, __`flight`__: These are unique identifiers and have a large number of unique values.
* __`carrier`__, __`name`__: These are airline carrier codes and names. There are 16 unique carriers in the dataset.
* __`tailnum`__: This is a unique identifier associated with the aircraft used for the flight. It also has a large number of unique values.
* __`origin`__, __`dest`__: These are the airport codes from which the flight departed and at which it arrived. There are 3 unique origin airports and 105 unique destination airports in the dataset.
* __`time_hour`__: This is the scheduled departure time of the flight, represented in local time and formatted as "yyyy-mm-dd hh:mm:ss". There are 6936 unique times in the dataset.

<a id="eda"></a>
# <p style="background-color:#8502d1; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 4 | EDA</p>

⬆️ [Tabel of Contents](#contents_tabel)

<a id="univariate"></a>
# <b><span style='color:darkorange'>Step 4.1 |</span><span style='color:#8502d1'> Univariate Analysis</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
We can perform univariate analysis on these columns based on their datatype:

* For __numerical__ data, we can use a histogram to visualize the data distribution. The number of bins should be chosen appropriately to represent the data well.
* For __categorical__ data, we can use a bar plot to visualize the frequency of each category.

In [None]:
# Set color for the plots
color = '#8502d1'

# Define function to plot histograms
def plot_hist(column, bins, title, xlabel, fontsize=8, rotation=0):
    plt.figure(figsize=(15,5))
    counts, bins, patches = plt.hist(column, bins=bins, color=color, edgecolor='white')
    plt.title(title, fontsize=15)
    plt.xlabel(xlabel, fontsize=12)
    plt.ylabel('Frequency', fontsize=12)

    # Add text annotation for frequencies
    bin_centers = 0.5 * (bins[:-1] + bins[1:])
    for count, x in zip(counts, bin_centers):
        if count > 0:
            plt.text(x, count, str(int(count)), fontsize=fontsize, ha='center', va='bottom', rotation=rotation)
    plt.show()

# Define function to plot bar plots
def plot_bar(column, title, xlabel, fontsize=8, rotation=0):
    plt.figure(figsize=(15,5))
    counts = column.value_counts()
    counts.plot(kind='bar', color=color, edgecolor='white')
    plt.title(title, fontsize=15)
    plt.xlabel(xlabel, fontsize=12)
    plt.ylabel('Frequency', fontsize=12)

    # Add text annotation for frequencies with rotation and larger font size
    for i, v in enumerate(counts):
        plt.text(i, v, str(v), fontsize=fontsize, ha='center', va='bottom', rotation=rotation)
    plt.show()

### <b><span style='color:darkorange'>Step 4.1.1 |</span><span style='color:#8502d1'> year</span></b>

In [None]:
# The year in which the flight took place. The dataset includes flights from the year 2013.
plot_bar(df['year'], 'Year', 'Year of Flight')

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Inference:</font></h2>
    
The dataset contains flight data for only one year (__2013__), so the bar plot only has one bar.

### <b><span style='color:darkorange'>Step 4.1.2 |</span><span style='color:#8502d1'> month</span></b>

In [None]:
# The month of the year in which the flight occurred, represented by an integer ranging from 1 (January) to 12 (December).
plot_hist(df['month'], bins=12, title='Month', xlabel='Month of Flight')

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Inference:</font></h2>

The histogram shows that the distribution of flights across different months is approximately uniform, with __a slight decrease in February__, which is likely due to the fewer number of days in that month.

### <b><span style='color:darkorange'>Step 4.1.3 |</span><span style='color:#8502d1'> day</span></b>

In [None]:
# The day of the month on which the flight took place, represented by an integer from 1 to 31.
plot_hist(df['day'], bins=31, title='Day', xlabel='Day of Flight', fontsize=7, rotation=45)

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Inference:</font></h2>
    
The histogram reveals a mostly uniform distribution of flights across the days of the month, with slight decreases at the end of the month. These decreases are due to some months having fewer than 31 days.

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Conclusion:</font></h2>

Based on the bivariate analysis, the features that have a __noticeable impact on arrival delay__ are:

- Month
- Departure Time and Scheduled Departure Time
- Departure Delay
- Arrival Time and Scheduled Arrival Time
- Carrier
- Origin
- Destination
- Hour

On the other hand, the following features __do not__ seem to significantly __influence arrival delay__:

- Day
- Air Time
- Distance
- Minute

# **Step 5 | Data Preprocessing**


<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Data preprocessing is a crucial step in any machine learning project. It involves cleaning and transforming raw data into a format that can be understood by machine learning algorithms.

<a id="removal"></a>
# <b><span style='color:darkorange'>Step 5.1 |</span><span style='color:#8502d1'> Irrelevant Feature Removal</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Based on our careful review and exploratory data analysis so far, here's a rundown of each feature I am going to take out of the picture:
    
- __`id`__: This is a unique identifier assigned to each flight record in the dataset. It carries no informational value for the model, as it does not reflect any characteristic of the flights.

- __`year`__: All the flights took place in 2013, so this feature is a constant for all records. A constant feature cannot improve the model's performance, because it does not contribute any information that can help distinguish one record from another.

- __`flight`__: This feature represents the designated number of the flight. However, there are many unique flight numbers (3844), which could lead to overfitting. Each flight number corresponds to a specific route, and while it's true that some routes may be more prone to delays than others, the high dimensionality of this feature may be more harmful than helpful.

- __`tailnum`__: This feature is a unique identifier associated with each aircraft used for the flight. There are even more unique tail numbers (4043) than there are unique flight numbers. Although certain aircraft may be more prone to delays (e.g., older planes that require more maintenance), again, the high dimensionality of this feature may lead to overfitting.

- __`time_hour`__: This feature represents the scheduled departure time of the flight, formatted as "yyyy-mm-dd hh:mm:ss". Since we have separate features for the year, month, day, and scheduled departure time (in the form of `sched_dep_time`), this feature is redundant and should be removed.

- __`minute`__: This feature represents the minute component of the scheduled departure time. We already have `sched_dep_time` that includes this information. Hence, `minute` can be removed.

- __`hour`__: Similar to `minute`, this feature is also redundant as we already have `sched_dep_time`. So `hour` should be removed as well.

- __`carrier`__: This is a two-letter code representing the airline carrier responsible for the flight. We have another feature, `name`, which represents the same information but in a more descriptive form (the full name of the airline carrier). To avoid redundancy, we can remove `carrier` and keep `name`.

In [None]:
df.drop(['id', 'year', 'flight', 'tailnum', 'time_hour', 'minute', 'hour', 'carrier'], axis=1, inplace=True)

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
Let us check the list of remaining features:

In [None]:
df.columns

____
<a id="missing"></a>
# <b><span style='color:darkorange'>Step 5.2 |</span><span style='color:#8502d1'>  Missing Value Treatment</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
__Missing data__ can disrupt many machine learning algorithms. It's crucial to handle these appropriately. Depending on the nature of the data and the percentage of missing values, we can:

* Drop the rows or columns with missing data, especially if the percentage of missing data is very high.
* Fill the missing data with a central tendency measure (mean, median, or mode).
* Predict the missing values using a machine learning algorithm like KNN.
* Use algorithms that can handle missing values.

In [None]:
# Check the percentage of missing values in each column
missing_percent = df.isnull().mean().sort_values(ascending=False) * 100
print("Missing Value Percentage by Columns:\n", round(missing_percent,2))

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Some of the missing values belong to the __target column__ (__`arr_delay`__). So, first I am going to drop rows with missing target values (`arr_delay`) to avoid introducing bias into our model. This is because we want our model to learn from actual observations, not from imputed values:

In [None]:
df.dropna(subset=['arr_delay'], inplace=True)
df.reset_index(drop=True, inplace=True)

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
Then i am using a __K-Nearest Neighbors (KNN) Imputer__ to fill in missing values in other columns. The KNN imputer is a more advanced imputation method that fills missing values based on similar observations, rather than just using the mean or median. This allows us to capture more complex patterns in the data, potentially leading to more accurate imputations:  

In [None]:
# Separate features and target
X = df.drop(columns=['arr_delay'])
y = df['arr_delay']

# Initialize the imputer
imputer = KNNImputer(n_neighbors=5)

# Apply the imputer
columns_to_impute = ['dep_time', 'dep_delay', 'arr_time', 'air_time']
X[columns_to_impute] = imputer.fit_transform(X[columns_to_impute])

# Check missing values again
X.isnull().sum().sum()

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

<h2 align="left"><font color=#8502d1>Note:</font></h2>

The separation of features and target before imputation ensures that our imputation process is not influenced by the target values, thereby preventing data leakage.

____
<a id="outlier"></a>
# <b><span style='color:darkorange'>Step 5.3 |</span><span style='color:#8502d1'>  Outlier Treatment</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

__Outliers__ are values that stand out from the rest because they're very different. They can sometimes cause problems, especially when we're doing something like __regression__, where outliers can have a big impact.

__In our flight delay data, these outliers represent the really long delays. These aren't errors or mistakes, they're a real part of flying that we want our model to learn from. So, we don't want to just throw these values away.__

But we also don't want these outliers to have too much influence. So, we use something called a __Box-Cox transformation__ later. __This is a way of adjusting our data to make the outliers less extreme, without getting rid of them.__

This way, our model can still learn from the outliers - the really long delays - but they won't have an outsized impact. And this is important because even though our model, regression, is usually sensitive to outliers, we still want it to learn from all parts of our data.

____
<a id="encoding"></a>
# <b><span style='color:darkorange'>Step 5.4 |</span><span style='color:#8502d1'> Categorical Features Encoding</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
Categorical variables need to be encoded because machine learning algorithms work with numerical data and cannot directly handle text or categorical data. First of all let's identify categorical columns:

In [None]:
# Identify categorical columns
cat_columns = X.select_dtypes(include=['object']).columns

# Check the number of unique categories in each categorical feature
X[cat_columns].nunique()

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
We identified `origin`, `dest`, and `name` as __nominal categorical features__ (containing categories without any inherent order). Before proceeding with encoding, however, it would be helpful to check how balanced the categories are within each feature. For example, if a feature has a category that is very rarely present in the data, one-hot encoding could result in a column with mostly zeros, which might not be very informative for the model:

In [None]:
# Check the distribution of categories within each feature
for col in cat_columns:
    print(f"\nDistribution of categories in {col}:")
    print(X[col].value_counts())

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">
    
Here's what we found:

* __`name`__: The `name` column has 16 unique categories, corresponding to different airline carriers. The distribution of categories is fairly balanced.
    
* __`origin`__: The `origin` column represents the airport from which the flight departed. There are 3 unique categories in this column, corresponding to three different airports. The distribution of categories is also quite balanced.

* __`dest`__ : The `dest` column represents the airport at which the flight arrived. There are 105 unique categories in this column, which is quite high. Some destinations have many flights (like ORD, ATL, LAX), while others have very few (like LEX, LGA).

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Based on the above observations, here's our plan for encoding:

* __One-hot Encoding__: We can apply one-hot encoding to the `origin`, and `name` columns. These columns have relatively few categories and are fairly balanced.


* __Frequency Encoding__: This method replaces each category in the feature with its frequency (i.e., the proportion of the total number of instances it represents). It's suitable for high-cardinality categorical features and does not introduce an arbitrary order. We will use this method for the `dest` feature, as it has a large number of unique categories, and the distribution is skewed, with some categories appearing much more frequently than others.

In [None]:
X['dest'].unique()

In [None]:
# Create a copy of the dataset for encoding
X_encoded = X.copy()

# Apply one-hot encoding to 'carrier', 'origin', and 'name'
X_encoded = pd.get_dummies(X_encoded, columns=['origin', 'name'], drop_first=True)

# Apply frequency encoding to 'dest'
dest_freq = X_encoded['dest'].value_counts() / len(X_encoded)  # calculate the frequencies
print(dest_freq)
X_encoded['dest'] = X_encoded['dest'].map(dest_freq)  # map frequencies to the feature

# Show the result
X_encoded.head()

____
<a id="scaling"></a>

# <b><span style='color:darkorange'>Step 5.6 |</span><span style='color:#8502d1'> Feature Scaling</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

__Feature Scaling__, also known as __standardization__ or __normalization__, is a crucial preprocessing step for many machine learning algorithms. It adjusts the range of feature values so that they can be compared on a common scale. __This adjustment is particularly important for algorithms that rely on the magnitude or distance of the features__, such as __k-nearest neighbors (KNN)__, __support vector machines (SVMs)__, and __neural networks__.

When we perform feature scaling, we must avoid __data leakage__. This means that we should not let our scaling process be influenced by any data that isn't part of the training set. However, in our current scenario, we are using the entire dataset for training. Therefore, we can fit the scaler on the whole dataset.

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

Now, let's discuss our data. Now, we have different types of features in our dataset: __continuous features__, __categorical features that have been one-hot encoded or frequency encoded__, and __cyclic features that have been transformed from time data__.

For the continuous features, we'll use standard scaling (or Z-score normalization). This not only scales the features to a common range but also __handles outliers__ to a certain extent by centering the distribution around the mean with a unit standard deviation.

For the __categorical features__ that have been transformed through frequency encoding, we should also apply standard scaling. Even though these features originated as categorical data, the encoding has transformed them into continuous features that can take on a range of values.

As for the __binary features resulting from one-hot encoding, and cyclic features resulting from time transformation__, we don't need to apply scaling. Binary features already have values of 0 or 1, which are within the range of scaled data. Furthermore, applying scaling to binary features could distort their clear, interpretable structure. Cyclic features, on the other hand, have been engineered to capture the cyclical nature of time data, and scaling these could distort this cyclical pattern.

In [None]:
# Define binary, cyclic, and continuous columns
binary_cols     = [col for col in X_encoded.columns if X_encoded[col].value_counts().index.isin([0,1]).all()]
cyclic_cols     = [col for col in X_encoded.columns if col.endswith('_cos') or col.endswith('_sin')]
continuous_cols = [col for col in X_encoded.columns if col not in binary_cols + cyclic_cols]

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the continuous features
X_encoded[continuous_cols] = scaler.fit_transform(X_encoded[continuous_cols])

# Show the result
X_encoded.head()

In [None]:
continuous_cols

<div style="border-radius:10px; padding: 15px; background-color: #e2c9ff; font-size:110%; text-align:left">

We've tidied up our data and it's now looking great! It's prepped, primed, and ready to dive into the world of machine learning models.

<h2 align="left"><font color='#8502d1'>Best Regards - Yameen Hakim</font></h2>