___
<h1> Machine Learning </h1>
<h2> Systems Engineering and Computer Technologies / Engenharia de Sistemas e Tecnologias Informáticas
(LESTI)</h2>
<h3> Instituto Superior de Engenharia / Universidade do Algarve </h3>

[LESTI](https://ise.ualg.pt/curso/1941) / [ISE](https://ise.ualg.pt) / [UAlg](https://www.ualg.pt)

Pedro J. S. Cardoso (pcardoso@ualg.pt)

___

# Feature engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself. Examples of feature engineering include:
- deriving new features from existing data,
- selecting only the most relevant features,
- creating features from images, text, and sensor data,
- normalizing numerical features,
- encoding categorical features,
- transforming features into a more suitable format for machine learning algorithms.
- ...

In this notebook, we will explore some of these techniques. For that, let us consider the Seoul Bike Sharing Demand dataset. The dataset contains the hourly count of rental bikes between years 2017 and 2018 in Seoul, Korea with the corresponding weather and seasonal information. The dataset can be downloaded from https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand but we have already downloaded it and saved it in the `data` folder.

So, we can start by loading the dataset into a pandas dataframe. 


In [None]:
import pandas as pd
df = pd.read_csv('./../../Datasets/SeoulBikeData.csv')
df.head()

By calling the dataframe's `info` method, we can see that there are no missing values but there are some categorical columns.
(For treating missing values, please refer to the `12-Missing-Data.ipynb` notebook were some techiniques are studied.)

In [None]:
df.info()

## Categorical data transformation

Most machine learning algorithms cannot handle categorical data. Therefore, categorical data must be transformed into numerical data. There are several ways to do this, like:
- One-hot encoding -- transform each category into a binary column
- Ordinal encoding -- transform each category into a number
- Binary encoding -- transform each category into a binary number
- Hash encoding -- transform each category into a hash number
- ...

Let us see how to performe the first two techniques.

### One hot encoding

One hot encoding is a technique used to transform categorical features to binary features. The idea is to create a new column for each category and assign a 1 or 0 to the column. For example, the season column has four categories: Spring, Summer, Autumn, and Winter. We can convert this column into 3 columns and use 0/False or 1/True to indicate if the sample belongs to that category or not. We only need 3 columns because if the sample is not in the first three categories, then it must be in the fourth category.


To achieve this, we can use the pandas get_dummies method (we'll do it for the 'Holiday' and 'Functioning Day' columns).

In [None]:
df = pd.get_dummies(df, columns=['Holiday', 'Functioning Day'], drop_first=True)
df

### Ordinal encoding
Ordinal encoding is a technique used to transform categorical features to ordinal features. The idea is to assign a number to each category. For example, the season column has four categories: Spring, Summer, Autumn, and Winter. We can convert this column into a single column with values 1, 2, 3, and 4. To achieve this, we can use the pandas replace method.

In [None]:

df['Seasons'] = df['Seasons'].replace({'Spring': 1, 'Summer': 2, 'Autumn': 3, 'Winter': 4})

## Dates' transformation

Dates are usually represented as strings or a specific data type. However, machine learning algorithms cannot handle strings. Therefore, dates must be transformed into numerical data. There are several ways to do this, like extracting the year, month, day, day of week etc. from the date

In our case, we can split this column into two columns: month and day, and day of week. To achieve this, we can use the pandas `to_datetime``
 method as follows:

In [None]:
# make sure the date column is in datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df.info()

In [None]:
And now, extract the wanted data (month, day etc.)

In [None]:
# create new columns for month, day, and day of week
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day
df['day_of_week'] = df['Date'].dt.day_of_week

# drop the original date column
df.drop('Date', axis=1, inplace=True)

Let us now recheck the dataframe's info method.

In [None]:
df.info()

We should be now able to apply machine learning algorithms to this dataset. However, we can still improve the performance of the algorithms by applying some feature engineering techniques. But let us see how the algorithms perform without any feature engineering.

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import NuSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
import matplotlib.pyplot as plt

def get_X_and_y(df):
    X = df.drop(['Rented Bike Count'], axis=1)
    y = df['Rented Bike Count']
    return X, y

def run(df):
    # get X and y
    X, y = get_X_and_y(df)
    
    # split the data into train and test sets
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, shuffle=True)

    models = {
        'LinearRegression': LinearRegression(),
        'Ridge': Ridge(),
        'Lasso': Lasso(),
        'SVR': NuSVR(),
        'KNeighborsRegressor': KNeighborsRegressor(),
        'RandomForestRegressor': RandomForestRegressor(),
        # 'MLPRegressor': MLPRegressor(max_iter=10000)
    }
 
    fig, ax = plt.subplots(len(models), 1, figsize=(10, 40))
    scores = {}
    for idx, (name, model) in enumerate(models.items()):
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        pred = model.predict(X_test)
        print(f'{name}: score = {score}')

        # plot pred vs actual
        ax[idx].plot(y_test.values, pred, c='g', marker='o', linestyle='None')
        ax[idx].plot(y_test.values, y_test.values, c='r')
        ax[idx].set_ylabel('Predicted')
        ax[idx].set_xlabel('Actual')
        ax[idx].set_title(f'{name} / Score =  {score}')   
        
        scores[name] = score
    
    return scores 

In [None]:
all_scores = pd.DataFrame()
all_scores['without scaling or poly'] = run(df)


## Feature scaling
Feature scaling is the process of transforming numerical features to a common scale. There are several ways to do this, like:
- Normalization -- transform each feature to a range between 0 and 1
- Standardization -- transform each feature to a normal distribution with mean 0 and standard deviation 1
- etc.

The original dataset has the following distribution

In [None]:
df.describe()

Box plots also help with visualization of the distribution

In [None]:
df.drop(['Rented Bike Count'], axis=1).plot(kind='box', figsize=(20,10))

### Standardization (or Z-score normalization)

Standardization is a technique used to transform numerical features to a normal distribution with mean 0 and standard deviation 1. The idea is to subtract the mean and divide by the standard deviation. The formula is given by
$$ X'_{ij} = \frac{X_{ij}-\mu_j}{\sigma_j}$$
where $X_{ij}$ is the observation $i$ for the feature $j$, $\mu_j$ is the mean and $\sigma_j$ is the standard deviation.


To achieve this, we can use the pandas mean and std methods or call the sklearn StandardScaler method.

Let us now apply the standardization technique

In [None]:
from sklearn.preprocessing import StandardScaler

# get X and y
X, y = get_X_and_y(df)

# set and fit the scaler
standard_scaler = StandardScaler().fit(X)

# normalize the data
df_std = pd.DataFrame(standard_scaler.transform(X), columns = X.columns)
df_std.plot(kind='box', figsize=(20,5))

df_std['Rented Bike Count'] = y

So, let us create a model but now using the standarderized data

In [None]:
all_scores['with standardization'] =  run(df_std)

### MinMaxScaler

Another usual solution is to normalize the distribution by subtracting the minimum and dividing by the difference between the maximum and the minimum,

$$ X'_{ij} = \frac{X_{ij}-\min_j}{\max_j-\min_j}$$

where X_{ij} is the observation $i$ for the feature $j$, $\min_j$ is the minimum and $\max_j$ is the maximum. Returned values are in the range [0, 1].

This can be done by coding or simply using sklearn

In [None]:
from sklearn.preprocessing import MinMaxScaler

X, y = get_X_and_y(df)

# set and fit the scaler
minmax_scaler = MinMaxScaler().fit(X)

df_minmax = pd.DataFrame(minmax_scaler.transform(X), columns = X.columns)
df_minmax.plot(kind='box', figsize=(20,5))

df_minmax['Rented Bike Count'] = y

So, let us create a model but now using the scaled data 

In [None]:
all_scores['with minmax'] = run(df_minmax)

## Polynomial features

Other approach is to create polynomial features. In this case, if the original set of feature is $(x_1, x_2, ..., x_n)$ then the polynomial features with degree 2 are $(1, x_1, x_2, x_n, x_1^2, x_1x_2, \ldots, x_1x_n, x_2^2, \ldots, x_2x_n, \ldots,  x_n^2 \ldots)$.

This can be done by coding or simply using sklearn

In [None]:
from sklearn.preprocessing import PolynomialFeatures

X, y = get_X_and_y(df)

# set and fit the scaler
poly = PolynomialFeatures(degree=2).fit(X)

df_poly = pd.DataFrame(poly.transform(X), columns = poly.get_feature_names_out(X.columns))

df_poly.plot(kind='box', figsize=(20,5))

df_poly['Rented Bike Count'] = y

df_poly

Train a model using the polynomial features

In [None]:
all_scores['with poly'] = run(df_poly)

## Normalization + Polynomial features

Now, let us combine both normalization and polynomial features

In [None]:
# get X and y
X, y = get_X_and_y(df)

# set and fit the scaler
standard_scaler = StandardScaler().fit(X)

# normalize the data
df_std = pd.DataFrame(standard_scaler.transform(X), columns = X.columns)

# set and fit the scaler
poly = PolynomialFeatures(degree=2, include_bias=False).fit(df_std)

df_std_poly = pd.DataFrame(poly.transform(df_std), columns = poly.get_feature_names_out(df_std.columns))

df_std_poly.plot(kind='box', figsize=(20,5))

df_std_poly['Rented Bike Count'] = y

And run the model

In [None]:
all_scores['with standardization and poly'] =  run(df_std_poly)

In [None]:
all_scores

In [None]:
all_scores.plot(figsize=(20,8))