_____

<table align="left" width=100%>
    <td>
        <div style="text-align: center;">
          <img src="./images/bar.png" alt="entidades financiadoras"/>
        </div>
    </td>
    <td>
        <p style="text-align: center; font-size:24px;"><b>Introduction to Data Science</b></p>
        <p style="text-align: center; font-size:18px;"><b>Master in Electrical and Computer Engineering</b></p>
        <p style="text-align: center; font-size:14px;"><b>Pedro Cardoso (pcardoso@ualg.pt)</b></p>
    </td>
</table>

_____

__Short Lesson Title:__ Advanced Data Transformation and Feature Engineering

*__Summary:__ This lesson delves into advanced data manipulation techniques within the realm of Exploratory Data Analysis (EDA), building upon previous lessons. It focuses on data type management, feature encoding, feature scaling, feature transformation, and feature splitting. Students will learn how to effectively manage different data types in pandas, including converting between numerical, categorical, and boolean types, and understanding the performance implications of each. The lesson covers one-hot encoding and label encoding for categorical variables, as well as standard, min-max, and robust scaling for numerical features. It also explores feature transformation techniques like log and square root transformations to address skewness. Finally, the lesson introduces feature splitting, demonstrating how to extract meaningful information from existing features. Through practical examples, students will gain skills in preparing data for advanced analysis and modeling.*

# Exploratory Data Analysis with Pandas (part 3)

Let us continue with the exploratory data analysis with Pandas. We will continue with the titanic dataset. So, load the titanic dataset and remember its structure.

In [None]:
# load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_bokeh

pandas_bokeh.output_notebook()
%matplotlib inline

Load the titanic dataset.

In [None]:
df_titanic = pd.read_excel('./data/titanic/Titanic.xls')
df_titanic.head()

## Data types in Pandas

Pandas has a `dtypes` attribute that returns the data types of the columns in the data frame. Possible data types include
- `int64`, 
- `float64`, 
- `object`, 
- `bool`, 
- `datetime64`, 
- `timedelta[ns]`, and 
- `category`. 

By default, the data types are inferred from the data. If the data type is numeric, numpy data types are used. If the data type is non-numeric, the data type is inferred as `object`. The `object` data type is used for string values and for other mixed data types. (see https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes for more details)

For example, let us check the data types of the titanic dataset using the `dtypes` attribute. 

In [None]:
df_titanic.dtypes

Which can be also visualized as follows:

In [None]:
df_titanic.info()

We can get the data types of a specific column using the `dtype` attribute.

In [None]:
df_titanic['survived'].dtype

And get all columns of a specific data type using the `select_dtypes` method.

Such as all columns of type `int64`.

In [None]:
df_titanic.select_dtypes(include=['int64'])

Or all columns of type `float64`.

As a note, `body` is considered as a float because it contains missing values, i.e., `NaN` (because `NaN` is a float, this forces an array of integers with any missing values to become floating point). 

In [None]:
df_titanic.select_dtypes(include=['float64'])

Or all columns of type `object`.
Object data type is used for string values and for other mixed data types.

In [None]:
df_titanic.select_dtypes(include=['object'])

Or, all numeric columns.

In [None]:
df_titanic.select_dtypes(include=['number'])

The conversion of the data types can be done using the `astype` method. For example, let us convert the `survived` feature (containing 0 and 1) to `bool`.

In [None]:
# copy the data frame
df = df_titanic.copy()

# convert the survived feature to bool
df['survived'] = df['survived'].astype(bool)

# check the data types
df.dtypes

Looking at the results, we can see that the `survived` feature is now of type `bool`, i.e., `True` and `False`.

In [None]:
df['survived']

Other transformations are natural such as converting the `embarked` and `sex` feature to `category`, as follows:

In [None]:
df['embarked_category'] = df['embarked'].astype('category')
df['sex_category'] = df['sex'].astype('category')
df.dtypes

This does not change the values of the feature, but it changes the data type, the `pclass` feature is now of type `category`.

In [None]:
df[['embarked_category', 'sex_category']]

The `pcalss` feature can also be considered as a categorical feature. Let us convert it to `category` data type.

In [None]:
df['pclass_category'] = df['pclass'].astype('category')

### Performance of the `category` data type vs. `object` data type (optional)

This is useful for performance reasons, as in some cases, the `category` data type is more efficient than `object` data type. For example, let us compare the performance of the `category` and `object` data types when counting the number of unique values.

In [None]:
def timeit_cat_vs_other(df, feature):
    print()
    feature_category = f'{feature}_category'
    print(f'Time to count the number of unique values in the {feature} feature (dtype = {df[feature].dtype}) vs. {df[feature_category].dtype})')
    %timeit df[feature].nunique()
    %timeit df[feature_category].nunique()

timeit_cat_vs_other(df, 'sex')
timeit_cat_vs_other(df, 'embarked')

Also in terms of memory usage, the `category` data type is more efficient than `object` data type.

In [None]:
df[['embarked', 'embarked_category', 'sex', 'sex_category', 'pclass', 'pclass_category']].memory_usage(deep=True)

Transformations of categorical columns can also have significant impact on the performance of some methods.

In [None]:
print('Time to convert the sex feature to upper case')
%timeit df['sex'].str.upper()
%timeit df['sex_category'].str.upper()

print()
print('Time to convert the embarked feature to upper case')
%timeit df['embarked'].str.upper()
%timeit df['embarked_category'].str.upper()

For other operations the `category` data type is **not** more efficient than `object` data type. For example, let us compare the performance of the `category`, `object` and `int64` a data types when counting the number of occurrences of each value. 

This is because the `category` data type is stored as an array of integers, and the `object` data type is stored as an array of pointers to the strings. 

In [None]:
print('Time to count the number of occurrences of each value in the embarked feature')
%timeit df['embarked'].value_counts()
%timeit df['embarked_category'].value_counts()

print()
print('Time to count the number of occurrences of each value in the pclass (!int64!) feature')
%timeit df['pclass'].value_counts()
%timeit df['pclass_category'].value_counts()

Or when doing a group by operation.

In [None]:
print('Time to group by the embarked feature')
%timeit df.groupby('embarked').size()
%timeit df.groupby('embarked_category').size()

In [None]:
print('Time to group by pclass feature (!int64!)')
%timeit df.groupby('pclass').size()
%timeit df.groupby('pclass_category').size()

So, the conversion of the data types can have significant impact on the performance. In general, the `category` data type is more efficient than `object` data type, but not always. This must be evaluated on a case by case basis.

## Feature encoding
Feature encoding is the process of transforming the features to have a more machine learning friendly format. For example, categorical features are transformed to have integer values. This is useful for some machine learning algorithms.

### One-hot encoding

One-hot encoding is the process of transforming the features to have a more machine learning friendly format. For example, categorical features are split into multiple binary features.

In [None]:
df = df_titanic.copy()

encoded_data = pd.get_dummies(
    df[['embarked', 'sex']],
    dtype=int # default is bool
)

encoded_data

In [None]:
# add the encoded data to the original data frame and drop the original features
df = pd.concat([df, encoded_data], axis=1).drop(['embarked', 'sex'], axis=1)
df

### Label encoding
Label encoding is the process of transforming the features to, usually, have a more machine learning friendly format. For example, categorical features are transformed to have integer values.

Let us encode the `embarked` feature.

In [None]:
from sklearn.preprocessing import LabelEncoder

# copy the data frame
df = df_titanic.copy()

# encode the embarked feature
le = LabelEncoder()
df['embarked'] = le.fit_transform(df['embarked'])

df

The `LabelEncoder` class has a `classes_` attribute that contains the list of classes that were encoded.

In [None]:
le.classes_

As an alternative to `LabelEncoder`, for categorial features we can use the `cat.codes` accessor.

In [None]:
df = df_titanic.copy()
df['embarked'] = df['embarked'].astype('category')
df['embarked'] = df['embarked'].cat.codes
df['embarked']

### Ordinal encoding

Ordinal encoding is similar to label encoding, but the classes are ordered. For example, the classes `low`, `medium`, and `high` can be encoded as `0`, `1`, and `2`, respectively.

In the Titanic dataset, we can consider the `pclass` feature as an ordinal feature.   

In [None]:
df_titanic.pclass

## Feature scaling

In practical scenarios, features often have distinct ranges, magnitudes, and units. For instance, age may vary between 0 and 120, while salary can fluctuate between zero and thousands or even millions. This raises the question of how data analysts or scientists can compare such features, given that they are on different scales. It is worth noting that high-magnitude features tend to have a more significant impact on machine learning models than lower magnitude ones. Fortunately, feature scaling or normalization can help address these issues.

So, feature scaling refers to the process of bringing all features to the same magnitude level. It is not mandatory for all algorithms, but some algorithms necessitate scaled data, such as those that depend on Euclidean distance measures, like K-nearest neighbor and K-means clustering algorithms.

Scalllin can also be used to annoniymize the data. For example, the age of a person can be scaled to the range [0, 1] by dividing by 100. This way, the age of a person is not directly available in the data.

### Standard scaling
Standard scaling is the process of transforming the features to have a more normal distribution. The standard scaling is performed by subtracting the mean and dividing by the standard deviation., i.e., $$x_{scaled} = \frac{x - \mu}{\sigma}$$

In [None]:
from sklearn.preprocessing import StandardScaler

# copy the data frame
df = df_titanic.copy()

# select the features
features = ['age', 'fare']

# standard scaling
scaler = StandardScaler()
scaler.fit(df[features])


df[features] = scaler.transform(df[features])

df

Let us see the distribution of the original data and the scaled data.


In [None]:
fig, ax = plt.subplots(2, 1, figsize=(15, 5))

df_titanic[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='original data', ax=ax[0])
df[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='standard scaled data', ax=ax[1])

plt.show()

The distribution of the original data and the scaled data is different. The scaled data has a mean of zero and a standard deviation of one. The scaled data is centered around zero, and the values are almost all within the range of -3 and 3. The scaled data is more normally distributed than the original data.

However, plotting the data, we can see that the shape of the data is the same, only the scale is different.

In [None]:
ax = df_titanic['age'].sort_values().reset_index().drop('index', axis=1)\
    .plot(style='o', title='original data')
    
df['age'].sort_values().reset_index().drop('index', axis=1)\
    .plot(style=".",ax=ax, secondary_y=True)
    
ax.set_ylabel('original data')
ax.right_ax.set_ylabel('standard scaled data')
ax.set_xlabel('index')
ax.set_title('original data vs. standard scaled data')
ax.right_ax.legend(['original data'], loc='upper left')
ax.legend(['original data', 'standard scaled data'], loc='upper right')  

plt.show()

### Min-max scaling

Min-max scaling is the process of transforming the features to have a more uniform distribution. The min-max scaling to the [0, 1] interval is performed by subtracting the minimum and dividing by the range., i.e., $$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}.$$

To transform the data to the $[a, b]$ interval, we can use the following formula: $$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}  (b - a) + a.$$

In [None]:
from sklearn.preprocessing import MinMaxScaler

# copy the data frame
df = df_titanic.copy()

# select the features
features = ['age', 'fare']

# min-max scaling
scaler = MinMaxScaler()
df[features] = scaler.fit_transform(df[features])

df

Let us see the distribution of the original data and the scaled data.

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(15, 5))

df_titanic[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='original data', ax=ax[0])
df[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='min-max scaled data', ax=ax[1])

plt.show()

The distribution of the original data and the scaled data is different. The scaled data is in the [0, 1] interval. The scaled data is more uniformly distributed than the original data.

### Robust scaling
Robust scaling features in a way that accounts for outliers. The method achieves this by first removing the median and then scaling the data based on the quantile range. The default quantile range used is the Interquartile Range (IQR), although it can be customized if needed.

During the scaling process, each feature is centered and scaled independently by computing relevant statistics from the training set. Outliers can often skew the sample mean and variance in undesirable ways.

So, robust scaling is computed as follows:
 $$x_{scaled} = \frac{x - \text{median}(x)}{\text{IQR}(x)},$$
where $\text{median}(x)$ is the median of the feature $x$, and $\text{IQR}(x)$ is the interquartile range of the feature $x$.

In [None]:
from sklearn.preprocessing import RobustScaler

# copy the data frame
df = df_titanic.copy()

# select the features
features = ['age', 'fare']

# robust scaling
scaler = RobustScaler()
df[features] = scaler.fit_transform(df[features])

df

Let us see the distribution of the original data and the scaled data. 

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(15, 5))

df_titanic[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='original data', ax=ax[0])
df[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='robust scaled data', ax=ax[1])

plt.show()

The distribution of the original data and the scaled data is different. The scaled data is centered around zero and is more uniformly distributed than the original data. The scaled data is more robust to outliers than the original data.


## Feature transformation
Feature transformation is the process of transforming the features allows, for exameple, reducing the skewness of the features, the effect of outliers, etc. Examples feature transformations are log transformation, square root transformation, square, etc.


This kind of transformation is useful for:
- some **machine learning algorithms and for some statistical tests**. For example, the t-test assumes that the data is normally distributed. If the data is not normally distributed, then the t-test may not be valid. In this case, we can apply a transformation to the data to make it more normally distributed.
- **to reduce the effect of outliers**. For example, if we have a feature that has a few very large values, then the mean and standard deviation of the feature will be affected by these outliers. In this case, we can apply a transformation to the data to reduce the effect of these outliers.
- **to reduce the skewness of the data**. For example, if we have a feature that is right-skewed, then the mean will be larger than the median. In this case, we can apply a transformation to the data to make it more normally distributed. If the feature is right-skewed or positively skewed or grouped at lower values, then we can apply the square root, cube root, and logarithmic transformations. If the feature is left-skewed or negative skewed or grouped at higher values, then common practice is to reflect the data first (e.g., subtract each value from a constant larger than the maximum value), making it right-skewed, and then apply the same transformations (log, square root, etc.) as you would for right-skewed data.
- **to reduce the effect of heteroscedasticity**. For example, In many datasets, the variance increases with the mean (heteroscedasticity). Log transformation helps stabilize variance across the range of data, making statistical modeling more robust and reliable. This is critical for meeting the assumptions of linear models.


### Log transformation

The log transformation is performed by taking the logarithm of the feature., i.e., $$x_{log} = \log(x).$$

In [None]:
from scipy import stats

# copy the data frame
df = df_titanic.copy()

# select the features
features = ['age', 'fare']

# log transformation
df[features] = df[features].apply(lambda x: np.log(x + 1))

df

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(15, 5))

df_titanic[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='original data', ax=ax[0])
df[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='log scaled data', ax=ax[1])

plt.show()

### Square root transformation
Square root transformation is performed by taking the square root of the feature., i.e., $$x_{sqrt} = \sqrt{x}.$$

In [None]:
# copy the data frame
df = df_titanic.copy()

# select the features
features = ['age', 'fare']

# square transformation
df[features] = df[features].apply(lambda x: np.sqrt(x))

df

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(15, 5))

df_titanic[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='original data', ax=ax[0])
df[['age', 'fare']].plot.hist(alpha=0.5, bins=20, title='square scaled data', ax=ax[1])

plt.show()

## Discretization transformation
Discretization transformation is performed by transforming numerical features to categorical features. This is useful for some machine learning algorithms. For example, the following code transforms the age feature to a categorical feature.

To do this, we use the `pd.cut` function. The `pd.cut` function takes as input the feature to be transformed, the bins, and the labels. The bins are the intervals in which the feature will be transformed. The labels are the names of the categories.

In [None]:
# copy the data frame
df = df_titanic.copy()

# select the features
features = ['age', 'fare']

# discretization transformation
df['age category'] = pd.cut(df['age'], bins=[0, 18, 30, 65, 100], labels=['child', 'young', 'adult', 'senior'])

df

In [None]:
df[['age', 'age category']].groupby('age category', observed=True).count().plot(kind='bar')
plt.title('Age categories')
plt.xlabel('Age category')
plt.ylabel('Count')
plt.show()

## Feature splitting
Feature splitting is the process of splitting a feature into multiple features. For example, some times is possible to split the `name` feature into two features: `first name` and `last name`. Or the spliting of a `date` feature into three features: `year`, `month`, and `day`.


### Dates example

Consider the dataset with energy consumption of a house. The dataset contains the date and the energy consumption of the house. The date feature can be split into multiple features, such as year, month, day, hour, minute, and second. This can be done using the `dt` accessor.

In [None]:
df_consumption = pd.read_csv('./data/house_consumption_TS/house_consumption.csv')
df_consumption.head()

In [None]:
df_consumption.info()

In [None]:
df_consumption['date'] = pd.to_datetime(df_consumption['date'])
df_consumption.info()

In [None]:
df_consumption['year'] = df_consumption['date'].dt.year
df_consumption['month'] = df_consumption['date'].dt.month
df_consumption['day'] = df_consumption['date'].dt.day
df_consumption['hour'] = df_consumption['date'].dt.hour
df_consumption['minute'] = df_consumption['date'].dt.minute
df_consumption['second'] = df_consumption['date'].dt.second
df_consumption

### Titanic example

In the Titanic dataset, we can split the cabin feature into two features: `cabin number` and `cabin letter`. The latter corresponds to the deck of the Titanic. Furhter, some passenger have more than one cabin. In this case, we can split the cabin feature into multiple features, one for each cabin.

In [None]:
df_titanic

The `extract` function can be used to extract the deck from the cabin feature. The `extract` function takes as input a regular expression that defines the pattern to be extracted.

See notebook [./04_d_exploratory_data_analysis.ipynb](./04_d_exploratory_data_analysis.ipynb)

For example, the following code extracts the deck from the cabin feature.

In [None]:
# copy the data frame
df = df_titanic.copy()

# select the features
features = ['cabin']

# feature splitting
# ([0-9]+): one or more digits
df['cabin number'] = df['cabin'].str.extract('([0-9]+)')

# ([A-Z]): one upper case letter
df['deck'] = df['cabin'].str.extract('([A-Z])')

# count the number of cabins, because the "cabin" column may contain multiple cabins
df['number of cabins'] = df['cabin'].str.split().str.len()

df.head(15)

And now it can be interesting to see the distribution of the passenger class in different decks. This can be done using a pivot table.

In [None]:
passenger_class_by_deck = df[['pclass', 'deck']].pivot_table(index='deck', columns='pclass', aggfunc='size', fill_value=0)

passenger_class_by_deck

In [None]:
passenger_class_by_deck.plot(kind="bar")
plt.title('Decks')
plt.xlabel('Deck')
plt.ylabel('Count')
plt.show()

Another thing we can check is the correlation between the number of cabins and the passenger class or the fare. We can see that the number of cabins is highly correlated fare but not with the passenger class.

In [None]:
df[['number of cabins', 'pclass', 'fare']].corr()

We can also see the distribution of the mean fare in different passenger classes and number of cabins.

In [None]:
df.pivot_table(index='pclass', columns='number of cabins', values='fare', aggfunc='mean', fill_value=0)#.plot.bar()


Let me give you some examples related to the Titanic dataset.

In [None]:
import re
name = 'Braund, Mr. Owen Harris & Bradley, Mrs. Anna Michaela'

def print_regex_result(explanation, name, regex):
    print(re.findall(regex, name), ":", explanation)

print_regex_result('title, extracted considering the dot ', name, r'([A-Za-z]+)\.')
print_regex_result('title, extracted considering the dot ', name, r'(\w+)\.')
print_regex_result('title, extracted considering the comma', name, r',\s([A-Za-z]+)')

In [None]:
print_regex_result('last name, extracted considering the comma', name, r'([A-Za-z]+),')
print_regex_result( 'first name, extracted considering the dot', name, r'\.\s([A-Za-zç]+)')
print_regex_result('first name and posterior name, extracted considering the space between words', name, r'([A-Za-zç]+)\s([A-Za-z]+)')


In [None]:

print_regex_result('list of words', name, r'([A-Za-z]+)')
print_regex_result('list of words', name, r'(\w+)')
print_regex_result('list of names', name, r'([A-Za-z]+),\s\w+.\s([A-Za-z]+)\s([A-Za-z]+)')

## Feature engineering

Feature engineering includes:
- the creation of new features from the existing features
- the selection of features
- the extraction of features 
- the reduction of features, and 
- the aggregation of features. 
                
We already saw how to create new features from the existing features, e.g., by extracting the deck from the cabin feature.

As another example, we can create a new columns with the title and last name of the passenger.

In [None]:
# copy the data frame
df = df_titanic.copy()

# select the features
features = ['name']

# feature engineering
df['title'] = df['name'].str.extract(r'([A-Za-z]+)\.')
df['last name'] = df['name'].str.extract(r'([A-Za-z]+),')

df

## Feature selection

### Correlation as a feature selection technique
In feature selection we select the features that are useful for the analysis. We can select the features using the correlation with the target variable. For example, we can select the features that have a correlation with the target variable greater than 0.1 to predict the survival of the passenger. 

Using features with high correlation with the target variable can help to improve the performance of the model. 

On the other hand, using features correlated to each other(excluding the target) might not help to improve the performance of the model. For example, if we have two features that are highly correlated, then we can select one of them and discard the other. This is because the two features contain the same information. In this case, we can use the `drop` method to drop one of the features.

The use of correlation as a feature selection technique is not always the best approach. For example, if we have two features that are highly correlated, then we can select one of them and discard the other. This is because the two features contain the same information. In this case, we can use the `drop` method to drop one of the features.

In [None]:
# select the numerical features
df = df_titanic.select_dtypes(include=np.number)

# compute the correlation with the target variable
df.corr()['survived'].abs().sort_values(ascending=False)

### Feature selection using the variance
We can also select the features using the variance. For example, we can select the features that have a variance greater than 1 to predict the survival of the passenger.

Using features with high variance can help to improve the performance of the model because they contain more information.

Note that, due to the magnitude of the values, the variance of the features is not always a good measure of the information contained in the feature. In that case, we can use other methods such as feature importance which is a model based ranking associated to models such as Random Forest, XGBoost, etc. Also, the use of Principal Component Analysis (PCA) can be used to select the features. PCA is a dimensionality reduction technique that can be used to reduce the number of features in the dataset. 

In [None]:
# select the numerical features
df = df_titanic.select_dtypes(include=np.number)

# compute the variance
df.var().sort_values(ascending=False)

Note that the variance is magnitude dependent. This means that the variance of a feature is affected by the scale of the feature. For example, if we have a feature that has a range of [0, 1], then the variance will be small. If we have a feature that has a range of [0, 100], then the variance will be large.

In [None]:
# select the features with variance greater than 0.1
df = df.loc[:, df.var() > 1]

df

# References

- Campesato, O. (2018). Regular expressions: Pocket primer. Mercury Learning and Information.
- https://www.kaggle.com/learn/pandas
- Navlani, A.,  Fandango, A.,  Idris, I. (2021). Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python. Packt. 3rd Edition
- Brandt. S. (2014). Data Analysis: Statistical and Computational Methods for Scientists and Engineers. Springer. 4th Edition
- https://eugenelohh.medium.com/data-analysis-on-the-titanic-dataset-using-python-7593633135f2