<table style="background-color:#F5F5F5;" width="100%">
<tr><td style="background-color:#F5F5F5;"><img src="../images/logo.png" width="150" align='right'/></td></tr>     <tr><td>
            <h2><center>Aprendizagem Automática em Engenharia Biomédica</center></h2>
            <h3><center>1st Semester - 2025/2026</center></h3>
            <h4><center>Universidade Nova de Lisboa - Faculdade de Ciências e Tecnologia</center></h4>
</td></tr>
    <tr><td><h2><b><center>Lab 2 - Data Preparation</center></b></h2>
    <h4><i><b><center>Loading, Visualizing, Describing, Split and Encoding</center></b></i></h4>
    <h5><b>Version Control:</b></h5>
    <h5>Created by: Hugo Gamboa to AAEB</h5>
    <h5>Modified by: Pedro Vieira to AAEF (25/26)</h5>
</td></tr>
</table>

## 1. Introduction to Data Preparation for Machine Learning
Data preparation is a crucial step in the machine learning pipeline. Raw data often contains inconsistencies, missing values, and outliers, which can negatively impact the performance of machine learning models. To build robust and reliable models, it's essential to clean and transform the data so that it is suitable for analysis.

In today's class, we will discuss the importance of data preparation and introduce pandas, a powerful Python library used for data manipulation and analysis. We will also cover the common steps involved in preparing data for machine learning, such as handling missing values, normalizing data, and encoding categorical variables.

### 1.1 Why is Data Preparation Important?
Machine learning algorithms rely on high-quality data to make accurate predictions. __If the data is noisy, incomplete, or poorly formatted, the model's performance can degrade significantly__. Some reasons why data preparation is important include:

* __Improving Model Accuracy__: Clean and well-structured data allows machine learning models to generalize better and make more accurate predictions.
* __Handling Missing or Inconsistent Data__: Missing data and inconsistencies can skew results and lead to poor model performance. Proper handling ensures data integrity.
* __Feature Engineering__: Creating new features or transforming existing ones helps the model better understand the underlying patterns in the data.
* __Reducing Model Complexity__: Removing irrelevant or redundant data simplifies the model, reducing the risk of overfitting and improving interpretability.

### 1.2 Pandas: A Python Module for Data Analysis and Data Manipulation 
Pandas is an open-source Python library that provides fast, flexible, and expressive data structures designed to make data manipulation and analysis easy. It is widely used in data science and machine learning due to its rich set of features for handling data, such as:

* __Data Reading and Writing__: Reading and writing data from various file formats (CSV, Excel, SQL, etc.)
* __Data Cleaning__: Handling missing data, outliers, duplicates, etc.
* __Data Transformation__: Applying mathematical functions, reshaping data, etc.
* __Data visualization__: Integrating well with libraries like Matplotlib and Seaborn for plotting

## 2. Getting Started with Pandas

### 2.1 Installing and Importing Pandas
Before we can use pandas, we need to make sure it is installed. Depending on your setup, there are two common ways to install it: using __pip__ or __Anaconda__.

_Note_: __This step is only needed if you havent installed pandas yet!__

__Installing with pip__

If you are working in a standard Python environment (e.g., in a Jupyter notebook or directly from Python), you can install pandas using __pip__:

In [None]:
# YOU ONLY NEED TO RUN THIS LINE OF CODE IF YOU HAVEN'T INSTALLED PANDAS YET
# you can uncomment the line below and run it if you want to install pandas using pip directly within the notebook
# !pip install pandas

__Installing with Anaconda__

If you are using __Anaconda__, you will need to run the following command in the __Conda Prompt__ (__not in the Jupyter notebook__):

1. Open the __Anaconda Prompt__ from your start menu.
2. Run the following command:

_conda install pandas_

Alternatively, you can use __Anaconda Navigator to install pandas via the graphical interface__.

Once pandas is installed (either with pip or conda), you can import it into your notebook using:
(We will also import some other packages that we will need for today. If you haven't installed these packaes yet you can use the same approach described above)

In [None]:
import pandas as pd  # data science
import numpy as np # mathematics
import matplotlib.pyplot as plt # plotting

### 2.2 Basic Data Structures in Pandas.

Pandas provides two primary data structures for working with data: [__pandas.Series__](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) and [__pandas.DataFrame__](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

These structures are highly efficient for manipulating and analyzing large datasets, allowing for operations like filtering, grouping, and merging with minimal code.

#### 2.2.1 Series

A __pandas.Series__ is a one-dimensional labeled array that can hold any data type (e.g., integers, floats, strings, or even Python objects). It is similar to a list or array, but with an associated index, which makes it easier to access specific elements based on a label. Think of a Series as a single column of data.

Here is how you can create a simple pandas.Series:

In [None]:
# define data
data = [10, 20, 30, 40, 50]

# pass the data into a pandas.Series object
s = pd.Series(data)

# Displaying the Series
print(s)

In the example above, the numbers __0 to 4 represent the index of each element__, and the values are the data stored in the Series.

##### 1.3.2.2 DataFrame

A __pandas.DataFrame__ a two-dimensional, tabular data structure that contains rows and columns, similar to a spreadsheet or SQL table. Each column in a DataFrame is a Series, and the rows can be indexed by labels. DataFrames are flexible and can store data of different types (integers, floats, strings) across various columns. A DataFrame can be created from dictionaries, lists, or by reading in a file (e.g., a CSV file).

Here’s how you can create a pandas.DataFrame with three columns:

In [None]:
# define data as dictionary where each key-value pair represents a column
# the key of the dictionary will be used as the column name.
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

# pas dictionary to DataFrame object
df = pd.DataFrame(data)

# a Jupyter Notebook has the convenience that it prints out a the last variable. 
# Thus we can display the DataFrame in a nice format by just calling the variable
df

In the example above, the numbers __0 to 3 represent the index of each row__, and the values are the data stored in the DataFrame.

### 2.2.3 Accessing Data in a DataFrame
Once you have created a DataFrame, you can access the data using column labels or row indices. Here are a few examples:

__1. Accessing a single column__:

_Note_: When accessing a column of a DataFrame a pandas.Series is returned. The index of this series is the index of the DataFrame.

In [None]:
# to access a column of a DataFrame you use the same notation as a for a dictionary. The column name functions as the key. Be careful with typos!
# a column is a pandas.Series
names_series = df['Name']

# print the column
print("The names_series contains: \n{}".format(names_series))

# print the type
print("\nThe type of name_series is: {}".format(type(names_series)))

__2. Accessing multiple columns__:

_Note_: When accessing a multiple columns of a DataFrame a pandas.DataFrame is returned. The index of the DataFrame remains unchanged.

In [None]:
# Accessing the 'Name' and 'City' columns
# This will return a DataFrame object
sub_df = df[['Name', 'City']]

# printing the type of sub_df
print("The type of the sub_df is: {}".format(type(sub_df)))

# printing the sub_df using jupyter notebook automatic print
sub_df

__3. Accessing a row__:

_Note_: When accessing a row of a DataFrame a pandas.Series is returned. __The index of this series are the column names__.

In [None]:
# Accessing the first row
# This will return a pandas.Series
first_row = df.iloc[0]

print("The first row of the DataFrame is: \n{}".format(first_row))

# print the type
print("\nThe type of first_row is: {}".format(type(first_row)))

## 3. Loading, Visualizing and Grouping

With these basics covered, you’re ready to start using pandas for more advanced data preparation tasks.

Before we start looking into Biomedical datasets we will first use a more simpler dataset to get a bit more familiarized with pandas. We will have a look at the __Titanic dataset__. 

The dataset contains the following data:

1. __PassengerId__: A unique identifier for each passenger.
2. __Survived__: Indicates whether the passenger survived (1) or not (0).
3. __Pclass__: The passenger's ticket class (1st, 2nd, or 3rd class), which is a proxy for socio-economic status.
4. __Name__: The full name of the passenger.
5. __Sex__: The gender of the passenger (male or female).
6. __Age__: The age of the passenger. Some values are missing in this column.
7. __SibSp__: The number of siblings or spouses the passenger had aboard the Titanic.
8. __Parch__: The number of parents or children the passenger had aboard the Titanic.
9. __Ticket__: The ticket number of the passenger.
10. __Fare__: The fare paid by the passenger for the journey.
11. __Cabin__: The cabin number where the passenger stayed. This column has many missing values.
12. __Embarked__: The port where the passenger boarded the ship. It can take three values: C (Cherbourg), Q (Queenstown), or S (Southampton).

The Titanic dataset is relatively simple and helps beginners understand key data manipulation techniques without being too overwhelming. It includes:

* __Categorical data__: e.g., Sex, Embarked
* __Numerical data__: e.g., Age, Fare
* __Missing values__:  which are common in real-world datasets
* __Label data__ : (Survived), useful for machine learning tasks like classification

By exploring this dataset, you will learn how to load data into pandas, inspect it, and clean it, which are crucial first steps before applying machine learning models.

### 3.1 Loading Data 

Pandas provides multiple ways of loading data files, specialized in different formats.
* [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
* [read_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
* [read_pickle()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html)
* [read_json()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)
* [read_html()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html)

Depending on the structure of your data, pandas will load your data into a pandas.Series (if the data is one dimensional) or a pandas.DataFrame (if the data is multidimensional). In the case of the titanic dataset it will be loaded into pandas.DataFrame. 

The dataset is stored in the __"Data" folder__ of your project and is stored in __.csv__ format. Thus, we will use the __read_csv()__ function.

In [None]:
# load csv file into a pandas.DataFrame
df_titanic = pd.read_csv("../Data/Lab2_titanic.csv")

# printing the type of the entire DataFrame
print("The type of a DataFrame is: {}".format(type(df_titanic)))

# show the first 5 rows of the DataFrame
df_titanic.head(5)

### 3.2. Getting to know the Data

Once the data is loaded, it's important to explore it to understand the structure and contents. You can use a few pandas functions to get a quick overview of the dataset. The full pandas function library is quite extensive. You can check the [documentation](https://pandas.pydata.org/docs/) if you want.

Pandas DataFrame and Series objects include a set of useful __attributes__ and __functions__, which we can use to explore the data.

#### 3.2.1 Some Attributes

__1. Getting the columns of the DataFrame__:

In [None]:
# Get the columns of the DataFrame
df_columns = df_titanic.columns

# print the columns
print("The column names of the DataFrame are: \n{}".format(df_columns))

__2. Getting the types of each column in the dataset__:

In [None]:
# Get the data types of each column
df_titanic.dtypes

__3. Getting the shape of the DataFrame__:

In [None]:
# get the shape of a DataFrame
df_shape = df_titanic.shape

# print the shape
print('DataFrame shape: {}| number of rows: {} | number of columns: {}'.format(df_shape, df_shape[0], df_shape[1]))

# alternatively you can get the number of rows by applying python's len() function to the DataFrame
print('DataFrame number of rows: {} '.format(len(df_titanic)))

#### 3.2.2 Some Functions

__1. Displaying the first N rows of a DataFrame using the function:__
* [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)

In [None]:
# print the first 5 rows of a DataFrame
df_titanic.head(5)

__2. Setting the index of the DataFrame with a column of the DataFrame using the function:__
* [set_index()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html)

In [None]:
# Set a column as the index of the DataFrame (you just need to pass the name of the column. Be careful with typos!)
df_titanic = df_titanic.set_index('PassengerId')

# print the first 5 rows of a DataFrame
df_titanic.head(10)

__3. Get the number of unique values and the corresponding unique values of a column using the functions:__
* [nunique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html)
* [unique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html)

In [None]:
# get the number of unique values of column 'Sex'
print('Number of unique values of \'Sex\' column: {}'.format(df_titanic['Sex'].nunique()))

# get corresponding unique values
print("Corresponding unique values of \'Sex\' column: {}".format(df_titanic['Sex'].unique()))

__4. Get an overview of the dataset, including the data types and any missing values using the function:__
* [info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)

_Note_: you can see that there are missing values for the columns 'Age', 'Cabin', and 'Embarked' as they do not reach the full number of rows (891).

In [None]:
# get the informaton of the DataFrame, in a neat print format.
df_titanic.info()

__5. Get the overall statistics (e.g., mean, min, max, etc.) using the function:__
* [describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

_Note_: using decribe() you can easily get an overall understanding of your data. For example you can see that the youngest passenger was less than 1 year old, while the oldest was 80 years old.

In [None]:
# get the general statitstics of the DataFrame (for each column)
df_titanic.describe()

__6. Getting the number of counts for each unique value inside a column using the function:__
* [value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html)

_Note_: In the example below we use value_counts() on the 'Sex' column. As the column has two unique values (male and female) the value_counts() function will count the number of male and female passengers in the dataset.

In [None]:
# get the number of counts for each unique element inside a column
# value_counts() sums up the number of values for each distinct entry of a column
# the function retruns as pandas.Series object
df_titanic['Sex'].value_counts()

### 3.3. Visualization

Data visualization is an essential part of data analysis, as it allows you to understand your dataset more intuitively. With pandas, you can quickly create visualizations to get a deeper sense of your data before further processing or modeling. For this puropose the Pandas library includes some visualization tools. All pandas.Series and pandas.DataFrame objects have a [plot() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) that you can use to visualize the data.

In this section, we will cover how to:

* Visualize distributions and relationships in the dataset
* Plot basic charts such as histograms, bar plots, and scatter plots

There is plenty of other plots that you can explore that are not dicussed in this notebook. You can explore the entire plot library by checking out the documentation.

_Note_: For plotting you need to import _matplotlib.pyplot_, which we did above in the import section

#### 3.3.1 Visualizing Distributions

Visualizing the distribution of numerical data is often the first step in understanding the spread and central tendency of a variable. A common way to visualize distributions is with a histogram. A histogram shows the distribution of a numerical variable by dividing it into "bins" and counting how many observations fall into each bin. You can plot a histogram by using:
* [plot(kind='hist',...)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)

In [None]:
# Plotting a histogram of the 'Age' column
df_titanic['Age'].plot(kind='hist', bins=20, title='Age Distribution')

# adding labels
plt.ylabel('Counts')
plt.xlabel('Age')
plt.show()

#### 3.3.2 Bar Plots for Categorical Data
For categorical data, such as gender or passenger class, a bar plot is often more appropriate. Bar plots visualize the count of occurrences for each category. You can generate bar plots by using:

* [plot(kind='bar',...)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)

In [None]:
# plotting the number of counts within the Pclass column

# get the value counts for the three classes
passenger_class_counts = df_titanic['Pclass'].value_counts()

# plot the data 
passenger_class_counts.plot(kind='bar', title='Class Count')


# add title labels
plt.xlabel('Pclass')
plt.ylabel('No. Passangers')
plt.show()

_Note_: The code above sorts the plots the according to their occurnce (from highest to lowest). In case you want to have the plot in the correct order you need to sort the indexes of the series we obtained above

In [None]:
# in case you want to have the plot in the correct order you need to sort the indexes of the series we obtained above
p_class_counts_sorted = passenger_class_counts.sort_index()

p_class_counts_sorted.sort_index().plot(kind='bar', title='Class Count')
plt.xlabel('Pclass')
plt.ylabel('No. Passangers')
plt.show()

### 3.4. Grouping Variables

The Pandas [groupby()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function is a powerful tool for grouping data by a specific column and performing aggregate calculations, such as computing the mean, sum, or count of a column for each group. This is particularly useful when you want to break down the data into subsets to observe patterns or trends within categories.

#### 3.4.1 Grouping Data by a Single Column

The most common use of  [groupby()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) is to group data by a single column and then perform aggregation. For example, you might want to know the average fare paid by passengers in each class. You can do this by applying the following function to the _groupby object_:

* [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)

In [None]:
# Grouping by 'Pclass' and calculating the average fare
avg_fare_by_class = df_titanic.groupby('Pclass')['Fare'].mean()

# print the result
print(avg_fare_by_class)

#### 3.4.2 Grouping by Multiple Columns

You can also group data by more than one column. For example, you might want to know the survival rate for different classes of 'PClass' and the 'Survived' column.

In [None]:
# group the sub DataFrame by the passenger ticket class and calculate the mean
# as the surviver column is binary encoded, calculating the mean over it gives you the survivor percentage (survival rate)
survival_rate_by_pclass = df_titanic[['Pclass', 'Survived']].groupby('Pclass').mean()

# printing the result
print(survival_rate_by_pclass)

You can also plot these results if you want:

In [None]:
# plot the result
survival_rate_by_pclass.plot(kind='bar', title='Survival rate by Passenger Ticket Class')

# add labels
plt.xlabel("Passenger Ticekt Class")
plt.ylabel("Survival Rate")
plt.show()

#### 1.5.3 Applying Multiple Aggregate Functions
Sometimes, you might want to apply multiple aggregate functions to a grouped dataset. For instance, you could calculate both the mean and the standard deviation of passenger fares for each class. This returns both the average fare and the standard deviation of fares for each class.

You can use this by applying the following function to the _groupby object_:
* [agg()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)

In [None]:
# Applying multiple aggregate functions
fare_stats = df_titanic.groupby('Pclass')['Fare'].agg(['mean', 'std'])
print(fare_stats)

## 4 Logical Operations on DataFrames

Logical operations in pandas allow you to filter your DataFrame based on conditions. This is a powerful tool for data analysis, as it enables you to focus on specific rows of data that meet certain criteria. You can think of it as asking questions about your data, like "Which passengers paid more than $100 for their fare?" or "Which patients have an abnormal heart rate?"

### 4.1 Filtering Data Using a Single Condition
You can filter rows of a DataFrame by specifying a condition. __Logical operators__ include:

* <b>></b>: greater than
* <b><</b>: less than
* <b>>=</b>: greater than or equal to
* <b><=</b>: less than or equal to
* <b>==</b>: equal to
* <b>!=</b>: not equal to

You can apply these operators to any column to filter data.

For example, to find all passengers who are older than 50 years:

In [None]:
# Filtering rows where 'Age' is greater than 50. This will return a DataFrame 
df_older_passengers = df_titanic[df_titanic['Age'] > 50]

# printing the first 5 rows of the DataFrame (You can see that the age column only contains values above 50)
df_older_passengers.head(5)

### 4.2 Filter Data Using Multiple Conditions
For combining multiple conditions you can use the following logical operators:

* __AND (&)__: Both conditions must be true.
* __OR (|)__: At least one condition must be true.
* __NOT (~)__: Negates the condition.

For example, to find passengers who were male __and__ paid more than $100 for their ticket:

In [None]:
# Filtering rows where 'Fare' is greater than 100 or 'Pclass' is 1
df_male_or_high_fare = df_titanic[(df_titanic['Sex'] == 'male') & (df_titanic['Fare'] > 100)]

# print the number of passengers that were male and paid more than $100 for their ticket
print("Nr. of passengers that were male and paid more than $100: {}".format(df_male_or_high_fare.shape[0]))

### 4.3 Using isin() for Filtering with Lists of Values
The [isin()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) function allows you to filter rows based on whether a value in a column belongs to a list of values. For example, you might want to find all passengers who embarked from either "C" or "Q":

In [None]:
# Filtering rows where 'Embarked' is either 'C' or 'Q'
df_embarked_c_or_q = df_titanic[df_titanic['Embarked'].isin(['C', 'Q'])]

# print the result
print("Nr. of people that embarked in Cherbourg or Queenstown: {}".format(df_embarked_c_or_q.shape[0]))

## 5. Handling Missing Data

One of the most common challenges in data preparation is dealing with missing values. __Missing data can occur for many reasons such as data entry errors, lost records, or simply because some information was unavailable__. Properly handling missing data is crucial because many machine learning algorithms cannot handle them directly.

### 5.1 Identifying Missing Data
To begin, it’s important to know where the missing values are in the dataset. pandas provides several functions to help with this.

__1. Checking for missing values using the function:__
* [isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html)

In the example below we can see that the first, third, and fourth values of the 'Cabin' column are missing as there value returned value is _True_.

_Note_: The term used for missing (or non-defined) data in programming is __null__. The is_null() function returns the booleans _True_ for missing values and _False_ for non-missing values. 

In [None]:
# Checking for missing values
df_titanic.isnull().head(5)

__2. Summing missing values using the function:__
* [sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html)

From the output below you can see again, that the columns 'Age', 'Cabin', and 'Embarked' have missing values.

_Note_: Using sum on booleans returns the number of _True_ values.

In [None]:
# Counting missing values in each column
# this will return a pandas.Series where the index are the column names of the DataFrame
df_titanic.isnull().sum()

### 5.2 What to do with Missing Data

Once we’ve identified missing values, we need to decide how to handle them. The two main approaches are:

1. __Dropping missing data__: Removing rows or columns with missing values
2. __Imputing missing data__: Filling in missing values

#### 5.2.1 Dropping Missing Data

If a column or row has too many missing values, or the missing values are not essential, you might want to drop them from the dataset entirely. You can do this using the dropna() function.

__1. Dropping rows with missing values, where at least one column in the row contains a missing value using:__
* [dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

_Note_: In cases where a certain column contains a lot of missing values using dropna() along the rows will result in a highly reduced dataset. As you can see below before dropping we had 891 rows of data and after dropping we only have 183. In these cases it might make more sense to remove the column with the missing data instead of dropping the rows.

In [None]:
# printing the number of rows before dropping rows with missing values
print("Number of rows before dropping rows containing missing values: {}".format(df_titanic.shape[0]))

# Dropping rows that have any missing values
df_titanic_dropped_rows = df_titanic.dropna()

# printing the number of rows after dropping rows with missing values
print("Number of rows after dropping rows containing missing values: {}".format(df_titanic_dropped_rows.shape[0]))

__2. Dropping columns with missing values:__

* [dropna(axis=1)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

_Note_: to drop columns you need to pass 1 to the _axis_ parameter of the dropna() function. The 1 corresponds to the dimension of the DataFrame. In line with programming we usually start to count at zero meaning that the __rows are dimnension 0 while the columns are dimension 1__.

In [None]:
# printing the number of columns before dropping containing missing values
print("Number of columns before dropping columns containing missing values: {}".format(df_titanic.shape[1]))

# Dropping columns that have missing values
titanic_cleaned_cols = df_titanic.dropna(axis=1)

# printing the number of after before dropping columns containing missing values
print("Number of columns before dropping columns containing missing values: {}".format(titanic_cleaned_cols.shape[1]))

#### 5.2.2 Imputing Missing Data
Instead of dropping rows or columns, a better strategy is often to fill in missing values with a placeholder or a calculated value. This is called imputation. 

> However, __imputation should always be done with great care and only for variables where it actually makes sense as imputation can introduce bias into your dataset if not done carefully, especially if a large proportion of values are missing__.

> Furthermore, __to ensure that no data leakage can occur, data imputation should only be perfromed AFTER the data has been split into training and test set__. See sections 10-11 for more information.

__1. Filling missing values with a constant: You can replace missing values with a specific constant (e.g., 0, or "Unknown" for categorical variables) using:__

* [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

_Note_: In the case of the titanic dataset it is not really useful to replace the age with 0 as done below. The code sample below just presents how to perform this type of imputation.

In [None]:
# Filling missing 'Age' values with 0
df_titanic_filled_zero = df_titanic.fillna({'Age': 0})

__2. Filling missing values with the mean/median/mode: For numerical columns, in specific cases it might be useful to replace missing values with the mean or median of the column using:__
* [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

_Note_: In the case of the titanic dataset it is not really useful to replace the age with median as done below. The code sample below just presents how to perform this type of imputation.

In [None]:
# Filling missing 'Age' values with the median
median_age = df_titanic['Age'].median()
df_titanic_filled_median = df_titanic['Age'].fillna(median_age)

### 6 Data Cleaning
Once you've dealt with missing data, the next step in preparing your data for analysis or machine learning is __data cleaning__. Data cleaning involves correcting or removing data that is incorrect, out-of-date, redundant, or formatted inconsistently. Clean data ensures that models can learn effectively from the data without being misled by errors or noise.

#### 6.1 Removing Duplicates

Duplicate rows can distort your analysis and cause bias in machine learning models. Duplicates can occur due to errors in data entry, repeated records, or other issues. pandas provides an easy way to identify and remove duplicates.

_Note_: The titanic dataset does not have any duplicated rows. The code below just shows how to search for and delete duplicated rows.

__1. Finding duplicate rows by using the function:__
* [duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html)

In [None]:
# Checking for duplicate rows
duplicates = df_titanic.duplicated()

# print the sum of duplicated rows. This equals to the number of duplicated rows
print(duplicates.sum())

__2. Removing duplicate rows by using the function:__
* [drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)

In [None]:
# Dropping duplicate rows
titanic_no_duplicates = df_titanic.drop_duplicates()

#### 6.2 Fixing Inconsistent Data Formats
Data can be entered in various formats, especially when working with categorical or textual data. For instance, one column might contain values like "Male" and "male," which pandas would treat as different categories. It's essential to standardize these formats to avoid errors in analysis.

__1. Standardizing text data: Convert text data to a consistent format (e.g., lowercase) to avoid inconsistencies by using the function:__
* [str()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html)

_Note_: The str() function can call all functions that strings can use (e.g., lower(), split(), replace(), etc.)

In [None]:
# Converting 'Sex' column to lowercase
df_titanic['Sex'] = df_titanic['Sex'].str.lower()

__2. Fixing categorical data inconsistencies: If you notice inconsistencies in categorical variables (e.g., both "M" and "male" are used to represent the same category), you can use the following function for standardization__:
* [replace()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html)

In [None]:
# Replacing 'M' with 'male' and 'F' with 'female'
df_titanic['Sex'] = df_titanic['Sex'].replace({'M': 'male'})

## 7. Feature Engineering

Feature Engineering is the process of transforming raw data into features that better represent the underlying patterns and relationships for machine learning models. The quality and quantity of the features you provide to a model can significantly affect its performance.

Models often benefit from new infered features, created from the original data. Pandas allows to apply mathematical operation on its objects (pandas.Series and pandas.DataFrame).

In this section, we will cover:

* Why feature engineering is important
* Creating new features
* Transforming existing features

### 7.1 Why Feature Engineering is Important
__Raw data often contains noise, irrelevant information, or incomplete variables that don't fully capture the relationships needed for machine learning__. Feature engineering helps create more useful features that can make patterns more visible and machine learning models more effective.

In the context of biomedical data, well-engineered features can represent important medical concepts like patient risk scores, treatment histories, or vitals trends, which can directly improve prediction accuracy.

### 7.2 Creating New Features
Creating new features from existing data can reveal important relationships that weren’t obvious in the raw data.

For example, lets get the number of family members each passenger had by adding the following two columns:
* __SibSp__: The number of siblings or spouses the passenger had aboard the Titanic.
* __Parch__: The number of parents or children the passenger had aboard the Titanic.

As the variable __SibSp__ encodes the number of siblings and spouses, and the variable __Parch__ encodes the number of parents and children, we can infer the size of the family.

We can add a new column to the DataFrame by just defining a new column name and assigning it a value. Be aware that new assigned values have to have the same number of row as the original DataFrame.

In [None]:
# adding a new column to the DataFrame by mathematically adding two columns
df_titanic['FamilySize'] = df_titanic["SibSp"] + df_titanic["Parch"]

### 7.3 Transforming Existing Features
Feature transformations can often improve the predictive power of a dataset by converting non-linear relationships into linear ones or scaling features to a common range.

__1. Log transformations: Taking the logarithm of a feature can help normalize data and reduce the influence of outliers__.

For example, we can log-transform the "Fare" column to deal with large outliers.

In [None]:
 # log1p is used to avoid log(0)
df_titanic['LogFare'] = np.log1p(df_titanic['Fare'])

__2. Binning numerical variables: Sometimes, converting continuous variables into categories or bins can make relationships clearer. For example, we could group passengers into age ranges. You can do this by using the function__:

* [cut()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)

_Note_: The example below categorizes the data into right-inclusive bins. The bins and the corresponding categories are the following:
* (0, 12]: Values greater than 0 and up to and including 12 are categorized as 'Child'.
* (12, 18]: Values greater than 12 and up to and including 18 are categorized as 'Teen'.
* (18, 40]: Values greater than 18 and up to and including 40 are categorized as 'Adult'.
* (40, 60]: Values greater than 40 and up to and including 60 are categorized as 'Middle-Aged'.
* (60, 100]: Values greater than 60 and up to and including 100 are categorized as 'Senior'.

In [None]:
# Binning 'Age' into categories
df_titanic['AgeGroup'] = pd.cut(df_titanic['Age'], bins=[0, 12, 18, 40, 60, 100], labels=['Child', 'Teen', 'Adult', 'Middle-Aged', 'Senior'])

# printing the first 10 rows of the DataFrame. Take a look at the new columns 'FamilySize','LogFare', and 'AgeGroup'
df_titanic.head(10)

## 8. Pandas Profiling

[Pandas Profiling](https://pandas-profiling.ydata.ai/docs/master/index.html) is a python library which generates insightful reports on Pandas datasets.

Note: You (may) need to install it!

In [None]:
# uncomment this line to install the package through pip
# !pip install pandas-profiling 

# alternatively you can also use the conda prompt as described above

In [None]:
#from ydata_profiling import ProfileReport

In [None]:
#profile = ProfileReport(df_titanic, title="Titanic Profiling", vars={"num": {"low_categorical_threshold": 0}})

In [None]:
#profile

In [None]:
#profile.to_file("Data/titanic_dataset.html")

### 9. Exercises on Pandas Functionalities

In the following exercises we will explore a dataset commonly used for binary classification tasks in materials science. The dataset contains measurements collected from a set of 699 material samples, each representing a different specimen analyzed in a laboratory environment. Every sample is described by 10 quantitative attributes, each representing a physical property obtained through standard characterization techniques such as microscopy, spectroscopy, or optical measurements.

These attributes include parameters such as density, grain size uniformity, crystal symmetry, impurity count, optical scattering, and defect density. All features are encoded as integer values ranging from 1 to 10, representing normalized or discretized measurements.

The goal of the classification task is to determine the material class of each sample. The dataset includes two categories, representing two distinct types of materials with different structural or optical characteristics


The datset has the following columns:

1. __Sample_ID__: A unique ID for each instance.
2. __Density__: mass/volume.
3. __Grain Size Uniformity__: Measured with SEM (Scanning Electron Microscopy).
4. __Crystal Symmetry Index__: Measured with X-Ray difraction.
5. __Surface Adhesion Coefficient__: Measured with contact angle measurements.
6. __Average Grain Size__: Measured with SEM.
7. __Impurity Count__: Mass spectrometry (this feature may have missing values).
8. __Optical Scattering Index__: Measured with laser scattering experiments using a integrating sphere.
9. __Reflectivity__: Measured with a Spectrophotometer .
10. __Defect Count__: Measured with SEM.
11. __Material Class__: (2) low quality material; (4) high quality material

__Exercise 1__:
* Load the Lab2_Material_Dataset.csv from UCI Repository
* Display the first 10 rows of the Dataset. 

__Hints__:
* Hint: use pandas [pd.read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) 

__Exercise 2__: As you can see, the dataset comes without any column names. Therefore you need to set the columns names with the names defined above.

* Store the columns defined above into a list of strings.
* Override the columns of the DataFrame using the list.
* Display the first 5 columns of the DataFrame to ensure that the column names were actually updated.

__Hints__:
* Hint: The __.columns__ attribute can also be used to set new column names.

__Exercise 3__: There are some rows that are duplicated in the DataFrame, meaning that the entire row of these instances is exactly the same. Having multiple instances of the same data may introduce bias when training a model. Therefore, these need to be removed. But before:

* Find how many duplicated rows exist in the DataFrame.

__Hints__:
* Hint 1: pandas.Series and pandas.DataFrame objects have a [sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) function that you can use to get the number of duplicates.

__Exercise 4__: As you now know that there are duplicate rows in the DataFrame, you need to remove these lines.

* Print the number of rows that the DataFrame has at the moment.
* Drop duplicate lines in the DataFrame.
* Print the number of lines that the DataFrame has after dropping the duplicates.

__Exercise 5__: Even though we removed duplicate lines there are still some instances where the __Sample ID__ column has the same number, but the rest of the data (the other columns) is different. These instances of data can be considered __noisy data__, because they point to the same sample but have different values for this sample. It is always necessary to remove __noisy data__, as this will lead to sub-optimal training of a model. 

* Check if Sample ID is actually unique

__Hints__: 
* Hint 1: You can do that by looking at the __number of unique__ values inside the 'Sample ID' column and compare that to the total number of samples in the DataFrame. 

__Exercise 6__: Now that you know that there a these instances where the Sample_ID is duplicated, you need to find out which of the rows have the same Sample_ID. For that you need to get the index of these rows. 

* Get the indices of the rows where 'Sample ID' column has duplicated values

__Hints__:
* check the [pandas.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. It can be used with a certain parameter so that it only looks for duplicates in a certain column.
* The returned boolean pandas.Series can be used to retrieve only the rows containing duplicates from the DataFrame. 
* In the introduction section there is an example on how to get the __index__ of a DataFrame.

__Exercise 7__: Now you need to remove those rows containing the __noisy data__

* Print the number of rows that the DataFrame has at the moment.
* Drop rows containing the duplicated data.
* Print the number of rows after dropping the rows to ensure you actually dropped the rows.

__Hints__:
* Hint 1: You can either use the indices you obtained by passing them to the [drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function
* Hint 2: Or you can use [drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) and use a particular parameter of the function.

__Exercise 8__: Set Sample ID as dataframe index

__Exercise 9__: For some reason the __Impurity Count__ column is read as an _object_ event though it should contain only integer values. There is a particular reason for this as the researcher compiling the dataset set a specific value if the data was missing.

* Check column types using `df.info()` and confirm that the Type for __Impurity Count__ is acutally _object_.
* Find out which is the value that the research set for missing data.

__Hints__: 
* There is a code example provided above that might help you finding this out

__Exercise 10__: Now that you know which is the value the researcher used for missing values you can clean your data by removing rows that contain the missing value in the __Impurity Count__ column. 
* Remove Sample IDs with missing values
* Convert Impurity Count to Type int
* Get the info() of the DataFrame after converting the column to int and check if all types are now _int64_ or _int32_.

__Hints__:
* Hint 1: This can be solved using a logical operation
* Hint 2: Check the function [astype()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) for converting the column to int.

## 10. Features, Target Class and Splitting data

Before building a machine learning model, it’s important to split your dataset into separate training and testing sets. This allows you to evaluate how well your model performs on new, unseen data. But before we split our data we need to define the following:


1. The features/variables we are going to pass to the model as input
2. The target class we want to predict using our model (i.e., the output of the model)

For this section we will use the [scikit-learn library](https://scikit-learn.org/stable/index.html). It provides functions for all the steps we need to perform.

_Note_: If you haven't installed scikit-learn yet, you can install it using pip or conda as described at the beginning of this notebook.

In [None]:
# importing necesary function from scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

import sklearn
print(sklearn.__version__)

### 10.1 Understanding Features and the Target Class
In a machine learning model, we work with two key components: __features__ (also called predictors or independent variables) and the __target class__ (also called the dependent variable or label). Understanding the distinction between these is crucial for framing and solving a predictive task.

#### 10.1.1 What are Features?
Features are the input variables or attributes that describe the data. These can be numerical values (e.g., age, salary) or categorical values (e.g., gender, city). In the context of machine learning, the model uses these features to learn patterns and make predictions.

For example, in the Titanic dataset, some common features we might use to predict whether a passenger survived or not include:

* __Pclass__: The class of the ticket (1st, 2nd, or 3rd class)
* __Age__: The age of the passenger
* __Fare__: The amount paid for the ticket
* __Sex__: Gender of the passenger (which we'll encode numerically)
* __Embarked__: The port from which the passenger boarded the Titanic (which we'll also encode numerically)


#### 10.1.2 What is the Target Class?
The target class (or simply target) is the variable we aim to predict based on the features. In classification tasks, the target is often a categorical value representing different classes. In regression tasks, the target is typically a continuous numerical value.

For the Titanic dataset, the target class is:

* __Survived__: A binary variable indicating whether a passenger survived (1) or not (0).


#### 10.2.3 Defining Features and the Target in pandas
When preparing data for a machine learning model, we separate the features and the target. Using pandas, this is done by creating two variables:

* __X__: Contains the features (predictor variables).
* __y__: Contains the target class (the value we want to predict).

In [None]:
# dropping rows where column 'Age' is NaN
df_titanic = df_titanic.dropna(subset=['Age'])

# dropping rows where 'Embarked' is NaN
df_titanic = df_titanic.dropna(subset=['Embarked'])

# Define the features (X) and the target (y)
X_titanic = df_titanic[['Pclass', 'Age', 'Fare', 'Sex', 'Embarked']]  # Add other features as needed 
y_titanic = df_titanic['Survived']

### 10.2 Splitting the Data into Different Sets
When building machine learning models, it's essential to split your dataset into separate parts to ensure you can evaluate your model's performance properly. The most common splits are:

* __Training Set__: The portion of the data the model learns from.
* __Validation Set__: A smaller part of the training data used for tuning model parameters and preventing overfitting.
* __Testing Set__: The final portion of data used to evaluate the model's performance on unseen data.

<div>
<img src="..\images\train_test.png" width="300"/>
</div>

Depending on the use case and the characteristics of the dataset, the data scientist (aka you!) must decide which approach to use to ensure the most reliable results.

#### 10.2.1 Train-Test Split

In simpler workflows, data is often split into just training and testing sets. This ensures the model is evaluated on data it has not seen during training. Usual poprotions of Train and Test sets are:
* __Training Set (70 - 90 %)__: Used to fit the model.
* __Testing Set (10 - 30 %)__: Used to evaluate the final model performance.

#### 10.2.2 Train-Validation-Test Split
If it is also needed to fine-tune some hyperparameters you can also use the following split:

* __Training Set (e.g., 60% of the data)__: Used to fit the model.
* __Validation Set (e.g., 20% of the data)__: Used for tuning hyperparameters and selecting the best model. The model evaluates itself against this set during training to avoid overfitting.
* __Testing Set (e.g., 20% of the data)__: Used to evaluate the final model performance after tuning is complete.

This ensures that you have a fair estimate of how the model performs on new data without overfitting or bias from the validation step.

### 10.3 A Common Pitfall: Data Leakage
When splitting our data into their repsective sets we need to be careful not to cause __data leakage__. Data leakage occurs when information from outside the training dataset is used to train the model, leading to overly optimistic performance estimates. This can happen when the model is inadvertently trained on data or information that it should not have access to, thereby invalidating the model's ability to generalize to unseen data. Data leakage can occur in several ways:

1. __Data splitting__: If the data is not split properly, for example in such a way that some test data leaks into the training set, it can result in a model that seems to perform well during validation but fails in practical applications.

2. __Data scaling__: When scaling the data (see section 11) before splitting it into the respective sets, the calculated scaling parameters contain information from the test data, thus giving the model access to information it shouldn't have.

3. __Data encoding__: When encoding data (e.g., categorical data to numerical data, see section 10.5), the encoding should be applied separately to the training and test set. While the potential of data leakage with respect to encoding is in most cases low, it can still occur when the encoding is more complex (e.g., encoding of variables in timeseries forecasting tasks).

4. __Data imputation__: Data imputation is mostly done using information from the data points contained in the dataset (e.g., replacing missing data with the median, mean, etc.). In the same vein as scaling, imputation should be done after splitting the data.

>__Generally speaking, it is always a good practice to view the training, validation, and testing datasets as separate datasets. All transformations should always be fit on the training set, and then the parameters of the fit should be used to transform the other sets (validation and test)__. 

### 10.4 How to Split Data in scikit-learn
For splitting the data into Train and Test sets we can use scikit-learn. Luckily scikit-learn will perform the data splitting for us in such a way that data leakage realted to data splitting does not occur. For this the following function can be used:
* [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

_Note_: In the example below we use a 80/20 split for train/test

_Note_: The train_test_split() function returns 2 pairs of variables (4 in total), the X and y for training and the X and y for testing.

In [None]:
# split the data into training and testing sets: 80% train_val, 20% test (it is only necessary to define the test size)
X_train_titanic, X_test_titanic, y_train_titanic, y_test_titanic = train_test_split(X_titanic, y_titanic, test_size=0.2)

### 10.5 Encoding of Categorical Variables
Endcoding of categorical variables is a part of __Feature Engineering__. Machine learning models often require categorical variables to be represented numerically.

>__To avoid data leakage during the encoding process we fit the encoder on the training set and then use the same encoder to transform the validation and test sest__.

In scikit-learn all objects that apply some kind of transformation (e.g., encoding, scaling, imputation, etc.) to the data, have convenience functions that help us avoiding data leakage. These are:

* __fit()__: for fittinf the object (e.g., encoder, scaler, imputer, etc.) to the data. __This function should only be used on the training set__.
* __fit_transfrom()__: for fitting the object to the data and transforming data in a single function call. __This function should only be used on the training set__.
* __transform()__: for transforming data using the parameters that were obtained during the fitting process. __This function should be used on the validation and test sets, or the training set if before only the fit() function was used__.
* __inverse_transform__: for performing the inverse transformation, i.e., obtaining the original data. This function (if neeeded) can be used on all datasets (training, validation, and test),

The scikit-learn package offers a variety of different encoder types. We are going to have a look a the two most commonly used.

#### 10.5.1 Label Encoding
This replaces each category with a unique number. For example, in the Titanic dataset, the column 'Sex' needs to be encoded as integers 'male': 0 and 'femal':1.

<div>
<img src="..\images\ordinal.png" width="400"/>
</div>

To perform this encoding we will use scikit-learn's:
* [LabelEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [None]:
# initialize the LabelEncoder
l_encoder = LabelEncoder()

# get the "Sex" column before encoding 
# (this line is not needed it is just to show you the difference before and after encoding at the end of this code block)
orig_sex_col = X_train_titanic['Sex']

# fit the encoder using the training data and transform the 'Sex' column (here we are directly overriding the column with the encoded values)
X_train_titanic['Sex'] = l_encoder.fit_transform(X_train_titanic['Sex'])

# transform the test data using the label encoder that was fit on the training data
X_test_titanic['Sex'] = l_encoder.transform(X_test_titanic['Sex'])

# displaying difference before and after
print("first 5 values of \'Sex\' column before encoding {}".format(orig_sex_col.head(5)))
print("first 5 values of \'Sex\' column after encoding {}".format(X_train_titanic['Sex'].head(5)))

#### 10.5.2 One-Hot Encoding
One-hot encoding is a process used to convert categorical variables into a format that can be provided to machine learning algorithms. Instead of assigning a single number to each category, one-hot encoding creates new binary columns—one for each unique category. Each column contains a 1 if the category is present and a 0 otherwise. This avoids giving numerical meaning or rank to the categories, which could mislead machine learning models.

<div>
<img src="..\images\one_hot.png" width="500"/>
</div>

For example, in the Titanic dataset, the column 'Embarked' needs to be one-hot encoded. This will result in three new columns 'Embarked C', 'Embarked_Q', 'Embarked_S'.

To perform this encoding we will use scikit-learn's:
* [OneHotEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

_Note_: When we use one-hot encoding to convert categorical variables into a numerical format, each unique category is represented by a separate binary column. For example, the 'Embarked' column in the Titanic dataset has three categories: 'C' (Cherbourg), 'Q' (Queenstown), and 'S' (Southampton).

With one-hot encoding, we create new columns for each category, where each column contains a 1 if the category is present and a 0 if it is not. However, to avoid a problem called the __dummy variable trap__, we often choose to drop one of these columns. The dummy variable trap occurs when the model receives redundant information from all the encoded columns, which can lead to multicollinearity and inflated variance in our model’s estimates.

For simplicity, __we do not consider the dummy variable trap in the code below__.

In [None]:
# Initialize the OneHotEncoder without dropping the first category
# encoder = OneHotEncoder(sparse_output=False)  # run this line of code if your scikit-learn version is > 1.2.0
oh_encoder = OneHotEncoder(sparse_output=False)  # # run this line of code if your scikit-learn version is < 1.2.0

# fit the encoder on the training data and tranform it (this will return a numpy.array)
embarked_encoded_train = oh_encoder.fit_transform(X_train_titanic[['Embarked']])

# transform the test data using the encoder that was fit on the training data
embarked_encoded_test = oh_encoder.transform(X_train_titanic[['Embarked']])

# convert the result into a DataFrame for easier handling
df_encoded_train = pd.DataFrame(embarked_encoded_train, columns=oh_encoder.get_feature_names_out(['Embarked']))
df_encoded_test = pd.DataFrame(embarked_encoded_test, columns=oh_encoder.get_feature_names_out(['Embarked']))

# concatenate the new one-hot encoded columns to the c
X_train_titanic = pd.concat([X_train_titanic.reset_index(drop=True), df_encoded_train], axis=1)
X_test_titanic = pd.concat([X_test_titanic.reset_index(drop=True), df_encoded_test], axis=1)

# display the first 5 rows the training data
# (This line is not needed, it is just to show the difference before and after encoding)
print(X_train_titanic[['Embarked', 'Embarked_C', 'Embarked_Q', 'Embarked_S']].head(5))

# drop the original 'Embarked' column as it is not needed anymore
X_train_titanic = X_train_titanic.drop(columns='Embarked')
X_test_titanic = X_test_titanic.drop(columns='Embarked')

In [None]:
# showing the first 10 rows of the DataFrame containing the training data after finishing encoding
X_train_titanic.head(10)

### 10.6. Exercises

__Exercise 1__:  Select the X and y data for the Material Dataset



__Exercise 2__: Divide the Material dataset into training and testing set using a 70/30 train/test split.

__Exercise 3__: Encode the target variable of the Material Dataset. 

As the Material Dataset uses the following values:
* __low quality__: 2
* __high quality__: 4

A __label encoding__ of the data is necessary.

## 11. Scaling

__Feature scaling__ is crucial in machine learning because it ensures that all features contribute equally to the model, preventing any single feature with a larger range of values from dominating the learning process. Many algorithms, such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Gradient Descent-based models (i.e., Deep Learning models), rely on the distance between data points. If features are on different scales, it can distort the distances and lead to poor performance. Scaling also speeds up the convergence of optimization algorithms, making training more efficient. Overall, feature scaling improves model accuracy and training stability.

Scikit-learn also provides the functionality for scaling data. The two main scalsers are (there are other scalers that exist that are not discussed in this Notebook):

* [MinMaxScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
* [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

When scaling data for machine learning, it's crucial to apply the transformation to the train and test sets carefully to avoid introducing data leakage. The key principle is that the test set should represent unseen data, simulating how the model will perform in real-world scenarios. In real-world scenarios we do not know the statistics of the data, thus we have to assume that the distribution of the training data is representatitve enough to reflect any new unseen data.

>__Therefore, in most cases, the scaling parameters (e.g., mean and standard deviation for StandardScaler, or min and max for MinMaxScaler) must be computed from the training set and then applied to both the training and test sets. If you scale the data before splitting (i.e. the entire dataset together), this can lead to data leakage, where information from the test set influences the model and results in overly optimistic performance metrics.__

Fortunately, the MinMaxScaler() and StandardScaler() handle most of this for us.

_Note 1_: applying a scaler to a pandas.DataFrame will return a numpy.array object. In case you want a pandas.DataFrame object again, you have to convert the output back to a pandas.DataFrame. This however not really necessary. Either way, the code snippets below show how you can obtain a pandas.DataFrame again after applying the scaler.

_Note 2_: The code snippets below do not check whether or not it makes sense to use Min-Max or Standard Scaling on the whole dataset. They just show how you can use the scalers to scale your data.

### 11.1. Min-Max Scaling

MinMax scaling is a normalization technique that transforms features to a specific range, typically between 0 and 1. It rescales each feature based on the formula:

$$X_{min-max} = {X - X_{min} \over X_{max} - X_{min}}$$

where $X_{min}$ and $X_{max}$ are the minimum and maximum values of the feature, respectively. This ensures all features have the same scale, preserving the relationships between values while eliminating distortions caused by differing ranges.

#### 11.1.1 When to use MinMax scaling:
* For distance-based algorithms like K-Nearest Neighbors (KNN) or Neural Networks.
* When features are not normally distributed.
* In scenarios where feature ranges are bounded, such as pixel intensities in image processing.

#### 11.1.2 Downsides of MinMax scaling:
* Sensitive to outliers, as they can significantly affect the scaling range.
* Does not change the shape of the data distribution, which may be a drawback for features with skewed distributions.

In [None]:
# importing scaler
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [None]:
# defining min max scaler and its range
min_max_scaler = MinMaxScaler(feature_range=(0, 1))

# fit the scaler to train set and scale the train data (this will return a numpy.array)
X_train_titanic_minmax = min_max_scaler.fit_transform(X_train_titanic)

# scale the test data using the scaler that was fitted on the training data
X_test_titanic_minmax = min_max_scaler.transform(X_test_titanic)

# print the first 5 rows of the numpy.array containing the training data
print("The first 5 rows of the X_train_bc_minmax numpy.array are: \n{}".format((X_train_titanic_minmax[:5, :])))

# converting the training and test data back to a pandas.DataFrame
df_X_train_titanic_minmax = pd.DataFrame(X_train_titanic_minmax, columns=X_train_titanic.columns)
df_X_test_titanic_minmax = pd.DataFrame(X_test_titanic_minmax, columns=X_test_titanic.columns)

# print the first 5 rows of the DataFrame containing the training data
df_X_train_titanic_minmax.head(5)

_Note_: As you can see in the output above, applying the MinMaxScaler to the entire DataFrame does not make a lot of sense as the ordinal information of the 'Pclass' column gets lost. Instead it would make more sense to apply the MinMaxScaler only the columns where scaling is appropriate.

### 11.2. Standard Scaling

Standard scaling is a technique that centers features by subtracting the mean and scales them by dividing by the standard deviation. The formula is:

$$X_{standard} = {X - \mu \over \sigma}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of the feature. This process results in a distribution with a mean of 0 and a standard deviation of 1, making the features comparable in terms of variance

#### 11.2.1 When to use Standard scaling:
* For algorithms that assume features are normally distributed, such as Linear Regression, Logistic Regression, and Support Vector Machines (SVM).
* When features have different units or large variance differences.
* In cases where preserving outlier relationships is important, as standard scaling does not compress the range like MinMax scaling.

#### 11.2.2 Downsides of Standard scaling:
* May not perform well if the data is not normally distributed.
* Outliers can still affect the scaling, as they contribute to the calculation of the mean and standard deviation.

In [None]:
# define scaler
std_scaler = StandardScaler()

# fit the scaler to train set and scale the train data (this will return a numpy.array)
X_train_titanic_std = std_scaler.fit_transform(X_train_titanic)

# scale the test data using the scaler that was fitted on the training data
X_test_titanic_std = std_scaler.transform(X_test_titanic)

# print the first 5 rows of the numpy.array containing the training data
print("The first 5 rows of the X_test_titanic_std numpy.array are: \n{}".format((X_train_titanic_std[:5, :])))

# converting the training and test data back to a pandas.DataFrame
df_X_train_titanic_std = pd.DataFrame(X_train_titanic_std, columns=X_train_titanic.columns)
df_X_test_titanic_std = pd.DataFrame(X_test_titanic_std, columns=X_test_titanic.columns)

# print the first 5 rows of the DataFrame containing the training data
df_X_train_titanic_std.head(5)

_Note_: As you can see in the output above, applying the StandardScaler to the entire DataFrame does not make a lot of sense as the encodings we generated before for the 'Sex' and 'Embarked' column are lost, as well as the ordinal information contained in 'Pclass'. Instead it would make more sense to apply the StandardScaler only the columns where scaling is appropriate.

### 11.3. Exercise

__Exercise 1__: Although the Material dataset is already normalized between 1 and 10, apply Min-Max scaling to the data.

_Note_: you don't need to transform the output back to a pandas.DataFrame.