_____

<table align="left" width=100%>
    <td>
        <div style="text-align: center;">
          <img src="./images/bar.png" alt="entidades financiadoras"/>
        </div>
    </td>
    <td>
        <p style="text-align: center; font-size:24px;"><b>Introduction to Data Science</b></p>
        <p style="text-align: center; font-size:18px;"><b>Master in Electrical and Computer Engineering</b></p>
        <p style="text-align: center; font-size:14px;"><b>Pedro Cardoso (pcardoso@ualg.pt)</b></p>
    </td>
</table>

_____

# Exploratory Data Analysis with Pandas (part 2)

Let us continue with the exploratory data analysis with Pandas. We will continue with the titanic dataset. So, load the titanic dataset and remember its structure.

In [None]:
# load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_bokeh
import tensorflow as tf
from tensorflow.keras import layers

pandas_bokeh.output_notebook()
%matplotlib inline

## Data visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

Let us see how we can use data visualization to explore the titanic dataset.

### Line plot

Line plots are useful to visualize the trend of a numerical column, usually, over time. We can use the `plot()` with the `kind='line'` argument from Pandas to visualize the trend of a numerical column.

Since the Titanic dataset does not have a time column, we will use the Covid dataset to visualize the trend of the number of confirmed cases over time. The data is available at the "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University" (https://github.com/CSSEGISandData/COVID-19)

In [None]:
# url with the data
url_confirmed= 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
df_covid_all = pd.read_csv(url_confirmed, header=0)
df_covid_all.head()

In [None]:
# select some countries
mask = df_covid_all['Country/Region']\
    .isin(['Portugal', 'Spain', 'Italy', 'France', 'Germany', 'United Kingdom'])

# group by country and sum the values, drop the lat and long columns, and transpose the dataframe
df_covid = df_covid_all[mask]\
    .drop(['Lat', 'Long', 'Province/State'], axis=1)\
    .groupby('Country/Region').sum()\
    .T

# set the index to datetime
df_covid.set_index(pd.to_datetime(df_covid.index, format='%m/%d/%y'), inplace=True)

df_covid.head()

In [None]:
df_covid.plot(kind='line', figsize=(15, 5))

In [None]:
df_covid.plot(kind='area', figsize=(15, 5), alpha=0.5)

In [None]:
df_covid.plot_bokeh(figsize=(800, 400), 
                    title='COVID-19')

In [None]:
df_covid.plot_bokeh(figsize=(800, 400), 
                    kind='area', 
                    title='COVID-19')

### Scatter plot

A scatter plot is useful to visualize the relationship between two numerical columns. We can use the `plot()` with the `kind='scatter'` argument from Pandas to visualize the relationship between two numerical columns.

Let us return to the Titanic dataset and visualize the relationship between the age and the number of siblings and spouses aboard the Titanic.

In [None]:
df_titanic = pd.read_excel('./data/titanic/Titanic.xls')
df_titanic.head()

In [None]:
df_titanic.plot(kind='scatter', x='age', y='sibsp')

In [None]:
df_covid.plot(kind="scatter", x="Portugal", y="Spain")

### Pie chart and donut chart

Pie charts are useful to visualize the distribution of a categorical column. We can use the `plot()` with the `kind='pie'` argument from Pandas.

In [None]:
df_titanic.pivot_table(index='sex',
                       columns='pclass',
                       values='survived',
                       aggfunc='sum').plot(kind='pie', 
                                           subplots=True, 
                                           figsize=(10, 10))

### Histograms
We already have seen how to visualize the distribution of a numerical column using the `hist()` method from Pandas. We can also use the `hist()` method from Seaborn to visualize the distribution of a numerical column.


In [None]:
ax = df_titanic['age'].plot(kind='hist', 
                            alpha=0.1, 
                            bins=20, 
                            figsize=(10, 4), 
                            color='green')
ax.set_xlabel('Age')
ax.set_ylabel('Count')
ax.set_title('Age distribution')

Seaborn also provides the `histplot()` method to visualize the distribution of a numerical column, but with more options.
(https://seaborn.pydata.org/generated/seaborn.histplot.html)

In [None]:
sns.histplot(data=df_titanic[['age']].dropna(), 
             x='age', 
             cumulative=True, 
             bins=20, 
             kde=True, 
             element='bars' # {‘bars’, ‘step’, ‘poly’}
             )

### Bar plot

Bar plots are useful to visualize the distribution of a categorical column. We can use the `plot()` with the `kind='bar'` argument from Pandas.

In [None]:
df_titanic.pivot_table(index='sex',
                       values='survived',
                       columns='pclass',
                       aggfunc='sum').plot(kind='bar', figsize=(10, 4))

In [None]:
df_titanic.pivot_table(index='sex',
                       values='survived',
                       columns='pclass',
                       aggfunc='sum').plot(kind='barh', figsize=(10, 4))

### Box plot

Box plots are useful to visualize the distribution of a numerical column. We can use the `boxplot()` method from Pandas.

In [None]:
df_titanic[['age', 'fare']].boxplot(figsize=(10, 4))

The seaborn library also provides the `boxplot()` method to visualize the distribution of a numerical column.

In [None]:
sns.boxplot(data=df_titanic['age'])

In [None]:
ax = sns.boxplot(data=df_titanic, 
                 x='pclass', 
                 y='age',
                 hue='sex')

ax.set_title('Age distribution by class')
ax.set_xlabel('Passenger Class')
ax.set_ylabel('Age')

### Kernel density estimation (KDE)
Kernel density estimation is a non-parametric approach to estimating the distribution of data. Instead of assuming a particular distribution, we use a continuous representation of the data. For example, let's say we have a set of data measurements but we don't know their underlying distribution. We can use a Gaussian kernel to estimate the density around the data.

We can use the `kdeplot()` method from Seaborn.

In [None]:
sns.kdeplot(data=df_titanic, x='age', hue='pclass')

### Violin plot

Violin plots are useful to visualize the distribution of a numerical column. Violin plots are similar to box plots, but they also show the probability density of the data at different values.

We can use the `violinplot()` method from Seaborn.

In [None]:
sns.violinplot(data=df_titanic['age'])

In [None]:
ax = sns.violinplot(data=df_titanic, 
               x='pclass', 
               y='age',
               hue='sex')

ax.set_title('Age distribution by class')
ax.set_xlabel('Passenger Class')
ax.set_ylabel('Age')

### Heatmap

Heatmaps are useful to visualize the correlation between numerical columns. We can use the `heatmap()` method from Seaborn.

In [None]:
# numerical columns
numerical_columns = df_titanic.select_dtypes(include=np.number).columns

sns.heatmap(df_titanic[numerical_columns].corr(), annot=True, cmap='coolwarm')

### Pairplot

Pairplots are useful to visualize the relationship between multiple numerical columns. We can use the `pairplot()` method from Seaborn. Several options are available to customize the pairplot, try some of them changing the `kind` and `diag_kind` arguments.

In [None]:
sns.pairplot(df_titanic,
             kind='reg', # {‘scatter’, ‘kde’, ‘hist’, ‘reg’}
             diag_kind='kde', # ‘hist’, ‘kde’,
             hue='survived'
             )

### Jointplot
Jointplots are useful to visualize the relationship between two numerical columns. We can use the `jointplot()` method from Seaborn which shows a scatter plot and the distribution of each variable. See the documentation for more options (https://seaborn.pydata.org/generated/seaborn.jointplot.html).


In [None]:
sns.jointplot(data=df_titanic,
              x='age', y='fare',
              hue='survived',
              kind='scatter' #{ “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }
)

### Hexbin plot

Hexbin plots are useful to visualize the relationship between two numerical columns. We can use the `hexbin()` method from Pandas.

In [None]:
sns.jointplot(data=df_titanic,
              x='age', 
              y='fare',
              kind='hex' # { "scatter" | "kde" | "hist" | "hex" | "reg" | "resid"}
)

### Bubble plot

Bubble plots are useful to visualize the relationship between two numerical columns and a third numerical column. We can use the `scatter()` method from Pandas.

In [None]:
df_titanic.plot.scatter(x='age', 
                        y='fare', 
                        s='pclass', 
                        figsize=(10, 8)
                        )

In [None]:
ax = df_titanic.plot.scatter(x='age', y='fare',
                        s=4**(4-df_titanic['pclass']),
                        c=df_titanic['survived'].apply(lambda x: 'red' if x == 1 else 'blue'),
                        figsize=(10, 8),
                        alpha=0.5
                        )

ax.set_title('Age vs Fare')
ax.set_xlabel('Age')
ax.set_ylabel('Fare')

## Data cleaning

Real-world datasets are typically characterized by their messy and noisy nature, which often results in numerous faulty or missing values. Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

Data cleaning is a fundamental step in the data analysis process. The data cleaning process involves many steps, such as:
- **Missing values treatment** - remove or impute the missing values with values that are more representative of the data. E.g., the mean, the median, the mode, etc.

- **Outliers treatment** - remove or transform the outliers.

- **Feature engineering** - create new features from the existing features. E.g., create a new feature that is the sum or product of two other features.

- **Feature selection** - remove features that are not useful for the analysis. E.g., remove features that have a low correlation with the target variable if the goal is to predict the target variable.

- **Feature scaling** - scale the features to have the same range. E.g., scale the features to have values between 0 and 1. This is useful for some machine learning algorithms.

- **Feature transformation** - transform the features to have a more normal distribution. E.g., transform the features to have a normal distribution. This is useful for some machine learning algorithms.

- **Feature discretization** - discretize the features to have a more discrete distribution. E.g., transform real values to integer values. This is useful for some machine learning algorithms.

- **Feature extraction** - extract features from the data. This is useful for some machine learning algorithms. E.g., extract features from text data, such as, the number of words or the number of characters.

- **Feature encoding** - encode the features to have a more machine learning friendly format. E.g., encode the categorical features to have integer values.

- **Feature reduction** - reduce the number of features. E.g., reduce the number of features using PCA. This is useful for some machine learning algorithms.

### Missing values treatment

Values that are not present in the data are referred to as missing values. They may be absent due to a variety of reasons, including human error, privacy concerns, or the failure of the survey respondent to complete the value. Missing values pose a significant challenge in data science and are typically addressed during data preprocessing. E.g., missing values can negatively impact the performance of machine learning models.

There are various methods to handle missing values, including **dropping** the records that contain missing values, manually **filling** in the missing values, using **measures of central tendency** such as mean, median, or mode to fill in the missing values, or employing **machine learning models** such as regression, decision trees, or KNNs to predict and fill in the missing values with the most probable value.

Let's see how many missing values we have in each column. First, we can use the `isnull()` method from Pandas to get a boolean dataframe with True for the missing values and False for the non-missing values. Then, we can use the sum() method to get the number of missing values in each column.

In [None]:
df = df_titanic.copy()

df.isnull().sum()

Similarly, we can use the `isna()` method from Pandas to get a boolean dataframe with True for the missing values and False for the non-missing values. Then, we can use the sum() method to get the number of missing values in each column.

In [None]:
df.isna().sum()

#### Drop missing values

If we have a lot of data, we can drop the rows or columns that have missing values. We can use the `dropna()` method from Pandas to drop the rows or columns that have missing values. By default, the `dropna()` method drops the rows that have missing values. This will not work for the titanic dataset because every row has at least one missing value, as we can see from the next cell.

In [None]:
df.dropna()

Adding `axis=1` will drop the columns that have missing values.

In [None]:
df.dropna(axis=1)

#### Fill missing values

If we have a small dataset, we can fill the missing values with values that are more representative of the data. 

We can use the `fillna()` method from Pandas to fill the missing values with values that are more representative of the data. E.g., the mean, the median, the mode, etc.

For the age and fare columns, were 263 values are unknown, we can fill the missing values with the mean age and median fare. If the data was normally distributed, we could have used the mean or median, but when the data is not normally distributed, we should use the median. Have a look at the histograms and skew values of the age and fare columns to see that age is "more normally distributed" than the fare. 

So impirically, we can use the mean for the age and the median for the fare.

In [None]:
df['age'] = df['age'].fillna(df['age'].mean())

df['fare'] = df['fare'].fillna(df['fare'].median())

The embarked column is a categorical column, so we can fill the missing values with the most frequent value which is given by the mode.

For the embarked column, we can fill the missing values with the most frequent value which is given by the **mode**.

In [None]:
df['embarked'].mode()

In [None]:
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

For the boat we can see that there are 823 missing values. This is because the boat column is only filled in for passengers who survived. We can fill the missing values with -1 as a value that represents that the passenger did not survive. We can also remove the boat column because if we think it is not useful for the analysis.

In [None]:
df['boat'] = df['boat'].fillna(-1)

##### Shapiro-Wilk test for normality (Optional)

**To test the normality of the distribution we can use the Shapiro-Wilk test**. The null hypothesis of the Shapiro-Wilk test is that the data is normally distributed. If the p-value is less than an $\alpha$ (e.g., a tipical value is $\alpha = 0.05$), we can reject the null hypothesis and conclude that the data is not normally distributed. Otherwise, we **can not** reject the null hypothesis and conclude that the data is normally distributed. Usually, the significance level is set to 0.05 (5%), implying that it is acceptable to have a 5% probability of incorrectly rejecting the true null hypothesis. Further,  $p$ is basically the probability of finding our data if the null hypothesis is true.

For our numerical variables, we can compute the p-values using the `shapiro()` function from the `scipy.stats` module. The `shapiro()` function returns the test statistic and the p-value. We can use the `select_dtypes()` method from Pandas to select only the numerical columns. Then, we can use a for loop to iterate over the numerical columns and print the p-value for each column. Further note the difference between original data and the data after filling the missing values.

In [None]:
from scipy.stats import shapiro

def test_shapiro(df):
    # for the numerical columns
    for col in df.select_dtypes(include=np.number).columns:
        p = shapiro(df[col].dropna())[1]
        if p < 0.05:
            print(f"p-value for {col}: {p} (reject the null hypothesis, i.e., the sample does not look like a normal distribution)")
        else:
            print(f"p-value for {col}: {p}  (null hypothesis is not rejectable, i.e., sample looks like a normal distribution)")

In [None]:
test_shapiro(df_titanic)

In [None]:
test_shapiro(df)

See the Shapiro-Wilk test applied to some distributions in the next cell.

In [None]:
import numpy as np

k = 5000
df_NUBP =  pd.DataFrame({
    'N':np.random.normal(0, 1, k),
    'U':np.random.uniform(0, 1, k),
    'B':np.random.binomial(100, 0.5, k),
    'P':np.random.poisson(10, k),
})

# plot the histograms
df_NUBP.hist(bins=20)

# test the normality
test_shapiro(df_NUBP)


### Outliers treatment

**Outliers are data points that are significantly different from the rest of the data**. Outliers can be caused by measurement errors or by human errors. Outliers can have a significant impact on the analysis, so it is important to treat them. The outliers treatment can be done in two ways:
- **Remove the outliers**.
- **Transform the outliers**.

To identify outliers we can use, e.g., the following methods:

- **Scatter plot** - the scatter plot is a graphical representation of the relationship between two variables and it is used to identify outliers. The outliers are the data points that are far away from the rest of the data.

- **Interquartile range / Boxplot** - the boxplot is a graphical representation of the distribution of the data. The boxplot shows the median, the first quartile, the third quartile, the minimum, the maximum, and the outliers. The outliers are the data points that are outside the range $[Q_1 - 1.5 \cdot IQR, Q_3 + 1.5 \cdot IQR]$, where $Q_1$ is the first quartile, $Q_3$ is the third quartile, and $IQR = Q_3 - Q_1$ is the interquartile range.

- **Percentile** - the outliers are the data points that are outside the range $[P_{\alpha/2}, P_{100-\alpha/2}]$, where $P_{x}$ is the $x$-th percentile.

- **Z-score** - the z-score is a measure of how many standard deviations away from the mean a data point is. The outliers are the data points that have a z-score greater than 3 or less than -3.

- **Isolation forest** - the isolation forest is an unsupervised machine learning algorithm that is used to identify outliers. The outliers are the data points that are isolated from the rest of the data.

- **Auto encoder** - the auto encoder is an unsupervised machine learning algorithm that is used to identify outliers. The outliers are the data points that are not reconstructed well by the auto encoder.

Let us see some examples of outliers treatment.

#### Scatter plot

If we plot the scatter plot of the `sibsp` (Number of siblings/spouses aboard) and the `parch` (Number of Parents/Children Aboard), we can see that there are a possible outlier with a `parch` equal to 9 and `sibsp` equal to 1.

In [None]:
# copy the original dataframe
df = df_titanic.copy()

_ = df.plot_bokeh.scatter(x='sibsp', y='parch')

We can see thar there is a single family that fits the description of the outlier.

In [None]:
df[(df.parch == 9) & (df.sibsp == 1)]

By the ticket number, we can see the family's composition.

In [None]:
df[df.ticket == 'CA. 2343']

####  Interquartile range / Boxplot

The boxplot is a graphical representation of the distribution of the data. The boxplot shows the median, the first quartile, the third quartile, the minimum, the maximum, and the outliers. The outliers are the data points that are outside the range 
$$[Q_1 - 1.5 \cdot IQR, Q_3 + 1.5 \cdot IQR],$$ 
where $Q_1$ is the first quartile, $Q_3$ is the third quartile, and 
$$IQR = Q_3 - Q_1$$ 
is the interquartile range.

Using the boxplot, we can see that there are outliers in the `age`, `sibsp`, `parch`, and `fare`.


In [None]:
_ = df.boxplot()

For example, we can see those outliers in the `fare` feature.

In [None]:
def get_outliers_mask_IQR(df, feature):
    q1 = df[feature].quantile(0.25)
    q3 = df[feature].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    mask_outliers = (df[feature] < lower_bound) | (df[feature] > upper_bound)
    return mask_outliers

mask_outliers = get_outliers_mask_IQR(df, 'fare')
outliers = df[mask_outliers]
outliers.sort_values(by='fare', ascending=False)

Removing the outliers, we can see that the boxplot is more compact, although outliers are still visible.

In [None]:
df.loc[~mask_outliers, ['fare']].boxplot()

The process of removing the outliers can be done iteratively, i.e., removing the outliers and then removing the outliers of the remaining data.

In [None]:
mask_outliers_it_2 = get_outliers_mask_IQR(df[~mask_outliers], 'fare')

while (mask_outliers_it_2.sum() > 0):
    # update the outliers mask
    mask_outliers = mask_outliers | mask_outliers_it_2

    # plot the boxplot of the remaining data
    df.loc[~mask_outliers, ['fare']].boxplot()
    plt.show()

    # get the outliers of the remaining data
    mask_outliers_it_2 = get_outliers_mask_IQR(df[~mask_outliers], 'fare')


#### Percentile

The outliers are the data points that are outside the range $[P_{\alpha/2}, P_{100-\alpha/2}]$, where $P_{x}$ is the $x$-th percentile.

In [None]:
def get_outliers_mask_percentile(df, feature, alpha=0.05):
    lower_bound = df[feature].quantile(alpha/2)
    upper_bound = df[feature].quantile(1-alpha/2)
    mask_outliers = (df[feature]<lower_bound) | (df[feature]>upper_bound)
    return mask_outliers

mask_outliers = get_outliers_mask_percentile(df, 'fare', 0.02)
outliers = df[mask_outliers]
outliers.sort_values(by='fare', ascending=False)

#### Z-score

The Z-score of the outliers is a measure of how many standard deviations away from the mean a data point is. The outliers are the data points that have a z-score greater than 3 or less than -3, i.e., $z > 3$ or $z < -3$, where $$z = \frac{x - \mu}{\sigma},$$ $x$ is the data point, $\mu$ is the mean, and $\sigma$ is the standard deviation.

Let's see if there are outliers in the `age` feature using the z-score.

In [None]:
def get_outliers_mask_z_score(df, feature):
    lower_bound = df[feature].mean() - 3 * df[feature].std()
    upper_bound = df[feature].mean() + 3 * df[feature].std()
    mask_outliers = (df[feature]<lower_bound) | (df[feature]>upper_bound)
    return mask_outliers

mask_outliers = get_outliers_mask_z_score(df, 'fare')
outliers = df[mask_outliers]
outliers.sort_values(by='fare', ascending=False)

In [None]:
ax_0 = df['fare'].plot(kind='hist', bins=20)
ax_0.set_title('Fare distribution ')

ax_1 = df.loc[~mask_outliers, ['fare']].plot(kind='hist', bins=20)
ax_1.set_title('Fare distribution without outliers')

ax_2 = df.loc[mask_outliers, ['fare']].plot(kind='hist')
ax_2.set_title('Fare distribution of the outliers')

As before, the process of removing the outliers can be done iteratively, i.e., removing the outliers and then removing the outliers of the remaining data.

In [None]:
mask_outliers_it_2 = get_outliers_mask_z_score(df[~mask_outliers], 'fare')
while (mask_outliers_it_2.sum() > 0):
    # update the outliers mask
    mask_outliers = mask_outliers | mask_outliers_it_2
    print('By now, we have removed {} outliers'.format(mask_outliers.sum()))
    # plot the boxplot of the remaining data
    df.loc[~mask_outliers, ['fare']].hist()
    plt.show()
    # get the outliers of the remaining data
    mask_outliers_it_2 = get_outliers_mask_z_score(df[~mask_outliers], 'fare')



Maybe it was too much to remove all the outliers!

#### Isolation forest

Isoaltion forest is an unsupervised machine learning algorithm that is used to identify outliers. The outliers are the data points that are isolated from the rest of the data. The Isolation Forest process splits the data into smaller and smaller subsets until the data points are isolated. The less splits it takes to isolate a point, the more anomalous the point is.

![images/IF.png](images/IF.png)

Isolation forest is implemented in the `sklearn` library. Let's see how it works. We will use the `age` and `fare` features to identify the outliers, helping to graphically visualize the results. However, isolation forest can be used with any number of features.

In [None]:
# !pip install scikit-learn
from sklearn.ensemble import IsolationForest

df_for_if = df[['age', 'fare']].dropna().copy()

# fit the model where the contamination is the percentage of outliers
clf = IsolationForest(random_state=0, contamination=0.02)
clf.fit(df_for_if)

# predict the outliers
y_pred = clf.predict(df_for_if)

# add a column to the original dataframe identifying the outliers
df_for_if['outlier'] = y_pred

# show the outliers
df_for_if[df_for_if.outlier == -1].sort_values(by='fare', ascending=False)

Let's see how the outliers are distributed in the `age` and `fare` features.

In [None]:
df[['age', 'fare']].dropna().plot.scatter(x='age', 
                                          y='fare', 
                                          c=df_for_if['outlier'], 
                                          colormap='viridis')

#### Autoencoder

An autoencoder is a neural network that is used to learn the identity function. The autoencoder is trained to reconstruct the input data after passing through the hidden layers. The autoencoder is trained to minimize the reconstruction error, i.e., the difference between the input data and the reconstructed data. The outliers are the data points that are reconstructed with a high error.

![images/AE.png](images/AE.png)

In [None]:
list_of_features = ['age', 'fare', 'pclass', 'sibsp', 'parch', 'survived']
df_for_AE = df[list_of_features].dropna().copy()


# reproducibility with tensorflow
tf.keras.utils.set_random_seed(42)  # sets seeds for base-python, numpy and tf

# define the model
model = tf.keras.Sequential()
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(2, activation='relu'))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(len(list_of_features), activation='linear'))

# compile the model
model.compile(optimizer='adam', loss='mse')

#normalize the data to have zero mean and unit variance -- this is important for the autoencoder and the reconstruction error
mu = df_for_AE.mean()
std = df_for_AE.std()
df_for_AE = (df_for_AE - mu) / std

# fit the model where the input and the output are the same
model.fit(df_for_AE, df_for_AE, epochs=500, verbose=1)

# predict the reconstructed data
y_pred = model.predict(df_for_AE)

|Now, compute the reconstruction error by computing the Euclidean distance between the original data and the reconstructed data.

In [None]:
reconstruction_error = df_for_AE - y_pred
reconstruction_error['error'] = reconstruction_error.apply(lambda x: np.sqrt(np.sum(np.square(x))), axis=1)
reconstruction_error

Then, identify the outliers by computing the quantile of the reconstruction error. Let us say that we want to remove the top 5% of the data points with the highest reconstruction error.

In [None]:
mask_outliers = reconstruction_error['error'] > reconstruction_error['error'].quantile(0.95)
reconstruction_error[mask_outliers].sort_values(by='error', ascending=False)

We can add a column to the original data frame identifying the outliers.

In [None]:
df_for_AE['outlier'] = 1
df_for_AE.loc[mask_outliers, 'outlier'] = -1
df_for_AE

Now we can plot the outliers in the `age` and `fare` features, as before.

In [None]:
df_for_AE[['age', 'fare']].dropna().plot.scatter(x='age', y='fare', c=df_for_AE['outlier'], colormap='viridis')

# Exercises

[05_exercise_adult_part_2.ipynb](05_exercise_adult_part_2.ipynb)

# References

- Campesato, O. (2018). Regular expressions: Pocket primer. Mercury Learning and Information.
- https://www.kaggle.com/learn/pandas
- Navlani, A.,  Fandango, A.,  Idris, I. (2021). Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python. Packt. 3rd Edition
- Brandt. S. (2014). Data Analysis: Statistical and Computational Methods for Scientists and Engineers. Springer. 4th Edition
- https://eugenelohh.medium.com/data-analysis-on-the-titanic-dataset-using-python-7593633135f2