## Hands-On Data Preprocessing in Python
Learn how to effectively prepare data for successful data analytics

## Data Transformation and Massaging
Data transformation comes at the very last stage of data preprocessing, right before using the analytic tools. At this stage of data preprocessing, the dataset already has the following characteristics.
- **Data cleaning**: The dataset is cleaned at all three cleaning levels.
- **Data integration**: All the potentially beneficial data sources are recognized and a dataset that includes the necessary information is created.
- **Data reduction**: If needed, the size of the dataset has been reduced

At this stage of data preprocessing, we may have to make some changes to the data before moving to the analyzing stage. The dataset will undergo the changes for one of the following reasons:
- **Necessity**: The analytic method cannot work with the current state of the data. For instance, many data-mining algorithms, such as Multi-Layered Perceptron (MLP) and K-means, only work with numbers; when there are categorical attributes, those attributes need to be transformed before the analysis is possible.
- **Correctness**: Without the proper data transformation, the resulting analytic will be misleading and wrong. For instance, if we use K-means clustering without normalizing the data, we think that all the attributes have equal weights in the clustering result, but that's incorrect; the attributes that happen to have a larger scale will have more weight.
- **Effectiveness**: If the data goes through some prescribed changes, the analytics will be more effective.

## Data Transformation versus Data Massaging
**Data transformation** refers to the process of converting data from one format or structure to another, in order to make it more useful or usable for a specific purpose. This can include things like cleaning and formatting data, as well as aggregating and summarizing it.

**Data massaging** is a less formal term that is often used to refer to the process of manipulating data in order to clean it, correct errors, or make it more useful. This can include things like removing duplicate records, filling in missing values, or converting data from one format to another.

Both data transformation and data massaging involve manipulating data in order to make it more useful or usable, but the specific techniques and processes used may differ depending on the context and the specific data being worked with.

<img src="https://drive.google.com/uc?id=1fubPkuIjBhZ3xLIaL4z4uTv6lETrSCwq" width="400"/>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Normalization and Standardization
Here is the general rule of when we need normalization or standardization. We need **normalization** when we need the range of all the attributes in a dataset to be equal. This will be needed especially for algorithmic data analytics that uses the distance between the data objects. Examples of such algorithms are K-means and KNN.

$NA_i = \cfrac{A_i - min(A)}{max(A) - min(A)}$

On the other hand, we need **standardization** when we need the variance and/or the standard deviation of all the attributes to be equal. We saw an example of needing standardization when learning about PCA, Data Reduction. We learned standardization was necessary because PCA essentially operates by examining the total variations in a dataset; when an attribute has more variations, it will have more say in the operation of PCA.

$SA_i = \cfrac{A_i - mean(A)}{std(A)}$

## Binary Coding, Ranking Transformation, and Discretization

<img src="https://drive.google.com/uc?id=1xfVckidbD7pTyNXLbShyOfrV-x6tRuAp" width="400"/>

### Example 1 – Binary coding of nominal attribute
---
In this example, we will use the **WH Report.csv** dataset. This dataset contains the **Continent** categorical attribute. This attribute indeed has information that can add to the interestingness of our clustering analysis.

In [None]:
report_df = pd.read_csv('WH Report.csv')
BM = report_df.year == 2019

report2019_df = report_df[BM]
report2019_df.set_index('Name',inplace=True)

As the attribute continent is nominal, we only have one choice and that is to use binary coding. In the following code, we will use the **pd.get_dummies()** pandas function to binary-code the Continent attribute.

In [None]:
bc_Continent = pd.get_dummies(report2019_df.Continent)
bc_Continent.head(5)

The addition of **Xs = Xs.join(bc_Continent/7)**, which adds the binary coded version of the Continent attribute (bc_Continent) to Xs after Xs is **normalized**, and before it is fed into **kmeans.fit()**.

In [None]:
from sklearn.cluster import KMeans
dimensions = ['Life_Ladder', 'Log_GDP_per_capita', 'Social_support',
              'Healthy_life_expectancy_at_birth', 'Freedom_to_make_life_choices',
              'Generosity', 'Perceptions_of_corruption', 'Positive_affect', 'Negative_affect']

Xs = report2019_df[dimensions]
Xs = (Xs - Xs.min())/(Xs.max()-Xs.min())
Xs = Xs.join(bc_Continent/7)
kmeans = KMeans(n_clusters=3)
kmeans.fit(Xs)

In [None]:
clusters = ['Cluster {}'.format(i) for i in range(3)]

Centroids = pd.DataFrame(0.0, index= clusters, columns=Xs.columns)
for i,clst in enumerate(clusters):
    BM = kmeans.labels_ == i
    Centroids.loc[clst] = Xs[BM].mean(axis=0)

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
sns.heatmap(Centroids[dimensions], linewidths=.5, annot=True, cmap='binary')
plt.subplot(1,2,2)
sns.heatmap(Centroids[bc_Continent.columns], linewidths=.5, annot=True, cmap='binary')
plt.show()

📌 To see this impact, **remove the division by 7** run the clustering analysis, and create the heatmap of the centroid analysis to see this.

In [None]:
dimensions = ['Life_Ladder', 'Log_GDP_per_capita', 'Social_support',
              'Healthy_life_expectancy_at_birth', 'Freedom_to_make_life_choices',
              'Generosity', 'Perceptions_of_corruption', 'Positive_affect', 'Negative_affect']

Xs = report2019_df[dimensions]
Xs = (Xs - Xs.min())/(Xs.max()-Xs.min())
Xs = Xs.join(bc_Continent)
kmeans = KMeans(n_clusters=3)
kmeans.fit(Xs)

clusters = ['Cluster {}'.format(i) for i in range(3)]

Centroids = pd.DataFrame(0.0, index=clusters, columns=Xs.columns)

for i,clst in enumerate(clusters):
    BM = kmeans.labels_==i
    Centroids.loc[clst] = Xs[BM].mean(axis=0)

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
sns.heatmap(Centroids[dimensions], linewidths=.5, annot=True, cmap='binary')
plt.subplot(1,2,2)
sns.heatmap(Centroids[bc_Continent.columns], linewidths=.5, annot=True, cmap='binary')
plt.show()

### Example 2 – Binary coding or ranking transformation of ordinal attributes
---
<img src="https://drive.google.com/uc?id=1gFsSq7zRP2sqrlR2j8DQ9MrscbArkNuK" width="600"/>

In the case of **Binary Coding**, the transformation has not assumed any information into the result, but the transformation has stripped the attribute from its ordinal information. You see, if we were to use the binary-coded values instead of the original attribute in our analysis, the data does not show the order of the possible values of the attribute. <u>For example</u>, while the binary-coded values make a distinction between High School and Bachelor, the data does not show that Bachelor comes after High School, as we know it does.

**Ranking Transformation**, does not have this shortcoming; however, it has other cons. You see, by trying to make sure that the order of the possible values is maintained, we had to engage numbers by ranking transformation; however, this goes a little bit overboard. By engaging numbers, not only have we successfully included order in between the possible values of the attribute but we have also collaterally assumed information that does not exist in the original attribute. <u>For example</u>, with the ranking transformed attribute, we are assuming there is one unit difference between Bachelors and High School.

**Attribute Construction**, which is only possible if we have a good understanding of the attribute. What Attribute Construction tries to fix is the gross assumptions that are added by Ranking Transformation; instead, Attribute Construction uses the knowledge about the original attribute to assume more accurate information into the transformed data. <u>For example</u>, as we know, achieving any of the degrees in the Education Level attribute takes a different number of years of education. So, instead, Attribute Construction uses that knowledge to assume more accurate assumptions about the transformed data.

### Example 3 – Discretization of Numerical attributes
---

In [None]:
adult_df = pd.read_csv('Adult.csv')
adult_df

The box plot that shows the interaction between three attributes, **sex**, **income**, and **hoursPerWeek**, from *adult_df* (**adult.csv**). We had to use a box plot because hoursPerWeek is a numerical attribute.

In [None]:
plt.figure(figsize=(5,4))
sns.boxplot(data=adult_df, y='sex', x='hoursPerWeek', hue='income')
plt.show()

In [None]:
plt.figure(figsize=(4,3))
adult_df.hoursPerWeek.plot.hist()
plt.show()

In [None]:
adult_df['discretized_hoursPerWeek'] = adult_df.hoursPerWeek.apply(lambda v: '>40' if v>40 else ('40' if v==40 else '<40'))

The bar chart that has the interaction with the same three attributes, except that the hoursPerWeek numerical attribute has been **discretized**. You can see the magic that the discretization of this attribute has done for us. The bar chart tells the story of the data far better than the box plot.

In [None]:
adult_df.groupby(['sex','income']).discretized_hoursPerWeek.value_counts().unstack()[['<40','40', '>40']].plot.barh()
plt.show()

## Types of Discretization
While the best tool to guide us through finding the best way to discretize an attribute is a histogram, there are a few different approaches one might adopt. These approaches are called **equal width**, **equal frequency**, and **ad hoc**.

1. The **equal width** approach makes sure that cut-off points will lead to equal intervals of the numerical attribute. For instance, the following screenshot shows the application of the *pd.cut()* function to create 5 equal-width bins from *adult_df.age*.

In [None]:
plt.figure(figsize=(4,3))
pd.cut(adult_df.age, bins=5).value_counts().sort_index().plot.bar()
plt.show()

2. The **equal frequency** approach aims to have an equal number of data objects in each bin. For instance, the following screenshot shows the application of the *pd.qcut()* function to create 5 equal-frequency bins from *adult_df.age*.

In [None]:
plt.figure(figsize=(4,3))
pd.qcut(adult_df.age, q=5, duplicates='drop').value_counts().sort_index().plot.bar()
plt.show()

3. The **ad hoc** approach prescribes the whereabouts of **cut-off** points based on the numerical attribute and other circumstantial knowledge about the attribute. For instance, we decided to cut adult_df.hoursePerWeek in Example 3 – discretization of numerical attributes ad hoc after having consulted the histogram of the attribute and the circumstantial knowledge that most employees work 40 hours a week in the US.

## Attribute Construction
### Example 1 – Construct one transformed attribute from two attributes
---

In [None]:
person_df = pd.read_csv('500 Person.csv')
person_df

In [None]:
person_df.Index = person_df.Index.replace({0:'Extremely Weak', 1: 'Weak',2: 'Normal',3:'Overweight', 4:'Obesity',5:'Extreme Obesity'})
person_df.columns = ['Gender', 'Height', 'Weight', 'Condition']
person_df

In [None]:
plt.figure(figsize=(4,3))
sns.scatterplot(data=person_df, x='Height', y='Weight', hue='Condition', style='Gender')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.show()

**BMI** is a function that factors in both weight and height to create a healthiness index. The formula is as follows. Be careful – in this formula, weight is in kilograms and height is in meters.

$BMI = \frac{Weight}{Height^2}$

In [None]:
person_df['BMI'] = person_df.apply(lambda r:r.Weight/((r.Height/100)**2),axis=1)
person_df

**random.<font color='red'>random</font>(*size=None*)**

Return random floats in the half-open interval [0.0, 1.0). Alias for random_sample to ease forward-porting to the new random API.

In [None]:
person_df['Random'] = np.random.random(len(person_df))

plt.figure(figsize=(9,2))
sns.scatterplot(data=person_df, x='BMI',y='Random', hue='Condition')

plt.ylim([-0.25,1.25])
plt.xticks(np.linspace(10,80,15))
plt.yticks([])
plt.grid()
plt.legend(bbox_to_anchor=(1.01, 1))
plt.show()

## Feature Extraction

<img src="https://drive.google.com/uc?id=1GQbJpVlcnwmcBeDhPiifpTN4R7X0tJ1d" width="500"/>

## Log Transformation
Attributes with **exponential growth or decline** may be problematic for data visualization and clustering analysis; furthermore, they can be problematic for some prediction and classification algorithms where the method uses the distance between the data objects, such as KNN, or where the method drives its performance based on collective performance metrics, such as linear regression.

In [None]:
country_df = pd.read_csv('GDP 2019 2020.csv')
country_df.set_index('Country Name',inplace=True)
country_df

In [None]:
n_countries = len(country_df)
intervals = [i*2 for i in range(75)]

wdf = country_df[['2019','2020']].sort_values('2020')
wdf['2020'].plot(figsize=(13,3))

plt.xticks(intervals,wdf.iloc[intervals].index,rotation=90)
plt.show()

wdf['2020'].plot.box(vert=False,figsize=(13,1))
plt.show()

In [None]:
n_countries = len(country_df)
intervals = [i*2 for i in range(75)]

wdf = country_df[['2019','2020']].sort_values('2020')
wdf['2020'].plot(figsize=(13,3),logy=True)

plt.xticks(intervals,wdf.iloc[intervals].index,rotation=90)
plt.show()

wdf['2020'].plot.box(vert=False,figsize=(13,1),logx=True)
plt.show()

## Smoothing

In [None]:
signal_df = pd.read_csv('Noise Data.csv')
signal_df

In [None]:
signal_df.drop(columns='t',inplace=True)

In [None]:
signal_df.Signal.plot(figsize=(10,3))
plt.show()

When we used **Functional Data Analysis (FDA)** to reduce the size of the data, we were interested in replacing the data with the parameters of the function that simulate the data well. However, when smoothing, we want our data with the same size, but we want to remove the noise. In other words, regarding how FDA is applied, it is very similar to both data reduction and smoothing; however, the output of FDA is different for each purpose.

> For **smoothing**, we expect to have the same size data as the output, whereas for **data reduction**, we expect to only have the parameters of the fitting function.

There are many functions and modules in the space of the Python data analysis environment that use FDA to smooth data. A few of them are savgol_filter from scipy.signal; CubicSpline, UnivariateSpline, splrep, and splev from scipy.Interpolate; and KernelReg from *statsmodels.nonparametric.kernel_regression*.

However, none of these functions works as well as it should, and
I believe there is much more room for the improvement of smoothing tools in the space of Python data analytics. For instance, the following figure shows the performance of the .KernelReg() function on part of the data (50 numbers) versus its performance on the whole Noise Data.csv file (200 numbers).

In [None]:
from statsmodels.nonparametric.kernel_regression import KernelReg

x = np.linspace(0,50,50)
y = signal_df.Signal.iloc[:50]

plt.figure(figsize=(5,3))
plt.plot(x, y, '+')

kr = KernelReg(y,x,'c')
y_pred, y_std = kr.fit(x)

plt.plot(x, y_pred)
plt.show()

In [None]:
from statsmodels.nonparametric.kernel_regression import KernelReg

x = np.linspace(0,200,200)
y = signal_df.Signal

plt.figure(figsize=(5,3))
plt.plot(x, y, '+')

kr = KernelReg(y,x,'c')
y_pred, y_std = kr.fit(x)

plt.plot(x, y_pred)
plt.show()

## Rolling Data Smoothing
The biggest difference between **functional data smoothing** and **rolling data smoothing** is that functional data smoothing looks at the whole data as one piece and then tries to find the function that fits the data. In contrast, rolling data smoothing works on incremental windows of the data. The following figure shows what rolling calculation and the incremental windows are using in the first 10 rows of *singnal_df*.

<img src="https://drive.google.com/uc?id=16bV5VmJ7Ovm1YocxCuAW3KDAYFtSMubI" width="600"/>

In [None]:
signal_df.Signal.iloc[:10].plot(figsize=(10,3))

plt.xticks([i for i in range(10)])
plt.show()

In [None]:
signal_df.Signal.plot(figsize=(10,3),label='Signal')
signal_df.Signal.rolling(window =5).mean().plot(label='Moving Average Smoothed')

plt.legend()
plt.show()

The first four values for Moving Average Smoothed are NaN, right? It is due to the nature of rolling window calculations. Always, when the width of windows is k, the first k-1 rows will have NaN.

In [None]:
pd.DataFrame({'Signal':signal_df.Signal.iloc[:50],
              'Moving Average Smoothed':signal_df.Signal.iloc[:50].rolling(window=5).mean()}).head(10)

## Binning
When the process is done to transform a numerical attribute to a categorical one, it is referred to as discretization, and when it is used as a way to combat noise in numerical data, we call the same data transformation binning.

In [None]:
plt.figure(figsize=(5,3))
adult_df.age.value_counts().sort_index().plot.bar()
plt.show()

In [None]:
plt.figure(figsize=(5,3))

adult_df['age_binned']=pd.cut(adult_df.age,10)
adult_df.age_binned.value_counts().sort_index().plot.bar()
plt.show()