## Hands-On Data Preprocessing in Python
Learn how to effectively prepare data for successful data analytics

## Data Reduction
Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form. The purpose of data reduction can be two-fold: reduce the number of data records by eliminating invalid data or produce summary data and statistics at different aggregation levels for various applications.

Data reduction does not necessarily mean loss of information. For example, the body mass index reduces two dimensions (body and mass) into a single measure, without any information being lost in the process.

Data reduction is about reducing the size of data due to one of the following three reasons:
- **High-Dimensional Visualizations**: When we have to pack more than three to five dimensions into one visual, we will reach the human limitation of comprehension.
- **Computational Cost**: Datasets that are too large may require too much computation. This might be the case for algorithmic approaches.
- **Curse of Dimensionality**: Some of the statistical approaches become incapable of finding meaningful patterns in the data because there are too many attributes.

## The objectives of Data Reduction
<img src="https://drive.google.com/uc?id=1QvClEQnqHYOUroXgTgVPaOP2V2Z-I-Vt" width="400"/>

## Types of Data Reduction
<img src="https://drive.google.com/uc?id=1uKL1W2RoE09t5HxX00WT4Ly2INYHAPGX" width="700"/>



| Dimensionality Reduction 	| Numerosity Reduction 	|
|---	|---	|
| In dimensionality reduction, data encoding or data transformations are applied to obtain a reduced or compressed for of original data. 	| In Numerosity reduction, data volume is reduced by choosing suitable alternating forms of data representation. 	|
| It can be used to remove irrelevant or redundant attributes. 	| It is merely a representation technique of original data into smaller form. 	|
| In this method, some data can be lost which is irrelevant. 	| In this method, there is no loss of data. 	|
| Methods for dimensionality reduction are:<br><br>1. Wavelet transformations.<br><br>2. Principal Component Analysis. 	| Methods for Numerosity reduction are:<br><br>1. Regression or log-linear model (parametric).<br><br>2. Histograms, clustering, sampling (non-parametric). 	|
| The components of dimensionality reduction are feature selection and feature extraction. 	| It has no components but methods that ensure reduction of data volume. 	|
| It leads to less misleading data and more model accuracy. 	| It preserves the integrity of data and the data volume is also reduced. 	|

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Numerosity Data Reduction
In the numerosity reduction, the data volume is decreased by selecting an alternative, smaller form of data representation. These techniques can be parametric or nonparametric.

For **parametric** methods, a model can estimate the data, so that only the data parameters need to be saved, instead of the actual data, for example, Log-linear models.

**Non-parametric** methods are used for storing a reduced representation of the data which include histograms, clustering, and sampling.

### **1. Random Sampling**
A simple random sample is a randomly selected subset of a population. In this sampling method, each member of the population has an exactly equal chance of being selected.

<img src="https://drive.google.com/uc?id=1IhAiiKFU5qNkTPCrkww96b0QrV522odk" width="400"/>

### Example 1 – Random sampling to speed up tuning
---
One of the standard ways of tuning an algorithm is to take a brute-force approach where we use all the possible combinations of hyperparameters and see which one leads to the best outcome. The following code uses the `GridSearchCV()` function from `sklearn.model_selection` to experiment with all the combinations of the listed possibilities for the `criterion, max_depth, min_samples_split, and  min_impurity_decrease` hyperparameters. These hyperparameters are the `DecisionTreeClassifier()` model's from `sklearn.tree`.

In [None]:
customer_df = pd.read_csv('Customer Churn.csv')
customer_df

- Upon running this code, it will report that there are **360** candidate models, and each will be fitted three times on different subsets of the input dataset, totaling **1,080** fittings.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

y = customer_df['Churn']
Xs = customer_df.drop(columns=['Churn'])

param_grid = {
      'criterion': ['gini','entropy'],
      'max_depth': [10,20,30,40,50,60],
      'min_samples_split': [10,20,30,40,50],
      'min_impurity_decrease': [0,0.001, 0.005, 0.01, 0.05, 0.1]}

gridSearch = GridSearchCV(DecisionTreeClassifier(),
                          param_grid, cv=3,
                          scoring='recall',verbose=1)
gridSearch.fit(Xs, y)

print('Best score: ', gridSearch.best_score_)
print('Best parameters: ', gridSearch.best_params_)

- **1,000** of the data objects have been randomly selected and then the same tuning code has been applied. After running this code, you will see that the amount of time it takes for the code to finish will drop significantly.

In [None]:
########## Random Sampling ##########
customer_df_rs = customer_df.sample(1000,random_state=1)

y = customer_df_rs['Churn']
Xs = customer_df_rs.drop(columns=['Churn'])

gridSearch = GridSearchCV(DecisionTreeClassifier(),
                          param_grid, cv=3,
                          scoring='recall',verbose=1)
gridSearch.fit(Xs, y)

print('Initial score: ', gridSearch.best_score_)
print('Initial parameters: ', gridSearch.best_params_)

### **2. Stratified Sampling**
In a stratified sample, researchers divide a population into homogeneous subpopulations called **strata** (the plural of stratum) based on specific characteristics (e.g., race, gender identity, location, etc.). Every member of the population studied should be in exactly one stratum.

Each stratum is then sampled using another probability sampling method, such as cluster sampling or simple random sampling, allowing researchers to estimate statistical measures for each sub-population.

Researchers rely on stratified sampling when a population’s characteristics are diverse and they want to ensure that every characteristic is properly represented in the sample. This helps with the generalizability and validity of the study, as well as avoiding research biases like undercoverage bias.

<img src="https://drive.google.com/uc?id=1GP0Qgwh9o4q1R0eNfnw2wX9YnaGngrq_" width="400"/>



### Example 2 – Stratified sampling for imbalanced dataset

---
In the previous example, we saw that *customer_df* is imbalanced as 15.7% of its cases are churn, while the rest, which is 84.3%, are non-churn. Now, we want to come up with some code that can perform stratified sampling.

The following code will be able to get a stratified sample of *customer_df* that contains 1000 data objects out of the 3,150 data objects. In the end, the code will print the ratios of churn and non-churn data objects in the sample using *.value_counts(normalize=True)*. Run the code a few times. You will see that even though the process is completely random, it will always lead to the same ratios of churn and non-churn cases.

In [None]:
n,s = len(customer_df),1000
r = s/n

sample_df = customer_df.groupby('Churn', group_keys=False).apply(lambda sdf: sdf.sample(round(len(sdf)*r)))
print(sample_df.Churn.value_counts(normalize=True))

In [None]:
plt.figure(figsize=(5,3))
customer_df.Churn.value_counts(normalize=True).plot.bar()

### **3. Random Over/Under Sampling**
✏️ Random <font color='blue'>oversampling</font> duplicates examples from the minority class in the training dataset and can result in overfitting for some models.

✏️ Random <font color='blue'>undersampling</font> deletes examples from the majority class and can result in losing information invaluable to a model.

<img src="https://drive.google.com/uc?id=1DdGEWA6oejSsEhvZh0B-O-4ZP9PXU_aG" width="500"/>




### Example 3 – Random over/under sampling
---
The following code will be able to get a sample of *customer_df* that contains 500 data objects out of the 3,150 data objects. There will be 250 data objects from both the churning and non-churning customers. In the end, the code will print the ratios of the churn and non-churn data objects in the sample using *.value_counts(normalize=True)*.

Then, run the code a few times. You will see that even though the process is completely random, it will always lead to the same and equal ratios of churn and non-churn cases.


In [None]:
n,s = len(customer_df),500

sample_df = customer_df.groupby('Churn', group_keys=False) .apply(lambda sdf: sdf.sample(250))
print(sample_df.Churn.value_counts(normalize=True))

## Dimensionality Data Reduction
In dimensionality reduction, data encoding or transformations are used to access a reduced or **compressed** depiction of the original data. If the original data can be regenerated from the compressed data without any loss of data, the data reduction is known as lossless. If data reconstructed is only approximated of the original data, then the data reduction is called lossy.

### Example 1 – Dimension Reduction using Linear Regression
---
The following is the linear regression
equation for this *amzn_df*:

<font color='green'>
$changeP = \beta_{0}
+ \beta_{1}×pd_{changeP}
+ \beta_{2}×pw_{changeP}
+ \beta_{3}×\textrm{dow_pd_changeP}
+ \beta_{4}×\textrm{dow_pw_changeP}
+ \beta_{5}×\textrm{nasdaq_pd_changeP}
+ \beta_{6}×\textrm{nasdaq_pw_changeP}
$
</font>

In [None]:
amzn_df = pd.read_csv('Amzn Stock.csv')

amzn_df.set_index('t',drop=True,inplace=True)
amzn_df.columns = ['pd_changeP', 'pw_changeP', 'dow_pd_changeP',
       'dow_pw_changeP', 'nasdaq_pd_changeP', 'nasdaq_pw_changeP',
       'changeP']
amzn_df

In the same table, in the **P>|t|** column, you can find the p-values of the hypothesis test of the independent attribute's significance for predicting the dependent attribute. You can see that most of the p-values are way larger than the **cut-off point of 0.05**, except for <font color='blue'>dow_pd_changeP</font>, which is slightly larger than the cut-off point. Based on our understanding of the p-value, we can see that we don't have enough evidence to reject the null hypothesis that most of the independent attributes are not related to the dependent attribute – that is, except for dow_pd_changeP, which has a rather small probability that this attribute is not related to the dependent attribute. So, if we were going to keep any attribute, we would keep dow_pd_changeP and remove the rest.

> <font color='green'>$changeP = \beta_{0}
+ \beta_{1}×\textrm{dow_pd_changeP}
$
</font>

What is the purpose of **Xs = sm.add_constant(Xs)**? This line of code adds a column whose value for all the rows is 1. The reason for this addition is to make sure **OLS()** will include a constant coefficient, which is what linear regression models have.

> Ordinary Least Squares (OLS)



In [None]:
import statsmodels.api as sm

Xs = amzn_df.drop(columns=['changeP'], index =['2021-01-12'] )
Xs = sm.add_constant(Xs)

y = amzn_df.drop(index =['2021-01-12']).changeP

sm.OLS(y, Xs).fit().summary()

### Example 2 – Dimension Reduction using Random Forest
---
The following code uses *RandomForestClassifier()* from *sklearn.ensemble* to train a random forest model that uses **1000** weak decision trees.



In [None]:
from sklearn.ensemble import RandomForestClassifier

y = customer_df['Churn']
Xs = customer_df.drop(columns=['Churn'])

rf = RandomForestClassifier(n_estimators=1000)
rf.fit(Xs, y)

Print <font color='blue'>rf.feature_importances_</font> and look at the numerical values that show the importance of the independent attributes. The code shown in the following screenshot creates a pandas Series, sorts the attributes based on their importance, and then creates a bar chart that shows the relative importance of each attribute to classify customer churn.

In [None]:
rf.feature_importances_

In [None]:
plt.figure(figsize=(5,3))

importance_sr = pd.Series(rf.feature_importances_, index=Xs.columns)
importance_sr.sort_values(ascending=False).plot.barh()
plt.show()

### Example 3 – Dimension Reduction using Principal Component Analysis (PCA)
---
Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

<font size='2'>*Ref: https://builtin.com/data-science/step-step-explanation-principal-component-analysis*</font>

<font size='2'>*Ref: https://lengyi.medium.com/principal-components-analysis-pca-e97c976ff130*</font>

In [None]:
toy_df = pd.read_excel('Toy Dataset.xlsx')
toy_df

In [None]:
toy_df.plot.scatter(x='Dimension_1',y='Dimension_2',s=40) # s: The marker size
plt.show()

In [None]:
var_df = pd.DataFrame(toy_df.var())
var_df.columns = ['Variance']
var_df.reset_index(inplace=True)
var_df

In [None]:
new_row = pd.Series({'index':'Total','Variance':var_df.Variance.sum()})

var_df = pd.concat([var_df, new_row.to_frame().T], ignore_index=True)
var_df.set_index('index')

In [None]:
toy_df.corr()

⬆️ Using the preceding screenshot, we can gain a lot of insight into *toy_df*. What jumps out right off the bat is that **Dimension_1** and **Dimension_2** are **strongly** correlated.

⬆️ We can see this both in the scatterplot and the correlation matrix; the correlation coefficient between Dimension_1 and Dimension_2 is 0.859195. We can also see that there is a total of 1026.989474 variations in toy_df; Dimension_1 contributes 415.315789 of the total variation, while Dimension_2 contributes the rest.

⬇️ This screenshot also contains the code that uses the **PCA()** function from *sklearn.decomposition* to transform toy_df.

⬇️ We call the new columns of a PCA-transformed dataset **principal components (PCs)**. Here, you can see that since toy_df has two attributes, toy_t_df has two PCs called **PC1** and **PC2**.

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(toy_df)

toy_t_df = pd.DataFrame(pca.transform(toy_df))
toy_t_df.columns = ['PC1','PC2']
toy_t_df

In [None]:
toy_t_df.plot.scatter(x='PC1',y='PC2',s=40)
plt.show()

In [None]:
var_df = pd.DataFrame(toy_t_df.var())
var_df.columns=['Variance']
var_df.reset_index(inplace=True)
var_df

In [None]:
new_row = pd.Series({'index':'Total','Variance':var_df.Variance.sum()})

var_df = pd.concat([var_df, new_row.to_frame().T], ignore_index=True)
var_df.set_index('index')

In [None]:
toy_t_df.corr()

- Look at the total amount of variance in both figures. They are both exactly 1026.989474. So, PCA does not add information to and remove information from the dataset, it just moves the variations from one attribute to the other.

- In this example, PC1 – carries the maximum possible variation, and the correlation between the PCs – in this example, PC1 and PC2 – will be zero.

- While Dimension_1 only contributes 415.315789 to the total 1026.989474 variations, PC1 contributes 957.53716 to the total 1026.989474 variations. So, we can see that the PCA transformation has successfully pushed most of the variations into the first PC, PC1.

- Moreover, looking at the scatterplot and the correlation matrix, we can see that PC1 and PC2 have no relationship with one another and that the correlation coefficient is zero (-2.682793e-17).

- However, we do remember that the relationship between Dimension_1 and Dimension_2 was rather strong (0.859195). So, we can see that PCA has been successful in making sure there is no correlation between PC1 and PC2 in this example.
