## Hands-On Data Preprocessing in Python
Learn how to effectively prepare data for successful data analytics
    
## Data Cleaning Level Ⅲ – Missing values, outliers, and errors

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Missing values

Missing values, as the name suggests, are values we expect to have but we don't. In the simplest terms, missing values are empty cells in a dataset that we want to use for analytic goals.

In Python, missing values are not presented with emptiness — they are presented via **NaN**, which is short for **Not a Number**. While the literal meaning of Not a Number does not completely capture all the possible situations for which we have missing values, NaN is used in Python whenever we have missing values.

Someone who did not know any better may have used some internal agreements to present missing values with an alternative such as MV, None, 99999, and N/A. If missing values are not presented in a standard way, the first step of dealing with them is to rectify that. In such cases, we detect the values that the author of the dataset meant as missing values and replace them with **np.nan**.

## Detecting missing values


In [None]:
air_df = pd.read_csv('Air Data.csv')
air_df

In [None]:
air_df.info()

In [None]:
print('Number of missing values:')
for col in air_df.columns:
  n_MV = sum(air_df[col].isna())
  print('{}:{}'.format(col,n_MV))

## Causes of missing values

*   Human error.
*   Respondents may refuse to answer a survey question.
*   The person taking the survey does not understand the question.
*   The provided value is an obvious error, so it was deleted.
*   Not enough time to respond to questions.
*   Lost records due to lack of effective database management.
*   Intentional deletion and skipping of data collection (probably with fraudulent intent).
*   Participant exiting in the middle of the study.
*   Third-party tampering with or blocking data collection.
*   Missed observations.
*   Sensor malfunctions.
*   Programing bugs.

## Types of missing values

1. **Missing Completely at Random (MCAR)** เมื่อ Missing Value เกิดขึ้นแบบ Random หรือเกิดแบบสุ่มทั่วทั้ง Dataset โดยที่ Data ที่ขาดหายไปไม่ได้ขึ้นกับตัวแปรใดตัวแปรหนึ่ง
2. **Missing at Random (MAR)** Data ที่ขาดหายไปไม่ได้หายแบบสุ่มทั้ง Dataset แต่มีการขาดหายไปแบบสุ่มในกลุ่มของ Sub-dataset หรือจาก Sample ที่สุ่มมา
3. **Not Missing at Random (NMAR)** เป็นกรณณีของ Data ที่ขาดหายไป
มีความสัมพันธ์โดยตรงกับข้อมูลที่ทำการเก็บมาโดยตรง เช่น คนทีมีการศึกษาที่ไม่ดีนักก็มักไม่ให้คำตอบด้านการศึกษาว่าจบชั้นไหน Data ในส่วนนี้จำเป็นต้องพิจารณา เพราะเป็นการขาดหายไปที่มีความสัมพันธ์กับตัวแปร

อาจกล่าวได้ว่า **NMAR** เป็น **Non-ignorable** คือไม่อาจจะเผิกเฉยได้ ในขณะที่ MCAR และ MAR อาจจะถูกเรียกว่า Ignorable เพราะเป็นการเกิดแบบสุ่ม

<font size='2'>*Ref: https://bigdatarpg.com/2022/05/23/missing-values-คืออะไร/*</font>

<img src="https://drive.google.com/uc?id=1f6r3ckOiFkbadivxv2-PiMssupyCGGoM" width="700"/>




## Diagnosis of missing values
Diagnosing the missing values in **NO2_Location_A** based on Temperature

In [None]:
BM_MV = air_df.NO2_Location_A.isna()
MV_labels = ['With Missing Values','Without Missing Values']

box_sr = pd.Series('',index = BM_MV.unique())

for poss in BM_MV.unique():
    BM = BM_MV == poss
    box_sr[poss] = air_df[BM].Temperature

plt.figure(figsize=(5,2))
plt.boxplot(box_sr,vert=False)
plt.yticks([1,2],MV_labels)
plt.show()

## Dealing with missing values
1.   Keep them as is.
2.   Remove the data objects (rows) with missing values.
1.   Remove the attributes (columns) with missing values.
2.   Estimate and impute a value.
<p style="text-indent: 40px">
- Impute with the general central tendency (mean, median, or mode). This is better for MCAR missing values.
- Impute with the central tendency of a more relevant group of data to the missing values. This is better for MAR missing values.
- Regression analysis. Not ideal, but if we have to proceed with a dataset that has MNAR missing values, this method is better for such a dataset.
- Interpolation. When the dataset is a time series dataset and the missing values are of the MCAR type.
</p>

To effectively find that balance in dealing with missing values, we need to understand and consider the following items:
- Our analytic goals
- Our analytic tools
- The cause of the missing values
- The type of the missing values (MCAR, MAR, MNAR)

<img src="https://drive.google.com/uc?id=1tjXpzx-5HiZCQUDG8ilDs4wV36nNot_m" width="700"/>







## Outliers
Outliers, a.k.a. extreme points, are data objects whose values are too different than the rest of the population. Being able to recognize and deal with them is important from the following three perspectives:
- Outliers may be data errors in data and should be detected and removed.
- Outliers that are not errors can skew the results of analytic tools that are sensitive to the existence of outliers.
- Outliers may be fraudulent entries.

## Detecting Outliers
### (1) Univariate outlier detection
The tools we will use for univariate outlier detection depend on the attribute's type. For numerical attributes, we can use a boxplot or the `[Q1-1.5*IQR, Q3+1.5*IQR]` statistical range. The concept of outliers does not have much meaning for a single categorical attribute.

<img src="https://drive.google.com/uc?id=1bqrDqqT2n1VaWdjEV2ZhGk-3cS9Vg9_a" width="500"/>

In [None]:
response_df = pd.read_csv('Responses.csv')
response_df.head(2)

#### (1.1) Example of detecting outliers across one numerical attribute

In [None]:
plt.figure(figsize=(5,2))
plt.boxplot(response_df.Weight.dropna(),vert=False)
plt.show()

In [None]:
Q1 = response_df.Weight.quantile(0.25)
Q3 = response_df.Weight.quantile(0.75)
IQR = Q3-Q1

BM = (response_df.Weight > (Q3+1.5*IQR)) | (response_df.Weight < (Q1-1.5*IQR))
response_df[BM]

#### (1.2) Example of detecting outliers across one categorical attribute

In [None]:
response_df.Education.value_counts()

In [None]:
response_df.Education.value_counts().plot.bar(figsize=(5,3))

### (2) Bivariate outlier detection
The tools we will use for bivariate outlier detection depend on the attributes' type.
- For numerical-numerical attributes, it is best to use a scatterplot.
- For categorical-categorical attributes, the tool we use is a color-coded contingency table.
- For numerical-categorical attributes, it is best to use multiple boxplots.

#### (2.1) Example of detecting outliers across two numerical attributes


In [None]:
response_df.plot.scatter(x='Height',y='Weight',figsize=(5,3))

In [None]:
BM = (response_df.Weight>130) | (response_df.Height<70)
response_df[BM]

#### (2.2) Example of detecting outliers across two categorical attributes

In [None]:
pd.crosstab(response_df['Education'],response_df['God'])

In [None]:
cont_table = pd.crosstab(response_df['Education'],response_df['God'])

sns.set(rc={"figure.figsize":(6, 4)})
sns.heatmap(cont_table,annot=True,center=0.5,cmap="Greys")

> The **.query()** function, as its name suggests, can also help us perform filtering of a DataFrame based on the values of the attributes.

In [None]:
response_df.query('Education== "currently a primary school pupil" & God==2')

In [None]:
response_df.query('Education== "currently a primary school pupil" & God==4')

#### (2.3) Example of detecting outliers across two attributes one categorical and the other numerical

In [None]:
sns.boxplot(x=response_df.Age,y=response_df.Education)

In [None]:
BM1 = (response_df.Education=='college/bachelor degree') & (response_df.Age>26)
BM2 = (response_df.Education == 'secondary school') & ((response_df.Age>24) | (response_df.Age<16))
BM3 = (response_df.Education == 'primary school') & ((response_df.Age>19) | (response_df.Age<16))
BM = BM1 | BM2 | BM3
response_df[BM]

### (3) Multivariate Outlier detection
Detecting outliers across more than two attributes is called multivariate outlier detection. The best way to go about multivariate outlier detection is through **clustering** analysis.

In this example, we would like to see whether we have outliers based on the following four attributes: **Country, Musical, Metal or Hardrock, and Folk**. If you check the complete description of these attributes on *columns_df*, you will realize these attributes describe the liking level of data objects for each of four kinds of music.

- First, we will create an Xs attribute, which includes the attributes we want to be used for clustering analysis.

In [None]:
dimensions = ['Country', 'Metal or Hardrock','Folk','Musical']
Xs = response_df[dimensions]

- Second, we need to check whether there are any missing values. You may use Xs.info() for the quick detection of missing values.

In [None]:
Xs.info()

- In this case, the missing values are spread across the data objects and the dimensions of Xs. So, we can use the following line of code to impute the missing values with `Q3+IQR*1.5`

In [None]:
Q3 = Xs.quantile(0.75)
Q1 = Xs.quantile(0.25)
IQR = Q3 - Q1
Xs = Xs.fillna(Q3+IQR*1.5)

- Next, of course, we will not forget to **standardize** the dataset using Xs = (Xs -Xs.min())/(Xs.max()-Xs.min()).

In [None]:
Xs = (Xs-Xs.min())/(Xs.max()-Xs.min())

- Lastly, we can use a loop to perform clustering analysis for different Ks and report its results.
- Once the preceding code is successfully run, you can scroll through its prints to see that under none of the Ks, has K-Means grouped one data object or a handful of data objects in one cluster. This will allow us to conclude that there is no multivariate outlier in Xs.

In [None]:
from sklearn.cluster import KMeans
for k in range(2,8):
  kmeans = KMeans(n_clusters=k)
  kmeans.fit(Xs)
  print('k={}'.format(k))
  for i in range(k):
    BM = kmeans.labels_==i
    print('Cluster {}: {}'.format(i,Xs[BM].index.values))
  print('--------- Divider ----------')

### (4) Time series outlier detection
Outliers in time series data are best detected using **line plots**, the reason being that between consecutive records of a time series there is a close relationship, and using the close relationship is the best way to check the correctness of a record. All you need is to evaluate the value of the record against its closest consecutive records, and that is easily done using line plots.

## Dealing with outliers
1. **Do nothing**
2. **Replace with the upper cap or the lower cap**
- If the criteria are met, in this approach the outliers are replaced with the correct upper or lower cap.
- We replace the univariate outliers that are too much smaller than the rest of the data object with the lower cap of the `Q1-1.5*IQR` attribute, and replace the univariate outliers that are too much larger than the rest of the data objects with the upper cap of the `Q3+1.5*IQR` attribute.
3. **Perform log transformation**
- This approach is not just a method to deal with outliers but is also an effective data transformation technique. When an attribute follows an exponential distribution, it is only typical for some of the data objects to be very different from the rest of the population. In those situations, applying a log transformation will be the best approach.
4. **Remove data objects with outliers**
- This is our least favorite approach and should only be used when absolutely necessary. The reason that we would like to avoid this approach is that the data is not incorrect; the values of the outliers are correct but happen to be too different from the rest of the population. It is our analytic tool that is incapable of dealing with the actual population.


### Example 1
---
For instance, if we are interested in seeing the frequency changes where most of the population is between 40 and 100, then a histogram without outliers would be better. On the other hand, if a true representation of the population is our end goal, then a histogram with outliers would be ideal.

In [None]:
plt.figure(figsize=(5,3))
response_df.Weight.plot.hist(histtype='step')
plt.show()

plt.figure(figsize=(5,3))
BM = response_df.Weight<105
response_df.Weight[BM].plot.hist(histtype='step')
plt.show()

### Example 2
---
In this example, we would like to use regression to capture the linear relationship between Weight, Height, and Gender to predict Weight. In other words, we would like to find the β0 and β1 values in the following equation: <font color='green'>$Weight = \beta_{0} + \beta_{1}×Height + \beta_{2}×Gender$</font>

> Regression analysis is sensitive to outliers.

#### 2.1 Dealing with missing values

In [None]:
select_attributes = ['Weight','Height','Gender']
pre_process_df = pd.DataFrame(response_df[select_attributes])
pre_process_df.info()

In [None]:
pre_process_df.dropna(inplace=True)

In [None]:
pre_process_df.info()

#### 2.2 Detecting univariate outliers and dealing with them

In [None]:
num_attributes = ['Weight','Height']
for i,att in enumerate(num_attributes):
    plt.subplot(1,3,i+1)
    pre_process_df[att].plot.box()

plt.subplot(1,3,3)
pre_process_df.Gender.value_counts().plot.bar()
plt.tight_layout()
plt.show()

- When the data objects are **univariate outliers**, it is better to **replace them with their statistical upper cap or lower cap**, as replacing the statistical upper or the lower cap will help to keep the data objects and at the same time mitigate the negative effect of the data object with the outliers.

- On the other hand — and this also applies generally — when the data objects are **bivariate or multivariate outliers**, it would be better to **remove** them. This is because these outliers will not allow the regression model to capture the patterns among the non-outlier data objects.

- In the special case of bivariate outliers whereby the pair of attributes is **categorical-numerical**, it might also be sensible to replace the outlier values with the upper or lower caps of the specific population.

In [None]:
Q3 = pre_process_df.Weight.quantile(0.75)
Q1 = pre_process_df.Weight.quantile(0.25)
IQR = Q3 - Q1

upper_cap = Q3+IQR*1.5

BM = pre_process_df.Weight > upper_cap
pre_process_df.loc[pre_process_df[BM].index,'Weight'] = upper_cap

In [None]:
pre_process_df.Weight.plot.box(figsize=(4,3))

In [None]:
Q3 = pre_process_df.Height.quantile(0.75)
Q1 = pre_process_df.Height.quantile(0.25)
IQR = Q3 - Q1

lower_cap = Q1-IQR*1.5
upper_cap = Q3+IQR*1.5

BM = pre_process_df.Height < lower_cap
pre_process_df.loc[pre_process_df[BM].index,'Height'] = lower_cap

BM = pre_process_df.Height > upper_cap
pre_process_df.loc[pre_process_df[BM].index,'Height'] = upper_cap

In [None]:
pre_process_df.Height.plot.box(figsize=(4,3))

#### 2.3 Detecting bivariate outliers and dealing with them


In [None]:
pre_process_df.plot.scatter(x='Height',y='Weight')

In [None]:
plt.subplot(1,2,1)
sns.boxplot(y=pre_process_df.Height,x=pre_process_df.Gender)

plt.subplot(1,2,2)
sns.boxplot(y=pre_process_df.Weight, x=pre_process_df.Gender)
plt.tight_layout()

- As these outliers are **bivariate** in a pair of **categorical-numerical** attributes, we may be **replacing** them with the specific population's upper or lower caps.

In [None]:
for poss in pre_process_df.Gender.unique():
  BM = pre_process_df.Gender == poss
  wdf = pre_process_df[BM]
  Q3 = wdf.Height.quantile(0.75)
  Q1 = wdf.Height.quantile(0.25)
  IQR = Q3 - Q1

  lower_cap = Q1-IQR*1.5
  upper_cap = Q3+IQR*1.5

  BM = wdf.Height > upper_cap
  pre_process_df.loc[wdf[BM].index,'Height'] = upper_cap

  BM = wdf.Height < lower_cap
  pre_process_df.loc[wdf[BM].index,'Height'] = lower_cap

In [None]:
for poss in pre_process_df.Gender.unique():
  BM = pre_process_df.Gender == poss
  wdf = pre_process_df[BM]
  Q3 = wdf.Weight.quantile(0.75)
  Q1 = wdf.Weight.quantile(0.25)
  IQR = Q3 - Q1

  lower_cap = Q1-IQR*1.5
  upper_cap = Q3+IQR*1.5

  BM = wdf.Weight > upper_cap
  pre_process_df.loc[wdf[BM].index,'Weight'] = upper_cap

  BM = wdf.Weight < lower_cap
  pre_process_df.loc[wdf[BM].index,'Weight'] = lower_cap

In [None]:
plt.subplot(1,2,1)
sns.boxplot(y=pre_process_df.Height,x=pre_process_df.Gender)

plt.subplot(1,2,2)
sns.boxplot(y=pre_process_df.Weight, x=pre_process_df.Gender)
plt.tight_layout()

#### 2.4 Detecting multivariate outliers and dealing with them
To detect multivariate outliers, the standard method is to use clustering analysis; however, when two of the three attributes are numerical and the other is categorical, we can do outlier detection using a specific visualization technique.

In [None]:
cat_attribute_poss = pre_process_df.Gender.unique()
for i,poss in enumerate(cat_attribute_poss):
  BM = pre_process_df.Gender == poss
  pre_process_df[BM].plot.scatter(x='Height',y='Weight')
  plt.title(poss)
  plt.show()

- Based on the preceding screenshot, we can conclude that there are **no multivariate outliers** in the data. If there were any, the only choice we would have would be to remove them, as outliers can negatively impact LR performance. Also, replacing the outliers with upper and lower caps is not an option for multivariate outliers.

#### 2.5 Applying linear regression

In [None]:
pre_process_df.Gender.replace({'male':0,'female':1},inplace=True)

In [None]:
from sklearn.linear_model import LinearRegression

X = ['Height','Gender']
y = 'Weight'

data_X = pre_process_df[X]
data_y = pre_process_df[y]

lm = LinearRegression()
lm.fit(data_X, data_y)

In [None]:
print('intercept (b0) ', lm.intercept_)
coef_names = ['b1','b2']
print(pd.DataFrame({'Predictor': data_X.columns,
                    'coefficient Name':coef_names,
                    'coefficient Value': lm.coef_}))

- The equation can now predict the individual Weight value based on their
Height and Gender values: <font color='green'>$Weight = −51.1038 + 0.7040×Height −8.6020×Gender$</font>



## Errors
- Errors are an inevitable part of any data collection and measurement. The following formula best captures this fact: <font color='blue'>*Data = True Signal + Error*</font>
- The *True Signal* is the reality we are trying to measure and present in the form of *Data*, but due to the incapability of our measurement system or data presentation, we cannot capture the *True Signal*. Therefore, Error is the difference between the *True Signal* and the recorded *Data*.

## Types of errors
1. **Systematic Errors**: Systematic errors are errors that have a clear cause and can be eliminated for future experiments
2. **Random Errors**: Random errors occur randomly, and sometimes have no source/cause
3. **Blunders**: Blunders are simply a clear mistake that causes an error in the experiment

<font size='2'>*Ref: https://www.expii.com/t/types-of-error-overview-comparison-8112*</font>

<img src="https://drive.google.com/uc?id=1tqjmQ-5ivaatPoWKatNlJgjOrVSeqQZo" width="700"/>

## Dealing with errors
We will deal with errors differently based on their types. **Random errors** are unavoidable and, at best, we may be able to mitigate them using smoothing or aggregation.

However, **systematic errors** are avoidable, and once recognized, we should always take the following steps in dealing with them:
1. Adjust and improve the data collection so that systematic errors will not happen in the future.
2. Try to use other data resources if available to find the correct value, and if there are none, we will regard the systematic error as a missing value.

## Detecting systematic errors
Detecting systematic errors is not very easy, and it is likely that they go unnoticed and negatively influence our analysis. The best chance we have in detecting systematic errors is the techniques we learned in the detecting outliers section. When outliers are detected and there is no explanation why the value of the outliers are correct, then we can conclude that outliers are systematic errors.

### Example of systematic error and correct outlier
In this example, we would like to analyze **Customer Entries.xlsx**. The dataset contains about 2 months of customer-visiting data from a local coffee shop between October 1, 2020, and November 24, 2020. The goal of the analysis is to profile the hours of the day to see at which times and days peak customer visits happen.

In [None]:
hour_df = pd.read_excel('Customer Enteries.xlsx')
hour_df.info()

In [None]:
hour_df.head(5)

In [None]:
hour_df.N_Customers.plot()
plt.show()

In [None]:
hour_df[hour_df.N_Customers>20]

To check whether this outlier is a case of a **systematic error** or not, we investigate using our other sources and we realize that nothing out of the ordinary had happened during that day, and this record could simply be a manual data entry error. This shows us that this is a systematic error, and therefore we need to take the following two steps in dealing with systematic errors:
1. *Step 1*: We inform the entity who is in charge of data collection about this mistake and ask them to take appropriate measures to prevent such a mistake from happening in the future.
2. *Step 2*: If we do not have ways to find the correct value using other resources within a reasonable time and effort, we regard the data entry as a missing value and replace it with **np.nan**.

In [None]:
err_index = hour_df[hour_df.N_Customers>20].index
hour_df.iloc[err_index,2] = np.nan

In [None]:
hour_df.N_Customers.plot()
plt.show()

In this dataset, time and data have already been separated, so we can perform the following bivariate outlier detection. The best way to perform bivariate outlier detection for a pair of numerical-categorical attributes is to use multiple boxplots.

In [None]:
sns.boxplot(y=hour_df.N_Customers,x=hour_df.Time)
plt.show()

In [None]:
hour_df.query("Time==17 and N_Customers>12")

Looking at the preceding screenshot, we do see that we have two other outliers that could be systematic errors. The first one is the **smallest** value of N_Customers, which is **zero**, under the Time value of 17. The value is consistent with the rest of the data. The Time value of 17 (or 5 P.M.) seems to be getting the least number of customers, and we can imagine occasionally having no customers at that hour.
<br></br>
However, the second flier at the same hour (5 P.M.) seems more troubling. After running hour_df.query("Time==17 and N_Customers>12"), which filters the flier, we can see the outlier has happened on November 17, 2020. After investigation, it turns out that on November 17, 2020 at 4:25, a biking club made a half hour stop for refreshment, which was out of the ordinary for the store. Therefore the data entry was **not erroneous** and just a **correct outlier**.

In [None]:
hour_df.groupby('Time').N_Customers.median().plot.bar()
plt.show()

Drawing a bar chart that shows and compares the **central tendency** of N_Customers per working hour of the coffee shop (Time) will be the visualization we need for this analysis.
<br></br>
The prescribed bar chart can easily deal with missing values as per the aggregation of the data to calculate the central tendencies. As we have outliers in the dataset, we chose to use **median** over **mean** as the central tendency for this analysis.