# Dealing with outliers

This covid case for outlier identification is from [Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI 2nd ed](https://www.amazon.com/Python-Data-Cleaning-Cookbook-insights/dp/1803239875), by Michael Walker
## Identifying outliers
### Summary statistics

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

covid_totals = pd.read_csv("data/covidtotals.csv")

# set up the cumulative and demographic columns
total_vars = ['iso_code','location', 'total_cases', 'total_deaths', 'total_cases_pm', 'total_deaths_pm']

covid_totals.info()

In [None]:
# Get descriptive statistics for the COVID-19 case data.
# Create a DataFrame with just the key case data:
covid_totals_only = covid_totals[total_vars].copy()
covid_totals_only.describe()

In [None]:
# Show more detailed percentile data.
# We indicate that we only want to do this for numeric values so that the location column is skipped:
covid_totals_only.quantile(np.arange(0.0, 1.1, 0.1),
                           numeric_only=True)

In [None]:
# You can also check head() and tail() after sort_values() to check the smallest values and largest values
covid_totals_only['total_cases'].sort_values().head()

In [None]:
# Skewness and kurtosis describe how symmetrical the distribution is and how fat the tails of the distribution are, respectively. Both measures, for total_cases and total_deaths, are significantly higher than we would expect if our variables were distributed normally:
covid_totals_only.skew(numeric_only=True)

In [None]:
# Note about kurtosis: There are two common definition of kurtosis:
# (1) Pearson Kurtosis: Normal distribution has kurtosis = 3
# (2) Fisher (Excess) Kurtosis: Normal distribution has kurtosis = 0. This subtracts 3 from Pearson’s value
# Pandas returns Fisher’s (excess) kurtosis.
covid_totals_only.kurt(numeric_only=True)

### Visual detection
Two options: histogram or boxplots
#### Histogram


In [None]:
# Plot one histogram
covid_totals_only['total_cases'].plot.hist(bins=10, rwidth=0.9, figsize=(6, 4))

In [None]:
# or you could plot them in one big plot

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))

for col, ax in zip(total_vars[2:], axes.flat):
    covid_totals_only[col].plot.hist(bins=10, rwidth=0.9, ax=ax)
    ax_title = (f"{col}\n"
                f"skew={covid_totals_only[col].skew(numeric_only=True):.2f}\n"
                f"kurtosis={covid_totals_only[col].kurt(numeric_only=True):.2f}")
    ax.text(x=0.95, y=0.95, s=ax_title,
            ha='right', va='top', transform=ax.transAxes)


#### Boxplot
[Pandas box plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html):
Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns. A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. By default, they extend no more than 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest data point within that interval. Outliers are plotted as separate dots.

In [None]:

covid_totals_only.plot.box(figsize=(8, 5))

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))

for col, ax in zip(total_vars[2:], axes.flat):
    covid_totals_only[col].plot.box(ax=ax)
    ax.text(x=0.95, y=0.95, s=col,
            ha='right', va='top', transform=ax.transAxes)

## Interquartile Range (IQR) Method
### Identify outliers using IQR with one column

In [None]:
third_quartile, first_quartile = covid_totals_only['total_cases'].quantile(0.75), covid_totals_only.total_cases.quantile(0.25)
inter_quartile_range = 1.5 * (third_quartile - first_quartile)
outlier_high, outlier_low = inter_quartile_range + third_quartile, first_quartile - inter_quartile_range
print(f"The low fence and high fence for total_cases is:\n{outlier_low} <--> {outlier_high}")


### Write util functions to identify all outliers in each column

In [None]:
def find_outliers_in_one_column(s: pd.Series) -> pd.Series:
    q3, q1 = s.quantile(0.75), s.quantile(0.25)
    iqr = q3-q1
    high_bound, low_bound = 1.5*iqr + q3, q1 - 1.5*iqr
    return (s < low_bound) | (s > high_bound)

In [None]:
find_outliers_in_one_column(covid_totals_only['total_cases'])

In [None]:

print("Number of outliers using IQR method:")
for col in total_vars[2:]:
    try:
        outliers = find_outliers_in_one_column(covid_totals_only[col]).sum()
        print(f"{col}: {outliers}")
    except Exception as e:
        pass

In [None]:
# Label each row if at least one of four numerical variables is an outlier
from functools import reduce

numerical_vars = total_vars[2:]

outlier_labels = [find_outliers_in_one_column(covid_totals_only[var]) for var in numerical_vars]

covid_totals_only['outlier_label'] = reduce(lambda x, y: x | y, outlier_labels)


In [None]:
covid_totals_only['outlier_label'].sum()

In [None]:
# inspect the outliers

outliers_df = covid_totals_only[covid_totals_only['outlier_label']].copy()

## Z-score method
after inspect variables, we found life_expectancy and hosp_beds are relatively more normal distributed, let's use those variables as an example to find outliers.

You can just inspect the data view using Pycharm. On top of each row, there is a histogram already.

In [None]:
vars_used = ['hosp_beds', 'life_expectancy']

# we found there are missing values in the two columns, however, pandas operations like mean() and std() by default skip NAN values. So we don't need to deal with it right now, since we are dealing with outliers, not missing values.


def find_outliers_using_z_score(s: pd.Series) -> pd.Series:
    mean_value = s.mean()
    std_value = s.std()
    z_scores = (s-mean_value) / std_value
    return z_scores.abs() > 3


In [None]:
find_outliers_using_z_score(covid_totals['hosp_beds'])

In [None]:
outliers = covid_totals[(find_outliers_using_z_score(covid_totals['hosp_beds'])) | (find_outliers_using_z_score(covid_totals['life_expectancy']))]