# Distribution and Outlier Detection

##  Objective 

### The key statistical concepts that will be explored this notebook are:

- Confidence Interval
- Outliers detection using standard deviation approach
- Outliers detection using box plot approach

### Problem Description

A retain chain ABC want to understand the distribution of demographics of customers to design and run few promotional campaigns. One of the demograhic attribute is age. To understand what age of customers mostly buy the its stores, it runs a survey and collects age of hunderds of customers randomly. 

1. How to find out most of the customers are from which age group?
2. How many different campaigns might be required to be designed?
3. Are there any unusual behavior observed? (i.e. customers with different age groups rarely expected are visisting the stores)

### Dataset

We will use customer demographics data (only age)

- customer_age_abc.csv

In [None]:
import pandas as pd
import matplotlib as mplot
import matplotlib.pyplot as plt
import seaborn as sn

## Age distribution of company ABC

In [None]:
age_abc_df = pd.read_csv("https://raw.githubusercontent.com/manaranjanp/MLCourseV1/main/Session_1/customer_age_abc.csv")

In [None]:
age_abc_df.sample(5)

### Find min and max values

In [None]:
age_abc_df.age.min(), age_abc_df.age.max()

### Histogram  and Density Plots

In [None]:
plt.hist(age_abc_df.age, bins = range(30, 50, 1));

In [None]:
plt.figure(figsize = (12, 4))
plt.title("Distribution of Customer Ages")
sn.kdeplot(age_abc_df.age);

### Finding Confidence Interval and Outliers

- 95% of customers are from which age span?
- Are there any outliers?

In [None]:
age_abc_df.age.mean(), age_abc_df.age.std()

In [None]:
from scipy import stats

In [None]:
abc_ci_95 = stats.norm.interval(0.95,
                                loc=age_abc_df.age.mean(),
                                scale=age_abc_df.age.std())

In [None]:
abc_ci_95

In [None]:
abc_ci_99_7 = stats.norm.interval(0.997,
                                loc=age_abc_df.age.mean(),
                                scale=age_abc_df.age.std())

In [None]:
abc_ci_99_7

#### Are there any outliers

In [None]:
age_abc_df[age_abc_df.age > abc_ci_99_7[1]]

## Age distribution of company XYZ

Let's look at a different scenario where the distribution looks very different.

### Dataset

We will use customer demographics data (only age)

- customer_age_abc.csv

In [None]:
age_xyz_df = pd.read_csv("https://raw.githubusercontent.com/manaranjanp/MLCourseV1/main/Session_1/customer_age_xyz.csv")

In [None]:
age_xyz_df.sample(10)

### Histogram and Density Plots

In [None]:
age_xyz_df.age.min(), age_xyz_df.age.max()

In [None]:
plt.hist(age_xyz_df.age, bins = range(30, 50, 1));

In [None]:
plt.figure(figsize = (12, 4))
plt.title("Distribution of Customer Ages")
sn.kdeplot(age_abc_df.age, label = 'ABC');
sn.kdeplot(age_xyz_df.age, label = 'XYZ');

### Finding Confidence Interval and Outliers

In [None]:
age_xyz_df.age.mean(), age_xyz_df.age.std()

In [None]:
from scipy import stats

In [None]:
xyz_ci_95 = stats.norm.interval(0.95,
                                loc=age_xyz_df.age.mean(),
                                scale=age_xyz_df.age.std())

In [None]:
xyz_ci_95

In [None]:
xyz_ci_99_7 = stats.norm.interval(0.997,
                                loc=age_xyz_df.age.mean(),
                                scale=age_xyz_df.age.std())

In [None]:
xyz_ci_99_7

In [None]:
age_xyz_df[age_xyz_df.age > xyz_ci_99_7[1]]

### Note:

- Are there any problem with finding outliers using 3 standard deviation approch?
- This is approach is not very robust as the mean and standard deviations are very sensitive to outliers.


## Finding outliers using Box Plot

- A boxplot displays the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). 

- It can find outliers and what their values are. It can also tell if the data is symmetrical, how tightly the data is grouped, and if the data is skewed.

- The minimum or lowest value of the dataset


Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

<img src="box.png" alt="Nowmal Distribution" width="500"/>

In [None]:
plt.figure(figsize=(12,4))
sn.boxplot(x = age_xyz_df.age, orient = 'h' );

#### Finding median, IQR, min and max values

The range is that is less sensitive to outliers is the interquartile range. The interquartile range is calculated by subtracting the first quartile from the third quartile:

**IQR = Q3 – Q1**

Though it's not often affected much by them, the interquartile range can be used to detect outliers. This is done using these steps:


- Calculate the interquartile range for the data.
- Multiply the interquartile range (IQR) by 1.5 (a constant used to discern outliers).
- Add 1.5 x (IQR) to the third quartile. Any number greater than this is a suspected outlier.
- Subtract 1.5 x (IQR) from the first quartile. Any number less than this is a suspected outlier.

Source: https://www.thoughtco.com/what-is-the-interquartile-range-rule-3126244

In [None]:
age_xyz_df.age.median()

In [None]:
age_stats = age_xyz_df.age.describe()
age_stats

In [None]:
from scipy import stats

In [None]:
iqr = stats.iqr(age_xyz_df.age)
iqr

In [None]:
min_age = age_stats['25%'] - 1.5 * iqr
max_age = age_stats['75%'] + 1.5 * iqr

In [None]:
min_age, max_age

### Any outliers

In [None]:
age_xyz_df[age_xyz_df.age > max_age]