## 2021: Week 20 Controlling Complaints

This week's challenge continues the focus on calculations, this time the focus is numbers. When using measures in data it is very easy to make mistakes if you don't check the realistic nature of the values, especially when entering data or forming calculations. By creating your calculations in your data preparation tool, you might be saving the users of your data set a lot of work and reducing the skills required to use the data. 

### Challenge
Control charts are a really useful way to visualise data but in Tableau Desktop they often require a few Table Calculations putting people off creating them. This week you will be building the calculations you need to build a control chart without using table calculations in Desktop. 

Different people like to use differing numbers of standard deviations to assess whether the data point falls outside of the normal levels of distribution or not. Some people class normal distribution as:

- 1 standard deviation either side of the mean - in a normal distribution this would cover 34.1% of the data either side of the mean. So 1 standard deviation either side of the mean would cover 68.2% of your data in a normal distribution so start to show interesting outliers.

- 2 standard deviations either side of the mean - this would add another 13.6% of you data either side of your mean so two standard deviations either side of the mean would cover 95.4% in a normal distribution.

- 3 standard deviations either side of the mean - this is the version that is used in the six sigma process improvement theory founded in manufacturing but has progressed into other industries too.

This challenge will ask you to create 3 main calculations after some initiation data preparation that will help you form the control chart:

![img](https://1.bp.blogspot.com/-Ue7ROL3TjUE/YKI4m1zF7mI/AAAAAAAACK0/8fCf___V-GoJaMK3LPQ2-WKOcrhbILB9QCLcBGAsYHQ/w640-h248/5.31%2BHow%2Bto%2Bread%2Ba%2Bcontrol%2Bchart.png)

- Mean - the value that shows the average of all the data points being assessed. This can be broken down into different partitions in the data, often by dates or dimensions. 
- Upper Control Limit - this value is 1, 2 or 3 standard deviations above the mean.
- Lower Control Limit - this value is 1, 2 or 3 standard deviations below the mean.

Measuring the difference between the Upper and Lower Control Limits (in this challenge called the Variation) is a demonstration of how much control there is in your process you are measuring. The smaller the variation, the more a process is controlled. 

With all delays in Prep Airs projects recently, we need to focus on the complaints that might be generated by the delays. 

### Input
One csv input file:

![img](https://1.bp.blogspot.com/-v9Fs6j-vRg4/YKI84VyZL5I/AAAAAAAACK8/SsGsGekUTtgCKyw_RZ1s3BWtCEo44BlxACLcBGAsYHQ/w640-h270/Screenshot%2B2021-05-17%2Bat%2B10.52.16.png)

### Requirements
- Input the data file
- Create the mean and standard deviation for each Week
- Create the following calculations for each of 1, 2 and 3 standard deviations:
    - The Upper Control Limit (mean+(n*standard deviation))
    - The Lower Control Limit (mean-(n*standard deviation))
    - Variation (Upper Control Limit - Lower Control Limit)
- Join the original data set back on to these results 
- Assess whether each of the complaint values for each Department, Week and Date is within or outside of the control limits
- Output only Outliers
- Produce a separate output worksheet (or csv) for 1, 2 or 3 standard deviations and remove the irrelevant fields for that output.

### Output
![img](https://1.bp.blogspot.com/-gDjMrX0CWaE/YKJC01vCWDI/AAAAAAAACLM/jf8Aau1bX4Yb1n7HKBjZLRLnQVkRyVujACLcBGAsYHQ/w640-h114/Screenshot%2B2021-05-17%2Bat%2B11.17.36.png)

1 file containing 3 worksheets (if you are using an older version of Prep of another tool, please feel free to output CSVs instead).

Each worksheet should contain:
10 fields
- Variation
- Outlier
- Lower Control Limit
- Upper Control Limit
- Standard Deviation
- Mean
- Date
- Week
- Complaints
- Department
For each of the outputs here are the number of rows they should have:
- 1 Standard Deviation - 24 rows (25 including headers)
- 2 Standard Deviations - 5 rows (6 including headers)
- 3 Standard Deviations - 2 rows (3 including headers)

![img](https://1.bp.blogspot.com/-MqoFdPegdjg/YKJCFfFqkYI/AAAAAAAACLE/Oa4YzwTCyJU0xTFMK_GzunzcJO0kKw_wgCLcBGAsYHQ/w480-h640/Control%2BChart.png)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Input the data file

In [2]:
df = pd.read_csv("./data/Prep Air Complaints - Complaints per Day.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Date        105 non-null    object
 1   Week        105 non-null    int64 
 2   Complaints  105 non-null    int64 
 3   Department  105 non-null    object
dtypes: int64(2), object(2)
memory usage: 3.4+ KB


In [5]:
df

Unnamed: 0,Date,Week,Complaints,Department
0,19/04/2021,16,42,Ticketing
1,20/04/2021,16,32,Ticketing
2,21/04/2021,16,51,Ticketing
3,22/04/2021,16,48,Ticketing
4,23/04/2021,16,34,Ticketing
...,...,...,...,...
100,19/05/2021,20,14,Airport Experience
101,20/05/2021,20,19,Airport Experience
102,21/05/2021,20,23,Airport Experience
103,22/05/2021,20,20,Airport Experience


In [None]:
### Create the mean and standard deviation for each Week

In [8]:
mean_week = df.groupby(["Week"])["Complaints"].mean()
mean_week

Week
16    29.714286
17    37.809524
18    53.333333
19    72.476190
20    28.904762
Name: Complaints, dtype: float64

### Create the following calculations for each of 1, 2 and 3 standard deviations:
- The Upper Control Limit (mean+(n*standard deviation))
- The Lower Control Limit (mean-(n*standard deviation))
- Variation (Upper Control Limit - Lower Control Limit)

In [10]:
# 1 standard deviation calculation check
standard_deviation = df.groupby(["Week"])["Complaints"].std()
standard_deviation

Week
16    12.958174
17    16.621128
18     9.446340
19    37.661146
20    12.234806
Name: Complaints, dtype: float64

In [11]:
upper_limit = mean_week + standard_deviation
upper_limit

Week
16     42.672460
17     54.430652
18     62.779673
19    110.137336
20     41.139568
Name: Complaints, dtype: float64

In [12]:
lower_limit = mean_week - standard_deviation
lower_limit

Week
16    16.756111
17    21.188396
18    43.886994
19    34.815045
20    16.669956
Name: Complaints, dtype: float64

In [13]:
# calculation function creation
def calculate_three_values(df_, n):
    mean_week = df_.groupby(["Week"])["Complaints"].mean()
    standard_deviation = df_.groupby(["Week"])["Complaints"].std()
    upper_limit = mean_week + (standard_deviation * n)
    lower_limit = mean_week - (standard_deviation * n)
    variation = upper_limit - lower_limit
    return upper_limit, lower_limit, variation

In [15]:
upper, lower, variation = calculate_three_values(df, 1)
upper_2, lower_2, variation_2 = calculate_three_values(df, 2)
upper_3, lower_3, variation_3 = calculate_three_values(df, 3)

In [None]:
### Join the original data set back on to these results 