### Importing modules

In [None]:
import numpy as np
import pandas as pd
from configs.paths import FILE_PATH_RAW
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

### Configurations

In [None]:
target_col = 'Revenue' # categorical variable

num_cols = ['Administrative',
            'Administrative_Duration', 
            'Informational',
            'Informational_Duration', 
            'ProductRelated', 
            'ProductRelated_Duration',
            'BounceRates', 
            'ExitRates', 
            'PageValues', 
            'SpecialDay']

cat_cols = ['Month',
            'OperatingSystems',
            'Browser',
            'Region',
            'TrafficType',
            'VisitorType',
            'Weekend']

### Loading and preparing data

In [None]:
df = pd.read_csv(FILE_PATH_RAW)
df.head()

We'll split the original dataset in 2 parts:
* `df_current`: we'll pretend this is the dataset we used to train model currently in production
* `df_new`: we'll pretend this is the new data arriving

In [None]:
df_current = df.sample(frac=0.7, random_state=1)
df_new = df.drop(df_current.index)

print(f'Original df shape: {df.shape}')
print(f'Current df shape: {df_current.shape}')
print(f'New data df shape: {df_new.shape}')

### Introduction to model drift simulation

**Scenario:**
1. During the upstream data cleansing process, visitors from certain OperatingSystem ("1","2") are missing their `VisitorType` entries
2. During the upstream data generation procedure the month "June" from `Month` is wrongly inserted as "Jul"
3. Also pretending that the new data refers to a period of pandemic, we can imagine that `Informational_Duration` and `ProductRelated_Duration` increase by double during normal days (`SpecialDay`=0).
  
**What are we simulating here?**
* Feature drift
* Upstream data errors

### Simulating model drift

In [None]:
df_1_err = df_new.copy()

# 1) null VisitorType for OperatingSystems like 1,2:
df_1_err.loc[df_1_err["OperatingSystems"].isin([1,2]), "VisitorType"] = np.nan

# 2) "June" --> "Jun":
df_1_err.loc[df_1_err["Month"]=="June", "Month"] = "Jul"

#3) double value of Informational_Duration and ProductRelated_Duration when SpecialDay=0:
df_1_err.loc[df_1_err["SpecialDay"]==0,["Informational_Duration","ProductRelated_Duration"]] = 2* df_1_err.loc[df_1_err["SpecialDay"]==0,["Informational_Duration","ProductRelated_Duration"]]

df_1_err.head()

### Feature checks prior to model training

**1. All features**
* Null checks

**2. Numeric features**
* Summary statistic checks: mean, median, standard deviation, minimum, maximum
* Distribution checks

**3. Categorical features**
* Check expected count for each level
* Check the mode

#### Set threshold
You can fine-tune these threshold to impact the number of (false) alarms.

In [None]:
null_proportion_threshold = 0.3     # how much we should allow null values presence 
stats_threshold_limit = 0.5         # how much we should allow basic summary stats to shift 
p_threshold = 0.05                  # the p-value below which to reject null hypothesis 

### Task 1 - Null checks

Write a piece of code that return any features of `df_1_err` that exceed the specified null threshold (`null_proportion_threshold)`.

In [None]:
print("CHECKING PROPORTION OF NULLS.....\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#

### Task 2 - Numeric features --> Summary statistic checks

* Check if the summary stats of new data (`df_1_err`) significantly deviates from the summary stats in the production data (`df_current`) by a certain threshold `stats_threshold_limit`
* You should consider all the metrics defined below in `statistic_list` variable.


For example the code should print an alarm if, for a numeric variable "var1":<br>
mean(new_data.var1) < or > mean(prod_data.var1) x `stats_threshold_limit` <br>
The printed message also should contain absolute value for that metric



In [None]:
statistic_list = ["mean", "median", "std", "min", "max"]

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#

For each features that exceed the `stats_threshold_limit` for at least one metric, plot that feature's distributions (e.g. boxplot) for both the datasets in order to compare them.

In [None]:
print(f"Let's look at the box plots of the features that exceed the stats_threshold_limit of {stats_threshold_limit}:\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#

### Task 2 - Numeric features --> Distribution checks

* Perform Levene test (https://en.wikipedia.org/wiki/Levene%27s_test) to check if each column's variance in new_df is significantly different from reference_df
* Perform also Kolmogorov-Smirnov test (https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test) <br>

For both tests you should print all the features which are significantly different between the two dataset, comparing the obtained p_value with `p_threshold`. <br>
You can use python module `scipy.stats` to implement the tests.


In [None]:
print("\nCHECKING VARIANCES WITH LEVENE TEST.....\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#


In [None]:
print("\nCHECKING KS TEST.....\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#

### Task 3 - Categorical features

* Check if there are different number of level for each categorical variable between the incoming data and the data in production.
* Check for each categorical variable if the mode has changed
* Perform chi-square test with `p_threshold` (python module `scipy.stats`)

In [None]:
print("\nCHECKING if different number of levels.....\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#

In [None]:
print("\nCHECKING if mode has changed.....\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#

In [None]:
print("\nCHECKING CHI-SQUARED TEST.....\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#

For each categorical features found in the previous checks, plot their categories feature's distributions (e.g. barplot) for both the datasets in order to compare them.

In [None]:
print("\nVisualizing categories frequency distribution for features found in the previous checks:\n")

# INSERT YOUR CODE HERE
#
#
#
#
#
#
#
#