## Statistical Anomaly Detection
This is a framework for statistical anomaly detection for quantitative historical data for randomized web traffic data. In this instance, I break down the data by its site and calculate the appropriate standard deviations from a user-provided Z-Score.
Edited and compiled by Kennon Stewart.

In [1]:
# importing necessary data and statistical packages
import pandas as pd
import numpy as np
from datetime import datetime
%autosave 30

Autosaving every 30 seconds


In [2]:
# paths will vary by user
path = r'C:\Users\stewart\Downloads\data.csv'

### Data Cleaning
The only data cleaning process we'll engage is straigtening out the CVR column, since the column type is technically a string from the .csv. This'll allow us to perform our calculations.

In [3]:
# I use the clean function to get rid of the percentages and return a float
def clean(n):
    return float(n[:-1])

In [4]:
# Maybe put a graph here to better visualize your point?
# But make it Amazon-specific.
cvr = pd.read_csv(path)
cvr.CVR = cvr.CVR.apply(lambda x: clean(x))
cvr['Std Dev'] = 0

### Calculations
The calculations have three parts. First, calculating and returning the upper and lower bounds based on the unique site defined in the first indented line using the threshold function. Followed by the storage of those boundaries in key-value pairs stored in a dictionary returned by the calc function. The comp function compares the results of the test with the thresholds and returns the final value. And finally, the mapp function takes a Z-Score as input for analysis and returns the original CVR table with an additional column indicating its QC status by binary categorical variable. 1 codes to outlier.

In [5]:
# INPUTS:
# i: site for calculation
# n: Z-Score (input)
def treshold(i,n):
    df = cvr[cvr.Site==i]
    lower_bound = np.mean(df.CVR)-(n*np.std(df.CVR))
    upper_bound = np.mean(df.CVR)+(n*np.std(df.CVR))
    return lower_bound, upper_bound

In [6]:
# INPUTS:
# i: site for calculation
# n: Z-Score (input)
def threshold(i,n):
    # Assigns a temporary dataframe, df, to the appropriate subset of data
    df = cvr[cvr.Site==i]
    # Calculates descriptive statistics for subset
    std = np.std(df['CVR'])
    me = np.mean(df['CVR'])
    # The subset rows are filled with the calculated Z-Score and the population Standard Deviation
    # This is more for manual QC in the generated file
    cvr.loc[cvr.Site==i,'Z']=(cvr.loc[cvr.Site==i,'CVR']-me)/std
    cvr.loc[cvr.Site==i,'Std Dev']=std
    # The upper and lower bounds of the subset QC threshold is calculated and returned
    lower_bound = np.mean(df.CVR)-(n*std)
    upper_bound = np.mean(df.CVR)+(n*std)
    return lower_bound, upper_bound

In [7]:
def calc(n):
    d = { }
    url = cvr.Site.unique()
    for i in url:
        lb, ub = threshold(i,n)
        d.update({i:[lb,ub]})
    return d

In [8]:
# INPUTS
# x: value for comparison
# j: site for calculation
def comp(x,j,n):
    d = calc(n)
    if x < d[j][0] or x > d[j][1]:return 0
    else: return 1

In [9]:
def outliers():
    n = float(input('Please enter a Z-Score for measurement: '))
    now = datetime.now()
    dt_string = now.strftime("%d%m%Y %H%M%S")
    cvr['QC'] = cvr.apply(lambda x,: comp(x['CVR'],x['Site'],n),axis=1)
    cvr.to_csv(r'C:\Users\stewart\Downloads\CRV_QC '+dt_string+'.csv')
    return cvr.loc[(cvr.Date==cvr.Date.max())&(cvr.QC==0)],cvr.loc[cvr.QC==0]

In [10]:
outliers()

Please enter a Z-Score for measurement: 2.5


(Empty DataFrame
 Columns: [Date, Site, CVR, Std Dev, Z, QC]
 Index: [],
        Date             Site   CVR   Std Dev         Z  QC
 17  2020-04  officedepot.com  30.9  3.713517  2.632276   0)