# Capstone Project 1 Inferential Statistics

The goal of this notebook is to present inferential statistics about the Walmart dataset.

## The Issue On Hand

This particular dataset does not lend itself well to a statistical analysis.  Recall that of the 6 original predictive variables, only one of them is numeric: `scan_count`.  Furthermore, this value is limited to being an integer.  It is not even possible to use this value to create a linear regression since the target variable is categorical with 39 different possibilities.  The `visit_number` variable is simply an identifier and also does not provide any insights.  There are no Pearson correlations, p-values, standard deviations, confidence intervals, or any other high-level statistic that can be computed.

Instead, we are largely limited to exploring variables individually and simply counting how often they occur.  To this end, I refer the reader back to the data wrangling and storytelling notebooks for this project, which include many such summary descriptions and further discussion of their implications as well as graphics.

## The Chi Square Statistic

Instead, what can be computed is the chi square statistic.  Given a null hypothesis regarding an expected distribution of occurances, the chi square statistic gives a value that measures how far off the experimental distribution is from the expected one.  Namely,

$\chi_c^2=\sum_{i\in C_i}\frac{(O_i-E_i)^2}{E_i}$

where $C_i$ is the set of all values for a categorical feature and $c$ is the degrees of freedom.  This can then be used to compute a $p$-value to accept or reject the null hypothesis.

### Load previous data

First we load the data and previous transformational code as necessary without further comment.

In [1]:
cd ~/Desktop/Springboard/Capstone_1/Original_Data

/Users/nick/Desktop/Springboard/Capstone_1/Original_Data


In [2]:
import pandas as pd
import numpy as np

train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
dfs=[train_df,test_df]

col=['trip_type','visit_number','weekday','upc','scan_count',
     'department_description','fineline_number']
train_df.columns=col
test_df.columns=col[1:len(col)]

train_df['trip_type']=train_df['trip_type'].astype('category')
categoricals=['weekday','department_description','fineline_number']
for df in dfs:
    for category in categoricals:
        df[category]=df[category].astype('category')
        
for df in dfs:
    df['purchase_count']=df['scan_count'].clip(lower=0)
for df in dfs:
    df['return_count']=((-1)*df['scan_count']).clip(lower=0)
for df in dfs:
    df.drop(columns='scan_count',axis=1,inplace=True)
    
train_df_grouped=train_df.groupby('visit_number')
test_df_grouped=test_df.groupby('visit_number')

### Days of the Week

Many of the variables in this data set still do not lend themselves well to this statistic because they take on so many values and it is not clear what a sensible null hypothesis to test would be.  As a demonstration though, we will test the null hypothesis the shoppers equally often on all days of the week with a significance level of $\alpha=.05$.

In [3]:
# Create a dictionary of the day of the week for each unique trip
day_by_visit={}
for visit in train_df.visit_number.unique():
    day_by_visit[visit]=train_df_grouped.get_group(visit).iloc[0,2]

In [4]:
# Count how often each day occurs and compute the average
unique_daily_count=pd.Series(list(day_by_visit.values())).value_counts()
print(unique_daily_count)
print("Average:" + str(np.mean(list(unique_daily_count))))

Sunday       17124
Saturday     16904
Friday       15234
Monday       12027
Wednesday    11612
Tuesday      11530
Thursday     11243
dtype: int64
Average:13667.714285714286


In [5]:
# Compute the chi square statistic and the associated p value
from scipy.stats import chisquare
chisquare(list(unique_daily_count))

Power_divergenceResult(statistic=3090.470911637435, pvalue=0.0)

As one would anticipate purely on visual inspection of a distribution for such a large sample size, we have $p\approx0$.  A calculation on a more precise calculator shows that $p<.00001$.  We reject the null hypothesis that customers shop equally often throughout the week.

More interestingly, let us test the null hypothesis that the number of unique items shoppers buy on a given day is proportional to how often they shop on that day.  This means that the $E_i$ are the proportions obtained in the previous step times the total number of unique items purchased, and the $O_i$ are the total number of unique items actually purchased each day.  The thought is that if more people shop on Saturdays for instance, perhaps Saturday is the day to not just go shopping but to do lots of shopping.

In [6]:
chisquare(list(train_df['weekday'].value_counts()),         
          f_exp=np.array(unique_daily_count)*len(train_df)/sum(list(unique_daily_count)))

Power_divergenceResult(statistic=5783.841850078856, pvalue=0.0)

This was again a horrible null hypothesis that is easily rejected.  Perhaps, on the other hand, lots of people only buy a single item or two on popular days.

Something to bear in mind is that neither of these tends give any indication of what the distributions actually are.  They merely allow us to reject the null hypothesis.  For instance, the first test does not support the claim that people shop on weekend days more than weekdays even though that appears to be the case.

### Days of the Week and Trip Type

It would be more constructive toward our end goal of predicting trip types to compute chi square statistics from a table of values for trip types against one of the other variables.  There are too many upcs and finelines to compare them to trip types and get a meaningful result.  We will compute the statistic for how many times each trip type occurs on each day of the week.  Specifically, the null hypothesis is that the day of the week and the trip type are independent of each other.

In [7]:
# Create an array of arrays, where the sub-array entries are how many times there is a unique
# visit number for each trip type, and the sub-arrays themselves are days of the week.
counts=[]
for day in train_df['weekday'].unique():
    daycounts=[]
    day_df=train_df[train_df['weekday']==day]
    for ttype in train_df['trip_type'].unique():
        daytype_df=day_df[day_df['trip_type']==ttype]
        x=daytype_df.loc[:,'visit_number'].nunique()
        daycounts.append(x)
    counts.append(daycounts)

In [8]:
freedom=(train_df['weekday'].nunique()-1)*(train_df['trip_type'].nunique()-1)
chisquare(counts,axis=None,ddof=freedom)

Power_divergenceResult(statistic=143674.27020925225, pvalue=0.0)

Clearly the day of the week and the type of trip are not independent of each other.  This is good news for a future machine learning model, as it means that the day of the week is a useful predictive feature.  However, we are still limited in our interpretability of the result, as it does not give any indication of what the actual distribution is.