# Data Challenge

# Summary

### Problem Statement

### Outline of Analysis Steps

# Data Prep

### Load data and libraries

##@ First glimpse of what's going on

### Data Cleaning

Based on the initial glimpse from above and looking at the unique values for each column, it looks like the following datatypes need to be modified:
- Channel and Customer should not be treated as ints
- All other columns should be ints
- Convert Year column values of '08/01/2016' to just 2016

Additionally, even though there are 801 rows, there are not 801 non-null values for all columns. We'll throw out those rows and see how many rows we lose in total, as well as check for duplcates.

In [1]:
cust = cust.dropna()
cust = cust.drop_duplicates()
cust = cust[cust.Fresh != 'unrecorded']
cust = cust[cust.Milk != 'unrecorded']
cust = cust[cust.Grocery != 'unrecorded']
cust = cust[cust.Frozen != 'unrecorded']
cust = cust[cust.Detergents_Paper != 'unrecorded']
cust = cust[cust.Delicassen != 'unrecorded']
print("Number of rows: ", str(cust.shape[0]))

In [None]:
# formatting the Year column to be more uniform
def fixYear(year):
    """
    a simple function to convert formatting for the year column
    """
    if year == '08/01/2016':
        new_year = '2016'
    else:
        new_year = year
    return new_year

cust['Year'] = cust.Year.apply(fixYear)
cust.Year.unique()

In [None]:
# datatype conversions
cust.Channel = cust.Channel.astype(str)
cust.Customer = cust.Customer.astype(str)
cust.Fresh = cust.Fresh.astype(int)
cust.Milk = cust.Milk.astype(int)
cust.Grocery = cust.Grocery.astype(int)
cust.Frozen = cust.Frozen.astype(int)
cust.Detergents_Paper = cust.Detergents_Paper.astype(int)
cust.Delicassen = cust.Delicassen.astype(int)
cust.info()

### New Feature Creation

In [1]:
# total spending
cust['Total'] = cust.apply(lambda x: x['Fresh'] + x['Milk'] + x['Grocery'] + 
                           x['Detergents_Paper'] + x['Delicassen'], axis=1)
cust.head()

# Exploratory Data Analysis

### Descriptive stats


A little more detail here

In [None]:
print('TOTAL SPENDING')
print('\nHotel & Resturants:')
print(cust.Total[(cust.Channel == '1')].describe())
print('\nRetail:')
print(cust.Total[(cust.Channel == '2')].describe())

print('\n\nFRESH')
print('\nHotel & Resturants:')
print(cust.Fresh[(cust.Channel == '1')].describe())
print('\nRetail:')
print(cust.Fresh[(cust.Channel == '2')].describe())

print('\n\nMILK')
print('\nHotel & Resturants:')
print(cust.Milk[(cust.Channel == '1')].describe())
print('\nRetail:')
print(cust.Milk[(cust.Channel == '2')].describe())

print('\n\nGROCERY')
print('\nHotel & Resturants:')
print(cust.Grocery[(cust.Channel == '1')].describe())
print('\nRetail:')
print(cust.Grocery[(cust.Channel == '2')].describe())

print('\n\nFROZEN')
print('\nHotel & Resturants:')
print(cust.Frozen[(cust.Channel == '1')].describe())
print('\nRetail:')
print(cust.Frozen[(cust.Channel == '2')].describe())

print('\n\nDETERGENTS & PAPER:')
print('\nHotel & Resturants:')
print(cust.Detergents_Paper[(cust.Channel == '1')].describe())
print('\nRetail:')
print(cust.Detergents_Paper[(cust.Channel == '2')].describe())

print('\n\nDELICASSEN')
print('\nHotel & Resturants:')
print(cust.Delicassen[(cust.Channel == '1')].describe())
print('\nRetail:')
print(cust.Delicassen[(cust.Channel == '2')].describe())

There are 503 records for hotel/restaurants versus 267 for retail (including both 2016 and 2017). However, mean total spending for retail is greater (38698 vs 21681).

Hotels/restaurants have higher means for:
- Fresh (12898 vs 7495)
- Frozen (3332 vs 1443)

Retail has a higher mean for 
- Milk (9223 vs 3083)
- Grocery (14065 vs 3645)
- Detergents_Paper (6454 vs 758)
- Delicassen (1460 vs 1295)

### Pairplots

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Paired plot comparing Channel (hotels/restaurants versus retail)
sns.set(style = 'ticks', color_codes = True)
sns.pairplot(cust, 
             hue = 'Channel', 
             palette = "Paired", 
             vars = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen'])

In [3]:
# Plots to look for outliers

In [None]:
# box plots
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

sns.set(color_codes=True)

sns.boxplot(x="total_price_bins", y="total_price", hue="sold_true", data=books[(books.price < 40)], palette="Set3")

In [None]:
# Paired grid/paired plots

In [4]:
# Scatterplots

In [5]:
# Histograms
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

sns.set(color_codes=True)

sns.distplot(x, bins=20, kde=False, rug=True);
# Tweak using Matplotlib
plt.ylim(0, None)
plt.xlim(0, 60)

In [6]:
# Barplot

In [7]:
# Heatmap

In [8]:
# Colinearity

# Basic Feature Engineeering

# Basic Modelling

In [None]:
# scipy t-test

In [None]:
# statsmodel for regression

# Validation

In [None]:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

# Results and Recommendations