# Wholesale Customer Loyalty Data Challenge

# Summary

### Problem Statement

You work at a small food service supply wholesaler that principally services the
hotel restaurant and retail channels. Currently, your company is working on designing a
customer loyalty program but your CEO is unsure of the best way to proceed - the
current thinking is a loyalty program that incentivizes customers to purchase across
multiple offering categories (e.g., if a customer purchases from the grocery, frozen, and
deli categories, they receive a discount).

Your CEO would like you to **examine annual spending by the company’s current
customers to understand if such a program would be attractive to the largest subgroup
of customers.**

Available Data:
- CUST_ID: Customer ID
- YEAR: Year
- FRESH: annual spending on fresh products
- DAIRY: annual spending on dairy products
- GROCERY: annual spending on grocery products
- FROZEN: annual spending on frozen products
- DETERGENTS_PAPER: annual spending on detergents and paper products
- DELI: annual spending on delicatessen products
- CHANNEL: HoReCa (hotel/restaurant) or Retail

### Outline of Analysis Steps

Initial Ideas
- original basic stats
- clean data (quantify and remove missing data; quantify and remove duplicates, fix datatypes)
- look for correlations between categories
- plot kmeans clusters.

Additional steps taken with new information:
- sdf

# Data Prep

### Load libraries, data, and get first glimpse

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [8]:
cust = pd.read_csv('wholesale_data.csv')
cust.head()

Unnamed: 0,Channel,Customer,Year,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,6048141,2017,12669,9656,7561,214,2674,1338
1,2,9336325,2017,7057,9810,9568,1762,3293,1776
2,2,6272942,2017,6353,8808,7684,2405,3516,7844
3,1,7856217,2017,13265,1196,4221,6404,507,1788
4,2,6179511,2017,22615,5410,7198,3915,1777,5185


In [13]:
# Row count
print("Number of rows: ", str(cust.shape[0]))

# Column count
print("Number of columns: ", str(cust.shape[1]))

# Column names
print("Column names: ", str(cust.columns))

# Indexing method
print("Index method: ", cust.index)

# Data types of all columns
print("Data types for entire dataframe: ")
cust.info()

Number of rows:  801
Number of columns:  9
Column names:  Index(['Channel', 'Customer', 'Year', 'Fresh', 'Milk', 'Grocery', 'Frozen',
       'Detergents_Paper', 'Delicassen'],
      dtype='object')
Index method:  RangeIndex(start=0, stop=801, step=1)
Data types for entire dataframe: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 9 columns):
Channel             801 non-null int64
Customer            801 non-null int64
Year                801 non-null object
Fresh               799 non-null object
Milk                798 non-null object
Grocery             798 non-null object
Frozen              798 non-null object
Detergents_Paper    799 non-null object
Delicassen          797 non-null object
dtypes: int64(2), object(7)
memory usage: 56.4+ KB


### Data Cleaning

Based on the initial glimpse from above, it looks like the following datatypes need to be modified:
- Channel and Customer should not be treated as ints
- Year should be converted to date formate
- All other columns should be ints

Additionally, even though there are 801 rows, there are not 801 non-null values for all columns. We'll throw out those rows and see how many rows we lose in total, as well as check for duplcates.



In [1]:
# Check for duplicates
# Check for NaNs
# Check for data types
# Check for unique values

# Exploratory Data Analysis

In [3]:
# Plots to look for outliers

In [None]:
# box plots
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

sns.set(color_codes=True)

sns.boxplot(x="total_price_bins", y="total_price", hue="sold_true", data=books[(books.price < 40)], palette="Set3")

In [None]:
# Paired grid/paired plots

In [4]:
# Scatterplots

In [5]:
# Histograms
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

sns.set(color_codes=True)

sns.distplot(x, bins=20, kde=False, rug=True);
# Tweak using Matplotlib
plt.ylim(0, None)
plt.xlim(0, 60)

In [6]:
# Barplot

In [7]:
# Heatmap

In [8]:
# Colinearity

# Basic Feature Engineeering

# Basic Modelling

In [None]:
# scipy t-test

In [None]:
# statsmodel for regression

In [None]:
# Unsupervised learning Kmeans
from sklearn import cluster, datasets
iris = datasets.load_iris()
X_iris = iris.data

k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris) 
print(k_means.labels_[::10])

# Validation

In [None]:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

# Results and Recommendations