# Task 2
**What We will learn**
- Understand experimentation and uplift testing, comparing trial and control stores.
- Learn control store selection based on defined metrics.
- Gain experience in data visualization.
- Perform statistical analysis to assess sales differences and formulate recommendations.

**What We will do**
- Define metrics to select control stores.
- Analyze trial stores against controls.
- Use Python for data analysis and visualization and summarise findings and provide recommendations.

In [33]:
# import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

In [34]:
# read the Data as DataFrame
# Load the data
data = pd.read_csv(r"/content/QVI_data.csv")

In [35]:
# Checking the first 5th rows in our data
data.head()

Unnamed: 0,LYLTY_CARD_NBR,DATE,STORE_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PACK_SIZE,BRAND,LIFESTAGE,PREMIUM_CUSTOMER
0,1000,2018-10-17,1,1,5,Natural Chip Compny SeaSalt175g,2,6.0,175,NATURAL,YOUNG SINGLES/COUPLES,Premium
1,1002,2018-09-16,1,2,58,Red Rock Deli Chikn&Garlic Aioli 150g,1,2.7,150,RRD,YOUNG SINGLES/COUPLES,Mainstream
2,1003,2019-03-07,1,3,52,Grain Waves Sour Cream&Chives 210G,1,3.6,210,GRNWVES,YOUNG FAMILIES,Budget
3,1003,2019-03-08,1,4,106,Natural ChipCo Hony Soy Chckn175g,1,3.0,175,NATURAL,YOUNG FAMILIES,Budget
4,1004,2018-11-02,1,5,96,WW Original Stacked Chips 160g,1,1.9,160,WOOLWORTHS,OLDER SINGLES/COUPLES,Mainstream


In [36]:
# Informations about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264834 entries, 0 to 264833
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   LYLTY_CARD_NBR    264834 non-null  int64  
 1   DATE              264834 non-null  object 
 2   STORE_NBR         264834 non-null  int64  
 3   TXN_ID            264834 non-null  int64  
 4   PROD_NBR          264834 non-null  int64  
 5   PROD_NAME         264834 non-null  object 
 6   PROD_QTY          264834 non-null  int64  
 7   TOT_SALES         264834 non-null  float64
 8   PACK_SIZE         264834 non-null  int64  
 9   BRAND             264834 non-null  object 
 10  LIFESTAGE         264834 non-null  object 
 11  PREMIUM_CUSTOMER  264834 non-null  object 
dtypes: float64(1), int64(6), object(5)
memory usage: 24.2+ MB


In [37]:
# Missing Values
data.isnull().sum()

LYLTY_CARD_NBR      0
DATE                0
STORE_NBR           0
TXN_ID              0
PROD_NBR            0
PROD_NAME           0
PROD_QTY            0
TOT_SALES           0
PACK_SIZE           0
BRAND               0
LIFESTAGE           0
PREMIUM_CUSTOMER    0
dtype: int64

- Zero null values, good beginning

## Select control stores
- The client has selected store numbers 77, 86 and 88 as trial stores and want control stores to be established stores that are operational for the entire observation period.
- We would want to match trial stores to control stores that are similar to the trial store prior to the trial period of Feb 2019 in terms of :


```
# • Monthly overall sales revenue
#• Monthly number of customers
#• Monthly number of transactions per customer
```



- Let’s first create the metrics of interest and filter to stores that are present throughout the pre-trial period

In [39]:
# Extracting by month, and year/month
data["DATE"] = pd.to_datetime(data["DATE"]) # DATE column to datetime
data["MONTH"] = data["DATE"].dt.month # Create month column for each month
data["YEAR/MONTH"] = data["DATE"].dt.to_period("M") # Create YEAR/MONTH column

In [40]:
data.head()

Unnamed: 0,LYLTY_CARD_NBR,DATE,STORE_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PACK_SIZE,BRAND,LIFESTAGE,PREMIUM_CUSTOMER,MONTH,YEAR/MONTH
0,1000,2018-10-17,1,1,5,Natural Chip Compny SeaSalt175g,2,6.0,175,NATURAL,YOUNG SINGLES/COUPLES,Premium,10,2018-10
1,1002,2018-09-16,1,2,58,Red Rock Deli Chikn&Garlic Aioli 150g,1,2.7,150,RRD,YOUNG SINGLES/COUPLES,Mainstream,9,2018-09
2,1003,2019-03-07,1,3,52,Grain Waves Sour Cream&Chives 210G,1,3.6,210,GRNWVES,YOUNG FAMILIES,Budget,3,2019-03
3,1003,2019-03-08,1,4,106,Natural ChipCo Hony Soy Chckn175g,1,3.0,175,NATURAL,YOUNG FAMILIES,Budget,3,2019-03
4,1004,2018-11-02,1,5,96,WW Original Stacked Chips 160g,1,1.9,160,WOOLWORTHS,OLDER SINGLES/COUPLES,Mainstream,11,2018-11


- In this next step, let's make a visualization the TOT_SALES in each month and see which month we has the high sales

In [41]:
high_sale = data.groupby(["MONTH"])["TOT_SALES"].sum().sort_values(ascending=False)
high_sale

MONTH
12    167913.40
3     166265.20
7     165275.30
10    164415.70
1     162642.30
6     160538.60
9     160522.00
11    160233.70
4     159845.10
8     158081.05
5     156717.65
2     150665.00
Name: TOT_SALES, dtype: float64

- Depening on the results above we can say that December comes first followed by March and July