# Mid-Course Project

Hi There, and thanks for your help. If you're reading this you've been selected to help on a secret initiative.

You will be helping us analyze a portion of data from a company we want to acquire, which could greatly improve the fortunes of Maven Mega Mart.

We'll be working with `project_transactions.csv` and briefly take a look at `product.csv`.

First, read in the transactions data and explore it.

* Take a look at the raw data, the datatypes, and cast `DAY`, `QUANTITY`, `STORE_ID`, and `WEEK_NO` columns to the smallest appropriate datatype. Check the memory reduction by doing so.
* Is there any missing data?
* How many unique households and products are there in the data? The fields household_key and Product_ID will help here.

In [1]:
import pandas as pd
import numpy as np

In [12]:
# load project transaction csv
transactions = pd.read_csv("../project_data/project_transactions.csv")
transactions.head()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
0,1364,26984896261,1,842930,1,2.19,31742,0.0,1,0.0,0.0
1,1364,26984896261,1,897044,1,2.99,31742,-0.4,1,0.0,0.0
2,1364,26984896261,1,920955,1,3.09,31742,0.0,1,0.0,0.0
3,1364,26984896261,1,937406,1,2.5,31742,-0.99,1,0.0,0.0
4,1364,26984896261,1,981760,1,0.6,31742,-0.79,1,0.0,0.0


In [13]:
# see memory usage for transaction DataFrame
transactions.info(memory_usage="deep")
# Key Values are
# dtypes: float64(4), int64(7)
# memory usage: 180.1 MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2146311 entries, 0 to 2146310
Data columns (total 11 columns):
 #   Column             Dtype  
---  ------             -----  
 0   household_key      int64  
 1   BASKET_ID          int64  
 2   DAY                int64  
 3   PRODUCT_ID         int64  
 4   QUANTITY           int64  
 5   SALES_VALUE        float64
 6   STORE_ID           int64  
 7   RETAIL_DISC        float64
 8   WEEK_NO            int64  
 9   COUPON_DISC        float64
 10  COUPON_MATCH_DISC  float64
dtypes: float64(4), int64(7)
memory usage: 180.1 MB


In [14]:
# Get summary Statistics to see datapoints
transactions.describe().round()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC
count,2146311.0,2146311.0,2146311.0,2146311.0,2146311.0,2146311.0,2146311.0,2146311.0,2146311.0,2146311.0,2146311.0
mean,1056.0,34048970000.0,390.0,2884715.0,101.0,3.0,3268.0,-1.0,56.0,-0.0,-0.0
std,605.0,4723748000.0,190.0,3831949.0,1152.0,4.0,9122.0,1.0,27.0,0.0,0.0
min,1.0,26984900000.0,1.0,25671.0,0.0,0.0,1.0,-130.0,1.0,-56.0,-8.0
25%,548.0,30407980000.0,229.0,917231.0,1.0,1.0,330.0,-1.0,33.0,0.0,0.0
50%,1042.0,32811760000.0,392.0,1027960.0,1.0,2.0,372.0,0.0,57.0,0.0,0.0
75%,1581.0,40128040000.0,555.0,1132771.0,1.0,3.0,422.0,0.0,80.0,0.0,0.0
max,2099.0,42305360000.0,711.0,18316298.0,89638.0,840.0,34280.0,4.0,102.0,0.0,0.0


In [15]:

# for Transactions cast `DAY`, `QUANTITY`, `STORE_ID`, and `WEEK_NO` columns to the smallest appropriate datatype
transactions['DAY'] = transactions['DAY'].astype('int8')
transactions['QUANTITY'] = transactions['QUANTITY'].astype('int8')
transactions['STORE_ID'] = transactions['STORE_ID'].astype('int16')
transactions['WEEK_NO'] = transactions['WEEK_NO'].astype('int8')
transactions.info(memory_usage="deep")
# previous Memory 
# memory usage: 180.1 MB
# updated Memory 
# memory usage: 122.8 MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2146311 entries, 0 to 2146310
Data columns (total 11 columns):
 #   Column             Dtype  
---  ------             -----  
 0   household_key      int64  
 1   BASKET_ID          int64  
 2   DAY                int8   
 3   PRODUCT_ID         int64  
 4   QUANTITY           int8   
 5   SALES_VALUE        float64
 6   STORE_ID           int16  
 7   RETAIL_DISC        float64
 8   WEEK_NO            int8   
 9   COUPON_DISC        float64
 10  COUPON_MATCH_DISC  float64
dtypes: float64(4), int16(1), int64(3), int8(3)
memory usage: 124.9 MB


In [37]:
# Is there missing data
# check nulls
transactions.isnull().sum()

household_key           0
BASKET_ID               0
DAY                     0
PRODUCT_ID              0
QUANTITY                0
SALES_VALUE             0
STORE_ID                0
RETAIL_DISC             0
WEEK_NO                 0
COUPON_DISC             0
COUPON_MATCH_DISC       0
total_discount          0
percent_discount     8426
dtype: int64

In [17]:
# unique households and products are there in the data?
# use .nunique() to find number of unique values
print('Unique Households')
print(transactions['household_key'].nunique())
print('Unique Products')
print(transactions['PRODUCT_ID'].nunique())

Unique Households
2099
Unique Products
84138


## Column Creation

Create two columns:

* A column that captures the `total_discount` by row (sum of `RETAIL_DISC`, `COUPON_DISC`)
* The percentage discount (`total_discount` / `SALES_VALUE`). Make sure this is positive (try `.abs()`).
* If the percentage discount is greater than 1, set it equal to 1. If it is less than 0, set it to 0. 
* Drop the individual discount columns (`RETAIL_DISC`, `COUPON_DISC`, `COUPON_MATCH_DISC`).

Feel free to overwrite the existing transaction DataFrame after making the modifications above.

In [33]:
# create column that for  `total_discount` by row (sum of `RETAIL_DISC`, `COUPON_DISC`)
transactions['total_discount'] = (transactions['RETAIL_DISC'] + transactions['COUPON_DISC'])

In [50]:
# create column for `percentage_discount` (`total_discount` / `SALES_VALUE`). Make sure this is positive (try `.abs()`).
# If the percentage discount is greater than 1, set it equal to 1. If it is less than 0, set it to 0. 
transactions['percent_discount'] = np.select(0,0,default=(transactions['total_discount']/transactions['SALES_VALUE']).abs())

TypeError: 'int' object is not iterable

In [86]:
# Drop the individual discount columns (`RETAIL_DISC`, `COUPON_DISC`, `COUPON_MATCH_DISC`).

In [40]:
transactions.head()

Unnamed: 0,household_key,BASKET_ID,DAY,PRODUCT_ID,QUANTITY,SALES_VALUE,STORE_ID,RETAIL_DISC,WEEK_NO,COUPON_DISC,COUPON_MATCH_DISC,total_discount,percent_discount
0,1364,26984896261,1,842930,1,2.19,31742,0.0,1,0.0,0.0,0.0,0.0
1,1364,26984896261,1,897044,1,2.99,31742,-0.4,1,0.0,0.0,-0.4,0.133779
2,1364,26984896261,1,920955,1,3.09,31742,0.0,1,0.0,0.0,0.0,0.0
3,1364,26984896261,1,937406,1,2.5,31742,-0.99,1,0.0,0.0,-0.99,0.396
4,1364,26984896261,1,981760,1,0.6,31742,-0.79,1,0.0,0.0,-0.79,1.316667


In [47]:
transactions.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2146311 entries, 0 to 2146310
Data columns (total 13 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   household_key      2146311 non-null  int64  
 1   BASKET_ID          2146311 non-null  int64  
 2   DAY                2146311 non-null  int8   
 3   PRODUCT_ID         2146311 non-null  int64  
 4   QUANTITY           2146311 non-null  int8   
 5   SALES_VALUE        2146311 non-null  float64
 6   STORE_ID           2146311 non-null  int16  
 7   RETAIL_DISC        2146311 non-null  float64
 8   WEEK_NO            2146311 non-null  int8   
 9   COUPON_DISC        2146311 non-null  float64
 10  COUPON_MATCH_DISC  2146311 non-null  float64
 11  total_discount     2146311 non-null  float64
 12  percent_discount   2137885 non-null  float64
dtypes: float64(6), int16(1), int64(3), int8(3)
memory usage: 157.6 MB


## Overall Statistics

Calculate:

* The total sales (sum of `SALES_VALUE`), 
* Total discount (sum of `total_discount`)
* Overall percentage discount (sum of total_discount / sum of sales value)
* Total quantity sold (sum of `QUANTITY`).
* Max quantity sold in a single row. Inspect the row as well. Does this have a high discount percentage?
* Total sales value per basket (sum of sales value / nunique basket_id).
* Total sales value per household (sum of sales value / nunique household_key). 

## Household Analysis

* Plot the distribution of total sales value purchased at the household level. 
* What were the top 10 households by quantity purchased?
* What were the top 10 households by sales value?
* Plot the total sales value for our top 10 households by value, ordered from highest to lowest.


## Product Analysis

* Which products had the most sales by sales_value? Plot  a horizontal bar chart.
* Did the top 10 selling items have a higher than average discount rate?
* What was the most common `PRODUCT_ID` among rows with the households in our top 10 households by sales value?
* Look up the names of the  top 10 products by sales in the `products.csv` dataset.
* Look up the product name of the item that had the highest quantity sold in a single row.