# The KPMG Bicycle Problem
A classic example of customer segmentaion analysis.
> This is a part of the KPMG Job Simulation Program, it used to be availible [the forage](https://www.theforage.com/) but is archived now.

> This notebook is a summary and updated version of the past three notebooks I've created before which can be visited here:
> - [Task 1](https://www.kaggle.com/code/notcostheta/kpmg-virtual-internship-task1)
> - [Task 2](https://www.kaggle.com/code/notcostheta/kpmg-virtual-internship-task2)
> - [Task 3](https://www.kaggle.com/code/notcostheta/kpmg-virtual-internship-task3)

## Background :

Apparently I’m finally working at KPMG (please hire me) and I just got my first client, it’s a medium-sized bikes and cycling accessories company, Sprocket Central Pty Ltd.

Me along with my Analytics, Information and modelling team are supposed to help them skyrocket their business 📈👌

They provided me with three datasets:

- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months

But there are issues with their data, of course, it can not be used to build some magical model right away, also my Associate Director suggested that I should optimise the quality of the customer data before deriving any kind of insights for the company growth. 

Which brings us towards our first task !

# Task 1 : Data Quality Assessment
Fix the issues we encounter in all three datasets, and draft an email to the client identifying the data quality issues and ways to mitigate them.

At the end of this task, we would have it cleaned, made it accurate and consistent, and ready to be used for our analysis.

## Data Quality Framework

| Framework   | Description                                                                                      |
|-------------|--------------------------------------------------------------------------------------------------|
| Accuracy    | The closeness between a value to its correct representation of the real-life phenomenon          |
| Completeness| The extent to which data are of sufficient breadth, depth, and scope for the task at hand        |
| Consistency | The extent to which data are uniform in format, use, and meaning across a data collection        |
| Currency    | The freshness of data                                                                           |
| Volatility  | The length of time the data remains valid                                                       |
| Relevancy   | The extent to which data are appropriate for the task at hand                                   |
| Validity    | The extent to which data conform to defined business rules or constraints                       |
| Uniqueness  | The extent to which data are unique within the dataset                                          |


## Importing the data
We will be using pandas libraries to import the data and perform our analysis.
You can also use excel or google sheets and whatever tools you're comfortable with.


In [111]:
# Importing the libraries
import numpy as np
import pandas as pd

# Importing the dataset
xls = pd.ExcelFile(
    "/home/meow/Desktop/internship-speedrun/kpmg/KPMG_VI_New_raw_data_update_final.xlsx"
)

# I'm using black formatting for the code
# Keep note of the naming convention
# We used PascalCase to name all out original datasets

Transactions = pd.read_excel(xls, "Transactions", skiprows=1)
NewCustomerList = pd.read_excel(xls, "NewCustomerList", skiprows=1)
CustomerDemographic = pd.read_excel(xls, "CustomerDemographic", skiprows=1)
CustomerAddress = pd.read_excel(xls, "CustomerAddress", skiprows=1)

### Analysing Transactions Data

#### Shape and Cardinality
- Shape of the data is 20000 rows and 13 columns
- Columns with cardinality 2 should be converted to boolean
- This would be useful for non machine learning approaches

In [112]:
Transactions.head()

Unnamed: 0,transaction_id,product_id,customer_id,transaction_date,online_order,order_status,brand,product_line,product_class,product_size,list_price,standard_cost,product_first_sold_date
0,1,2,2950,2017-02-25,0.0,Approved,Solex,Standard,medium,medium,71.49,53.62,41245.0
1,2,3,3120,2017-05-21,1.0,Approved,Trek Bicycles,Standard,medium,large,2091.47,388.92,41701.0
2,3,37,402,2017-10-16,0.0,Approved,OHM Cycles,Standard,low,medium,1793.43,248.82,36361.0
3,4,88,3135,2017-08-31,0.0,Approved,Norco Bicycles,Standard,medium,medium,1198.46,381.1,36145.0
4,5,78,787,2017-10-01,1.0,Approved,Giant Bicycles,Standard,medium,large,1765.3,709.48,42226.0


In [113]:
Transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   transaction_id           20000 non-null  int64         
 1   product_id               20000 non-null  int64         
 2   customer_id              20000 non-null  int64         
 3   transaction_date         20000 non-null  datetime64[ns]
 4   online_order             19640 non-null  float64       
 5   order_status             20000 non-null  object        
 6   brand                    19803 non-null  object        
 7   product_line             19803 non-null  object        
 8   product_class            19803 non-null  object        
 9   product_size             19803 non-null  object        
 10  list_price               20000 non-null  float64       
 11  standard_cost            19803 non-null  float64       
 12  product_first_sold_date  19803 n

In [114]:
# List unique values of all the columns with ascending order of cardinality
# Cardinality is the number of unique values in a column
# This is a good way to check for categorical columns

for col in Transactions.columns:
    print(f"{col}: {Transactions[col].nunique()}")

print("------------------------------------")

# Print top 7 unique values of all the columns with ascending order of cardinality if less than 7
for col in Transactions.columns:
    if Transactions[col].nunique() < 10:
        print(f"{col}: {Transactions[col].unique()}")

transaction_id: 20000
product_id: 101
customer_id: 3494
transaction_date: 364
online_order: 2
order_status: 2
brand: 6
product_line: 4
product_class: 3
product_size: 3
list_price: 296
standard_cost: 103
product_first_sold_date: 100
------------------------------------
online_order: [ 0.  1. nan]
order_status: ['Approved' 'Cancelled']
brand: ['Solex' 'Trek Bicycles' 'OHM Cycles' 'Norco Bicycles' 'Giant Bicycles'
 'WeareA2B' nan]
product_line: ['Standard' 'Road' 'Mountain' 'Touring' nan]
product_class: ['medium' 'low' 'high' nan]
product_size: ['medium' 'large' 'small' nan]


#### Missing Values
The number of missing values in this dataframe are insignificant and from the same rows, we can drop them.

In [115]:
# Function to analyse the missing data


def analyze_null_values(dataframe):
    total_null_values = dataframe.isnull().sum()
    total_values = dataframe.count().sort_values(ascending=True)
    null_values_percentage = (total_null_values / total_values) * 100
    missing_data = pd.concat(
        {
            "Null Values": total_null_values,
            "Percentage of Missing Values": null_values_percentage,
            "Total Values": total_values,
        },
        axis=1,
    )

    missing_data = missing_data.sort_values(
        by="Percentage of Missing Values", ascending=False
    )

    return missing_data


analyze_null_values(Transactions)

Unnamed: 0,Null Values,Percentage of Missing Values,Total Values
online_order,360,1.832994,19640
brand,197,0.994799,19803
product_line,197,0.994799,19803
product_class,197,0.994799,19803
product_size,197,0.994799,19803
standard_cost,197,0.994799,19803
product_first_sold_date,197,0.994799,19803
transaction_id,0,0.0,20000
product_id,0,0.0,20000
customer_id,0,0.0,20000


#### Date Time Adjustments
- `transaction_date` should be converted to datetime format
- `product_first_sold_date` should be converted to datetime format
- we will later make a new `product_age` column by converting the `product_first_sold_date` to datetime format and subtracting it from the `today_date`

In [116]:
# Define start and end date of the dataset
start_date = pd.to_datetime("2017-01-01")
end_date = pd.to_datetime("2017-12-31")

# Convert transaction_date column to standard datetime format
Transactions["transaction_date"] = pd.to_datetime(Transactions["transaction_date"])

# Convert product_first_sold_date column to standard datetime format
# We need to add the timedelta to the date because the date is stored as a number of days since 1900-01-01
# Assuming the dataset was given to us on 2018-01-01

today_date = pd.Timestamp("2018-01-01")
Transactions["product_first_sold_date"] = pd.to_timedelta(
    Transactions["product_first_sold_date"], unit="D"
) + pd.Timestamp("1900-01-01")

date_difference = pd.Timestamp.today() - today_date
Transactions["product_first_sold_date"] = (
    Transactions["product_first_sold_date"] - date_difference
)

In [117]:
Transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   transaction_id           20000 non-null  int64         
 1   product_id               20000 non-null  int64         
 2   customer_id              20000 non-null  int64         
 3   transaction_date         20000 non-null  datetime64[ns]
 4   online_order             19640 non-null  float64       
 5   order_status             20000 non-null  object        
 6   brand                    19803 non-null  object        
 7   product_line             19803 non-null  object        
 8   product_class            19803 non-null  object        
 9   product_size             19803 non-null  object        
 10  list_price               20000 non-null  float64       
 11  standard_cost            19803 non-null  float64       
 12  product_first_sold_date  19803 n

### Fixing the Transactions Data
- We already fixed the date time format earlier
- We will convert the `online_order` and `order_status` columns to object type and map them as True/False
- We have their datatype as object, even though they are boolean, because it'll be helpful for machine learning models
- Don't worry about the `NaN` values, pipelines will take care of them

In [118]:
# Mapping the online_order column to True and False
Transactions["online_order"] = Transactions["online_order"].map({1: True, 0: False})

# Mapping the order_status column to True and False
Transactions["order_status"] = Transactions["order_status"].map(
    {"Approved": True, "Cancelled": False}
)

# Print unique values of online_order and order_status columns
print("Unique values of online_order column:", Transactions["online_order"].unique())
print("Unique values of order_status column:", Transactions["order_status"].unique())

Unique values of online_order column: [False True nan]
Unique values of order_status column: [ True False]
