# Targeting high value customers based on customer demographics and attributes.
> This project was done under the umbrella of KPMG internship experience. I was provided data sets of an organization targeting a client who wants a feedback from us on their dataset quality and how this can be improved.

### Background
- Sprocket Central Pty Ltd, a medium size bikes & cycling accessories organisation
- needs help with its customer and transactions data
- how to analyse it to help optimise its marketing strategy effectively.

### Datasets
- New Customer List
- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months


### Task
- Exploratory Data Analysis to understand the data and its quality
- Model building to predict the high value customers
- Results and recommendations

In [347]:
# Importing libraries

import pandas as pd
import numpy as np

In [348]:
# Importing data
xls = pd.ExcelFile('KPMG_VI_New_raw_data_update_final.xlsx')

Transactions = pd.read_excel(xls, 'Transactions', skiprows=1)
NewCustomerList = pd.read_excel(xls, 'NewCustomerList', skiprows=1)
Demographic = pd.read_excel(xls, 'CustomerDemographic', skiprows=1)
Address = pd.read_excel(xls, 'CustomerAddress', skiprows=1)

### Checking correlation and common columns among the sheets

In [349]:
# Making variables to store the columns of each dataframe

transactions_columns = Transactions.columns
demographic_columns = Demographic.columns
newcustomerlist_columns = NewCustomerList.columns
address_columns = Address.columns

transactions_columns.name = 'transactions_columns'
demographic_columns.name = 'demographic_columns'
newcustomerlist_columns.name = 'newcustomerlist_columns'
address_columns.name = 'address_columns'

In [350]:
# A code I prompted to generate a dataframe to generate a sheet vs column presence table
# This shows if a column header is present across multiple dataframes or not

def generate_presence_dataframe(*columns):
    """
    Generate a DataFrame to show the presence of attributes in each column.

    Args:
        *columns: Variable number of pandas DataFrame columns.

    Returns:
        A pandas DataFrame with the presence of attributes in each column.
        The column headers are based on the names of the passed columns,
        or generic names if the columns don't have names.
        The displayed columns are in the same order of the passed column arguments.
        The DataFrame is sorted based on the number of '1' values horizontally (across the rows).

    """
    # Step 1: Convert the column(s) to set(s)
    column_sets = [set(col) for col in columns]

    # Step 2: Create a set of all unique attributes from the column(s)
    all_attributes = sorted(list(set().union(*column_sets)))

    # Step 3: Create a dictionary to store the presence of attributes in each column
    presence_dict = {'Attributes': all_attributes}
    for i, col in enumerate(columns):
        column_name = col.name if col.name else f'Column {i+1}'
        presence_dict[column_name] = [1 if attr in col else 0 for attr in all_attributes]

    # Step 4: Create a DataFrame from the presence dictionary
    presence_df = pd.DataFrame(presence_dict)

    # Step 5: Sort the dataframe based on the number of '1' values horizontally (across the rows)
    presence_df = presence_df.iloc[presence_df.iloc[:, 1:].sum(axis=1).sort_values(ascending=False).index]

    # Reset the index
    presence_df = presence_df.reset_index(drop=True)

    return presence_df


In [351]:
columns_presence_df = generate_presence_dataframe(transactions_columns, demographic_columns, address_columns,newcustomerlist_columns)
columns_presence_df

Unnamed: 0,Attributes,transactions_columns,demographic_columns,address_columns,newcustomerlist_columns
0,customer_id,1,1,1,0
1,DOB,0,1,0,1
2,first_name,0,1,0,1
3,tenure,0,1,0,1
4,state,0,0,1,1
5,property_valuation,0,0,1,1
6,postcode,0,0,1,1
7,past_3_years_bike_related_purchases,0,1,0,1
8,owns_car,0,1,0,1
9,job_title,0,1,0,1


### Highlights of the corelation analysis
We are supposed to combine the data from the three sheets (Customer Demographic, Customer Addresses, Transactions) and then make a Master Sheet which would be a training set where we would train our model to predict the high value customers.
- can use **customer_id** as the primary key to combine the data
- there are some irrelevant columns in plain sight which can be dropped
- there are rows with missing data which can be dropped as well, since it would affect the model building
- the DOB can be converted to age, and then we can perform an analysis with different age groups
- a **polynomial regression** model can be used to predict the high value customers
- values for the states, DOB and gender can be made uniform.

### Building a master Dataframe to train the model
- The master dataframe is built by combining the three sheets using the customer_id as the primary key
- The irrelevant columns are dropped
- reference for [pandas merge](https://www.youtube.com/watch?v=h4hOPGo4UVU)

In [352]:
# Merging the dataframes using customer_id as the key
merged_df = pd.merge(Demographic, Address, on='customer_id', how='outer')
master_df = pd.merge(merged_df, Transactions, on='customer_id', how='outer')
master_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20510 entries, 0 to 20509
Data columns (total 30 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   customer_id                          20510 non-null  int64         
 1   first_name                           20504 non-null  object        
 2   last_name                            19849 non-null  object        
 3   gender                               20504 non-null  object        
 4   past_3_years_bike_related_purchases  20504 non-null  float64       
 5   DOB                                  20047 non-null  object        
 6   job_title                            18027 non-null  object        
 7   job_industry_category                17180 non-null  object        
 8   wealth_segment                       20504 non-null  object        
 9   deceased_indicator                   20504 non-null  object        
 10  default   

In [353]:
# List of columns to drop
columns_to_drop = ['transaction_id', 'product_id', 'first_name', 'last_name', 'default', 'country','address', 'deceased_indicator']
# Create a new DataFrame by dropping the specified columns
master_stripped = master_df.drop(columns=columns_to_drop)
master_stripped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20510 entries, 0 to 20509
Data columns (total 22 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   customer_id                          20510 non-null  int64         
 1   gender                               20504 non-null  object        
 2   past_3_years_bike_related_purchases  20504 non-null  float64       
 3   DOB                                  20047 non-null  object        
 4   job_title                            18027 non-null  object        
 5   job_industry_category                17180 non-null  object        
 6   wealth_segment                       20504 non-null  object        
 7   owns_car                             20504 non-null  object        
 8   tenure                               20047 non-null  float64       
 9   postcode                             20478 non-null  float64       
 10  state     

### Making the dates uniform and adding the age column

In [354]:
# Making the DOB and transaction_date columns into datetime objects
master_stripped['DOB'] = pd.to_datetime(master_stripped['DOB'])
master_stripped['transaction_date'] = pd.to_datetime(master_stripped['transaction_date'])

In [355]:
import pandas as pd

# Assuming your DataFrame is called master_stripped

# Define the start and end dates of the dataset
start_date = pd.to_datetime('2017-01-01')
end_date = pd.to_datetime('2017-12-31')

# Convert "DOB" column to datetime format
master_stripped['DOB'] = pd.to_datetime(master_stripped['DOB'])

# Calculate customer age based on "DOB" column
master_stripped['customer_age'] = ((end_date - master_stripped['DOB']).dt.days // 365.25)


In [356]:
# Define the reference date
reference_date = pd.to_datetime('2018-01-01')

# Convert "product_first_sold_date" column to timedelta format
master_stripped['product_first_sold_date'] = pd.to_timedelta(master_stripped['product_first_sold_date'], unit='D')

# Calculate product age based on "product_first_sold_date" column
master_stripped['product_age'] = (reference_date - pd.to_datetime(0)).days - master_stripped['product_first_sold_date'].dt.days
master_stripped['product_age'] = (master_stripped['product_age'] / 365.25).abs().fillna(np.nan).round().astype('Int64')


### Making the States and Genders uniform

In [357]:
# Replace 'New South Wales' with 'NSW' and Victoria with 'VIC'
master_stripped['state'] = master_stripped['state'].replace('New South Wales', 'NSW')
master_stripped['state'] = master_stripped['state'].replace('Victoria', 'VIC')

# Replace 'Female'and 'Femal' with 'F' and 'Male' with 'M'
master_stripped['gender'] = master_stripped['gender'].replace('Female', 'F')
master_stripped['gender'] = master_stripped['gender'].replace('Femal', 'F')
master_stripped['gender'] = master_stripped['gender'].replace('Male', 'M')

In [358]:
# Obtain the unique values of the "state" and the gender columns
states = master_stripped['state'].unique()
genders = master_stripped['gender'].unique()

In [359]:
print(states)
print(genders)

['NSW' nan 'QLD' 'VIC']
['F' 'M' 'U' nan]


### Converting the non Boolean values to Boolean values

In [360]:
# Fixing the owns_car column

# Step 1: Replace blank values with NaN
master_stripped['owns_car'].replace('', pd.NA, inplace=True)

# Step 2: Convert 'Yes' to True and 'No' to False
mapping = {'Yes': True, 'No': False}
master_stripped['owns_car'] = master_stripped['owns_car'].map(mapping)

# Step 3: Fill NaN values with False
master_stripped['owns_car'].fillna(False, inplace=True)

In [361]:
# Fixing the owns_car column

# Step 1: Replace blank values with NaN
master_stripped['order_status'].replace('', pd.NA, inplace=True)

# Step 2: Convert 'Yes' to True and 'No' to False
mapping = {'Approved': True, 'Cancelled': False}
master_stripped['order_status'] = master_stripped['order_status'].map(mapping)

# Step 3: Fill NaN values with False
master_stripped['order_status'].fillna(False, inplace=True)

In [362]:
# Fixing the online_order column
# Step 1: Replace blank values with NaN
master_stripped['online_order'].replace('', pd.NA, inplace=True)

# Step 2: Convert the column to boolean type
master_stripped['online_order'] = master_stripped['online_order'].astype('boolean')

### Fixing bad floats

In [363]:
master_stripped['tenure'] = master_stripped['tenure'].astype('Int64')
master_stripped['past_3_years_bike_related_purchases'] = master_stripped['past_3_years_bike_related_purchases'].astype('Int64')
master_stripped['postcode'] = master_stripped['postcode'].astype('Int64')
master_stripped['property_valuation'] = master_stripped['property_valuation'].astype('Int64')
master_stripped['customer_age'] = master_stripped['customer_age'].astype('Int64')

In [364]:
master_stripped.to_csv('master_stripped.csv', index=False)