# Targeting high value customers based on customer demographics and attributes.
> This project was done under the umbrella of KPMG internship experience. I was provided data sets of an organization targeting a client who wants a feedback from us on their dataset quality and how this can be improved.

### Background
- Sprocket Central Pty Ltd, a medium size bikes & cycling accessories organisation
- needs help with its customer and transactions data
- how to analyse it to help optimise its marketing strategy effectively.

### Datasets
- New Customer List
- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months


### Task
- Exploratory Data Analysis to understand the data and its quality
- Model building to predict the high value customers
- Results and recommendations

In [115]:
# Importing libraries

import pandas as pd
import numpy as np

In [116]:
# Importing data
xls = pd.ExcelFile('KPMG_VI_New_raw_data_update_final.xlsx')

Transactions = pd.read_excel(xls, 'Transactions', skiprows=1)
NewCustomerList = pd.read_excel(xls, 'NewCustomerList', skiprows=1)
Demographic = pd.read_excel(xls, 'CustomerDemographic', skiprows=1)
Address = pd.read_excel(xls, 'CustomerAddress', skiprows=1)

### Checking correlation and common columns among the sheets

In [117]:
# Making variables to store the columns of each dataframe

transactions_columns = Transactions.columns
demographic_columns = Demographic.columns
newcustomerlist_columns = NewCustomerList.columns
address_columns = Address.columns

transactions_columns.name = 'transactions_columns'
demographic_columns.name = 'demographic_columns'
newcustomerlist_columns.name = 'newcustomerlist_columns'
address_columns.name = 'address_columns'

In [118]:
# A code I prompted to generate a dataframe to generate a sheet vs column presence table
# This shows if a column header is present across multiple dataframes or not

def generate_presence_dataframe(*columns):
    """
    Generate a DataFrame to show the presence of attributes in each column.

    Args:
        *columns: Variable number of pandas DataFrame columns.

    Returns:
        A pandas DataFrame with the presence of attributes in each column.
        The column headers are based on the names of the passed columns,
        or generic names if the columns don't have names.
        The displayed columns are in the same order of the passed column arguments.
        The DataFrame is sorted based on the number of '1' values horizontally (across the rows).

    """
    # Step 1: Convert the column(s) to set(s)
    column_sets = [set(col) for col in columns]

    # Step 2: Create a set of all unique attributes from the column(s)
    all_attributes = sorted(list(set().union(*column_sets)))

    # Step 3: Create a dictionary to store the presence of attributes in each column
    presence_dict = {'Attributes': all_attributes}
    for i, col in enumerate(columns):
        column_name = col.name if col.name else f'Column {i+1}'
        presence_dict[column_name] = [1 if attr in col else 0 for attr in all_attributes]

    # Step 4: Create a DataFrame from the presence dictionary
    presence_df = pd.DataFrame(presence_dict)

    # Step 5: Sort the dataframe based on the number of '1' values horizontally (across the rows)
    presence_df = presence_df.iloc[presence_df.iloc[:, 1:].sum(axis=1).sort_values(ascending=False).index]

    # Reset the index
    presence_df = presence_df.reset_index(drop=True)

    return presence_df


In [119]:
columns_presence_df = generate_presence_dataframe(transactions_columns, demographic_columns, address_columns,newcustomerlist_columns)
columns_presence_df

Unnamed: 0,Attributes,transactions_columns,demographic_columns,address_columns,newcustomerlist_columns
0,customer_id,1,1,1,0
1,DOB,0,1,0,1
2,first_name,0,1,0,1
3,tenure,0,1,0,1
4,state,0,0,1,1
5,property_valuation,0,0,1,1
6,postcode,0,0,1,1
7,past_3_years_bike_related_purchases,0,1,0,1
8,owns_car,0,1,0,1
9,job_title,0,1,0,1


### Highlights of the corelation analysis
We are supposed to combine the data from the three sheets (Customer Demographic, Customer Addresses, Transactions) and then make a Master Sheet which would be a training set where we would train our model to predict the high value customers.
- can use **customer_id** as the primary key to combine the data
- there are some irrelevant columns in plain sight which can be dropped
- there are rows with missing data which can be dropped as well, since it would affect the model building
- the DOB can be converted to age, and then we can perform an analysis with different age groups
- a **polynomial regression** model can be used to predict the high value customers