# Data Modelling for Medium Size Bikes & Cycling Accessories Organization
> This project was done under the umbrella of KPMG internship experience. I was provided data sets of an organization targeting a client who wants a feedback from us on their dataset quality and how this can be improved.

### Background
- Sprocket Central Pty Ltd, a medium size bikes & cycling accessories organisation
- needs help with its customer and transactions data
- how to analyse it to help optimise its marketing strategy effectively.

### Datasets
- New Customer List
- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months

### Task
- Build a model to predict which new customers will convert into paying customers.
- Use the RFM analysis based cluster segmentation to identify the target customers.
- Build Dashboards in Tableau/PowerBI to present your findings.

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import core excel sheet
xls = pd.ExcelFile('KPMG_VI_New_raw_data_update_final.xlsx')

Transactions = pd.read_excel(xls, 'Transactions', skiprows=1)
NewCustomerList = pd.read_excel(xls, 'NewCustomerList', skiprows=1)
Demographic = pd.read_excel(xls, 'CustomerDemographic', skiprows=1)
Address = pd.read_excel(xls, 'CustomerAddress', skiprows=1)

# Import Reference sheet
Reference = pd.read_csv('reference.csv')

In [3]:
common_columns = set(Reference.columns) & set(NewCustomerList.columns)

print("Common columns between Reference and NewCustomerList:")
print(common_columns)

Common columns between Reference and NewCustomerList:
{'job_industry_category', 'postcode', 'DOB', 'past_3_years_bike_related_purchases', 'property_valuation', 'wealth_segment', 'owns_car', 'state', 'tenure', 'gender', 'job_title'}


We would try to use most of these features, but I dont really think postcode would be of any use so might as well drop it,
we would also need to engineer some features like customer_age like we did in the previous task.

In [4]:
# Lets have a look at reference dataset
Reference.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17015 entries, 0 to 17014
Data columns (total 32 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   customer_id                          17015 non-null  int64  
 1   gender                               17015 non-null  object 
 2   past_3_years_bike_related_purchases  17015 non-null  int64  
 3   DOB                                  17015 non-null  object 
 4   job_title                            17015 non-null  object 
 5   job_industry_category                17015 non-null  object 
 6   wealth_segment                       17015 non-null  object 
 7   owns_car                             17015 non-null  bool   
 8   tenure                               17015 non-null  int64  
 9   postcode                             17015 non-null  int64  
 10  state                                17015 non-null  object 
 11  property_valuation          