<a href="https://colab.research.google.com/github/jeremysb1/data_analysis_projects/blob/main/modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Modeling

This project seeks to answer the question:

How many customers does this business have?

I will combine multiple data sources into a single dataset representing a customer data model.

## Part 1: Exploring, extracting, and combining customer data

In [1]:
import pandas as pd
import numpy as np

sales = pd.read_csv('/content/drive/MyDrive/Data Analysis Projects/Data Modeling/purchases.csv')
print(sales.shape)

(71519, 11)


In [2]:
sales.isnull().sum()

event_time              0
product_id              0
category_id             0
category_code       16739
brand                5707
price                   0
session_id              0
customer_id         18448
guest_first_name    53071
guest_surname       53071
guest_postcode      53071
dtype: int64

Creating a new column to track guest checkouts, which happen when a customer ID is not provided.

In [3]:
sales['is_guest'] = sales['customer_id'].isnull()

In [4]:
sales[sales["is_guest"] & sales["customer_id"].notnull()]
sales[(sales["is_guest"] == False) & sales["customer_id"].isnull()]

Unnamed: 0,event_time,product_id,category_id,category_code,brand,price,session_id,customer_id,guest_first_name,guest_surname,guest_postcode,is_guest


The data shows all purchases were made by a guest checkout or by a registered customer.

In [5]:
sales["is_guest"].value_counts(normalize=True)

is_guest
False    0.742055
True     0.257945
Name: proportion, dtype: float64

The proportion of guest vs. registered user purchases.

In [6]:
guest_columns = ["guest_first_name", "guest_surname", "guest_postcode"]

unique_guests = sales[guest_columns].drop_duplicates()
print(len(unique_guests))

unique_customers = sales["customer_id"].unique()
cust_total = len(unique_customers) + len(unique_guests)

print(len(unique_guests) / (cust_total - 1))

8301
0.2495640671036017


There are around 25,000 unique customer IDs, which represent registered customers, and another roughly 8,000 unique combinations of guest information, so from purchases alone, I estimate the upper bound of the number of customers to be around 33,000.

In [7]:
guest_columns = ["guest_first_name", "guest_surname", "guest_postcode", "is_guest"]
guests = sales.loc[sales["is_guest"], guest_columns]
guests = guests.drop_duplicates()
guests.head()

Unnamed: 0,guest_first_name,guest_surname,guest_postcode,is_guest
0,MICHAEL,MASON,RG497ZQ,True
2,COLE,WILKINSON,SW75TQ,True
3,MOHAMMED,RICHARDS,RG150RE,True
7,KIAN,MILLS,SW332TF,True
13,RUBY,OWEN,PO377YS,True


In [8]:
non_guests = (pd.DataFrame(sales.loc[sales["customer_id"].notnull(), "customer_id"].unique().astype(int), columns=["customer_id"]))

non_guests.head()

Unnamed: 0,customer_id
0,7466
1,31266
2,534142828
3,1035
4,6985


Combining guest and non-guest data:

In [9]:
sales_customers = pd.concat([non_guests, guests], axis=0, ignore_index=True)

In [10]:
new_col_names = ["customer_id", "first_name", "surname", "postcode", "is_guest"]
sales_customers = sales_customers.set_axis(new_col_names, axis=1)

In [11]:
sales_customers["is_guest"] = sales_customers["is_guest"].fillna(False)

In [12]:
sales_customers["in_purchase_data"] = True

In [13]:
for col in ["first_name", "surname"]:
    sales_customers[col] = sales_customers[col].str.lower().str.strip()

sales_customers["postcode"] = sales_customers["postcode"].str.strip()

## Explore CRM data

In [15]:
crm = pd.read_csv('/content/drive/MyDrive/Data Analysis Projects/Data Modeling/crm_export.csv')
print(crm.shape)
crm.head()

(7825, 5)


Unnamed: 0,customer_id,first_name,surname,postcode,age
0,29223,Holly,Rogers,LS475RT,12
1,27826,Daniel,Owen,M902XX,5
2,7432,Eleanor,Russell,HR904ZA,34
3,2569,Paige,Roberts,DE732EP,61
4,9195,Matilda,Young,LS670FU,78


In [16]:
crm.isnull().sum()

customer_id    0
first_name     0
surname        0
postcode       0
age            0
dtype: int64

In [17]:
crm.groupby("customer_id").size().loc[lambda x: x > 1]

Series([], dtype: int64)

In [18]:
print(len(crm))
print(len(crm.drop(columns = "customer_id").drop_duplicates()))

7825
7419


The next step is to transform the CRM data to the same schema as the customers from the purchase table, and we also need to enhance the registered customers in the purchase history with details from the CRM data.

In [19]:
sales_and_crm_customers = sales_customers.merge(crm, on="customer_id", how="left", suffixes=("_sales", "_crm"))
print(len(sales_and_crm_customers))
sales_and_crm_customers.isnull().sum()

33261


customer_id          8300
first_name_sales    24961
surname_sales       24961
postcode_sales      24961
is_guest                0
in_purchase_data        0
first_name_crm      26147
surname_crm         26147
postcode_crm        26147
age                 26147
dtype: int64

In [20]:
merged_customers_filter = (
    (sales_and_crm_customers["customer_id"].notnull())
      &  ((sales_and_crm_customers["first_name_crm"].notnull())
      |  (sales_and_crm_customers["surname_crm"].notnull()))
)

In [21]:
sales_and_crm_customers.loc[merged_customers_filter, "in_crm_filter"] = True
sales_and_crm_customers.loc[~merged_customers_filter, "in_crm_filter"] = False
sales_and_crm_customers["in_crm_filter"].value_counts()

in_crm_filter
False    26147
True      7114
Name: count, dtype: int64