<a href="https://colab.research.google.com/github/jeremysb1/data_analysis_projects/blob/main/modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Modeling

This project seeks to answer the question:

How many customers does this business have?

I will combine multiple data sources into a single dataset representing a customer data model.

## Part 1: Exploring, extracting, and combining customer data

In [1]:
import pandas as pd
import numpy as np

sales = pd.read_csv('/content/drive/MyDrive/Data Analysis Projects/Data Modeling/purchases.csv')
print(sales.shape)

(71519, 11)


In [2]:
sales.isnull().sum()

event_time              0
product_id              0
category_id             0
category_code       16739
brand                5707
price                   0
session_id              0
customer_id         18448
guest_first_name    53071
guest_surname       53071
guest_postcode      53071
dtype: int64

Creating a new column to track guest checkouts, which happen when a customer ID is not provided.

In [3]:
sales['is_guest'] = sales['customer_id'].isnull()

In [4]:
sales[sales["is_guest"] & sales["customer_id"].notnull()]
sales[(sales["is_guest"] == False) & sales["customer_id"].isnull()]

Unnamed: 0,event_time,product_id,category_id,category_code,brand,price,session_id,customer_id,guest_first_name,guest_surname,guest_postcode,is_guest


The data shows all purchases were made by a guest checkout or by a registered customer.

In [5]:
sales["is_guest"].value_counts(normalize=True)

is_guest
False    0.742055
True     0.257945
Name: proportion, dtype: float64

The proportion of guest vs. registered user purchases.

In [6]:
guest_columns = ["guest_first_name", "guest_surname", "guest_postcode"]

unique_guests = sales[guest_columns].drop_duplicates()
print(len(unique_guests))

unique_customers = sales["customer_id"].unique()
cust_total = len(unique_customers) + len(unique_guests)

print(len(unique_guests) / (cust_total - 1))

8301
0.2495640671036017


There are around 25,000 unique customer IDs, which represent registered customers, and another roughly 8,000 unique combinations of guest information, so from purchases alone, I estimate the upper bound of the number of customers to be around 33,000.

In [8]:
guest_columns = ["guest_first_name", "guest_surname", "guest_postcode", "is_guest"]
guests = sales.loc[sales["is_guest"], guest_columns]
guests = guests.drop_duplicates()
guests.head()

Unnamed: 0,guest_first_name,guest_surname,guest_postcode,is_guest
0,MICHAEL,MASON,RG497ZQ,True
2,COLE,WILKINSON,SW75TQ,True
3,MOHAMMED,RICHARDS,RG150RE,True
7,KIAN,MILLS,SW332TF,True
13,RUBY,OWEN,PO377YS,True
