# E-commerce Customer Profiling and Data Consolidation Project

# Problem Statement

An e-commerce startup, after a successful first month, is seeking to better understand its customer base and purchasing patterns. They aim to answer fundamental questions like "Who are our customers?" "What do they buy?" and "What drives their purchasing behavior?" However, they face challenges in accurately identifying and counting customers due to disparate data sources, including:

1. A customer database recording online account sign-ups,
2. A CRM system tracking phone and non-online customer interactions, and
3. Raw transaction data that includes guest purchases with no formal customer record.

These sources of data may have overlapping entries or duplicate records, often due to customers engaging with multiple systems (e.g., making a purchase as a guest, then creating an account later). Furthermore, customer information may vary across entries due to potential discrepancies like typos or alternative identifiers. The startup seeks the assistance of an analyst to address these complexities and provide an accurate count and profile of their customers.


## Project Goals

1. **Data Consolidation:**  
   Integrate customer information across all available sources to create a unified, comprehensive dataset representing the complete customer base.

2. **Data Deduplication:**  
   Identify and resolve duplicate records, maintaining data lineage to track original sources and identifiers for each customer where possible.

3. **Customer Identification:**  
   Develop a data model that uniquely identifies each customer, regardless of which data source(s) they appear in, and can be easily queried for counts and profiles.

4. **Data Validation:**  
   Ensure data integrity by confirming assumptions, identifying inconsistencies, and validating data completeness to maximize reliability.

5. **Solution Documentation:**  
   Create a flexible and traceable solution, with a well-defined schema that supports future analysis of customer demographics and purchasing behavior.

This project will ultimately enable the startup to answer critical questions about their customer demographics and purchasing trends, creating a foundation for data-driven decision-making.

To actually perform the data modeling, we should take the following course of action:

1. First we explore all 3 datasets. We should see if the values in all the columns make sense.

2. Then we can trim down our data before we merge our different datasets, so that we have unique customers

3. Then we merge these trimmed datasets and remove duplicates carefully.

4. When performing deduplication, we need to go one step beyond and identify more than just exact match duplicates. We can use Fuzzy Matching here.

5. Finally, we clean and present our data model.

## Exploring our Purchases Dataset

In [1]:
import pandas as pd
import numpy as np

sales = pd.read_csv("../data/purchases.csv")
print(sales.shape)

(71519, 11)


In [2]:
sales.head()


Unnamed: 0,event_time,product_id,category_id,category_code,brand,price,session_id,customer_id,guest_first_name,guest_surname,guest_postcode
0,2022-10-01 02:26:08+00:00,32701106,2055156924466332447,,shimano,95.21,64c68405-7002-4ce0-9604-a4c2e1f7384b,,MICHAEL,MASON,RG497ZQ
1,2022-10-01 02:28:32+00:00,9400066,2053013566067311601,,jaguar,164.2,3b7d6741-3c82-4c75-8015-6f54b52612e0,7466.0,,,
2,2022-10-01 02:31:01+00:00,1004238,2053013555631882655,electronics.smartphone,apple,1206.4,38c6d3f7-6c32-4fed-bca6-ef98e1746386,,COLE,WILKINSON,SW75TQ
3,2022-10-01 02:33:31+00:00,11300059,2053013555531219353,electronics.telephone,texet,17.48,3398c966-7846-4186-89be-323daad735b9,,MOHAMMED,RICHARDS,RG150RE
4,2022-10-01 02:40:18+00:00,17300751,2053013553853497655,,versace,77.22,11e3a573-01b9-4794-b513-e7d8a4fcac83,31266.0,,,


In [3]:
sales.isnull().sum()

event_time              0
product_id              0
category_id             0
category_code       16739
brand                5707
price                   0
session_id              0
customer_id         18448
guest_first_name    53071
guest_surname       53071
guest_postcode      53071
dtype: int64

Looks like the 18448 missing customer IDs (which are guest checkouts) and the 53071 missing guest values (which are registered customers) make up all of our data. We should verify there's no overlap, e.g. a row with both customer ID and guest details missing.

Let's create a column to track guest checkouts:

In [4]:
sales["is_guest"] = sales["customer_id"].isnull()

Let's check cases where a guest checkout also had a customer id filled

In [5]:
sales[sales["is_guest"] & sales["customer_id"].notnull()]

Unnamed: 0,event_time,product_id,category_id,category_code,brand,price,session_id,customer_id,guest_first_name,guest_surname,guest_postcode,is_guest


no such case. Let's also check cases where we neither have a guest checkout nor a customer id filled.

In [6]:
sales[(sales["is_guest"] == False) & sales["customer_id"].isnull()]

Unnamed: 0,event_time,product_id,category_id,category_code,brand,price,session_id,customer_id,guest_first_name,guest_surname,guest_postcode,is_guest


This tells us that either all rows are a guest checkout or a purchase made by a registered customer. Let's now check what percentage of records are guest checkouts.

In [8]:
sales['is_guest'].value_counts(normalize=True)

False    0.742055
True     0.257945
Name: is_guest, dtype: float64

Now we know around 25% of our records are guest checkouts. We need to keep in mind that each row in our dataset represents a purchased item and not a customer record. Hence, this won't be the actual proportion of guest checkouts. Now, to calculate the actual proportion:

In [11]:
guest_columns = ["guest_first_name", "guest_surname", "guest_postcode"]
unique_guests = sales[guest_columns].drop_duplicates()
print(len(unique_guests))

unique_customers = sales["customer_id"].unique()
cust_total = len(unique_customers) + len(unique_guests) - 1 # Subtracting 1 because Null is also counted once
print(len(unique_customers))
print(cust_total)

8301
24962
33262


## Exporting Customer Data from Purchases

We need to create a schema where we can store our customer records. If we look at our datasets, we can see that customer and CRM datasets have names, postcodes and age (which is not present in our guest customers, but we should not discard this column just because its not present in guests). We can also keep track of where our records came from by adding an indicator column for each data source. We could also just have one column called sources with values like 'purchases', 'CRM' or 'customer database' but it would not help us if we have duplicate customers in different sources.

Our data model schema will contain these columns:
customer_id, first_name, surname, postcode, age, is_guest, in_purchase_data, in_crm_data, in_customer_data

To extract customer data from purchases, we should first extract the guests and non-guests separately and then combine both.

In [15]:
guest_columns = ['guest_first_name', 'guest_surname', 'guest_postcode', 'is_guest']
guests = sales.loc[sales['is_guest'], guest_columns]
guests = guests.drop_duplicates()
guests.head()

Unnamed: 0,guest_first_name,guest_surname,guest_postcode,is_guest
0,MICHAEL,MASON,RG497ZQ,True
2,COLE,WILKINSON,SW75TQ,True
3,MOHAMMED,RICHARDS,RG150RE,True
7,KIAN,MILLS,SW332TF,True
13,RUBY,OWEN,PO377YS,True


In [16]:
non_guests = (pd.DataFrame(sales.loc[sales["customer_id"].notnull(), "customer_id"]
                          .unique()
                          .astype(int),
                          columns = ['customer_id']
                          ))
non_guests.head()

Unnamed: 0,customer_id
0,7466
1,31266
2,534142828
3,1035
4,6985


In [17]:
sales_customers = pd.concat([non_guests, guests], axis=0, ignore_index=True)

In [18]:
new_col_names = ["customer_id", "first_name", "surname","postcode", "is_guest"]
sales_customers = sales_customers.set_axis(new_col_names, axis=1)

In [19]:
sales_customers["is_guest"] = sales_customers["is_guest"].fillna(False)

In [20]:
sales_customers["in_purchase_data"] = True

In [None]:
for col in ["first_name", "surname"]:
 sales_customers[col] = sales_customers[col].str.lower().str.strip()


sales_customers["postcode"] = sales_customers["postcode"].str.strip()