## Assignment

A retailer called SpreeMart wants to improve their service experience for their customers which have different needs from their different profiles. 

As a data scientist at SpreeMart, you are asked to develop a model in Python to divide broad customers into sub-groups of customers based on some type of shared characteristics and extract useful insights from the data and your model with a given sample of customer data in the files attached as following. 

- sm_customers.csv describes the customer's information and its location.
- sm_items.csv includes data of items purchased within each order. 
- sm_orders.csv This file shows data of each order. 
- sm_payments.csv file details the order payment options. 

SpreeMart expects you to do good practice for data science including data exploratory, model development, insight interpretation and in addition, sharing the results with stakeholders by preparing a presentation for an audience of mixed technical abilities on your results and thinking process.

---

## Solution overview

SpreeMart is seeking to enhance its customer service experience by better understanding the diverse needs of its customers. Our objective is to analyze customer data and develop a model that segments customers into distinct groups based on shared characteristics. This segmentation will enable SpreeMart to tailor their services more effectively to each group.

### Expactation and solution approach

#### 1. Data exploration: 

- **Data cleaning and preparation:** We'll start with loading and examinating the data. We'll identify and handle missing values, and perform any necessary data transformations.
- **Exploratory Data Analysis (EDA):** we'll loop deep into the data, looking at distributions, patterns, and relationships. This will help us understand the data's nature and modeling approach.

#### 2. Model development:
- **Customer Segmentation Model:** We will develop clustering techniques to segment the customers. The choice of technique will depend on the data's characteristics and the insights we wish to derive.

#### 3. Insight Generation:
- **Analysis and Insight Generation:** we'll analyze each segment to disciver unique traits and behaviors. This will lead to actionable insights which can use to enhance customer experience and service personalization.

#### 4. Presentation of Results:
- **Presentation and Reporting:** We'll prepare a presentation to summarize our findings, and recommendations. This presentation will be tailored for various level of audiences to communicate our insights and implications business strategy.

---

### 1. Data exploration

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

customers = pd.read_csv('./data/sm_customers.csv')
items = pd.read_csv('./data/sm_items.csv')
orders = pd.read_csv('./data/sm_orders.csv')
payments = pd.read_csv('./data/sm_payments.csv')


In [25]:
# Loop display summary of each DataFrame

dict = {'customers': customers, 'items': items, 'orders': orders, 'payments': payments}

for key, df in dict.items():
    print(f"=========== Info: {key} ===========")
    print(df.info())
    print("\n")

    # Identify missing values
    print(f"--- Missing Values: {key} ---")
    print(df.isnull().sum())

    # Identify duplicate rows
    print(f"--- Duplicate Rows: {key} ---")
    print(df.duplicated().sum())

    # Identify unique values
    print(f"--- Unique Values: {key} ---")
    print(df.nunique())

    print("\n")




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         99441 non-null  object
 1   customer_unique_id  99441 non-null  object
 2   customer_city       99441 non-null  object
 3   customer_state      99441 non-null  object
 4   customer_zip_code   99441 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 3.8+ MB
None


--- Missing Values: customers ---
customer_id           0
customer_unique_id    0
customer_city         0
customer_state        0
customer_zip_code     0
dtype: int64
--- Duplicate Rows: customers ---
0
--- Unique Values: customers ---
customer_id           99441
customer_unique_id    96096
customer_city         38038
customer_state           51
customer_zip_code     50545
dtype: int64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns)

----

##### Customer investigstion

We found that customer table has customer_id and customer_unique_id. We want to their relationship and which one is more useful for our analysis.


In [37]:

# total number of rows
print("Total customer rows count:", customers.shape[0])

# Check unique counts
print("Unique customer_id count:", customers['customer_id'].nunique())
print("Unique customer_unique_id count:", customers['customer_unique_id'].nunique())
print("\n")

# Check if there are multiple customer_ids for a single customer_unique_id
customer_id_counts = customers.groupby('customer_unique_id')['customer_id'].nunique()

# Customer_id counts per customer_unique_id
print("Distribution of customer_id counts per customer_unique_id:")
print(customer_id_counts.value_counts())
print("\n")

# Analyzing any customer_unique_id with more than one customer_id
multi_customer_ids = customer_id_counts[customer_id_counts > 1]

# rowcount multi_customer_ids
print("Count number of customer_unique_ids which have multiple customer_id:", multi_customer_ids.count())
print("Examples:")
print(multi_customer_ids.tail(5))


Total customer rows count: 99441
Unique customer_id count: 99441
Unique customer_unique_id count: 96096


Distribution of customer_id counts per customer_unique_id:
customer_id
1     93099
2      2745
3       203
4        30
5         8
6         6
7         3
9         1
17        1
Name: count, dtype: int64


Count number of customer_unique_ids which have multiple customer_id: 2997
Examples:
customer_unique_id
ff36be26206fffe1eb37afd54c70e18b    3
ff44401d0d8f5b9c54a47374eb48c1b8    2
ff8892f7c26aa0446da53d01b18df463    2
ff922bdd6bafcdf99cb90d7f39cea5b3    3
ffe254cc039740e17dd15a5305035928    2
Name: customer_id, dtype: int64


##### Customer_id and customer_unique_id interpretation

- **customer_id** seems to be an order-specific identifier, unique to each transaction. We can use this ID when analyzing at an order level
- **customer_unique_id** appears to be a unique identifier for each **individual customer** across multiple transactions. This ID is crucial for customer-level analysis, especially for understanding customer behavior over time

**Assumption:** 

- We will use **customer_unique_id** for our analysis, as it will allow us to group customers based on their overall behavior 

----

### xxx