# **1. Predict <u>churn probability</u> per user**

Predict <u>churn probability</u> per user at the last date in the dataset, using purchase history.

**GitHub**: 

#### **Assumptions**:
- Two approaches: heuristic + unsupervised clustering.

#### **Constraints**:
- no labels exist
- no sequences per fixed time window.

#### **Output**:
    one probability per user


In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# **2. Load & inspect data**

In [6]:
df = pd.read_csv("./data/dataset.csv")
df.head()

Unnamed: 0,UserId,Type,PurchasedAt,PurchasedAmount
0,9,1,3/9/2019,974460
1,17,1,3/12/2019,3248200
2,20,1,3/13/2019,974460
3,28,1,3/19/2019,974460
4,29,1,3/23/2019,974460


In [8]:
df.describe()

Unnamed: 0,UserId,Type,PurchasedAmount
count,278166.0,278166.0,278166.0
mean,100482.220829,1.655044,6179461.0
std,52613.138455,1.239377,13571640.0
min,1.0,1.0,1000.0
25%,53044.0,1.0,1992000.0
50%,98232.0,1.0,3289875.0
75%,147804.75,1.0,5636834.0
max,202320.0,4.0,957228500.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278166 entries, 0 to 278165
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   UserId           278166 non-null  int64 
 1   Type             278166 non-null  int64 
 2   PurchasedAt      278166 non-null  object
 3   PurchasedAmount  278166 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 8.5+ MB


# **3. Preprocessing**

In [19]:
unique_user_count = df["UserId"].nunique()
print("unique_user_count : ",unique_user_count)

unique_user_count :  60396
duplicated num :  5069


### 3-1. Duplication

In [None]:
print("duplicated num : ", df.duplicated().sum())

### 3-2. "PurchasedAt" feature

- needs to be Converted to `pandas datetime`. (for feature engineering such as recency, inter-purchase gaps, customer tenure and ...)
- from object(str) → datetime64[ns]

In [24]:
df['PurchasedAt'] = pd.to_datetime(df['PurchasedAt'], errors='coerce')
# df.info()
min_date = df['PurchasedAt'].min()
max_date = df['PurchasedAt'].max()
print("min date: {},\n max date : {}".format(min_date,max_date))

min date: 2019-03-09 00:00:00,
max date : 2025-12-06 00:00:00


### 3-3. "Type" feature

It's categorical.\
because of user-level aggregation,Per UserId we'll have:
- num_A_purchases
- 	num_B_purchase
- share_A = `num_A / total_purchases`
- share_B = `num_B / total_purchases`ses


### 3-4. "PurchasedAmount" feature

- zero or negative amounts are invalid
- extremely large amounts will be considered outliers

In [None]:
df = df[df['PurchasedAmount'] > 0]  


# **4. Feature engineering (RFM + extras)**

# **5. Model Training** 
### **5-1. Approach 1 – Heuristic churn model (no ML)**

### **5-2. Approach 2 – K-Means churn model (unsupervised ML)**

# **6. Comparing / combining the two approaches**

Evaluation?

# **7. Build final churn probabilities**

# **8. Export output file `(churn_p.csv)`**

# **9. Conclusion & possible future work**

Final results summary at the bottom