**Enhancing Customer Retention through Machine Learning**: <br> Segmentation, Churn Prediction, and Product Recommendation – A Case Study on Flipkart Data.
<br><br>
Phase: **Data Preprocessing** <br>
By: Gia-My Nguyen <br>
Last updated (dd/mm/yyyy): 26/09/2025
<br><br>

**Completed:**
- Sampling: proportional ratios by city and 1% of full sales dataset.
- Create master dataset from sales + products (sampled).
<br><br>

**Tasks Remaining**
- No



In [2]:
import pandas as pd

In [3]:
# Get Sales data

df_sales = pd.read_csv('data/cleaned/sales_cleaned.csv')

In [4]:
# Get Products data

df_products = pd.read_csv('data/cleaned/products_cleaned.csv')

# Data Sampling

In [5]:
# Take 1% from each city 

df_sampled_1pct = df_sales.groupby("city_name", group_keys=False).apply(lambda x: x.sample(frac=0.01, random_state=42))
# df_sampled_1pct.to_csv("data/sampling/sales_1pct.csv", index=False)

In [6]:
#

print("1% Sample - City distribution:")
print(df_sampled_1pct["city_name"].value_counts())

1% Sample - City distribution:
city_name
Delhi        213610
HR-NCR       117723
Bengaluru     94854
Mumbai        38295
Name: count, dtype: int64


# Sampled Master Data

In [7]:
df_master_1pct = pd.merge(
    df_sampled_1pct, df_products,
    on="product_id",
    how="left" 
)

In [12]:
# to csv

df_master_1pct.to_csv("data/sampling/master_1pct.csv", index=False)

In [9]:
df_master_1pct.head(5)

Unnamed: 0,date,city_name,order_id,cart_id,dim_customer_key,procured_quantity,unit_selling_price,total_discount_amount,product_id,total_weighted_landing_price,...,unit,product_type,brand_name,manufacturer_name,l0_category,l1_category,l2_category,l0_category_id,l1_category_id,l2_category_id
0,2022-04-25,Bengaluru,117671443,187611165,3642084,1,165.0,0.0,389717,137.11838,...,200 g,Almonds,GMC,HOT,"Dry Fruits, Masala & Oil",Dry Fruits,Almonds & Cashews,1557,1160,1162
1,2022-05-19,Bengaluru,123650254,203274833,1718448,1,17.0,0.0,217614,10.0,...,1 piece (400 g - 600 g),Bottle Gourd,Unknown,HOT,Vegetables & Fruits,Fresh Vegetables,Fresh Vegetables,1487,1489,1489
2,2022-04-12,Bengaluru,114662318,179282119,1751755,1,182.0,0.0,277307,187.78612,...,355 ml,Shampoo,Clinic Plus,Hindustan Unilever Ltd.,Personal Care,Shampoo & Conditioner,Shampoo & Conditioner,163,166,166
3,2022-06-01,Bengaluru,127037446,211916548,17913985,1,143.0,0.0,485664,112.686012,...,2 x 500 g,Arhar Dal,GMC,HOT,"Atta, Rice & Dal","Toor, Urad & Chana",Arhar,16,1010,1195
4,2022-04-29,Bengaluru,118492357,189776642,1319906,1,30.0,0.0,477123,19.505,...,50 g,ORS,Enerzal,FDC Ltd.,Pharma & Wellness,Digestive Care,Digestive Care,287,298,298


In [10]:
print(df_master_1pct[df_master_1pct['product_type'] == 'Unknown'].shape[0])

5267


In [11]:
df_master_1pct[df_master_1pct['product_type'] == 'Unknown'].head(5)

Unnamed: 0,date,city_name,order_id,cart_id,dim_customer_key,procured_quantity,unit_selling_price,total_discount_amount,product_id,total_weighted_landing_price,...,unit,product_type,brand_name,manufacturer_name,l0_category,l1_category,l2_category,l0_category_id,l1_category_id,l2_category_id
6,2022-05-21,Bengaluru,124336417,193737438,12937945,1,121.0,0.0,481912,79.906,...,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
55,2022-05-17,Bengaluru,123293714,202358258,16604244,1,50.0,0.0,448553,79.00399,...,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
170,2022-04-20,Bengaluru,116440387,169140378,5270752,3,335.0,0.0,13971,982.4403,...,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
190,2022-06-03,Bengaluru,127705043,204822341,1807344,1,594.0,0.0,483343,563.99793,...,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
240,2022-07-03,Bengaluru,135724981,221180941,3749978,3,29.0,0.0,484639,80.43651,...,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
