# Supervised Segmentation

## Dataset Analysis
The dataset consists of 2,500 records with the following columns:
- customer_id (irrelevant for modeling)
- customer_name, email, phone, address (likely not useful for prediction)
- area, pincode (location-based features)
- registration_date (can be used to derive features like "tenure in days")
- customer_segment (this is the target variable for classification)
- total_orders, avg_order_value (important behavioral features)

The goal is to predict the customer_segment based on user data, meaning we are setting up a classification model.

In [2]:
import pandas as pd

#loading dataset
df_customers = pd.read_csv('/Users/a12345/Desktop/DATA PT /ML Project/02_data/Blinkit/blinkit_customers.csv')

## Data Preprocessing

In [3]:
#Drop unnecessary columns (ID, names, email, phone, address)
df_clean = df_customers.drop(columns=["customer_id", "customer_name", "email", "phone", "address"])

In [4]:
#load datetime
from datetime import datetime

# Convert registration_date to datetime and create tenure feature
df_clean["registration_date"] = pd.to_datetime(df_clean["registration_date"])
df_clean["tenure_days"] = (datetime.today() - df_clean["registration_date"]).dt.days

# Droping original date column
df_clean = df_clean.drop(columns=["registration_date"])

In [5]:
df_clean

Unnamed: 0,area,pincode,customer_segment,total_orders,avg_order_value,tenure_days
0,Udupi,321865,Premium,13,451.92,656
1,Aligarh,149394,Inactive,4,825.48,254
2,Begusarai,621411,Regular,17,1969.81,155
3,Kozhikode,826054,New,4,220.09,512
4,Ichalkaranji,730539,Inactive,14,578.14,342
...,...,...,...,...,...,...
2495,Mumbai,45238,Inactive,17,754.33,399
2496,Udupi,688100,Regular,4,1540.81,249
2497,Kavali,528749,Regular,1,1541.22,346
2498,Alwar,586734,Premium,12,1185.50,174


In [6]:
from sklearn.preprocessing import LabelEncoder

#encoding target feature
label_encoder = LabelEncoder()
df_clean["customer_segment"] = label_encoder.fit_transform(df_clean["customer_segment"])

In [7]:
df_clean

Unnamed: 0,area,pincode,customer_segment,total_orders,avg_order_value,tenure_days
0,Udupi,321865,2,13,451.92,656
1,Aligarh,149394,0,4,825.48,254
2,Begusarai,621411,3,17,1969.81,155
3,Kozhikode,826054,1,4,220.09,512
4,Ichalkaranji,730539,0,14,578.14,342
...,...,...,...,...,...,...
2495,Mumbai,45238,0,17,754.33,399
2496,Udupi,688100,3,4,1540.81,249
2497,Kavali,528749,3,1,1541.22,346
2498,Alwar,586734,2,12,1185.50,174


Key for customer_segment

- 0 = Inactive
- 1 = New
- 2 = Premium
- 3 = Regular

In [8]:
#saving dataset
df_clean.to_csv("preprocessed_dataset.csv", index=False)

In [9]:
#the locations pose a problem, need to encode them somehow
#Im gonna use frequency encoding for this to establish a numerical meaningful value
#not the happiest with this at the moment but let's see how the model will perform

# Frequency Encoding for 'area'
area_counts = df_clean['area'].value_counts()
df_clean['area_encoded'] = df_clean['area'].map(area_counts)

In [10]:
df_clean

Unnamed: 0,area,pincode,customer_segment,total_orders,avg_order_value,tenure_days,area_encoded
0,Udupi,321865,2,13,451.92,656,8
1,Aligarh,149394,0,4,825.48,254,9
2,Begusarai,621411,3,17,1969.81,155,12
3,Kozhikode,826054,1,4,220.09,512,10
4,Ichalkaranji,730539,0,14,578.14,342,12
...,...,...,...,...,...,...,...
2495,Mumbai,45238,0,17,754.33,399,11
2496,Udupi,688100,3,4,1540.81,249,8
2497,Kavali,528749,3,1,1541.22,346,7
2498,Alwar,586734,2,12,1185.50,174,9


In [11]:
#dropping original column
df_clean.drop(columns=['area'], inplace=True)

In [12]:
df_clean

Unnamed: 0,pincode,customer_segment,total_orders,avg_order_value,tenure_days,area_encoded
0,321865,2,13,451.92,656,8
1,149394,0,4,825.48,254,9
2,621411,3,17,1969.81,155,12
3,826054,1,4,220.09,512,10
4,730539,0,14,578.14,342,12
...,...,...,...,...,...,...
2495,45238,0,17,754.33,399,11
2496,688100,3,4,1540.81,249,8
2497,528749,3,1,1541.22,346,7
2498,586734,2,12,1185.50,174,9


In [13]:
#saving dataset
df_clean.to_csv("preprocessed_dataset_v2.csv", index=False)