# Customer Behaviour Insight Engine - Session & Customer Features 

This notebook builds *analysis-ready features* from teh cleaned clickstream dataset. 

## Goals

From the cleaned tables in `data/processed/`, tasks of this notebook include: 

1. **Load cleaned data**
   - `events_clean.csv`
   - `sessions_clean.csv`
   - `orders_clean.csv`
   - `customers_clean.csv`
   - `order_items_clean.csv`
   - `products_clean.csv`

2. **Create session-level features**
    - Sesssion start/end timestamps
    - Session duration
    - Number of events per session
    - Counts of key behaviours:
          - page views
          - add_to_cart
          - checkout
          - purchase
          - Cart and revenue indicators per session

3. **Create customer-level features**
    - Number of sessions
    - Number of orders
    - Total revenue (lifetime value)
    - First and most recent order dates
    - Basic recency/frequency metrics

4. **Save feature tables**
    - `session_features.csv`
    - `customer_features.csv`

To be used later for exploratory analysis and simple customer behaviour models. 

In [9]:
import pandas as pd
import numpy as np
import os

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

In [11]:
# 1. Load cleaned tables from data/processed

processed_path = "data/processed/"

events = pd.read_csv(processed_path + "events_clean.csv", parse_dates=["timestamp"])
sessions = pd.read_csv(processed_path + "sessions_clean.csv")
orders = pd.read_csv(processed_path + "orders_clean.csv", parse_dates=["order_time"])
customers = pd.read_csv(processed_path + "customers_clean.csv", parse_dates=["signup_date"])
order_items = pd.read_csv(processed_path + "order_items_clean.csv")
products = pd.read_csv(processed_path + "products_clean.csv")
reviews = pd.read_csv(processed_path + "reviews_clean.csv")

tables = {
    "events": events,
    "sessions": sessions,
    "orders": orders,
    "customers": customers,
    "order_items": order_items,
    "products": products,
    "reviews": reviews,
}

for name, df in tables.items():
    print(f"{name:12} -> shape = {df.shape}")

events       -> shape = (760958, 10)
sessions     -> shape = (120000, 6)
orders       -> shape = (33580, 10)
customers    -> shape = (20000, 7)
order_items  -> shape = (59163, 5)
products     -> shape = (1197, 6)
reviews      -> shape = (10780, 6)


## 2. Confirm cleaned tables load as expected and row counts appear reasonalbe

In [15]:
# Quick shapes + a couple of basic checks 

print("Table shapes (rows, columns):")
print("  events:  ", events.shape)
print("  sessions  ", sessions.shape)
print("  orders:  ", orders.shape)
print("  customers:", customers.shape)
print("  order_items:", order_items.shape)
print("  products:   ", products.shape)
print("  reviews:   ", reviews.shape)

print("\nEvent types:", events["event_type"].unique())
print("Payment methods:", events["payment"].dropna().unique())

Table shapes (rows, columns):
  events:   (760958, 10)
  sessions   (120000, 6)
  orders:   (33580, 10)
  customers: (20000, 7)
  order_items: (59163, 5)
  products:    (1197, 6)
  reviews:    (10780, 6)

Event types: ['page_view' 'add_to_cart' 'checkout' 'purchase']
Payment methods: ['card' 'paypal' 'wallet' 'cod']


## 3. Build session-level behaviour features 

Aggregate the `events` table or one row per `session_id`:

For each session compute:

- `session_start`, `session_end`, `session_duration_min`
- `n_events` (total events)
- `n_page_views`
- `n_add_to_cart`
- `n_checkout`
- `n_purchase`
- `total_qty` (sum of `qty` across events)
- `session_revenue` (sume of `amount_usd` across events)
- `made_purchase` flag (1 if the session includes a purchse, else 0)

Turns the raw clickstream log into a compact session-level behaviour table. 

In [20]:
# 3. Build session-level aggregates from events

# Make sure numeric fields don't break aggregations due to NaN
events_for_agg = events.copy()
events_for_agg["qty"] = events_for_agg["qty"].fillna(0)
events_for_agg["amount_usd"] = events_for_agg["amount_usd"].fillna(0)

session_agg = (
    events_for_agg
    .groupby("session_id")
    .agg(
        session_start=("timestamp", "min"),
        session_end=("timestamp", "max"),
        session_duration_min=(
            "timestamp",
            lambda x: (x.max() - x.min()).total_seconds() / 60.0
        ),
        n_events=("event_id", "count"),
        n_page_views=("event_type", lambda x: (x == "page_view").sum()),
        n_add_to_cart=("event_type", lambda x: (x == "add_to_cart").sum()),
        n_checkout=("event_type", lambda x: (x == "checkout").sum()),
        n_purchase=("event_type", lambda x: (x == "purchase").sum()),
        total_qty=("qty", "sum"),
        session_revenue=("amount_usd", "sum")
    )
    .reset_index()
)

# Purchase flag
session_agg["made_purchase"] = (session_agg["n_purchase"] > 0).astype(int)

session_agg.head()

Unnamed: 0,session_id,session_start,session_end,session_duration_min,n_events,n_page_views,n_add_to_cart,n_checkout,n_purchase,total_qty,session_revenue,made_purchase
0,1,2021-12-27 00:08:36,2021-12-27 01:59:36,111.0,10,7,3,0,0,3.0,0.0,0
1,2,2025-01-31 21:48:42,2025-01-31 23:07:42,79.0,8,5,1,1,1,1.0,85.72,1
2,3,2024-02-19 00:57:50,2024-02-19 01:17:50,20.0,5,2,1,1,1,1.0,116.17,1
3,4,2024-08-04 20:24:31,2024-08-04 20:47:31,23.0,2,2,0,0,0,0.0,0.0,0
4,5,2022-06-28 14:19:08,2022-06-28 15:27:08,68.0,6,6,0,0,0,0.0,0.0,0


## 4. Enrich `sessions` with behaviour metrics 

Join the aggregated behaviour features bak onto the original `session` table. Keeps metadata like `customer_id`, `device`, `country` or `source` together with the clickstream behaviour for each session. 

In [22]:
# 4. Join behavioural features into sessions table

sessions_enriched = sessions.merge(
    session_agg,
    on="session_id",
    how="left"
)

sessions_enriched.head()

Unnamed: 0,session_id,customer_id,start_time,device,source,country,session_start,session_end,session_duration_min,n_events,n_page_views,n_add_to_cart,n_checkout,n_purchase,total_qty,session_revenue,made_purchase
0,1,12360,2021-12-27T00:01:36,mobile,email,DE,2021-12-27 00:08:36,2021-12-27 01:59:36,111.0,10,7,3,0,0,3.0,0.0,0
1,2,13917,2025-01-31T21:29:42,desktop,organic,PL,2025-01-31 21:48:42,2025-01-31 23:07:42,79.0,8,5,1,1,1,1.0,85.72,1
2,3,1022,2024-02-19T00:52:50,tablet,organic,FR,2024-02-19 00:57:50,2024-02-19 01:17:50,20.0,5,2,1,1,1,1.0,116.17,1
3,4,2882,2024-08-04T19:54:31,mobile,direct,GB,2024-08-04 20:24:31,2024-08-04 20:47:31,23.0,2,2,0,0,0,0.0,0.0,0
4,5,1286,2022-06-28T13:58:08,desktop,email,ES,2022-06-28 14:19:08,2022-06-28 15:27:08,68.0,6,6,0,0,0,0.0,0.0,0


In [23]:
print("Original sessions rows:", len(sessions))
print("Enriched sessions rows:", len(sessions_enriched))

Original sessions rows: 120000
Enriched sessions rows: 120000


## 5. Customer-level purchase and behavuour metrics 

Aggregate to one row per `customer_id` using two sources: 

1. **Orders table)) -> purchase metrics
   - `n_sessions`
   - `first_order_date`
   - `last_order_date`
   - `total_revenue`
   - `avg_order_value`

2.  **Sessions (enriched)** -> browsing + conversion metrics

   - `n_sessions`
   - `sessions_with_purchase`
   - `sessions_without_purcahse`
   - `avg_session_duration_min`
   - `total_session_revenu`

To promote power CLV-style and funnel analyses

In [28]:
# 5.1 Order-based customer metrics

customer_orders = (
    orders
    .groupby("customer_id")
    .agg(
        n_orders=("order_id", "nunique"),
        first_order_date=("order_time", "min"),
        last_order_date=("order_time", "max"),
        total_revenue=("total_usd", "sum"),
        avg_order_value=("total_usd", "mean")
    )
    .reset_index()
)

customer_orders.head()

Unnamed: 0,customer_id,n_orders,first_order_date,last_order_date,total_revenue,avg_order_value
0,1,2,2022-03-18 04:16:29,2025-06-25 16:02:53,115.39,57.695
1,2,2,2023-12-16 17:48:30,2025-01-02 02:48:29,68.52,34.26
2,3,1,2020-07-04 07:39:11,2020-07-04 07:39:11,66.72,66.72
3,4,2,2020-09-29 03:07:16,2023-08-01 00:50:26,279.86,139.93
4,5,3,2024-06-15 21:36:57,2025-01-30 02:03:28,271.29,90.43


In [34]:
# 5.2 Session-based customer metrics

# Some customers may have no enriched metrics if they never had events. 
# This is handled by grouping and letting NaNs appear for those cases. 

customer_sessions= (
    sessions_enriched
    .groupby("customer_id")
    .agg(
        n_sessions=("session_id", "nunique"),
        sessions_with_purchase=("made_purchase", "sum"),
        sessions_without_purchase=(
            "made_purchase",
            lambda x: (x == 0).sum()
        ),
        avg_session_duration_min=("session_duration_min", "mean"),
        total_session_revenue=("session_revenue", "sum")
    )
    .reset_index()
)
customer_sessions.head()        

Unnamed: 0,customer_id,n_sessions,sessions_with_purchase,sessions_without_purchase,avg_session_duration_min,total_session_revenue
0,1,5,2,3,75.216667,115.39
1,2,3,2,1,80.333333,68.52
2,3,5,1,4,75.196667,66.72
3,4,9,2,7,35.555556,279.86
4,5,9,3,6,51.335185,271.29


## 6. Combine customer demographics + metrics 

Join the base `customers` table with the order-based and session-based aggregates to create one analytics-ready customer table: 

- One row per customer
- Includes demographic fields (country, age, signup_date, ...)
- Plus purchse behaviour (orders, revenue, AOV)
- Plus browsing behaviour (sessions, conversion to purchase)

This table is to be used for CLV-style analysis and segmentation. 

In [36]:
# 6. Buil final customer analytics tabls

customers_analytics = (
    customers
    .merge(customer_orders, on="customer_id", how="left")
    .merge(customer_sessions, on="customer_id", how="left")
)

# Fill NaNs for customers who never ordered / never had sessions 
count_cols = [
    "n_orders",
    "n_sessions",
    "sessions_with_purchase",
    "sessions_without_purchase"
]
value_cols = [
    "total_revenue",
    "avg_order_value",
    "avg_session_duration_min",
    "total_session_revenue"
]

customers_analytics[count_cols] = customers_analytics[count_cols].fillna(0)
customers_analytics[value_cols] = customers_analytics[value_cols].fillna(0)

customers_analytics.head()
    

Unnamed: 0,customer_id,name,email,country,age,signup_date,marketing_opt_in,n_orders,first_order_date,last_order_date,total_revenue,avg_order_value,n_sessions,sessions_with_purchase,sessions_without_purchase,avg_session_duration_min,total_session_revenue
0,1,Jennifer Salinas,nicholas59@example.org,JP,71,2020-09-04,True,2.0,2022-03-18 04:16:29,2025-06-25 16:02:53,115.39,57.695,5.0,2.0,3.0,75.216667,115.39
1,2,Phillip Ramos,christinarubio@example.com,IN,26,2020-04-05,False,2.0,2023-12-16 17:48:30,2025-01-02 02:48:29,68.52,34.26,3.0,2.0,1.0,80.333333,68.52
2,3,Dawn Fowler,jessica03@example.org,BR,21,2023-08-31,True,1.0,2020-07-04 07:39:11,2020-07-04 07:39:11,66.72,66.72,5.0,1.0,4.0,75.196667,66.72
3,4,Mario Butler,paula27@example.org,FR,63,2022-06-30,True,2.0,2020-09-29 03:07:16,2023-08-01 00:50:26,279.86,139.93,9.0,2.0,7.0,35.555556,279.86
4,5,Amber Brown,kevin85@example.net,BR,19,2022-07-22,True,3.0,2024-06-15 21:36:57,2025-01-30 02:03:28,271.29,90.43,9.0,3.0,6.0,51.335185,271.29


## 7. Save session and customer-level tables 

Save the aggregated tables so that they can be used by: 

- future notebooks (exploratory analysis, modelling),
- SQL scripts in the warehouse layer,
- or BI tools.

Outputs:

- `data/processed/sessions_enriched.csv`
- `data/processed/customers_analtics.csv`

In [38]:
# 7. Save aggregated tables 

sessions_enriched.to_csv("data/processed/sessions_enriched.csv", index=False)
customers_analytics.to_csv("data/processed/customers_analytics.csv", index=False)

print("Saved:")
print("  data/processed/sessions/enriched.csv")
print("  data/processed/customers_analytics.csv")

Saved:
  data/processed/sessions/enriched.csv
  data/processed/customers_analytics.csv
