# Usecase - Session Duration Analysis in Ecommerce

# 1. Create the synthetic dataset

Synthetic Data Creation Process:

1. Initialize: Start with an empty list to hold data and a base datetime for timestamps.
2. Generate Users: Loop through 20 users.
3. Generate Sessions per User: For each user, create a random number of sessions (1-5).
4. Generate Events per Session: For each session, create a random number of events (2-15).

Assign Event Details:

1. Assign a unique session_id.
2. Randomly pick an event_type (e.g., 'page_view', 'add_to_cart').
3. Generate sequential timestamp strings for events within a session, with random small increments.
4. Advance Time: After each session, move the global timestamp forward to ensure sessions are chronologically distinct.

Form DataFrame: Convert the collected list of event dictionaries into a Pandas DataFrame.

In [39]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

print("--- Session Duration Analysis in E-commerce ---")

# Step 1: Create a synthetic dataset with 20 users
print("\nStep 1: Creating a synthetic dataset...")

num_users = 20
data = []
current_time = datetime(2023, 10, 26, 9, 0, 0) # Start time for the dataset

for user_id in range(1, num_users + 1):
    num_sessions = random.randint(1, 5) # Each user has 1 to 5 sessions

    for session_num in range(1, num_sessions + 1):
        session_id = f"user_{user_id}_session_{session_num}"
        num_events = random.randint(2, 15) # Each session has 2 to 15 events

        session_start_offset = timedelta(minutes=random.randint(0, 120)) # Start session at random offset
        session_current_time = current_time + session_start_offset

        for event_num in range(num_events):
            event_type = random.choice(['page_view', 'add_to_cart', 'product_view', 'checkout', 'search', 'wishlist'])
            # Add random time to simulate events within a session (e.g., 5-60 seconds)
            event_offset = timedelta(seconds=random.randint(5, 60))
            session_current_time += event_offset

            # Store timestamp as a string initially, as it often comes from raw logs
            data.append({
                'user_id': user_id,
                'session_id': session_id,
                'event_type': event_type,
                'timestamp': session_current_time.strftime('%Y-%m-%d %H:%M:%S')
            })

        # Advance global time to ensure next session is later
        current_time = session_current_time + timedelta(minutes=random.randint(10, 60))

# Create DataFrame
df = pd.DataFrame(data)

--- Session Duration Analysis in E-commerce ---

Step 1: Creating a synthetic dataset...


In [40]:
# Visualize the dataset created

df.head()

Unnamed: 0,user_id,session_id,event_type,timestamp
0,1,user_1_session_1,wishlist,2023-10-26 09:33:37
1,1,user_1_session_1,checkout,2023-10-26 09:34:22
2,1,user_1_session_1,add_to_cart,2023-10-26 09:35:22
3,1,user_1_session_1,product_view,2023-10-26 09:36:15
4,1,user_1_session_1,wishlist,2023-10-26 09:36:41


In [41]:
df.shape

(441, 4)

In [42]:
# Example of a session

df[df['session_id']=='user_1_session_1']

Unnamed: 0,user_id,session_id,event_type,timestamp
0,1,user_1_session_1,wishlist,2023-10-26 09:33:37
1,1,user_1_session_1,checkout,2023-10-26 09:34:22
2,1,user_1_session_1,add_to_cart,2023-10-26 09:35:22
3,1,user_1_session_1,product_view,2023-10-26 09:36:15
4,1,user_1_session_1,wishlist,2023-10-26 09:36:41
5,1,user_1_session_1,search,2023-10-26 09:37:05
6,1,user_1_session_1,checkout,2023-10-26 09:37:10
7,1,user_1_session_1,page_view,2023-10-26 09:37:42
8,1,user_1_session_1,product_view,2023-10-26 09:38:11
9,1,user_1_session_1,wishlist,2023-10-26 09:38:22


In [43]:
print(f"\nInitial DataFrame info (note 'timestamp' dtype):")

df.info()


Initial DataFrame info (note 'timestamp' dtype):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 441 entries, 0 to 440
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     441 non-null    int64 
 1   session_id  441 non-null    object
 2   event_type  441 non-null    object
 3   timestamp   441 non-null    object
dtypes: int64(1), object(3)
memory usage: 13.9+ KB


# 2. Converting 'timestamp' to datetime objects

In [44]:
# Step 2: Convert the timestamp related metrics from object to datetime

print("\nStep 2: Converting 'timestamp' to datetime objects...")
df['timestamp'] = pd.to_datetime(df['timestamp'])

print("\nDataFrame after timestamp conversion (first 5 rows):")
print(df.head())
print(f"\nDataFrame info after timestamp conversion (note 'timestamp' dtype):")
df.info()


Step 2: Converting 'timestamp' to datetime objects...

DataFrame after timestamp conversion (first 5 rows):
   user_id        session_id    event_type           timestamp
0        1  user_1_session_1      wishlist 2023-10-26 09:33:37
1        1  user_1_session_1      checkout 2023-10-26 09:34:22
2        1  user_1_session_1   add_to_cart 2023-10-26 09:35:22
3        1  user_1_session_1  product_view 2023-10-26 09:36:15
4        1  user_1_session_1      wishlist 2023-10-26 09:36:41

DataFrame info after timestamp conversion (note 'timestamp' dtype):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 441 entries, 0 to 440
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   user_id     441 non-null    int64         
 1   session_id  441 non-null    object        
 2   event_type  441 non-null    object        
 3   timestamp   441 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(2)
memor

# 3. Perform session duration analysis - calculate derived metrics

### A. Utility function to derive average time between events for each session

How calculate_avg_time_between_events Works

This function takes a DataFrame containing events for a single session (identified by session_df). Here's a breakdown of its logic:

* Handle Trivial Sessions: It first checks if the session has one or zero events. If so, there are no intervals between events, so it immediately returns 0.
* Order Events by Time: It then ensures that the events within the session are in chronological order by sorting the timestamp column. This is crucial for correctly calculating the time elapsed between consecutive actions.
* Calculate Time Differences: It computes the difference between each consecutive timestamp. For example, it finds the time between event 1 and event 2, then event 2 and event 3, and so on. These differences are converted into total seconds.
* Handle Edge Case (No Differences): If, after calculating differences, there are no valid time differences (e.g., if a session unexpectedly had only one event after all, or issues with timestamps), it returns 0.
* Compute Average: Finally, it calculates the mean (average) of these time differences, giving you the typical duration a user waits before taking their next action in that session.

In [18]:
# Let's derive average time between events for each session

def calculate_avg_time_between_events(session_df):
    if len(session_df) <= 1:
        return 0 # No time between events for 0 or 1 events

    # Sort by timestamp to ensure correct order
    sorted_timestamps = session_df['timestamp'].sort_values()

    # Calculate differences between consecutive timestamps
    time_diffs = sorted_timestamps.diff().dt.total_seconds().dropna()

    if time_diffs.empty:
        return 0
    return time_diffs.mean()

### B. Group by session_id to get session start, end, and duration

In [45]:
df[df['session_id']=='user_1_session_1']

Unnamed: 0,user_id,session_id,event_type,timestamp
0,1,user_1_session_1,wishlist,2023-10-26 09:33:37
1,1,user_1_session_1,checkout,2023-10-26 09:34:22
2,1,user_1_session_1,add_to_cart,2023-10-26 09:35:22
3,1,user_1_session_1,product_view,2023-10-26 09:36:15
4,1,user_1_session_1,wishlist,2023-10-26 09:36:41
5,1,user_1_session_1,search,2023-10-26 09:37:05
6,1,user_1_session_1,checkout,2023-10-26 09:37:10
7,1,user_1_session_1,page_view,2023-10-26 09:37:42
8,1,user_1_session_1,product_view,2023-10-26 09:38:11
9,1,user_1_session_1,wishlist,2023-10-26 09:38:22


In [46]:
df['session_id'].nunique()

56

In [47]:

# Group by session_id to get session start, end, and duration

session_summary = df.groupby('session_id').agg(
    session_start_time=('timestamp', 'min'),
    session_end_time=('timestamp', 'max'),
    num_events=('event_type', 'size') # Count of events in each session
).reset_index()

In [49]:
session_summary.head()

Unnamed: 0,session_id,session_start_time,session_end_time,num_events
0,user_10_session_1,2023-10-28 08:29:51,2023-10-28 08:32:06,5
1,user_10_session_2,2023-10-28 10:30:44,2023-10-28 10:32:04,4
2,user_10_session_3,2023-10-28 12:08:32,2023-10-28 12:16:05,14
3,user_10_session_4,2023-10-28 14:04:25,2023-10-28 14:11:24,13
4,user_10_session_5,2023-10-28 16:25:47,2023-10-28 16:29:21,7


In [50]:
# Validate the groupby operation using a unique session id

session_summary[session_summary['session_id'] == 'user_1_session_1']

Unnamed: 0,session_id,session_start_time,session_end_time,num_events
26,user_1_session_1,2023-10-26 09:33:37,2023-10-26 09:39:21,13


### C. Calculate session duration

In [51]:
# Calculate session duration

session_summary['session_duration_seconds'] = (session_summary['session_end_time'] - session_summary['session_start_time']).dt.total_seconds()
session_summary.head()

Unnamed: 0,session_id,session_start_time,session_end_time,num_events,session_duration_seconds
0,user_10_session_1,2023-10-28 08:29:51,2023-10-28 08:32:06,5,135.0
1,user_10_session_2,2023-10-28 10:30:44,2023-10-28 10:32:04,4,80.0
2,user_10_session_3,2023-10-28 12:08:32,2023-10-28 12:16:05,14,453.0
3,user_10_session_4,2023-10-28 14:04:25,2023-10-28 14:11:24,13,419.0
4,user_10_session_5,2023-10-28 16:25:47,2023-10-28 16:29:21,7,214.0


### D. Calculate average time between events within a session (if more than one event)

In [52]:
# Calculate average time between events within a session (if more than one event)
# This requires a bit more complex group-level apply or merge
# For simplicity and clarity in this example, we'll focus on overall session duration,
# but a more granular analysis could calculate this by sorting timestamps within each session
# and taking the mean of differences.

# Add average time between events to the main df first

df_with_avg_event_time = df.groupby('session_id').apply(calculate_avg_time_between_events).reset_index(name='avg_time_between_events_seconds')

# Merge this back into session_summary
session_summary = pd.merge(session_summary, df_with_avg_event_time, on='session_id')

print("\nSession Summary DataFrame (first 5 rows):")
session_summary.head()


Session Summary DataFrame (first 5 rows):


  df_with_avg_event_time = df.groupby('session_id').apply(calculate_avg_time_between_events).reset_index(name='avg_time_between_events_seconds')


Unnamed: 0,session_id,session_start_time,session_end_time,num_events,session_duration_seconds,avg_time_between_events_seconds
0,user_10_session_1,2023-10-28 08:29:51,2023-10-28 08:32:06,5,135.0,33.75
1,user_10_session_2,2023-10-28 10:30:44,2023-10-28 10:32:04,4,80.0,26.666667
2,user_10_session_3,2023-10-28 12:08:32,2023-10-28 12:16:05,14,453.0,34.846154
3,user_10_session_4,2023-10-28 14:04:25,2023-10-28 14:11:24,13,419.0,34.916667
4,user_10_session_5,2023-10-28 16:25:47,2023-10-28 16:29:21,7,214.0,35.666667


# 4. Analyze the central tendencies of the dataset and derive conclusions

### A. Descriptive Statistics for Session Duration (in seconds)

In [53]:
# Descriptive Statistics for Session Duration (in seconds)

print("\nDescriptive Statistics for Session Duration (in seconds):")

session_summary['session_duration_seconds'].describe()



Descriptive Statistics for Session Duration (in seconds):


Unnamed: 0,session_duration_seconds
count,56.0
mean,222.089286
std,130.930277
min,6.0
25%,119.0
50%,205.5
75%,331.0
max,481.0


### B. Descriptive Statistics for Number of Events per Session

In [54]:
# Descriptive Statistics for Number of Events per Session

print("\nDescriptive Statistics for Number of Events per Session:")

session_summary['num_events'].describe()


Descriptive Statistics for Number of Events per Session:


Unnamed: 0,num_events
count,56.0
mean,7.875
std,3.936138
min,2.0
25%,4.75
50%,8.0
75%,11.25
max,15.0


### C. Descriptive Statistics for Average Time Between Events (in seconds)

In [55]:
# Descriptive Statistics for Average Time Between Events (in seconds)

print("\nDescriptive Statistics for Average Time Between Events (in seconds):")

session_summary['avg_time_between_events_seconds'].describe()


Descriptive Statistics for Average Time Between Events (in seconds):


Unnamed: 0,avg_time_between_events_seconds
count,56.0
mean,31.707465
std,7.688303
min,6.0
25%,28.375
50%,31.1
75%,36.09375
max,48.75


### D. Conclusions / Key insights

In [56]:
# Conclusion 1: Overall Engagement

mean_duration = session_summary['session_duration_seconds'].mean()
median_duration = session_summary['session_duration_seconds'].median()
std_duration = session_summary['session_duration_seconds'].std()

print(f"\n1. Overall Engagement Overview:")
print(f"   - Average Session Duration: {mean_duration:.2f} seconds ({mean_duration / 60:.2f} minutes)")
print(f"   - Median Session Duration: {median_duration:.2f} seconds ({median_duration / 60:.2f} minutes)")
print(f"   - Standard Deviation of Duration: {std_duration:.2f} seconds")
print(f"   This indicates the typical time users spend interacting with the site/app per session.")
print(f"   A high standard deviation suggests a wide range of session lengths, possibly indicating different user behaviors (quick visits vs. deep dives).")

# Conclusion 2: User Activity Level

mean_events = session_summary['num_events'].mean()
median_events = session_summary['num_events'].median()
std_events = session_summary['num_events'].std()

print(f"\n2. User Activity Level (Events per Session):")
print(f"   - Average Events per Session: {mean_events:.2f}")
print(f"   - Median Events per Session: {median_events:.2f}")
print(f"   - Standard Deviation of Events: {std_events:.2f}")
print(f"   This metric reflects how many actions users perform within a session. Higher numbers suggest more interaction. Large deviation could mean some sessions are very active while others are minimal.")

# Conclusion 3: Pace of Interaction

mean_avg_event_time = session_summary['avg_time_between_events_seconds'].mean()
median_avg_event_time = session_summary['avg_time_between_events_seconds'].median()

print(f"\n3. Pace of Interaction (Average Time Between Events):")
print(f"   - Average Time Between Events: {mean_avg_event_time:.2f} seconds")
print(f"   - Median Time Between Events: {median_avg_event_time:.2f} seconds")
print(f"   This indicates how quickly users are performing consecutive actions. A very high average might point to users getting stuck or leaving the tab open without interaction.")

# Further Insights (Derived, non-quantitative for this example):

print("\nFurther Potential Insights:")
print("- **Engaging Content:** Sessions with high duration and many events might point to highly engaging content or successful user flows (e.g., product discovery leading to purchase intent).")
print("- **Friction Points:** Sessions with long durations but few events (or long average time between events) could indicate users are encountering friction, confusion, or simply leaving the tab open. Further analysis (e.g., looking at `event_type` sequences for these sessions) is needed.")
print("- **Quick Visits:** Very short sessions with few events might be users quickly finding what they need or immediately bouncing.")
print("- **Optimization:** Identifying anomalies (e.g., outlier sessions that are extremely long or short) can lead to investigating specific user journeys for improvement.")

print("\n--- Analysis Complete ---")


1. Overall Engagement Overview:
   - Average Session Duration: 222.09 seconds (3.70 minutes)
   - Median Session Duration: 205.50 seconds (3.42 minutes)
   - Standard Deviation of Duration: 130.93 seconds
   This indicates the typical time users spend interacting with the site/app per session.
   A high standard deviation suggests a wide range of session lengths, possibly indicating different user behaviors (quick visits vs. deep dives).

2. User Activity Level (Events per Session):
   - Average Events per Session: 7.88
   - Median Events per Session: 8.00
   - Standard Deviation of Events: 3.94
   This metric reflects how many actions users perform within a session. Higher numbers suggest more interaction. Large deviation could mean some sessions are very active while others are minimal.

3. Pace of Interaction (Average Time Between Events):
   - Average Time Between Events: 31.71 seconds
   - Median Time Between Events: 31.10 seconds
   This indicates how quickly users are performin

# COMPLETED