# Project Overview
## Session Intent Analysis Pipeline


This notebook explores the outputs of a session-level NLP pipeline.
User event logs are transformed into session documents, vectorized, and clustered to identify latent user intents.

### 1. Load Pipeline Outputs

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/raw_events.csv")
df["event_time"] = pd.to_datetime(df["event_time"])

df.head()

Unnamed: 0,user_id,event_time,event_text
0,1,2025-12-01 09:00:00,search hybrid suv deals
1,1,2025-12-01 09:01:30,view kona hybrid 2024
2,1,2025-12-01 09:03:10,click car detail page
3,1,2025-12-01 10:10:00,search ev charging station
4,1,2025-12-01 10:12:20,view ioniq 6 battery info


In [None]:
from feature_store.sessionizer import Sessionizer

sessionizer = Sessionizer()
out = sessionizer.assign_sessions(df)

out.groupby(["user_id", "session_id"]).size()

In [3]:
# Load aggregated session documents
session_docs = pd.read_csv("../data/processed/session_docs.csv")

session_docs.head()

Unnamed: 0,user_id,session_id,raw_text,global_session_id,cleaned_text,cluster_id
0,1,1,search hybrid suv deals view kona hybrid 2024 ...,1_1,search hybrid suv deals view kona hybrid 0 cli...,0
1,1,2,search ev charging station view ioniq 6 batter...,1_2,search ev charging station view ioniq battery...,1
2,2,1,search running shoes discount view nike zoom f...,2_1,search running shoes discount view nike zoom f...,2
3,3,1,search iphone 16 cases view clear iphone 16 ca...,3_1,search iphone cases view clear iphone case s...,0
4,4,1,vehicle start signal vehicle stop signal searc...,4_1,vehicle start signal vehicle stop signal searc...,1


### 2. Cluster Distribution

In [4]:
# 특정 intent가 많은지
# 데이터 불균형이 있는지
session_docs["cluster_id"].value_counts()

cluster_id
0    2
1    2
2    1
Name: count, dtype: int64

### 3. Top Keywords per Cluster (most important)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def top_keywords_per_cluster(df, n_words = 5):
    result = {}

    for cid in sorted(df["cluster_id"].unique()):
        texts = df[df["cluster_id"] == cid]["cleaned_text"].tolist()

        vectorizer = TfidfVectorizer(stop_words = "english")
        X = vectorizer.fit_transform(texts)

        scores = np.asarray(X.mean(axis = 0)).ravel()
        top_idx = scores.argsort()[::-1][:n_words]

        result[cid] = [vectorizer.get_feature_names_out()[i] for i in top_idx]

    return result

In [6]:
top_keywords_per_cluster(session_docs)

{np.int64(0): ['hybrid', 'search', 'view', 'iphone', 'magsafe'],
 np.int64(1): ['vehicle', 'signal', 'view', 'charging', 'station'],
 np.int64(2): ['zoom', 'view', 'fly', 'nike', 'search']}

### 4. Interpretation
#### Cluster Interpretation
Each cluster represents a distinct user intent pattern:<br />
    &emsp; - Cluster 0: Vehicle-related research (hybird, SUV, charging)<br />
    &emsp; - Cluster 1: Consumer electronics browsing<br />
    &emsp; - Cluster 2: E-commerce purchase flow