# Project Overview
## Session Intent Analysis Pipeline


This notebook explores the outputs of a session-level NLP pipeline.
User event logs are transformed into session documents, vectorized, and clustered to identify latent user intents.

### 1. Load Pipeline Outputs

In [None]:
import pandas as pd

df = pd.read_csv("../data/raw/raw_events.csv")
df["event_time"] = pd.to_datetime(df["event_time"])

df.head()

In [None]:
from feature_store.sessionizer import Sessionizer

sessionizer = Sessionizer()
out = sessionizer.assign_sessions(df)

out.groupby(["user_id", "session_id"]).size()

In [None]:
# Load aggregated session documents
session_docs = pd.read_csv("../data/processed/session_docs.csv")

session_docs.head()

### 2. Cluster Distribution

In [None]:
# 특정 intent가 많은지
# 데이터 불균형이 있는지
session_docs["cluster_id"].value_counts()

### 3. Top Keywords per Cluster (most important)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def top_keywords_per_cluster(df, n_words = 5):
    result = {}

    for cid in sorted(df["cluster_id"].unique()):
        texts = df[df["cluster_id"] == cid]["cleaned_text"].tolist()

        vectorizer = TfidfVectorizer(stop_words = "english")
        X = vectorizer.fit_transform(texts)

        scores = np.asarray(X.mean(axis = 0)).ravel()
        top_idx = scores.argsort()[::-1][:n_words]

        result[cid] = [vectorizer.get_feature_names_out()[i] for i in top_idx]

    return result

In [None]:
top_keywords_per_cluster(session_docs)

### 4. Interpretation
#### Cluster Interpretation
Each cluster represents a distinct user intent pattern:<br />
    &emsp; - Cluster 0: Vehicle-related research (hybird, SUV, charging)<br />
    &emsp; - Cluster 1: Consumer electronics browsing<br />
    &emsp; - Cluster 2: E-commerce purchase flow