# Flora Chat Data Cleaning

Robust data pipeline to properly parse and clean the flora-chats export.

**Key challenges addressed:**
- CSV contains embedded JSON with commas (breaks naive pandas parsing)
- Input column contains both instructions AND data payloads
- Need to separate user intent from structured data

In [1]:
import csv
import pandas as pd
import numpy as np
import json
import re
from pathlib import Path
from datetime import datetime

DATA_DIR = Path("../data")
RAW_FILE = DATA_DIR / "flora-chats-01-09-26.csv"

## 1. Load Raw Data (Proper CSV Parsing)

Using Python's csv module instead of pandas for accurate parsing of embedded JSON/quotes.

In [2]:
def strip_quotes(val):
    """Remove surrounding quotes from a string value."""
    if isinstance(val, str):
        # Strip leading/trailing whitespace first
        val = val.strip()
        # Remove surrounding quotes (handles "" and escaped quotes)
        while len(val) >= 2 and val.startswith('"') and val.endswith('"'):
            val = val[1:-1]
        # Handle escaped quotes like \"
        val = val.replace('\\"', '"')
    return val

def load_raw_csv(filepath):
    """Load CSV with proper handling of embedded quotes and JSON."""
    rows = []
    columns = ['id', 'timestamp', 'name', 'userId', 'sessionId', 
               'release', 'version', 'environment', 'tags', 'input', 'output']
    
    with open(filepath, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, quotechar='"', doublequote=True)
        header = next(reader)  # Skip header
        
        for row in reader:
            row_data = {}
            for i, col in enumerate(columns):
                val = row[i] if i < len(row) else ''
                # Strip surrounding quotes from each value
                row_data[col] = strip_quotes(val)
            rows.append(row_data)
    
    return pd.DataFrame(rows)

df_raw = load_raw_csv(RAW_FILE)
print(f"Loaded {len(df_raw)} rows")
print(f"Columns: {list(df_raw.columns)}")
print(f"\nSample userId (should be clean UUID): {df_raw['userId'].iloc[0]}")
df_raw.head()

Loaded 96 rows
Columns: ['id', 'timestamp', 'name', 'userId', 'sessionId', 'release', 'version', 'environment', 'tags', 'input', 'output']

Sample userId (should be clean UUID): c1d2a96d-da46-404b-97ed-ffcbd76f1dab


Unnamed: 0,id,timestamp,name,userId,sessionId,release,version,environment,tags,input,output
0,206abeb3502f897134951d0e9c989dbc,2026-01-09T17:04:42.739Z,flora-thinking,c1d2a96d-da46-404b-97ed-ffcbd76f1dab,chat-history_4410630c-cd0d-44cb-9177-878383aca50c,,,production,"[""askLLMAsync"",""flora-thinking"",""openai"",""org-...",For work period @work_period:ef237015-9250-4b1...,# Sprint Report: REBEL-2025 SP26 \n2025-12-10...
1,2bbde341c2c32604a8ea3095d490f68c,2026-01-09T17:00:42.509Z,flora-thinking,1927df39-3746-4a83-bf6d-527e91d72ee6,chat-history_ec79e713-967f-4c20-a0c2-f51fd694c7a2,,,production,"[""flora-thinking"",""org-7a24ea6d-262e-4e04-a1b6...",,
2,5afb40311aa5651533d87821df0bed21,2026-01-09T17:00:52.905Z,flora-thinking,1927df39-3746-4a83-bf6d-527e91d72ee6,chat-history_df8a1de7-b7fe-40db-a3b8-80fb57d5b860,,,production,"[""flora-thinking"",""org-7a24ea6d-262e-4e04-a1b6...",Analyze the following data and provide a conci...,Key insights for work period STAR-2025 SP26 (1...
3,afc00d1fecece5bce4484e040aa921a7,2026-01-09T16:57:51.458Z,flora-thinking,c1d2a96d-da46-404b-97ed-ffcbd76f1dab,chat-history_12e07917-515a-462a-8f13-923eea51147f,,,production,"[""askLLMAsync"",""flora-thinking"",""openai"",""org-...",tell me about this initiative,# Initiative: Additional payment methods 2025 ...
4,00aa93e2a3edd0eede03b95874dcf279,2026-01-05T17:21:05.265Z,flora-thinking,c1d2a96d-da46-404b-97ed-ffcbd76f1dab,chat-history_498fb4ef-1603-43ec-bf54-97795a6bd067,,,production,"[""askLLMAsync"",""flora-thinking"",""openai"",""org-...",provide sprint reto analysis for work period '...,Data retrieval was unsuccessful. You requested...


## 2. Data Validation

In [3]:
print("=== DATA VALIDATION ===")
print(f"Total rows: {len(df_raw)}")
print(f"Unique users (before validation): {df_raw['userId'].nunique()}")
print(f"Unique sessions: {df_raw['sessionId'].nunique()}")
print()

# Check for valid UUIDs (now without surrounding quotes)
uuid_pattern = re.compile(r'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$')
valid_users = df_raw['userId'].apply(lambda x: bool(uuid_pattern.match(str(x))))
print(f"Valid UUID userIds: {valid_users.sum()} / {len(df_raw)}")

if not valid_users.all():
    print(f"\n{(~valid_users).sum()} rows with invalid userIds (will be filtered):")
    invalid_df = df_raw[~valid_users]
    for idx, row in invalid_df.head(5).iterrows():
        print(f"  Row {idx}: {str(row['userId'])[:60]}...")

=== DATA VALIDATION ===
Total rows: 96
Unique users (before validation): 13
Unique sessions: 76

Valid UUID userIds: 93 / 96

3 rows with invalid userIds (will be filtered):
  Row 17: ""tasks_in_usd"":3918.67...
  Row 75: ""grade"":null...
  Row 87: ""tasks_in_usd"":1593.36...


In [4]:
# Filter to valid rows only
df = df_raw[valid_users].copy()
print(f"Rows after filtering invalid userIds: {len(df)}")
print(f"Unique users: {df['userId'].nunique()}")

Rows after filtering invalid userIds: 93
Unique users: 10


In [5]:
print("=== USER DISTRIBUTION ===")
user_counts = df['userId'].value_counts()
print(user_counts)
print()
print(f"Top 4 users account for {user_counts.head(4).sum()}/{len(df)} messages ({user_counts.head(4).sum()/len(df)*100:.1f}%)")

=== USER DISTRIBUTION ===
userId
c1d2a96d-da46-404b-97ed-ffcbd76f1dab    33
66453e68-c05c-46b8-9e7a-71615370e110    27
1927df39-3746-4a83-bf6d-527e91d72ee6    17
0c4eee93-be4b-4afb-8049-31c2095fbd23    10
98149010-b355-44ff-877a-c391a74da7fd     1
e5c89a3b-b2db-428b-9ca5-2d59b9975f46     1
fe5e1b7a-8b0e-4031-800d-3f6760f49558     1
56e682af-dd66-4762-82d7-f3660d27af28     1
39fb5e0b-6b41-4db0-b0c8-cfadcf61c343     1
eb446d4d-edde-4618-8385-6b914fabd145     1
Name: count, dtype: int64

Top 4 users account for 87/93 messages (93.5%)


## 3. Data Cleaning & Type Conversion

In [6]:
# Convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'], format='ISO8601', errors='coerce')

# Parse tags JSON array
def parse_tags(tags_str):
    """Parse JSON array of tags."""
    if pd.isna(tags_str) or tags_str == '':
        return []
    try:
        return json.loads(tags_str)
    except:
        return []

df['tags_list'] = df['tags'].apply(parse_tags)

# Extract model from tags (openai, anthropic, etc.)
def extract_model(tags):
    """Extract AI model from tags."""
    for tag in tags:
        tag_lower = str(tag).lower()
        if 'openai' in tag_lower:
            return 'openai'
        if 'anthropic' in tag_lower or 'claude' in tag_lower:
            return 'anthropic'
    return 'unknown'

df['model'] = df['tags_list'].apply(extract_model)

print("Model distribution:")
print(df['model'].value_counts())

Model distribution:
model
unknown    60
openai     33
Name: count, dtype: int64


## 4. Input Analysis & Separation

Separate the user's **instruction** from any **data payload** (JSON) they're passing to Flora.

In [7]:
def extract_instruction_and_payload(text):
    """
    Separate user instruction from data payload.
    
    Returns: (instruction, has_json_payload, has_reference)
    """
    if pd.isna(text) or text == '':
        return ('', False, False)
    
    # Check for JSON payload (matches {...} patterns)
    json_match = re.search(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', text)
    has_json = json_match is not None
    
    # Check for @references (Flora-specific syntax like @work_period:uuid[Name])
    ref_pattern = r'@\w+:[a-f0-9-]+\[[^\]]+\]'
    has_reference = bool(re.search(ref_pattern, text))
    
    # Extract instruction (text before JSON or the whole text)
    if has_json:
        instruction = text[:json_match.start()].strip()
    else:
        instruction = text
    
    # Clean up instruction - replace @references with [REF] for cleaner analysis
    instruction_clean = re.sub(ref_pattern, '[REF]', instruction)
    
    return (instruction_clean.strip(), has_json, has_reference)

# Apply extraction
extraction = df['input'].apply(extract_instruction_and_payload)
df['instruction'] = extraction.apply(lambda x: x[0])
df['has_json_payload'] = extraction.apply(lambda x: x[1])
df['has_reference'] = extraction.apply(lambda x: x[2])

print(f"Inputs with JSON payload: {df['has_json_payload'].sum()} ({df['has_json_payload'].mean()*100:.1f}%)")
print(f"Inputs with @references: {df['has_reference'].sum()} ({df['has_reference'].mean()*100:.1f}%)")

Inputs with JSON payload: 39 (41.9%)
Inputs with @references: 4 (4.3%)


In [8]:
# Preview extracted instructions
print("=== SAMPLE INSTRUCTIONS (cleaned) ===")
sample = df[df['instruction'] != ''].head(15)
for i, (idx, row) in enumerate(sample.iterrows()):
    instr = row['instruction'][:120]
    flags = []
    if row['has_json_payload']: flags.append('JSON')
    if row['has_reference']: flags.append('REF')
    flag_str = f" [{', '.join(flags)}]" if flags else ""
    print(f"{i+1}. {instr}...{flag_str}" if len(row['instruction']) > 120 else f"{i+1}. {instr}{flag_str}")
    print()

=== SAMPLE INSTRUCTIONS (cleaned) ===
1. For work period [REF]@parent:board:b1192a4c-f0fb-4fac-aa9a-17b284d95b8e[REBEL-Scrum] provide Executive Summary, Sprint R... [REF]

2. Analyze the following data and provide a concise summary of the key findings. Keep your response in a short paragraph of... [JSON]

3. tell me about this initiative

4. provide sprint reto analysis for work period 'STAR-2025 SP23'

5. Analyze the following data and provide a concise summary of the key findings. Keep your response in a short paragraph of... [JSON]

6. what is the velocity for CS-Federal Bureau of Iteration for sprint 26

7. Analyze the following data and provide a concise summary of the key findings. Keep your response in a short paragraph of... [JSON]

8. what is the velocity for the last 6 sprints for CS-Federal Bureau of Iteration

9. Analyze the following data and provide a concise summary of the key findings. Keep your response in a short paragraph of... [JSON]

10. what is the velocity for th

## 5. Compute Metrics

In [9]:
# Text length metrics
df['input_length'] = df['input'].str.len().fillna(0).astype(int)
df['output_length'] = df['output'].str.len().fillna(0).astype(int)
df['instruction_length'] = df['instruction'].str.len().fillna(0).astype(int)

# Sort by session and timestamp
df = df.sort_values(['sessionId', 'timestamp']).reset_index(drop=True)

# Session metrics
session_sizes = df.groupby('sessionId').size()
df['session_message_count'] = df['sessionId'].map(session_sizes)

# Is this the first message in session?
df['is_first_in_session'] = ~df['sessionId'].duplicated(keep='first')

# Message position within session (1st, 2nd, 3rd...)
df['message_position'] = df.groupby('sessionId').cumcount() + 1

# User metrics
user_msg_counts = df.groupby('userId').size()
df['user_total_messages'] = df['userId'].map(user_msg_counts)

user_session_counts = df.groupby('userId')['sessionId'].nunique()
df['user_session_count'] = df['userId'].map(user_session_counts)

print("=== METRICS SUMMARY ===")
print(f"Avg input length: {df['input_length'].mean():.0f} chars")
print(f"Avg instruction length: {df['instruction_length'].mean():.0f} chars")
print(f"Avg messages per session: {session_sizes.mean():.2f}")
print(f"Avg sessions per user: {user_session_counts.mean():.2f}")
print(f"Single-message sessions: {(session_sizes == 1).sum()} / {len(session_sizes)} ({(session_sizes == 1).mean()*100:.1f}%)")

=== METRICS SUMMARY ===
Avg input length: 3283 chars
Avg instruction length: 153 chars
Avg messages per session: 1.27
Avg sessions per user: 7.70
Single-message sessions: 60 / 73 (82.2%)


## 6. Export Cleaned Data

In [10]:
# Select and order columns for export
export_columns = [
    'id', 'timestamp', 'userId', 'sessionId', 'environment',
    'model', 'input', 'output', 'instruction',
    'has_json_payload', 'has_reference',
    'input_length', 'output_length', 'instruction_length',
    'session_message_count', 'message_position', 'is_first_in_session',
    'user_total_messages', 'user_session_count'
]

df_export = df[export_columns].copy()

# Save cleaned data
output_file = DATA_DIR / "flora-chats-cleaned.csv"
df_export.to_csv(output_file, index=False)
print(f"Exported {len(df_export)} rows to {output_file}")

# Save data summary
summary = {
    'total_messages': len(df_export),
    'unique_users': int(df_export['userId'].nunique()),
    'unique_sessions': int(df_export['sessionId'].nunique()),
    'avg_session_depth': round(float(session_sizes.mean()), 2),
    'single_message_sessions_pct': round(float((session_sizes == 1).mean() * 100), 1),
    'pct_with_json_payload': round(float(df_export['has_json_payload'].mean() * 100), 1),
    'pct_with_reference': round(float(df_export['has_reference'].mean() * 100), 1),
    'first_message_date': str(df_export['timestamp'].min()),
    'last_message_date': str(df_export['timestamp'].max()),
    'model_distribution': df_export['model'].value_counts().to_dict()
}

with open(DATA_DIR / 'data_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)

print("\n=== DATA SUMMARY ===")
for k, v in summary.items():
    print(f"{k}: {v}")

Exported 93 rows to ../data/flora-chats-cleaned.csv

=== DATA SUMMARY ===
total_messages: 93
unique_users: 10
unique_sessions: 73
avg_session_depth: 1.27
single_message_sessions_pct: 82.2
pct_with_json_payload: 41.9
pct_with_reference: 4.3
first_message_date: 2025-12-10 19:58:44.193000+00:00
last_message_date: 2026-01-09 17:10:56.745000+00:00
model_distribution: {'unknown': 60, 'openai': 33}


In [11]:
# Final data preview
print("=== CLEANED DATA PREVIEW ===")
df_export[['timestamp', 'userId', 'instruction', 'has_json_payload', 'session_message_count', 'message_position']].head(10)

=== CLEANED DATA PREVIEW ===


Unnamed: 0,timestamp,userId,instruction,has_json_payload,session_message_count,message_position
0,2026-01-09 16:29:38.196000+00:00,c1d2a96d-da46-404b-97ed-ffcbd76f1dab,Tell me about this initiative,False,1,1
1,2026-01-07 19:14:37.562000+00:00,c1d2a96d-da46-404b-97ed-ffcbd76f1dab,For work period 'REBEL-2025 SP26' provide Exec...,False,1,1
2,2025-12-15 19:05:37.654000+00:00,66453e68-c05c-46b8-9e7a-71615370e110,Analyze the following data and provide a conci...,True,1,1
3,2026-01-05 16:43:30.135000+00:00,1927df39-3746-4a83-bf6d-527e91d72ee6,Analyze the following data and provide a conci...,True,1,1
4,2026-01-07 19:26:13.581000+00:00,1927df39-3746-4a83-bf6d-527e91d72ee6,Analyze the following data and provide a conci...,True,1,1
5,2026-01-09 16:57:51.458000+00:00,c1d2a96d-da46-404b-97ed-ffcbd76f1dab,tell me about this initiative,False,1,1
6,2025-12-10 19:58:44.193000+00:00,66453e68-c05c-46b8-9e7a-71615370e110,Explain to me for BDC Tools the cycle time tre...,False,3,1
7,2025-12-10 19:59:06.440000+00:00,66453e68-c05c-46b8-9e7a-71615370e110,Throughput and Velocity Comparisons,False,3,2
8,2025-12-10 20:02:47.843000+00:00,66453e68-c05c-46b8-9e7a-71615370e110,Explain to me for the DSE board team the cycle...,False,3,3
9,2025-12-16 22:23:22.235000+00:00,66453e68-c05c-46b8-9e7a-71615370e110,Analyze the following data and provide a conci...,True,1,1
