# CodexContinue Data Analysis

This notebook demonstrates how to analyze and visualize data from the CodexContinue project.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sys
import os

# Set up plotting styles
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("Libraries imported successfully!")

## Connect to Project Environment

Let's ensure we have access to the CodexContinue project modules:

In [None]:
# Add the project root to Python path to access project modules
project_root = '/app'
if project_root not in sys.path:
    sys.path.append(project_root)

# List available modules in the project
print("Available directories in the project:")
for item in os.listdir(project_root):
    if os.path.isdir(os.path.join(project_root, item)) and not item.startswith('.'):
        print(f"- {item}")

## Simulated Data Generation

For demonstration purposes, we'll generate simulated data that represents user interactions with the CodexContinue system. In a real scenario, you would load this data from a database or API.

In [None]:
# Create a timestamp range for our simulated data
import datetime as dt

# Generate dates for the past 30 days
end_date = dt.datetime.now()
start_date = end_date - dt.timedelta(days=30)
dates = pd.date_range(start=start_date, end=end_date, freq='H')

# Create simulated user activity data
np.random.seed(42)  # For reproducibility

# Generate random user IDs (100 users)
user_ids = [f"user_{i:03d}" for i in range(1, 101)]

# Create a DataFrame with simulated user activities
n_samples = 5000
data = {
    'timestamp': np.random.choice(dates, n_samples),
    'user_id': np.random.choice(user_ids, n_samples),
    'action': np.random.choice(['query', 'document_view', 'code_generation', 'code_execution', 'system_config'], n_samples, 
                             p=[0.4, 0.2, 0.2, 0.15, 0.05]),
    'duration_seconds': np.random.exponential(scale=60, size=n_samples),
    'success': np.random.choice([True, False], n_samples, p=[0.9, 0.1])
}

# Create the DataFrame
activity_df = pd.DataFrame(data)
activity_df['date'] = activity_df['timestamp'].dt.date

# Display the first few rows
activity_df.head()

## Data Overview and Basic Statistics

Let's explore the dataset to understand its structure and basic statistics:

In [None]:
# Basic information about the dataset
print("Dataset shape:", activity_df.shape)
print("\nData types:")
print(activity_df.dtypes)

print("\nBasic statistics:")
activity_df.describe()

In [None]:
# Check for missing values
print("Missing values per column:")
activity_df.isnull().sum()

## User Activity Analysis

Let's analyze patterns in user activity:

In [None]:
# Count of different actions
action_counts = activity_df['action'].value_counts()

# Plot the counts
plt.figure(figsize=(10, 6))
sns.barplot(x=action_counts.index, y=action_counts.values)
plt.title('Distribution of User Actions')
plt.xlabel('Action Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Success rate by action type
success_by_action = activity_df.groupby('action')['success'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=success_by_action.index, y=success_by_action.values)
plt.title('Success Rate by Action Type')
plt.xlabel('Action Type')
plt.ylabel('Success Rate')
plt.xticks(rotation=45)
plt.ylim(0, 1)  # Set y-axis limits for better visualization
plt.tight_layout()
plt.show()

## Temporal Analysis

Let's analyze how user activity changes over time:

In [None]:
# Activity by date
daily_activity = activity_df.groupby('date').size()

# Plot daily activity
plt.figure(figsize=(12, 6))
daily_activity.plot(kind='line', marker='o')
plt.title('Daily User Activity')
plt.xlabel('Date')
plt.ylabel('Number of Actions')
plt.grid(True)
plt.tight_layout()
plt.show()

# Activity by hour of day
activity_df['hour'] = activity_df['timestamp'].dt.hour
hourly_activity = activity_df.groupby('hour').size()

plt.figure(figsize=(12, 6))
hourly_activity.plot(kind='bar')
plt.title('User Activity by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Actions')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

## Interactive Visualization with Plotly

Let's create an interactive visualization to explore the data:

In [None]:
# Create an interactive heatmap of user activity by day and hour
activity_df['day_of_week'] = activity_df['timestamp'].dt.day_name()
day_hour_activity = activity_df.groupby(['day_of_week', 'hour']).size().reset_index(name='count')

# Ensure days of week are in correct order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_hour_activity['day_of_week'] = pd.Categorical(day_hour_activity['day_of_week'], categories=day_order, ordered=True)
day_hour_activity = day_hour_activity.sort_values(['day_of_week', 'hour'])

# Create the heatmap
fig = px.density_heatmap(
    day_hour_activity, 
    x='hour', 
    y='day_of_week',
    z='count',
    title='User Activity Heatmap by Day and Hour',
    labels={'hour': 'Hour of Day', 'day_of_week': 'Day of Week', 'count': 'Number of Actions'},
    color_continuous_scale='Viridis'
)

fig.update_layout(width=900, height=600)
fig.show()

## User Engagement Metrics

Let's calculate some key user engagement metrics:

In [None]:
# Daily active users (DAU)
dau = activity_df.groupby('date')['user_id'].nunique()

# Monthly active users (assuming we have 30 days of data)
mau = activity_df['user_id'].nunique()

# Average actions per user
actions_per_user = activity_df.groupby('user_id').size().mean()

# Average session duration
avg_duration = activity_df['duration_seconds'].mean()

# Display metrics
metrics = {
    'Daily Active Users (Average)': dau.mean(),
    'Monthly Active Users': mau,
    'Average Actions per User': actions_per_user,
    'Average Session Duration (seconds)': avg_duration
}

for metric, value in metrics.items():
    print(f"{metric}: {value:.2f}")

# Plot DAU trend
plt.figure(figsize=(12, 6))
dau.plot(kind='line', marker='o')
plt.title('Daily Active Users Trend')
plt.xlabel('Date')
plt.ylabel('Number of Active Users')
plt.grid(True)
plt.tight_layout()
plt.show()

## User Segmentation

Let's segment users based on their activity levels:

In [None]:
# Calculate activity metrics per user
user_activity = activity_df.groupby('user_id').agg({
    'timestamp': 'count',              # Total actions
    'duration_seconds': 'mean',        # Average session duration
    'success': 'mean'                  # Success rate
}).rename(columns={'timestamp': 'total_actions'})

# Create user segments based on activity
def categorize_activity(actions):
    if actions < 30:
        return 'Low'
    elif actions < 70:
        return 'Medium'
    else:
        return 'High'

user_activity['activity_level'] = user_activity['total_actions'].apply(categorize_activity)

# Display distribution of user segments
segment_counts = user_activity['activity_level'].value_counts()

plt.figure(figsize=(10, 6))
sns.barplot(x=segment_counts.index, y=segment_counts.values)
plt.title('Distribution of User Segments')
plt.xlabel('Activity Level')
plt.ylabel('Number of Users')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

# Compare metrics across segments
segment_metrics = user_activity.groupby('activity_level').agg({
    'total_actions': 'mean',
    'duration_seconds': 'mean',
    'success': 'mean'
})

segment_metrics

## Integration with CodexContinue API

Let's demonstrate how to connect to the CodexContinue API to fetch real data.

> Note: This requires the backend service to be running. If it's not available, this section will fail gracefully.

In [None]:
# API interaction example
import requests

# Backend service URL (using Docker Compose network name)
backend_url = 'http://backend:8000'

# Function to check if a service is available
def check_service(url):
    try:
        response = requests.get(f"{url}/health", timeout=3)
        if response.status_code == 200:
            print(f"✅ Service at {url} is available")
            return True
        else:
            print(f"❌ Service at {url} returned status code {response.status_code}")
            return False
    except requests.exceptions.RequestException as e:
        print(f"❌ Cannot connect to service at {url}: {e}")
        return False

# Check if backend service is available
print("Checking backend service availability...")
backend_available = check_service(backend_url)

# If backend is available, try to fetch some data
if backend_available:
    try:
        # This is a placeholder - modify based on your actual API endpoints
        response = requests.get(f"{backend_url}/api/stats/summary", timeout=5)
        if response.status_code == 200:
            api_data = response.json()
            print("\nData from API:")
            print(api_data)
        else:
            print(f"Could not fetch data from API. Status code: {response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data from API: {e}")
else:
    print("\nUsing simulated data as backend service is not available.")

## ML Model Example

Let's demonstrate a simple machine learning model using the simulated data:

In [None]:
# Prepare data for modeling
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# Use action type to predict success
X = activity_df[['duration_seconds']]  # Feature

# Add hour of day as a feature
X['hour'] = activity_df['hour']

# One-hot encode the action type
action_dummies = pd.get_dummies(activity_df['action'], prefix='action')
X = pd.concat([X, action_dummies], axis=1)

y = activity_df['success']  # Target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Get feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance)
plt.title('Feature Importance for Predicting Success')
plt.tight_layout()
plt.show()

## Saving Results and Exporting

Let's demonstrate how to save our analysis results and export them for use in other parts of the project:

In [None]:
# Create a directory for our exports if it doesn't exist
import os
os.makedirs("/notebooks/exports", exist_ok=True)

# Export the user activity data to CSV
user_activity.to_csv("/notebooks/exports/user_activity_summary.csv")

# Export the daily activity data
daily_activity_df = daily_activity.reset_index()
daily_activity_df.columns = ['date', 'activity_count']
daily_activity_df.to_csv("/notebooks/exports/daily_activity.csv", index=False)

# Export the model feature importance
feature_importance.to_csv("/notebooks/exports/feature_importance.csv", index=False)

print("Results exported to the '/notebooks/exports' directory.")
print("You can access these files from your local machine in the 'notebooks/exports' directory.")

## Summary and Next Steps

We've analyzed simulated user activity data from the CodexContinue project and:

1. **Explored user activity patterns** across different dimensions (time, action types)
2. **Calculated key engagement metrics** such as DAU, MAU, and average session duration
3. **Segmented users** based on their activity levels
4. **Built a predictive model** to identify factors that contribute to successful actions
5. **Demonstrated API integration** with the backend service

### Next Steps

- Connect to real data sources when they become available
- Create automated analysis pipelines
- Develop more sophisticated ML models for user behavior prediction
- Build interactive dashboards for real-time monitoring