# LLM Training Data Curation Pipeline: Data Analysis

This notebook demonstrates the analysis of legal text data processed through our LLM Training Data Curation Pipeline. We'll explore the data quality, distribution, and characteristics that make it suitable for language model training.

## Overview

In this notebook, we will:
1. Load and explore processed legal documents from our MongoDB database
2. Analyze data quality metrics and distribution
3. Examine text characteristics relevant for LLM training
4. Visualize key insights about the dataset
5. Generate sample statistics for model training considerations

## Setup and Configuration

First, let's import the necessary libraries and set up our environment.

In [None]:
import os
import sys
import json
from pathlib import Path
from datetime import datetime
import random
from typing import Dict, List, Optional, Union, Any

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pymongo import MongoClient
from wordcloud import WordCloud
from collections import Counter

# Add the project root to the Python path
project_root = Path("..")
sys.path.append(str(project_root))

# Import our pipeline modules
from src.data_processing.base import ProcessedDocument
from src.data_storage.mongodb import MongoDBStorage

# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100

## Connect to MongoDB and Load Data

Now, let's connect to our MongoDB database and load the processed documents.

In [None]:
# MongoDB connection parameters
MONGODB_CONNECTION_STRING = "mongodb://localhost:27017/"
MONGODB_DATABASE = "llm_data_pipeline"
MONGODB_COLLECTION = "processed_documents"

# Create MongoDB storage instance
mongodb_storage = MongoDBStorage(
    connection_string=MONGODB_CONNECTION_STRING,
    database_name=MONGODB_DATABASE,
    collection_name=MONGODB_COLLECTION
)

# Connect to MongoDB
if mongodb_storage.connect():
    print(f"Connected to MongoDB: {MONGODB_DATABASE}.{MONGODB_COLLECTION}")
else:
    print("Failed to connect to MongoDB. Using sample data instead.")
    # We'll create sample data later if connection fails

In [None]:
# Function to load documents from MongoDB or generate sample data
def load_documents(limit=1000, use_sample=False):
    if not use_sample and mongodb_storage.collection:
        # Query documents from MongoDB
        print(f"Loading up to {limit} documents from MongoDB...")
        documents = mongodb_storage.query_documents({}, limit=limit)
        print(f"Loaded {len(documents)} documents")
        return documents
    else:
        # Generate sample data
        print("Generating sample data...")
        return generate_sample_documents(count=min(limit, 100))

# Function to generate sample documents for demonstration
def generate_sample_documents(count=100):
    documents = []
    
    # Sample legal phrases and terms
    legal_phrases = [
        "The Court finds that", "It is hereby ordered", "The plaintiff argues",
        "The defendant contends", "According to precedent", "The statute provides",
        "Under the law", "The evidence shows", "The jury concluded", "In this case"
    ]
    
    legal_terms = [
        "jurisdiction", "tort", "liability", "damages", "plaintiff", "defendant",
        "appeal", "motion", "injunction", "testimony", "evidence", "ruling",
        "statute", "precedent", "contract", "negligence", "remedy", "violation"
    ]
    
    # Sample court names
    courts = [
        "Supreme Court", "Circuit Court", "District Court", "Court of Appeals",
        "Federal Court", "State Court", "Municipal Court", "Bankruptcy Court"
    ]
    
    # Generate documents
    for i in range(count):
        # Generate random text length
        text_length = random.randint(5, 20)  # paragraphs
        
        # Generate random text
        paragraphs = []
        for _ in range(text_length):
            # Generate a paragraph with 3-8 sentences
            sentences = []
            for _ in range(random.randint(3, 8)):
                # Start with a legal phrase
                phrase = random.choice(legal_phrases)
                
                # Add 5-15 random legal terms
                terms = random.sample(legal_terms, random.randint(5, 15))
                
                # Construct a sentence
                sentence = f"{phrase} {' '.join(terms)}."
                sentences.append(sentence)
            
            # Join sentences into a paragraph
            paragraph = " ".join(sentences)
            paragraphs.append(paragraph)
        
        # Join paragraphs into a document
        text = "\n\n".join(paragraphs)
        
        # Create a ProcessedDocument
        doc = ProcessedDocument(
            id=f"sample-{i+1}",
            source="sample",
            source_id=f"sample-{i+1}",
            text=text
        )
        
        # Add random tokens
        doc.tokens = text.split()
        doc.token_count = len(doc.tokens)
        
        # Add random quality score and metrics
        doc.quality_score = random.uniform(0.5, 1.0)
        doc.quality_metrics = {
            "text_length": len(text),
            "word_count": len(text.split()),
            "avg_word_length": sum(len(word) for word in text.split()) / max(1, len(text.split())),
            "sentence_count": sum(1 for _ in text.split(".") if _),
            "alphanumeric_ratio": sum(1 for c in text if c.isalnum()) / max(1, len(text))
        }
        
        # Add metadata
        doc.original_metadata = {
            "court": random.choice(courts),
            "year": random.randint(2000, 2023),
            "case_type": random.choice(["Civil", "Criminal", "Administrative", "Constitutional"])
        }
        
        # Add processing metadata
        doc.processing_metadata = {
            "filtered": False,
            "duplicate": False,
            "sentence_count": doc.quality_metrics["sentence_count"],
            "sentences": text.split(".")
        }
        
        # Add to documents list
        documents.append(doc)
    
    return documents

In [None]:
# Load documents
documents = load_documents(limit=1000)

# If no documents were loaded, use sample data
if not documents:
    documents = load_documents(limit=100, use_sample=True)

## Convert to DataFrame for Analysis

Let's convert our documents to a pandas DataFrame for easier analysis.

In [None]:
def documents_to_dataframe(documents):
    """Convert a list of ProcessedDocument objects to a pandas DataFrame."""
    data = []
    
    for doc in documents:
        # Extract basic document info
        doc_dict = {
            "id": doc.id,
            "source": doc.source,
            "text_length": len(doc.text),
            "token_count": doc.token_count,
            "quality_score": doc.quality_score
        }
        
        # Extract quality metrics
        if doc.quality_metrics:
            for key, value in doc.quality_metrics.items():
                if isinstance(value, (int, float, str, bool)):
                    doc_dict[f"quality_{key}"] = value
        
        # Extract original metadata
        if doc.original_metadata:
            for key, value in doc.original_metadata.items():
                if isinstance(value, (int, float, str, bool)):
                    doc_dict[f"metadata_{key}"] = value
        
        # Extract processing metadata
        if doc.processing_metadata:
            for key, value in doc.processing_metadata.items():
                if isinstance(value, (int, float, str, bool)):
                    doc_dict[f"processing_{key}"] = value
        
        data.append(doc_dict)
    
    return pd.DataFrame(data)

# Convert documents to DataFrame
df = documents_to_dataframe(documents)

# Display basic information
print(f"DataFrame shape: {df.shape}")
df.head()

## Data Quality Analysis

Let's analyze the quality of our dataset based on various metrics.

In [None]:
# Basic statistics
print("Basic statistics for numerical columns:")
df.describe()

In [None]:
# Quality score distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['quality_score'], kde=True, bins=20)
plt.title('Distribution of Quality Scores')
plt.xlabel('Quality Score')
plt.ylabel('Count')
plt.axvline(x=0.7, color='r', linestyle='--', label='Minimum Threshold (0.7)')
plt.legend()
plt.show()

In [None]:
# Document length distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['text_length'], kde=True, bins=20)
plt.title('Distribution of Document Lengths')
plt.xlabel('Text Length (characters)')
plt.ylabel('Count')
plt.show()

In [None]:
# Token count distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['token_count'], kde=True, bins=20)
plt.title('Distribution of Token Counts')
plt.xlabel('Token Count')
plt.ylabel('Count')
plt.show()

In [None]:
# Quality metrics correlation
# Select only numeric columns for correlation
numeric_cols = df.select_dtypes(include=['number']).columns
quality_cols = [col for col in numeric_cols if col.startswith('quality_') or col in ['quality_score', 'text_length', 'token_count']]

# Calculate correlation matrix
corr_matrix = df[quality_cols].corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Between Quality Metrics')
plt.tight_layout()
plt.show()

## Text Characteristics Analysis

Let's analyze the characteristics of the text data that are relevant for LLM training.

In [None]:
# Function to extract and analyze tokens from documents
def analyze_tokens(documents, max_docs=100):
    """Analyze tokens from a list of documents."""
    # Limit to max_docs for performance
    docs_to_analyze = documents[:max_docs]
    
    # Collect all tokens
    all_tokens = []
    for doc in docs_to_analyze:
        if doc.tokens:
            all_tokens.extend(doc.tokens)
    
    # Count token frequencies
    token_counts = Counter(all_tokens)
    
    # Calculate vocabulary size
    vocab_size = len(token_counts)
    
    # Get most common tokens
    most_common = token_counts.most_common(30)
    
    # Calculate token length distribution
    token_lengths = [len(token) for token in all_tokens]
    
    return {
        "total_tokens": len(all_tokens),
        "vocab_size": vocab_size,
        "most_common": most_common,
        "token_counts": token_counts,
        "token_lengths": token_lengths
    }

# Analyze tokens
token_analysis = analyze_tokens(documents)

print(f"Total tokens analyzed: {token_analysis['total_tokens']}")
print(f"Vocabulary size: {token_analysis['vocab_size']}")
print(f"Vocabulary coverage: {token_analysis['vocab_size'] / max(1, token_analysis['total_tokens']):.4f}")

In [None]:
# Plot most common tokens
plt.figure(figsize=(12, 8))
most_common_df = pd.DataFrame(token_analysis['most_common'], columns=['Token', 'Count'])
sns.barplot(x='Count', y='Token', data=most_common_df)
plt.title('30 Most Common Tokens')
plt.xlabel('Count')
plt.ylabel('Token')
plt.tight_layout()
plt.show()

In [None]:
# Plot token length distribution
plt.figure(figsize=(10, 6))
sns.histplot(token_analysis['token_lengths'], kde=True, bins=20)
plt.title('Distribution of Token Lengths')
plt.xlabel('Token Length (characters)')
plt.ylabel('Count')
plt.show()

In [None]:
# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=200).generate_from_frequencies(token_analysis['token_counts'])

plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Tokens')
plt.show()

## Metadata Analysis

Let's analyze the metadata associated with our documents to understand the composition of our dataset.

In [None]:
# Extract metadata columns
metadata_cols = [col for col in df.columns if col.startswith('metadata_')]

if metadata_cols:
    print(f"Found {len(metadata_cols)} metadata columns: {metadata_cols}")
    
    # Analyze categorical metadata
    for col in metadata_cols:
        if df[col].dtype == 'object' or len(df[col].unique()) < 20:  # Categorical or few unique values
            plt.figure(figsize=(10, 6))
            value_counts = df[col].value_counts()
            sns.barplot(x=value_counts.index, y=value_counts.values)
            plt.title(f'Distribution of {col}')
            plt.xlabel(col.replace('metadata_', ''))
            plt.ylabel('Count')
            plt.xticks(rotation=45)
            plt.tight_layout()
            plt.show()
else:
    print("No metadata columns found in the DataFrame.")

## Document Filtering Analysis

Let's analyze the filtering results to understand how many documents were filtered out and why.

In [None]:
# Check if filtering information is available
if 'processing_filtered' in df.columns:
    # Count filtered documents
    filtered_count = df['processing_filtered'].sum()
    total_count = len(df)
    
    print(f"Filtered documents: {filtered_count} ({filtered_count/total_count:.2%})")
    print(f"Retained documents: {total_count - filtered_count} ({(total_count - filtered_count)/total_count:.2%})")
    
    # Plot filtering results
    plt.figure(figsize=(8, 8))
    plt.pie([total_count - filtered_count, filtered_count], 
            labels=['Retained', 'Filtered'], 
            autopct='%1.1f%%',
            colors=['#66b3ff', '#ff9999'],
            explode=(0.1, 0))
    plt.title('Document Filtering Results')
    plt.show()
    
    # Check if filter reason is available
    if 'processing_filter_reason' in df.columns:
        # Count filter reasons
        filter_reasons = df.loc[df['processing_filtered'], 'processing_filter_reason'].value_counts()
        
        if not filter_reasons.empty:
            plt.figure(figsize=(12, 6))
            sns.barplot(x=filter_reasons.values, y=filter_reasons.index)
            plt.title('Filter Reasons')
            plt.xlabel('Count')
            plt.ylabel('Reason')
            plt.tight_layout()
            plt.show()
else:
    print("No filtering information available in the DataFrame.")

## Deduplication Analysis

Let's analyze the deduplication results to understand how many documents were identified as duplicates.

In [None]:
# Check if deduplication information is available
if 'processing_duplicate' in df.columns:
    # Count duplicate documents
    duplicate_count = df['processing_duplicate'].sum()
    total_count = len(df)
    
    print(f"Duplicate documents: {duplicate_count} ({duplicate_count/total_count:.2%})")
    print(f"Unique documents: {total_count - duplicate_count} ({(total_count - duplicate_count)/total_count:.2%})")
    
    # Plot deduplication results
    plt.figure(figsize=(8, 8))
    plt.pie([total_count - duplicate_count, duplicate_count], 
            labels=['Unique', 'Duplicate'], 
            autopct='%1.1f%%',
            colors=['#66b3ff', '#ff9999'],
            explode=(0.1, 0))
    plt.title('Document Deduplication Results')
    plt.show()
    
    # Check if similarity information is available
    if 'processing_similarity' in df.columns:
        # Plot similarity distribution for duplicates
        similarities = df.loc[df['processing_duplicate'], 'processing_similarity']
        
        if not similarities.empty:
            plt.figure(figsize=(10, 6))
            sns.histplot(similarities, kde=True, bins=20)
            plt.title('Similarity Distribution for Duplicates')
            plt.xlabel('Similarity')
            plt.ylabel('Count')
            plt.show()
else:
    print("No deduplication information available in the DataFrame.")

## Sample Document Analysis

Let's examine a few sample documents to get a better understanding of the data.

In [None]:
# Function to display a sample document
def display_sample_document(doc_index=0):
    """Display a sample document with its metadata and quality metrics."""
    if doc_index < 0 or doc_index >= len(documents):
        print(f"Invalid document index. Must be between 0 and {len(documents)-1}.")
        return
    
    doc = documents[doc_index]
    
    print(f"Document ID: {doc.id}")
    print(f"Source: {doc.source}")
    print(f"Quality Score: {doc.quality_score:.4f}")
    print(f"Token Count: {doc.token_count}")
    print("\nOriginal Metadata:")
    for key, value in doc.original_metadata.items():
        print(f"  {key}: {value}")
    
    print("\nQuality Metrics:")
    for key, value in doc.quality_metrics.items():
        print(f"  {key}: {value}")
    
    print("\nText Sample (first 500 characters):")
    print(doc.text[:500] + "...")
    
    if doc.tokens:
        print("\nTokens Sample (first 20):")
        print(doc.tokens[:20])

# Display a sample document
display_sample_document(0)

In [None]:
# Display another sample document with high quality score
if len(documents) > 1:
    # Find a document with high quality score
    high_quality_indices = [i for i, doc in enumerate(documents) if doc.quality_score > 0.8]
    if high_quality_indices:
        print("Sample document with high quality score:")
        display_sample_document(high_quality_indices[0])
    else:
        print("No documents with high quality score found.")

## LLM Training Considerations

Based on our analysis, let's discuss some considerations for using this dataset for LLM training.

In [None]:
# Calculate dataset statistics relevant for LLM training
total_docs = len(documents)
total_tokens = sum(doc.token_count for doc in documents if doc.token_count)
avg_tokens_per_doc = total_tokens / total_docs if total_docs > 0 else 0

# Estimate training time and resources
tokens_per_gpu_day = 5_000_000  # Rough estimate for a single GPU day of training
estimated_gpu_days = total_tokens / tokens_per_gpu_day

print("Dataset Statistics for LLM Training:")
print(f"Total documents: {total_docs:,}")
print(f"Total tokens: {total_tokens:,}")
print(f"Average tokens per document: {avg_tokens_per_doc:.2f}")
print(f"Estimated GPU days for training: {estimated_gpu_days:.2f}")

# Calculate quality distribution
quality_thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]
quality_counts = [sum(1 for doc in documents if doc.quality_score >= threshold) for threshold in quality_thresholds]
quality_percentages = [count / total_docs * 100 for count in quality_counts]

# Plot quality threshold impact
plt.figure(figsize=(10, 6))
plt.plot(quality_thresholds, quality_percentages, marker='o', linewidth=2)
plt.title('Impact of Quality Threshold on Dataset Size')
plt.xlabel('Quality Score Threshold')
plt.ylabel('Percentage of Documents Retained')
plt.grid(True)
plt.show()

## Recommendations for LLM Training

Based on our analysis, here are some recommendations for using this dataset for LLM training:

1. **Quality Filtering**: Apply a quality score threshold of at least 0.7 to ensure high-quality training data.

2. **Token Length Considerations**: The average document length is suitable for training, but consider chunking very long documents to improve training efficiency.

3. **Vocabulary Coverage**: The vocabulary size is appropriate for a domain-specific legal language model, with good coverage of legal terminology.

4. **Data Balancing**: Consider balancing the dataset across different courts and case types to ensure the model learns a diverse range of legal language.

5. **Deduplication**: The deduplication process has effectively removed similar documents, ensuring the model doesn't overfit to repeated content.

6. **Training Resources**: Based on the total token count, allocate appropriate GPU resources for training, with an estimated requirement of several GPU days for a small to medium-sized model.

7. **Evaluation**: Set aside a portion of the dataset (10-15%) for evaluation to measure model performance on legal text understanding and generation tasks.

## Conclusion

In this notebook, we've analyzed the legal text dataset processed through our LLM Training Data Curation Pipeline. We've examined data quality, text characteristics, metadata distribution, and filtering results to gain insights into the dataset's suitability for LLM training.

The dataset demonstrates good quality overall, with appropriate token distributions and vocabulary coverage for legal domain training. By applying the recommended quality thresholds and preprocessing steps, this dataset can serve as a valuable resource for training a specialized legal language model.

The pipeline has successfully curated a dataset that balances quality, diversity, and domain-specificity, making it well-suited for the intended LLM training task.