# LLM Research Data Comparative Analysis Demo

This notebook demonstrates how to use the tools in this repository to analyze research data with Large Language Models and visualize the results.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path

# Import our custom modules
import sys
sys.path.append('../code/')
from visualize import create_comparative_analysis, visualize_publication_trends

## 1. Loading Scopus Data

First, let's load the Scopus data file and explore its structure.

In [None]:
# Load the data
try:
    df = pd.read_csv('../data/scopus_27_Feb.csv')
    print(f"Loaded {len(df)} records from Scopus")
    
    # Display the first few rows
    display(df.head())
    
    # Display basic information about the dataset
    print("\nDataset information:")
    print(f"Number of publications: {len(df)}")
    print(f"Date range: {df['Year'].min()} - {df['Year'].max()}")
    
    # Display column information
    print("\nColumns in the dataset:")
    for col in df.columns:
        print(f"- {col}")
except Exception as e:
    print(f"Error: {e}")
    print("Using sample data for demonstration...")
    # Create sample data for demonstration
    years = range(2017, 2025)
    publications = [5, 12, 28, 45, 98, 187, 342, 421]
    df = pd.DataFrame({'Year': years, 'Count': publications})
    display(df)

## 2. Analyzing Publication Trends

Let's visualize the publication trends over time to identify patterns.

In [None]:
# Set plotting style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

# Plot publication trends by year
plt.figure(figsize=(12, 6))
try:
    year_counts = df['Year'].value_counts().sort_index()
    ax = year_counts.plot(kind='bar', color='skyblue')
    plt.title('Number of LLM Publications by Year', fontsize=16)
    plt.xlabel('Year', fontsize=14)
    plt.ylabel('Number of Publications', fontsize=14)
    plt.xticks(rotation=45)
except Exception as e:
    # Use sample data if the real data doesn't have the expected structure
    print(f"Using sample data: {e}")
    sample_df = pd.DataFrame({
        'Year': range(2017, 2025),
        'Count': [5, 12, 28, 45, 98, 187, 342, 421]
    })
    ax = sample_df.plot(x='Year', y='Count', kind='bar', color='skyblue')
    plt.title('Sample: Number of LLM Publications by Year', fontsize=16)
    plt.xlabel('Year', fontsize=14)
    plt.ylabel('Number of Publications', fontsize=14)
    plt.xticks(rotation=45)
    
plt.tight_layout()
plt.show()

## 3. LLM Comparative Analysis

Now, let's demonstrate a comparative analysis of different LLMs on research data tasks.

In [None]:
# Sample performance data for different LLMs
models = ['GPT-4', 'Claude-3', 'Llama-3', 'Gemini']
metrics = ['MMLU', 'HumanEval', 'TruthfulQA', 'GSM8K']

# Create a comparative analysis
comparison_df = create_comparative_analysis(models=models, metrics=metrics)

# Display the performance data
display(comparison_df)

## 4. Analyzing Citation Impact

Let's analyze the citation impact of publications in our dataset.

In [None]:
# Create sample citation data if real data not available
try:
    if 'Cited by' in df.columns:
        citation_data = df['Cited by'].dropna()
    else:
        raise ValueError("'Cited by' column not found")
except Exception as e:
    print(f"Using synthetic citation data: {e}")
    # Generate synthetic citation data with a skewed distribution
    np.random.seed(42)
    citation_data = np.random.exponential(scale=10, size=500)
    citation_data = citation_data.astype(int)
    citation_data = citation_data[citation_data < 100]  # Cap at 100 citations

# Plot citation distribution
plt.figure(figsize=(10, 6))
sns.histplot(citation_data, bins=30, kde=True, color='coral')
plt.title('Distribution of Citation Counts', fontsize=16)
plt.xlabel('Number of Citations', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.tight_layout()
plt.show()

# Display summary statistics
citation_stats = pd.Series(citation_data).describe()
print("Citation Statistics:")
print(citation_stats)

## 5. Topic Analysis

Let's identify the common topics in the publications using simple frequency analysis.

In [None]:
# Sample topics related to construction and AI
topics = [
    'Building Information Modeling', 'Digital Twin', 'Machine Learning',
    'Natural Language Processing', 'Construction Management', 'Automation',
    'Deep Learning', 'Smart Buildings', 'Construction Safety', 'Robotics',
    'IoT', 'Knowledge Management', 'Computer Vision', 'Sustainable Construction',
    'Augmented Reality', 'Virtual Reality', 'Data Analytics'
]

# Generate sample frequency data
np.random.seed(42)
frequencies = np.random.randint(10, 100, size=len(topics))
topic_df = pd.DataFrame({'Topic': topics, 'Frequency': frequencies})
topic_df = topic_df.sort_values('Frequency', ascending=False)

# Plot topic frequencies
plt.figure(figsize=(14, 8))
sns.barplot(x='Frequency', y='Topic', data=topic_df, palette='viridis')
plt.title('Frequency of Topics in Construction AI Research', fontsize=16)
plt.xlabel('Number of Publications', fontsize=14)
plt.ylabel('Topic', fontsize=14)
plt.tight_layout()
plt.show()

## 6. Conclusion

In this notebook, we've demonstrated:
- Loading and exploring Scopus research data
- Visualizing publication trends
- Comparing different LLMs on research tasks
- Analyzing citation impact
- Examining topic frequency

These tools can be used to conduct in-depth analyses of research in the construction technology domain and evaluate how different LLMs perform on related tasks.