# Data Analyst Agent - Interactive Notebook

This notebook provides an interactive environment for the Data Analyst Agent.
You can upload files, process them, and ask intelligent questions using AI.

## Features:
- Multi-modal file processing (CSV, Excel, PDF, Images, Text)
- AI-powered data analysis using Together AI's LLaMA model
- Exploratory Data Analysis (EDA)
- Conversation logging
- Interactive visualizations

## Requirements:
- Python 3.8+
- Together AI API key
- Required packages (see requirements.txt)


## 1. Setup and Imports

In [None]:
# Core imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Environment and logging
from dotenv import load_dotenv
import logging

# Add src directory to path
src_path = Path("../src").resolve()
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

# Import our modules
from data_processor import DataProcessor
from ai_agent import AIAgent
from utils import setup_logging, validate_api_key

# Load environment variables
load_dotenv(dotenv_path="../.env")

# Setup
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

# Setup logging
setup_logging()
logger = logging.getLogger(__name__)

print("✅ Setup complete!")

## 2. API Configuration

In [None]:
# Get API key from environment or input
api_key = os.getenv("TOGETHER_API_KEY")

if not api_key or not validate_api_key(api_key):
    print("⚠️ API key not found or invalid in environment variables.")
    print("Please enter your Together AI API key:")
    api_key = input("API Key: ")
    os.environ["TOGETHER_API_KEY"] = api_key

if validate_api_key(api_key):
    print("✅ API key validated successfully!")
else:
    print("❌ Invalid API key. Please check and try again.")

## 3. Initialize Components

In [None]:
try:
    # Initialize data processor
    data_processor = DataProcessor()
    print("✅ Data processor initialized")
    
    # Initialize AI agent
    ai_agent = AIAgent(
        api_key=api_key,
        model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
        max_tokens=500,
        temperature=0.7
    )
    print("✅ AI agent initialized and connected")
    
except Exception as e:
    print(f"❌ Error initializing components: {str(e)}")
    logger.error(f"Component initialization failed: {str(e)}")

## 4. File Processing Functions

In [None]:
def load_and_process_file(file_path: str):
    """
    Load and process a file from the local filesystem
    """
    if not os.path.exists(file_path):
        print(f"❌ File not found: {file_path}")
        return None
    
    print(f"📁 Processing file: {file_path}")
    
    # Create a mock uploaded file object
    class MockUploadedFile:
        def __init__(self, file_path):
            self.name = os.path.basename(file_path)
            self.size = os.path.getsize(file_path)
            self._file_path = file_path
            self._content = None
        
        def read(self):
            if self._content is None:
                with open(self._file_path, 'rb') as f:
                    self._content = f.read()
            return self._content
        
        def seek(self, position):
            # For simplicity, we'll reload the file content
            pass
    
    try:
        mock_file = MockUploadedFile(file_path)
        processed_data = data_processor.process_file(mock_file)
        
        print(f"✅ File processed successfully!")
        
        # Display basic info
        if isinstance(processed_data, pd.DataFrame):
            print(f"📊 Dataset shape: {processed_data.shape}")
            print(f"📋 Columns: {list(processed_data.columns)}")
        elif isinstance(processed_data, str):
            print(f"📝 Text length: {len(processed_data)} characters")
        
        return processed_data, {
            'name': mock_file.name,
            'type': file_path.split('.')[-1].upper(),
            'size': mock_file.size
        }
        
    except Exception as e:
        print(f"❌ Error processing file: {str(e)}")
        logger.error(f"File processing error: {str(e)}")
        return None, None

def display_data_info(data, file_info=None):
    """
    Display information about the processed data
    """
    if isinstance(data, pd.DataFrame):
        print("\n📊 Data Overview:")
        print(f"Shape: {data.shape}")
        print(f"Memory usage: {data.memory_usage().sum() / 1024:.1f} KB")
        print("\nColumn types:")
        print(data.dtypes)
        
        print("\n📋 First 5 rows:")
        display(data.head())
        
        # Check for missing values
        missing = data.isnull().sum()
        if missing.sum() > 0:
            print("\n⚠️ Missing values:")
            print(missing[missing > 0])
        
        # Basic statistics for numeric columns
        numeric_cols = data.select_dtypes(include=['number']).columns
        if len(numeric_cols) > 0:
            print("\n📈 Basic statistics:")
            display(data[numeric_cols].describe())
    
    elif isinstance(data, str):
        print(f"\n📝 Text data: {len(data)} characters")
        print("\nPreview:")
        print(data[:500] + "..." if len(data) > 500 else data)

print("✅ File processing functions ready!")

## 5. Load Sample Data

Let's load some sample data to demonstrate the capabilities.

In [None]:
# List available sample files
sample_data_dir = Path("../data/samples")
sample_files = []

if sample_data_dir.exists():
    for file_path in sample_data_dir.glob("*"):
        if file_path.is_file():
            sample_files.append(str(file_path))

if sample_files:
    print("📁 Available sample files:")
    for i, file_path in enumerate(sample_files):
        print(f"{i+1}. {os.path.basename(file_path)}")
else:
    print("📁 No sample files found. You can add files to ../data/samples/")
    
    # Create a sample dataset
    print("\n📊 Creating a sample dataset...")
    np.random.seed(42)
    sample_df = pd.DataFrame({
        'date': pd.date_range('2023-01-01', periods=100),
        'sales': np.random.normal(1000, 200, 100),
        'category': np.random.choice(['A', 'B', 'C'], 100),
        'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
        'customer_satisfaction': np.random.uniform(3, 5, 100)
    })
    
    # Save sample data
    os.makedirs(sample_data_dir, exist_ok=True)
    sample_file_path = sample_data_dir / "sample_sales_data.csv"
    sample_df.to_csv(sample_file_path, index=False)
    print(f"✅ Sample data created: {sample_file_path}")
    
    # Update sample files list
    sample_files = [str(sample_file_path)]

In [None]:
# Process the first sample file
if sample_files:
    selected_file = sample_files[0]
    print(f"🔄 Processing: {os.path.basename(selected_file)}")
    
    current_data, current_file_info = load_and_process_file(selected_file)
    
    if current_data is not None:
        display_data_info(current_data, current_file_info)
else:
    print("❌ No files to process")

## 6. AI-Powered Analysis

Now let's use the AI agent to analyze our data.

In [None]:
def ask_ai_question(question: str, data=None, file_info=None):
    """
    Ask the AI agent a question about the data
    """
    if data is None:
        data = current_data
    if file_info is None:
        file_info = current_file_info
    
    if data is None:
        print("❌ No data loaded. Please process a file first.")
        return None
    
    try:
        print(f"🤔 Analyzing: {question}")
        print("⏳ Please wait...")
        
        response = ai_agent.analyze_data(data, question, file_info)
        
        print("\n🤖 AI Response:")
        print("=" * 60)
        print(response)
        print("=" * 60)
        
        return response
        
    except Exception as e:
        print(f"❌ Error getting AI response: {str(e)}")
        logger.error(f"AI analysis error: {str(e)}")
        return None

def generate_summary_report(data=None, file_info=None):
    """
    Generate a comprehensive summary report
    """
    if data is None:
        data = current_data
    if file_info is None:
        file_info = current_file_info
    
    return ai_agent.generate_summary_report(data, file_info)

def suggest_analysis_questions(data=None, file_info=None):
    """
    Get AI suggestions for analysis questions
    """
    if data is None:
        data = current_data
    if file_info is None:
        file_info = current_file_info
    
    return ai_agent.suggest_questions(data, file_info)

print("✅ AI analysis functions ready!")

In [None]:
# Example AI questions
if 'current_data' in locals() and current_data is not None:
    
    # Generate suggested questions
    print("🔍 Getting suggested analysis questions...")
    suggestions = suggest_analysis_questions()
    if suggestions:
        print("\n💡 Suggested Questions:")
        print("=" * 60)
        print(suggestions)
        print("=" * 60)
else:
    print("⚠️ Please load data first before running AI analysis")

In [None]:
# Interactive analysis - ask your own questions
if 'current_data' in locals() and current_data is not None:
    
    # Example questions you can ask:
    example_questions = [
        "What are the main trends in this dataset?",
        "Which category performs the best?",
        "Are there any notable patterns or anomalies?",
        "What insights can help improve business performance?",
        "How does customer satisfaction vary across regions?"
    ]
    
    print("💭 Example questions you can ask:")
    for i, q in enumerate(example_questions, 1):
        print(f"{i}. {q}")
    
    print("\n🤖 Ask a question about your data:")
    user_question = input("Your question: ")
    
    if user_question.strip():
        response = ask_ai_question(user_question)
    else:
        print("No question provided.")
        
else:
    print("⚠️ Please load data first before running AI analysis")

## 7. Exploratory Data Analysis (EDA)

In [None]:
def perform_eda(data):
    """
    Perform comprehensive EDA on the dataset
    """
    if not isinstance(data, pd.DataFrame):
        print("❌ EDA can only be performed on structured data (CSV/Excel)")
        return
    
    print("📊 Performing Exploratory Data Analysis...")
    
    # Basic info
    print(f"\nDataset shape: {data.shape}")
    print(f"Memory usage: {data.memory_usage(deep=True).sum() / 1024:.1f} KB")
    
    # Data types
    print("\n📋 Data Types:")
    print(data.dtypes.value_counts())
    
    # Missing values
    missing = data.isnull().sum()
    if missing.sum() > 0:
        print("\n⚠️ Missing Values:")
        missing_pct = (missing / len(data)) * 100
        missing_df = pd.DataFrame({
            'Count': missing[missing > 0],
            'Percentage': missing_pct[missing > 0]
        })
        display(missing_df)
    else:
        print("\n✅ No missing values found")
    
    # Numerical columns analysis
    numeric_cols = data.select_dtypes(include=['number']).columns
    if len(numeric_cols) > 0:
        print(f"\n📈 Numerical Columns ({len(numeric_cols)}):")
        display(data[numeric_cols].describe())
        
        # Visualizations
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        fig.suptitle('Numerical Data Analysis', fontsize=16)
        
        # Histograms
        if len(numeric_cols) >= 1:
            col = numeric_cols[0]
            axes[0, 0].hist(data[col].dropna(), bins=30, alpha=0.7, edgecolor='black')
            axes[0, 0].set_title(f'Distribution of {col}')
            axes[0, 0].set_xlabel(col)
            axes[0, 0].set_ylabel('Frequency')
        
        # Box plot
        if len(numeric_cols) >= 2:
            col = numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0]
            axes[0, 1].boxplot(data[col].dropna())
            axes[0, 1].set_title(f'Box Plot of {col}')
            axes[0, 1].set_ylabel(col)
        
        # Correlation heatmap
        if len(numeric_cols) > 1:
            corr = data[numeric_cols].corr()
            im = axes[1, 0].imshow(corr, cmap='coolwarm', aspect='auto')
            axes[1, 0].set_title('Correlation Heatmap')
            axes[1, 0].set_xticks(range(len(corr.columns)))
            axes[1, 0].set_yticks(range(len(corr.columns)))
            axes[1, 0].set_xticklabels(corr.columns, rotation=45)
            axes[1, 0].set_yticklabels(corr.columns)
            
            # Add correlation values
            for i in range(len(corr.columns)):
                for j in range(len(corr.columns)):
                    axes[1, 0].text(j, i, f'{corr.iloc[i, j]:.2f}', 
                                   ha='center', va='center', fontsize=8)
        
        # Scatter plot (if we have at least 2 numeric columns)
        if len(numeric_cols) >= 2:
            col1, col2 = numeric_cols[0], numeric_cols[1]
            axes[1, 1].scatter(data[col1], data[col2], alpha=0.6)
            axes[1, 1].set_xlabel(col1)
            axes[1, 1].set_ylabel(col2)
            axes[1, 1].set_title(f'{col1} vs {col2}')
        
        plt.tight_layout()
        plt.show()
    
    # Categorical columns analysis
    categorical_cols = data.select_dtypes(include=['object', 'category']).columns
    if len(categorical_cols) > 0:
        print(f"\n📊 Categorical Columns ({len(categorical_cols)}):")
        
        for col in categorical_cols[:3]:  # Show first 3 categorical columns
            print(f"\n{col} - Unique values: {data[col].nunique()}")
            value_counts = data[col].value_counts().head(10)
            print(value_counts)
            
            # Visualization
            plt.figure(figsize=(10, 6))
            value_counts.plot(kind='bar')
            plt.title(f'Distribution of {col}')
            plt.xlabel(col)
            plt.ylabel('Count')
            plt.xticks(rotation=45)
            plt.tight_layout()
            plt.show()

# Perform EDA on current data
if 'current_data' in locals() and current_data is not None:
    perform_eda(current_data)
else:
    print("⚠️ Please load data first before performing EDA")

## 8. Export and Save Results

In [None]:
def save_analysis_results(analysis_history, filename=None):
    """
    Save analysis results to a file
    """
    from datetime import datetime
    import json
    
    if not filename:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"analysis_results_{timestamp}.json"
    
    # Create outputs directory
    output_dir = Path("../data/outputs")
    output_dir.mkdir(exist_ok=True)
    
    filepath = output_dir / filename
    
    # Prepare data for export
    export_data = {
        'timestamp': datetime.now().isoformat(),
        'file_info': current_file_info if 'current_file_info' in locals() else None,
        'analysis_history': analysis_history,
        'data_summary': data_processor.get_data_summary(current_data) if 'current_data' in locals() else None
    }
    
    try:
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(export_data, f, indent=2, ensure_ascii=False)
        
        print(f"✅ Analysis results saved to: {filepath}")
        return str(filepath)
        
    except Exception as e:
        print(f"❌ Error saving results: {str(e)}")
        return None

def create_analysis_report():
    """
    Create a comprehensive analysis report
    """
    if 'current_data' not in locals() or current_data is None:
        print("❌ No data loaded for report generation")
        return
    
    print("📝 Generating comprehensive analysis report...")
    
    # Generate comprehensive summary
    summary = generate_summary_report()
    
    if summary:
        print("\n📋 Comprehensive Data Analysis Report")
        print("=" * 80)
        print(summary)
        print("=" * 80)
        
        # Save the report
        report_data = [{
            'type': 'comprehensive_summary',
            'question': 'Generate comprehensive analysis report',
            'response': summary,
            'timestamp': datetime.now().isoformat()
        }]
        
        save_analysis_results(report_data, "comprehensive_report.json")

print("✅ Export functions ready!")

In [None]:
# Generate comprehensive report
if 'current_data' in locals() and current_data is not None:
    create_analysis_report()
else:
    print("⚠️ Please load data first before generating a report")

## 9. Summary and Next Steps

This notebook demonstrated the capabilities of the Data Analyst Agent:

### What we accomplished:
- ✅ Loaded and processed data files
- ✅ Performed exploratory data analysis
- ✅ Used AI to generate insights and answer questions
- ✅ Created visualizations and reports
- ✅ Saved analysis results

### Next Steps:
1. **Upload your own data**: Place files in `../data/samples/` and process them
2. **Ask specific questions**: Use the AI agent to get detailed insights
3. **Explore visualizations**: Create custom plots for your data
4. **Export results**: Save your analysis for future reference
5. **Use the Streamlit app**: Try the web interface for a more interactive experience

### Tips for better analysis:
- Be specific in your questions to the AI
- Combine AI insights with statistical analysis
- Validate AI responses with domain knowledge
- Use visualizations to verify trends and patterns

Happy analyzing! 🚀