# Week 2 Interactive Workshop: Advanced Data Processing & LangChain Fundamentals

Welcome to Week 2! This interactive workshop combines advanced data preprocessing with LangChain fundamentals to build powerful hybrid ML + LLM pipelines.

## 🎯 Workshop Objectives

By the end of this session, you'll be able to:
- Master advanced Metaflow preprocessing patterns
- Understand and use LangChain Expression Language (LCEL)
- Set up and work with local LLMs using Ollama
- Build hybrid workflows combining traditional ML with LLM capabilities
- Process text data with sophisticated NLP techniques

## ⏰ Workshop Timeline (90 minutes)

### Part 1: Advanced Data Preprocessing (45 minutes)
1. **Missing Data Strategies** (10 min) - Advanced imputation techniques
2. **Feature Engineering** (15 min) - Creating predictive features
3. **Scaling and Validation** (10 min) - Pipeline robustness
4. **Text Data Handling** (10 min) - NLP preprocessing fundamentals

### Part 2: LangChain Introduction (30 minutes)
1. **Installation and Setup** (5 min) - LangChain and Ollama
2. **First LCEL Chain** (10 min) - prompt | model | output_parser
3. **Local LLM Integration** (10 min) - Working with Ollama models
4. **Chain Composition** (5 min) - Building complex workflows

### Part 3: Integration Workshop (15 minutes)
1. **Hybrid Pipelines** (10 min) - Combining Metaflow + LangChain
2. **Text Analysis** (5 min) - LLM-powered data insights

## 📋 Prerequisites Check

Let's verify your environment is ready for the workshop!

In [None]:
import subprocess
import sys
import os

def check_environment():
    """Comprehensive environment verification for Week 2 workshop"""
    print("🔍 WEEK 2 ENVIRONMENT CHECK")
    print("=" * 40)
    
    # Check Python packages
    required_packages = {
        'metaflow': ['metaflow', '2.7+'],
        'pandas': ['pandas', '1.3+'],
        'numpy': ['numpy', '1.20+'],
        'sklearn': ['scikit-learn', '1.0+'],
        'langchain': ['langchain', '0.1+'],
        'langchain_community': ['langchain_community', '0.0.10+'],
        'matplotlib': ['matplotlib', '3.3+'],
        'seaborn': ['seaborn', '0.11+']
    }
    
    print("📦 Checking Python packages...")
    missing_packages = []
    
    for lib, package in required_packages.items():
        try:
            __import__(lib)
            print(f"   ✅ {package[0]} {package[1]} - OK")
        except ImportError:
            print(f"   ❌ {package[0]} {package[1]} - MISSING")
            missing_packages.append(f"{package[0]} {package[1]}")
    
    if missing_packages:
        print(f"\n⚠️  Install missing packages: pip install {' '.join(missing_packages)}")
        return False
    
    # Check data files
    print("\n📊 Checking data files...")
    data_files = [
        '../data/titanic.csv',
        '../data/customer_reviews.csv',
        '../data/financial_data.json'
    ]
    
    for file_path in data_files:
        if os.path.exists(file_path):
            print(f"   ✅ {os.path.basename(file_path)}")
        else:
            print(f"   ⚠️  {os.path.basename(file_path)} - Not found (will use sample data)")
    
    # Check Ollama
    print("\n🧠 Checking Ollama installation...")
    try:
        result = subprocess.run(['ollama', '--version'], 
                              capture_output=True, text=True, timeout=5)
        if result.returncode == 0:
            print("   ✅ Ollama installed")
            print("   💡 Download model with: ollama pull llama3.2")
        else:
            print("   ❌ Ollama command failed")
    except (subprocess.TimeoutExpired, FileNotFoundError):
        print("   ❌ Ollama not found - install from ollama.com")
        print("      This is required for LLM exercises")
    
    print("\n🎯 Environment check complete!")
    print("   Ready to start the workshop!")
    return True

check_environment()

## Part 1: Advanced Data Preprocessing with Metaflow (45 minutes)

Let's start by setting up our imports and loading sample data for preprocessing exercises.

In [None]:
# Complete imports for advanced data preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from metaflow import FlowSpec, step, Parameter, IncludeFile, catch
import re
import string
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("viridis")
plt.rcParams['figure.figsize'] = (12, 8)

print("📊 Data preprocessing environment ready!")
print("🎯 Let's build advanced preprocessing pipelines!")

### 1.1 Loading and Exploring Our Datasets

We'll work with three datasets to practice different preprocessing techniques:

In [None]:
# Load datasets (with fallback to sample data)
def load_workshop_data():
    """Load datasets for preprocessing workshop"""
    
    print("📥 Loading workshop datasets...")
    
    # Dataset 1: Titanic (structured data with missing values)
    try:
        titanic = pd.read_csv('../data/titanic.csv')
        print("   ✅ Loaded Titanic dataset")
    except FileNotFoundError:
        # Create sample titanic-like data
        np.random.seed(42)
        n_samples = 800
        titanic = pd.DataFrame({
            'PassengerId': range(1, n_samples + 1),
            'Survived': np.random.choice([0, 1], n_samples, p=[0.6, 0.4]),
            'Pclass': np.random.choice([1, 2, 3], n_samples, p=[0.2, 0.3, 0.5]),
            'Name': [f'Passenger {i}' for i in range(1, n_samples + 1)],
            'Sex': np.random.choice(['male', 'female'], n_samples, p=[0.65, 0.35]),
            'Age': np.random.normal(30, 12, n_samples),
            'SibSp': np.random.poisson(0.5, n_samples),
            'Parch': np.random.poisson(0.3, n_samples),
            'Ticket': [f'TICKET{i}' for i in range(1, n_samples + 1)],
            'Fare': np.random.lognormal(3, 1, n_samples),
            'Cabin': [f'C{i}' if np.random.random() > 0.7 else None for i in range(n_samples)],
            'Embarked': np.random.choice(['C', 'Q', 'S'], n_samples, p=[0.2, 0.1, 0.7])
        })
        # Introduce missing values
        titanic.loc[np.random.choice(titanic.index, 150, replace=False), 'Age'] = np.nan
        titanic.loc[np.random.choice(titanic.index, 50, replace=False), 'Embarked'] = np.nan
        print("   ✅ Created sample Titanic-like dataset")
    
    # Dataset 2: Customer Reviews (text data)
    try:
        reviews = pd.read_csv('../data/customer_reviews.csv')
        print("   ✅ Loaded customer reviews dataset")
    except FileNotFoundError:
        # Create sample review data
        sample_reviews = [
            "Great product! Highly recommend to everyone.",
            "Terrible quality. Broke after one day.",
            "Average product, nothing special but works fine.",
            "Amazing customer service and fast delivery!",
            "Not worth the money. Poor build quality.",
            "Excellent value for money. Very satisfied!",
            "Disappointed with the purchase. Returns policy unclear.",
            "Perfect for my needs. Would buy again."
        ]
        
        np.random.seed(42)
        n_reviews = 500
        reviews = pd.DataFrame({
            'review_id': range(1, n_reviews + 1),
            'review_text': np.random.choice(sample_reviews, n_reviews),
            'rating': np.random.choice([1, 2, 3, 4, 5], n_reviews, p=[0.1, 0.1, 0.2, 0.3, 0.3]),
            'product_category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], n_reviews)
        })
        print("   ✅ Created sample customer reviews dataset")
    
    return titanic, reviews

# Load the data
titanic_df, reviews_df = load_workshop_data()

print(f"\n📊 Dataset Overview:")
print(f"   🚢 Titanic: {titanic_df.shape[0]} rows, {titanic_df.shape[1]} columns")
print(f"   💬 Reviews: {reviews_df.shape[0]} rows, {reviews_df.shape[1]} columns")

# Quick preview
print("\n🔍 Titanic Preview:")
display(titanic_df.head())

print("\n🔍 Reviews Preview:")
display(reviews_df.head())

### 🎯 Exercise 1.1: Missing Data Analysis (10 minutes)

**Your Task**: Analyze and handle missing data in the Titanic dataset using advanced imputation techniques.

**Instructions**:
1. Identify columns with missing values and their percentages
2. Visualize missing data patterns
3. Implement different imputation strategies:
   - Simple imputation (mean, median, mode)
   - KNN imputation
   - Custom domain-specific imputation
4. Compare the results and choose the best strategy

In [None]:
# 🎯 EXERCISE 1.1: Your solution here!
print("💻 Exercise 1.1: Missing Data Analysis")
print("=" * 40)

# Step 1: Identify missing values
missing_info = titanic_df.isnull().sum()
missing_percent = (missing_info / len(titanic_df)) * 100

print("📊 Missing Value Analysis:")
for col in missing_info[missing_info > 0].index:
    print(f"   {col}: {missing_info[col]} ({missing_percent[col]:.1f}%)")

# Step 2: Visualize missing patterns
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
missing_info[missing_info > 0].plot(kind='bar')
plt.title('Missing Values Count')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
# Heatmap of missing values
plt.imshow(titanic_df.isnull(), cmap='viridis', aspect='auto')
plt.title('Missing Values Pattern')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.tight_layout()
plt.show()

# Your turn: Implement different imputation strategies!
print("\n🔧 Implement your imputation strategies below:")
print("   Hint: Try SimpleImputer and KNNImputer from sklearn")
print("   Consider domain knowledge for better imputation")

### 🎯 Exercise 1.2: Feature Engineering (15 minutes)

**Your Task**: Create meaningful features from the Titanic dataset to improve model performance.

**Instructions**:
1. Extract titles from passenger names
2. Create family size features
3. Bin numerical features (Age, Fare)
4. Create interaction features
5. Evaluate feature importance

In [None]:
# 🎯 EXERCISE 1.2: Your solution here!
print("💻 Exercise 1.2: Feature Engineering")
print("=" * 40)

# Create a copy for feature engineering
titanic_fe = titanic_df.copy()

# Example: Extract titles from names
def extract_title(name):
    """Extract title from passenger name"""
    title = re.search(' ([A-Za-z]+)\.', name)
    if title:
        return title.group(1)
    return 'Unknown'

titanic_fe['Title'] = titanic_fe['Name'].apply(extract_title)
print(f"📝 Extracted titles: {titanic_fe['Title'].value_counts().head()}")

# Your turn: Create more features!
print("\n🔧 Create additional features:")
print("   1. Family size (SibSp + Parch + 1)")
print("   2. Is alone (family size == 1)")
print("   3. Age groups (Child, Adult, Senior)")
print("   4. Fare per person (Fare / family size)")
print("   5. Deck from Cabin (first letter)")

# Add your feature engineering code here!

print("\n📊 Feature engineering complete!")
print(f"   Original features: {titanic_df.shape[1]}")
print(f"   New features: {titanic_fe.shape[1]}")

### 🎯 Exercise 1.3: Scaling and Pipeline Validation (10 minutes)

**Your Task**: Build a robust preprocessing pipeline with proper scaling and validation.

**Instructions**:
1. Compare different scaling techniques
2. Handle categorical variables properly
3. Create a complete preprocessing pipeline
4. Validate pipeline robustness with cross-validation

In [None]:
# 🎯 EXERCISE 1.3: Your solution here!
print("💻 Exercise 1.3: Scaling and Pipeline Validation")
print("=" * 50)

# Prepare sample data for scaling comparison
numerical_features = ['Age', 'Fare', 'SibSp', 'Parch']
sample_data = titanic_df[numerical_features].dropna()

# Compare scaling techniques
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

plt.figure(figsize=(15, 5))
for i, (name, scaler) in enumerate(scalers.items(), 1):
    scaled_data = scaler.fit_transform(sample_data)
    
    plt.subplot(1, 3, i)
    plt.boxplot(scaled_data, labels=numerical_features)
    plt.title(f'{name}')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

print("\n🔧 Build your complete preprocessing pipeline:")
print("   1. Handle missing values")
print("   2. Encode categorical variables")
print("   3. Scale numerical features")
print("   4. Add feature engineering")
print("   5. Validate with cross-validation")

# Add your pipeline code here!

print("\n✅ Pipeline validation complete!")

### 🎯 Exercise 1.4: Text Data Preprocessing (10 minutes)

**Your Task**: Process the customer reviews dataset using NLP techniques.

**Instructions**:
1. Clean and normalize text data
2. Remove stopwords and special characters
3. Apply stemming/lemmatization
4. Create TF-IDF features
5. Analyze text patterns

In [None]:
# 🎯 EXERCISE 1.4: Your solution here!
print("💻 Exercise 1.4: Text Data Preprocessing")
print("=" * 42)

# Basic text preprocessing function
def clean_text(text):
    """Clean and normalize text data"""
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply basic cleaning
reviews_df['clean_text'] = reviews_df['review_text'].apply(clean_text)

print("📝 Text Cleaning Example:")
for i in range(3):
    print(f"   Original: {reviews_df['review_text'].iloc[i]}")
    print(f"   Cleaned:  {reviews_df['clean_text'].iloc[i]}")
    print()

# Your turn: Add more sophisticated preprocessing!
print("🔧 Add advanced text preprocessing:")
print("   1. Remove stopwords")
print("   2. Apply stemming or lemmatization")
print("   3. Create TF-IDF features")
print("   4. Extract sentiment features")
print("   5. Analyze word frequencies")

# Create TF-IDF features
vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
tfidf_features = vectorizer.fit_transform(reviews_df['clean_text'])

print(f"\n📊 TF-IDF Matrix: {tfidf_features.shape}")
print(f"   Features: {len(vectorizer.get_feature_names_out())}")
print(f"   Top features: {vectorizer.get_feature_names_out()[:10]}")

# Add your advanced text processing here!

print("\n✅ Text preprocessing complete!")

## Part 2: LangChain Introduction (30 minutes)

Now let's dive into LangChain and build our first chains using LCEL (LangChain Expression Language)!

In [None]:
# LangChain imports and setup
try:
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.output_parsers import StrOutputParser
    from langchain_community.llms import Ollama
    from langchain_community.chat_models import ChatOllama
    from langchain_core.runnables import RunnablePassthrough
    from langchain_core.messages import HumanMessage, SystemMessage
    
    print("✅ LangChain imports successful!")
    
    # Test Ollama connection
    try:
        llm = ChatOllama(model="llama3.2", temperature=0.1)
        test_response = llm.invoke([HumanMessage(content="Hello! Just testing connection.")])
        print("✅ Ollama connection successful!")
        print(f"   Model: llama3.2")
        print(f"   Test response: {test_response.content[:50]}...")
        
    except Exception as e:
        print(f"⚠️  Ollama connection failed: {e}")
        print("   Make sure Ollama is running and llama3.2 is downloaded")
        print("   Run: ollama pull llama3.2")
        
except ImportError as e:
    print(f"❌ LangChain import failed: {e}")
    print("   Install with: pip install langchain langchain-community")

print("\n🧠 LangChain environment ready!")
print("🎯 Let's build some chains!")

### 🎯 Exercise 2.1: Your First LCEL Chain (10 minutes)

**Your Task**: Build a simple prompt | model | output_parser chain using LCEL syntax.

**Instructions**:
1. Create a chat prompt template
2. Set up the Ollama model
3. Add an output parser
4. Chain them together with LCEL
5. Test with different inputs

In [None]:
# 🎯 EXERCISE 2.1: Your solution here!
print("💻 Exercise 2.1: Building Your First LCEL Chain")
print("=" * 50)

# Step 1: Create a prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful data science assistant. Provide clear, concise explanations."),
    ("human", "{question}")
])

# Step 2: Set up the model
model = ChatOllama(model="llama3.2", temperature=0.1)

# Step 3: Add output parser
output_parser = StrOutputParser()

# Step 4: Create the chain using LCEL
chain = prompt | model | output_parser

print("🔗 Chain created: prompt | model | output_parser")

# Step 5: Test the chain
test_questions = [
    "What is the difference between bias and variance in machine learning?",
    "Explain what cross-validation is in simple terms.",
    "What are the main steps in a data preprocessing pipeline?"
]

print("\n🧪 Testing the chain:")
for i, question in enumerate(test_questions[:1], 1):  # Test first question
    print(f"\n   Question {i}: {question}")
    try:
        response = chain.invoke({"question": question})
        print(f"   Answer: {response}")
    except Exception as e:
        print(f"   Error: {e}")

# Your turn: Try the other questions and create your own!
print("\n🔧 Your turn:")
print("   1. Test the remaining questions")
print("   2. Create your own data science questions")
print("   3. Experiment with different temperature values")
print("   4. Modify the system prompt")

print("\n✅ First LCEL chain complete!")

### 🎯 Exercise 2.2: Advanced Chain Composition (10 minutes)

**Your Task**: Build more complex chains with multiple steps and data processing.

**Instructions**:
1. Create a data analysis chain
2. Add data preprocessing steps
3. Combine statistical analysis with LLM insights
4. Build a chain that processes our customer reviews
5. Create a summary and recommendations chain

In [None]:
# 🎯 EXERCISE 2.2: Your solution here!
print("💻 Exercise 2.2: Advanced Chain Composition")
print("=" * 45)

# Create a data analysis chain
analysis_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a data analyst. Analyze the provided data and give insights."),
    ("human", "Analyze this dataset summary: {data_summary}\n\nProvide key insights and recommendations.")
])

# Create a function to summarize our review data
def summarize_reviews(reviews_df):
    """Create a statistical summary of reviews"""
    summary = {
        'total_reviews': len(reviews_df),
        'avg_rating': reviews_df['rating'].mean(),
        'rating_distribution': reviews_df['rating'].value_counts().to_dict(),
        'categories': reviews_df['product_category'].value_counts().to_dict(),
        'common_words': ' '.join(reviews_df['review_text']).lower().split()
    }
    
    # Get most common words (simple approach)
    from collections import Counter
    word_counts = Counter(summary['common_words'])
    summary['top_words'] = dict(word_counts.most_common(10))
    del summary['common_words']  # Remove the large list
    
    return summary

# Create the analysis chain
analysis_chain = (
    RunnablePassthrough()
    | analysis_prompt
    | model
    | output_parser
)

# Test the chain with our review data
print("📊 Analyzing customer reviews...")
review_summary = summarize_reviews(reviews_df)
print(f"   Summary: {review_summary}")

try:
    insights = analysis_chain.invoke({"data_summary": str(review_summary)})
    print(f"\n🧠 LLM Insights:\n{insights}")
except Exception as e:
    print(f"   Error: {e}")

# Your turn: Build more complex chains!
print("\n🔧 Build your own advanced chains:")
print("   1. Sentiment analysis chain for individual reviews")
print("   2. Product recommendation chain")
print("   3. Multi-step analysis pipeline")
print("   4. Combine numerical and text analysis")

# Add your advanced chain code here!

print("\n✅ Advanced chain composition complete!")

### 🎯 Exercise 2.3: LLM Model Comparison (10 minutes)

**Your Task**: Compare different local LLM models and their outputs.

**Instructions**:
1. Set up multiple Ollama models
2. Create a comparison chain
3. Test the same prompt across models
4. Analyze differences in responses
5. Choose the best model for your use case

In [None]:
# 🎯 EXERCISE 2.3: Your solution here!
print("💻 Exercise 2.3: LLM Model Comparison")
print("=" * 40)

# Available models to test (install with: ollama pull <model>)
available_models = [
    "llama3.2:1b",   # Lightweight
    "llama3.2",      # Standard
    "phi3",          # Alternative
]

# Test prompt
test_prompt = "Explain the bias-variance tradeoff in machine learning in 2-3 sentences."

print(f"🧪 Testing prompt: {test_prompt}")
print("\n📋 Model Comparison:")

# Test each available model
model_responses = {}

for model_name in available_models:
    try:
        print(f"\n🤖 Testing {model_name}...")
        
        # Create model instance
        test_model = ChatOllama(model=model_name, temperature=0.1)
        
        # Simple chain
        simple_chain = test_model | StrOutputParser()
        
        # Get response
        response = simple_chain.invoke([HumanMessage(content=test_prompt)])
        model_responses[model_name] = response
        
        print(f"   ✅ Response: {response}")
        
    except Exception as e:
        print(f"   ❌ Error with {model_name}: {e}")
        print(f"      Try: ollama pull {model_name}")

# Your turn: Analyze the differences!
print("\n🔧 Analysis tasks:")
print("   1. Compare response quality and accuracy")
print("   2. Measure response length and detail")
print("   3. Test with different types of prompts")
print("   4. Consider speed vs. quality tradeoffs")
print("   5. Choose the best model for your use case")

# Simple comparison
if model_responses:
    print("\n📊 Quick Analysis:")
    for model, response in model_responses.items():
        print(f"   {model}: {len(response)} characters")

print("\n✅ Model comparison complete!")

## Part 3: Integration Workshop - Hybrid Pipelines (15 minutes)

Now let's combine everything: Metaflow data processing with LangChain analysis!

### 🎯 Exercise 3.1: Building a Hybrid Pipeline (10 minutes)

**Your Task**: Create a Metaflow pipeline that incorporates LangChain for intelligent data analysis.

**Instructions**:
1. Design a Metaflow flow with data preprocessing
2. Add LangChain analysis steps
3. Combine statistical analysis with LLM insights
4. Generate automated reports
5. Test the complete pipeline

In [None]:
# 🎯 EXERCISE 3.1: Your solution here!
print("💻 Exercise 3.1: Building a Hybrid Pipeline")
print("=" * 45)

# Define a hybrid Metaflow + LangChain pipeline
class HybridAnalysisPipeline(FlowSpec):
    """
    A hybrid pipeline combining Metaflow data processing 
    with LangChain LLM analysis
    """
    
    dataset_type = Parameter('dataset', default='reviews',
                            help='Dataset to analyze: reviews or titanic')
    
    @step
    def start(self):
        """
        Initialize the pipeline and load data
        """
        print("🚀 Starting hybrid analysis pipeline")
        print(f"   Dataset: {self.dataset_type}")
        
        # Load appropriate dataset
        if self.dataset_type == 'reviews':
            self.data = reviews_df.copy()
            print(f"   Loaded {len(self.data)} reviews")
        else:
            self.data = titanic_df.copy()
            print(f"   Loaded {len(self.data)} passenger records")
        
        self.next(self.preprocess_data)
    
    @step
    def preprocess_data(self):
        """
        Apply data preprocessing
        """
        print("🔧 Preprocessing data...")
        
        if self.dataset_type == 'reviews':
            # Text preprocessing
            self.data['clean_text'] = self.data['review_text'].apply(clean_text)
            
            # Create summary statistics
            self.stats = {
                'total_reviews': len(self.data),
                'avg_rating': self.data['rating'].mean(),
                'rating_distribution': self.data['rating'].value_counts().to_dict()
            }
        else:
            # Handle missing values
            self.data['Age'].fillna(self.data['Age'].median(), inplace=True)
            self.data['Embarked'].fillna(self.data['Embarked'].mode()[0], inplace=True)
            
            # Create summary statistics
            self.stats = {
                'survival_rate': self.data['Survived'].mean(),
                'avg_age': self.data['Age'].mean(),
                'class_distribution': self.data['Pclass'].value_counts().to_dict()
            }
        
        print(f"   Preprocessing complete: {self.stats}")
        self.next(self.llm_analysis)
    
    @step 
    def llm_analysis(self):
        """
        Use LangChain for intelligent analysis
        """
        print("🧠 Running LLM analysis...")
        
        try:
            # Create analysis prompt
            analysis_prompt = ChatPromptTemplate.from_messages([
                ("system", "You are an expert data analyst. Provide insights and recommendations."),
                ("human", "Analyze this data summary and provide 3 key insights: {data_summary}")
            ])
            
            # Create analysis chain
            analysis_chain = analysis_prompt | model | output_parser
            
            # Generate insights
            self.llm_insights = analysis_chain.invoke({"data_summary": str(self.stats)})
            print(f"   LLM Analysis: {self.llm_insights[:100]}...")
            
        except Exception as e:
            print(f"   LLM analysis failed: {e}")
            self.llm_insights = "LLM analysis unavailable"
        
        self.next(self.generate_report)
    
    @step
    def generate_report(self):
        """
        Generate final analysis report
        """
        print("📊 Generating final report...")
        
        self.report = {
            'dataset': self.dataset_type,
            'data_shape': self.data.shape,
            'statistics': self.stats,
            'llm_insights': self.llm_insights,
            'timestamp': pd.Timestamp.now().isoformat()
        }
        
        print("   Report generated successfully!")
        self.next(self.end)
    
    @step
    def end(self):
        """
        Pipeline completion
        """
        print("🎉 Hybrid pipeline complete!")
        print(f"   Final report: {len(str(self.report))} characters")

print("✅ Hybrid pipeline class defined!")
print("\n🔧 Your turn:")
print("   1. Add more sophisticated preprocessing steps")
print("   2. Include multiple LLM analysis stages")
print("   3. Add data visualization generation")
print("   4. Implement error handling and logging")
print("   5. Create automated report formatting")

# Test the pipeline (comment out if Ollama not available)
print("\n🧪 Testing hybrid pipeline...")
print("   (Uncomment the code below to test)")

# Uncomment to test:
# if __name__ == '__main__':
#     HybridAnalysisPipeline()


### 🎯 Exercise 3.2: Production Considerations (5 minutes)

**Your Task**: Discuss and implement production-ready patterns for hybrid pipelines.

**Instructions**:
1. Error handling and fallback strategies
2. Model versioning and updates
3. Monitoring and logging
4. Scalability considerations
5. Cost optimization

In [None]:
# 🎯 EXERCISE 3.2: Your solution here!
print("💻 Exercise 3.2: Production Considerations")
print("=" * 45)

print("🏭 Production-Ready Patterns:")

production_patterns = {
    "Error Handling": [
        "Graceful LLM failures with fallback analysis",
        "Retry logic for API calls",
        "Data validation checkpoints",
        "Circuit breaker patterns"
    ],
    "Monitoring": [
        "Track LLM response times and quality",
        "Monitor data drift in inputs",
        "Log pipeline execution metrics",
        "Alert on analysis anomalies"
    ],
    "Scalability": [
        "Batch processing for large datasets",
        "Async LLM calls for parallel processing",
        "Caching for repeated analyses",
        "Resource management and limits"
    ],
    "Cost Optimization": [
        "Local models vs. API costs",
        "Smart prompt engineering",
        "Result caching strategies",
        "Model size vs. accuracy tradeoffs"
    ]
}

for category, items in production_patterns.items():
    print(f"\n   🎯 {category}:")
    for item in items:
        print(f"      • {item}")

# Example production-ready pattern
print("\n🔧 Example: Robust LLM Analysis Function")

def robust_llm_analysis(data_summary, max_retries=3, timeout=30):
    """
    Production-ready LLM analysis with error handling
    """
    import time
    
    for attempt in range(max_retries):
        try:
            # Your LLM call here
            # result = llm_chain.invoke({"data": data_summary})
            
            # Simulated response for demo
            time.sleep(0.1)  # Simulate processing time
            result = f"Analysis attempt {attempt + 1}: Key insights from data summary"
            
            # Validate response quality
            if len(result) > 10:  # Basic validation
                return {
                    'success': True,
                    'analysis': result,
                    'attempt': attempt + 1
                }
                
        except Exception as e:
            print(f"   Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                return {
                    'success': False,
                    'analysis': 'Automated statistical analysis only',
                    'error': str(e)
                }
            time.sleep(2 ** attempt)  # Exponential backoff

# Test the robust function
test_result = robust_llm_analysis("Sample data summary")
print(f"   Test result: {test_result}")

print("\n🔧 Your turn: Implement production patterns:")
print("   1. Add comprehensive logging")
print("   2. Implement result caching")
print("   3. Add performance monitoring")
print("   4. Create configuration management")
print("   5. Design automated testing")

print("\n✅ Production considerations complete!")

## 🎉 Workshop Summary & Next Steps

Congratulations! You've completed the Week 2 interactive workshop!

In [None]:
print("🎓 WEEK 2 WORKSHOP COMPLETE!")
print("=" * 35)

print("🏆 What You've Accomplished:")
accomplishments = [
    "✅ Mastered advanced data preprocessing techniques",
    "✅ Built sophisticated feature engineering pipelines",
    "✅ Learned LangChain Expression Language (LCEL)",
    "✅ Set up and used local LLMs with Ollama",
    "✅ Created hybrid ML + LLM workflows",
    "✅ Processed text data with NLP techniques",
    "✅ Built production-ready patterns"
]

for achievement in accomplishments:
    print(f"   {achievement}")

print("\n🛠️ Key Skills Developed:")
skills = {
    "Data Processing": ["Advanced imputation", "Feature engineering", "Text preprocessing", "Pipeline validation"],
    "LangChain/LLMs": ["LCEL syntax", "Chain composition", "Local model setup", "Prompt engineering"],
    "Integration": ["Hybrid pipelines", "Error handling", "Production patterns", "Monitoring strategies"]
}

for category, skill_list in skills.items():
    print(f"   🎯 {category}: {', '.join(skill_list)}")

print("\n🚀 Coming in Week 3:")
week3_topics = [
    "Advanced LangChain patterns and agents",
    "Vector databases and retrieval systems",
    "RAG (Retrieval-Augmented Generation)",
    "Building intelligent applications",
    "End-to-end AI system deployment"
]

for topic in week3_topics:
    print(f"   🔮 {topic}")

print("\n💡 Practice Recommendations:")
practice_items = [
    "🔄 Apply preprocessing techniques to your own datasets",
    "⚙️ Experiment with different LLM models and prompts",
    "📊 Build domain-specific analysis chains",
    "🏗️ Create production-ready pipeline templates",
    "📚 Explore advanced LangChain documentation"
]

for item in practice_items:
    print(f"   {item}")

print("\n📋 Self-Study Checklist:")
checklist = [
    "□ Complete the additional exercises in /exercises/",
    "□ Set up additional Ollama models for comparison",
    "□ Build a custom preprocessing pipeline for your domain",
    "□ Create a LangChain chain for a specific business problem",
    "□ Review vector database concepts for Week 3"
]

for item in checklist:
    print(f"   {item}")

print("\n🎖️ Workshop Feedback:")
print("   💭 What was your favorite part of today's workshop?")
print("   🤔 Which concepts need more practice?")
print("   💡 What real-world applications are you excited to build?")

print("\n🎉 Excellent work! You're ready for advanced AI applications!")
print("🏆 - INRIVA AI Academy Team")

# Save progress
import json
progress = {
    'workshop': 'week2_interactive',
    'completed': True,
    'timestamp': pd.Timestamp.now().isoformat(),
    'skills_learned': [skill for skill_list in skills.values() for skill in skill_list],
    'next_steps': week3_topics
}

print(f"\n💾 Progress saved: {len(json.dumps(progress))} characters")