# Week 2 Workshop: Data Preprocessing and LangChain Introduction

Welcome to Week 2! Today we'll master advanced data preprocessing with Metaflow and take our first steps into the world of Large Language Models with LangChain.

## 🎯 Workshop Objectives

By the end of this session, you'll be able to:
- Build sophisticated data preprocessing pipelines in Metaflow
- Understand and use LangChain Expression Language (LCEL)
- Work with local LLMs using Ollama
- Create hybrid workflows combining traditional ML with LLM capabilities

## 📋 Prerequisites Check

Let's make sure your environment is ready!

In [None]:
# Environment verification for Week 2
import sys
import subprocess
import warnings
warnings.filterwarnings('ignore')

def check_environment():
    """Comprehensive environment check for Week 2"""
    print("🔍 Week 2 Environment Check")
    print("=" * 30)
    
    # Core imports
    try:
        import pandas as pd
        import numpy as np
        import matplotlib.pyplot as plt
        import seaborn as sns
        from sklearn.preprocessing import StandardScaler, LabelEncoder
        from sklearn.feature_extraction.text import TfidfVectorizer
        from metaflow import FlowSpec, step, Parameter
        print("✅ Core ML libraries imported successfully")
    except ImportError as e:
        print(f"❌ Core import failed: {e}")
        return False
    
    # LangChain imports
    try:
        import langchain
        from langchain.schema import BaseOutputParser
        from langchain.prompts import PromptTemplate
        from langchain_community.llms import Ollama
        print("✅ LangChain imported successfully")
        print(f"   LangChain version: {langchain.__version__}")
    except ImportError as e:
        print(f"❌ LangChain import failed: {e}")
        print("   Install with: pip install langchain langchain-community")
        return False
    
    # Check Ollama installation
    try:
        result = subprocess.run(['ollama', '--version'], 
                              capture_output=True, text=True, timeout=5)
        if result.returncode == 0:
            print("✅ Ollama is installed")
            print(f"   Version: {result.stdout.strip()}")
            
            # Check for available models
            models_result = subprocess.run(['ollama', 'list'], 
                                         capture_output=True, text=True, timeout=5)
            if models_result.returncode == 0:
                models = models_result.stdout.strip()
                if 'llama' in models.lower() or 'mistral' in models.lower():
                    print("✅ Local models available")
                else:
                    print("⚠️  No models found. Download with: ollama pull llama3.2")
        else:
            print("❌ Ollama command failed")
    except (subprocess.TimeoutExpired, FileNotFoundError):
        print("❌ Ollama not found - install from ollama.com")
        print("   This is required for LLM exercises")
    
    print("\n🎯 Environment check complete!")
    return True

check_environment()

## Part 1: Advanced Data Preprocessing with Metaflow

Let's start by importing everything we need and loading our datasets.

In [None]:
# Complete imports for data preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from metaflow import FlowSpec, step, Parameter, IncludeFile, catch
import re
import string
from scipy import stats

# Set up plotting
plt.style.use('default')
sns.set_palette("viridis")
plt.rcParams['figure.figsize'] = (12, 8)

print("📊 Data preprocessing environment ready!")

## 🎯 Your Exercise Space

Use the cells below to work on the exercises. Feel free to add more cells as needed!

In [None]:
# Exercise workspace - Add your solutions here!
print("💻 Ready for your solutions!")
print("   Copy one of the exercise prompts above and start coding.")
print("   Remember: experimentation is key to learning!")

## 🎉 Workshop Complete!

You've successfully completed the Week 2 workshop on advanced data preprocessing and LangChain integration!

### 🏆 What You've Accomplished:

1. **🔧 Built sophisticated preprocessing pipelines** with Metaflow
2. **🦜 Mastered LangChain fundamentals** and LCEL syntax
3. **🌊 Created hybrid workflows** combining ML + LLM capabilities
4. **📝 Processed text data** using modern LLM approaches
5. **🎯 Implemented production patterns** for scalable AI systems

### 📚 Key Takeaways:

- **LCEL is powerful**: The `prompt | model | parser` pattern enables flexible chain composition
- **Local LLMs matter**: Ollama provides privacy-focused AI capabilities
- **Hybrid approaches work**: Combining traditional ML with LLMs creates powerful systems
- **Error handling is crucial**: Production systems need robust fallback mechanisms
- **Integration patterns**: Metaflow + LangChain = MLOps + LLMOps

### 🚀 Next Steps:

1. **Complete the exercises** in your own time
2. **Experiment with different models** using Ollama
3. **Try the advanced patterns** in the resources section
4. **Join Friday's showcase** to share your results
5. **Prepare for Week 3** - Supervised Learning with Metaflow

### 📞 Need Help?

- **📁 Check `/solutions/`** for complete exercise answers
- **💬 Ask in Google Chat** for quick questions
- **🕒 Use Friday office hours** for detailed help
- **📖 Review resources** in `/resources/` directory

### 🎯 Week 3 Preview:

Next week we'll dive into supervised learning with Metaflow, building on everything you've learned about data preprocessing and adding sophisticated model training, evaluation, and comparison capabilities.

#### 📚 **What You'll Learn:**
- **🤖 Multiple Algorithm Training**: Random Forest, XGBoost, SVM, Logistic Regression
- **⚡ Parallel Processing**: Use `@foreach` to train models simultaneously
- **📊 Advanced Evaluation**: ROC curves, precision-recall, feature importance
- **🔧 Hyperparameter Tuning**: Grid search and Bayesian optimization
- **🦜 LLM Model Interpretation**: Use LangChain to explain model results
- **📋 Cross-Validation**: Robust model validation strategies

#### 🛠️ **Technical Skills:**
- Building scalable ML training pipelines
- Implementing model comparison frameworks
- Creating automated hyperparameter optimization
- Integrating LLM explanations for model insights
- Deploying model selection workflows

#### 🎯 **Week 3 Deliverables:**
- Multi-algorithm comparison pipeline
- Hyperparameter optimization system
- LLM-powered model explanation tool
- Production-ready model selection workflow

**Great work today! You're building real production AI skills! 🌟**