# Week 1: AI Engineering Mindset & Python Foundations

## Overview
Welcome to Week 1 of the AI Engineering curriculum. This week focuses on establishing an **engineering mindset** for AI development, moving beyond notebook-only approaches to building production-quality systems.

### Learning Objectives
By the end of this week, you will be able to:
- Distinguish between AI, ML, DL, and Agentic AI
- Understand the AI system lifecycle: data → model → system → production
- Write clean, production-quality Python code with OOP, modularity, and typing
- Apply NumPy for efficient numerical computations
- Process data using Pandas
- Implement data validation and logging

### Real-World Outcome
Build a **Production Data Pipeline** that transforms raw data into clean, validated datasets with proper metrics and logging.

---

## Part 1: AI vs ML vs DL vs Agentic AI

### Understanding the Landscape

Let's clarify the terminology that's often used interchangeably but has distinct meanings:

**Artificial Intelligence (AI)**
- Broad field of creating systems that can perform tasks requiring human intelligence
- Includes rule-based systems, expert systems, search algorithms, and modern ML

**Machine Learning (ML)**
- Subset of AI that learns patterns from data
- Includes supervised, unsupervised, and reinforcement learning
- Examples: Linear regression, decision trees, random forests

**Deep Learning (DL)**
- Subset of ML using neural networks with multiple layers
- Excels at learning hierarchical representations
- Examples: CNNs for vision, Transformers for language

**Agentic AI**
- AI systems that can act autonomously to achieve goals
- Can plan, use tools, maintain memory, and adapt
- Examples: Autonomous research agents, multi-agent systems

### TODO 1.1: Map Use Cases to AI Categories

Complete the function below to categorize different AI systems:

In [None]:
from typing import Dict, List
from enum import Enum

class AICategory(Enum):
    RULE_BASED_AI = "rule_based_ai"
    TRADITIONAL_ML = "traditional_ml"
    DEEP_LEARNING = "deep_learning"
    AGENTIC_AI = "agentic_ai"

def categorize_ai_system(description: str) -> AICategory:
    """
    Categorize an AI system based on its description.
    
    Args:
        description: Description of the AI system
    
    Returns:
        AICategory enum value
    """
    # TODO: Implement logic to categorize AI systems
    # Hint: Look for keywords like "rules", "neural network", "autonomous", etc.
    pass

# Test cases
test_cases = [
    "A spam filter using logistic regression",
    "A chess program with if-then rules",
    "An image classifier using convolutional neural networks",
    "An autonomous agent that plans and executes research tasks"
]

# TODO: Uncomment and test
# for case in test_cases:
#     print(f"{case}: {categorize_ai_system(case)}")

---

## Part 2: AI System Lifecycle

### The Pipeline: Data → Model → System → Production

Real AI systems follow a structured lifecycle:

1. **Data**: Collection, cleaning, validation, versioning
2. **Model**: Training, evaluation, selection, tuning
3. **System**: Integration, API design, error handling
4. **Production**: Deployment, monitoring, maintenance, updates

### TODO 2.1: Design a System Lifecycle Tracker

In [None]:
from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class LifecycleStage:
    """Represents a stage in the AI system lifecycle."""
    name: str
    started_at: Optional[datetime] = None
    completed_at: Optional[datetime] = None
    status: str = "pending"  # pending, in_progress, completed, failed
    
class AISystemLifecycle:
    """Tracks the lifecycle of an AI system."""
    
    def __init__(self, project_name: str):
        self.project_name = project_name
        self.stages = {
            "data": LifecycleStage("Data Collection & Preparation"),
            "model": LifecycleStage("Model Development"),
            "system": LifecycleStage("System Integration"),
            "production": LifecycleStage("Production Deployment")
        }
    
    def start_stage(self, stage_name: str) -> None:
        """Mark a stage as started."""
        # TODO: Implement this method
        # Set started_at to current time and status to "in_progress"
        pass
    
    def complete_stage(self, stage_name: str) -> None:
        """Mark a stage as completed."""
        # TODO: Implement this method
        # Set completed_at to current time and status to "completed"
        pass
    
    def get_current_stage(self) -> Optional[str]:
        """Get the current in-progress stage."""
        # TODO: Implement this method
        # Return the name of the stage that is "in_progress", or None
        pass
    
    def get_progress_report(self) -> Dict[str, str]:
        """Generate a progress report of all stages."""
        # TODO: Implement this method
        # Return a dict mapping stage names to their status
        pass

# TODO: Test the lifecycle tracker
# lifecycle = AISystemLifecycle("Customer Churn Predictor")
# lifecycle.start_stage("data")
# print(lifecycle.get_progress_report())

---

## Part 3: Production-Quality Python

### Object-Oriented Programming for AI Systems

Production AI code should be:
- **Modular**: Organized into classes and functions
- **Typed**: Using type hints for clarity and IDE support
- **Testable**: Easy to unit test
- **Maintainable**: Clear naming and documentation

### TODO 3.1: Build a Data Validator Class

In [None]:
from typing import Any, Callable, List, Tuple
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class ValidationRule:
    """Represents a single validation rule."""
    
    def __init__(self, name: str, validator: Callable[[Any], bool], error_message: str):
        self.name = name
        self.validator = validator
        self.error_message = error_message
    
    def validate(self, value: Any) -> Tuple[bool, str]:
        """Run validation and return (is_valid, message)."""
        # TODO: Implement validation logic
        # Call self.validator(value) and return appropriate tuple
        pass

class DataValidator:
    """Validates data against a set of rules."""
    
    def __init__(self):
        self.rules: List[ValidationRule] = []
    
    def add_rule(self, rule: ValidationRule) -> None:
        """Add a validation rule."""
        # TODO: Implement this method
        pass
    
    def validate(self, value: Any) -> Tuple[bool, List[str]]:
        """Validate value against all rules.
        
        Returns:
            Tuple of (is_valid, list of error messages)
        """
        # TODO: Implement validation against all rules
        # Return True and empty list if all pass, False and error messages if any fail
        pass

# TODO: Create validation rules for email addresses
# Example: email must contain '@', must have domain, etc.
# Test with valid and invalid emails

---

## Part 4: NumPy Fundamentals & Vectorization

### Why NumPy?
- **Speed**: Vectorized operations are 10-100x faster than Python loops
- **Memory**: Efficient array storage
- **Foundation**: Basis for pandas, scikit-learn, TensorFlow, PyTorch

### TODO 4.1: Vectorize Data Processing

In [None]:
import numpy as np
import time

# Generate sample data
np.random.seed(42)
data = np.random.randn(1000000)

# TODO 4.1a: Implement using Python loop
def normalize_loop(arr: np.ndarray) -> np.ndarray:
    """
    Normalize array to range [0, 1] using a loop.
    Formula: (x - min) / (max - min)
    """
    # TODO: Implement using a for loop
    pass

# TODO 4.1b: Implement using NumPy vectorization
def normalize_vectorized(arr: np.ndarray) -> np.ndarray:
    """
    Normalize array to range [0, 1] using vectorized operations.
    """
    # TODO: Implement using NumPy operations (no loops)
    pass

# TODO: Compare performance
# Measure time for both implementations and print the speedup

### TODO 4.2: Statistical Analysis with NumPy

In [None]:
def compute_statistics(arr: np.ndarray) -> Dict[str, float]:
    """
    Compute comprehensive statistics for an array.
    
    Returns dict with: mean, median, std, min, max, q25, q75
    """
    # TODO: Implement statistical computations
    # Use NumPy functions: np.mean, np.median, np.std, np.percentile, etc.
    pass

# TODO: Test with sample data
# test_data = np.random.randn(1000)
# stats = compute_statistics(test_data)
# print(stats)

---

## Part 5: Pandas for Data Processing

### Why Pandas?
- **DataFrames**: Table-like data structures
- **Missing Data**: Built-in handling of NaN values
- **Grouping**: SQL-like operations
- **Time Series**: Powerful date/time functionality

### TODO 5.1: Load and Explore Data

In [None]:
import pandas as pd

# Create sample dataset
data = {
    'user_id': range(1, 101),
    'age': np.random.randint(18, 70, 100),
    'income': np.random.randint(20000, 150000, 100),
    'signup_date': pd.date_range('2023-01-01', periods=100),
    'is_active': np.random.choice([True, False], 100),
    'total_purchases': np.random.randint(0, 50, 100)
}
df = pd.DataFrame(data)

# Introduce some missing values
df.loc[np.random.choice(df.index, 10), 'income'] = np.nan

def explore_dataframe(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Generate comprehensive exploration report for a DataFrame.
    
    Returns dict with:
    - shape: (rows, columns)
    - dtypes: dict of column types
    - missing: dict of missing value counts per column
    - numeric_summary: summary stats for numeric columns
    """
    # TODO: Implement exploration logic
    pass

# TODO: Test the function
# report = explore_dataframe(df)
# print(report)

### TODO 5.2: Clean and Transform Data

In [None]:
class DataCleaner:
    """Handles data cleaning operations."""
    
    @staticmethod
    def handle_missing_values(df: pd.DataFrame, strategy: str = 'mean') -> pd.DataFrame:
        """
        Handle missing values in DataFrame.
        
        Args:
            df: Input DataFrame
            strategy: 'mean', 'median', 'mode', or 'drop'
        
        Returns:
            Cleaned DataFrame
        """
        # TODO: Implement missing value handling
        # Use different strategies based on the strategy parameter
        pass
    
    @staticmethod
    def remove_outliers(df: pd.DataFrame, column: str, n_std: float = 3.0) -> pd.DataFrame:
        """
        Remove outliers from a specific column using standard deviation method.
        
        Args:
            df: Input DataFrame
            column: Column name to check for outliers
            n_std: Number of standard deviations for outlier threshold
        
        Returns:
            DataFrame with outliers removed
        """
        # TODO: Implement outlier removal
        # Remove rows where column value is > n_std standard deviations from mean
        pass
    
    @staticmethod
    def create_features(df: pd.DataFrame) -> pd.DataFrame:
        """
        Create derived features from existing columns.
        
        For the sample dataset:
        - purchase_frequency: total_purchases / days_since_signup
        - income_bracket: categorize income into low/medium/high
        - age_group: categorize age into groups
        """
        # TODO: Implement feature engineering
        pass

# TODO: Test the data cleaner
# cleaner = DataCleaner()
# df_cleaned = cleaner.handle_missing_values(df, strategy='median')
# df_cleaned = cleaner.create_features(df_cleaned)
# print(df_cleaned.head())

---

## Part 6: Data Validation & Logging

### Production Data Pipelines Need:
- **Validation**: Ensure data quality at every stage
- **Logging**: Track operations and errors
- **Metrics**: Measure pipeline health

### TODO 6.1: Implement Pipeline Validation

In [None]:
from typing import Callable, Dict, List
import logging

class DataQualityChecker:
    """Validates data quality in a pipeline."""
    
    def __init__(self):
        self.logger = logging.getLogger(self.__class__.__name__)
        self.validation_results = []
    
    def check_missing_values(self, df: pd.DataFrame, max_missing_pct: float = 0.1) -> bool:
        """
        Check if missing values are within acceptable threshold.
        
        Args:
            df: DataFrame to check
            max_missing_pct: Maximum allowed percentage of missing values
        
        Returns:
            True if validation passes
        """
        # TODO: Implement validation
        # Log results and return boolean
        pass
    
    def check_schema(self, df: pd.DataFrame, expected_columns: List[str]) -> bool:
        """
        Check if DataFrame has expected columns.
        """
        # TODO: Implement schema validation
        pass
    
    def check_data_types(self, df: pd.DataFrame, expected_types: Dict[str, str]) -> bool:
        """
        Check if columns have expected data types.
        
        Args:
            df: DataFrame to check
            expected_types: Dict mapping column names to expected types
        """
        # TODO: Implement type validation
        pass
    
    def check_value_ranges(self, df: pd.DataFrame, column: str, min_val: float, max_val: float) -> bool:
        """
        Check if values in a column are within expected range.
        """
        # TODO: Implement range validation
        pass
    
    def get_validation_report(self) -> Dict[str, Any]:
        """
        Generate a comprehensive validation report.
        """
        # TODO: Compile and return validation results
        pass

# TODO: Test the quality checker
# qc = DataQualityChecker()
# qc.check_missing_values(df)
# qc.check_schema(df, ['user_id', 'age', 'income'])
# print(qc.get_validation_report())

---

## Part 7: Week 1 Project - Production Data Pipeline

### Project Overview
Build an end-to-end data pipeline that:
1. Loads raw data
2. Validates data quality
3. Cleans and transforms data
4. Generates metrics and logs
5. Exports clean data

### TODO 7.1: Implement Complete Pipeline

In [None]:
from datetime import datetime
from pathlib import Path

class ProductionDataPipeline:
    """
    A production-grade data pipeline with validation, logging, and metrics.
    """
    
    def __init__(self, pipeline_name: str):
        self.pipeline_name = pipeline_name
        self.logger = self._setup_logging()
        self.cleaner = DataCleaner()
        self.quality_checker = DataQualityChecker()
        self.metrics = {}
    
    def _setup_logging(self) -> logging.Logger:
        """Set up pipeline logging."""
        # TODO: Configure logging with file handler
        pass
    
    def load_data(self, data_source: Any) -> pd.DataFrame:
        """
        Load data from source (file, database, API, etc.).
        """
        # TODO: Implement data loading with error handling and logging
        pass
    
    def validate_raw_data(self, df: pd.DataFrame) -> bool:
        """
        Validate raw data before processing.
        """
        # TODO: Run multiple validation checks
        pass
    
    def clean_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Clean and transform data.
        """
        # TODO: Apply cleaning operations
        pass
    
    def compute_metrics(self, df_raw: pd.DataFrame, df_clean: pd.DataFrame) -> Dict[str, Any]:
        """
        Compute pipeline metrics.
        
        Metrics to compute:
        - rows_processed
        - rows_removed
        - missing_values_handled
        - processing_time
        - data_quality_score
        """
        # TODO: Implement metrics computation
        pass
    
    def export_data(self, df: pd.DataFrame, output_path: str) -> None:
        """
        Export cleaned data to file.
        """
        # TODO: Implement data export with error handling
        pass
    
    def run(self, data_source: Any, output_path: str) -> Dict[str, Any]:
        """
        Execute the complete pipeline.
        
        Returns:
            Pipeline execution report
        """
        # TODO: Orchestrate the complete pipeline
        # 1. Load data
        # 2. Validate raw data
        # 3. Clean data
        # 4. Compute metrics
        # 5. Export data
        # 6. Return execution report
        pass

# TODO: Test the complete pipeline
# pipeline = ProductionDataPipeline("user_data_pipeline")
# report = pipeline.run(df, "cleaned_data.csv")
# print(report)

---

## Summary & Next Steps

### What You've Learned
- Distinctions between AI, ML, DL, and Agentic AI
- AI system lifecycle from data to production
- Production-quality Python with OOP and typing
- NumPy for efficient numerical computations
- Pandas for data manipulation
- Data validation and logging practices
- Building a complete production data pipeline

### Real-World Applications
- Data engineering pipelines
- ETL processes
- ML data preparation
- Data quality monitoring systems

### Next Week Preview
**Week 2: Probability, Uncertainty & Decision Systems**
- Build explainable decision-making systems
- Apply probability in real-world scenarios
- Create a risk scoring engine

---

## Additional Practice

### Challenge 1: Enhanced Pipeline
Extend the pipeline to:
- Handle multiple data sources
- Implement data versioning
- Add pipeline scheduling

### Challenge 2: Real Dataset
Apply your pipeline to a real dataset:
- Download a dataset from Kaggle or UCI ML Repository
- Process it through your pipeline
- Generate a quality report

### Challenge 3: Unit Tests
Write unit tests for:
- DataValidator class
- DataCleaner methods
- DataQualityChecker validations

---