# Housing V4 - Modular Implementation

This notebook demonstrates the V4 modular implementation of the housing price prediction project. Unlike V3 which was a monolithic notebook, V4 is structured as a production-ready Python application with clear module boundaries.

## Key Improvements in V4:
- **Modular Design**: Separated into logical components (data loading, preprocessing, modeling, evaluation)
- **Configuration Management**: YAML-based configuration for easy experimentation
- **CLI Interface**: Can be run from command line with different modes
- **Model Persistence**: Save/load trained models and preprocessing artifacts
- **Better Error Handling**: Robust validation and informative error messages
- **Production Ready**: Structured for deployment and maintenance

## Setup and Configuration

In [None]:
import sys
from pathlib import Path

# Add src to path for imports
sys.path.append(str(Path.cwd() / 'src'))

# Import our modular components
from src.config import Config
from src.data_loader import HousingDataLoader
from src.preprocessor import HousingPreprocessor
from src.model import HousingPriceModel, ModelTrainer
from src.evaluator import ModelEvaluator
from src.utils import set_random_seed, create_directories

# Standard imports
import pandas as pd
import numpy as np
import torch
import matplotlib.pyplot as plt

## 1. Initialize Configuration

In [None]:
# Create necessary directories
create_directories()

# Load configuration
config = Config('config.yaml')

# Set random seed for reproducibility
set_random_seed(config.SEED)

print("Configuration loaded successfully!")
print(f"Device: {config.device}")
print(f"Model architecture: {config.MODEL['hidden_sizes']}")
print(f"Training epochs: {config.MODEL['epochs']}")

## 2. Data Loading

In [None]:
# Initialize data loader
data_loader = HousingDataLoader(config)

# Load training and test data
train_data, test_data = data_loader.load_data()

# Display basic information
print("\nData Overview:")
print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")
print(f"Features in training: {train_data.shape[1] - 2}")  # Exclude Id and SalePrice

# Preview the data
display(train_data.head())

## 3. Data Preprocessing

In [None]:
# Initialize preprocessor
preprocessor = HousingPreprocessor(config)

# Run complete preprocessing pipeline
X_train, X_test, y_train = preprocessor.preprocess(train_data, test_data)

print(f"\nPreprocessing Results:")
print(f"Training features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")
print(f"Target shape: {y_train.shape}")
print(f"Target range (log): [{y_train.min():.3f}, {y_train.max():.3f}]")

## 4. Model Training

In [None]:
# Initialize model and trainer
model = HousingPriceModel(X_train.shape[1], config)
trainer = ModelTrainer(model, config)

print(f"Model Architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Train the model
print("Starting training...")
trained_model = trainer.train(X_train, y_train)

## 5. Model Evaluation

In [None]:
# Initialize evaluator
evaluator = ModelEvaluator(config)

# Evaluate model performance
metrics = evaluator.evaluate_model(trained_model, X_train, y_train)

print(f"\nFinal Training Metrics:")
for metric, value in metrics.items():
    if metric == 'rmse':
        print(f"  {metric.upper()}: ${value:,.2f}")
    else:
        print(f"  {metric.upper()}: {value:.5f}")

## 6. Cross-Validation

In [None]:
# Perform cross-validation
print("Performing cross-validation...")
cv_scores = evaluator.cross_validate(X_train, y_train)

print(f"\nCross-Validation Summary:")
print(f"  Individual fold scores: {[f'{score:.5f}' for score in cv_scores]}")
print(f"  Mean ± Std: {np.mean(cv_scores):.5f} ± {np.std(cv_scores):.5f}")

## 7. Model Persistence

In [None]:
# Save the trained model
model_path = 'models/best_model_v4.pth'
trainer.save_model(model_path)

# Demonstrate loading (optional)
# new_model = HousingPriceModel(X_train.shape[1], config)
# new_trainer = ModelTrainer(new_model, config)
# loaded_model = new_trainer.load_model(model_path)

## 8. Generate Predictions

In [None]:
# Make predictions on test set
test_predictions = trainer.predict(X_test)

# Convert back to original scale
test_predictions_original = np.expm1(test_predictions.flatten())

print(f"Test Predictions Summary:")
print(f"  Number of predictions: {len(test_predictions_original)}")
print(f"  Price range: ${test_predictions_original.min():,.0f} - ${test_predictions_original.max():,.0f}")
print(f"  Mean price: ${test_predictions_original.mean():,.0f}")

# Create submission file
submission = pd.DataFrame({
    'Id': test_data['Id'],
    'SalePrice': test_predictions_original
})

submission.to_csv('submissions/submission_v4_notebook.csv', index=False)
print(f"\nSubmission saved: submissions/submission_v4_notebook.csv")

# Display first few predictions
display(submission.head(10))

## 9. Comparison with V3

### V3 Characteristics:
- Single monolithic notebook
- All code in one place
- Manual configuration via variables
- Limited reusability
- Harder to maintain and debug

### V4 Improvements:
- **Modular Design**: Clear separation of concerns
- **Configuration Management**: YAML-based settings
- **Reusability**: Components can be used independently
- **Error Handling**: Better validation and error messages
- **Testing**: Each module can be tested separately
- **CLI Interface**: Can run from command line
- **Production Ready**: Suitable for deployment

### Benefits:
1. **Maintainability**: Easier to update and debug individual components
2. **Scalability**: Can easily add new features or models
3. **Collaboration**: Multiple developers can work on different modules
4. **Testing**: Each component can be unit tested
5. **Deployment**: Ready for production environments
6. **Experimentation**: Easy to try different configurations

## 10. Using V4 from Command Line

You can also run V4 directly from the command line:

```bash
# Train the model
python main.py --mode train

# Make predictions with a saved model
python main.py --mode predict --model-path models/best_model.pth

# Run only cross-validation
python main.py --mode evaluate

# Use custom configuration
python main.py --config custom_config.yaml --mode train
```

This makes V4 suitable for:
- **Automated pipelines**: Can be scheduled or triggered automatically
- **Batch processing**: Process multiple datasets or configurations
- **Production deployment**: Integrate into larger ML systems
- **Experimentation**: Easy A/B testing with different configurations

## Summary

V4 represents a significant evolution from the notebook-based approach of V3:

- **Structure**: Organized into logical, reusable modules
- **Configuration**: Externalized settings for easy experimentation  
- **Interface**: Both notebook and CLI interfaces available
- **Maintenance**: Much easier to update, debug, and extend
- **Production**: Ready for real-world deployment scenarios

The modular design makes it easy to:
- Swap out different models or preprocessing steps
- Add new evaluation metrics or visualization
- Integrate with MLOps tools and pipelines
- Scale to larger datasets or distributed training

This architecture provides a solid foundation for further development and production use.