A comprehensive collection of Jupyter notebooks demonstrating how to debug common machine learning problems. Each notebook provides hands-on examples, visualizations, and practical solutions to help you identify and fix issues in your ML models.
Learn to identify and fix the bias-variance tradeoff problems:
- Topics Covered:
- Detecting overfitting vs underfitting using learning curves
- Impact of model complexity on performance
- Regularization techniques (L1, L2, Ridge)
- Proper model selection and validation
- Key Techniques: Learning curves, cross-validation, regularization
Understand and prevent data leakage that inflates model performance:
- Topics Covered:
- Target leakage identification
- Train-test contamination
- Temporal leakage in time-series
- Proper data splitting and preprocessing
- Key Techniques: Correlation analysis, pipeline usage, temporal validation
Debug gradient problems in deep neural networks:
- Topics Covered:
- Identifying vanishing/exploding gradients
- Weight initialization strategies
- Batch normalization
- Gradient clipping
- Key Techniques: Gradient monitoring, proper initialization (He/Xavier), ReLU activation
Handle imbalanced datasets effectively:
- Topics Covered:
- Detecting class imbalance problems
- Resampling techniques (SMOTE, undersampling)
- Class weights and cost-sensitive learning
- Appropriate metrics for imbalanced data
- Key Techniques: SMOTE, class weights, ROC-AUC, precision-recall curves
Master feature scaling for better model performance:
- Topics Covered:
- When and why to scale features
- Different scaling techniques (StandardScaler, MinMaxScaler, RobustScaler)
- Impact on various algorithms
- Common scaling mistakes
- Key Techniques: StandardScaler, MinMaxScaler, RobustScaler, proper pipeline usage
- Python 3.8 or higher
- pip package manager
- Clone this repository:
git clone https://github.com/macanderson/MLLanguageModels.git
cd MLLanguageModels- Install required dependencies:
pip install -r requirements.txt- Launch Jupyter Notebook:
jupyter notebook- Open any notebook from the
notebooks/directory and start learning!
The notebooks use the following main libraries:
- NumPy - Numerical computing
- Pandas - Data manipulation
- Scikit-learn - Machine learning algorithms
- TensorFlow - Deep learning (for gradient problems)
- Matplotlib & Seaborn - Visualization
- imbalanced-learn - Handling imbalanced datasets
See requirements.txt for the complete list.
Recommended order for beginners:
- Start with Overfitting/Underfitting to understand model performance basics
- Learn Feature Scaling to prepare data properly
- Study Data Leakage to avoid common pitfalls
- Tackle Class Imbalance for real-world scenarios
- Explore Gradient Problems for deep learning applications
For experienced practitioners:
- Jump to any notebook based on your current debugging needs
- Each notebook is self-contained with complete examples
Each notebook follows a consistent structure:
- Problem Overview - What the issue is and why it matters
- Symptoms - How to recognize the problem
- Hands-on Examples - Code demonstrating the problem
- Solutions - Multiple approaches to fix the issue
- Best Practices - Guidelines to prevent future occurrences
- Debugging Checklist - Quick reference for troubleshooting
- Exercises - Practice problems to reinforce learning
These notebooks are perfect for:
- Students learning machine learning fundamentals
- Data Scientists debugging model performance issues
- ML Engineers implementing production-ready models
- Researchers understanding common pitfalls
- Interview Preparation for ML/DS roles
Contributions are welcome! If you have suggestions for:
- Additional debugging scenarios
- Improved explanations
- New visualization techniques
- Bug fixes
Please open an issue or submit a pull request.
This project is open source and available under the MIT License.
For questions or feedback, please open an issue on GitHub.
Happy Debugging! 🐛🔧