# **<ins>Module 1: Data Collection - The Foundation of Data Science</ins>**
* <ins>Data collection</ins> is the first and most crucial step in the Data Science lifecycle.
* It serves as the foundation for every subsequent stage, as the <ins>quality</ins>, <ins>accuracy</ins>, and <ins>reliability</ins> of your data directly impact the results of your analysis and machine-learning models.
* Without good data, even the most advanced algorithms and models will fail to deliver meaningful insights.

### **<ins>What is Data Collection?</ins>**
* Data collection is the systematic process of gathering raw data from various sources (databases, APIs, websites, surveys, *etc.*) in order to analyze and extract valuable insights.
* The goal is to ensure that the collected data is <ins>relevant</ins>, <ins>accurate</ins>, and <ins>usable</ins> for analysis or training machine-learning models.

### **<ins>Why is Data Collection important?</ins>**
* <ins>Foundation for Decision-Making</ins>: Reliable data allows businesses and organizations to make informed, data-driven decisions.
* <ins>Model Performance</ins>: Inaccurate or incomplete data can result in poor-performing machine-learning models.
* <ins>Understanding Trends</ins>: Data helps identify patterns, behaviors, and market trends.
* <ins>Problem-Solving</ins>: Proper data collection identifies areas of improvement or optimization in processes.
* <ins>Accountability</ins>: Transparent data collection practices ensure credibility and reproducibility in research and business analytics.

### **<ins>Types of Data in Data Collection</ins>**
* <ins>Structured Data</ins>: Organized data stored in rows and columns, often in spreadsheets or relational databases (Excel, PostgreSQL, *etc.*).
* <ins>Unstructured Data</ins>: Raw data without a predefined format, such as text, images, audio, and videos.
* <ins>Semi-Structured Data</ins>: Data that has some level of organization but isn't fully structured (*e.g.* JSON, XML files, emails, *etc.*).

### **<ins>Data Collection Methods</ins>**
* <ins>Manual Data Collection</ins>: Data is manually gathered via surveys, interviews, or direct observation. Common in research and customer feedback analysis.
* <ins>Automated Data Collection</ins>: Data is collected automatically via web scraping, APIs, IoT devices, or automated tools.
* <ins>Web Scraping</ins>: Extracting data from websites using libraries like BeautifulSoup or Scrapy in Python.
* <ins>APIs (Application Programming Interfaces)</ins>: APIs allow systems to communicate and exchange data seamlessly.
* <ins>Sensor Data Collection</ins>: IoT devices gather real-time data, such as temperature sensors or fitness trackers.
* <ins>Transaction Data</ins>: Data from e-commerce systems, financial transactions, and point-of-sale systems.

### **<ins>Common Data Sources</ins>**
* Databases, APIs, Web Scraping, Public Datasets, Logs, Surveys and Questionnaires

### **<ins>Challenges in Data Collection</ins>**
* <ins>Data Quality</ins>: Ensuring data is clean, relevant, and error-free.
* <ins>Data Privacy</ins>: Complying with laws like GDPR and CCPA to protect user data.
* <ins>Scalability</ins>: Collecting and managing large volumes of data efficiently.
* <ins>Data Integration</ins>: Merging data from multiple sources into a consistent format.
* <ins>Real-Time Data Collection</ins>: Capturing and processing live data streams.

### **<ins>Best Practices for Data Collection</ins>**
* <ins>Define Objectives</ins>: Be clear about what data you need and why you need it.
* <ins>Ensure Data Accuracy</ins>: Validate and cross-check data sources.
* <ins>Use Reliable Sources</ins>: Trust verified datasets and APIs.
* <ins>Automate Where Possible</ins>: Use scripts or APIs to reduce manual errors.
* <ins>Follow Ethical Guidelines</ins>: Always respect user privacy and comply with regulations.
* <ins>Backup Your Data</ins>: Regularly back up collected data to prevent loss.

# **<ins>Module 2: Data Cleaning and Preprocessing - Turning Raw Data into Usable Insights</ins>**
* <ins>Data Cleaning and Preprocessing</ins> is the second critical stage in the data science workflow.
* Raw data is often messy, inconsistent, and filled with errors, missing values, or duplicate entries.

### **<ins>What is Data Cleaning and Preprocessing?</ins>**
* Data Cleaning and Preprocessing involve identifying, correcting, and preparing raw data to make it usable for analysis and modeling.
    * This process ensures that the data is accurate, consistent, and complete, removing any biases or errors that might mislead analysis or affect the performance of machine learning models.
* Real-world data is rarely perfect - it may have missing values, outliers, duplicates, incorrect formats, or inconsistencies. Cleaning and preprocessing aims to handle these problems systematically.

### **<ins>Why is Data Cleaning important?</ins>**
* <ins>Improves Model Performance</ins>: Clean data ensures accurate predictions and prevents misleading results.
* <ins>Reduces Bias</ins>: Eliminates errors that could create unintended biases in machine-learning models.
* <ins>Enhances Data Usability</ins>: Structured data is easier to interpret and analyze.
* <ins>Reduces Noise</ins>: Outliers and irrelevant data points are removed to ensure clarity.
* <ins>Saves Resources</ins>: Working with clean data reduces computational load and prevents unnecessary complexity in analysis.

### **<ins>Key Concepts in Data Cleaning and Preprocessing</ins>**
1. <ins>Handling Missing Values</ins>: Missing data is one of the most common issues in datasets.
    * Methods to handle missing values include:
        * <ins>Imputation</ins>: Replacing missing values with the **mean**, **median**, or **mode**.
        * <ins>Dropping Missing Values</ins>: Removing rows or columns with excessive missing data.
2. <ins>Removing Duplicates</ins>: Duplicate entries can skew analysis and lead to misleading insights.
3. <ins>Outlier Detection and Treatment</ins>: Outliers can distort statistical measures Techniques include: 
    * Z-Score Analysis
    * IQR (Interquartile Range) Analysis
4. <ins>Data Normalization and Standardization</ins>: Scaling numerical features ensures consistency across data points, especially for algorithms sensitive to magnitude (*e.g.* KNN, Gradient Descent, *etc.*).
    * <ins>Normalization</ins>: Scale data to a [0, 1] range.
    * <ins>Standardization</ins>: Transform data to have a mean of 0 and a standard deviation of 1.
5. <ins>Handling Inconsistent Data</ins>: Standardizing formats, fixing typos, and ensuring uniform conventions (*e.g.* date formats, categorical values, *etc.*).

### **<ins>Best Practices for Data Cleaning and Preprocessing</ins>**
* <ins>Understand the Dataset</ins>: Start with exploratory data analysis (EDA).
* <ins>Document Every Step</ins>: Keep track of the changes you make to the data.
* <ins>Handle Missing Values Wisely</ins>: Choose imputation techniques based on the nature of the data.
* <ins>Beware of Over-Cleaning</ins>: Don't remove too much data (it may result in losing valuable information).
* <ins>Automate with Pipelines</ins>: Create reusable preprocessing pipelines for consistent results.

# **<ins>Module 3: Data Exploration and Analysis (EDA)</ins>**
* <ins>Data Exploration and Analysis (EDA)</ins> is one of the most critical stages in the Data Science workflow.
* EDA serves as a bridge between raw data and actionable insights, allowing data scientists to understand data patterns, relationships, and anomalies before building models.
    * Involves summarizing data, visualizing trends, and forming hypotheses that guide the rest of the analysis or machine-learning process.

### **<ins>What is Exploratory Data Analysis (EDA)?</ins>**
* EDA is the process of examining datasets to summarize their key characteristics using statistical techniques and visualization tools.
* It's about asking questions, identifying patterns, uncovering relationships between variables, and detecting anomalies or outliers.
* EDA is iterative and investigative, often revealing insights that might not be obvious at first glance. At its core, EDA aims to: 
    * Understand the structure and quality of the data.
    * Identify patterns, trends, and anomalies.
    * Validate assumptions and hypotheses.
    * Decide on the best preprocessing techniques and model choices.

### **<ins>Why is EDA important?</ins>**
* <ins>Understand Data Distribution</ins>: Identify how variables are distributed (normal, skewed, *etc.*).
* <ins>Identify Outliers and Anomalies</ins>: Detect extreme or unusual values that could impact modeling.
* <ins>Spot Missing Values</ins>: Understand where and why data might be missing.
* <ins>Form Hypotheses</ins>: Generate assumptions about relationships between variables.
* <ins>Feature Selection</ins>: Identify the most important features for analysis.
* <ins>Prevent Costly Mistakes</ins>: Ensure that data is well-prepared before building predictive models.

### **<ins>Key Concepts in EDA</ins>**
* <ins>Data Summary and Descriptive Statistics</ins>
    * <ins>Statistical Measures</ins>: Mean, median, mode, variance, standard deviation
    * <ins>Data Distribution</ins>: Histograms, density plots, and box plots to visualize variable distributions
* <ins>Data Visualization</ins>:
    * <ins>Univariate Analysis</ins>: Analyzing one variable at a time (*e.g.* bar plots, histograms)
    * <ins>Bivariate Analysis</ins>: Exploring relationships between two variables (*e.g.* scatter plots, heatmaps)
    * <ins>Multivariate Analysis</ins>: Analyzing relationships among multiple variables
* <ins>Outlier Detection</ins>
    * Outliers can distort analysis. Techniques to deal with outliers:
        * Z-Score Analysis
        * IQR (Interquartile Range) Method
* <ins>Correlation Analysis</ins>
    * <ins>Correlation Matrix</ins>: Understand relationships between numerical features.
    * <ins>Heatmap</ins>: Visualize correlations graphically.
* <ins>Missing Data Analysis</ins>
    * Understand where data is missing and decide on strategies: drop, impute, or flag.

### **<ins>Best Practices for EDA</ins>**
* <ins>Ask Clear Questions</ins>: Know the objective behind the analysis.
* <ins>Start Simple</ins>: Begin with descriptive statistics before moving to complex visualizations.
* <ins>Document Your Findings</ins>: Keep detailed notes and visualizations.
* <ins>Iterate Frequently</ins>: Go back and forth between visualizations and summaries.
* <ins>Focus on Storytelling</ins>: Translate data insights into actionable business recommendations.

# **<ins>Module 4: Feature Engineering Transforming Data into Insights</ins>**
* <ins>Feature Engineering</ins> is often considered the heart of data science and machine-learning.
    * It bridges the gap between raw data and model performance by creating, selecting, and optimizing features that enable algorithms to make accurate predictions.
    * In essence, better features mean better models.

### **<ins>What is Feature Engineering?</ins>**
* Feature Engineering is the process of selecting, transforming, or creating new features (variables) from raw data to improve the performance of machine-learning models.
* Features are the input variables that an algorithm uses to make predictions, and their quality directly affects the model's accuracy and reliability.
* Imagine building a house: data is the raw material, the algorithm is the architect, and features are the building blocks.
* Well-engineered features ensure a solid foundation for your model.

### **<ins>Why is Feature Engineering important?</ins>**
* <ins>Improves Model Accuracy</ins>: Well-crafted features can significantly boost model performance.
* <ins>Reduces Noise</ins>: Eliminate irrelevant or redundant information.
* <ins>Handles Complex Relationships</ins>: Create features that capture hidden patterns in data.
* <ins>Simplifies Models</ins>: Better features can reduce the need for overly complex models.
* <ins>Boosts Interpretability</ins>: Meaningful features make it easier to understand model predictions. 

### **<ins>Key Concepts in Feature Engineering</ins>**
* <ins>Feature Creation</ins>
    * Combine or extract information from existing features to create new ones.
    * Ex. From a date column, create day, month, and year as separate features.
* <ins>Handling Categorical Features</ins>
    * <ins>One-Hot Encoding</ins>: Create binary columns for each category.
    * <ins>Label Encoding</ins>: Assign a unique integer to each category.
* <ins>Handling Numerical Features</ins>
    * <ins>Scaling</ins>: Adjust numerical values to a specific range (*e.g.* 0 to 1).
    * <ins>Standardization</ins>: Center data around zero with unit variance.
* <ins>Handling Missing Data in Features</ins>
    * Impute missing values with statistical measures like mean, median, or mode.
* <ins>Feature Transformation</ins>
    * <ins>Log Transformation</ins>: Reduces the effect of extreme values.
    * <ins>Polynomial Features</ins>: Create non-linear relationships.
* <ins>Feature Selection Techniques</ins>
    * <ins>Filter Methods</ins>: Correlation, Chi-Square test
    * <ins>Wrapper Methods</ins>: Recursive Feature Elimination (RFE)
    * <ins>Embedded Methods</ins>: LASSO Regression, Tree-based Importance

### **<ins>Best Practices for Feature Engineering</ins>**
* <ins>Understand Your Data</ins>: Know what each feature represents and how it impacts the target variable.
* <ins>Avoid Data Leakage</ins>: Ensure that target-related information doesn't leak into features during training.
* <ins>Iterate and Experiment</ins>: Try different transformations and observe model performance.
* <ins>Keep It Interpretable</ins>: Ensure features are meaningful and easy to understand.
* <ins>Use Domain Knowledge</ins>: Sometimes, the best features come from subject matter expertise.

# **<ins>Module 5: Data Visualization - Communicating Insights Effectively</ins>**
* <ins>Data Visualization</ins> is the art of representing data visually to identify patterns, trends, and insights that are otherwise hidden in raw numbers.
* Whether you're presenting findings to stakeholders, building dashboards, or exploring data for analysis, visualization bridges the gap between raw data and actionable insights.

### **<ins>What is Data Visualization?</ins>**
* Data Visualization is the graphical representation of information and data.
* Using visual elements like charts, graphs, maps, and dashboards, it simplifies complex data into easily digestible insights.
* The goal of data visualization is to:
    * Simplify complex data.
    * Identify patterns, relationships, and outliers.
    * Communicate results effectively to both technical and non-technical audiences.
    * Support data-driven decision-making.

### **<ins>Why is Data Visualization important?</ins>**
* <ins>Improved Understanding</ins>: Visuals simplify complex datasets for better comprehension.
* <ins>Quick Insights</ins>: Patterns and trends are immediately apparent.
* <ins>Enhanced Decision-Making</ins>: Clear visual insights drive informed business strategies.
* <ins>Storytelling with Data</ins>: Visuals tell compelling stories that resonate with stakeholders.
* <ins>Error Detection</ins>: Spot anomalies and inconsistencies quickly.

### **<ins>Key Concepts in Data Visualization</ins>**
* Types of Data Visualizations
    * <ins>Line Chart</ins>: For showing trends over time.
    * <ins>Bar Chart</ins>: For comparing categories.
    * <ins>Scatter Plot</ins>: For showing relationships between two numerical variables.
    * <ins>Histogram</ins>: For understanding the distribution of numerical data.
    * <ins>Heatmap</ins>: For showing correlations in matrix form.
* Data Visualization Tools
    * <ins>Matplotlib</ins>: The foundational Python library for static plots.
    * <ins>Seaborn</ins>: Built on Matplotlib, ideal for advanced statistical visualizations.
    * <ins>Plotly</ins>: For interactive and dynamic visualizations.
    * <ins>Tableau and Power BI</ins>: Tools for enterprise-level dashboards and interactive reporting.
* Exploratory vs. Explanatory Visualization
    * <ins>Exploratory Visualization</ins>: Used for analyzing datasets to uncover insights (*e.g.* scatter plots, heatmaps).
    * <ins>Explanatory Visualization</ins>: Used for presenting insights to an audience (*e.g.* dashboards, pie charts).
* Principles of Effective Visualization
    * <ins>Clarity</ins>: Ensure your visuals are easy to interpret.
    * <ins>Accuracy</ins>: Represent data truthfully without distortion.
    * <ins>Simplicity</ins>: Avoid unnecessary elements or clutter.
    * <ins>Storytelling</ins>: Build a narrative around your visualizations.
    * <ins>Audience Awareness</ins>: Tailor visuals to your audience's level of expertise.
* Dashboard Design
    * Dashboards combine multiple visualizations to provide a comprehensive view of data.
        * Interactive Filters
        * Drill-Down Options
        * Real-Time Data Updates

### **<ins>Best Practices for Data Visualization</ins>**
* <ins>Know Your Audience</ins>: Tailor the complexity of visuals based on your audience.
* <ins>Choose the Right Chart</ins>: Select visuals that best represent your data.
* <ins>Simplify and Focus</ins>: Remove clutter and emphasize key insights.
* <ins>Add Context</ins>: Use titles, labels, and legends to make your visual self-explanatory.
* <ins>Validate Your Visualizations</ins>: Ensure accuracy before presenting results.

# **<ins>Module 6: Machine Learning and Modeling - Building Intelligent Systems</ins>**
* <ins>Machine Learning and Modeling</ins> form the backbone of modern artificial intelligence, enabling systems to analyze data, identify patterns, and make data-driven predictions or decisions without explicit programming.

### **<ins>What is Machine Learning and Modeling?</ins>**
* At its core, Machine Learning (ML) is about enabling machines to learn from data and improve their performance over time.
* Modeling refers to building mathematical representations of real-world problems using machine learning algorithms.
* In simple terms:
    * <ins>Data</ins>: The raw information provided to the algorithm.
    * <ins>Model</ins>: A mathematical function representing the relationship between inputs and outputs.
    * <ins>Training</ins>: Feeding data into the model to enable it to learn patterns.
    * <ins>Prediction</ins>: Using the trained model to make predictions on new data.
* The key goal of machine learning is to find patterns and insights in data to solve problems like classification, regression, clustering, and anomaly detection.

### **<ins>Why is Machine Learning and Modeling important?</ins>**
* <ins>Automation</ins>: Automates repetitive and complex tasks with high accuracy.
* <ins>Improved Decision-Making</ins>: Data-driven insights enhances strategic planning.
* <ins>Personalization</ins>: Powers recommendation engines and tailored user experiences.
* <ins>Efficiency</ins>: Optimizes workflows and reduces operational overhead.
* <ins>Real-Time Insights</ins>: Processes large volumes of data in real-time.
* From detecting fraud in banking to predicting diseases in healthcare, machine learning is revolutionizing every industry.

### **<ins>Key Concepts in Machine Learning and Modeling</ins>**
* Types of Machine Learning
    * <ins>Supervised Learning</ins>: The algorithm learns from labeled data. *Ex.*: Predicting house prices
    * <ins>Unsupervised Learning</ins>: The algorithm identifies patterns in unlabeled data. *Ex.*: Customer segmentation.
    * <ins>Reinforcement Learning</ins>: The model learns through trial-and-error, receiving rewards for optimal decisions. *Ex.*: Robotics, game agents
* Machine Learning Workflow
    * <ins>Data Collection</ins>: Gather relevant datasets.
    * <ins>Data Cleaning and Preprocessing</ins>: Handle missing values, standardize data, and remove outliers.
    * <ins>Feature Engineering</ins>: Create meaningful features from raw data.
    * <ins>Model Selection</ins>: Choose the right algorithm for the problem.
    * <ins>Model Training</ins>: Train the model on labeled data.
    * <ins>Evaluation</ins>: Measure performance using metrics like accuracy, precision, and recall.
    * <ins>Optimization</ins>: Fine-tune the model for better performance.
    * <ins>Deployment</ins>: Deploy the trained model into a production environment.
* Common Machine Learning Algorithms
    * <ins>Regression Algorithms</ins>: Linear Regression, Ridge Regression, Lasso Regression
    * <ins>Classification Algorithms</ins>: Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM)
    * <ins>Clustering Algorithms</ins>: K-Means, DBSCAN, Hierarchical Clustering
    * <ins>Ensemble Methods</ins>: Bagging (*e.g.* Random Forest), Boosting (*e.g.* Gradient Boosting, XGBoost)
* Model Evaluation Metrics
    * <ins>Regression Metrics</ins>: Mean Absolute Error (MAE), Mean Square Error (MSE), R-Squared (R<sup>2</sup>)
    * <ins>Classification Metrics</ins>: Accuracy, Precision, Recall, F1 Score, ROC-AUC Score
* Overfitting and Underfitting
    * <ins>Overfitting</ins>: The model learns too well from training data but performs poorly on unseen data.
    * <ins>Underfitting</ins>: The model is too simplistic to capture data patterns.
    * **Solutions**: Cross-validation, Regularization techniques (*e.g.* L1, L2), Hyperparameter tuning
* Model Optimization
    * <ins>Hyperparameter Tuning</ins>: Adjust model parameters like learning rate, tree depth, *etc.*.
    * <ins>Grid Search</ins>: Exhaustively searches for the best combination of hyperparameters.
    * <ins>Random Search</ins>: Randomly tests a subset of hyperparameters.

### **<ins>Best Practices for Machine Learning and Modeling</ins>**
* <ins>Understand the Problem</ins>: Choose the right algorithm for the task.
* <ins>Clean and Preprocess Data</ins>: Ensure data quality before modeling.
* <ins>Avoid Overfitting</ins>: Use regularization and cross-validation.
* <ins>Choose Relevant Metrics</ins>: Use appropriate evaluation metrics for your task.
* <ins>Iterate and Experiment</ins>: Test multiple models and configurations.

# **<ins>Module 7: Model Evaluation and Validation - Ensuring Reliable Predictions</ins>**
* <ins>Model Evaluation and Validation</ins> is a critical phase in the machine learning pipeline where we measure a model's performance, reliability and ability to generalize unseen data.
* A model might perform exceptionally well on training data, but fail on real-world scenarios if it isn't validated properly.

### **<ins>What is Model Evaluation and Validation?</ins>**
* Model Evaluation and Validation refer to the processes used to assess a model's performance and ensure its reliability.
* Evaluation measures how well a model performs on test data, while validation ensures it generalizes effectively to unseen data.
* <ins>Evaluation</ins>: Quantitative assessment using metrics like accuracy, precision, and recall.
* <ins>Validation</ins>: Techniques to ensure the model is not overfitting or underfitting.
* <ins>Goal</ins>: Build a model that performs consistently across different datasets.

### **<ins>Why is Model Evaluation and Validation important?</ins>**
* <ins>Prevents Overfitting and Underfitting</ins>: Ensures the model generalizes well to unseen data.
* <ins>Measures Accuracy and Reliability</ins>: Quantifies how well the model performs.
* <ins>Informs Model Selection</ins>: Helps compare multiple models and choose the best-performing one.
* <ins>Identifies Weaknesses</ins>: Highlights areas where the model struggles (*e.g.* class imbalance).
* <ins>Improves Trustworthiness</ins>: Stakeholders can trust models backed by robust evaluation techniques.
* Without proper evaluation, even a powerful model can become a liability in real-world scenarios.

### **<ins>Key Concepts in Model Evaluation and Validation</ins>**
* Train-Test Split
    * Split data into training and testing datasets (*e.g.* 80% for training, 20% for testing).
    * The training set is used to train the model, while the test set evaluates its performance.
* Evaluation Metrics
    * For Regression Models:
        * <ins>Mean Absolute Error (MAE)</ins>: Measures average absolute differences.
        * <ins>Mean Squared Error (MSE)</ins>: Penalizes larger errors.
        * <ins>R-Squared (R<sup>2</sup>)</ins>: Explains the proportion of variance explained by the model.
    * For Classification Models:
        * <ins>Accuracy</ins>: Ratio of correct predictions to total predictions.
        * <ins>Precision</ins>: Proportion of positive predictions that were correct.
        * <ins>Recall</ins>: Proportion of actual positives correctly predicted.
        * <ins>F1-Score</ins>: Harmonic mean of precision and recall.
        * <ins>ROC-AUC Score</ins>: Measures how well the model distinguishes between classes.
* Confusion Matrix
    * A confusion matrix provides insights into how a classification model performs: True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
* Cross-Validation
    * Cross-validation ensures that your model's performance is consistent across different subsets of the data.
    * <ins>K-Fold Cross-Validation</ins>: Splits data into 'K' subsets and rotates the test set across each fold.
    * <ins>Stratified K-Fold</ins>: Ensures each fold maintains the same class distribution as the entire dataset.
* Overfitting and Underfitting
    * <ins>Overfitting</ins>: The model performs well on training data but poorly on unseen data.
    * <ins>Underfitting</ins>: The model is too simplistic to capture relationships in the data.
    * **Solutions**:
        * Use Regularization techniques (L1, L2).
        * Reduce model complexity or increase training data.
        * Apply cross-validation to validate model performance.
* Bias-Variance Tradeoff
    * <ins>Bias</ins>: Error due to overly simplistic assumptions in the model.
    * <ins>Variance</ins>: Error due to sensitivity to small fluctuations in the training set.
    * <ins>Goal</ins>: Find a balance to minimize total error.

### **<ins>Best Practices for Model Evaluation and Validation</ins>**
* <ins>Use Appropriate Metrics</ins>: Choose metrics based on the problem type (*e.g.* MSE for regression, F1-Score for classification).
* <ins>Perform Cross-Validation</ins>: Validate models on multiple subsets of the data.
* <ins>Avoid Data Leakage</ins>: Ensure training data doesn't contain future information.
* <ins>Monitor Bias-Variance Tradeoff</ins>: Balance model complexity and generalizability.
* <ins>Evaluate on Unseen Data</ins>: Always test on unseen datasets before deployment.

# **<ins>Module 8: </ins>**