## 17 March Assignment

## Feature Engineering-1

### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of data entries for certain observations or variables. These missing values can occur due to various reasons, such as errors during data collection, data corruption, or intentional omission. Handling missing values is crucial because they can adversely affect the quality and reliability of the data analysis and machine learning processes. They can lead to biased insights, inaccurate predictions, and unreliable model performance.

Importance of Handling Missing Values:

1. **Biased Analysis**: If missing values are not handled properly, any analysis or model built on the dataset can be biased and provide inaccurate results.

2. **Reduced Model Performance**: Many machine learning algorithms cannot handle missing values directly. If not addressed, they can cause errors during model training and evaluation.

3. **Inaccurate Predictions**: If a predictive model encounters missing values in new data during prediction, it might fail to provide accurate results.

4. **Distorted Patterns**: Missing values can distort the underlying patterns and relationships within the data, leading to incorrect conclusions.

5. **Incomplete Insights**: Missing values can lead to incomplete insights, as they might prevent the analysis of certain variables' impact on the outcome.

Algorithms Not Affected by Missing Values:

There are certain machine learning algorithms that can handle missing values inherently or are less sensitive to them:

1. **Tree-Based Algorithms**: Decision trees and Random Forests can handle missing values without much preprocessing. They can split nodes based on the available data and are not strongly affected by missing values.

2. **Ensemble Methods**: Ensemble methods like Gradient Boosting and AdaBoost, which combine multiple models, can often work well with missing data because the individual models can fill in gaps.

3. **K-Nearest Neighbors (KNN)**: KNN algorithms can impute missing values by considering the nearest neighbors' values.

4. **XGBoost and LightGBM**: These gradient boosting frameworks are robust to missing values and can handle them during model training.

5. **Support Vector Machines (SVM)**: SVMs can handle missing values through appropriate kernel functions.


### Q2: List down techniques used to handle missing data. Give an example of each with python code.

Handling missing data is essential for accurate analysis and modeling. Here are some common techniques to handle missing data along with examples using Python:

1. **Deletion of Missing Data:**
   This approach involves removing rows or columns with missing values. It's suitable when the missing data is random and doesn't impact the overall analysis significantly.
   
   ```python
   import pandas as pd
   
   # Create a DataFrame with missing values
   data = {'A': [1, 2, None, 4, 5],
           'B': [None, 2, 3, 4, None]}
   df = pd.DataFrame(data)
   
   # Drop rows with any missing values
   df_cleaned_rows = df.dropna()
   print(df_cleaned_rows)
   
   # Drop columns with any missing values
   df_cleaned_columns = df.dropna(axis=1)
   print(df_cleaned_columns)
   ```

2. **Imputation using Mean/Median/Mode:**
   In this method, missing values are replaced with the mean (for continuous data), median (for skewed data), or mode (for categorical data) of the non-missing values.
   
   ```python
   import pandas as pd
   
   # Create a DataFrame with missing values
   data = {'A': [1, 2, None, 4, 5],
           'B': [None, 2, 3, 4, None]}
   df = pd.DataFrame(data)
   
   # Impute missing values with mean
   df_mean_imputed = df.fillna(df.mean())
   print(df_mean_imputed)
   
   # Impute missing values with median
   df_median_imputed = df.fillna(df.median())
   print(df_median_imputed)
   
   # Impute missing values with mode
   df_mode_imputed = df.fillna(df.mode().iloc[0])
   print(df_mode_imputed)
   ```

3. **Imputation using Interpolation:**
   Interpolation involves estimating missing values based on the values of neighboring data points. This method is suitable for time series or sequential data.
   
   ```python
   import pandas as pd
   
   # Create a DataFrame with missing values
   data = {'A': [1, 2, None, 4, 5],
           'B': [None, 2, 3, 4, None]}
   df = pd.DataFrame(data)
   
   # Interpolate missing values using linear method
   df_interpolated = df.interpolate()
   print(df_interpolated)
   ```

4. **Imputation using Machine Learning Algorithms:**
   You can use machine learning algorithms to predict missing values based on other features. For example, you can use regression or K-Nearest Neighbors to impute missing values.
   
   ```python
   import pandas as pd
   from sklearn.impute import KNNImputer
   
   # Create a DataFrame with missing values
   data = {'A': [1, 2, None, 4, 5],
           'B': [None, 2, 3, 4, None]}
   df = pd.DataFrame(data)
   
   # Impute missing values using KNN imputer
   imputer = KNNImputer(n_neighbors=2)
   df_knn_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
   print(df_knn_imputed)
   ```

5. **Creating Indicator Variables (Flagging):**
   In this method, you create a binary indicator variable that indicates whether a value is missing or not. This approach is useful when the fact that a value is missing is informative.
   
   ```python
   import pandas as pd
   
   # Create a DataFrame with missing values
   data = {'A': [1, 2, None, 4, 5],
           'B': [None, 2, 3, 4, None]}
   df = pd.DataFrame(data)
   
   # Create indicator variables for missing values
   df_indicator = df.copy()
   for col in df.columns:
       df_indicator[col + '_missing'] = df[col].isnull().astype(int)
   print(df_indicator)
   ```

These techniques offer various ways to handle missing data, each with its advantages and limitations. The choice of method depends on the nature of the dataset, the extent of missing values, and the specific analysis or modeling goals.

### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data in the context of machine learning refers to a situation where the distribution of classes or target variables in a dataset is highly skewed. In other words, one class (the minority class) has significantly fewer instances than another class (the majority class). This imbalance can occur in various types of classification tasks, such as fraud detection, disease diagnosis, and customer churn prediction.

For example, consider a binary classification problem to predict whether an online transaction is fraudulent or not. If only a small fraction of transactions are fraudulent, the dataset might have a large number of non-fraudulent transactions (majority class) and only a few fraudulent ones (minority class). This creates an imbalanced dataset.

Consequences of Not Handling Imbalanced Data:

1. **Biased Model Performance**: Most machine learning algorithms are designed to maximize overall accuracy. In an imbalanced dataset, a model might achieve high accuracy by simply predicting the majority class all the time. This can give a false sense of good performance while actually being ineffective for the minority class.

2. **Poor Generalization**: Imbalanced data can lead to poor generalization to new, unseen data. The model becomes biased towards the majority class, failing to capture the patterns in the minority class.

3. **Misclassification of Minority Class**: Due to the scarcity of data for the minority class, the model might misclassify or ignore instances from the minority class altogether.

4. **Low Sensitivity to Anomalies**: In applications like fraud detection or medical diagnosis, the minority class often represents critical cases. Ignoring this class can lead to false negatives, missing important instances.

5. **Loss of Important Information**: Imbalanced data might result in the loss of valuable insights and patterns present in the minority class.

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

**Up-sampling** and **down-sampling** are two techniques used to address the issue of imbalanced data in machine learning, where one class is significantly more or less frequent than the other class. These techniques aim to balance the class distribution, which can lead to improved model performance.

1. **Up-sampling**:
Up-sampling involves increasing the number of instances in the minority class by either duplicating existing instances or generating synthetic data points. This technique aims to give the minority class more representation in the dataset, making the class distribution more balanced. Common methods for up-sampling include SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling).

**Example of Up-sampling**:
Suppose you're working on a credit card fraud detection problem, where the majority of transactions are legitimate (non-fraudulent) and only a small portion are fraudulent. The dataset has a class distribution of 95% legitimate transactions and 5% fraudulent transactions. In this case, up-sampling can be applied to increase the number of fraudulent transactions to achieve a more balanced distribution. This helps the model better learn the patterns of both classes and improves its ability to detect fraud.

2. **Down-sampling**:
Down-sampling involves reducing the number of instances in the majority class by randomly removing data points. This technique aims to reduce the influence of the majority class and create a more balanced distribution. However, down-sampling can potentially lead to loss of information if done excessively.

**Example of Down-sampling**:
Consider a medical diagnosis problem where you're predicting whether a patient has a rare disease. The dataset contains a large number of healthy patients (majority class) and a small number of patients with the disease (minority class). The class distribution is imbalanced with 90% healthy patients and 10% patients with the disease. In this scenario, down-sampling can be used to reduce the number of healthy patients, creating a more balanced dataset that allows the model to better focus on learning the disease-related patterns.

When to Use Up-sampling and Down-sampling:

- **Up-sampling**:
  - When the minority class is under-represented and the model is biased towards the majority class.
  - When improving the model's ability to correctly predict the minority class is crucial.
  - When there's a risk of false negatives (missing important instances) in the minority class.

- **Down-sampling**:
  - When the majority class is over-represented and the model's performance on the minority class is unsatisfactory.
  - When the presence of a large majority class overwhelms the model's ability to learn from the minority class.
  - When computational efficiency is a concern, as down-sampling reduces the dataset size.

Both up-sampling and down-sampling have their pros and cons. It's important to consider the specific problem, dataset, and potential trade-offs before applying these techniques. In some cases, a combination of both techniques or other strategies like adjusting class weights might be more effective in achieving a balanced class distribution and improving model performance.

### Q5: What is data Augmentation? Explain SMOTE.

**Data Augmentation** is a technique used in machine learning and deep learning to artificially increase the diversity and size of a dataset by applying various transformations to the existing data. This is particularly useful when working with limited data, as it helps improve model generalization by exposing the model to a wider range of variations present in the real-world data.

Data augmentation involves applying operations like rotation, translation, scaling, flipping, cropping, and noise addition to the original data to create new, slightly altered instances. These augmented instances are then used alongside the original data for training the model.

**SMOTE (Synthetic Minority Over-sampling Technique)** is a specific technique for data augmentation that focuses on addressing the imbalance in class distribution. It generates synthetic samples for the minority class by creating new instances that are combinations of existing instances in the same class.

Here's how SMOTE works:

1. For each instance in the minority class, SMOTE selects k-nearest neighbors from the same class. The value of k is a parameter chosen by the user.

2. It then generates new instances by interpolating between the selected instance and its k-nearest neighbors. For each feature, SMOTE computes the difference between the feature values of the selected instance and its neighbors. It multiplies this difference by a random number between 0 and 1 and adds it to the selected instance's feature value to create a new instance.

3. SMOTE repeats this process for a specified number of times to generate a desired number of synthetic instances.

SMOTE creates synthetic instances that are plausible representations of the minority class, helping to balance the class distribution without duplicating existing data. This reduces the risk of overfitting on the minority class and allows the model to better capture the underlying patterns of both classes.

Example of SMOTE:

Suppose you're working on a medical diagnosis problem to predict whether a patient has a rare disease. The dataset is imbalanced, with a large number of healthy patients (majority class) and a small number of patients with the disease (minority class). Applying SMOTE would involve selecting instances from the minority class, choosing k-nearest neighbors from the same class, and then generating synthetic instances by interpolating between these instances. The result is a set of new instances that better represent the characteristics of the minority class, improving the model's ability to correctly predict cases of the disease.

Overall, SMOTE is a valuable technique for handling imbalanced datasets and improving the performance of machine learning models on minority classes.

### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

**Outliers** in a dataset are data points that significantly deviate from the rest of the data points. They are observations that lie far away from the bulk of the data distribution. Outliers can be caused by various factors, such as measurement errors, data entry mistakes, or genuine extreme values in the data.

Outliers can be categorized into two types:

1. **Univariate Outliers**: These outliers occur in a single variable or feature and are typically detected by considering the distribution of that variable alone.

2. **Multivariate Outliers**: These outliers are identified when considering multiple variables or features together. They might not be detected as outliers in individual variables but stand out when considering their combination.

Importance of Handling Outliers:

Handling outliers is essential for several reasons:

1. **Distorted Analysis and Insights**: Outliers can distort the statistical measures and distribution of the data, leading to inaccurate insights and conclusions. They can exaggerate the spread of data and affect measures like mean and standard deviation.

2. **Incorrect Model Assumptions**: Outliers can violate the assumptions of many statistical and machine learning algorithms, affecting their performance and validity.

3. **Bias in Regression Models**: Outliers can significantly impact the coefficients and fit of regression models, leading to biased predictions.

4. **Increased Model Variability**: Outliers can lead to increased variability in model predictions, causing instability in model performance.

5. **Misinterpretation of Results**: If not handled, outliers can lead to misinterpretation of results, potentially leading to wrong decisions and actions based on data analysis.

6. **Sensitive to Noise**: Some algorithms, like k-means clustering, are sensitive to outliers and can result in suboptimal clustering.

### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Handling missing data in customer data analysis project is essential to ensure accurate insights and reliable results. Here are some techniques can be use to handle the missing data:

1. **Imputation using Mean/Median/Mode**:
   Replace missing values with the mean (for continuous data), median (for skewed data), or mode (for categorical data) of the non-missing values within the same variable. This is a simple method and works well when the missing values are randomly distributed.

2. **Imputation using Regression**:
   If there's a relationship between the missing variable and other variables, you can use regression models to predict the missing values based on the values of other variables. This approach is effective when the missing data follows a pattern.

3. **Imputation using K-Nearest Neighbors (KNN)**:
   For numerical data, you can impute missing values by using the values of the k-nearest neighbors. This approach is particularly useful when there's a local relationship between data points.

4. **Imputation using Machine Learning Models**:
   You can train a machine learning model to predict missing values based on other features. Algorithms like decision trees, Random Forests, and XGBoost can be used for this purpose.

5. **Forward/Backward Fill**:
   In time series data, you can use forward fill (replacing missing value with the previous value) or backward fill (replacing with the next value) if missing values occur sequentially.

6. **Interpolation**:
   Interpolation involves estimating missing values based on the neighboring data points. It's effective for time series or sequential data where values tend to follow a trend.

7. **Creating Indicator Variables**:
   You can create binary indicator variables that indicate whether a value is missing or not. This approach helps the model consider the missingness as a separate category.

8. **Deletion**:
   If missing values are very few and randomly distributed, you might consider deleting the corresponding rows or columns. However, this should be done cautiously to avoid loss of valuable information.

9. **Domain Expertise**:
   Consulting domain experts can provide insights into the nature of missing data and help decide on appropriate imputation techniques.

10. **Multiple Imputations**:
   This technique involves creating multiple imputed datasets and analyzing them separately. This accounts for uncertainty in imputation and provides more accurate estimates.

11. **Use of External Data**:
   In some cases, external data sources might provide information that can be used to impute missing values.

The choice of technique depends on the nature of the data, the extent of missing values, the analysis goals, and the assumptions about the missing data mechanism. It's often a good practice to explore multiple techniques and assess their impact on the analysis to ensure robust results.

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

Determining whether missing data is missing at random (MAR) or if there's a pattern to the missing data is crucial for understanding the potential bias and implications it might have on your analysis. Here are some strategies you can use to assess the randomness or patterns in missing data:

1. **Descriptive Statistics**:
   Calculate summary statistics for both the complete and missing data. Compare the distributions of variables to see if there are noticeable differences. If the distributions are similar, the missing data might be MAR.

2. **Visualization**:
   Create visualizations, such as histograms or box plots, for variables with missing data. Compare these visualizations between complete and missing data. If they look similar, the missing data might be MAR.

3. **Missing Data Heatmap**:
   Create a heatmap that displays the presence or absence of data across different variables. This can help identify if certain variables tend to have more missing data than others, suggesting patterns.

4. **Pattern Tests**:
   Conduct statistical tests to compare characteristics of complete and missing data. For continuous variables, you can use t-tests or Mann-Whitney U tests. For categorical variables, use chi-squared tests or Fisher's exact tests.

5. **Correlation Analysis**:
   Investigate correlations between variables with missing data and other variables. If missingness is correlated with specific variables, there might be a pattern.

6. **Time Analysis**:
   For time series data, analyze whether the missingness follows a temporal pattern. If there are specific time periods with higher missingness, it could indicate non-randomness.

7. **Domain Knowledge**:
   Consult experts in the field to understand if there's a logical reason behind the missing data. They might provide insights into potential patterns.

8. **Missing Data Mechanism Tests**:
   Run specific tests designed to determine the missing data mechanism, such as the Little's MCAR test, which tests whether the missing data is missing completely at random.

9. **Multiple Imputations**:
   Create multiple imputations of the missing data using different methods and compare the results. If the imputed values vary significantly, it could indicate a non-random mechanism.

10. **Exploring Subgroups**:
    Analyze whether the missing data pattern varies across subgroups. If certain demographic or categorical groups have different levels of missingness, it could suggest a pattern.

11. **Check External Data Sources**:
    If available, check external data or sources that might provide insights into why the data is missing for specific cases.

Remember that these strategies are not mutually exclusive and can be used in combination to gain a comprehensive understanding of the missing data pattern. Determining whether missing data is MAR or not can influence the choice of imputation methods and the interpretation of analysis results, so it's important to invest effort in this investigation.

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Working with imbalanced datasets, such as in a medical diagnosis project where the majority of patients do not have the condition of interest, requires careful evaluation to ensure that the model's performance is not biased or misleading. Here are some strategies to evaluate your machine learning model's performance on such imbalanced datasets:

1. **Confusion Matrix and Class Imbalance Metrics**:
   Use a confusion matrix to visualize the model's performance. Pay special attention to metrics like precision, recall (sensitivity), specificity, and the F1-score. These metrics provide insights into how well the model is identifying both positive and negative cases.

2. **ROC Curve and AUC**:
   Plot the Receiver Operating Characteristic (ROC) curve, which shows the trade-off between true positive rate (recall) and false positive rate at various threshold settings. Calculate the Area Under the Curve (AUC) to assess overall model performance. AUC is robust to class imbalance and provides a single metric that summarizes the model's ability to discriminate between classes.

3. **Precision-Recall Curve**:
   The precision-recall curve is useful for imbalanced datasets, as it focuses on the positive class. It plots precision against recall at different threshold values. A model with high precision and recall is desirable for imbalanced datasets.

4. **Balanced Accuracy**:
   The balanced accuracy takes into account the imbalance in class distribution. It calculates the average of sensitivity (recall) for the positive class and specificity for the negative class.

5. **Cost-sensitive Learning**:
   Adjust your model's cost function to account for the class imbalance. Assign different misclassification costs to different classes, putting more emphasis on the minority class.

6. **Resampling Techniques**:
   Implement resampling techniques like oversampling (increasing the minority class) or undersampling (decreasing the majority class) during cross-validation. This can help the model become more sensitive to the minority class.

7. **Ensemble Methods**:
   Ensemble methods like Random Forest, AdaBoost, or Gradient Boosting can handle imbalanced data better by combining multiple models. They can give more weight to the minority class during training.

8. **Stratified Cross-Validation**:
   Use stratified cross-validation to ensure that each fold maintains the class distribution found in the entire dataset. This prevents overfitting to the majority class in any single fold.

9. **Area Under the Precision-Recall Curve (AUC-PR)**:
   This metric focuses on the positive class and provides a better measure of the model's performance on the minority class.

10. **Use Domain Knowledge**:
    Understand the importance of false positives and false negatives in your specific application. Adjust the decision threshold of your model accordingly.

11. **Anomaly Detection Techniques**:
    Treat the problem as an anomaly detection task, where the positive class is treated as an anomaly. Use techniques like Isolation Forest or One-Class SVM.

Remember that selecting the appropriate evaluation strategy depends on the specific objectives and constraints of your project. The goal is to ensure that the model's performance is not skewed by the class imbalance and that it can effectively identify cases of interest, even in the presence of a small percentage of positive cases.

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

Balancing an unbalanced dataset for customer satisfaction estimation is important to ensure that your machine learning model doesn't become biased towards the majority class. Down-sampling the majority class is a common technique to achieve this balance. Here's how you can employ methods to down-sample the majority class:

1. **Random Under-sampling**:
   Randomly remove instances from the majority class until the desired balance is achieved. This can help prevent the model from being dominated by the majority class.

2. **Cluster-Based Under-sampling**:
   Use clustering algorithms to group similar instances from the majority class and then randomly remove instances from each cluster. This ensures that you retain diversity within the majority class.

3. **Tomek Links**:
   Tomek links are pairs of instances from different classes that are each other's nearest neighbors. You can remove the majority class instance from each Tomek link, which can help in separating the decision boundary between classes.

4. **NearMiss Algorithm**:
   NearMiss is an under-sampling algorithm that selects a subset of majority class instances based on their distance to minority class instances. It aims to maintain the distribution of the minority class while reducing the number of majority class instances.

5. **Edited Nearest Neighbors (ENN)**:
   ENN is a technique that removes majority class instances that are misclassified by their k-nearest neighbors. It aims to remove noisy instances from the majority class.

6. **Instance Hardness Threshold (IHT)**:
   IHT assigns a hardness score to each instance based on its likelihood of being misclassified. It then removes instances with scores above a certain threshold.

7. **Down-sampling using imbalanced-learn Library**:
   The imbalanced-learn library in Python provides various under-sampling techniques like RandomUnderSampler, ClusterCentroids, TomekLinks, and more. These methods can make the implementation of down-sampling easier and more structured.

8. **Stratified Sampling**:
   Perform stratified sampling, which ensures that the sub-sample of the majority class maintains a similar class distribution as the original dataset.

9. **Cross-Validation and Ensemble Methods**:
   Use cross-validation techniques along with ensemble methods like Random Forest or Gradient Boosting. Cross-validation ensures that the training and validation folds have balanced class distributions.

10. **Evaluation of Down-sampling Methods**:
    Experiment with different down-sampling methods and evaluate their impact on model performance using appropriate evaluation metrics.

Remember that down-sampling the majority class comes with the trade-off of reducing the available data for training, which might lead to loss of information. Experiment with different techniques to find the balance between achieving class balance and maintaining sufficient data for the model to learn effectively.

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

When dealing with a dataset that has a low percentage of occurrences for a rare event, it's important to balance the dataset to ensure that your machine learning model can effectively learn from the minority class. Up-sampling the minority class is a common technique to achieve this balance. Here are some methods you can employ to up-sample the minority class:

1. **Random Over-sampling**:
   Randomly duplicate instances from the minority class to increase its size. This can help ensure that the model has sufficient data to learn from.

2. **SMOTE (Synthetic Minority Over-sampling Technique)**:
   SMOTE is a popular technique that creates synthetic instances by interpolating between existing instances of the minority class. This technique helps address the imbalance while avoiding duplication of existing data points.

3. **ADASYN (Adaptive Synthetic Sampling)**:
   ADASYN is an extension of SMOTE that assigns different weights to instances in the minority class based on their level of difficulty in classification. It focuses on generating instances in regions of the feature space that are harder to classify.

4. **Borderline-SMOTE**:
   Borderline-SMOTE is a variant of SMOTE that focuses on generating synthetic instances near the decision boundary between classes. This can improve the generalization of the model.

5. **SMOTE-NC (SMOTE for Nominal and Continuous Features)**:
   SMOTE-NC is an extension of SMOTE that works with datasets containing both nominal and continuous features.

6. **SMOTEENN (SMOTE combined with Edited Nearest Neighbors)**:
   This technique combines over-sampling using SMOTE with under-sampling using the Edited Nearest Neighbors algorithm. It helps in generating synthetic instances while removing noisy instances from the majority class.

7. **Synthetic Minority Under-sampling Technique (SMUT)**:
   SMUT combines over-sampling and under-sampling by first applying over-sampling and then under-sampling to the majority class to achieve better class balance.

8. **Using Synthetic Data Generators**:
   Some libraries offer synthetic data generators that create new instances using probabilistic methods or generative models. These generated instances can help balance the dataset.

9. **Cost-sensitive Learning**:
   Adjust the algorithm's cost function to give more importance to the minority class during training. This can help the model pay more attention to the rare event.

10. **Ensemble Methods**:
    Use ensemble methods like EasyEnsemble or BalanceCascade, which train multiple models on different subsets of the minority class and then combine their predictions.

11. **Cross-Validation and Evaluation**:
    When up-sampling, be sure to perform cross-validation with the up-sampled dataset to evaluate the model's performance more accurately.
