# Core Module | Assignment 5

## Naive Bayes Approach

# Q1.
### What is the Naive Approach in machine learning?

The Naive Approach in machine learning refers to a simple and straightforward method that assumes all features in a dataset are independent of each other. This assumption is called naive because it often doesn't hold true in real-world scenarios. The most common example is the Naive Bayes classifier, where it assumes that the presence of one particular feature does not affect the presence of another. Despite its simplifying assumption, the Naive Approach can be surprisingly effective, especially when dealing with large datasets, text classification, and certain types of data distributions. However, its performance may suffer when faced with highly correlated features or complex relationships between variables.

# Q2.
### Explain the assumptions of feature independence in the Naive Approach.


The Naive Approach makes a fundamental assumption known as the "feature independence assumption." This assumption states that the presence or absence of one feature is independent of the presence or absence of any other feature in the dataset. 

- Conditional Independence: Given the class label, all features are assumed to be conditionally independent of each other. This means that knowing the value of one feature provides no information about the values of other features.
- Single Feature Importance: Each feature contributes independently and equally to the classification decision. There are no interactions or dependencies between features that influence the outcome.

# Q3.
### How does the Naive Approach handle missing values in the data?

The Naive Approach handles missing values in a straightforward manner:

- During Training: When calculating probabilities for each feature given a class label, the Naive Bayes classifier ignores instances with missing values for that particular feature. This means that missing values do not contribute to the probability estimation of the feature.
- During Prediction: When making predictions for new data with missing values, the Naive Bayes classifier skips the feature(s) with missing values in the probability calculation. The prediction is still made based on the available features.

It's important to note that handling missing values by ignoring instances or features can lead to biased or inaccurate predictions.

# Q4.
### What are the advantages and disadvantages of the Naive Approach?

Advantages of the Naive Approach:

- Simplicity: The Naive Approach is easy to understand, implement, and computationally efficient. It requires minimal tuning and can be applied quickly to large datasets.
- Low Data Requirements: Naive Bayes can perform well even with small training datasets, making it suitable for scenarios with limited data.
- Effective for Text Classification: It excels in text classification tasks (spam filtering, sentiment analysis) where the feature independence assumption is not severely violated.

Disadvantages of the Naive Approach:

- Strong Independence Assumption: The feature independence assumption is often unrealistic in real-world data, leading to suboptimal performance when features are highly correlated.
- Limited Expressiveness: Naive Bayes may not capture complex relationships between variables, limiting its effectiveness in certain tasks.
- Sensitivity to Irrelevant Features: It can be sensitive to irrelevant features, potentially affecting the classification performance.

# Q5.
### Can the Naive Approach be used for regression problems? If yes, how?

The Naive Approach, particularly the Naive Bayes classifier, is primarily designed for classification tasks and is not directly applicable to regression problems. 
There is an extension of Naive Bayes called the "Gaussian Naive Bayes" that can be adapted for simple regression problems where the target variable follows a Gaussian (normal) distribution. In Gaussian Naive Bayes regression, the algorithm assumes that the features are conditionally independent given the target variable and that each feature follows a Gaussian distribution.

The steps to use Gaussian Naive Bayes for regression are as follows:

1. Transform the target variable: If the target variable is not already normally distributed, apply a suitable transformation (e.g., log transformation) to make it approximately Gaussian.
2. Model feature distributions: For each feature, estimate the mean and variance of its distribution given the target variable.
3. Calculate the conditional probabilities: Use the Gaussian probability density function to calculate the likelihood of each feature given the target value.
4. Make predictions: Combine the conditional probabilities using Bayes' theorem to make predictions for new data points

# Q6.
### How do you handle categorical features in the Naive Approach?

Handling categorical features in the Naive Approach, involves converting these categorical attributes into a numerical format. 

1. Label Encoding: In this method, each category in a categorical feature is assigned a unique integer label. For example, if you have a feature "Color" with categories "Red," "Blue," and "Green," you can map them to 0, 1, and 2, respectively. However, be cautious with this approach, as it implies an ordinal relationship between the categories, which may not be the case for all categorical variables.
2. One-Hot Encoding: This method creates binary columns for each category in the original feature. Each category becomes a new feature, and its presence is denoted by a 1, while all other columns are set to 0. This way, there is no implied ordinal relationship between the categories. For example, if "Color" has three categories, three binary features (e.g., "Red," "Blue," "Green") are created, each representing the presence of the corresponding color.

# Q7.
### What is Laplace smoothing and why is it used in the Naive Approach?

Laplace smoothing, also known as add-one smoothing or additive smoothing, is a technique used to address the problem of zero probabilities in the Naive Approach. 

In the Naive Bayes algorithm, when calculating the probability of a feature given a class label, there is a possibility of encountering zero probabilities. This occurs when a particular feature does not appear with a specific class in the training data, resulting in a probability of zero. When zero probabilities are encountered, it can lead to severe issues during classification, such as undefined probabilities and an inability to make predictions.

Laplace smoothing solves this problem by adding a small constant value (typically 1) to both the numerator and denominator of the probability calculation. 

The formula for Laplace smoothed probability for a feature (x) given a class (c) is:

**P(x|c) = (count(x, c) + 1) / (count(c) + |V|)**

# Q8.
### How do you choose the appropriate probability threshold in the Naive Approach?

Choosing the appropriate probability threshold in the Naive Approach depends on the specific requirements of the application and the trade-off between precision and recall.

Here's how we can choose an appropriate probability threshold:

1. **Understanding the Problem**: Consider the specific problem and the consequences of false positives and false negatives. For example, in medical diagnosis, false positives may lead to unnecessary treatments, while false negatives can be life-threatening. Adjust the threshold to prioritize the more critical outcome.
2. **Precision-Recall Trade-off**: Lowering the threshold increases recall (sensitivity) but decreases precision. Conversely, raising the threshold increases precision but decreases recall. Assess the balance needed based on the application.
3. **Receiver Operating Characteristic (ROC) Curve**: Plot the ROC curve and analyze the trade-off between true positive rate (recall) and false positive rate. The optimal threshold is often associated with the point closest to the top-left corner (maximizing true positive rate while minimizing false positive rate).
4. **F1 Score or Area Under the Curve (AUC)**: Consider metrics like F1 score (harmonic mean of precision and recall) or AUC to find an optimal threshold that maximizes performance.
5. **Cross-Validation**: Use cross-validation to evaluate different thresholds and choose the one that provides the best balance of performance on the validation data.

# Q9.
### Give an example scenario where the Naive Approach can be applied.

A common example scenario where the Naive Approach can be applied is in text classification, particularly for spam email filtering.

**Scenario: Spam Email Filtering**

In this scenario, the task is to classify incoming emails as either "spam" or "non-spam" (ham). The goal is to automatically filter out unwanted spam emails and ensure that legitimate emails reach the inbox.

**Implementation:**

1. **Data Collection**: Gather a labeled dataset consisting of emails, where each email is labeled as "spam" or "ham."
2. **Data Preprocessing**: Convert the emails into a numerical format by using techniques like bag-of-words or TF-IDF (Term Frequency-Inverse Document Frequency) to represent the text as feature vectors.
3. **Naive Bayes Model**: Train a Naive Bayes classifier using the training set, where the features represent the words in the emails, and the labels are "spam" or "ham."
4. **Probability Calculation**: Calculate the probabilities of each word appearing in spam and non-spam emails, based on the training data.
5. **Classification**: For new, unseen emails, use the Naive Bayes model to predict whether each email is "spam" or "ham" by combining the probabilities of the words present in the email.
6. **Threshold Selection**: Apply an appropriate probability threshold to convert the continuous probabilities into discrete class labels.
7. **Evaluation**: Measure the performance of the Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1 score on a separate test dataset.

## KNN

# Q10.
### What is the K-Nearest Neighbors (KNN) algorithm?

The K-Nearest Neighbors (KNN) algorithm is a simple and popular supervised machine learning algorithm used for classification and regression tasks. It is a non-parametric, instance-based learning method, meaning it does not make any assumptions about the underlying data distribution. Instead, it relies on the proximity of data points to make predictions.

# Q 11.
### How does the KNN algorithm work?

The K-Nearest Neighbors (KNN) algorithm works as follows:

1. Training Phase:
- The algorithm simply stores the entire training dataset with labeled data points (instances) and their corresponding class labels (for classification tasks) or target values (for regression tasks).
2. Prediction Phase (Classification):
- Given a new, unlabeled data point to classify, KNN calculates the distance between this data point and all the data points in the training dataset. The most common distance metric used is the Euclidean distance, but other metrics like Manhattan distance or cosine similarity can also be used.
- The algorithm selects the K data points (neighbors) with the smallest distances to the new data point.
- The class label of the new data point is determined by majority voting among the K neighbors. The new data point is assigned the class label that occurs most frequently among its K nearest neighbors.
3. Prediction Phase (Regression):
- For regression tasks, the process is similar to the classification step, except that instead of using majority voting, KNN calculates the average or weighted average of the target values of the K nearest neighbors. This average value becomes the predicted target value for the new data point.

# Q 12.
### How do you choose the value of K in KNN?

Choosing the value of K in K-Nearest Neighbors (KNN) is a critical step that can significantly impact the performance of the algorithm. The appropriate value of K depends on the specific dataset and the complexity of the underlying data distribution.`

# Q 13.
### What are the advantages and disadvantages of the KNN algorithm?

Advantages of the K-Nearest Neighbors (KNN) algorithm:

1. Simple and Intuitive: KNN is easy to understand and implement, making it a great choice for beginners in machine learning.
2. Non-Parametric: KNN makes no assumptions about the underlying data distribution, making it suitable for various types of data.
3. Adapts to Complex Decision Boundaries: KNN can handle complex decision boundaries and nonlinear relationships between features.
4. No Training Phase: KNN is a lazy learner, so there is no explicit training phase. It memorizes the entire training dataset, making predictions faster during the testing phase.
5. Suitable for Multi-Class Classification: KNN can handle multi-class classification tasks without modification.

Disadvantages of the KNN algorithm:

1. Computationally Expensive: KNN can be computationally expensive, especially with large datasets, as it requires calculating distances between the new data point and all training data points.
2. Sensitivity to Outliers: KNN is sensitive to outliers, as they can significantly affect the distance calculations and lead to incorrect predictions.
3. Need for Feature Scaling: KNN performance can be affected by the scale of features, requiring proper feature scaling before training.
4. High Memory Usage: Since KNN stores the entire training dataset during prediction, it can consume a considerable amount of memory for large datasets.

# Q 14.
### How does the choice of distance metric affect the performance of KNN?

The choice of distance metric in K-Nearest Neighbors (KNN) can significantly impact the performance of the algorithm. The distance metric determines how the similarity or dissimilarity between data points is calculated, which, in turn, affects how KNN finds the nearest neighbors and makes predictions.

# Q 15.
### Can KNN handle imbalanced datasets? If yes, how?

Yes, K-Nearest Neighbors (KNN) can handle imbalanced datasets to some extent, but it requires certain considerations and techniques to address the challenges posed by class imbalance.

Here are some ways KNN can handle imbalanced datasets:

1. Selecting the Right K: The choice of K can influence how KNN handles imbalanced data. A smaller K might give more weight to the local neighborhood, which can be beneficial for handling minority class instances. However, excessively small K values can also introduce noise and lead to overfitting, so it's essential to find a balance through experimentation and cross-validation.
2. Weighted Voting: Implement weighted voting in KNN to give more importance to the nearest neighbors when they belong to the minority class. For example, the weight can be inversely proportional to the distance from the new data point.

# Q 16.
### How do you handle categorical features in KNN?

Handling categorical features in K-Nearest Neighbors (KNN) requires converting these features into a numerical format since KNN relies on distance calculations in the feature space. There are two common methods to handle categorical features in KNN:

1. Label Encoding: In this method, each category in a categorical feature is assigned a unique integer label. For example, if we have a feature "Color" with categories "Red," "Blue," and "Green," we can map them to 0, 1, and 2, respectively. However, be cautious with this approach, as it implies an ordinal relationship between the categories, which may not be the case for all categorical variables.
2. One-Hot Encoding: This method creates binary columns for each category in the original feature. Each category becomes a new feature, and its presence is denoted by a 1, while all other columns are set to 0. This way, there is no implied ordinal relationship between the categories

# Q 17.
### What are some techniques for improving the efficiency of KNN?

Improving the efficiency of K-Nearest Neighbors (KNN) is crucial, especially when dealing with large datasets or high-dimensional feature spaces. Here are some techniques to enhance the efficiency of KNN:

1. Feature Scaling: Properly scale the features before applying KNN. Standardize the features to have a mean of 0 and a standard deviation of 1 (z-score normalization). Feature scaling ensures that all features contribute equally to the distance calculations.
2. Dimensionality Reduction: Reduce the dimensionality of the feature space using techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE). Reducing the number of features can significantly speed up KNN, especially in high-dimensional datasets.

# Q 18.
### Give an example scenario where KNN can be applied.

One example scenario where K-Nearest Neighbors (KNN) can be applied is in the field of personalized movie recommendations.

**Scenario: Personalized Movie Recommendations**

In this scenario, the goal is to recommend movies to users based on their preferences and historical movie ratings. The dataset contains information about users' movie ratings and the genre of each movie.

**Implementation:**

1. Data Collection: Gather a dataset of movie ratings from users, where each rating is associated with a specific movie and user ID. Additionally, collect information about the genre of each movie.
2. Data Preprocessing: Encode categorical features like movie genres using one-hot encoding to convert them into a numerical format. Scale the numerical features, such as movie ratings, using feature scaling techniques.
3. KNN Model: Train a KNN classifier or regressor using the training dataset, where the features represent movie ratings and genres, and the labels are the user IDs.
4. Nearest Neighbors: When a new user rates a movie or expresses their preferences, use the trained KNN model to find the K nearest neighbors (users) in the feature space, based on their movie ratings and genre preferences.
5. Recommendation: The KNN model predicts the user ID for the new user based on the majority voting of the K nearest neighbors. The recommended movies for the new user are the top-rated movies of the selected neighbors that the new user has not yet watched.
6. Ealuation: Measure the performance of the recommendation system using evaluation metrics like precision, recall, or Mean Average Precision (MAP) on a separate test dataset.


## Clustering

# Q 19.
### What is clustering in machine learning?

Clustering in machine learning is a type of unsupervised learning technique where the goal is to group similar data points together based on their inherent similarities or patterns. The process of clustering involves partitioning a dataset into subsets, known as clusters, in such a way that data points within the same cluster are more similar to each other than to those in other clusters. Clustering algorithms do not use predefined class labels; instead, they try to identify inherent structures or patterns within the data.

# Q 20.
### Explain the difference between hierarchical clustering and k-means clustering.


Hierarchical clustering and K-Means clustering are both popular techniques for partitioning data into clusters, but they have significant differences in their approaches and outputs:

**Approach:**
- Hierarchical Clustering: This method creates a hierarchical representation of the data by iteratively merging or splitting clusters. It starts with each data point as its own cluster and then repeatedly combines the closest clusters until a single cluster containing all data points is formed (agglomerative) or until each data point becomes its own cluster (divisive). The result is a dendrogram, representing the hierarchical structure of clusters.
- K-Means Clustering: K-Means is a partition-based algorithm that assigns data points to a fixed number (K) of clusters. It starts with randomly initialized cluster centroids and iteratively assigns each data point to the nearest centroid. Then, it updates the centroids based on the mean of the points assigned to each cluster. This process continues until the centroids stabilize, and the clusters converge.

**Number of Clusters:**
- Hierarchical Clustering: Hierarchical clustering does not require the user to specify the number of clusters in advance. The number of clusters is determined implicitly by the dendrogram and can be chosen at different levels of the hierarchy.
- K-Means Clustering: K-Means requires the user to specify the number of clusters (K) before running the algorithm. The selection of K can influence the clustering results and is often determined using validation techniques or domain knowledge.


# Q 21.
### How do you determine the optimal number of clusters in k-means clustering?

Determining the optimal number of clusters (K) in K-Means clustering is a crucial step to achieve meaningful and interpretable results

1. Elbow Method: Plot the variance explained (inertia) by the K-Means model for different values of K. The inertia is the sum of squared distances between data points and their assigned cluster centroids. The plot will resemble an elbow shape, and the optimal K is typically the point where the inertia starts to level off or decreases at a slower rate. This point suggests that adding more clusters does not significantly improve the model's performance.
2. Silhouette Score: Calculate the silhouette score for different values of K. The silhouette score measures how well-separated the clusters are and ranges from -1 to 1. Higher values indicate better-defined clusters. The optimal K is the one that maximizes the silhouette score.

# Q 22.
### What are some common distance metrics used in clustering?

In clustering, distance metrics play a crucial role in measuring the similarity or dissimilarity between data points. Different distance metrics are used depending on the nature of the data and the clustering algorithm. Here are some common distance metrics used in clustering:

1. Euclidean Distance: The most widely used distance metric in clustering. It measures the straight-line distance between two data points in a multi-dimensional space. It is suitable for continuous numerical data.
2. Manhattan Distance: Also known as L1 distance or city block distance, it measures the sum of the absolute differences between the coordinates of two data points. It is useful for data with attributes on different scales or for non-Euclidean spaces.
3. Cosine Similarity: It measures the cosine of the angle between two non-zero vectors. It is commonly used when dealing with text data or sparse feature representations and is invariant to feature scaling.
4. Correlation Distance: Measures the degree of linear correlation between two data points. It is suitable for datasets where the magnitude of the values is not essential, and the relative relationships matter more.

# Q 23.
### How do you handle categorical features in clustering?

Handling categorical features in clustering requires converting them into a numerical format since many clustering algorithms rely on distance-based calculations. Here are some common techniques to handle categorical features in clustering

1. Label Encoding: Assign a unique integer label to each category in the categorical feature. For example, if you have a feature "Color" with categories "Red," "Blue," and "Green," you can map them to 0, 1, and 2, respectively. However, be cautious with this approach, as it implies an ordinal relationship between the categories, which may not be appropriate for all categorical variables.
2. One-Hot Encoding: Create binary columns for each category in the original feature. Each category becomes a new feature, and its presence is denoted by a 1, while all other columns are set to 0. This way, there is no implied ordinal relationship between the categories.
3. Binary Encoding: Replace each category with binary code. This method encodes each category as a binary representation of its integer index. For example, if we have three categories (0, 1, and 2), we can represent them as (00, 01, and 10) in binary format.

# Q 24.
### What are the advantages and disadvantages of hierarchical clustering?

Advantages of Hierarchical Clustering:

1. Hierarchy of Clusters: Hierarchical clustering creates a tree-like structure (dendrogram) that visually represents the hierarchy of clusters at different levels of granularity. This allows for easy interpretation and understanding of the relationships between clusters.
2. No Need to Specify K: Unlike K-Means, hierarchical clustering does not require the user to specify the number of clusters (K) beforehand. The dendrogram can be cut at different levels to obtain the desired number of clusters.
3. Flexibility in Distance Metrics: Hierarchical clustering can work with various distance metrics, allowing it to handle different types of data and distances.

Disadvantages of Hierarchical Clustering:

1. Computational Complexity: Hierarchical clustering can be computationally expensive, especially for large datasets. The time complexity is O(n^3) for agglomerative hierarchical clustering, which can make it impractical for very large datasets.
2. Lack of Global Optimality: Hierarchical clustering decisions are made locally during each merging or splitting step, which may lead to suboptimal overall clustering solutions.
3. Difficulty in Handling Large Datasets: Dendrograms can become difficult to visualize and interpret for large datasets, making it challenging to determine the appropriate number of clusters.

# Q 25.
### Explain the concept of silhouette score and its interpretation in clustering.

The silhouette score is a widely used metric to evaluate the quality and consistency of clustering results. It quantifies how well each data point fits into its assigned cluster and provides an overall measure of cluster cohesion and separation. The silhouette score ranges from -1 to 1, with higher values indicating better-defined and well-separated clusters.

# Q 26.
### Give an example scenario where clustering can be applied.

An example scenario where clustering can be applied is customer segmentation in marketing.

**Scenario: Customer Segmentation in Marketing**

In this scenario, a marketing team wants to divide their customer base into distinct groups or segments based on their behavior, preferences, and purchasing patterns. The goal is to gain insights into customer segments and tailor marketing strategies to better meet the specific needs of each group.

**Implementation:**
1. Data Collection: Gather data on customer demographics, purchase history, website interactions, customer support interactions, and any other relevant information.
2. Data Preprocessing: Clean and preprocess the data, handle missing values, and perform feature engineering if needed.
3. Clustering: Apply a clustering algorithm (e.g., K-Means, Hierarchical Clustering, or DBSCAN) to cluster the customers based on their features.
4. Determine Optimal K: Use evaluation metrics like the silhouette score or elbow method to find the optimal number of clusters (K).
5. Customer Segmentation: Assign each customer to the cluster they belong to based on the clustering results.
6. Analysis and Insights: Analyze the characteristics and behaviors of customers in each segment to understand their preferences, needs, and patterns.
7. Marketing Strategies: Develop tailored marketing strategies for each customer segment. For example, customers in one segment might respond better to discount offers, while another segment prefers personalized recommendations.
8. Monitoring and Refinement: Continuously monitor the performance of the marketing strategies and refine the clusters and strategies as needed based on customer feedback and changing trends.


## Anomaly Detection


# Q 27.
### What is anomaly detection in machine learning?

# Q 28.
### Explain the difference between supervised and unsupervised anomaly detection.

# Q 29.
### What are some common techniques used for anomaly detection?

# Q 30.
### How does the One-Class SVM algorithm work for anomaly detection?

# Q 31.
### How do you choose the appropriate threshold for anomaly detection?

# Q 32.
### How do you handle imbalanced datasets in anomaly detection?

# Q 33.
### Give an example scenario where anomaly detection can be applied.

## Dimension Reduction

# Q 34.
###  What is dimension reduction in machine learning?

# Q 35.
### Explain the difference between feature selection and feature extraction.

# Q 36.
### How does Principal Component Analysis (PCA) work for dimension reduction?

# Q 37.
### How do you choose the number of components in PCA?

# Q 38.
### What are some other dimension reduction techniques besides PCA?

# Q 39.
### Give an example scenario where dimension reduction can be applied.

## Feature Selection

# Q 40.
### What is feature selection in machine learning?

# Q 41.
### Explain the difference between filter, wrapper, and embedded methods of feature selection.

# Q 42.
### How does correlation-based feature selection work?

# Q 43.
### How do you handle multicollinearity in feature selection?

# Q 44.
### What are some common feature selection metrics?

# Q 45.
### Give an example scenario where feature selection can be applied.

## Data Drift Detection

# Q 46.
### What is data drift in machine learning?


# Q 47.
### Why is data drift detection important?

# Q 48.
### Explain the difference between concept drift and feature drift.

# Q 49.
### What are some techniques used for detecting data drift?

# Q 50.
### How can you handle data drift in a machine learning model?

## Data Leakage

# Q 51.
### What is data leakage in machine learning?


Data leakage is a phenomenon in machine learning where information from the test set or future data leaks into the training set. This can lead to the model overfitting the training data and performing poorly on unseen data.

# Q 52.
### Why is data leakage a concern?

Data leakage is a concern because it can lead to models that are overfit to the training data and do not generalize well to new data. This can have serious consequences, such as leading to incorrect predictions or financial losses.

# Q 53.
### Explain the difference between target leakage and train-test contamination.

Target leakage occurs when information about the target variable leaks into the training set. This can happen, for example, if the target variable is used to select features or to weight features. Train-test contamination occurs when data from the test set leaks into the training set. This can happen, for example, if the training and test sets are not properly separated or if the model is retrained on the test set.

# Q 54.
### How can you identify and prevent data leakage in a machine learning pipeline?

There are a number of ways to identify and prevent data leakage in a machine learning pipeline. Some common methods include:

- **Visualizing the data:** This can help to identify any patterns or relationships that suggest data leakage.
- **Using statistical tests:** This can be used to test for statistical significance between the training and test sets.
- **Using a holdout set:** This is a separate set of data that is not used to train the model. The model is then evaluated on the holdout set to see how well it performs on unseen data.

# Q 55.
### What are some common sources of data leakage?


Some common sources of data leakage include:

* **Using the target variable to select features:** This can happen, for example, if the target variable is used to select features that are highly correlated with it.
* **Using the target variable to weight features:** This can happen, for example, if the target variable is used to weight features based on their importance.
* **Not properly separating the training and test sets:** This can happen, for example, if the training and test sets are stored in the same file or if the test set is not shuffled before it is used to evaluate the model.

# Q 56.
### Give an example scenario where data leakage can occur.

Here is an example scenario where data leakage can occur:

A company is building a model to predict customer churn. The company has a large dataset of customer data, including the customer's past purchase history, demographics, and whether or not the customer has churned in the past. The company decides to use the customer's past purchase history to select features for the model. However, the customer's past purchase history also includes information about the customer's churn status. This means that the target variable (churn status) is leaking into the training set. As a result, the model is likely to overfit the training data and perform poorly on unseen data.


## Cross Validation

# Q 57.
### What is cross-validation in machine learning?



# Q 58.
### Why is cross-validation important?

# Q 59.
### Explain the difference between k-fold cross-validation and stratified k-fold cross-validation.

# Q 60.
### How do you interpret the cross-validation results?