In [None]:
"""
Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

1. What are missing values in a dataset?

Missing values occur when no data value is stored for a particular variable (column) in an observation (row).
They are usually represented as NaN, NULL, or blank cells.
2. Why is it essential to handle missing values?

Handling missing values is important because:

Algorithms may fail

Many ML algorithms cannot work with missing values.

Incorrect results

Missing data can bias predictions and reduce accuracy.

Loss of valuable information

Ignoring missing values may discard useful data.

Better model performance

Proper handling improves reliability and robustness.

Data consistency

Clean data ensures correct statistical analysis.

3. Algorithms that are not affected by missing values

Some algorithms can handle missing values internally or are less sensitive to them:

Decision Trees

Random Forest

Gradient Boosting (XGBoost, LightGBM)

Naïve Bayes (in some implementations)

k-Nearest Neighbors (KNN) (can handle with distance-based approaches)


"""

In [None]:
"""
Q2: List down techniques used to handle missing data. Give an example of each with python code.

1. Removing rows with missing values

Description:
Rows containing missing values are deleted.

When to use:

When missing values are very few

When dataset is large

2. Removing columns with missing values

Description:
Columns with missing values are removed.

When to use:

When a column has too many missing values

When column is not important

3. Mean Imputation (Numerical data)

Description:
Missing values are replaced with the mean of the column.

4. Median Imputation (Numerical data)

Description:
Missing values are replaced with the median of the column.
Useful when data has outliers.

5. Mode Imputation (Categorical data)

Description:
Missing values are replaced with the most frequent value.

6. Forward Fill (Propagation method)

Description:
Missing value is replaced with the previous value.

7. Backward Fill

Description:
Missing value is replaced with the next value.

8. Using a Constant Value

Description:
Missing values are replaced with a fixed value like 0 or "Unknown".

"""

In [None]:
"""
Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data occurs when the classes in a dataset are not equally represented, and one class has significantly more samples than the other(s).
It is very common in classification problems.

Most machine learning algorithms assume that all classes are equally important.
When data is imbalanced, the model becomes biased toward the majority class.

What happens if imbalanced data is not handled?
1. High accuracy but poor performance

Model may predict only the majority class.

Example: Predicting Not Fraud for all cases gives 99% accuracy, but detects no fraud.

2. Poor minority class prediction

The model fails to identify rare but important cases (fraud, disease, defects).

3. Misleading evaluation metrics

Accuracy becomes meaningless.

Metrics like precision, recall, F1-score are affected.

4. Increased business risk  

Missed fraud cases → financial loss

Missed disease detection → health risk

5. Overfitting to majority class

Model learns patterns only from the dominant class.

"""

In [None]:
"""
Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down - sampling are required.

Up-sampling is a technique used to handle imbalanced data by increasing the number of samples in the minority class so that it becomes comparable to the majority class.

This is done by:

Duplicating existing minority samples, or Generating synthetic samples (e.g., SMOTE).

When up-sampling is required:

When the dataset is small

When removing majority data is risky

When minority class is more important (fraud, disease detection)

2. Down-sampling
Definition:

Down-sampling is a technique where the number of samples in the majority class is reduced to balance the dataset.

This is done by randomly removing samples from the majority class.

When down-sampling is required:

When dataset is very large

When computation cost is high

When majority class has redundant data


"""


In [None]:
"""
What is Data Augmentation? Explain SMOTE

Data augmentation is a technique used to increase the size and diversity of a dataset by creating new data samples from existing data without collecting new data.

It is mainly used to:

Handle imbalanced datasets

Reduce overfitting

Improve model generalization

SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation technique used for imbalanced classification problems.

It increases the minority class samples by creating synthetic (new) data points, not by duplicating existing ones.

Select a minority class data point

Find its k nearest minority neighbors

Create a new synthetic point between them

Repeat until classes are balanced

"""

In [None]:
"""
Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that significantly differ from the majority of the data values.
They may occur due to measurement errors, data entry mistakes, or genuine rare events.

2. Why is it essential to handle outliers?
1. Prevents misleading results

Outliers can distort mean, variance, and correlations.

2. Improves model accuracy

Many algorithms (Linear Regression, KNN, SVM) are sensitive to outliers.

3. Better data visualization & analysis

Outliers can hide true data patterns.

4. Reduces model bias

Model may overfit extreme values.

5. Detects errors and anomalies

Helps identify data entry or system errors.

3. What happens if outliers are not handled?

Poor model performance

Wrong predictions

Unstable and unreliable models

"""

In [None]:
"""
Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

When working on customer data, missing values are common (e.g., missing age, income, location).
Several techniques can be used to handle missing data effectively, depending on the situation.

1. Remove Missing Data
a) Deleting rows

Remove records with missing values.

Suitable when missing data is very small and random.

Example:
Remove customers with missing age. 
b) Deleting columns

Remove entire columns if they have too many missing values.

Useful when the column is not important.

2. Statistical Imputation
a) Mean Imputation

Replace missing values with the average.

Used for numerical features like income or spending.

b) Median Imputation

Replace with middle value.

Best when data has outliers.

c) Mode Imputation

Replace with most frequent value.

Used for categorical features like city or gender.

3. Forward Fill and Backward Fill

Forward fill: Use previous customers value.

Backward fill: Use next available value.

Mostly used in time-series customer data.

4. Constant Value Replacement

Replace missing values with a placeholder like "Unknown" or 0.

Useful for categorical customer attributes.

5. Model-Based Imputation

Predict missing values using machine learning models.

Example: Predict missing income based on age and occupation.

6. Using Algorithms That Handle Missing Data

Some algorithms handle missing values internally:

Decision Trees

Random Forest

XGBoost

7. Business-Aware Handling

For critical fields (email, phone): record may be discarded

For optional fields (age, income): imputation is preferred

"""

In [None]:
"""
Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

1. Types of Missing Data (Conceptual)

MCAR (Missing Completely At Random):
Missingness has no relationship with any variable.

MAR (Missing At Random):
Missingness depends on other observed variables.

MNAR (Missing Not At Random):
Missingness depends on the missing value itself.

2. Strategies to Identify Missing Data Patterns
1. Calculate Missing Value Percentages

Check how much data is missing in each column.

Very small and uniform missingness often suggests randomness.

Example:
If 1 to 2% missing evenly across columns → likely random.

2. Visual Inspection

Use heatmaps or matrix plots to visualize missing values.

Random missing data appears scattered.

Pattern-based missing data appears clustered.

Example:
Income missing only for senior citizens → pattern exists.

3. Compare Distributions

Compare rows with missing values vs without missing values.

If their distributions differ significantly, missingness is not random.

Example:
Customers with missing income mostly belong to a specific region.

4. Correlation with Other Features

Create a missing indicator column (1 = missing, 0 = not missing).

Check correlation with other variables.

Example:
Income missing strongly correlated with employment type.

5. Group-wise Analysis

Analyze missing data by categories such as:

Age group

Gender

Region

Example:
Phone numbers missing mostly for older customers.

6. Statistical Tests

Use tests like Littles MCAR test to formally test randomness.

Helps decide whether data is MCAR or not.

7. Business & Data Collection Understanding

Understand how the data was collected.

Survey skip logic or system failures often create patterns.

Example:
Optional survey questions → intentional missing data.

3. Why this analysis is important

Determines correct imputation strategy

Avoids biased analysis

Improves model accuracy and trustworthiness

"""

In [None]:
"""
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Evaluating Model Performance on an Imbalanced Medical Dataset

In medical diagnosis datasets, it is common that most patients do not have the disease, while only a small percentage do.
In such imbalanced datasets, using accuracy alone is misleading. Proper evaluation strategies are essential.

1. Use Appropriate Evaluation Metrics (Instead of Accuracy)
a) Confusion Matrix

Shows TP, FP, TN, FN

Helps understand how many sick patients are correctly or incorrectly classified

Why important:
False Negatives (missing a disease) are critical in healthcare.

b) Precision

Measures how many predicted positive cases are actually positive

Important when false positives are costly

c) Recall (Sensitivity)

Measures how many actual disease cases are correctly detected

Most important in medical diagnosis

d) F1-Score

Harmonic mean of precision and recall

Useful when classes are imbalanced

2. ROC Curve and AUC Score

ROC Curve: Trade-off between True Positive Rate and False Positive Rate

AUC: Overall model performance independent of threshold

Why useful:
Works well for imbalanced data comparison.

3. Precision Recall Curve

More informative than ROC when the positive class is rare

Focuses on performance of the minority (disease) class

4. Stratified Train-Test Split

Ensures both train and test sets maintain the same class distribution

Prevents biased evaluation

5. Cost-Sensitive Evaluation

Assign higher cost to False Negatives

Evaluate whether the model minimizes critical medical errors

6. Cross-Validation with Stratification

Use Stratified K-Fold Cross Validation

Ensures stable and reliable performance estimates

7. Baseline Comparison

Compare model against a naive baseline (e.g., always predicting “no disease”)

Ensures model adds real value

8. Domain-Specific Evaluation

Collaborate with medical experts

Evaluate whether predictions are clinically acceptable

"""

In [None]:
"""
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

Handling an Imbalanced Customer Satisfaction Dataset

When most customers report being satisfied and only a few are unsatisfied, the dataset becomes imbalanced.
To build a reliable model, the data must be balanced and the majority class may need to be down-sampled.

1. Random Down-sampling of the Majority Class

What it does:

Randomly removes samples from the satisfied (majority) class.

When to use:

Dataset is large

Majority class has redundant data

Example:
Reduce 90,000 satisfied customers to 10,000 to match unsatisfied customers.

2. Stratified Sampling

What it does:

Maintains class proportions while sampling.

Often combined with down-sampling.

Use case:

Creating balanced training data while keeping test data intact.

3. Cluster-Based Down-sampling

What it does:

Groups majority class data into clusters

Samples equally from each cluster to preserve diversity

Benefit:

Reduces information loss compared to random down-sampling.

4. NearMiss Algorithm

What it does:

Selects majority class samples closest to minority class samples.

Focuses learning on decision boundaries.

Best for:

Improving classification near class overlap regions.

5. Tomek Links (Cleaning Majority Class)

What it does:

Removes overlapping majority samples near minority class.

Helps clean noisy boundaries.

6. Ensemble Methods with Balanced Sampling

Examples:

Balanced Random Forest

EasyEnsemble

Benefit:

Each model is trained on a different down-sampled subset.

7. Combine Down-sampling with Up-sampling (Hybrid)

Example:

Down-sample satisfied customers slightly

Up-sample unsatisfied customers using SMOTE

8. Keep Test Set Imbalanced

Important note:

Balance only the training data

Keep test data reflecting real-world distribution

"""

In [None]:
"""
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

Handling Rare Events by Up-sampling the Minority Class

When estimating the occurrence of a rare event (fraud, failure, disease), datasets are usually highly imbalanced, with very few positive cases.
To improve model learning, the minority class must be up-sampled.

1. Random Over-sampling

What it does:

Randomly duplicates existing minority class samples.

Advantages:

Simple to implement

No data loss

Limitation:

Risk of overfitting due to duplication

2. SMOTE (Synthetic Minority Over-sampling Technique)

What it does:

Generates new synthetic minority samples using nearest neighbors.

Why effective:

Avoids exact duplication

Improves generalization

3. SMOTE Variants
a) Borderline-SMOTE

Generates samples near class boundaries

Improves decision boundary learning

b) ADASYN

Creates more synthetic samples for harder-to-learn minority points

4. Data Augmentation

What it does:

Creates new samples by modifying existing data

Common in image, text, and signal data

Example:

Slightly altering sensor readings to simulate failures

5. Ensemble Methods with Up-sampling

Examples:

EasyEnsemble

Balanced Bagging

Benefit:

Each model sees a different up-sampled dataset

6. Cost-Sensitive Learning (Alternative Approach)

What it does:

Penalizes misclassification of minority class more heavily

Does not modify dataset size

7. Keep Validation Data Realistic

Best practice:

Apply up-sampling only on training data

Keep validation/test data unchanged

"""