Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

Missing values are the data points that are absent or unknown in a dataset.
Why it's important to handle them:

Missing values can distort analysis, leading to biased or incorrect results.

Models may fail to train properly or have reduced accuracy if missing data is not handled.

Algorithms not affected by missing values:

Decision Trees (e.g., Random Forest)

K-Nearest Neighbors (KNN)

Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [1]:
pip install pandas



Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data occurs when the number of instances in one class is much larger or smaller than in another, making the model biased toward the majority class.

Consequences:

The model might predict the majority class most of the time and ignore the minority class, leading to poor performance on the minority class.

For example, in fraud detection, most transactions are legitimate, and if ignored, the model won’t detect fraud accurately.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

In [5]:
pip install imbalanced-learn




In [8]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from collections import Counter

# Create a simple imbalanced dataset (using make_classification)
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.95, 0.05], flip_y=0, random_state=42)

# Convert to DataFrame (for easier handling)
df = pd.DataFrame(X)
df['target'] = y

# Check the original class distribution
print(f"Original class distribution: {Counter(y)}")

# Apply SMOTE to up-sample the minority class
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Convert back to DataFrame (optional, for easier analysis)
df_resampled = pd.DataFrame(X_resampled)
df_resampled['target'] = y_resampled

# Check the new class distribution
print(f"Resampled class distribution: {Counter(y_resampled)}")


Original class distribution: Counter({0: 950, 1: 50})
Resampled class distribution: Counter({0: 950, 1: 950})


In [9]:
from sklearn.utils import resample
from collections import Counter

# Separate the majority and minority classes
df_majority = df[df['target'] == 0]
df_minority = df[df['target'] == 1]

# Down-sample the majority class
df_majority_downsampled = resample(df_majority,
                                   replace=False,  # don't replace (no duplicates)
                                   n_samples=len(df_minority),  # balance to the size of minority class
                                   random_state=42)  # for reproducibility

# Combine the down-sampled majority with the minority class
df_balanced = pd.concat([df_majority_downsampled, df_minority])

# Check the new class distribution
print(f"Resampled class distribution (down-sampling): {Counter(df_balanced['target'])}")


Resampled class distribution (down-sampling): Counter({0: 50, 1: 50})


Q5: What is data Augmentation? Explain SMOTE.

Data Augmentation involves creating new data points by modifying existing data (e.g., flipping, rotating, or scaling images) to increase the size and diversity of the dataset.

SMOTE (Synthetic Minority Over-sampling Technique):

SMOTE is a technique for generating synthetic samples for the minority class by creating new examples that are combinations of the nearest neighbors of the minority class instances.

Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that significantly differ from the rest of the data, either higher or lower.

Why it's important to handle them:

Outliers can skew statistical analyses and model training, leading to poor predictions or biased results.

They can distort distributions and mislead the model in making incorrect predictions.

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

You can handle missing data by:

Removing rows or columns with missing data if it's not too much.

Imputation using statistical measures (mean, median) or machine learning techniques (KNN, regression).

Using algorithms that can handle missing values directly, like Decision Trees.

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

Missing Completely at Random (MCAR): Missingness is unrelated to other variables. Use statistical tests like Little’s MCAR test.

Missing at Random (MAR): Missingness is related to other observed data. Investigate correlations between missing values and other features.

Not Missing at Random (NMAR): Missingness depends on the value itself. If possible, collect more data to understand this behavior.

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Use metrics like Precision, Recall, F1-Score instead of accuracy to focus on minority class performance.

ROC-AUC: Evaluate the trade-off between sensitivity and specificity.

Stratified K-Fold cross-validation: Ensures that each fold has a proportionate number of minority and majority class instances.


Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

Down-sample the majority class by randomly reducing its size.

Synthetic data generation using techniques like SMOTE.

Use weighted loss functions in models like logistic regression to give more importance to the minority class.


Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

Up-sample the minority class using techniques like SMOTE to generate synthetic data.

Use anomaly detection algorithms that are designed for imbalanced datasets.

Adjust class weights in models like SVM or Random Forest to make the minority class more important.