Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.
Q2: List down techniques used to handle missing data. Give an example of each with python code.
Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?
Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.
Q5: What is data Augmentation? Explain SMOTE.
Q6: What are outliers in a dataset? Why is it essential to handle outliers?
Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?
Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

Q1: Missing values in a dataset refer to the absence of values for certain variables or observations. It's essential to handle missing values because they can lead to biased results, reduced statistical power, and inaccurate predictions if not addressed properly. Some algorithms that are not affected by missing values include tree-based algorithms like Random Forest and XGBoost, as well as some robust regression techniques like Ridge Regression.

Q3: Imbalanced data refers to a situation where the classes in the dataset are not represented equally. If imbalanced data is not handled, machine learning models tend to be biased towards the majority class, leading to poor performance in predicting the minority class.

Q4: Up-sampling involves increasing the number of instances in the minority class, while down-sampling involves reducing the number of instances in the majority class. Up-sampling and down-sampling are required to address class imbalance in the dataset. For example, if you have a dataset with 90% of instances belonging to class A and 10% belonging to class B, you may up-sample class B to balance the classes or down-sample class A.

Q5: Data augmentation involves artificially creating new training samples from existing data by applying transformations such as rotation, scaling, and flipping. SMOTE (Synthetic Minority Over-sampling Technique) is a specific technique used to address class imbalance by generating synthetic samples for the minority class.

Q6: Outliers are data points that significantly differ from other observations in the dataset. It's essential to handle outliers because they can skew statistical analyses and machine learning models, leading to inaccurate results.

Q7: Techniques to handle missing data in analysis include mean/mode imputation, interpolation, and dropping rows or columns with missing values.

Q8: Strategies to determine if missing data is missing at random or if there is a pattern include analyzing correlations between missing values and other variables, plotting missing value patterns, and using statistical tests such as Little's MCAR test.

Q9: Strategies to evaluate model performance on an imbalanced dataset include using evaluation metrics like precision, recall, F1-score, and ROC-AUC, employing techniques like cross-validation, and using resampling methods such as oversampling the minority class or undersampling the majority class.

Q10: Methods to balance an unbalanced dataset and down-sample the majority class include random undersampling, cluster-based undersampling, and using algorithms like SMOTE to generate synthetic samples for the minority class.

Q11: Methods to balance an unbalanced dataset and up-sample the minority class include random oversampling, SMOTE, and ADASYN (Adaptive Synthetic Sampling) to create synthetic samples for the minority class.

In [7]:
#Q2: Techniques to handle missing data include:
#a. Mean/Median/Mode imputation: Replace missing values with the mean, median, or mode of the non-missing values in the column.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)
print("Sample DataFrame with missing values:")
print(df)

# Mean imputation
imputer = SimpleImputer(strategy='mean')
df_mean_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print("\nDataFrame after mean imputation:")
print(df_mean_imputed)

#b. Forward/Backward fill: Use the value preceding or succeeding the missing value to fill it.

# Forward fill
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_ffill)

# Forward fill followed by backward fill
df_ffill_bfill = df.ffill().bfill()
print("\nDataFrame after forward fill followed by backward fill:")
print(df_ffill_bfill)

#c. Interpolation: Estimate missing values based on the surrounding values.

# Interpolation
df_interpolated = df.interpolate(method='linear')
print("\nDataFrame after interpolation:")
print(df_interpolated)

Sample DataFrame with missing values:
     A    B  C
0  1.0  NaN  1
1  2.0  2.0  2
2  NaN  3.0  3
3  4.0  NaN  4
4  5.0  5.0  5

DataFrame after mean imputation:
     A         B    C
0  1.0  3.333333  1.0
1  2.0  2.000000  2.0
2  3.0  3.000000  3.0
3  4.0  3.333333  4.0
4  5.0  5.000000  5.0

DataFrame after forward fill:
     A    B  C
0  1.0  NaN  1
1  2.0  2.0  2
2  2.0  3.0  3
3  4.0  3.0  4
4  5.0  5.0  5

DataFrame after forward fill followed by backward fill:
     A    B  C
0  1.0  2.0  1
1  2.0  2.0  2
2  2.0  3.0  3
3  4.0  3.0  4
4  5.0  5.0  5

DataFrame after interpolation:
     A    B  C
0  1.0  NaN  1
1  2.0  2.0  2
2  3.0  3.0  3
3  4.0  4.0  4
4  5.0  5.0  5
