Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.
Missing values in a dataset are data points that are absent or undefined for one or more variables or features. It is essential to handle missing values because they can lead to biased or inaccurate analysis and modeling. Some algorithms not affected by missing values include tree-based algorithms like Random Forest and XGBoost, as they can work with missing data by design.


Q2: List down techniques used to handle missing data. Give an example of each with python code.
Techniques to handle missing data include:

Deletion: Removing rows or columns with missing values.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 5, 6, 7]})
df.dropna()  # Removes rows with missing values


Imputation: Replacing missing values with estimated or calculated values.
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
df['A'] = imp.fit_transform(df[['A']])


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?
Imbalanced data refers to a situation where the classes or categories in a dataset are not represented equally, with one or more classes having significantly fewer instances than others. If imbalanced data is not handled, machine learning models tend to be biased toward the majority class, leading to poor performance on the minority class and skewed results.


Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

Up-sampling: It involves randomly duplicating samples from the minority class to balance class distribution. It is typically used when you have limited data for the minority class.

Down-sampling: It involves randomly removing samples from the majority class to balance class distribution. It is used when you have a large amount of data, and the majority class dominates.
For example, in fraud detection, where fraudulent transactions are rare, up-sampling the minority class may be necessary to build a balanced model.

Q5: What is data Augmentation? Explain SMOTE.
Data augmentation is a technique used to artificially increase the diversity of a dataset by applying various transformations to the existing data, creating new, similar samples. SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation method specifically designed for addressing class imbalance. It generates synthetic examples for the minority class based on the existing data.


Q6: What are outliers in a dataset? Why is it essential to handle outliers?
Outliers are data points that significantly deviate from the rest of the data in a dataset. Handling outliers is essential because they can skew statistical analysis, affect model performance, and lead to erroneous insights. Outliers can be a result of measurement errors, data entry mistakes, or genuinely rare events.


Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. 
What are some techniques you can use to handle the missing data in your analysis?
To handle missing data in customer data analysis, you can:

Use imputation techniques like mean, median, or mode imputation for numerical features.
Use categorical imputation for categorical features.
Apply advanced imputation methods like K-nearest neighbors (KNN) imputation.
Consider dropping columns with a high percentage of missing values if they are not informative.


Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?
To determine if missing data is missing at random or if there is a pattern, you can:

Visualize missing data patterns using heatmaps.
Perform statistical tests to check for correlations between missingness and other variables.
Explore patterns and relationships within the missing data to identify potential causes.



Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?
Strategies to evaluate model performance on an imbalanced dataset include:

Using evaluation metrics like precision, recall, F1-score, and area under the ROC curve (AUC-ROC) that are sensitive to imbalanced datasets.
Applying resampling techniques like oversampling the minority class, undersampling the majority class, or using a combination of both.
Adjusting class weights in the model to give higher importance to the minority class.


Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?
To balance a dataset with a majority class in customer satisfaction estimation, you can:

Down-sample the majority class by randomly removing samples.
Use stratified sampling to maintain the class distribution in the down-sampled dataset.
Generate synthetic samples for the minority class using techniques like SMOTE to up-sample.


Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?
To balance a dataset with a low percentage of occurrences in estimating a rare event, you can:

Up-sample the minority class by generating synthetic samples.
Use techniques like SMOTE to create synthetic examples for the minority class.
Employ stratified sampling to maintain class balance while up-sampling.
