# Q1

In [None]:
"""
What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.
"""

In [None]:
"""
Missing values in a dataset refer to the absence of data for certain observations or variables. It occurs when there is no recorded or available information for a specific data point. Missing values can be represented in various ways, such as "NaN" (Not a Number), "NA" (Not Available), or blank cells.

Handling missing values is crucial for several reasons:

Accurate Analysis: Missing values can lead to biased or incorrect analyses if not appropriately addressed. Ignoring missing values can result in distorted statistical measures and inaccurate conclusions.
Data Completeness: Missing values can impact the completeness and representativeness of the dataset. To ensure a comprehensive analysis, it is important to handle missing values appropriately.
Model Performance: Many machine learning algorithms cannot handle missing values directly. Therefore, missing value treatment is necessary to ensure the accurate functioning and performance of the models.


algorithms that are not affected by missing values:

Decision Tree
Random Forest
Gradient Boosting Machines (GBM)
k-Nearest Neighbors (k-NN)
"""

# Q2

In [None]:
"""
List down techniques used to handle missing data. Give an example of each with python code.
"""

In [2]:
# Removal of Missing Data
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, None, 5]}
df = pd.DataFrame(data)

# Remove rows with any missing values
df_dropped = df.dropna()







# Mean/Median/Mode Imputation
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, None, 5]}
df = pd.DataFrame(data)

# Impute missing values with the mean
df_filled = df.fillna(df.mean())

# Q3

In [None]:
"""
Explain the imbalanced data. What will happen if imbalanced data is not handled?
"""

In [None]:
"""
Imbalanced data refers to a situation in a classification problem where the classes are not represented equally in the dataset

If imbalanced data is not handled, several issues may arise:
1. Biased Model Performance
2. Misleading Evaluation Metrics
3. Lack of Generalization
"""

# Q4

In [None]:
"""
What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.
"""

In [None]:
"""
Up-sampling (Over-sampling):

Up-sampling involves increasing the number of samples in the minority class to match the number of samples in the majority class.
This can be done by replicating existing samples from the minority class or generating synthetic samples using techniques like SMOTE.
Up-sampling is required when the minority class is underrepresented, and we want to provide the model with more examples to learn from and give it a better chance to recognize the patterns of the minority class.
Example: Consider a credit card fraud detection dataset where the fraud cases are significantly less common compared to non-fraud cases. Up-sampling the fraud cases can help balance the dataset and enable the model to learn better representations of fraudulent patterns.



Down-sampling (Under-sampling):

Down-sampling involves reducing the number of samples in the majority class to match the number of samples in the minority class.
This can be done by randomly removing instances from the majority class until the desired balance is achieved.
Down-sampling is required when the majority class overwhelms the minority class, leading to biased model performance and predictions that heavily favor the majority class.
Example: Consider a medical diagnosis dataset where the occurrence of a rare disease is significantly lower than the non-disease cases. In this scenario, down-sampling the non-disease cases can help create a more balanced dataset and prevent the model from being biased towards predicting non-disease cases.
"""

# Q5

In [None]:
"""
What is data Augmentation? Explain SMOTE.
"""

In [None]:
"""
Data augmentation is a technique used to artificially increase the size of a dataset by creating new synthetic samples based on existing data. It is commonly employed in machine learning tasks, particularly when working with limited or imbalanced datasets.

SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation method designed to address the class imbalance problem. It focuses on the minority class by generating synthetic samples to balance the class distribution
"""

# Q6

In [None]:
"""
What are outliers in a dataset? Why is it essential to handle outliers?
"""

In [None]:
"""
Outliers in a dataset are data points or observations that significantly deviate from the majority of the other data points. They are values that are unusually high or low in comparison to the rest of the data.

It is essential to handle outliers for several reasons:
1. Distorted Analysis
2. Biased Models
3. Unrepresentative Patterns
4. Robustness and Generalization
"""

# Q7

In [None]:
"""
You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?
"""

In [None]:
"""
1. Removal of Missing Data (Deletion)
2. Mean/Median/Mode Imputation
3. Forward or Backward Fill
4. Interpolation Methods
5. Model-Based Imputation
"""

# Q8

In [None]:
"""
You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?
"""

In [None]:
"""
1. Missing Data Visualization
2. Missing Data Summary
3. Missing Data Mechanism Test
4. Imputation Comparison
"""

# Q9

In [None]:
"""
Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?
"""

In [3]:
"""
1. Confusion Matrix and Class-Specific Metrics
2. Resampling Techniques
3. Class Weighting
4. Evaluation Metrics
"""

'\n1. Confusion Matrix and Class-Specific Metrics\n2. Resampling Techniques\n3. Class Weighting\n4. Evaluation Metrics\n'

# Q10

In [None]:
"""
When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?
"""

In [None]:
"""
1. Random Under-Sampling
2. Cluster-Based Under-Sampling
"""

# Q11

In [None]:
"""
You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?
"""

In [None]:
"""
1. Random Over-Sampling
2. SMOTE
3. Cluster-Based Over-Sampling
"""