Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

In [None]:
# Answer:
'''
Missing values in a dataset refer to the absence of a value for a variable in an observation. Handling missing values is important because most machine learning algorithms cannot handle missing values, 
and missing values can cause bias and errors in the analysis. Some algorithms that are not affected by missing values include Decision Trees, Random Forests, and Naive Bayes.
'''

Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [2]:
# Answer:
'''
Some techniques to handle missing data include:

1. Removing rows with missing data: This can be done using the dropna() method in pandas.
2. Replacing missing values with mean or median values: This can be done using the fillna() method in pandas.
3. Using interpolation: This can be done using the interpolate() method in pandas.
'''
# Example code for replacing missing values with mean:

import pandas as pd
import numpy as np

# create a dataframe with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5], 'B': [6, 7, 8, np.nan, 10]})

# replace missing values with mean
df.fillna(df.mean(), inplace=True)

# display the dataframe
print(df)


     A      B
0  1.0   6.00
1  2.0   7.00
2  3.0   8.00
3  4.0   7.75
4  5.0  10.00


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

In [None]:
# Answer:
'''
Imbalanced data refers to a situation where the classes in the target variable are not represented equally. If imbalanced data is not handled, 
the model will be biased towards the majority class, leading to poor performance in predicting the minority class.
'''

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

In [None]:
# Answer:
'''
Up-sampling involves increasing the number of instances in the minority class, while down-sampling involves decreasing the number of instances in the majority class. 
Up-sampling can be used when the minority class is important, while down-sampling can be used when there is a large amount of data and the majority class is not very important.

For example, 
if we have a dataset with 100 instances, of which 90 are in the majority class and 10 are in the minority class, we can up-sample the minority class by duplicating 
instances to increase the number of instances to 90, making the dataset balanced. Alternatively, we can down-sample the majority class by randomly removing 
instances to decrease the number of instances to 10, making the dataset balanced.
'''

Q5: What is data Augmentation? Explain MOTE.

In [None]:
# Answer:
'''
Data augmentation refers to the process of creating new data by applying transformations to the existing data. SMOTE (Synthetic Minority Over-sampling Technique) 
is a technique used in data augmentation to create synthetic samples for the minority class by interpolating between existing instances. This can be done using the imblearn package in python.
'''

Q6: What are outliers in a dataset? Why is it essential to handle outliers?

In [None]:
# Answer:
'''
Outliers in a dataset are values that are significantly different from the other values in the dataset. Handling outliers is important because they can distort the
analysis and lead to incorrect conclusions. Outliers can be handled by removing them, transforming them, or using robust statistical methods.
'''

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

In [None]:
# Answer:
'''
Some techniques that can be used to handle missing data in analysis include imputing missing data with mean or median values, using interpolation to fill missing data, 
or removing rows with missing data.
'''

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

In [None]:
# Answer:
'''
To determine if missing data is missing at random or if there is a pattern to the missing data, we can use techniques such as visualizing missing data using heatmaps, 
checking the correlation between missing values and other variables, or using statistical tests to compare the characteristics of observations with and without missing data.
'''

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

In [None]:
# Answer:
'''
Strategies that can be used to evaluate the performance of a machine learning model on an imbalanced dataset include using metrics such as precision, recall, and F1-score, 
using resampling techniques such as up-sampling the minority class or down-sampling the majority class, or using ensemble techniques such as bagging and boosting.
'''

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

In [None]:
# Answer:
'''
When dealing with an imbalanced dataset where the majority class is overrepresented, downsampling can be used to balance the dataset. 
Here are some methods that can be employed to down-sample the majority class:

1. Random Under-Sampling: In this method, random samples are removed from the majority class to reduce its size to that of the minority class.

2. Tomek Links: Tomek Links are pairs of instances that are closest to each other, but belong to different classes. In this method, 
instances from the majority class that form Tomek Links with instances from the minority class are removed.

3. Cluster Centroids: In this method, the K-means clustering algorithm is used to identify centroids for each class. For the majority class, 
instances that are farthest away from the centroid are removed.
'''

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

In [None]:
# Answer:
'''
When working with a dataset that has a low percentage of occurrences of a rare event, the dataset is said to be imbalanced.
In this case, we can use up-sampling techniques to balance the dataset. Some of the popular up-sampling techniques are:

1. Random over-sampling: In this technique, we randomly duplicate samples from the minority class to balance the dataset. 
    This method can lead to overfitting as the same samples are present in both the training and testing datasets.

2. SMOTE (Synthetic Minority Over-sampling Technique): This method creates synthetic samples of the minority class by finding the k-nearest neighbors and generating new samples between them. 
    This method is less prone to overfitting than random over-sampling.

3. ADASYN (Adaptive Synthetic Sampling): This method generates synthetic samples of the minority class in a way that emphasizes the samples that are harder to learn. 
    This method is also less prone to overfitting.
'''