## What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of a particular value in a variable. There can be various reasons for missing values, such as data entry errors, loss of data during transmission, or a failure to record data for some observation. Handling missing values is essential because they can impact the quality and accuracy of the analysis performed on the dataset.

Some of the consequences of missing values are:

1. Reduced power and representativeness of the analysis

2. Biased results

3. Incorrect imputation affect the accuracy of the analysis.

some common machine learning algorithms that can handle missing values:

1. Decision Trees: Decision trees can handle missing values in the dataset by simply ignoring the missing values and making a decision based on the available data.

2. Random Forest: Random forest is an ensemble learning algorithm that uses multiple decision trees to make predictions. It can handle missing values in a similar way to decision trees.

3. K-Nearest Neighbors (KNN): KNN can handle missing values by using the available data to find the K-nearest neighbors, and then taking the average of their values to fill in the missing value.

4. Naive Bayes: Naive Bayes can handle missing values by ignoring the missing values and making a prediction based on the available data.

5. Principal Component Analysis (PCA): PCA is a technique used for dimensionality reduction. It can handle missing values by ignoring the missing values during the computation of the principal components.

6. Gradient Boosting: Gradient boosting is another ensemble learning algorithm that can handle missing values by using decision trees to fill in the missing values.

7. Neural Networks: Neural networks can handle missing values by using backpropagation to adjust the weights of the network based on the available data.

## List down techniques used to handle missing data. Give an example of each with python code.

Some techniques used to handle missing data. With example of each with python code.

1. __Deletion:__ In this technique, we remove the missing values from the dataset. There are three types of deletion: listwise deletion, pairwise deletion, and complete case deletion. Listwise deletion removes entire rows with missing values, pairwise deletion removes pairs of values that are missing, and complete case deletion removes cases with any missing value.

In [12]:
import pandas as pd

x = [2,3,4,6, None, 8]
y = [9,3,None, 4, 7, 12]

df = pd.DataFrame({'x':, "y":y})
df

Unnamed: 0,x,y
0,2.0,9.0
1,3.0,3.0
2,4.0,
3,6.0,4.0
4,,7.0
5,8.0,12.0


In [13]:
df.dropna()

Unnamed: 0,x,y
0,2.0,9.0
1,3.0,3.0
3,6.0,4.0
5,8.0,12.0


2. __Mean/median imputation:__ In this technique, the missing values are replaced with the mean or median of the available data.

In [14]:
df['x'].fillna(df['x'].mean())

0    2.0
1    3.0
2    4.0
3    6.0
4    4.6
5    8.0
Name: x, dtype: float64

In [15]:
df['y'].fillna(df['y'].median())

0     9.0
1     3.0
2     7.0
3     4.0
4     7.0
5    12.0
Name: y, dtype: float64

3. __Mode imputation:__ In this technique, the missing values are replaced with the mode of the available data.

In [16]:
df = pd.DataFrame({'x': ['a', 'b', 'c', 'a', None , 'b', 'a']})

df.fillna(df['x'].mode()[0])

Unnamed: 0,x
0,a
1,b
2,c
3,a
4,a
5,b
6,a


4. __Interpolation technique:__ In this case, the missing values are replaced with the values obtained by connecting the nearest known values with a straight line.

In [6]:
# creating df
x = [1,2,3,4, None, 6]
y = [2,4,None, 6, None, 10]

df = pd.DataFrame({'A':x, "B":y})

In [7]:
# interpolating values based on the distribution type as here it is linear
df.interpolate(method= 'linear')

Unnamed: 0,A,B
0,1.0,2.0
1,2.0,4.0
2,3.0,5.0
3,4.0,6.0
4,5.0,8.0
5,6.0,10.0


## Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation where the number of observations belonging to one class in a classification problem is much higher or much lower than the number of observations belonging to the other classes. In other words, the distribution of class labels in the data is not equal or balanced.

If imbalanced data is not handled, model may cause these problems:

#### 1.  Bias towards the majority class:

    The model may be biased towards the majority class, leading to an over-representation of the majority class in the predictions.

#### 2. Poor generalization:

    The model may not generalize well to new data if the class distribution in the new data is different from the class distribution in the training data.

#### 3. Lower predictive performance:

    The model may have lower accuracy, precision, recall, and F1 score for the minority class, resulting in lower predictive performance overall.

#### 4. Incorrect ranking:

    The model may incorrectly rank the instances in the minority class, leading to a higher false negative rate and missing important instances.

## What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

### Upsampling and DownSampling
__Down-sampling:__

        In this technique, the majority class is randomly reduced to match the number of observations in the minority class. This can result in a smaller dataset but can help balance the class distribution.

__Up-sampling:__

        In this technique, the minority class is randomly duplicated to match the number of observations in the majority class. This can result in a larger dataset but can help balance the class distribution.

Here are some examples of when up-sampling and down-sampling may be required:

__Example of the upsampling:__

Up-sampling may be required when the minority class is under-represented, and the model is not correctly identifying instances from that class. For example, in fraud detection, the number of fraudulent transactions may be much lower than non-fraudulent transactions. In this case, up-sampling the minority class can help improve the model's performance.

__Down Sampling:__

Down-sampling may be required when the majority class is over-represented, and the model is biased towards that class. For example, in medical diagnosis, the number of healthy patients may be much higher than the number of patients with a disease. In this case, down-sampling the majority class can help balance the class distribution and improve the model's performance.

## What is data Augmentation? Explain SMOTE.

__Data Augmentation__

Data augmentation is a technique used to increase the size of the training dataset by applying various transformations to the existing data. It is commonly used in machine learning and deep learning to improve the performance of models by increasing the amount and diversity of data available for training. Data augmentation can be applied to various types of data, such as images, text, and time-series data. The most common data augmentation techniques include flipping, cropping, rotation, scaling, and adding noise

__SMOTE__

SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used to handle imbalanced data. It creates synthetic samples of the minority class by interpolating new examples between existing minority class samples. The SMOTE algorithm works by selecting one minority class observation at random and then finding its k nearest minority class neighbors. Synthetic examples are then generated by taking a linear combination of the minority class observation and its k nearest neighbors.

## What are outliers in a dataset? Why is it essential to handle outliers?

Outliers in a dataset are data points that are significantly different from other observations in the dataset. They can be identified by their extreme values that are either too high or too low compared to the rest of the data. Outliers can occur due to various reasons such as data entry errors, measurement errors, or extreme events, and can have a significant impact on the statistical analysis of the dataset.

Handling outliers is necessay for the following reasons:

1. Outliers can significantly affect the mean and standard deviation of the dataset, making them unreliable as measures of central tendency and variability. Removing or adjusting outliers can help improve the accuracy of these measures.

2. Outliers can affect the distribution of the data, making it non-normal. Many statistical models assume normality, and outliers can violate this assumption.

3. Removing or adjusting outliers can help improve the validity of the statistical models.

4. Outliers can affect the correlation between variables in the dataset, making it difficult to interpret the relationship between them. Removing or adjusting outliers can help improve the accuracy of correlation analysis.

5. Outliers can have a significant impact on the predictive performance of machine learning models. Many machine learning algorithms are sensitive to outliers, and removing or adjusting them can help improve the model's performance.

## You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

1. __Deletion:__ 

        In this technique, we remove the missing values from the dataset. There are three types of deletion: listwise deletion, pairwise deletion, and complete case deletion. Listwise deletion removes entire rows with missing values, pairwise deletion removes pairs of values that are missing, and complete case deletion removes cases with any missing value.

2. __Mean/Median/Mode imputation:__ 

        In this method, missing values are replaced with the mean/median/mode of the respective column. This is a simple and quick method but can introduce bias if the missing values are not missing at random.

3. __Interpolation technique:__

        In this case, the missing values are replaced with the values obtained by connecting the nearest known values with a straight line.

4. __Expectation-Maximization (EM):__

        EM is an iterative method that estimates the missing data values and model parameters simultaneously. It is commonly used in data analysis, especially in cases where the data is missing not at random.

5. __K-Nearest Neighbor (KNN):__

        In this technique, missing values are imputed by identifying the K nearest neighbors to the observation with missing data and using the average value of the nearest neighbors to fill in the missing data.

## You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

There are several strategies that can be used to determine if the missing data is missing at random or if there is a pattern to the missing data.

Some of the commonly used strategies are:

- Visualizations:

        Visualizations can be used to identify patterns in the missing data. Missing data can be represented graphically, such as through heat maps, histograms, and scatter plots, to identify whether missing data is clustered in particular regions or distributed randomly.

- Statistical tests:

        Statistical tests can be used to determine whether the missing data is missing at random or is systematically related to other variables in the dataset. Tests such as Little's MCAR test and the chi-square test can be used to determine whether the missing data is missing completely at random or not.

- Imputation methods:

        Imputation methods can be used to examine the relationship between missing data and other variables in the dataset. For example, imputing missing data using regression imputation can help identify whether the missing data is related to other variables in the dataset.

- Machine Learning:

        Machine learning algorithms can be used to identify patterns in the missing data. Techniques such as clustering and association rule mining can be used to identify patterns and relationships in the missing data.

- Domain knowledge: 

        In some cases, the domain knowledge of the data can be used to identify patterns in missing data. For example, if the missing data is related to a specific demographic group, it may be possible to infer the missing values based on other variables that are known to be related to that demographic group.

## Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

When working with an imbalanced dataset, where one class has significantly fewer samples than the other class, the performance of a machine learning model can be biased towards the majority class. Therefore, it is essential to evaluate the performance of the model using appropriate metrics that consider the class imbalance.

Here are some strategies that can be used to evaluate the performance of machine learning models on imbalanced datasets:

- __Cost-Sensitive Learning:__

Cost-sensitive learning is a technique that assigns different costs to different types of classification errors based on the class imbalance. This approach can help to train a model that is more sensitive to the minority class and reduces the false-negative rate.

- __Sampling Techniques:__

Sampling techniques like over-sampling and under-sampling can be used to balance the dataset by either increasing the minority class samples or decreasing the majority class samples. This can help to improve the model's performance on the minority class.

- __Confusion Matrix:__

A confusion matrix can be used to evaluate the performance of the model. It provides information about the number of true positive, false positive, true negative, and false negative predictions. Using this, we can calculate performance metrics like precision, recall, F1 score, and accuracy.

- __Adjust class weights:__

Some machine learning algorithms such as logistic regression and decision trees allow for adjusting the weights of the different classes. By increasing the weight of the minority class, the algorithm can be forced to focus more on correctly classifying examples from the minority class.

- __Collect more data:__

Collecting more data from the minority class can help address the class imbalance and improve the performance of the machine learning model. However, this is not always feasible, particularly in medical diagnosis projects where data collection can be expensive and time-consuming.

## When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

To balance an unbalanced dataset where the majority class overwhelms the minority class, one can use several methods for down-sampling the majority class. Here are some common techniques:

- Random under-sampling: This method involves randomly removing samples from the majority class until the class distribution is balanced with the minority class.

- Cluster-based under-sampling: This method involves identifying clusters of the majority class samples and randomly removing samples from each cluster until the class distribution is balanced with the minority class.

- Tomek links: This method identifies pairs of samples from different classes that are the closest to each other and removes the majority class sample from each pair.

- Synthetic Minority Over-sampling Technique (SMOTE-ENN):  To use SMOTE with Edited Nearest Neighbors for down-sampling the majority class, you can first apply SMOTE to up-sample the minority class to a desired level. Then, apply ENN to remove the noisy or irrelevant data points. Finally, you can randomly select a subset of the majority class samples to balance the dataset.

## You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

When dealing with an imbalanced dataset where the minority class has a low percentage of occurrences, we can use various techniques to balance the dataset and up-sample the minority class.

Here are some methods that can be used to up-sample the minority class:

- Random Over-Sampling:

In random over-sampling, we randomly duplicate samples from the minority class until we have a balanced dataset. This method can be quick and easy, but it may result in overfitting and the generation of redundant data.

- SMOTE:

 SMOTE is a popular method for balancing imbalanced datasets. In SMOTE, we create synthetic minority class samples by interpolating between the minority class samples. This can help to increase the size of the minority class and balance the dataset.

- ADASYN:

ADASYN is a variation of SMOTE that creates synthetic minority class samples based on the density distribution of the dataset. This can help to create more diverse synthetic samples and improve the performance of the model.