##### Q1. What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

* **Missing Values** in a dataset refers to the absence of data or information for a specific variables or observations in that dataset.
* Handling missing values are essentials for several reasons such a as:
    1. **Data Integrity**: Missing values can lead to error in data analysis, modeling, and interpretation. If not addressed, they can result in biased or incorrect conclusions. 
    2. **Statistical Analysis**: many statistical analysis and machine learning algorithms cannot handle missing data. Imputing or filling in missing values enables the application of these methods and ensures the accuracy of statistical summaries and inferences. 
    3. **Visualization**: Missin g values can effect the accuracy of data visualizations.
    4. **Model Performance**: in machine learning, missing values can adversly impact the performance of predictive models. Most machine learning algorithms require complete data.
* Algorithms that are not affected by missing values:
    1. **Decision Tree**: Decision tree algorithms, such as CART (Classification and Regression Trees) and Random forests can naturally handle missing values during the tree building.
    2. **Random Forests**: Random Forests are an ensemble learning method that combines multiple decision trees. They inherit the ability to handle missing values from the individual decision trees.
    3. **Gradient Boosting Macines**: Algorithms like Gradient Boosting and XGBoost can handle missing values by incorporating them into their optimization process.
    4. **Neural Networks**: Neural networks, especially deep learning models can be designed to handle missing values through specialized architectures and techniques like embedding layers or masking.
    5. **K-Nearest Neighbors (KNN)**: the KNN algorithms makes predictions based on the similarity of data points. It can naturally handle missing values by ignoring them when computing distance between the data points.
___

##### Q2. List down techniques used to handle missing data. Give an example of each with python code.
* Missing values can be handled by imputimng the missing values. Below are few methods which can impute the missing values
    1. Mean value imputation
    2. Median value imputation
    3. Mode value imputation
    4. Random value imputation


In [24]:
## 1. mean value imputation

import pandas as pd
import numpy as np
import seaborn as sns

df_titanic = sns.load_dataset('titanic')

# Check the total missing values for each variables
df_titanic.isnull().sum()   # 177 missing values for age 

# calculate mean age
mean_age = df_titanic['age'].mean()

# Fill the missing values with mean age (Can also be done in place)
df_titanic['age_mean'] = df_titanic['age'].fillna(mean_age)

# Verify the data
df_titanic[df_titanic['age'].isnull()][['age', 'age_mean']]

Unnamed: 0,age,age_mean
5,,29.699118
17,,29.699118
19,,29.699118
26,,29.699118
28,,29.699118
...,...,...
859,,29.699118
863,,29.699118
868,,29.699118
878,,29.699118


In [25]:
## 2. Median value imputation - when there are outliers in the dxat points

## Calculate the median value
median_age = df_titanic['age'].median()

# Fill the missing values with median age (Can also be done in place)
df_titanic['age_median'] = df_titanic['age'].fillna(median_age)

# Verify the data
df_titanic[df_titanic['age'].isnull()][['age', 'age_mean', 'age_median']]

Unnamed: 0,age,age_mean,age_median
5,,29.699118,28.0
17,,29.699118,28.0
19,,29.699118,28.0
26,,29.699118,28.0
28,,29.699118,28.0
...,...,...,...
859,,29.699118,28.0
863,,29.699118,28.0
868,,29.699118,28.0
878,,29.699118,28.0


In [26]:
## 3. Mode value imputation - for categorical data

# Identify the missing categorical data that can be imputed
df_titanic.isnull().sum()
# Emark column has two missing values that can be imputed with mode value

# Find the mode value
mode_value = df_titanic[df_titanic['embarked'].notnull()]['embarked'].mode()[0]

# Fill the missing value with mode value
df_titanic['embarked'].fillna(mode_value, inplace = True)

# Verify
df_titanic['embarked'].isnull().sum()


0

In [27]:
## 4. Random value imputation - select random value from the non missing data and fill the missing values

# Get the non missing values
non_na_ages = df_titanic.age.dropna()

# Calculate the missing values count
missing_count = df_titanic.age.isnull().sum()

# Generate random values from non missing values
random_values = np.random.choice(non_na_ages, size=missing_count)

# Create a series from missing values with index of actual missing rows.
random_value_series = pd.Series(random_values, index=df_titanic.index[df_titanic['age'].isnull()])

# Fill the missing value with random generated series
df_titanic['age_random'] = df_titanic['age'].fillna(random_value_series)

# Verify the result
df_titanic[df_titanic['age'].isnull()][['age', 'age_random', 'age_median', 'age_mean']]
 


Unnamed: 0,age,age_random,age_median,age_mean
5,,35.0,28.0,29.699118
17,,21.0,28.0,29.699118
19,,29.0,28.0,29.699118
26,,24.0,28.0,29.699118
28,,16.0,28.0,29.699118
...,...,...,...,...
859,,16.0,28.0,29.699118
863,,16.0,28.0,29.699118
868,,42.0,28.0,29.699118
878,,19.0,28.0,29.699118


___

##### Q3. Explain the imbalanced data. What will happen if imbalanced data is not handled?
* **Imbalanced data** is a situation in dataset where the distribution of classes or labels is not rouighly equal, meaning that one class significantly outnumbers the others. This imbalance can occur in various types of datasets, such as binary classification problems, multi-class classification problems or even regression tasks with imbalanced target values. 
* *Example* : In a binary classification problem where you are trying to predict whether an email is spam or not, you might have a dataset with 95% of the emails being non-spam (class 0) and only 5% being spam (class 1). In this case, the data is imbalanced because one class (non-spam) dominates the other (spam).
* If imablance is n ot handled properly, several issues can arise:
    1. Bias in model training: Most machine learning algorithms are designed to optimize accuracy, which can lead to biased model when faced with imbalanced data.
    2. Poor generalization: mmodels trained on imbalnced data may not generalize well on new unseen data.
    3. Loss of information: Imbalanced dataset may contain valuable information about the minority class that is underrepresented
___

##### Q4. What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.
1. **Up-sampling** refers to manually adding data samples to the minority classes in order to create a more balanced dataset.
2. **Down-sampling** referse to removing records from majority class there by creating a more balanced dataset.
**Example**:

In [28]:
## Create dample imbalanced data set 
sample_size = 1000
class_0 = 0.897
class_1 = 1 - class_0
class_0_sample_size = int(sample_size * class_0)
class_1_sample_size = int(sample_size * class_1)

class_0 = pd.DataFrame({
    'f1' : np.random.normal(loc = 0, scale = 2, size = class_0_sample_size),
    'f2' : np.random.normal(loc = 0, scale = 2, size = class_0_sample_size),
    'target' : [0] * class_0_sample_size
})

class_1 = pd.DataFrame({
    'f1' : np.random.normal(loc = 0, scale = 2, size = class_1_sample_size),
    'f2' : np.random.normal(loc = 0, scale = 2, size = class_1_sample_size),
    'target' : [1] * class_1_sample_size
})

df = pd.concat([class_0, class_1])

# verfiy if the data is imablanced
df['target'].value_counts()


target
0    897
1    102
Name: count, dtype: int64

In [29]:
## 1. Up-sampling
''' 
Up-sampling is used in the following cases:
1. when the minority class is important
2. When the dataset is small
3. When there is no risk of overwhelming the majority class
'''

# Import resample module from sklearn
from sklearn.utils import resample

# Divide the dataframe into two parts - majority and minority, based on target
majority = df[df['target'] == 0]
minority = df[df['target'] == 1]

# Upsample the minority dataset
minority_upsampled = resample(minority,
                              replace=True,    # values from the minority dataset can be picked multple times
                              n_samples=len(majority),
                              random_state=42)

# Check the size of df after upsampling
minority_upsampled.shape

# Concatenate majority df and upsampled minority df by resetting the index
upsampled_df = pd.concat([majority, minority_upsampled]).reset_index(drop=True)

# Verify if the df is balanced
upsampled_df.target.value_counts()



target
0    897
1    897
Name: count, dtype: int64

In [30]:
## 2. Down-sampling
'''
Down-sampling can be used in below scenarios:
1. When the majority class is overwhelming
2. When we have limited computational capacity
3. When the majority class is less important
'''

majority_downsampled = resample(majority,
                                replace = False,
                                n_samples = len(minority),
                                random_state = 42)

# Check the size of df after upsampling
print(majority_downsampled.shape)

# Concatenate majority df and upsampled minority df by resetting the index
downsampled_df = pd.concat([majority_downsampled, minority]).reset_index(drop=True)

# Verify if the df is balanced
downsampled_df.target.value_counts()

(102, 3)


target
0    102
1    102
Name: count, dtype: int64

___

##### Q5. What is data Augmentation? Explain SMOTE.
**Data augmentation**
* It refers to artificially increase the size of the datasetr by applying various transformations to the original data.
* These transformations create new, slightly modified versions of the existing data, which can be used to train machine learning models.
* Data augmentation serves several purposes, including improving model generalization, reducing overfitting, and addressing issues related to limited training data.

**SMOTE (Synthetic Minority Over-sampling Technique)**
* It is a data augmentation technique commonly used to address class imbalance in binary classification tasks.
* It is particularly useful when the minority class is underrepresented in the dataset.
* SMOTE works by generating synthetic examples for the minority class, thereby balancing the class distribution.
* For each instance in the minority class, SMOTE randomly selects a similar instance (i.e., a neighbor) from the minority class. The similarity is determined using a distance metric such as Euclidean distance in feature space.
* SMOTE then creates synthetic samples by interpolating between the selected instance and its chosen neighbor. It does this by choosing a random value between 0 and 1 and multiplying it by the difference between the feature vectors of the two instances. This random value determines the direction and extent of the interpolation.
* The synthetic samples are added to the original dataset, effectively increasing the size of the minority class.
___



##### Q6. What are outliers in a dataset? Why is it essential to handle outliers?

**Outliers**:
* Outliers are data points in a dataset that significantly deviate from the majority of other data points. 
* They are observations that are unusually distant from the center or the typical values of a dataset. 
* Outliers can occur for various reasons, including data entry errors, measurement errors, natural variability, or the presence of rare events.

**Detecting and handling outliers is important for several reasons:**
* **Data Quality** : Outliers can be the result of data entry errors or measurement inaccuracies. By identifying and addressing outliers, we can improve the overall quality and accuracy of our dataset.
* **Statistical Analysis**: Outliers can distort summary statistics, such as the mean and standard deviation, leading to inaccurate interpretations of the data. 
* **Model Performance**: Outliers can have a significant impact on the performance of machine learning and statistical models. Many models are sensitive to outliers, and their presence can lead to biased parameter estimates, reduced predictive accuracy, and increased model complexity.
* **Risk Management**: In applications like finance, fraud detection, and quality control, outliers may represent important events or anomalies that require attention. Detecting and managing outliers can help mitigate risks and make informed decisions.
___



##### Q7. You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?
We can handle the missing values by imputing them with following techniques such as 
1. Mean value imputation.
2. Median value imputation if there are any outliers in the existing dataset.
3. Mode value imputation for categorical data.
4. Random value imputation if the values are missing at random.
___

##### Q8. You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?
Determining whether the missing data is missing at random (MAR) or if there is a pattern to the missing data is an important step in handling missing data effectively. We can use some of the below strategy to investigate missing data pattern.
1. **Visualization of Missing Data**: Visualizing missing data using techniques like missing data heatmaps or bar charts is a widely used and intuitive approach to initially explore the patterns of missingness in a dataset. It provides a quick visual summary of which variables have missing values and whether there are any apparent patterns.

2. **Statistical Tests**: Statistical tests, such as Chi-square tests for independence or correlation tests, are frequently employed to determine if there is a significant association between the missingness of one variable and the values of other variables. These tests help quantify the relationship between missing data and other variables in the dataset.

3. **Imputation and Analysis**: Imputing missing data and comparing the results before and after imputation is a common practice. If imputation significantly affects the analysis results, it may indicate non-random missingness. While not a direct method of identifying patterns, this approach can indirectly reveal whether missing data follows a systematic pattern.
___

##### Q9. Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?
- In this case we can perform one of the below methods to balance the data
1. **Up-Sampling**: We can increase the size of minority class (patients with condition) by repeating the data sets randomly.
2. **Down-Sampling**: We can decrease the size of majority class (patients without condition) to match the size of minority class by randomly deleteing the data records.
2. **SMOTE (Synthetic Minority Over-Sampling Technique)**: SMOTE generates synthetic examples for the minority class, thereby balancing the class distribution.
___

##### Q10. When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

- We can use **Down-Sampling** to balance the data set which deletes some records from majority class at random.
- Below code explains how it can be done

```python 
# First we need to import **resample** module from sklearn.utils to perform the resampling of data
from sklearn.utils import resample

# Divide th dataset into two dataframes - majority class and minority class
df_majority = df[df['satisfaction'] == 1]
df_minority = df[df['satisfaction'] == 0]

# Down-Sample the majority class
df_majority_down_sampled = resample(df_majority,
                                    replace = False,    # This is to make sure that the data points are not repeated
                                    n_samples = len(df_minority),    # The desired number of majority samples
                                    random_state = 42
                                    )

# Now we combine the original minority and down sampled majority dataframes
df = pd.concat([df_minority, df_majority_down_sampled], axis=0).reset_index(drop=True)
```
___

##### Q11. You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

- When we're dealing with an unbalanced dataset with a low percentage of occurrences of a rare event, it can lead to biased machine learning models that perform poorly on the minority class. 
- To address this issue, we can up-sample the minority at random or by generating synthetic examples of minority class using SMOTE

1. **Up-Sampling Minority Class** : Here we increase the size of minority class by picking the data repeatedly at random.
```python 
# First we need to import **resample** module from sklearn.utils to perform the resampling of data
from sklearn.utils import resample

# Divide th dataset into two dataframes - majority class and minority class
df_majority = df[df['occurrence'] == 0]
df_minority = df[df['occurrence'] == 1]

# Up-Sample the minority class
df_minority_up_sampled = resample(df_minority,
                                    replace = True,    # This is to make sure that the data points are repeated
                                    n_samples = len(df_majority),    # The desired number of minority samples
                                    random_state = 42
                                    )

# Now we combine the original majority and up sampled minority dataframes
df = pd.concat([df_majority, df_minority_up_sampled], axis=0).reset_index(drop=True)
```

2. **Using SMOTE to generate synthetic examples of minority class**: SMOTE generates random examples of minority calss by genrating data beteween two nearest observations.
```python
# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE

# Create an inmstance of SMOTE
oversample = SMOTE()

# Transfer the data set by passing independent variable and dependent variable seperately
X,y = oversample.fit_resample(df_final[['feature_1', 'feature_2']], df_final['occurrence'])

# Comncatenate both dataframes
oversampled_df = pd.concat([X, y], axis=1)
```
___