Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of values for one or more features in some or all observations.It is essential to handle missing values in a dataset because they can lead to biased or inaccurate results when analyzing the data or building models. 

1.Decision trees: Decision trees can handle missing values in features by creating additional branches in the tree to handle the missing values.

2.Random forests: Random forests can handle missing values in features by using the mode of the available values to split the data.

3.Support Vector Machines (SVMs): SVMs can handle missing values by ignoring the missing values and only using the available features to find the optimal hyperplane.

4.K-Nearest Neighbors (KNN): KNN can handle missing values by ignoring the missing values and only using the available features to find the k-nearest neighbors.

5.K-Nearest Neighbors (KNN): KNN can handle missing values by ignoring the missing values and only using the available features to find the k-nearest neighbors.

Q2: List down techniques used to handle missing data. Give an example of each with python code.

1.Deletion methods: This involves removing observations or features with missing values from the dataset

In [3]:
import pandas as pd

# create a sample dataframe with missing data
df = pd.DataFrame({
    'A': [1, 2, 3, None, 5],
    'B': [6, None, 8, 9, 10]
})

# use pairwise deletion to remove missing data in the 'B' column
df.dropna(subset=['B'], inplace=True)
print(df)

     A     B
0  1.0   6.0
2  3.0   8.0
3  NaN   9.0
4  5.0  10.0


2.Imputation methods: This involves filling in the missing values with estimated values. There are several techniques for imputing missing values, including mean imputation, median imputation, mode imputation, and regression imputation.

In [None]:
from sklearn.impute import SimpleImputer
import pandas as pd

# Load the dataset
df = pd.read_csv('data.csv')

# Create an imputer object with mean strategy
imputer = SimpleImputer(strategy='mean')

# Impute missing values in the dataset
df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# View the cleaned dataset
print(df.head())


3.Prediction methods: This involves using machine learning models to predict the missing values based on the relationships between the available features and the target variable.

In [None]:
from sklearn.impute import KNNImputer
import pandas as pd

# Load the dataset
df = pd.read_csv('data.csv')

# Create an imputer object with KNN strategy
imputer = KNNImputer(n_neighbors=3)

# Impute missing values in the dataset
df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# View the cleaned dataset
print(df.head())


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation where one target class represents a significant portion of observations. Imbalanced datasets can cause problems in both model training and evaluation because model training and evaluation are commonly run with the assumption that there are an adequate number of observations for each class.

If imbalanced data is not handled, it can lead to poor performance of machine learning models. For example, if we have a dataset in which 92% of the data is labelled as ‘Not Fraud’ and the remaining 8% are cases of ‘Fraud’, then accuracy can be misleading. In such cases, we need to use techniques like undersampling, oversampling, SMOTE etc., to handle imbalanced data.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

Up-sampling involves randomly duplicating observations from the minority class, whereas down-sampling involves randomly removing observations from the majority class

In [4]:
import pandas as pd
from sklearn.utils import resample

# create a sample imbalanced dataset
data = {'feature_1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'feature_2': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
        'target': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data)

# separate majority and minority classes
majority_class = df[df.target==0]
minority_class = df[df.target==1]

# up-sample minority class
minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=123)

# down-sample majority class
majority_downsampled = resample(majority_class, replace=False, n_samples=len(minority_class), random_state=123)

# combine majority and minority classes
upsampled_df = pd.concat([majority_class, minority_upsampled])
downsampled_df = pd.concat([minority_class, majority_downsampled])

print(upsampled_df)
print(downsampled_df)


   feature_1  feature_2  target
0          1          1       0
1          2          0       0
2          3          1       0
3          4          0       0
4          5          1       0
7          8          0       1
9         10          0       1
7          8          0       1
6          7          1       1
8          9          1       1
   feature_1  feature_2  target
5          6          0       1
6          7          1       1
7          8          0       1
8          9          1       1
9         10          0       1
1          2          0       0
3          4          0       0
4          5          1       0
0          1          1       0
2          3          1       0


Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used in machine learning and computer vision to artificially increase the size of a training dataset by creating new training samples from existing ones.

SMOTE (Synthetic Minority Over-sampling Technique) is an algorithm that performs data augmentation by creating synthetic data points based on the original data points. 

Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that are significantly different from other data points in a dataset. They can be caused by measurement errors, data entry errors, or natural variation in the population

It is essential to handle outliers because they can have a significant impact on statistical analysis.

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

1.Mean / median imputation: In this technique, missing values are replaced with the mean or median value of the respective variable. 

2.Mode imputation: This technique is used when the data is categorical. The missing values are replaced with the most frequent value of the respective variable.

3.Backward/forward filling: This technique is useful when the missing data is in a time series dataset. 

4.Hot-deck imputation: In this technique, missing values are replaced with values from similar cases or observations.

5.Regression imputation: This technique involves using regression analysis to predict the missing values based on the relationship between the dependent and independent variables.

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

1.Visual inspection: One way to identify patterns in missing data is to visually inspect the dataset using plots, such as scatter plots or histograms, to see if there is a relationship between the missing data and other variables.

2.Missing data analysis: You can perform a missing data analysis to examine the patterns in the missing data

3.Statistical tests: You can use statistical tests, such as the Little's MCAR (Missing Completely At Random) test or the MNAR (Missing Not At Random) test, to determine if the missing data is missing at random or if there is a pattern to the missing data.

4.Imputation and analysis: Another strategy is to impute the missing data using various techniques, such as mean/median imputation or regression imputation, and analyze the impact of the imputation on the results.

5.Expert knowledge: Finally, you can consult with domain experts to determine if there are any known reasons or patterns for the missing data, such as missing data due to measurement error or non-response bias

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

1.Confusion matrix: The confusion matrix is a table that shows the number of true positives, true negatives, false positives, and false negatives. 

2.Accuracy metrics: Accuracy metrics such as precision, recall, F1 score, and ROC-AUC can be used to evaluate the model's performance on an imbalanced dataset. 

3.Resampling techniques: Resampling techniques such as oversampling and undersampling can be used to balance the dataset.

4.Cost-sensitive learning: Cost-sensitive learning is a technique that assigns different weights to different classes based on their importance. 

5.Ensemble models: Ensemble models such as bagging, boosting, and stacking can be used to improve the performance of the model on an imbalanced dataset.

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

1.Random under-sampling: This method involves randomly removing some of the majority class samples until the dataset is balanced. 

2.Cluster-based under-sampling: This method involves identifying clusters of samples from the majority class and keeping only the centroid of each cluster. 

3.Tomek links: Tomek links are pairs of samples from different classes that are very close to each other. By removing the majority class sample from a Tomek link, we can create a smaller and more balanced dataset while keeping the most informative samples.

4.Edited nearest neighbors: This method involves identifying the samples in the majority class that are misclassified by the nearest neighbor classifier and removing them from the dataset.

5.Synthetic minority oversampling technique (SMOTE): SMOTE is a method that generates synthetic samples for the minority class by creating new samples between existing samples of the minority class.

Once you have down-sampled the majority class, you can train a model on the balanced dataset to estimate customer satisfaction. It's important to note that down-sampling the majority class may result in a smaller dataset and may lead to a loss of information. 

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

1.Random over-sampling: This method involves randomly replicating some of the minority class samples until the dataset is balanced.

2.Synthetic minority oversampling technique (SMOTE): SMOTE is a method that generates synthetic samples for the minority class by creating new samples between existing samples of the minority class.

3.Adaptive synthetic sampling (ADASYN): ADASYN is a method that generates synthetic samples for the minority class based on the density distribution of the minority class. 

4.Cluster-based over-sampling: This method involves identifying clusters of samples from the minority class and replicating them to increase their representation in the dataset.

5.Kernel Density Estimation (KDE) based over-sampling: This method involves estimating the density distribution of the minority class and creating new samples based on the estimated distribution.