# Assignment

### Ans1)

Missing values in a dataset refer to the absence of data for one or more variables in a given observation or record. The missing value may occur due to various reasons, such as incomplete data collection, data entry errors, or data corruption during transmission or storage.

Some algorithms that are not affected by missing values include decision trees, random forests, and support vector machines.

### Ans2)

In [14]:
import numpy as np
import pandas as pd
df = pd.DataFrame({"A":[1,2,3,np.nan,4],
                  "B":[3,4,5,np.nan,np.nan],
                  "C":[7,np.nan,9,10,11]})

1) Deletion: This technique involves removing the observations or variables that contain missing values from the dataset.

In [15]:
df

Unnamed: 0,A,B,C
0,1.0,3.0,7.0
1,2.0,4.0,
2,3.0,5.0,9.0
3,,,10.0
4,4.0,,11.0


In [16]:
df.dropna(inplace=True)

In [17]:
df

Unnamed: 0,A,B,C
0,1.0,3.0,7.0
2,3.0,5.0,9.0


2) Imputation: This technique involves filling in the missing values with a predicted or estimated value.

In [18]:
import numpy as np
import pandas as pd
df = pd.DataFrame({"A":[1,2,3,np.nan,4],
                  "B":[3,4,5,np.nan,np.nan],
                  "C":[7,np.nan,9,10,11]})

In [19]:
df.fillna(df.mean(),inplace=True)

In [20]:
df

Unnamed: 0,A,B,C
0,1.0,3.0,7.0
1,2.0,4.0,9.25
2,3.0,5.0,9.0
3,2.5,4.0,10.0
4,4.0,4.0,11.0


### Ans3)

Imbalanced data is a situation where the proportion of observations in different classes or categories in a dataset is not equal. In other words, one class may have significantly more or fewer observations than the other classes. This often occurs in real-world datasets, such as medical diagnosis, fraud detection, and spam filtering, where the rare events or outcomes are of more interest.

If imbalanced data is not handled properly, it can lead to biased and inaccurate machine learning models. In particular, if the algorithm is trained on imbalanced data, it may prioritize the majority class and ignore the minority class. As a result, the model may have low accuracy, low precision, and high recall for the majority class, but low recall and high false negative rate for the minority class.

### Ans4)

1) Upsampling involves increasing the number of instances in the minority class by either duplicating the existing observations or generating synthetic observations using techniques such as SMOTE (Synthetic Minority Over-sampling Technique). The goal of upsampling is to provide more examples of the minority class to the machine learning algorithm and help it learn the patterns and characteristics of the minority class more effectively.

2) Downsampling involves decreasing the number of instances in the majority class by randomly removing some observations from the dataset. The goal of downsampling is to reduce the bias towards the majority class and provide a more balanced representation of the different classes to the machine learning algorithm.

Here is an example to illustrate when upsampling and downsampling may be required:


Suppose we have a dataset of credit card transactions with 100,000 observations, out of which only 1,000 are fraudulent transactions. In this case, the dataset is highly imbalanced, with the fraudulent transactions representing only 1% of the total observations.

If we train a machine learning model on this imbalanced dataset without any data balancing techniques, the model may not be able to detect the fraudulent transactions effectively and may have a high false negative rate. To address this problem, we can use upsampling or downsampling.

If we choose to upsample the minority class, we can increase the number of fraudulent transactions from 1,000 to 10,000 by either duplicating the existing observations or generating synthetic observations. This will provide more examples of the fraudulent transactions to the machine learning algorithm and help it learn the patterns and characteristics of the fraudulent transactions more effectively.

On the other hand, if we choose to downsample the majority class, we can randomly remove some of the non-fraudulent transactions from the dataset, say 80,000, to reduce the bias towards the majority class. This will provide a more balanced representation of the different classes to the machine learning algorithm and help it learn the patterns and characteristics of both classes more effectively.

### Ans5)

Data augmentation is a technique used to artificially increase the size of a dataset by creating new samples that are similar to the existing ones, but have small variations. The goal of data augmentation is to improve the generalization ability of machine learning models by providing them with more diverse and representative training data.

One popular technique for data augmentation is SMOTE (Synthetic Minority Over-sampling Technique), which is specifically designed to address the problem of imbalanced data. SMOTE works by generating synthetic samples of the minority class by interpolating between existing samples.

Here's how SMOTE works:

Select a minority sample x.

Choose k nearest neighbors of x from the minority class.

Randomly choose one of the k nearest neighbors, say x'.

Generate a synthetic sample by linearly interpolating between x and x':

x_new = x + (x' - x) * r, where r is a random number between 0 and 1.

Repeat steps 1-4 until the desired number of synthetic samples is generated.

### Ans6)

Outliers are data points in a dataset that are significantly different from other data points. These data points are usually located far away from the rest of the data points, and they can have a significant impact on the results of data analysis and machine learning models.

It is essential to handle outliers for several reasons:

1) Outliers can affect the statistical measures of a dataset, such as the mean and standard deviation, making them less representative of the actual data distribution.

2) Outliers can affect the accuracy of predictive models by introducing bias and reducing the model's ability to generalize to new data.

3) Outliers can also affect the performance of clustering algorithms by creating clusters that are not representative of the underlying data distribution.

4) Outliers can also be caused by errors in data collection or data entry, and their presence can signal problems in the data collection process.

### Ans7)

There are several techniques that can be used to handle missing data in customer data analysis. Here are some of the most commonly used techniques:

1) Delete the missing data: One technique is to delete the missing data from the dataset. This is only appropriate if the amount of missing data is small and the remaining data is still representative of the underlying population.

2) Impute missing data: Another technique is to impute the missing data by replacing the missing values with an estimate of the missing value. This can be done using techniques such as mean imputation, median imputation, mode imputation, or regression imputation.

3) Use data augmentation techniques: Data augmentation techniques such as synthetic minority oversampling technique (SMOTE) can be used to generate synthetic data that can be used to replace the missing values.

4) Use machine learning algorithms: Machine learning algorithms such as k-Nearest Neighbors (KNN) and Decision Trees can be used to predict missing values based on the other variables in the dataset.

5) Use expert knowledge: Expert knowledge can be used to estimate missing values in a dataset based on the characteristics of the data.

### Ans8)

Here are some strategies that can be used to determine if the missing data is missing at random or if there is a pattern to the missing data:

Missing data analysis: Conduct a missing data analysis to identify any patterns or correlations between the missing data and the other variables in the dataset. This can involve visualizing the missing data patterns and conducting statistical tests such as chi-square tests or t-tests.

Correlation analysis: Conduct a correlation analysis to identify any correlations between the missing data and the other variables in the dataset. This can involve calculating correlation coefficients between the missing data and other variables in the dataset.

Imputation tests: Conduct imputation tests by imputing the missing data using different imputation techniques and comparing the results. If the imputed values are significantly different from the observed values, it may indicate that the missing data is not missing at random.

Expert knowledge: Use expert knowledge to identify any potential reasons for the missing data and whether there is a pattern to the missing data.

Machine learning algorithms: Use machine learning algorithms such as random forests or logistic regression to predict the missing values based on the other variables in the dataset. If the prediction accuracy is high, it may indicate that the missing data is missing at random.

### Ans9)

Here are some strategies that can be used to evaluate the performance of a machine learning model on an imbalanced dataset:

1) Confusion Matrix: Confusion matrix can be used to evaluate the performance of a machine learning model on an imbalanced dataset. It provides information about true positives, true negatives, false positives, and false negatives, which can be used to calculate metrics such as precision, recall, F1-score, and accuracy.

2) Precision-Recall Curve: A precision-recall curve can be plotted to evaluate the performance of the model on an imbalanced dataset. It shows the trade-off between precision and recall for different classification thresholds, and the area under the curve (AUC) can be used as a performance metric.

3) ROC Curve: ROC (Receiver Operating Characteristic) curve can also be used to evaluate the performance of the model. It shows the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) for different classification thresholds, and the area under the curve (AUC) can be used as a performance metric.

4) Stratified Sampling: When splitting the dataset into training and testing sets, it is important to use stratified sampling to ensure that the proportion of positive and negative examples is maintained in both sets. This can help to prevent overfitting and ensure that the model is not biased towards the majority class.

5) Resampling Techniques: Resampling techniques such as oversampling the minority class or undersampling the majority class can be used to balance the class distribution in the dataset. This can help to improve the performance of the model on the minority class.

### Ans10)

Here are some methods to balance the dataset and down-sample the majority class:

1) Undersampling: Undersampling involves randomly selecting a subset of the majority class samples to match the number of minority class samples. This can help to balance the class distribution in the dataset.

2) Oversampling: Oversampling involves generating synthetic samples for the minority class to match the number of majority class samples. This can be done using techniques such as SMOTE (Synthetic Minority Over-sampling Technique) or ADASYN (Adaptive Synthetic Sampling).

3) Hybrid Sampling: Hybrid sampling involves combining both undersampling and oversampling techniques to balance the dataset. This can be done using techniques such as SMOTE combined with Tomek links, which involves applying SMOTE and then removing any overlapping samples with the majority class using Tomek links.

### Ans11)

Here are some methods that can be used to up-sample the minority class:

1) Random oversampling: In this method, we randomly duplicate samples from the minority class to increase its size to match the size of the majority class. This method is simple and fast, but it can lead to overfitting and reduce the diversity of the data.

2) Synthetic Minority Over-sampling Technique (SMOTE): In this method, we create synthetic samples of the minority class by interpolating between pairs of samples from the minority class. This method can increase the diversity of the data and prevent overfitting, but it can also generate noisy samples.

3) Adaptive Synthetic (ADASYN): This method is an extension of SMOTE, which creates more synthetic samples in regions where the density of the minority class is lower. This method can further increase the diversity of the data and address the issue of generating noisy samples.
