# Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

# Missing Value
Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset.

Types of Missing Values

![image.png](attachment:89e49189-1827-43d1-8f42-bf1472e852ce.png)

1. Missing Completely at Random, MCAR:

Missing completely at random (MCAR) is a type of missing data mechanism in which the probability of a value being missing is unrelated to both the observed data and the missing data. In other words, if the data is MCAR, the missing values are randomly distributed throughout the dataset, and there is no systematic reason for why they are missing.

For example, in a survey about the prevalence of a certain disease, the missing data might be MCAR if the survey participants with missing values for certain questions were selected randomly and their missing responses are not related to their disease status or any other variables measured in the survey.

2. Missing at Random MAR:

Missing at Random (MAR) is a type of missing data mechanism in which the probability of a value being missing depends only on the observed data, but not on the missing data itself. In other words, if the data is MAR, the missing values are systematically related to the observed data, but not to the missing data. Here are a few examples of missing at random:

Income data: Suppose you are collecting income data from a group of people, but some participants choose not to report their income. If the decision to report or not report income is related to the participant's age or gender, but not to their income level, then the data is missing at random.

Medical data: Suppose you are collecting medical data on patients, including their blood pressure, but some patients do not report their blood pressure. If the patients who do not report their blood pressure are more likely to be younger or have healthier lifestyles, but the missingness is not related to their actual blood pressure values, then the data is missing at random.

3. Missing data not at random (MNAR):

It is a type of missing data mechanism where the probability of missing values depends on the value of the missing data itself. In other words, if the data is MNAR, the missingness is not random and is dependent on unobserved or unmeasured factors that are associated with the missing values.

For example, suppose you are collecting data on the income and job satisfaction of employees in a company. If employees who are less satisfied with their jobs are more likely to refuse to report their income, then the data is not missing at random. In this case, the missingness is dependent on job satisfaction, which is not directly observed or measured.

# Handling missing values importance:
It is important to handle the missing values appropriately.

1. Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
2. You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.
3. Missing data can lead to a lack of precision in the statistical analysis.


Some algorithms that are not affected by missing values include decision trees, random forests, and gradient boosting algorithms such as XGBoost and LightGBM. These algorithms are capable of handling missing values by using a variety of techniques such as surrogate splits, imputation, or treating missing values as a separate category. However, it is always recommended to handle missing values appropriately to avoid introducing errors and biases in data analysis and modeling.

# Q2: List down techniques used to handle missing data. Give an example of each with python code.

### Following Techniques are:

#### 1. Deleting Rows which contain Missing value

In [1]:
import seaborn as sns
df = sns.load_dataset('titanic')
import pandas as pd

In [4]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [6]:
df.dropna()  ## It deletes those Rows which contain Missing Values or NAN values

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


#### 2.Deleting Columns which contain missing value:

In [7]:
df.dropna(axis=1)

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,class,who,adult_male,alive,alone
0,0,3,male,1,0,7.2500,Third,man,True,no,False
1,1,1,female,1,0,71.2833,First,woman,False,yes,False
2,1,3,female,0,0,7.9250,Third,woman,False,yes,True
3,1,1,female,1,0,53.1000,First,woman,False,yes,False
4,0,3,male,0,0,8.0500,Third,man,True,no,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,0,0,13.0000,Second,man,True,no,True
887,1,1,female,0,0,30.0000,First,woman,False,yes,True
888,0,3,female,1,2,23.4500,Third,woman,False,no,False
889,1,1,male,0,0,30.0000,First,man,True,yes,True


## 3. Imputation Technique to handle missing values
1. Mean Imputatiom
2. Median Imputation
3. Mode Imputation

### 1. Mean Imputation:

##### This technique is used when we have to handle numerical missing values with no outliers in a dataset.

##### suppose i have to handle 'age' column

In [13]:
df['age'].isnull().sum()   # Here 177 numerical missing values contain in a dataset

177

In [14]:
df['age_mean'] = df['age'].fillna(df['age'].mean())

In [16]:
df[['age','age_mean']].head()

Unnamed: 0,age,age_mean
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0


In [18]:
df['age_mean'].isnull().sum()  # See we successfully handle missing values

0

### 2. Median Imputation:
##### This technique used when outliers are present in dataset

In [27]:
dataset = pd.DataFrame({"Weight":[50,60,56,67,None,45,None,200,300]})   # Contain Outliers

In [30]:
dataset.isnull().sum()   # Contain 2 Missing values

Weight    2
dtype: int64

In [31]:
dataset['New_weight_median'] = dataset['Weight'].fillna(dataset['Weight'].median())

In [37]:
dataset

Unnamed: 0,Weight,New_weight_median
0,50.0,50.0
1,60.0,60.0
2,56.0,56.0
3,67.0,67.0
4,,60.0
5,45.0,45.0
6,,60.0
7,200.0,200.0
8,300.0,300.0


In [38]:
dataset.isnull().sum()

Weight               2
New_weight_median    0
dtype: int64

### 3. Mode Imputation:
##### This technique used mostly to handle ccategorical missing values in a dataset.

##### suppose we have to handle "embarked" column in titanic dataset

In [66]:
df['embarked'].isnull().sum()

2

In [67]:
df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [68]:
mode = df['embarked'].fillna(df['embarked'].mode()[0])

In [69]:
df['embarked_mode'] = mode

In [70]:
df[['embarked','embarked_mode']]

Unnamed: 0,embarked,embarked_mode
0,S,S
1,C,C
2,S,S
3,S,S
4,S,S
...,...,...
886,S,S
887,S,S
888,S,S
889,C,C


In [71]:
df['embarked_mode'].isnull().sum()

0

In [65]:
### we successfully handle missing values of 'embarked column' 

### As of Now Sir taught this techniques only.

# Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

#### Imbalanced data refers to a situation where the classes or categories in a dataset are not equally represented. This means that one or more classes have significantly fewer samples than the others. Imbalanced data is common in many real-world applications such as fraud detection, medical diagnosis, and rare event prediction.

#### The problem with imbalanced data is that most machine learning algorithms are designed to assume that the classes are balanced, and they tend to perform poorly when applied to imbalanced data. This is because the algorithms tend to be biased towards the majority class, which can lead to poor performance on the minority class. For example, if a dataset contains 95% samples of Class A and only 5% samples of Class B, a classifier trained on this dataset is likely to predict most new examples as Class A, regardless of their actual class.

#### If imbalanced data is not handled, it can lead to several problems, including:
1. Poor performance: The performance of a classifier trained on imbalanced data is likely to be poor, particularly on the minority class. This can lead to false negatives and false positives, which can have serious consequences in some applications.

2. Biased models: Imbalanced data can lead to biased models that are not representative of the true distribution of the data. This can result in poor generalization to new examples and can make the model less reliable.

3. Overfitting: In imbalanced datasets, the model can learn to overfit on the majority class, which can lead to poor performance on the minority class.

#### To handle imbalanced data, several techniques can be used, including:
1. Resampling: This involves either oversampling the minority class or undersampling the majority class to create a balanced dataset.

2. Cost-sensitive learning: This involves assigning different misclassification costs to different classes to reflect the imbalance in the data.

3. Algorithmic modifications: This involves modifying the machine learning algorithm to handle imbalanced data directly, such as changing the threshold of a decision rule or using specialized classifiers designed for imbalanced data.

![image.png](attachment:72c0acc4-22d4-4062-9cee-9a006be41623.png)

# Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

## Upsampling and downsampling are two common techniques used to handle imbalanced data in machine learning.

### 1.Downsampling:
##### Downsampling involves reducing the number of samples in the majority class to match the number of samples in the minority class. This can be done randomly or using more sophisticated techniques, such as clustering or instance selection. Downsampling is useful when the majority class has a large number of samples that can be safely removed without losing important information.
##### For example, consider a dataset with 1000 samples of Class A and 100 samples of Class B. If we downsample Class A to 100 samples, we can create a balanced dataset with 100 samples of each class.

### Upsampling:
##### Upsampling involves increasing the number of samples in the minority class to match the number of samples in the majority class. This can be done by replicating existing samples in the minority class, or by generating new synthetic samples using techniques such as SMOTE (Synthetic Minority Over-sampling Technique). Upsampling is useful when the minority class has a small number of samples that cannot be safely removed, and when we want to avoid losing important information.
##### For example, consider a dataset with 1000 samples of Class A and 100 samples of Class B. If we upsample Class B to 1000 samples using SMOTE, we can create a balanced dataset with 1000 samples of each class.

#### Whether to use upsampling or downsampling depends on the specific dataset and problem at hand. In general, upsampling is preferred when the minority class is important and has important features that need to be preserved, while downsampling is preferred when the majority class is too large to process efficiently or contains a significant amount of irrelevant data.

#### In summary, upsampling and downsampling are two techniques used to handle imbalanced data in machine learning. Upsampling involves increasing the number of samples in the minority class, while downsampling involves reducing the number of samples in the majority class. The choice of which technique to use depends on the specific dataset and problem at hand.

![image.png](attachment:3a517720-0db7-4c61-b1a3-3f7938f6fdf2.png)

# Q5: What is data Augmentation? Explain SMOTE.

#### Data augmentation is a technique that can be used to artificially expand the size of a training set by creating modified data from the existing one. It is a good practice to use DA if you want to prevent overfitting, or the initial dataset is too small to train on, or even if you want to squeeze better performance from your model.

#### Data augmentation is not only used to prevent overfitting. In general, having a large dataset is crucial for the performance of both ML and Deep Learning (DL) models. However, we can improve the performance of the model by augmenting the data we already have. It means that data augmentation is also good for enhancing the model’s performance.

#### One popular data augmentation technique is SMOTE (Synthetic Minority Over-sampling Technique). SMOTE is specifically designed to handle imbalanced datasets where the minority class has very few samples. SMOTE generates synthetic examples of the minority class by interpolating between pairs of minority class examples.

#### The basic idea of SMOTE is to randomly select a minority class example and its k nearest neighbors, where k is a user-defined parameter. SMOTE then creates new synthetic examples by interpolating between the minority example and each of its k nearest neighbors. Specifically, SMOTE selects a random point along the line segment connecting the minority example and its nearest neighbor and adds this point as a new example to the dataset.

#### This process is repeated until the desired number of synthetic examples has been generated. The result is a larger and more diverse dataset that includes synthetic examples of the minority class.

#### SMOTE can be very effective in improving the performance of machine learning models on imbalanced datasets. By creating synthetic examples of the minority class, SMOTE can help to address the problem of class imbalance and ensure that the model is better able to generalize to new examples.

#### However, it is important to note that SMOTE can also introduce some noise and overfitting in the data, particularly if the value of k is set too high. Therefore, it is important to carefully select the parameters of SMOTE and to evaluate its effectiveness using appropriate validation techniques.

![image.png](attachment:c4a1660a-4abc-4a9a-86be-9fd0644084ac.png)

# Q6: What are outliers in a dataset? Why is it essential to handle outliers?

#### Outliers are nothing but data points that differ significantly from other observations. They are the points that lie outside the overall distribution of the dataset. Outliers, if not treated, can cause serious problems in statistical analyses.

### Types of Outliers

##### Outliers are generally classified into two types: Univariate and Multivariate.

##### 1. Univariate Outliers – These outliers are found in the distribution of values in a single feature space.

##### 2. Multivariate Outliers – These outliers are found in the distribution of values in a n-dimensional space (n-features).

#### It is essential to handle outliers because they can cause a number of problems, including:

##### 1. Skewed data distribution: Outliers can distort the data distribution, making it difficult to accurately interpret the data and identify patterns.

##### 2. Misleading statistical measures: Outliers can significantly affect statistical measures such as mean and standard deviation, leading to inaccurate or misleading results.

##### 3. Biased machine learning models: Outliers can have a disproportionate influence on the model training process, leading to biased models that perform poorly on new data.

##### 4. Reduced model performance: Outliers can cause overfitting, leading to reduced model performance and accuracy.

![image.png](attachment:aea5054c-3fef-45dc-8239-e976224b1986.png)

# Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

### There are several techniques that can be used to handle missing data in customer data analysis:

1. Deletion: One simple approach is to simply delete any rows or columns with missing data. However, this approach can lead to loss of important information and reduce the size of the dataset.

2. Imputation: Imputation involves replacing missing data with estimated values based on the available data. This can be done using techniques such as mean imputation, median imputation, mode imputation, and iterative imputation.

3. Regression: Regression analysis can be used to predict missing values based on the available data. This approach can be particularly effective if there is a strong correlation between the missing variable and other variables in the dataset.

4. Multiple imputation: Multiple imputation involves creating multiple imputed datasets and combining them to produce a final estimate of the missing values. This approach can be particularly effective if there is a significant amount of missing data in the dataset.

5. Machine learning: Machine learning algorithms can be used to predict missing values based on the available data. This approach can be particularly effective if the dataset contains complex relationships between variables.

# Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

### When dealing with missing data, there are several strategies to determine if the missing data is missing at random or if there is a pattern to the missing data. Here are some of the most commonly used methods:

1. Analyze missingness patterns: You can start by examining the missingness patterns in the data. Plotting the distribution of missing values by variable or by record can help identify patterns of missingness. If the missingness patterns are random or similar across all variables, then it is likely that the missing data is missing at random. However, if there are patterns in the missingness, such as specific variables having higher rates of missing values or specific values within a variable being more likely to be missing, this suggests that the missing data may be non-random.

2. Correlation analysis: You can examine the correlation between the missingness of a variable and other variables in the dataset. If the missingness of a variable is not correlated with any other variable, then it is likely missing at random. However, if the missingness of a variable is correlated with other variables, it suggests that the missing data may be non-random.

3. Imputation and analysis: Impute the missing values using various techniques and compare the results. If the results are consistent across multiple imputation techniques, then it suggests that the missing data is missing at random. However, if the results vary significantly depending on the imputation technique used, it suggests that the missing data may be non-random.

4. Expert knowledge: Sometimes expert knowledge can help determine if the missing data is missing at random or not. For example, if you are studying the impact of a new medication, and patients who experience side effects are more likely to drop out of the study, then the missing data is likely not missing at random.

5. Statistical tests: You can use statistical tests such as the Little’s MCAR test or Missing Completely at Random (MCAR) test to determine if the missing data is missing at random or not. These tests can help determine if the pattern of missing data can be explained by chance or if there is a systematic reason for the missing data.

#### Overall, it's important to remember that determining the pattern of missing data is often a combination of these methods, and it may require some judgment to make a final determination.

# Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

### Dealing with imbalanced datasets is a common problem in machine learning, especially in medical diagnosis projects. Here are some strategies you can use to evaluate the performance of your machine learning model on an imbalanced dataset:

1. Confusion matrix: A confusion matrix is a table that summarizes the performance of a classification model. It shows the true positive, false positive, true negative, and false negative rates. In the case of an imbalanced dataset, accuracy may not be a good metric to evaluate the model's performance. Instead, you can look at other metrics such as precision, recall, F1-score, and the area under the receiver operating characteristic (ROC) curve. These metrics are not affected by the class imbalance and provide a better evaluation of the model's performance.

2. Resampling techniques: Resampling techniques can be used to balance the dataset. You can either oversample the minority class or undersample the majority class. Oversampling involves adding copies of the minority class to the dataset, while undersampling involves removing examples from the majority class. However, both techniques have some drawbacks. Oversampling can lead to overfitting, while undersampling can lead to a loss of information. One common resampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic examples of the minority class.

3. Ensemble methods: Ensemble methods combine multiple models to improve their performance. One common ensemble method is the bagging method, which involves training multiple models on different subsets of the dataset and averaging their predictions. Another common ensemble method is the boosting method, which involves training multiple models sequentially, with each subsequent model focusing on the errors of the previous model. Ensemble methods can help improve the performance of the model on imbalanced datasets.

4. Cost-sensitive learning: Cost-sensitive learning involves assigning different costs to different types of errors. In the case of an imbalanced dataset, misclassifying a minority class example as a majority class example may be more costly than the opposite. By assigning different costs to different types of errors, the model can be trained to minimize the overall cost of errors rather than just the number of errors.

5. Domain knowledge: Finally, domain knowledge can be used to improve the model's performance on an imbalanced dataset. For example, if the dataset contains demographic information, you can use this information to stratify the dataset and ensure that both classes are represented equally in each stratum.

### Overall, it's important to remember that there is no single best strategy for dealing with imbalanced datasets, and the best approach may depend on the specific dataset and problem at hand. It's often a combination of these techniques that leads to the best results.

# Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

### There are several methods that can be employed to balance an unbalanced dataset and down-sample the majority class. Here are a few possible approaches:

1. Random under-sampling: This involves randomly removing instances from the majority class until the dataset is balanced. One potential drawback of this approach is that it may result in the loss of important information, particularly if the majority class contains important or rare examples that should be preserved.

2. Cluster-based under-sampling: This method involves clustering the majority class instances and then selecting representative instances from each cluster. This can help to preserve important information in the majority class, while also reducing the imbalance.

3. Tomek Links: This method is an under-sampling technique that identifies pairs of instances from different classes that are close to each other, and removes the majority class instance from each pair. By doing this, the Tomek Links method creates a clearer separation between the two classes.

4. Edited Nearest Neighbors (ENN): This method is also an under-sampling technique that removes noisy or mislabeled instances by checking the class of each instance's nearest neighbors. If an instance's nearest neighbors are mostly from a different class, then the instance is removed. ENN can be applied after other under-sampling or over-sampling techniques to further improve the balance of the dataset.

5. Ensemble-based methods: These methods involve training multiple models on different subsets of the data, and then combining the results to produce a final prediction. This can be particularly useful in cases where the dataset is highly imbalanced and standard methods may not be effective.

### It is important to note that there is no one "best" method for balancing an unbalanced dataset, and the choice of method will depend on the specific characteristics of the dataset and the goals of the analysis. It is also important to evaluate the performance of the chosen method on a validation set to ensure that it does not introduce biases or negatively impact the accuracy of the model.

# Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

### If I have an unbalanced dataset with a low percentage of occurrences of a rare event, you can employ various techniques to balance the dataset and up-sample the minority class. Here are a few possible approaches:
1. Random over-sampling: This involves randomly duplicating instances from the minority class until the dataset is balanced. One potential drawback of this approach is that it may result in overfitting and lower the overall accuracy of the model.

2. Synthetic minority over-sampling technique (SMOTE): This method involves creating synthetic instances of the minority class by interpolating between existing instances. SMOTE generates new instances by taking the difference between the feature vector of one minority class instance and its k-nearest neighbors, and then multiplying this difference by a random number between 0 and 1. This can help to balance the dataset while also preserving the overall distribution of the minority class.