In [None]:
'''
Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.
'''


Missing values in a dataset refer to the absence of a particular value or data point in one or more
columns of a row. There are various reasons why data may be missing, such as data corruption, human error,
or system failure.

It is essential to handle missing values because they can adversely impact the accuracy and reliability
of any analysis or machine learning model developed using the dataset. If missing values are ignored, 
they can lead to biased or incorrect conclusions, which may result in poor business decisions or ineffective models.

Some algorithms that are not affected by missing values include tree-based algorithms such as decision trees, 
random forests, and gradient boosting, as well as some clustering algorithms such as K-means clustering.
These algorithms can work with missing values by ignoring the missing values while splitting the dataset, 
or by using surrogate variables to approximate missing values during the analysis.

In [2]:
'''
Q2: List down techniques used to handle missing data. Give an example of each with python code.
'''

# 1 .Deleting rows or columns: If the missing values are very few, we can delete the corresponding rows or columns. Here's an example:

import numpy as np 

import pandas as pd

# Create a sample dataframe with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Delete rows with missing values
df_drop_row = df.dropna(axis=0)
print(df_drop_row)

# Delete columns with missing values
df_drop_col = df.dropna(axis=1)
print(df_drop_col)



     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12
    C
0   9
1  10
2  11
3  12


In [3]:
# 2.Forward or backward fill: We can propagate the last known value forward or backward. Here's an example

import pandas as pd

# Create a sample dataframe with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Forward fill
df_ffill = df.fillna(method='ffill')
print(df_ffill)

# Backward fill
df_bfill = df.fillna(method='bfill')
print(df_bfill)


     A    B   C
0  1.0  5.0   9
1  2.0  5.0  10
2  2.0  5.0  11
3  4.0  8.0  12
     A    B   C
0  1.0  5.0   9
1  2.0  8.0  10
2  4.0  8.0  11
3  4.0  8.0  12


In [None]:
'''
Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?
'''


one class has significantly more instances than the other class(es). 
For example, in a binary classification problem, if one class has 90% of the instances and
the other class has only 10%, the data is said to be imbalanced.

handling imbalanced data is essential to prevent biased model predictions and ensure that 
the model is equally effective at predicting all classes in the classification problem.


In [None]:
'''
Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.
'''

Up-sampling is the process of randomly replicating instances from the minority class to increase
its representation in the dataset. This technique can be useful when the dataset has a significant
class imbalance and the minority class has few instances.

Down-sampling, on the other hand, is the process of randomly removing instances from the majority
class to decrease its representation in the dataset. This technique can be useful when the majority 
class has a significantly larger number of instances than the minority class.

In [None]:
'''
Q5: What is data Augmentation? Explain SMOTE.
'''


Data augmentation is a technique used in machine learning to increase the amount of training data 
by creating new, artificial examples from the existing ones. It is commonly used when the size of
the training dataset is small, and the models performance is limited due to the lack of data. 
Data augmentation can be applied to various types of data, including images, audio, text, and tabular data.

SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used 
for imbalanced datasets. It works by creating synthetic examples of the minority class by interpolating 
new examples along the line segments joining neighboring minority class examples. The SMOTE algorithm
selects a random example from the minority class and then finds its k-nearest neighbors. It then 
generates new examples by interpolating between the selected example and its k-nearest neighbors.

For example, suppose we have a dataset with two classes: class A has 100 samples, and class B has 
only 10 samples. The dataset is imbalanced, and the model performance is limited due to the lack
of data for class B. In this case, we can use SMOTE to generate new synthetic examples of class B,
which will balance the dataset and improve the models performance.

In [None]:
'''
Q6: What are outliers in a dataset? Why is it essential to handle outliers?
'''

Outliers are the data points that are significantly different from other data points in a dataset.
They can occur due to various reasons such as measurement errors, data entry errors, or natural
variation in data. Outliers can have a significant impact on the performance of machine learning
models as they can influence the mean, variance, and standard deviation of a dataset.

It is essential to handle outliers because they can distort the results of statistical analyses 
and can have a significant impact on the performance of machine learning models. Outliers can cause
a model to overfit or underfit, which can lead to poor generalization performance.

In [None]:
'''

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

'''
1.Deletion: One option is to simply delete the rows or columns that contain missing data.
        This technique is called deletion, and there are two types of deletion: listwise deletion 
        (also called complete case analysis), which deletes any row with a missing value, and pairwise
        deletion, which only deletes the specific missing values in a row or column.


2.Imputation: Imputation involves estimating missing values based on the available data. 
        Some common imputation methods include mean, median, mode, and regression imputation

3.Prediction: Another option is to use machine learning algorithms to predict missing values based 
            on the available data.

4.Expert judgment: If none of the above techniques are suitable, expert judgment can be used to 
                estimate missing values based on the context of the data and the expertise of the analyst.


In [None]:
'''
Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?
'''


1.visual inspection: Plotting the data can sometimes reveal patterns in the missing data. 
        For example, if missing values tend to cluster around a certain range of values, it could suggest that
        there is a pattern to the missing data.

2.Correlation analysis: Correlation analysis can be used to determine if there is a relationship between
        missing data and other variables in the dataset. If missing values are correlated with certain variables,
        it could suggest that there is a pattern to the missing data.

3.Missing data imputation: Imputing missing data and comparing the imputed values to the actual values 
        can give an indication of whether the missing data is missing at random or not. If the imputed values 
        are close to the actual values, it suggests that the missing data is missing at random.

4.Statistical tests: Statistical tests such as Littles MCAR test or MAR test can be used to determine
        if the missing data is missing completely at random (MCAR) or if there is a pattern to the missing data.

In [None]:
'''
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?
'''


Confusion Matrix: A confusion matrix can be used to visualize the performance of a model on a binary
            classification problem. It displays the number of true positive, true negative, false positive,
           and false negative predictions. From this, various performance metrics such as precision, recall,
            and F1 score can be computed.

Precision and Recall: Precision is the proportion of true positives among all predicted positives,
        while recall is the proportion of true positives among all actual positives. These metrics
        are often used in conjunction with each other to evaluate the performance of a model on imbalanced datasets.

ROC Curve and AUC: The receiver operating characteristic (ROC) curve is a graphical representation of the 
        performance of a binary classifier. It plots the true positive rate (TPR) against the false positive rate 
        (FPR) for different classification thresholds. The area under the curve (AUC) is a scalar metric that
        summarizes the overall performance of the classifier.

Class weights: Many machine learning algorithms allow for assigning class weights to account for imbalanced datasets.
            By giving a higher weight to the minority class, the model is incentivized to better classify examples from that class.

In [7]:
'''
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?
'''

1.Random Under-Sampling: Randomly select a subset of data from the majority class to match the
        number of samples in the minority class.

2.Cluster-Based Under-Sampling: Cluster the majority class data and select samples from each 
        cluster to match the number of samples in the minority class.

3.Tomek Links: Identify pairs of samples in the majority and minority classes that are close to 
        each other and remove the majority class samples.

4.NearMiss Algorithm: This algorithm selects samples from the majority class that are closest to
        the minority class.


SyntaxError: incomplete input (418650471.py, line 1)

In [None]:
'''
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?
'''.

1.SMOTE (Synthetic Minority Over-sampling Technique): This method creates synthetic samples for
            the minority class by interpolating between existing minority class samples.

2.ADASYN (Adaptive Synthetic Sampling): This method is similar to SMOTE but focuses on generating 
            more synthetic samples in regions of the feature space where the density of the minority 
            class is lower.

3.Random oversampling: This method involves duplicating samples from the minority class randomly 
        until the classes are balanced.

