In [None]:
# Q1: Missing values in a dataset refer to the absence of data for one or more variables in certain observations. There are several reasons for missing values, such as data entry errors, equipment malfunctions, or participants choosing not to provide certain information. It is essential to handle missing values because they can lead to biased or inefficient analyses and may affect the performance of machine learning algorithms.

# Some algorithms that are not affected by missing values include:
# 1. Decision Trees: Decision trees can handle missing values by treating missingness as just another category during the tree construction process.
# 2. Random Forests: Random Forests can handle missing values by using surrogate splits, which are alternative splits that mimic the behavior of the original split for cases with missing values.
# 3. Gradient Boosting Machines(GBMs): GBMs can handle missing values by treating missingness as a separate category and learning how to best use it during the boosting process.

# Q2: Techniques to handle missing data:
# 1. Deletion: Delete rows or columns with missing values.
# ```python
# # Deleting rows with missing values
# df.dropna(axis=0, inplace=True)

# # Deleting columns with missing values
# df.dropna(axis=1, inplace=True)
# ```
# 2. Mean/Mode/Median Imputation: Replace missing values with the mean, mode, or median of the non-missing values in the same column.
# ```python
# # Mean imputation
# df['column'].fillna(df['column'].mean(), inplace=True)

# # Mode imputation
# df['column'].fillna(df['column'].mode()[0], inplace=True)

# # Median imputation
# df['column'].fillna(df['column'].median(), inplace=True)
# ```
# 3. Forward/Backward Fill: Fill missing values with the previous(forward fill) or next(backward fill) valid value in the column.
# ```python
# # Forward fill
# df['column'].fillna(method='ffill', inplace=True)

# # Backward fill
# df['column'].fillna(method='bfill', inplace=True)
# ```
# 4. Interpolation: Fill missing values using interpolation methods like linear interpolation, polynomial interpolation, etc.
# ```python
# # Linear interpolation
# df['column'].interpolate(method='linear', inplace=True)

# # Polynomial interpolation
# df['column'].interpolate(method='polynomial', order=2, inplace=True)
# ```
# 5. Model-based Imputation: Predict missing values using a regression or classification model trained on the non-missing values.
# ```python

# # Create a mask of missing values
# mask = df['column'].isnull()

# # Split data into two parts: rows with non-missing values and rows with missing values
# train = df[~mask]
# test = df[mask]

# # Train a linear regression model
# model = LinearRegression()
# model.fit(train[['feature1', 'feature2']], train['column'])

# # Predict missing values
# df.loc[mask, 'column'] = model.predict(test[['feature1', 'feature2']])
# ```

# Q3: Imbalanced data refers to a situation where the distribution of classes in a classification problem is uneven, with one class having significantly more instances than the others. If imbalanced data is not handled properly, it can lead to biased models that have poor performance on the minority class . The model may favor the majority class and have lower accuracy, recall, and F1-score for the minority class .

# Q4: Upsampling and downsampling are techniques used to address imbalanced data.
# - Upsampling involves increasing the number of instances in the minority class to match the number of instances in the majority class . This can be done by duplicating existing instances or generating synthetic samples.
# - Downsampling involves decreasing the number of instances in the majority class to match the number of instances in the minority class . This can be done by randomly selecting a subset of instances from the majority class .

# For example, let's consider a binary classification problem with classes A and B. Class A has 100 instances, while class B has only 20 instances. To upsample, we can duplicate instances from class B to have 100 instances. To downsample, we can randomly select 20 instances from class A to match the number of instances in class B.

# Q5: Data augmentation is a technique used to artificially increase the size of a dataset by creating slightly modified copies of the existing data. It helps in improving the generalization and performance of machine learning models. SMOTE(Synthetic Minority Over-sampling Technique) is a popular data augmentation method for addressing imbalanced datasets.

# SMOTE works by generating synthetic samples for the minority class . It selects a sample from the minority class, finds its k nearest neighbors, and creates new samples along the line segments joining the sample and its neighbors. This helps to create a more balanced dataset.

# Q6: Outliers in a dataset are observations that significantly deviate from the other data points. They can be caused by various factors like data entry errors, measurement noise, or genuine rare events. It is essential to handle outliers because they can distort statistical analyses and modeling results. Outliers can disproportionately influence the estimated parameters and affect the performance and generalizability of machine learning models.

# Q7: Techniques to handle missing data in customer analysis:
# 1. Complete Case Analysis: Exclude observations with missing values from the analysis. This is suitable when the missing data is minimal, and removing the incomplete cases does not introduce significant bias.
# 2. Mean/Mode/Median Imputation: Replace missing values with the mean, mode, or median of the non-missing values in the same variable. This is applicable when the missingness is random and not associated with the customer characteristics.
# 3. Model-based Imputation: Use regression or classification models to predict missing values based on other variables' values. This is suitable when there is a pattern in the missing data and the missingness is related to other customer characteristics.

# Q8: Strategies to determine if missing data is missing at random or has a pattern:
# 1. Missing Data Visualization: Plot missingness patterns to visualize if there are any systematic patterns in missing values across variables or observations.
# 2. Missing Data Mechanism Tests: Conduct statistical tests like Little's MCAR(Missing Completely at Random) test or MNAR(Missing Not at Random) tests to determine if the missing data can be considered random or not .
# 3. Correlation Analysis: Examine the correlations between missing values and other variables to identify potential patterns or relationships.

# Q9: Strategies to evaluate performance on an imbalanced dataset:
# 1. Confusion Matrix and Class Metrics: Evaluate the confusion matrix, including metrics such as accuracy, precision, recall(sensitivity), specificity, and F1-score, to get a comprehensive understanding of model performance for both the majority and minority classes.
# 2. ROC Curve and AUC: Plot the Receiver Operating Characteristic(ROC) curve and calculate the Area Under the Curve(AUC) to assess the model's ability to distinguish between the classes.
# 3. Precision-Recall Curve: Plot the Precision-Recall curve and calculate metrics like Average Precision Score(AP) to evaluate the model's performance in the context of imbalanced data.
# 4. Resampling Techniques: Use resampling techniques like oversampling the minority class , undersampling the majority class, or a combination of both, to balance the dataset and evaluate the model's performance on the balanced data.

# Q10: Methods to balance an unbalanced dataset and down-sample the majority class when estimating customer satisfaction:
# 1. Random Under-Sampling: Randomly select a subset of instances from the majority class to reduce its

# representation in the dataset.
# 2. Cluster-Based Under-Sampling: Use clustering algorithms to identify clusters within the majority class and keep only a representative subset of instances from each cluster.
# 3. Tomek Links: Identify pairs of instances from different classes that are closest to each other and remove the majority class instance. This helps in reducing overlapping between classes.
# 4. NearMiss Algorithm: Select instances from the majority class based on their distance to the minority class instances, such as selecting the instances with the smallest average distance.

# Q11: Methods to balance an unbalanced dataset and up-sample the minority class when estimating the occurrence of a rare event:
# 1. Random Over-Sampling: Randomly duplicate instances from the minority class to increase its representation in the dataset.
# 2. SMOTE(Synthetic Minority Over-sampling Technique): Generate synthetic samples for the minority class by creating new instances along the line segments joining the minority class samples.
# 3. ADASYN(Adaptive Synthetic Sampling): Similar to SMOTE, but it introduces more synthetic samples for the minority class instances that are harder to learn.
# 4. SMOTE-ENN: Apply SMOTE to up-sample the minority class and then use the Edited Nearest Neighbors(ENN) algorithm to remove noisy samples from both classes.
