<a href="https://colab.research.google.com/github/kanchandhole/Data-Scientist/blob/main/17th_march_feature_engn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Q1:** What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

**Ans:**

Here‚Äôs a **clear, well-structured answer** for **Q1** üëá

---

## **Missing Values in a Dataset**

### **What are Missing Values?**

Missing values occur when **no data is recorded for a variable** in one or more observations.
They may appear as **NaN, NULL, empty cells, or special symbols** in a dataset.

**Common Causes:**

* Data entry errors
* Sensor or system failure
* Survey non-responses
* Data corruption or loss

---

### **Why Is It Essential to Handle Missing Values?**

Handling missing values is important because:

1. **Model Accuracy**
   Missing values can lead to incorrect predictions or biased results.

2. **Algorithm Limitations**
   Many machine learning algorithms cannot work with missing values directly.

3. **Statistical Validity**
   Ignoring missing data can distort distributions and relationships.

4. **Data Consistency**
   Clean data ensures reliable analysis and decision-making.

---

### **Algorithms Not Affected by Missing Values**

Some algorithms can **handle missing values internally** or are **less sensitive** to them:

* **Decision Trees**
* **Random Forest**
* **Gradient Boosting (XGBoost, LightGBM, CatBoost)**
* **K-Nearest Neighbors (with distance-based handling, depending on implementation)**

> Note: Many algorithms still perform better when missing values are properly handled.

---

### **Conclusion**

Missing values represent incomplete information in data. Proper handling improves **model performance, reliability, and interpretability**, even when using algorithms that can tolerate missing data.


**Q2:** List down techniques used to handle missing data. Give an example of each with python code.

**Ans:**  Techniques to Handle Missing Data (with Python Examples)

Let‚Äôs assume we have this sample dataset:

In [1]:
import pandas as pd
import numpy as np

data = {
    'Age': [25, 30, np.nan, 28, 35],
    'Salary': [50000, np.nan, 60000, 58000, np.nan]
}

df = pd.DataFrame(data)
print(df)


    Age   Salary
0  25.0  50000.0
1  30.0      NaN
2   NaN  60000.0
3  28.0  58000.0
4  35.0      NaN


1. Removing Missing Values (Deletion Method)

a) Row-wise deletion

Used when missing values are very few.

In [2]:
df_drop_rows = df.dropna()
print(df_drop_rows)


    Age   Salary
0  25.0  50000.0
3  28.0  58000.0


b) Column-wise deletion

Used when a column has too many missing values.

In [3]:
df_drop_cols = df.dropna(axis=1)
print(df_drop_cols)


Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]


2. Mean Imputation

Replace missing values with the mean (numerical data).

In [4]:
df_mean = df.copy()
df_mean['Age'].fillna(df_mean['Age'].mean(), inplace=True)
df_mean['Salary'].fillna(df_mean['Salary'].mean(), inplace=True)
print(df_mean)

    Age   Salary
0  25.0  50000.0
1  30.0  56000.0
2  29.5  60000.0
3  28.0  58000.0
4  35.0  56000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_mean['Age'].fillna(df_mean['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_mean['Salary'].fillna(df_mean['Salary'].mean(), inplace=True)


3. Median Imputation

Used when data has outliers.

In [5]:
df_median = df.copy()
df_median.fillna(df_median.median(), inplace=True)
print(df_median)

    Age   Salary
0  25.0  50000.0
1  30.0  58000.0
2  29.0  60000.0
3  28.0  58000.0
4  35.0  58000.0


4. Mode Imputation

Used for categorical data.

In [6]:
df_cat = pd.DataFrame({
    'City': ['Pune', 'Mumbai', np.nan, 'Pune']
})

df_cat['City'].fillna(df_cat['City'].mode()[0], inplace=True)
print(df_cat)

     City
0    Pune
1  Mumbai
2    Pune
3    Pune


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cat['City'].fillna(df_cat['City'].mode()[0], inplace=True)


In [7]:
#5. Forward Fill (Propagation Method)
df_ffill = df.fillna(method='ffill')
print(df_ffill)


    Age   Salary
0  25.0  50000.0
1  30.0  50000.0
2  30.0  60000.0
3  28.0  58000.0
4  35.0  58000.0


  df_ffill = df.fillna(method='ffill')


In [8]:
#6. Backward Fill

df_bfill = df.fillna(method='bfill')
print(df_bfill)

    Age   Salary
0  25.0  50000.0
1  30.0  60000.0
2  28.0  60000.0
3  28.0  58000.0
4  35.0      NaN


  df_bfill = df.fillna(method='bfill')


In [9]:
#7. Constant Value Imputation

#Replace missing values with a fixed value (e.g., 0 or ‚ÄúUnknown‚Äù).

df_const = df.fillna(0)
print(df_const)

    Age   Salary
0  25.0  50000.0
1  30.0      0.0
2   0.0  60000.0
3  28.0  58000.0
4  35.0      0.0


In [10]:
#8. Predictive Imputation (Advanced ‚Äì ML Based)

#Use models like regression or KNN to predict missing values.

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
df_knn = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(df_knn)

    Age   Salary
0  25.0  50000.0
1  30.0  54000.0
2  26.5  60000.0
3  28.0  58000.0
4  35.0  54000.0


| Technique               | Best Used When          |
| ----------------------- | ----------------------- |
| Deletion                | Very few missing values |
| Mean                    | Normal distribution     |
| Median                  | Outliers present        |
| Mode                    | Categorical data        |
| Forward / Backward Fill | Time-series data        |
| Constant                | Placeholder needed      |
| KNN / ML                | Complex patterns        |


Conclusion

Choosing the right missing data technique depends on data type, amount of missingness, and business context.


**Q3:** Explain the imbalanced data. What will happen if imbalanced data is not handled?

**Ans:**

Here‚Äôs a **clear, exam-ready answer** for **Q3** üëá

---

## **Imbalanced Data in Machine Learning**

### **What is Imbalanced Data?**

Imbalanced data occurs when the **classes in a target variable are not represented equally**, meaning one class (majority class) has significantly more observations than the other class(es) (minority class).

**Example:**

* Fraud detection: 98% non-fraud, 2% fraud
* Medical diagnosis: 95% healthy, 5% disease

---

### **What Happens If Imbalanced Data Is Not Handled?**

1. **Biased Model Predictions**
   The model tends to predict the **majority class**, ignoring the minority class.

2. **Misleading Accuracy**
   High accuracy may be achieved, but the model fails to detect important minority cases.

   * Example: Predicting ‚Äúnon-fraud‚Äù always gives 98% accuracy.

3. **Poor Recall and Precision for Minority Class**
   Critical cases (fraud, disease, churn) are missed.

4. **Business and Real-World Risks**

   * Missed fraud ‚Üí financial loss
   * Missed disease ‚Üí health risks

5. **Unreliable Model Performance**
   The model does not generalize well for minority outcomes.

---

### **Conclusion**

If imbalanced data is not handled, the model becomes **biased, unreliable, and ineffective** for decision-making. Handling imbalance is essential for fair and meaningful predictions.


**Q4:** What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

**Ans:**

Here‚Äôs a **clear, exam-ready explanation** for **Q4** üëá

---

## **Up-sampling and Down-sampling in Machine Learning**

Up-sampling and down-sampling are techniques used to **handle imbalanced datasets** by adjusting the number of samples in different classes.

---

## **1. Up-sampling**

### **Definition**

Up-sampling increases the number of samples in the **minority class** to balance it with the majority class.

This is usually done by **duplicating existing minority samples** or **creating synthetic samples**.

### **When Up-sampling Is Required**

* Minority class has **very few samples**
* Losing minority data is risky (e.g., fraud, disease detection)

### **Example**

Fraud detection dataset:

* Fraud cases: 200
* Non-fraud cases: 9,800

Up-sampling increases fraud cases to match non-fraud cases.

### **Advantages**

* Preserves all original data
* Improves detection of minority class

### **Disadvantages**

* Risk of overfitting due to duplicated data

---

## **2. Down-sampling**

### **Definition**

Down-sampling reduces the number of samples in the **majority class** to balance the dataset.

This is done by **randomly removing majority class samples**.

### **When Down-sampling Is Required**

* Dataset is very large
* Majority class dominates heavily

### **Example**

Customer churn dataset:

* Non-churn: 50,000
* Churn: 5,000

Down-sampling reduces non-churn samples to 5,000.

### **Advantages**

* Faster training
* Reduces model bias

### **Disadvantages**

* Loss of potentially useful data

---

## **Comparison Table**

| Aspect           | Up-sampling    | Down-sampling  |
| ---------------- | -------------- | -------------- |
| Targets          | Minority class | Majority class |
| Data Size        | Increases      | Decreases      |
| Data Loss        | No             | Yes            |
| Overfitting Risk | Higher         | Lower          |
| Best For         | Small datasets | Large datasets |

---

## **Conclusion**

* Use **up-sampling** when minority class data is limited and valuable.
* Use **down-sampling** when the dataset is large and majority class dominates.




**Q5:** What is data Augmentation? Explain SMOTE.

**Ans:**

Here‚Äôs a **clear, exam-ready explanation** for **Q5** üëá

---

## **Data Augmentation**

### **What is Data Augmentation?**

Data augmentation is a technique used to **artificially increase the size and diversity of a dataset** by creating new data samples from existing ones. It helps improve model performance and reduce **overfitting**, especially when data is limited or imbalanced.

**Common Examples:**

* Image data: rotation, flipping, zooming
* Text data: synonym replacement
* Tabular data: synthetic sample generation (e.g., SMOTE)

---

## **SMOTE (Synthetic Minority Over-sampling Technique)**

### **What is SMOTE?**

SMOTE is a **data augmentation technique for imbalanced datasets** that generates **synthetic samples for the minority class** instead of duplicating existing ones.

### **How SMOTE Works**

1. Select a minority class data point
2. Find its **k-nearest neighbors**
3. Create a new synthetic point **between the data point and one of its neighbors**

This results in more realistic and diverse minority samples.

---

### **Why SMOTE Is Better Than Simple Up-sampling**

* Avoids exact duplication
* Reduces overfitting
* Produces more generalized samples

---

### **Example Scenario**

In fraud detection:

* Fraud cases = 2%
* Non-fraud cases = 98%

SMOTE generates new fraud samples to balance the dataset.

---

### **Advantages of SMOTE**

* Improves minority class recall
* Creates diverse synthetic data
* Works well for numeric features

---

### **Limitations of SMOTE**

* Not suitable for categorical data (without modification)
* Can create overlapping classes
* Sensitive to noise

---

### **Conclusion**

Data augmentation improves model learning by increasing data diversity. **SMOTE** is a powerful technique for handling class imbalance by generating synthetic minority samples, leading to better and fairer predictions.



**Q6:** What are outliers in a dataset? Why is it essential to handle outliers?

**Ans:**

Here‚Äôs a **clear, exam-ready answer** for **Q6** üëá

---

## **Outliers in a Dataset**

### **What Are Outliers?**

Outliers are **data points that differ significantly from other observations** in a dataset.
They can be unusually high or low values compared to the majority of the data.

**Examples:**

* Age = 150 in a human dataset
* Salary = $1,000,000 when most are $30,000‚Äì$50,000

**Causes of Outliers:**

* Data entry errors or typos
* Measurement or sensor errors
* Natural variability in the population
* Fraudulent data

---

### **Why Is It Essential to Handle Outliers?**

1. **Impact on Statistical Measures**

   * Outliers can **distort mean, standard deviation, and correlations**.
   * Example: A single extremely high salary can inflate the average.

2. **Affect Model Performance**

   * Many machine learning models (e.g., linear regression, k-NN) are **sensitive to extreme values**.
   * Outliers can lead to poor predictions and biased coefficients.

3. **Influence on Visualizations**

   * Outliers can **mislead plots** like boxplots or scatter plots.

4. **Impact on Distance-Based Algorithms**

   * Algorithms like **k-NN or clustering** rely on distance metrics; outliers can dominate distances and distort results.

---

### **Conclusion**

Handling outliers is crucial to ensure **accurate statistics, robust models, and reliable analysis**. Common approaches include **removal, transformation, or capping (winsorization)**.




**Q7:** You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

**Ans:**

Here‚Äôs a **clear, exam-ready answer** for **Q7** üëá

---

## **Handling Missing Data in Customer Analysis**

When working with datasets containing missing values, it‚Äôs important to handle them carefully to ensure accurate analysis. Some common techniques include:

---

### **1. Deletion Methods**

* **Row-wise deletion:** Remove rows that contain missing values.
* **Column-wise deletion:** Remove columns with too many missing values.
  **Use case:** When missing data is minimal and removing it won‚Äôt affect the analysis.

---

### **2. Imputation Methods**

* **Mean Imputation:** Replace missing numerical values with the column mean.
* **Median Imputation:** Replace missing values with the median (useful for skewed data).
* **Mode Imputation:** Replace missing categorical values with the mode.
* **Constant Value:** Fill missing data with a fixed value, e.g., ‚ÄúUnknown‚Äù or 0.

---

### **3. Forward / Backward Fill**

* **Forward fill:** Replace missing value with the previous value in the column (time-series data).
* **Backward fill:** Replace missing value with the next value.

---

### **4. Predictive Imputation**

* Use machine learning models (e.g., **KNN, regression**) to predict missing values based on other features.
  **Use case:** When data has patterns that can help estimate missing values.

---

### **5. Specialized Techniques**

* **Multiple Imputation:** Creates multiple imputed datasets and combines results.
* **Interpolation:** Estimate missing values based on trends (time-series data).

---

### **Conclusion**

Choosing the right technique depends on:

* The **amount of missing data**
* **Data type** (numerical or categorical)
* **Business context** and importance of the missing values

Proper handling ensures **accurate insights** and **robust models**.


**Q8:** You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

**Ans:**


Here‚Äôs a **clear, exam-ready answer** for **Q8** üëá

---

## **Determining Patterns in Missing Data**

When a small percentage of data is missing, it is important to check **why the data is missing** to decide how to handle it. Missing data can be:

* **MCAR (Missing Completely at Random):** No pattern; missingness is unrelated to any variable.
* **MAR (Missing at Random):** Missingness depends on other observed variables.
* **MNAR (Missing Not at Random):** Missingness depends on the unobserved value itself.

---

### **Strategies to Detect Patterns**

1. **Visual Inspection**

   * Use heatmaps or missing value plots to see the distribution.

   ```python
   import seaborn as sns
   sns.heatmap(df.isnull(), cbar=False)
   ```

   * Random missing values indicate MCAR; clusters indicate MAR or MNAR.

2. **Summary Statistics**

   * Compare statistics of rows with missing vs. non-missing data.
   * Significant differences may indicate MAR or MNAR.

3. **Correlation Analysis**

   * Check correlations between missingness and other variables.
   * Example: `df['feature'].isnull().astype(int).corr(df['other_feature'])`

4. **Chi-Square Test**

   * For categorical data, test whether missingness is independent of other features.

5. **Pattern Detection Tools**

   * Python libraries like `missingno` help identify missing data patterns:

   ```python
   import missingno as msno
   msno.matrix(df)
   msno.bar(df)
   ```

---

### **Conclusion**

* **MCAR:** Safe to use deletion methods
* **MAR:** Use imputation techniques based on other features
* **MNAR:** Requires careful modeling or domain knowledge

Understanding the **pattern of missing data** ensures more **accurate handling** and reduces bias in analysis.


**Q9:** Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

**Ans:**

Here‚Äôs a **clear, exam-ready answer** for **Q9** üëá

---

## **Evaluating Machine Learning Models on Imbalanced Data**

In medical diagnosis, if the dataset is **highly imbalanced** (e.g., few patients have a disease), standard accuracy is **not reliable**, because predicting the majority class will give high accuracy but fail to detect the minority (disease) cases.

---

### **Strategies to Evaluate Performance**

1. **Use Appropriate Metrics**

   * **Precision:** Fraction of correctly predicted positive cases among all predicted positives
   * **Recall (Sensitivity):** Fraction of correctly predicted positive cases among actual positives
   * **F1-Score:** Harmonic mean of precision and recall; balances both
   * **ROC-AUC (Receiver Operating Characteristic ‚Äì Area Under Curve):** Measures the tradeoff between true positive rate and false positive rate
   * **PR-AUC (Precision-Recall AUC):** Especially useful for highly imbalanced datasets

2. **Confusion Matrix Analysis**

   * Examine **true positives, false positives, true negatives, false negatives**
   * Helps understand how many minority class cases are correctly predicted

3. **Resampling Techniques**

   * **Up-sampling / Down-sampling:** Balance the classes before training
   * **SMOTE (Synthetic Minority Oversampling Technique):** Generate synthetic minority class examples

4. **Stratified Sampling**

   * Ensure **train-test splits maintain the original class ratio**
   * Prevents model bias during training or testing

5. **Cost-Sensitive Learning**

   * Assign **higher misclassification penalties** to the minority class
   * Encourages the model to pay more attention to critical cases

6. **Cross-Validation with Stratification**

   * Ensures each fold has a **proportional representation** of the minority class

---

### **Key Takeaway**

* Do **not rely solely on accuracy**
* Focus on metrics that **emphasize the minority class** (recall, F1-score, ROC-AUC)
* Combine **evaluation metrics** with **resampling or cost-sensitive methods** to get a reliable assessment.


**Q10:** When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

**Ans:**

Here‚Äôs a **clear, exam-ready answer** for **Q10** üëá

---

## **Balancing an Unbalanced Customer Satisfaction Dataset**

When most customers report being satisfied, the dataset is **imbalanced**, with the majority class (satisfied) dominating the minority class (dissatisfied). This can bias machine learning models toward predicting satisfaction for everyone.

To balance the dataset, you can **down-sample the majority class** or use other resampling techniques.

---

### **1. Down-Sampling the Majority Class**

**Definition:** Randomly remove samples from the majority class to match the size of the minority class.


---

### **2. Up-Sampling the Minority Class**

**Definition:** Duplicate or generate synthetic samples of the minority class to match the majority class.

* Can use **SMOTE** for generating synthetic minority samples.

---

### **3. Combination Methods**

* **SMOTE + Down-Sampling:** Balance by reducing majority and enhancing minority classes simultaneously.
* **Weighted Loss Functions:** Assign higher misclassification penalties to the minority class during training.

---

### **4. Considerations When Down-Sampling**

* May **discard useful information** from majority class
* Works well when **majority class is very large**

---

### **Conclusion**

Down-sampling is an effective way to **balance imbalanced datasets** when the majority class dominates. Always combine it with careful **model evaluation using metrics like recall, F1-score, or ROC-AUC** to ensure the minority class is well represented.




In [11]:
import pandas as pd
from sklearn.utils import resample

# Example dataset
data = pd.DataFrame({
    'Satisfaction': ['Satisfied']*90 + ['Dissatisfied']*10,
    'Feature': range(100)
})

# Separate majority and minority classes
majority = data[data.Satisfaction=='Satisfied']
minority = data[data.Satisfaction=='Dissatisfied']

# Down-sample majority class
majority_downsampled = resample(
    majority,
    replace=False,     # No replacement
    n_samples=len(minority),  # Match minority class size
    random_state=42
)

# Combine minority class with downsampled majority class
balanced_data = pd.concat([majority_downsampled, minority])
print(balanced_data['Satisfaction'].value_counts())



Satisfaction
Satisfied       10
Dissatisfied    10
Name: count, dtype: int64


**Q11:** You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

**Ans:**

Balancing an Unbalanced Dataset with Rare Events

When the dataset has a low percentage of rare events (minority class), models may fail to predict these events accurately. To handle this, we can up-sample the minority class to balance the dataset.

1. Random Over-Sampling

Definition: Duplicate existing minority class samples to increase their frequency.

Python Example:


In [12]:
import pandas as pd
from sklearn.utils import resample

# Example dataset
data = pd.DataFrame({
    'Event': ['No']*95 + ['Yes']*5,
    'Feature': range(100)
})

# Separate majority and minority classes
majority = data[data.Event=='No']
minority = data[data.Event=='Yes']

# Up-sample minority class
minority_upsampled = resample(
    minority,
    replace=True,           # With replacement
    n_samples=len(majority),  # Match majority size
    random_state=42
)

# Combine with majority class
balanced_data = pd.concat([majority, minority_upsampled])
print(balanced_data['Event'].value_counts())


Event
No     95
Yes    95
Name: count, dtype: int64


2. SMOTE (Synthetic Minority Over-sampling Technique)

Generates synthetic samples by interpolating between existing minority samples instead of simple duplication.

Reduces overfitting compared to random duplication.

Python Example with imblearn:

In [14]:
import pandas as pd
from imblearn.over_sampling import SMOTE

# Example dataset
data = pd.DataFrame({
    'Event': ['No']*95 + ['Yes']*5,  # Minority class = 5 samples
    'Feature': range(100)
})

# Separate features and target
X = data[['Feature']]
y = data['Event']

# Determine the number of minority samples
minority_count = y.value_counts()['Yes']

# Set k_neighbors to a safe value
k_neighbors = min(5, minority_count - 1)  # Must be < minority samples

# Apply SMOTE
smote = SMOTE(random_state=42, k_neighbors=k_neighbors)
X_res, y_res = smote.fit_resample(X, y)

# Combine back into a DataFrame for convenience
balanced_data = pd.DataFrame(X_res, columns=['Feature'])
balanced_data['Event'] = y_res

# Check class distribution
print(balanced_data['Event'].value_counts())



Event
No     95
Yes    95
Name: count, dtype: int64


3. Other Up-sampling Techniques

ADASYN (Adaptive Synthetic Sampling): Focuses more on minority samples that are harder to classify.

Combination with Down-Sampling: Reduce majority class slightly and up-sample minority class to balance dataset efficiently.

4. Key Considerations

Avoid overfitting by not creating too many duplicates.

Always evaluate model using metrics for imbalanced data (Recall, F1-score, ROC-AUC).

Up-sampling is particularly useful when minority class is small but important (fraud, rare disease, equipment failure).

Conclusion

Up-sampling the minority class ensures that the model learns to detect rare events, improving prediction performance on critical but infrequent outcomes.