### Identifying and Handling Missing Values in a Pandas DataFrame

Missing values in Pandas are represented as `NaN`. We can identify and handle them using various methods:

#### **Identifying Missing Values**
1. `df.isnull()` → Returns a DataFrame showing `True` for missing values.
2. `df.isnull().sum()` → Counts missing values per column.

#### **Handling Missing Values**
1. **Removing Missing Data**
   - `df.dropna()` → Removes rows with missing values.
   - `df.dropna(axis=1)` → Removes columns with missing values.

2. **Filling Missing Data**
   - `df.fillna(value)` → Replaces missing values with a specific value.
   - `df['column'].fillna(df['column'].mean())` → Replaces missing values with the column's mean (for numerical data).
   - `df.fillna(method='ffill')` → Forward fill (propagates last valid value).
   - `df.fillna(method='bfill')` → Backward fill (uses next valid value).


In [None]:
import pandas as pd

# Sample DataFrame with missing values
data = {
    'Name': ['Ali', 'Ayesha', 'Hassan', None, 'Sara'],
    'Age': [25, None, 30, 28, None],
    'City': ['Karachi', 'Lahore', None, 'Islamabad', 'Quetta']
}

df = pd.DataFrame(data)

# Identifying missing values
print("Missing Values in DataFrame:")
print(df.isnull())  # Shows True for missing values

print("\nCount of Missing Values per Column:")
print(df.isnull().sum())  # Count missing values per column

# Handling missing values

# Removing rows with missing values
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_rows)

# Removing columns with missing values
df_dropped_cols = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:")
print(df_dropped_cols)

# Filling missing values with a specific value
df_filled = df.fillna("Unknown")
print("\nDataFrame after filling missing values with 'Unknown':")
print(df_filled)

# Filling missing numerical values with the column's mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
print("\nDataFrame after filling missing 'Age' values with mean:")
print(df)

# Forward Fill
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after Forward Fill:")
print(df_ffill)

# Backward Fill
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after Backward Fill:")
print(df_bfill)


___

### What is Imputation, and Why is it Useful in Dealing with Missing Data?

**Imputation** is the process of replacing missing values in a dataset with estimated values instead of removing them.

#### **Why is Imputation Useful?**
- **Prevents Data Loss**: Removing missing values can lead to loss of important information.
- **Maintains Data Integrity**: Helps preserve the dataset size and structure.
- **Improves Model Performance**: In machine learning, missing data can lead to bias or incorrect predictions.

#### **Common Imputation Techniques**
1. **Mean/Median/Mode Imputation** (for numerical data)
   - Mean: Replaces missing values with the average of available data.
   - Median: Uses the middle value of the dataset.
   - Mode: Uses the most frequent value.

2. **Forward/Backward Fill** (for time series data)
   - **Forward Fill (`ffill`)**: Uses the previous row's value.
   - **Backward Fill (`bfill`)**: Uses the next row's value.

3. **Predictive Imputation** (Advanced)
   - Uses machine learning models (e.g., KNN, regression) to predict missing values.


In [None]:
import pandas as pd

# Sample DataFrame with missing values
data = {
    'Name': ['Ali', 'Ayesha', 'Hassan', None, 'Sara'],
    'Age': [25, None, 30, 28, None],
    'City': ['Karachi', 'Lahore', None, 'Islamabad', 'Quetta']
}

df = pd.DataFrame(data)

# Mean Imputation (Filling missing Age values with mean)
df['Age'] = df['Age'].fillna(df['Age'].mean())
print("\nDataFrame after Mean Imputation for 'Age':")
print(df)

# Median Imputation
df['Age'] = df['Age'].fillna(df['Age'].median())
print("\nDataFrame after Median Imputation for 'Age':")
print(df)

# Mode Imputation for categorical data (Filling missing 'City' values with mode)
df['City'] = df['City'].fillna(df['City'].mode()[0])
print("\nDataFrame after Mode Imputation for 'City':")
print(df)

# Forward Fill
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after Forward Fill:")
print(df_ffill)

# Backward Fill
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after Backward Fill:")
print(df_bfill)


___

___

### How Can You Encode Categorical Variables in a Pandas DataFrame?

**Categorical variables** are non-numeric variables that represent categories, such as `"Male"`/`"Female"` or `"Red"`/`"Blue"`/`"Green"`. Since machine learning models work with numerical data, we need to encode these categorical variables into numerical formats.

#### **Methods for Encoding Categorical Variables**
1. **Label Encoding** (`.astype('category').cat.codes`)
   - Converts each category into a unique integer.
   - Useful for ordinal data (e.g., `Low=0, Medium=1, High=2`).

2. **One-Hot Encoding** (`pd.get_dummies()`)
   - Creates separate binary columns (0 or 1) for each category.
   - Useful for nominal data (e.g., `"Red"`, `"Blue"`, `"Green"`).

3. **Ordinal Encoding** (`OrdinalEncoder` from `sklearn.preprocessing`)
   - Similar to label encoding but allows specifying a meaningful order.

4. **Target Encoding** (Advanced)
   - Replaces each category with the mean of the target variable (used in ML).


import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Sample DataFrame with categorical variables
data = {
    'Name': ['Ali', 'Ayesha', 'Hassan', 'Sara', 'Bilal'],
    'City': ['Karachi', 'Lahore', 'Islamabad', 'Karachi', 'Lahore'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']
}

df = pd.DataFrame(data)

# Label Encoding
label_encoder = LabelEncoder()
df['City_Label_Encoded'] = label_encoder.fit_transform(df['City'])
print("\nDataFrame after Label Encoding:")
print(df)

# One-Hot Encoding
df_one_hot = pd.get_dummies(df, columns=['City'])
print("\nDataFrame after One-Hot Encoding:")
print(df_one_hot)

# Ordinal Encoding (Assuming order: Small < Medium < Large)
size_categories = [['Small', 'Medium', 'Large']]
ordinal_encoder = OrdinalEncoder(categories=size_categories)
df['Size_Ordinal_Encoded'] = ordinal_encoder.fit_transform(df[['Size']])
print("\nDataFrame after Ordinal Encoding:")
print(df)


### What is One-Hot Encoding, and When Would You Use It in Data Preprocessing?

#### **What is One-Hot Encoding?**
One-hot encoding is a technique used to convert categorical variables into a numerical format. It creates **binary (0 or 1) columns** for each unique category in the original feature.

For example, if we have a `"Color"` column with values `["Red", "Green", "Blue"]`, one-hot encoding will create separate columns:

|     | Red | Green | Blue |
|-----|-----|-------|------|
| **Red**   | 1   | 0     | 0    |
| **Green** | 0   | 1     | 0    |
| **Blue**  | 0   | 0     | 1    |




#### **When Should You Use One-Hot Encoding?**
- When dealing with **nominal categorical variables** (no inherent order, e.g., `"Red"`, `"Blue"`, `"Green"`).
- When preparing data for **machine learning models**, as most models work with numerical inputs.
- When categorical variables are **not ordinal**, meaning that the order does not matter.

#### **When Should You Avoid One-Hot Encoding?**
- When there are **too many unique categories** (e.g., thousands of city names) since it increases dimensionality (curse of dimensionality).
- When the variable is **ordinal** (e.g., `"Small" < "Medium" < "Large"`) – use ordinal encoding instead.






In [None]:
import pandas as pd

# Sample DataFrame with categorical variable
data = {
    'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']
}

df = pd.DataFrame(data)

# Applying One-Hot Encoding
df_one_hot = pd.get_dummies(df, columns=['Color'])

print("\nDataFrame after One-Hot Encoding:")
print(df_one_hot)


### How Do You Identify and Remove Duplicate Rows from a DataFrame?

#### **Identifying Duplicates**
In Pandas, we can check for duplicate rows using:
- `df.duplicated()` → Returns a Boolean Series indicating whether each row is a duplicate (`True` for duplicates).
- `df.duplicated().sum()` → Counts the total number of duplicate rows.

#### **Removing Duplicates**
To remove duplicate rows, we use:
- `df.drop_duplicates()` → Removes all duplicate rows, keeping only the first occurrence by default.
- `df.drop_duplicates(keep='last')` → Keeps the last occurrence instead of the first.
- `df.drop_duplicates(subset=['column_name'])` → Removes duplicates based on a specific column.


In [None]:
import pandas as pd

# Sample DataFrame with duplicate rows
data = {
    'Name': ['Ali', 'Ayesha', 'Hassan', 'Ali', 'Sara'],
    'Age': [25, 30, 28, 25, 22],
    'City': ['Karachi', 'Lahore', 'Islamabad', 'Karachi', 'Quetta']
}

df = pd.DataFrame(data)

# Identifying duplicate rows
print("Duplicate Rows (True means duplicate):")
print(df.duplicated())

# Counting duplicate rows
print("\nTotal Number of Duplicate Rows:", df.duplicated().sum())

# Removing duplicate rows (keeping the first occurrence)
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after Removing Duplicates:")
print(df_no_duplicates)

# Removing duplicates while keeping the last occurrence
df_no_duplicates_last = df.drop_duplicates(keep='last')
print("\nDataFrame after Removing Duplicates (Keeping Last Occurrence):")
print(df_no_duplicates_last)

# Removing duplicates based on a specific column (e.g., 'Name')
df_no_duplicates_name = df.drop_duplicates(subset=['Name'])
print("\nDataFrame after Removing Duplicates Based on 'Name' Column:")
print(df_no_duplicates_name)


### Difference Between `duplicated()` and `drop_duplicates()` in Pandas

#### **1. `duplicated()` Method**
- It **identifies** duplicate rows in a DataFrame.
- Returns a **Boolean Series**, where `True` indicates a duplicate row.
- Does **not** modify the original DataFrame.

##### **Syntax:**
```python
df.duplicated()
df.duplicated(subset=['column_name'])
df.duplicated(keep='first')  # Default: Marks all but first occurrence as duplicate
df.duplicated(keep='last')   # Marks all but last occurrence as duplicate
df.duplicated(keep=False)    # Marks all occurrences as duplicate



---

#### **Code Cell**
```python
import pandas as pd

# Sample DataFrame with duplicate rows
data = {
    'Name': ['Ali', 'Ayesha', 'Hassan', 'Ali', 'Sara', 'Ali'],
    'Age': [25, 30, 28, 25, 22, 25],
    'City': ['Karachi', 'Lahore', 'Islamabad', 'Karachi', 'Quetta', 'Karachi']
}

df = pd.DataFrame(data)

# Identifying duplicate rows using duplicated()
print("Duplicate Rows (True means duplicate):")
print(df.duplicated())

# Removing duplicates using drop_duplicates()
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after Removing Duplicates (Keeping First Occurrence):")
print(df_no_duplicates)

# Removing duplicates while keeping the last occurrence
df_no_duplicates_last = df.drop_duplicates(keep='last')
print("\nDataFrame after Removing Duplicates (Keeping Last Occurrence):")
print(df_no_duplicates_last)

# Removing all occurrences of duplicates
df_no_all_duplicates = df.drop_duplicates(keep=False)
print("\nDataFrame after Removing All Duplicate Rows:")
print(df_no_all_duplicates)


### Importance of Feature Scaling in Machine Learning

#### **What is Feature Scaling?**
Feature scaling is the process of **normalizing or standardizing** numerical data so that all features have a similar range. It ensures that no feature dominates the learning process due to differences in magnitude.

#### **Why is Feature Scaling Important?**
1. **Improves Model Performance**  
   - Algorithms like gradient descent converge faster with scaled features.
  
2. **Prevents Features with Larger Values from Dominating**  
   - ML models work with numerical values, and large-scale differences can bias the model.

3. **Essential for Distance-Based Algorithms**  
   - Algorithms like **K-Nearest Neighbors (KNN), K-Means Clustering, and Support Vector Machines (SVM)** rely on distances between points, making scaling crucial.

4. **Stabilizes Neural Networks**  
   - Deep learning models benefit from scaled features, reducing computational costs.

5. **Improves Interpretability**  
   - Helps in understanding and comparing feature contributions in linear models.

#### **Common Feature Scaling Techniques**
1. **Min-Max Scaling (Normalization)**
   - Rescales values between 0 and 1.
   - Formula:  
     \[
     X' = \frac{X - X_{min}}{X_{max} - X_{min}}
     \]
   - Use case: When data needs to be in a fixed range (e.g., neural networks).

2. **Standardization (Z-score Normalization)**
   - Centers data around mean (0) with standard deviation (1).
   - Formula:  
     \[
     X' = \frac{X - \mu}{\sigma}
     \]
   - Use case: When data follows a normal distribution.

3. **Robust Scaling**
   - Uses median and interquartile range (IQR) instead of mean and standard deviation.
   - Less sensitive to outliers.


import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Sample DataFrame
data = {
    'Feature1': [100, 200, 300, 400, 500],
    'Feature2': [1, 2, 3, 4, 5],
    'Feature3': [1000, 2000, 3000, 4000, 5000]
}

df = pd.DataFrame(data)

# Min-Max Scaling (Normalization)
min_max_scaler = MinMaxScaler()
df_min_max = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)

# Standardization (Z-score Normalization)
standard_scaler = StandardScaler()
df_standardized = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)

# Robust Scaling (Handles Outliers)
robust_scaler = RobustScaler()
df_robust = pd.DataFrame(robust_scaler.fit_transform(df), columns=df.columns)

# Display Results
print("\nOriginal DataFrame:")
print(df)

print("\nMin-Max Scaled DataFrame:")
print(df_min_max)

print("\nStandardized DataFrame:")
print(df_standardized)

print("\nRobust Scaled DataFrame:")
print(df_robust)


### Difference Between Min-Max Scaling and Z-Score Normalization

Feature scaling is an essential step in machine learning, and **Min-Max Scaling** and **Z-Score Normalization (Standardization)** are two common techniques.

#### **1. Min-Max Scaling (Normalization)**
- Rescales data to a fixed range, usually **[0,1]** or **[-1,1]**.
- Formula:
  \[
  X' = \frac{X - X_{min}}{X_{max} - X_{min}}
  \]
- **When to Use?**
  - When feature values have a known minimum and maximum.
  - When data does **not** follow a normal distribution.
  - Useful for algorithms that require bounded input (e.g., **Neural Networks**).

#### **2. Z-Score Normalization (Standardization)**
- Transforms data to have a **mean of 0** and a **standard deviation of 1**.
- Formula:
  \[
  X' = \frac{X - \mu}{\sigma}
  \]
- **When to Use?**
  - When data follows a **normal distribution**.
  - When feature values do **not** have a known range.
  - Preferred for algorithms like **SVM, K-Means, PCA** that rely on distances.

#### **Key Differences**
| Method | Range | Affected by Outliers? | Best Used For |
|--------|-------|------------------|--------------|
| Min-Max Scaling | [0,1] (or [-1,1]) | **Yes** (Sensitive to outliers) | Neural Networks, Image Processing |
| Z-Score Normalization | Mean = 0, Std Dev = 1 | **No** (More robust to outliers) | SVM, K-Means, PCA, Linear Regression |

#### **Choosing the Right Method**
- If your data contains **outliers**, use **Z-Score Normalization**.
- If your data needs to be in a **fixed range**, use **Min-Max Scaling**.
- Some models like **Tree-based algorithms (Decision Trees, Random Forests)** do **not require** scaling.


import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample DataFrame
data = {
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [100, 200, 300, 400, 500],
    'Feature3': [5, 10, 15, 20, 25]
}

df = pd.DataFrame(data)

# Applying Min-Max Scaling
min_max_scaler = MinMaxScaler()
df_min_max = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)

# Applying Z-Score Normalization (Standardization)
standard_scaler = StandardScaler()
df_standardized = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)

# Display Results
print("\nOriginal DataFrame:")
print(df)

print("\nMin-Max Scaled DataFrame:")
print(df_min_max)

print("\nStandardized DataFrame (Z-Score Normalization):")
print(df_standardized)


________

### What Are Outliers, and Why Might They Impact Machine Learning Models?

#### **What Are Outliers?**
An **outlier** is a data point that differs significantly from other observations in the dataset. It is an unusually high or low value compared to the rest of the data.

#### **Types of Outliers**
1. **Univariate Outliers** – Found when analyzing a single feature.
2. **Multivariate Outliers** – Detected when considering relationships between multiple features.

#### **Why Do Outliers Matter in Machine Learning?**
- **Skewing Statistical Measures** – Affects mean, standard deviation, and correlation.
- **Impacting Model Performance** – Some ML algorithms (e.g., Linear Regression, K-Means) are sensitive to extreme values.
- **Slowing Model Convergence** – Can cause instability in gradient-based models (e.g., Neural Networks).
- **Misleading Insights** – Can result in incorrect conclusions or poor model predictions.

#### **When to Remove or Keep Outliers?**
- **Remove outliers** if they are due to errors, sensor malfunctions, or irrelevant noise.
- **Keep outliers** if they provide valuable insights (e.g., fraud detection, rare events).



import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample Data with an Outlier
data = {'Feature1': [10, 12, 14, 16, 18, 500]}  # 500 is an outlier
df = pd.DataFrame(data)

# Boxplot to Visualize Outliers
plt.boxplot(df['Feature1'])
plt.title("Boxplot for Outlier Detection")
plt.show()

# Identifying Outliers using Z-Score
from scipy.stats import zscore

df['Z-Score'] = np.abs(zscore(df['Feature1']))
outliers = df[df['Z-Score'] > 3]  # Outliers have Z-score > 3

print("\nOutliers Detected Using Z-Score:")
print(outliers)


____

### Different Methods for Detecting Outliers in a Dataset

Outliers are extreme values that differ significantly from the rest of the data. Detecting and handling them is crucial for improving the performance of machine learning models.

#### **1. Z-Score Method (Standard Deviation)**
- Measures how many standard deviations a data point is from the mean.
- A common threshold for outliers is **Z-score > 3 or < -3**.
- Best for normally distributed data.

#### **2. IQR (Interquartile Range) Method**
- Uses the **first quartile (Q1) and third quartile (Q3)** to define outliers.
- Data points outside the range **[Q1 - 1.5 × IQR, Q3 + 1.5 × IQR]** are considered outliers.
- Works well with skewed data.

#### **3. Boxplot Method**
- A graphical approach using boxplots to visualize outliers.
- Outliers appear as **individual points outside the whiskers**.

#### **4. Isolation Forest**
- A machine learning model that isolates anomalies using decision trees.
- Assigns an "anomaly score" to each observation.

#### **5. DBSCAN (Density-Based Clustering)**
- Detects outliers by identifying points in low-density areas.
- Good for **unsupervised anomaly detection**.

Each method has its advantages, and the choice depends on the data distribution and the problem at hand.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import zscore
from sklearn.ensemble import IsolationForest

# Sample Data
data = {'Feature1': [10, 12, 14, 16, 18, 500]}  # 500 is an outlier
df = pd.DataFrame(data)

# 1. Z-Score Method
df['Z-Score'] = np.abs(zscore(df['Feature1']))
outliers_z = df[df['Z-Score'] > 3]

# 2. IQR Method
Q1 = df['Feature1'].quantile(0.25)
Q3 = df['Feature1'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['Feature1'] < Q1 - 1.5 * IQR) | (df['Feature1'] > Q3 + 1.5 * IQR)]

# 3. Boxplot Visualization
plt.boxplot(df['Feature1'])
plt.title("Boxplot for Outlier Detection")
plt.show()

# 4. Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
df['Anomaly Score'] = iso_forest.fit_predict(df[['Feature1']])
outliers_iso = df[df['Anomaly Score'] == -1]

# Display Results
print("\nOutliers Detected Using Z-Score:")
print(outliers_z)

print("\nOutliers Detected Using IQR Method:")
print(outliers_iqr)

print("\nOutliers Detected Using Isolation Forest:")
print(outliers_iso)


_____

### How to Handle Outliers in a Continuous Numerical Variable

Outliers can distort statistical analysis and negatively impact machine learning models. Here are several methods to handle outliers in continuous numerical variables:

#### **1. Removal of Outliers**
- If outliers are due to errors or irrelevant data, they can be removed.
- **Methods:**
  - Z-Score (Remove values with |Z| > 3)
  - IQR Method (Remove values outside **[Q1 - 1.5 × IQR, Q3 + 1.5 × IQR]**)

#### **2. Winsorization (Capping)**
- Replace extreme values with a specified percentile (e.g., 5th and 95th percentiles).

#### **3. Transformation Techniques**
- Apply transformations like **log, square root, or Box-Cox** to reduce the effect of outliers.

#### **4. Binning**
- Convert continuous variables into categorical bins to reduce the impact of extreme values.

#### **5. Model-Based Approaches**
- Use **Robust Regression** (e.g., RANSAC) or **Tree-Based Models** (e.g., Decision Trees, Random Forests) that are less sensitive to outliers.

The best method depends on whether the outliers are **genuine data points** or **errors**.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import zscore

# Sample Data
data = {'Feature1': [10, 12, 14, 16, 18, 500]}  # 500 is an outlier
df = pd.DataFrame(data)

# 1. Removing Outliers Using Z-Score
df_zscore = df[np.abs(zscore(df['Feature1'])) < 3]

# 2. Removing Outliers Using IQR Method
Q1 = df['Feature1'].quantile(0.25)
Q3 = df['Feature1'].quantile(0.75)
IQR = Q3 - Q1
df_iqr = df[(df['Feature1'] >= Q1 - 1.5 * IQR) & (df['Feature1'] <= Q3 + 1.5 * IQR)]

# 3. Winsorization (Capping Outliers)
lower_limit = df['Feature1'].quantile(0.05)  # 5th percentile
upper_limit = df['Feature1'].quantile(0.95)  # 95th percentile
df_winsorized = df.copy()
df_winsorized['Feature1'] = np.clip(df['Feature1'], lower_limit, upper_limit)

# 4. Log Transformation
df_log_transformed = df.copy()
df_log_transformed['Feature1'] = np.log1p(df['Feature1'])  # log1p avoids log(0) issues

# Display Results
print("\nOriginal Data:")
print(df)

print("\nAfter Removing Outliers (Z-Score):")
print(df_zscore)

print("\nAfter Removing Outliers (IQR Method):")
print(df_iqr)

print("\nAfter Winsorization (Capping Outliers):")
print(df_winsorized)

print("\nAfter Log Transformation:")
print(df_log_transformed)


___

___

____

# THE END 

___