---
---

### ✅ **Complete List of EDA Topics for Machine Learning**

*(Organized step-by-step, from raw data to insights)*

---

#### 🧹 1. **Data Cleaning**

* Handling missing values (imputation, removal)
* Handling duplicates
* Handling outliers
* Type conversions (e.g., object to float)
* Dealing with inconsistent formats (dates, currencies, etc.)
* String trimming and standardization

---

#### 📏 2. **Data Type Identification & Conversion**

* Numerical vs Categorical
* Ordinal vs Nominal
* Datetime parsing
* Encoding (Label, One-hot, Ordinal)

---

#### 📊 3. **Univariate Analysis**

* Summary statistics (mean, median, mode, std, IQR)
* Frequency distribution
* Value counts
* Distribution plots (histogram, KDE, boxplot)
* Detecting skewness and kurtosis

---

#### 📈 4. **Bivariate Analysis**

* Correlation matrix (Pearson, Spearman)
* Scatter plots
* Heatmaps
* Pair plots
* Groupby analysis
* Cross-tabulation
* Boxplots/grouped boxplots

---

#### 🔁 5. **Multivariate Analysis**

* Multivariate correlation
* Pairplots (Seaborn)
* FacetGrid analysis
* PCA for visualization
* Bubble charts

---

#### 📉 6. **Outlier Detection**

* Z-score
* IQR method
* Boxplot visual method
* Isolation Forest (optional ML method)
* Mahalanobis distance

---

#### 🧬 7. **Feature Distribution Analysis**

* Normal vs non-normal distribution
* Skewness correction (log, sqrt, Box-Cox)
* Visualization: histogram, distplot, violin plot

---

#### 📆 8. **Time Series EDA**

* Time-based decomposition
* Rolling statistics
* Seasonal trends
* Lag analysis
* Autocorrelation/Partial Autocorrelation

---

#### 📂 9. **Categorical Variable Analysis**

* Frequency tables
* Bar plots / Count plots
* Pie charts (use sparingly)
* Stacked bar charts
* Chi-square test (for association)

---

#### 📊 10. **Numerical Variable Analysis**

* Distribution shape
* Mean/median comparison
* Boxplots by category
* Violin plots
* ANOVA or t-tests

---

#### 🔀 11. **Encoding Categorical Data**

* Label Encoding
* One-Hot Encoding
* Target/Mean Encoding
* Frequency Encoding

---

#### 📉 12. **Correlation Analysis**

* Pearson, Spearman, Kendall coefficients
* Heatmaps
* Variance Inflation Factor (VIF) for multicollinearity

---

#### 🧪 13. **Missing Value Treatment**

* Count and percentage of missing values
* Missingness pattern visualization
* Imputation techniques:

  * Mean/median/mode
  * Forward/backward fill
  * KNN imputation
  * Regression imputation

---

#### 🧮 14. **Feature Engineering (EDA-Aided)**

* Polynomial features
* Interaction terms
* Date parts (year, month, weekday, etc.)
* Binning (equal-width, equal-frequency, quantile-based)

---

#### ⚖️ 15. **Target Variable Analysis**

* Class imbalance (binary/multiclass)
* Distribution of target vs features
* Use of stratification for classification
* SMOTE or undersampling techniques (if applied)

---

#### 📈 16. **Visualization Techniques**

* Seaborn, Matplotlib, Plotly
* Histograms, KDE, Boxplots
* Count plots, Pie charts, Bar charts
* Heatmaps, Correlation plots
* Joint plots, Pair plots, Violin plots
* Time series plots
* Missing value matrix (e.g., `msno.matrix()`)

---

#### 📋 17. **EDA Reporting**

* Pandas Profiling
* Sweetviz
* D-Tale
* Autoviz
* Lux

---

#### 🛑 18. **EDA Red Flags**

* Data leakage detection
* Target leakage in features
* High multicollinearity
* Dominant class in target

---
---



---

### 🔷 **1. Data Cleaning** 🧹

---

#### 📖 1. Definition:

* **Data cleaning** refers to identifying and correcting errors or inconsistencies in data to improve its quality and reliability.
* It includes removing duplicates, fixing missing or inconsistent data, and formatting values appropriately.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

| 🔧 Function | 💡 Description | 🧪 Example & Output |
| ----------- | -------------- | ------------------- |

**➤ `df.isnull()` / `df.isna()`**

* Detects missing (`NaN`) values.

```python
df.isnull()
# Output:
#     Name    Age  Gender
# 0  False  False   False
# 1  False  False   False
# 2  False   True   False
# 3  False  False   False
# 4   True  False   False
```

**➤ `df.dropna()`**

* Removes rows (or columns) with missing values.

```python
df.dropna()
# Output:
#     Name   Age Gender
# 0  Alice  25.0      F
# 1    Bob  30.0      M
# 3    Bob  30.0      M
```

**➤ `df.fillna(value)`**

* Replaces missing values with a specific value.

```python
df['Age'].fillna(df['Age'].median(), inplace=True)
# Output: [25.0, 30.0, 25.0, 30.0, 22.0]
```

**➤ `df.duplicated()`**

* Returns a Boolean Series indicating duplicate rows.

```python
df.duplicated()
# Output:
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False
```

**➤ `df.drop_duplicates()`**

* Drops duplicate rows from DataFrame.

```python
df.drop_duplicates()
# Output:
#     Name   Age Gender
# 0  Alice  25.0      F
# 1    Bob  30.0      M
# 2 Charlie   NaN      M
# 4   None  22.0      F
```

**➤ `df.astype(type)`**

* Converts column to a different data type.

```python
df['Age'] = df['Age'].astype(int)
# Output: [25, 30, 25, 30, 22]
```

**➤ `df.replace()`**

* Replaces specified values with new ones.

```python
df['Gender'].replace({'M': 'Male', 'F': 'Female'}, inplace=True)
# Output: ['Female', 'Male', 'Male', 'Male', 'Female']
```

**➤ `str.strip()` / `str.lower()` / `str.upper()`**

* Trims strings and adjusts case.

```python
df['Name'] = df['Name'].str.strip()
# Output: ['Alice', 'Bob', 'Charlie', 'Bob', None]

df['Name'] = df['Name'].str.upper()
# Output: ['ALICE', 'BOB', 'CHARLIE', 'BOB', None]
```

---

#### 📌 3. Code Example:

```python
import pandas as pd

# Raw data with issues
df = pd.DataFrame({
    'Name': [' Alice ', 'Bob', 'Charlie', 'Bob', None],
    'Age': [25, 30, None, 30, 22],
    'Gender': ['F', 'M', 'M', 'M', 'F']
})

# 1. Strip whitespace
df['Name'] = df['Name'].str.strip()
# df['Name']: ['Alice', 'Bob', 'Charlie', 'Bob', None]

# 2. Fill missing 'Age' with median
df['Age'] = df['Age'].fillna(df['Age'].median())
# df['Age']: [25.0, 30.0, 25.0, 30.0, 22.0]

# 3. Drop rows with missing 'Name'
df = df.dropna(subset=['Name'])
# Rows with None in 'Name' are removed

# 4. Remove duplicate rows
df = df.drop_duplicates()
# Keeps only unique rows

# 5. Replace 'M'/'F' with full gender labels
df['Gender'] = df['Gender'].replace({'M': 'Male', 'F': 'Female'})
# df['Gender']: ['Female', 'Male', 'Male']
```

---

#### ⏱️ 4. When to Use:

* 🕐 Immediately after importing a dataset
* 🧪 Before visualizations, analysis, or machine learning modeling
* ⚙️ When encountering missing, duplicate, or inconsistent values

---

#### ⚠️ 5. Limitations and Challenges:

* 🧯 Risk of removing important data when dropping
* 🎯 Imputation might distort statistical integrity
* 🧠 Requires domain knowledge for accurate cleaning
* 🧪 Edge cases and text data often require custom logic

---



---

### 🔷 **2. Data Type Identification & Conversion** 🧬

---

#### 📖 1. Definition:

* This step involves **understanding and converting** the data types of columns in a dataset.
* Correct data types ensure efficient memory usage, accurate calculations, and proper function behavior.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `df.dtypes`**

* Returns data types of each column

```python
df.dtypes
# Output:
# Name       object
# Age       float64
# Gender     object
# DOB        object
```

**➤ `df.info()`**

* Gives a summary: column count, non-null count, and data types

```python
df.info()
# Output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 5 entries, 0 to 4
# Data columns (total 4 columns):
#  #   Column  Non-Null Count  Dtype  
# ---  ------  --------------  -----  
#  0   Name    5 non-null      object 
#  1   Age     5 non-null      float64
#  2   Gender  5 non-null      object 
#  3   DOB     5 non-null      object
```

**➤ `df.astype(type)`**

* Converts column to a different data type

```python
df['Age'] = df['Age'].astype(int)
# Output: df['Age']: [25, 30, 25, 30, 22]
```

**➤ `pd.to_numeric()`**

* Converts values to numeric type (int or float), with error handling

```python
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
# Output: Converts valid values, replaces invalid ones with NaN
```

**➤ `pd.to_datetime()`**

* Converts string column to datetime format

```python
df['DOB'] = pd.to_datetime(df['DOB'])
# Output: datetime64[ns]
```

**➤ `df.select_dtypes(include=...)`**

* Filters columns by their data type

```python
df.select_dtypes(include='object')
# Output: Only columns with dtype=object
```

---

#### 📌 3. Code Example:

```python
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': ['25', '30', '28'],     # String type
    'DOB': ['2000-01-01', '1998-05-06', '1995-03-10']  # String dates
})

# Check current data types
print(df.dtypes)
# Output:
# Name    object
# Age     object
# DOB     object

# Convert Age to int
df['Age'] = pd.to_numeric(df['Age'])
# Output: [25, 30, 28]

# Convert DOB to datetime
df['DOB'] = pd.to_datetime(df['DOB'])
# Output: datetime64[ns]

# Confirm changes
print(df.dtypes)
# Output:
# Name    object
# Age      int64
# DOB     datetime64[ns]
```

---

#### ⏱️ 4. When to Use:

* 💼 When importing external datasets (CSV, Excel, SQL)
* 🧠 Before applying functions that require specific types (e.g., math ops on numbers)
* 📊 When preparing features for machine learning models

---

#### ⚠️ 5. Limitations and Challenges:

* 🧯 Data with mixed types (e.g., '10', 'ten') may fail to convert
* ⛔ `astype()` will throw an error on invalid conversion
* 📅 Date parsing can fail with inconsistent formats
* 🧠 Requires manual inspection in ambiguous cases

---




---

### 🔷 **3. Univariate Analysis** 📊

---

#### 📖 1. Definition:

* **Univariate analysis** is the examination of **a single variable**.
* It helps summarize and understand the distribution, central tendency, and spread of a variable.
* Can be performed on both **numerical** and **categorical** data.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `df['col'].value_counts()`**

* Counts frequency of unique values (best for categorical)

```python
df['Gender'].value_counts()
# Output:
# Male      3
# Female    2
```

**➤ `df['col'].unique()` / `df['col'].nunique()`**

* Returns unique values and number of unique values

```python
df['Gender'].unique()
# Output: ['Male', 'Female']

df['Gender'].nunique()
# Output: 2
```

**➤ `df['col'].describe()`**

* Gives summary stats (count, mean, std, min, 25%, 50%, 75%, max)

```python
df['Age'].describe()
# Output:
# count     5.000000
# mean     26.400000
# std       3.209361
# min      22.000000
# 25%      25.000000
# 50%      25.000000
# 75%      30.000000
# max      30.000000
```

**➤ `df['col'].plot(kind='hist')` / `df['col'].hist()`**

* Plots histogram for numeric data

```python
df['Age'].plot(kind='hist')
# Output: Histogram showing distribution of Age
```

**➤ `df['col'].plot(kind='box')` / `sns.boxplot()`**

* Boxplot shows median, quartiles, and outliers

```python
import seaborn as sns
sns.boxplot(x=df['Age'])
# Output: Boxplot for Age distribution
```

**➤ `sns.countplot(x='col', data=df)`**

* Countplot for categorical features

```python
sns.countplot(x='Gender', data=df)
# Output: Bar chart showing count of Male/Female
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'Age': [22, 25, 25, 30, 30],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female']
})

# Summary statistics
print(df['Age'].describe())
# Output:
# count     5.000000
# mean     26.400000
# std       3.209361
# min      22.000000
# 25%      25.000000
# 50%      25.000000
# 75%      30.000000
# max      30.000000

# Value counts for Gender
print(df['Gender'].value_counts())
# Output:
# Male      3
# Female    2

# Countplot for Gender
sns.countplot(x='Gender', data=df)
plt.show()

# Boxplot for Age
sns.boxplot(x=df['Age'])
plt.show()
```

---

#### ⏱️ 4. When to Use:

* 📊 Early in EDA to understand each variable individually
* 🎯 To detect outliers, skewness, and distribution shape
* ✅ Useful in feature selection (e.g., if all values are same, it’s not useful)

---

#### ⚠️ 5. Limitations and Challenges:

* 🚫 Doesn’t reveal relationships between variables
* 📉 Can miss hidden patterns unless visualized well
* ⚠️ Sensitive to outliers (especially for numerical data)

---




---

### 🔷 **4. Bivariate Analysis** 🔗

---

#### 📖 1. Definition:

* **Bivariate analysis** explores the **relationship between two variables**.
* It helps detect **correlation**, **association**, or **causal patterns** between features.
* Variable combinations can be:

  * 🔢 Numerical vs Numerical
  * 🧩 Categorical vs Categorical
  * 🧠 Categorical vs Numerical

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `df.corr()`**

* Computes pairwise correlation between numeric columns

```python
df.corr()
# Output:
#         Age  Salary
# Age     1.0    0.89
# Salary  0.89   1.0
```

**➤ `sns.heatmap()`**

* Visualizes the correlation matrix

```python
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
# Output: Heatmap with correlation coefficients
```

**➤ `sns.scatterplot(x=..., y=..., data=...)`**

* Scatter plot for numeric vs numeric

```python
sns.scatterplot(x='Age', y='Salary', data=df)
# Output: Points showing relationship
```

**➤ `sns.boxplot(x=..., y=..., data=...)`**

* Boxplot for numeric vs categorical

```python
sns.boxplot(x='Gender', y='Salary', data=df)
# Output: Salary distribution across genders
```

**➤ `pd.crosstab()`**

* Cross-tabulation (frequency table) of two categorical variables

```python
pd.crosstab(df['Gender'], df['Purchased'])
# Output:
# Purchased  No  Yes
# Gender            
# Female      1    1
# Male        1    2
```

**➤ `sns.countplot(x=..., hue=..., data=...)`**

* Countplot for comparing categories

```python
sns.countplot(x='Gender', hue='Purchased', data=df)
# Output: Side-by-side bars grouped by gender
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'Age': [22, 25, 30, 28, 26],
    'Salary': [25000, 30000, 50000, 42000, 38000],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],
    'Purchased': ['Yes', 'No', 'Yes', 'Yes', 'No']
})

# Correlation matrix
print(df.corr())
# Output:
#              Age    Salary
# Age     1.000000  0.899735
# Salary  0.899735  1.000000

# Heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

# Scatter plot: Age vs Salary
sns.scatterplot(x='Age', y='Salary', data=df)
plt.show()

# Box plot: Gender vs Salary
sns.boxplot(x='Gender', y='Salary', data=df)
plt.show()

# Crosstab
print(pd.crosstab(df['Gender'], df['Purchased']))
# Output:
# Purchased  No  Yes
# Gender            
# Female      1    1
# Male        1    2

# Countplot
sns.countplot(x='Gender', hue='Purchased', data=df)
plt.show()
```

---

#### ⏱️ 4. When to Use:

* 🔍 To identify correlation, trends, or class separation
* 🧠 To detect possible data leakage or multicollinearity
* 🎯 Before selecting features for models

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Correlation ≠ causation
* 📉 Correlation applies only to numeric values
* 🧪 Categorical relations may require deeper statistical testing (e.g., chi-square)

---




---

### 🔷 **5. Multivariate Analysis** 🧪📊

---

#### 📖 1. Definition:

* **Multivariate analysis** examines **three or more variables** simultaneously.
* It helps uncover **complex relationships**, **interactions**, and **patterns** that aren’t visible in univariate or bivariate analysis.
* Common in feature selection, hypothesis testing, and advanced visualization.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `sns.pairplot(data, hue=...)`**

* Plots pairwise relationships between numeric columns, color-coded by categorical column

```python
sns.pairplot(df, hue='Purchased')
# Output: Matrix of scatter plots and histograms
```

**➤ `sns.heatmap(df.corr(), annot=True)`**

* Heatmap for all variable correlations

```python
sns.heatmap(df.corr(), annot=True, cmap='YlGnBu')
# Output: Correlation between all numeric features
```

**➤ `sns.scatterplot(x=..., y=..., hue=..., size=..., data=...)`**

* Scatterplot with third or fourth variable shown via color and size

```python
sns.scatterplot(x='Age', y='Salary', hue='Purchased', size='Experience', data=df)
# Output: Enhanced scatter plot
```

**➤ `pd.plotting.scatter_matrix()`**

* Similar to pairplot, built into pandas

```python
pd.plotting.scatter_matrix(df[['Age', 'Salary', 'Experience']], figsize=(8, 6))
# Output: Matrix of scatter plots
```

**➤ `sns.lmplot(x=..., y=..., hue=..., col=..., data=...)`**

* Fits regression lines and shows separation across multiple dimensions

```python
sns.lmplot(x='Age', y='Salary', hue='Purchased', col='Gender', data=df)
# Output: Multi-panel linear regression plots
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'Age': [22, 25, 30, 28, 26],
    'Salary': [25000, 30000, 50000, 42000, 38000],
    'Experience': [1, 3, 7, 6, 4],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],
    'Purchased': ['Yes', 'No', 'Yes', 'Yes', 'No']
})

# Pairplot
sns.pairplot(df, hue='Purchased')
plt.show()

# Scatterplot with hue and size
sns.scatterplot(x='Age', y='Salary', hue='Purchased', size='Experience', data=df)
plt.show()

# Heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

# lmplot (multi-panel regression lines)
sns.lmplot(x='Age', y='Salary', hue='Purchased', col='Gender', data=df)
plt.show()
```

---

#### ⏱️ 4. When to Use:

* 🧠 For exploring complex relationships between multiple features
* 🧪 During feature selection and dimensionality reduction
* 🔬 For hypothesis generation in multivariate modeling

---

#### ⚠️ 5. Limitations and Challenges:

* ⚠️ High-dimensional plots can be hard to interpret
* 🖼️ Too many features = cluttered visuals
* 🔍 May require advanced techniques (PCA, t-SNE) for visual clarity
* 🧮 Computationally expensive for large datasets

---




---

### 🔷 **6. Outlier Detection** 🚨📉

---

#### 📖 1. Definition:

* **Outliers** are data points that significantly differ from the rest of the dataset.
* Outlier detection is essential in **data cleaning** and **model accuracy**.
* Outliers can result from **errors**, **natural variance**, or **rare events**.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `df.describe()`**

* Shows summary statistics; useful for spotting large deviations

```python
df['Salary'].describe()
# Output:
# count        6.000000
# mean     65000.000000
# std      35000.000000
# min      25000.000000
# 25%      30000.000000
# 50%      45000.000000
# 75%      70000.000000
# max     150000.000000  ← possible outlier
```

**➤ `sns.boxplot()`**

* Visual tool to detect outliers using IQR (dots beyond whiskers = outliers)

```python
sns.boxplot(x=df['Salary'])
plt.show()
# Output: Boxplot with outliers shown as individual dots
```

**➤ `zscore` from `scipy.stats`**

* Detects outliers using standard deviation (Z-score > 3 or < -3)

```python
from scipy.stats import zscore
import numpy as np

z_scores = zscore(df['Salary'])
outliers = df[np.abs(z_scores) > 3]
print(outliers)
# Output: Rows with Z-score outliers
```

**➤ IQR method (manual)**

* Filters data outside Q1–1.5*IQR or Q3+1.5*IQR

```python
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < Q1 - 1.5 * IQR) | (df['Salary'] > Q3 + 1.5 * IQR)]
print(outliers)
# Output: Data points outside the acceptable range
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore
import numpy as np

df = pd.DataFrame({
    'Salary': [25000, 30000, 45000, 70000, 80000, 150000]  # 150000 may be an outlier
})

# Describe
print(df['Salary'].describe())
# Output:
# count         6.000000
# mean      67500.000000
# std       43899.201438
# min       25000.000000
# 25%       33750.000000
# 50%       57500.000000
# 75%       77500.000000
# max      150000.000000

# Boxplot
sns.boxplot(x=df['Salary'])
plt.show()

# Z-score method
z_scores = zscore(df['Salary'])
outliers_z = df[np.abs(z_scores) > 3]
print(outliers_z)
# Output:
#     Salary
# 5  150000.0

# IQR method
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['Salary'] < Q1 - 1.5 * IQR) | (df['Salary'] > Q3 + 1.5 * IQR)]
print(outliers_iqr)
# Output:
#     Salary
# 5  150000
```

---

#### ⏱️ 4. When to Use:

* 🔍 Before modeling or normalization steps
* 💡 In fraud detection, anomaly detection, or quality control
* 📊 In datasets where variance matters (e.g., salaries, prices)

---

#### ⚠️ 5. Limitations and Challenges:

* ❗ May incorrectly remove **legitimate rare events**
* ⚠️ Not all outliers are “bad” (some may carry insights)
* ⚙️ Sensitive to method used (Z-score, IQR, etc.)
* 🧠 Must consider domain knowledge before removing outliers

---




---

### 🔷 **7. Handling Missing Values** 🧩🛠️

---

#### 📖 1. Definition:

* **Missing values** are blank, `NaN`, or `None` entries in the dataset.
* They can arise due to data corruption, manual errors, or unrecorded info.
* Handling them is essential to ensure clean, complete, and reliable models.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `df.isnull()` / `df.isnull().sum()`**

* Detects missing values and counts them

```python
print(df.isnull())
# Output:
#     Age  Salary  Gender
# 0  False   False   False
# 1  False   False   False
# 2  False   False   False
# 3  False    True   False
# 4  False   False    True

print(df.isnull().sum())
# Output:
# Age       0
# Salary    1
# Gender    1
```

**➤ `df.dropna()`**

* Removes rows with any missing values

```python
df_clean = df.dropna()
print(df_clean)
# Output: Rows where all values are present
```

**➤ `df.fillna(value)`**

* Fills missing values with a specified value

```python
df_filled = df.fillna(0)
print(df_filled)
# Output: Missing values replaced with 0
```

**➤ `df['col'].fillna(method='ffill') / 'bfill'`**

* Forward or backward fill from adjacent rows

```python
df['Gender'] = df['Gender'].fillna(method='ffill')
# Output: Fills with previous row value
```

**➤ `df['col'].fillna(df['col'].mean())`**

* Fills numeric column missing values with mean

```python
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
# Output: Fills with average salary
```

---

#### 📌 3. Code Example:

```python
import pandas as pd

df = pd.DataFrame({
    'Age': [22, 25, 30, 28, 26],
    'Salary': [25000, 30000, 50000, None, 38000],
    'Gender': ['Female', 'Male', 'Male', 'Male', None]
})

# Check for missing values
print(df.isnull().sum())
# Output:
# Age       0
# Salary    1
# Gender    1

# Fill Salary with mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
print(df['Salary'])
# Output:
# 0    25000.0
# 1    30000.0
# 2    50000.0
# 3    35750.0   ← mean of other 4 salaries
# 4    38000.0

# Forward fill Gender
df['Gender'] = df['Gender'].fillna(method='ffill')
print(df['Gender'])
# Output:
# 0    Female
# 1      Male
# 2      Male
# 3      Male
# 4      Male   ← copied from row 3
```

---

#### ⏱️ 4. When to Use:

* 📋 Always check for missing values during EDA
* 🛠️ Use drop when only a small % of data is missing
* 📊 Use imputation (mean/median/mode) when data is numeric and missing randomly

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Dropping rows may lose valuable information
* 📉 Filling with averages may distort variance
* 🧠 Requires domain knowledge to fill accurately
* ⚠️ May bias the model if patterns in missingness are ignored

---




---

### 🔷 **8. Feature Engineering** 🛠️✨

---

#### 📖 1. Definition:

* **Feature Engineering** is the process of **creating, transforming, or selecting** variables (features) to improve the performance of machine learning models.
* It involves techniques like encoding categorical data, scaling numerical features, creating interaction terms, and extracting new variables.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `pd.get_dummies()`**

* Converts categorical variables into dummy/indicator variables (one-hot encoding)

```python
pd.get_dummies(df['Gender'])
# Output:
#    Female  Male
# 0       1     0
# 1       0     1
# 2       0     1
```

**➤ `sklearn.preprocessing.StandardScaler()`**

* Standardizes features by removing the mean and scaling to unit variance

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df[['Age', 'Salary']])
print(scaled)
# Output: 2D array with scaled values
```

**➤ `sklearn.preprocessing.MinMaxScaler()`**

* Scales features to a given range (default 0 to 1)

```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[['Age', 'Salary']])
print(scaled)
# Output: 2D array with scaled values between 0 and 1
```

**➤ `df['new_feature'] = df['Age'] * df['Salary']`**

* Creating interaction terms or new features by combining existing ones

```python
df['Age_Salary'] = df['Age'] * df['Salary']
print(df['Age_Salary'])
# Output: Series with multiplied values
```

**➤ `df['col'].astype('category').cat.codes`**

* Label encoding categorical columns

```python
df['Gender_code'] = df['Gender'].astype('category').cat.codes
print(df['Gender_code'])
# Output:
# 0    0
# 1    1
# 2    1
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

df = pd.DataFrame({
    'Age': [22, 25, 30, 28, 26],
    'Salary': [25000, 30000, 50000, 42000, 38000],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female']
})

# One-hot encoding
gender_dummies = pd.get_dummies(df['Gender'])
print(gender_dummies)
# Output:
#    Female  Male
# 0       1     0
# 1       0     1
# 2       0     1
# 3       0     1
# 4       1     0

# Standard scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['Age', 'Salary']])
print(scaled_features)
# Output: e.g.
# [[-1.414 -1.414]
#  [-0.707 -1.131]
#  [ 1.414  1.414]
#  [ 0.707  0.566]
#  [ 0.000 -0.434]]

# Min-Max scaling
minmax = MinMaxScaler()
scaled_minmax = minmax.fit_transform(df[['Age', 'Salary']])
print(scaled_minmax)
# Output: e.g.
# [[0.   0.  ]
#  [0.15 0.14]
#  [1.   1.  ]
#  [0.75 0.7 ]
#  [0.4  0.46]]

# Interaction feature
df['Age_Salary'] = df['Age'] * df['Salary']
print(df['Age_Salary'])
# Output:
# 0     550000
# 1     750000
# 2    1500000
# 3    1176000
# 4     988000

# Label encoding
df['Gender_code'] = df['Gender'].astype('category').cat.codes
print(df['Gender_code'])
# Output:
# 0    0
# 1    1
# 2    1
# 3    1
# 4    0
```

---

#### ⏱️ 4. When to Use:

* 🔧 Before training models to improve accuracy and performance
* 🧠 When dealing with categorical variables
* ⚖️ To scale features for models sensitive to feature magnitude (e.g., SVM, KNN)

---

#### ⚠️ 5. Limitations and Challenges:

* ⚠️ Over-engineering can cause overfitting
* 🕵️‍♂️ Requires domain knowledge for meaningful features
* ⏳ Time-consuming and may increase model complexity
* 🔄 Some transformations require reversing for model interpretability

---




---

### 🔷 **9. Data Transformation and Scaling** ⚙️📏

---

#### 📖 1. Definition:

* **Data transformation and scaling** involve changing the range or distribution of data features.
* It makes data suitable for machine learning algorithms by normalizing, standardizing, or applying mathematical transformations.
* Essential for algorithms sensitive to feature scale (e.g., KNN, SVM, gradient descent).

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `sklearn.preprocessing.StandardScaler()`**

* Standardizes features by removing mean and scaling to unit variance

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df[['Age', 'Salary']])
print(scaled)
# Output: 2D numpy array of standardized values
```

**➤ `sklearn.preprocessing.MinMaxScaler()`**

* Scales features to a fixed range, usually \[0, 1]

```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[['Age', 'Salary']])
print(scaled)
# Output: 2D numpy array with values between 0 and 1
```

**➤ `sklearn.preprocessing.MaxAbsScaler()`**

* Scales each feature by its maximum absolute value (keeps sparse data intact)

```python
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaled = scaler.fit_transform(df[['Age', 'Salary']])
print(scaled)
# Output: values between -1 and 1 scaled by max abs value
```

**➤ `sklearn.preprocessing.PowerTransformer()`**

* Applies power transformations to make data more Gaussian-like (Box-Cox or Yeo-Johnson)

```python
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
transformed = pt.fit_transform(df[['Salary']])
print(transformed)
# Output: transformed data array
```

**➤ `np.log1p()` and `np.sqrt()`**

* Log and square root transformations for skewed data

```python
import numpy as np
df['Log_Salary'] = np.log1p(df['Salary'])
print(df['Log_Salary'])
# Output: log-transformed salary values
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer
import numpy as np

df = pd.DataFrame({
    'Age': [22, 25, 30, 28, 26],
    'Salary': [25000, 30000, 50000, 42000, 38000]
})

# Standard scaling
scaler = StandardScaler()
scaled_std = scaler.fit_transform(df[['Age', 'Salary']])
print(scaled_std)
# Output:
# [[-1.414 -1.414]
#  [-0.707 -1.131]
#  [ 1.414  1.414]
#  [ 0.707  0.566]
#  [ 0.    -0.434]]

# Min-Max scaling
minmax = MinMaxScaler()
scaled_minmax = minmax.fit_transform(df[['Age', 'Salary']])
print(scaled_minmax)
# Output:
# [[0.   0.  ]
#  [0.15 0.14]
#  [1.   1.  ]
#  [0.75 0.7 ]
#  [0.4  0.46]]

# Power transformation (Yeo-Johnson)
pt = PowerTransformer(method='yeo-johnson')
transformed = pt.fit_transform(df[['Salary']])
print(transformed)
# Output: array([[ -1.172], [-0.971], [1.555], [0.741], [-0.154]])

# Log transformation for skewness reduction
df['Log_Salary'] = np.log1p(df['Salary'])
print(df['Log_Salary'])
# Output:
# 0    10.126631
# 1    10.308953
# 2    10.819778
# 3    10.645941
# 4    10.545163
```

---

#### ⏱️ 4. When to Use:

* ⚖️ When features have different units or scales
* 🧠 Before applying distance-based ML algorithms (KNN, SVM)
* 📉 To reduce skewness and improve normality assumptions
* 🚀 To speed up convergence in gradient descent algorithms

---

#### ⚠️ 5. Limitations and Challenges:

* 🔄 Transformation may make data less interpretable
* ❗ Not all transformations fit every data distribution
* ⚠️ Improper scaling may hurt model performance
* 🕵️‍♂️ Requires consistent scaling on train and test datasets

---




---

### 🔷 **10. Feature Selection** 🎯🔍

---

#### 📖 1. Definition:

* **Feature Selection** is the process of selecting a subset of relevant features (variables) for model building.
* It helps improve model performance, reduces overfitting, decreases training time, and enhances interpretability.
* Methods include filter, wrapper, and embedded approaches.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `sklearn.feature_selection.SelectKBest`**

* Selects the top k features based on a scoring function (e.g., chi2, f\_classif)

```python
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)
print(X_new.shape)
# Output: (n_samples, 2)
```

**➤ `sklearn.feature_selection.RFE` (Recursive Feature Elimination)**

* Recursively removes features and builds model to select most important features

```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)
print(rfe.support_)
# Output: array indicating selected features (True/False)
```

**➤ `sklearn.feature_selection.VarianceThreshold`**

* Removes features with low variance (below threshold)

```python
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.1)
X_var = sel.fit_transform(X)
print(X_var.shape)
# Output: shape after removing low variance features
```

**➤ `feature_importances_` attribute (e.g., from RandomForestClassifier)**

* Feature importance from tree-based models

```python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X, y)
print(model.feature_importances_)
# Output: array of feature importance scores
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif, RFE, VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Load example data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# SelectKBest example
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)
print("SelectKBest shape:", X_new.shape)
# Output: SelectKBest shape: (150, 2)

# Recursive Feature Elimination
model = LogisticRegression(max_iter=200)
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)
print("RFE selected features:", rfe.support_)
# Output: RFE selected features: [ True False False  True]

# Variance Threshold
sel = VarianceThreshold(threshold=0.5)
X_var = sel.fit_transform(X)
print("VarianceThreshold shape:", X_var.shape)
# Output: VarianceThreshold shape: (150, 2)

# Feature Importances from Random Forest
rf = RandomForestClassifier()
rf.fit(X, y)
print("Feature importances:", rf.feature_importances_)
# Output: e.g. [0.1 0.4 0.3 0.2]
```

---

#### ⏱️ 4. When to Use:

* 🚀 To reduce dimensionality and improve model speed
* 🔍 When many features are irrelevant or redundant
* 🛡️ To reduce overfitting risk
* 🤖 To improve model interpretability

---

#### ⚠️ 5. Limitations and Challenges:

* ⚠️ Incorrect feature removal may lose important info
* 🕰️ Some methods are computationally expensive on large datasets
* 🧩 Selecting too few features can underfit the model
* 🧠 Requires tuning and domain expertise

---




---

### 🔷 **11. Outlier Detection and Treatment** 🚨📊

---

#### 📖 1. Definition:

* **Outliers** are data points that significantly differ from other observations.
* Detecting and treating outliers is important to avoid skewed results and improve model robustness.
* Treatment includes removal, transformation, or capping (winsorizing).

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ Using Z-Score (from `scipy.stats`)**

* Identifies outliers by measuring how many standard deviations a point is from the mean

```python
from scipy.stats import zscore
import numpy as np

z_scores = zscore(df['Salary'])
outliers = np.where(np.abs(z_scores) > 3)
print(outliers)
# Output: indices of outliers
```

**➤ Using IQR Method**

* Detects outliers as points outside 1.5\*IQR above Q3 or below Q1

```python
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < Q1 - 1.5*IQR) | (df['Salary'] > Q3 + 1.5*IQR)]
print(outliers)
# Output: DataFrame rows considered outliers
```

**➤ `sklearn.ensemble.IsolationForest`**

* An unsupervised method to detect outliers

```python
from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(df[['Salary']])
outliers = df[yhat == -1]
print(outliers)
# Output: DataFrame rows detected as outliers
```

**➤ Capping or Winsorizing (using `numpy.clip`)**

* Limits values outside a range to the boundary values

```python
df['Salary_capped'] = np.clip(df['Salary'], lower_bound, upper_bound)
print(df['Salary_capped'])
# Output: Series with capped values
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
import numpy as np
from scipy.stats import zscore
from sklearn.ensemble import IsolationForest

df = pd.DataFrame({
    'Salary': [25000, 30000, 50000, 42000, 38000, 150000]  # 150000 is an outlier
})

# Z-score method
z_scores = zscore(df['Salary'])
outlier_indices = np.where(np.abs(z_scores) > 3)[0]
print("Z-score outliers indices:", outlier_indices)
# Output: Z-score outliers indices: [5]

# IQR method
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['Salary'] < Q1 - 1.5*IQR) | (df['Salary'] > Q3 + 1.5*IQR)]
print("IQR outliers:\n", outliers_iqr)
# Output:
# IQR outliers:
#    Salary
# 5  150000

# Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
yhat = iso.fit_predict(df[['Salary']])
outliers_iforest = df[yhat == -1]
print("Isolation Forest outliers:\n", outliers_iforest)
# Output:
# Isolation Forest outliers:
#    Salary
# 5  150000

# Winsorizing/Capping
lower_bound = df['Salary'].quantile(0.05)
upper_bound = df['Salary'].quantile(0.95)
df['Salary_capped'] = np.clip(df['Salary'], lower_bound, upper_bound)
print(df['Salary_capped'])
# Output:
# 0     25000.0
# 1     30000.0
# 2     50000.0
# 3     42000.0
# 4     38000.0
# 5    116000.0
# Name: Salary_capped, dtype: float64
```

---

#### ⏱️ 4. When to Use:

* ⚠️ When extreme values may distort analysis or model training
* 🛡️ To improve model robustness and accuracy
* 🔎 When suspicious or erroneous data points exist
* 🔄 Before normalization or scaling

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Removing outliers might lose important rare events
* 🔄 Capping can bias the data distribution
* 🤔 Defining “outlier” thresholds can be subjective
* 🧠 Some methods are sensitive to parameter tuning (e.g., contamination in IsolationForest)

---




---

### 🔷 **12. Data Visualization for EDA** 📊🎨

---

#### 📖 1. Definition:

* **Data Visualization** is the graphical representation of data to identify patterns, trends, outliers, and relationships.
* It helps to quickly understand data distribution and underlying structure during exploratory data analysis (EDA).
* Common plots include histograms, box plots, scatter plots, bar charts, and heatmaps.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `matplotlib.pyplot.hist()`**

* Creates a histogram to visualize the frequency distribution of a numeric variable

```python
import matplotlib.pyplot as plt

plt.hist(df['Age'], bins=5)
plt.show()
# Output: Histogram plot showing distribution of 'Age'
```

**➤ `seaborn.boxplot()`**

* Displays the distribution and detects outliers using box plots

```python
import seaborn as sns

sns.boxplot(x=df['Salary'])
plt.show()
# Output: Box plot with median, quartiles, and outliers of 'Salary'
```

**➤ `matplotlib.pyplot.scatter()`**

* Plots scatter plot to visualize relationships between two numeric variables

```python
plt.scatter(df['Age'], df['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
# Output: Scatter plot showing Age vs Salary
```

**➤ `seaborn.heatmap()`**

* Shows correlation matrix or data values as color-coded matrix

```python
import seaborn as sns

corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()
# Output: Heatmap of correlation coefficients with annotations
```

**➤ `pandas.DataFrame.plot()`**

* Quick plotting method for line, bar, and other charts

```python
df['Salary'].plot(kind='bar')
plt.show()
# Output: Bar chart of Salary values
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({
    'Age': [22, 25, 30, 28, 26],
    'Salary': [25000, 30000, 50000, 42000, 38000]
})

# Histogram of Age
plt.hist(df['Age'], bins=5, color='skyblue')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
# Output: Histogram plot

# Box plot of Salary
sns.boxplot(x=df['Salary'], color='lightgreen')
plt.title('Salary Distribution')
plt.show()
# Output: Box plot with outliers

# Scatter plot of Age vs Salary
plt.scatter(df['Age'], df['Salary'], color='purple')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
# Output: Scatter plot

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
# Output: Heatmap with correlation coefficients

# Bar plot of Salary
df['Salary'].plot(kind='bar', color='orange')
plt.title('Salary Bar Chart')
plt.xlabel('Index')
plt.ylabel('Salary')
plt.show()
# Output: Bar chart
```

---

#### ⏱️ 4. When to Use:

* 👁️ To visually explore and understand data distribution
* 🔍 To detect outliers, trends, and patterns
* 🤝 To investigate relationships between variables
* 📊 Before modeling to guide feature engineering

---

#### ⚠️ 5. Limitations and Challenges:

* 🖼️ Visualizations can be misleading if axes or scales are inappropriate
* 🧩 Complex datasets may require advanced or customized plots
* 📏 Limited to human interpretation and might miss subtle patterns
* 🕰️ Large datasets can slow plotting and clutter visuals

---




---

### 🔷 **13. Handling Missing Data** 🕳️❓

---

#### 📖 1. Definition:

* **Handling Missing Data** involves identifying and dealing with absent or null values in datasets.
* Missing data can cause biased analysis and affect model performance.
* Strategies include deletion, imputation (mean, median, mode), or predictive modeling.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `pandas.isnull()` / `pandas.isna()`**

* Detects missing values in DataFrame or Series

```python
import pandas as pd

df.isnull()
# Output: DataFrame of booleans indicating NaNs
```

**➤ `pandas.notnull()` / `pandas.notna()`**

* Detects non-missing values

```python
df.notnull()
# Output: DataFrame of booleans indicating non-NaNs
```

**➤ `pandas.DataFrame.dropna()`**

* Removes rows or columns containing missing values

```python
df.dropna(axis=0, inplace=True)
print(df)
# Output: DataFrame with rows containing NaNs removed
```

**➤ `pandas.DataFrame.fillna()`**

* Fills missing values with specified value or method (mean, median, mode, forward fill, backward fill)

```python
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
# Output: DataFrame with NaNs in 'Age' filled by mean
```

**➤ `sklearn.impute.SimpleImputer`**

* Imputation transformer for filling missing values with strategies such as mean, median, most frequent

```python
from sklearn.impute import SimpleImputer
import numpy as np

imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X)
print(X_imputed)
# Output: Numpy array with missing values imputed by mean
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np

df = pd.DataFrame({
    'Age': [25, np.nan, 30, 28, np.nan],
    'Salary': [50000, 60000, np.nan, 42000, 38000]
})

# Detect missing values
print(df.isnull())
# Output:
#      Age  Salary
# 0  False   False
# 1   True   False
# 2  False    True
# 3  False   False
# 4   True   False

# Drop rows with any missing values
df_drop = df.dropna()
print(df_drop)
# Output:
#    Age   Salary
# 0  25.0  50000.0
# 3  28.0  42000.0
# 4   NaN  38000.0 (Note: This row actually removed, so no NaN here)

# Fill missing values with mean
df_fill = df.copy()
df_fill['Age'].fillna(df_fill['Age'].mean(), inplace=True)
df_fill['Salary'].fillna(df_fill['Salary'].mean(), inplace=True)
print(df_fill)
# Output:
#          Age   Salary
# 0  25.000000  50000.0
# 1  27.666667  60000.0
# 2  30.000000  47500.0
# 3  28.000000  42000.0
# 4  27.666667  38000.0

# Using SimpleImputer
imp = SimpleImputer(strategy='mean')
X = df.values
X_imputed = imp.fit_transform(X)
print(X_imputed)
# Output:
# [[25.         50000.       ]
#  [27.66666667 60000.       ]
#  [30.         47500.       ]
#  [28.         42000.       ]
#  [27.66666667 38000.       ]]
```

---

#### ⏱️ 4. When to Use:

* ❗ When dataset contains missing or null values
* 💡 Before applying machine learning models that don’t handle missing values natively
* 🔄 To improve data quality and completeness
* 🔧 When cleaning data or preparing for analysis

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Dropping data can reduce dataset size and lose information
* 🤔 Imputation may introduce bias or dilute variance
* 🧩 Some missing data patterns are not random and require careful handling
* 🕵️ Complex methods can be computationally expensive

---




---

### 🔷 **14. Data Transformation and Scaling** ⚙️📏

---

#### 📖 1. Definition:

* **Data Transformation and Scaling** refers to modifying data to a suitable format or scale for better analysis or model performance.
* Common techniques include normalization, standardization, log transformation, and power transforms.
* Helps models converge faster and prevents features with large scales from dominating.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `sklearn.preprocessing.StandardScaler`**

* Standardizes features by removing the mean and scaling to unit variance (z-score normalization)

```python
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()
data_scaled = scaler.fit_transform(np.array([[1], [2], [3], [4], [5]]))
print(data_scaled)
# Output: array([[-1.41421356], [-0.70710678], [0.], [0.70710678], [1.41421356]])
```

**➤ `sklearn.preprocessing.MinMaxScaler`**

* Scales features to a fixed range, usually \[0, 1]

```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(np.array([[1], [2], [3], [4], [5]]))
print(data_scaled)
# Output: array([[0.], [0.25], [0.5], [0.75], [1.]])
```

**➤ `numpy.log()` and `numpy.log1p()`**

* Applies logarithmic transformation, useful for skewed data

```python
import numpy as np

data = np.array([1, 10, 100, 1000])
log_data = np.log(data)
print(log_data)
# Output: array([0.        , 2.30258509, 4.60517019, 6.90775528])
```

**➤ `sklearn.preprocessing.PowerTransformer`**

* Applies power transforms like Box-Cox or Yeo-Johnson to stabilize variance and make data more Gaussian-like

```python
from sklearn.preprocessing import PowerTransformer

data = np.array([[1], [2], [3], [4], [5]])
pt = PowerTransformer(method='yeo-johnson')
data_transformed = pt.fit_transform(data)
print(data_transformed)
# Output: array([[-1.41421356], [-0.70710678], [0.], [0.70710678], [1.41421356]])
```

---

#### 📌 3. Code Example:

```python
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer

data = np.array([[1], [2], [3], [4], [5]])

# StandardScaler (z-score normalization)
scaler_std = StandardScaler()
data_std = scaler_std.fit_transform(data)
print("StandardScaler:\n", data_std)
# Output:
# StandardScaler:
# [[-1.41421356]
#  [-0.70710678]
#  [ 0.        ]
#  [ 0.70710678]
#  [ 1.41421356]]

# MinMaxScaler (scale to [0,1])
scaler_minmax = MinMaxScaler()
data_minmax = scaler_minmax.fit_transform(data)
print("MinMaxScaler:\n", data_minmax)
# Output:
# MinMaxScaler:
# [[0.  ]
#  [0.25]
#  [0.5 ]
#  [0.75]
#  [1.  ]]

# Log Transformation
log_data = np.log(data)
print("Log Transformation:\n", log_data)
# Output:
# Log Transformation:
# [[0.        ]
#  [0.69314718]
#  [1.09861229]
#  [1.38629436]
#  [1.60943791]]

# Power Transformer (Yeo-Johnson)
pt = PowerTransformer(method='yeo-johnson')
data_pt = pt.fit_transform(data)
print("PowerTransformer (Yeo-Johnson):\n", data_pt)
# Output:
# PowerTransformer (Yeo-Johnson):
# [[-1.41421356]
#  [-0.70710678]
#  [ 0.        ]
#  [ 0.70710678]
#  [ 1.41421356]]
```

---

#### ⏱️ 4. When to Use:

* ⚖️ When features have different scales or units
* 🔄 To improve model convergence speed and performance
* 📊 For skewed data distributions requiring normalization
* 🧠 Before algorithms sensitive to feature magnitude (e.g., SVM, KNN, PCA)

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Scaling can distort interpretability of feature values
* ⚠️ Log transform requires positive values only
* 🔧 Power transforms need parameter tuning
* 🧩 Not all models require scaling (e.g., tree-based methods)

---



---

### 🔷 **15. Feature Engineering and Selection** 🛠️✨

---

#### 📖 1. Definition:

* **Feature Engineering** is the process of creating, modifying, or transforming variables (features) to improve model performance.
* **Feature Selection** involves choosing the most relevant features to reduce dimensionality and improve efficiency.
* Both are crucial steps in preparing data for machine learning models.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `pandas.get_dummies()`**

* Converts categorical variables into dummy/indicator variables (one-hot encoding)

```python
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
df_encoded = pd.get_dummies(df['Color'])
print(df_encoded)
# Output:
#    Blue  Green  Red
# 0     0      0    1
# 1     1      0    0
# 2     0      1    0
```

**➤ `sklearn.feature_selection.SelectKBest`**

* Selects the top k features based on statistical tests

```python
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

X, y = load_iris(return_X_y=True)
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)
print(X_new.shape)
# Output:
# (150, 2)
```

**➤ `sklearn.feature_selection.RFE` (Recursive Feature Elimination)**

* Recursively removes less important features based on model weights

```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=500)
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)
print(X_rfe.shape)
# Output:
# (150, 2)
```

**➤ `pandas.DataFrame.apply()`**

* Apply custom transformations to columns (e.g., creating new features)

```python
df = pd.DataFrame({'Age': [25, 30, 45]})
df['Age_squared'] = df['Age'].apply(lambda x: x**2)
print(df)
# Output:
#    Age  Age_squared
# 0   25          625
# 1   30          900
# 2   45         2025
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression

# One-hot encoding categorical data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
df_encoded = pd.get_dummies(df['Color'])
print("One-hot encoded data:\n", df_encoded)
# Output:
#    Blue  Green  Red
# 0     0      0    1
# 1     1      0    0
# 2     0      1    0

# Feature selection with SelectKBest
X, y = load_iris(return_X_y=True)
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)
print("Shape after SelectKBest:", X_new.shape)
# Output: Shape after SelectKBest: (150, 2)

# Recursive Feature Elimination (RFE)
model = LogisticRegression(max_iter=500)
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)
print("Shape after RFE:", X_rfe.shape)
# Output: Shape after RFE: (150, 2)

# Feature engineering: create new feature by applying function
df2 = pd.DataFrame({'Age': [25, 30, 45]})
df2['Age_squared'] = df2['Age'].apply(lambda x: x**2)
print("Feature engineered DataFrame:\n", df2)
# Output:
#    Age  Age_squared
# 0   25          625
# 1   30          900
# 2   45         2025
```

---

#### ⏱️ 4. When to Use:

* 🌟 To improve predictive power of models by generating meaningful features
* 🔍 To reduce dimensionality by selecting only important features
* 🧹 To reduce noise and overfitting
* 🛠️ When working with categorical variables or raw data needing transformation

---

#### ⚠️ 5. Limitations and Challenges:

* ⚙️ Feature engineering can be time-consuming and requires domain knowledge
* 🔎 Feature selection risks removing useful but less obvious features
* 🧩 Over-engineering can lead to overfitting
* 📏 Different algorithms may require different feature sets

---




---

### 🔷 **16. Handling Outliers** 🚫📈

---

#### 📖 1. Definition:

* **Outliers** are data points that differ significantly from other observations.
* Handling outliers involves detecting and managing these extreme values to prevent distortion in analysis or modeling.
* Approaches include removal, transformation, or capping.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `pandas.DataFrame.describe()`**

* Provides statistical summary including min, max, quartiles to help spot outliers

```python
import pandas as pd

df = pd.DataFrame({'Age': [20, 22, 21, 100, 23, 22]})
print(df.describe())
# Output includes max=100 indicating potential outlier
```

**➤ `numpy.percentile()`**

* Calculates percentiles, useful for defining outlier thresholds based on IQR

```python
import numpy as np

data = np.array([20, 22, 21, 100, 23, 22])
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
print(q1, q3)
# Output: 21.0 22.75
```

**➤ Outlier detection using Interquartile Range (IQR)**

* IQR = Q3 - Q1; Outliers are points outside Q1 - 1.5*IQR or Q3 + 1.5*IQR

```python
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print(lower_bound, upper_bound)
# Output: 18.625 25.125
```

**➤ `pandas.DataFrame.clip()`**

* Caps outliers by setting upper and lower limits

```python
df['Age_clipped'] = df['Age'].clip(lower=18.625, upper=25.125)
print(df)
# Output: Outliers capped to limits
```

**➤ `scipy.stats.zscore`**

* Calculates Z-scores to identify outliers beyond a threshold (e.g., ±3)

```python
from scipy.stats import zscore

import numpy as np
data = np.array([20, 22, 21, 100, 23, 22])
z_scores = zscore(data)
print(z_scores)
# Output: Array with z-scores, extreme values stand out
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
import numpy as np
from scipy.stats import zscore

df = pd.DataFrame({'Age': [20, 22, 21, 100, 23, 22]})

# Statistical summary
print(df.describe())
# Output:
#           Age
# count   6.000000
# mean   34.666667
# std    33.544908
# min    20.000000
# 25%    21.000000
# 50%    22.000000
# 75%    22.750000
# max   100.000000

# Calculate IQR and bounds
q1 = np.percentile(df['Age'], 25)
q3 = np.percentile(df['Age'], 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print(f"IQR bounds: {lower_bound}, {upper_bound}")
# Output: IQR bounds: 18.625, 25.125

# Cap outliers using clip
df['Age_clipped'] = df['Age'].clip(lower=lower_bound, upper=upper_bound)
print(df)
# Output:
#    Age  Age_clipped
# 0   20         20.0
# 1   22         22.0
# 2   21         21.0
# 3  100         25.1
# 4   23         23.0
# 5   22         22.0

# Z-score for outlier detection
z_scores = zscore(df['Age'])
print("Z-scores:", z_scores)
# Output: Z-scores: [-0.43376669 -0.37700647 -0.40538658  2.0111473   0.07236736 -0.37700647]

# Identify outliers by threshold (e.g., abs(z) > 2)
outliers = np.where(np.abs(z_scores) > 2)
print("Outlier indices:", outliers)
# Output: Outlier indices: (array([3]),)
```

---

#### ⏱️ 4. When to Use:

* ⚠️ When extreme values distort model learning or analysis
* 🔎 When exploratory data analysis reveals suspicious data points
* 🧹 When cleaning data to improve robustness and accuracy
* 🔄 Before applying algorithms sensitive to outliers (e.g., linear regression)

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Removing outliers might discard important rare events
* 🔍 Defining thresholds can be subjective and dataset-dependent
* 🛠️ Capping or transforming may distort data distribution
* 🧩 Some models are robust to outliers, so treatment might be unnecessary

---




---

### 🔷 **18. Handling Missing Data** ❓🧹

---

#### 📖 1. Definition:

* **Handling Missing Data** involves identifying and dealing with gaps or nulls in datasets.
* Missing data can cause biased analysis or errors in models if not properly managed.
* Techniques include deletion, imputation, or flagging missing values.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `pandas.isnull()` / `pandas.isna()`**

* Detects missing values in DataFrame or Series

```python
import pandas as pd

df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
print(df.isnull())
# Output:
#        A      B
# 0  False  False
# 1   True  False
# 2  False   True
```

**➤ `pandas.DataFrame.dropna()`**

* Drops rows or columns with missing values

```python
print(df.dropna())
# Output:
#      A    B
# 0  1.0  4.0
```

**➤ `pandas.DataFrame.fillna()`**

* Fills missing values with specified value or method (forward/backward fill)

```python
print(df.fillna(0))
# Output:
#      A    B
# 0  1.0  4.0
# 1  0.0  5.0
# 2  3.0  0.0
```

**➤ `sklearn.impute.SimpleImputer`**

* Provides flexible imputation strategies like mean, median, most frequent

```python
from sklearn.impute import SimpleImputer
import numpy as np

imp = SimpleImputer(strategy='mean')
data = np.array([[1, 2], [np.nan, 3], [7, 6]])
imputed_data = imp.fit_transform(data)
print(imputed_data)
# Output:
# [[1. 2.]
#  [4. 3.]
#  [7. 6.]]
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np

df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})

# Detect missing values
print("Missing values:\n", df.isnull())
# Output:
#        A      B
# 0  False  False
# 1   True  False
# 2  False   True

# Drop rows with any missing values
print("After dropna():\n", df.dropna())
# Output:
#      A    B
# 0  1.0  4.0

# Fill missing values with zero
print("After fillna(0):\n", df.fillna(0))
# Output:
#      A    B
# 0  1.0  4.0
# 1  0.0  5.0
# 2  3.0  0.0

# Impute missing values with mean using SimpleImputer
imp = SimpleImputer(strategy='mean')
imputed_array = imp.fit_transform(df)
print("After SimpleImputer:\n", imputed_array)
# Output:
# [[1. 4.]
#  [2. 5.]
#  [3. 4.5]]
```

---

#### ⏱️ 4. When to Use:

* ⚠️ When datasets contain null or NaN values that may affect analysis or models
* 🔄 Before feeding data to models that do not support missing values
* 🧹 To clean data and reduce bias due to missingness
* 📊 When preparing data for visualization or statistical tests

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Dropping missing data can reduce dataset size and cause bias
* 🧠 Imputation assumes missingness is random, which may not hold
* ⚙️ Complex imputation methods may require careful tuning
* 🔍 Some missing patterns can signal important info and should be analyzed

---




---

### 🔷 **19. Dimensionality Reduction** 📉✨

---

#### 📖 1. Definition:

* **Dimensionality Reduction** is the process of reducing the number of input variables (features) in a dataset while retaining as much important information as possible.
* It helps to simplify models, reduce overfitting, improve visualization, and speed up computation.
* Techniques include feature selection and feature extraction (e.g., PCA, t-SNE).

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `sklearn.decomposition.PCA`**

* Principal Component Analysis transforms features into a lower-dimensional space maximizing variance

```python
from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6]])
pca = PCA(n_components=1)
X_reduced = pca.fit_transform(X)
print(X_reduced)
# Output:
# [[-2.82842712]
#  [ 0.        ]
#  [ 2.82842712]]
```

**➤ `sklearn.manifold.TSNE`**

* t-SNE maps high-dimensional data to 2D or 3D space preserving local structure, used mainly for visualization

```python
from sklearn.manifold import TSNE

X = np.array([[1, 2], [3, 4], [5, 6]])
tsne = TSNE(n_components=2, random_state=0)
X_embedded = tsne.fit_transform(X)
print(X_embedded)
# Output: Array with 2D coordinates
```

**➤ `sklearn.feature_selection.SelectKBest`**

* Selects top k features based on statistical tests

```python
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

iris = load_iris()
X, y = iris.data, iris.target
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)
print(X_new.shape)
# Output: (150, 2)
```

---

#### 📌 3. Code Example:

```python
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_iris
import numpy as np

# PCA example
X = np.array([[1, 2], [3, 4], [5, 6]])
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X)
print("PCA result:\n", X_pca)
# Output:
# [[-2.82842712]
#  [ 0.        ]
#  [ 2.82842712]]

# t-SNE example
tsne = TSNE(n_components=2, random_state=0)
X_tsne = tsne.fit_transform(X)
print("t-SNE result:\n", X_tsne)
# Output:
# [[ 9.219143e-01 -1.931900e+00]
#  [-1.210973e+00 -4.193745e-01]
#  [ 2.180471e-01  2.351274e+00]]

# Feature selection example
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X, y)
print("Selected features shape:", X_selected.shape)
# Output: Selected features shape: (150, 2)
```

---

#### ⏱️ 4. When to Use:

* 🚀 To reduce computational cost when dealing with high-dimensional data
* 🔍 To avoid overfitting by removing redundant or irrelevant features
* 📊 For visualizing data in 2D or 3D plots
* 🧩 When feature interpretability or simplification is desired

---

#### ⚠️ 5. Limitations and Challenges:

* ❌ Risk of losing important information during reduction
* 🔍 Interpretation of transformed features can be difficult (especially PCA, t-SNE)
* ⚙️ Parameter tuning (e.g., number of components) requires experimentation
* 🕰️ Some methods (e.g., t-SNE) can be computationally expensive on large datasets

---




---

### 🔷 **20. Feature Engineering** ⚙️✨

---

#### 📖 1. Definition:

* **Feature Engineering** is the process of creating, transforming, or selecting features from raw data to improve the performance of machine learning models.
* It includes techniques like encoding categorical variables, scaling, creating interaction features, and extracting meaningful attributes.
* It is often the most crucial step for effective modeling.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `pandas.get_dummies()`**

* Converts categorical variables into one-hot encoded variables

```python
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green']})
print(pd.get_dummies(df['color']))
# Output:
#    blue  green  red
# 0     0      0    1
# 1     1      0    0
# 2     0      1    0
```

**➤ `sklearn.preprocessing.StandardScaler`**

* Standardizes features by removing the mean and scaling to unit variance

```python
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
# Output:
# [[-1.22474487 -1.22474487]
#  [ 0.          0.        ]
#  [ 1.22474487  1.22474487]]
```

**➤ `sklearn.preprocessing.LabelEncoder`**

* Converts categorical labels into numerical labels

```python
from sklearn.preprocessing import LabelEncoder

labels = ['cat', 'dog', 'cat', 'bird']
le = LabelEncoder()
encoded = le.fit_transform(labels)
print(encoded)
# Output:
# [1 2 1 0]
```

**➤ `pandas.Series.str` methods**

* Extract or transform features from string columns, e.g., `.str.lower()`, `.str.extract()`, `.str.contains()`

```python
df = pd.DataFrame({'text': ['Hello World', 'Test Data']})
print(df['text'].str.lower())
# Output:
# 0    hello world
# 1       test data
# Name: text, dtype: object
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
import numpy as np

# One-hot encoding categorical variable
df = pd.DataFrame({'color': ['red', 'blue', 'green']})
print("One-hot encoded:\n", pd.get_dummies(df['color']))
# Output:
#    blue  green  red
# 0     0      0    1
# 1     1      0    0
# 2     0      1    0

# Standard scaling numerical data
data = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Scaled data:\n", scaled_data)
# Output:
# [[-1.22474487 -1.22474487]
#  [ 0.          0.        ]
#  [ 1.22474487  1.22474487]]

# Label encoding categorical labels
labels = ['cat', 'dog', 'cat', 'bird']
le = LabelEncoder()
encoded_labels = le.fit_transform(labels)
print("Label encoded:", encoded_labels)
# Output:
# [1 2 1 0]

# String transformation
df_text = pd.DataFrame({'text': ['Hello World', 'Test Data']})
print("Lowercase text:\n", df_text['text'].str.lower())
# Output:
# 0    hello world
# 1       test data
# Name: text, dtype: object
```

---

#### ⏱️ 4. When to Use:

* 🔧 When raw data needs to be converted into formats suitable for ML algorithms
* 🧩 To improve model accuracy by creating more meaningful features
* 🔍 To handle categorical variables and scale features appropriately
* 📊 For enriching datasets with domain knowledge transformations

---

#### ⚠️ 5. Limitations and Challenges:

* 🕰️ Can be time-consuming and require domain expertise
* ⚠️ Risk of overfitting if too many features are created
* 🔄 May require iterative experimentation and validation
* ⚙️ Some transformations may not generalize well to new data

---




---

### 🔷 **21. Outlier Detection and Treatment** 🚨📉

---

#### 📖 1. Definition:

* **Outliers** are data points that differ significantly from other observations.
* **Outlier Detection** identifies these unusual points that may distort analysis or model training.
* **Treatment** involves removing, capping, or transforming outliers to reduce their negative impact.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ Using `pandas.DataFrame.describe()`**

* Quickly checks summary statistics to spot extreme values

```python
import pandas as pd

df = pd.DataFrame({'values': [10, 12, 14, 1000, 15]})
print(df.describe())
# Output:
#            values
# count     5.000000
# mean    210.200000
# std     436.381278
# min      10.000000
# 25%      12.000000
# 50%      14.000000
# 75%      15.000000
# max    1000.000000
```

**➤ Using Interquartile Range (IQR) method**

* Calculates Q1, Q3 and identifies outliers beyond 1.5\*IQR

```python
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['values'] < Q1 - 1.5*IQR) | (df['values'] > Q3 + 1.5*IQR)]
print(outliers)
# Output:
#    values
# 3    1000
```

**➤ Using `scipy.stats.zscore`**

* Computes z-scores to detect points with high deviation

```python
from scipy.stats import zscore

df['zscore'] = zscore(df['values'])
outliers = df[(df['zscore'] > 3) | (df['zscore'] < -3)]
print(outliers)
# Output:
#    values    zscore
# 3    1000  2.172065 (no outliers by zscore >3)
```

**➤ Using `sklearn.ensemble.IsolationForest`**

* Anomaly detection algorithm to identify outliers

```python
from sklearn.ensemble import IsolationForest
import numpy as np

X = np.array([[10], [12], [14], [1000], [15]])
clf = IsolationForest(contamination=0.2, random_state=0)
outliers = clf.fit_predict(X)
print(outliers)
# Output:
# [ 1  1  1 -1  1]  # -1 indicates outlier at index 3 (1000)
```

---

#### 📌 3. Code Example:

```python
import pandas as pd
from scipy.stats import zscore
from sklearn.ensemble import IsolationForest
import numpy as np

df = pd.DataFrame({'values': [10, 12, 14, 1000, 15]})

# Summary stats to spot outliers
print("Describe:\n", df.describe())
# Output:
# count      5.000000
# mean     210.200000
# std      436.381278
# min       10.000000
# 25%       12.000000
# 50%       14.000000
# 75%       15.000000
# max     1000.000000

# IQR method to detect outliers
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['values'] < Q1 - 1.5*IQR) | (df['values'] > Q3 + 1.5*IQR)]
print("Outliers detected by IQR:\n", outliers_iqr)
# Output:
#    values
# 3    1000

# Z-score method
df['zscore'] = zscore(df['values'])
outliers_z = df[(df['zscore'] > 3) | (df['zscore'] < -3)]
print("Outliers detected by z-score:\n", outliers_z)
# Output:
# Empty DataFrame
# Columns: [values, zscore]
# Index: []

# Isolation Forest
X = df[['values']].values
clf = IsolationForest(contamination=0.2, random_state=0)
df['anomaly'] = clf.fit_predict(X)
print("Isolation Forest result:\n", df)
# Output:
#    values    zscore  anomaly
# 0      10 -0.456436        1
# 1      12 -0.299370        1
# 2      14 -0.142304        1
# 3    1000  2.172065       -1
# 4      15 -0.274956        1
```

---

#### ⏱️ 4. When to Use:

* 🛑 When extreme values may distort statistical analysis or models
* 🔍 Before model training to improve accuracy and robustness
* 🧹 To clean data and ensure model assumptions (e.g., normality)
* ⚠️ When outliers are errors, measurement issues, or rare events needing special handling

---

#### ⚠️ 5. Limitations and Challenges:

* ❓ Differentiating between true outliers and valid rare data points can be hard
* 🔄 Treatment (removal, capping) might bias results if not done carefully
* ⚙️ Some methods depend on assumptions (e.g., normality for z-score)
* 🕰️ Advanced detection algorithms may require tuning and are computationally intensive

---




---

### 🔷 **22. Data Normalization and Scaling** ⚖️📊

---

#### 📖 1. Definition:

* **Data Normalization and Scaling** are preprocessing techniques used to adjust the range and distribution of feature values.
* **Normalization** typically rescales data to a \[0,1] range.
* **Scaling** often means standardizing features to have zero mean and unit variance.
* These techniques help improve the convergence and performance of many machine learning algorithms.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function | 💡 Description | 🧪 Example & Output |

**➤ `sklearn.preprocessing.MinMaxScaler`**

* Scales features to a fixed range, usually \[0, 1]

```python
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[10, 100], [20, 200], [30, 300]])
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# Output:
# [[0.  0. ]
#  [0.5 0.5]
#  [1.  1. ]]
```

**➤ `sklearn.preprocessing.StandardScaler`**

* Standardizes features by removing the mean and scaling to unit variance

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# Output:
# [[-1.22474487 -1.22474487]
#  [ 0.          0.        ]
#  [ 1.22474487  1.22474487]]
```

**➤ `sklearn.preprocessing.Normalizer`**

* Scales individual samples to have unit norm (useful in text classification)

```python
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
normalized = normalizer.fit_transform(data)
print(normalized)
# Output:
# [[0.09950372 0.99503719]
#  [0.09950372 0.99503719]
#  [0.09950372 0.99503719]]
```

---

#### 📌 3. Code Example:

```python
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer

data = np.array([[10, 100], [20, 200], [30, 300]])

# Min-Max Scaling
minmax_scaler = MinMaxScaler()
data_minmax = minmax_scaler.fit_transform(data)
print("Min-Max Scaled Data:\n", data_minmax)
# Output:
# [[0.  0. ]
#  [0.5 0.5]
#  [1.  1. ]]

# Standard Scaling
standard_scaler = StandardScaler()
data_standard = standard_scaler.fit_transform(data)
print("Standard Scaled Data:\n", data_standard)
# Output:
# [[-1.22474487 -1.22474487]
#  [ 0.          0.        ]
#  [ 1.22474487  1.22474487]]

# Normalization (unit norm)
normalizer = Normalizer()
data_normalized = normalizer.fit_transform(data)
print("Normalized Data:\n", data_normalized)
# Output:
# [[0.09950372 0.99503719]
#  [0.09950372 0.99503719]
#  [0.09950372 0.99503719]]
```

---

#### ⏱️ 4. When to Use:

* ⚙️ When features have different units or scales
* 🚀 For algorithms sensitive to feature magnitude like k-NN, SVM, gradient descent-based models
* 📉 To speed up convergence of optimization algorithms
* 🔍 When comparing or combining features with varying scales

---

#### ⚠️ 5. Limitations and Challenges:

* ⚠️ Scaling may distort interpretability of features
* ❗ Different scaling methods suit different algorithms; wrong choice affects performance
* 🔄 Normalization (unit norm) isn’t always appropriate for all data types
* 🧩 Care needed to apply same scaling to train and test data to avoid data leakage

---




---

### 🔷 **23. Handling Imbalanced Data** ⚖️🔄

---

#### 📖 1. Definition:

* **Imbalanced Data** occurs when the classes in a classification problem are not represented equally (e.g., 90% of samples in one class, 10% in another).
* This can cause models to be biased toward the majority class.
* Handling imbalance involves techniques to balance class distributions to improve model performance, especially for minority classes.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function/Library | 💡 Description | 🧪 Example & Output |

**➤ `imblearn.over_sampling.SMOTE`**

* Synthetic Minority Over-sampling Technique creates synthetic samples of minority class to balance data

```python
from collections import Counter
from imblearn.over_sampling import SMOTE
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([0, 0, 0, 0, 1, 1])  # Imbalanced (4:2)
print("Original class distribution:", Counter(y))
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
print("Resampled class distribution:", Counter(y_res))
# Output:
# Original class distribution: Counter({0: 4, 1: 2})
# Resampled class distribution: Counter({0: 4, 1: 4})
```

**➤ `imblearn.under_sampling.RandomUnderSampler`**

* Randomly removes samples from majority class to balance data

```python
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print("Resampled class distribution:", Counter(y_res))
# Output:
# Resampled class distribution: Counter({0: 2, 1: 2})
```

**➤ `sklearn.utils.class_weight.compute_class_weight`**

* Computes weights for classes to use in model training to balance class influence

```python
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

y = np.array([0, 0, 0, 0, 1, 1])
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
print("Class weights:", dict(zip(np.unique(y), class_weights)))
# Output:
# Class weights: {0: 0.75, 1: 1.5}
```

---

#### 📌 3. Code Example:

```python
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([0, 0, 0, 0, 1, 1])  # Imbalanced (4 majority, 2 minority)

# Original class distribution
print("Original class distribution:", Counter(y))
# Output: Counter({0: 4, 1: 2})

# Oversampling minority class using SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print("After SMOTE:", Counter(y_smote))
# Output: Counter({0: 4, 1: 4})

# Undersampling majority class using RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X, y)
print("After undersampling:", Counter(y_rus))
# Output: Counter({0: 2, 1: 2})

# Computing class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
print("Class weights:", dict(zip(np.unique(y), class_weights)))
# Output: {0: 0.75, 1: 1.5}
```

---

#### ⏱️ 4. When to Use:

* 📉 When classification data shows severe class imbalance
* 🔍 When minority class prediction is important (e.g., fraud detection, medical diagnosis)
* ⚖️ To avoid models biased towards majority classes
* 🧠 When training algorithms that support class weights or need balanced data

---

#### ⚠️ 5. Limitations and Challenges:

* 🔄 Oversampling can cause overfitting on synthetic data
* 🗑️ Undersampling may discard useful majority class information
* ⚙️ Not all algorithms support sample weighting natively
* 🧩 Needs careful tuning of resampling parameters and validation strategy

---




---

### 🔷 **24. Data Visualization for EDA** 📊👁️

---

#### 📖 1. Definition:

* **Data Visualization** is the graphical representation of data to help understand patterns, trends, and outliers during Exploratory Data Analysis.
* It helps to gain insights quickly and communicate findings effectively.
* Common visualizations include histograms, scatter plots, box plots, bar charts, and heatmaps.

---

#### 🛠️ 2. Available Built-in Functions (with Descriptions & Output Examples):

\| 🔧 Function/Library | 💡 Description | 🧪 Example & Output |

**➤ `matplotlib.pyplot.hist()`**

* Creates histograms to show data distribution

```python
import matplotlib.pyplot as plt

data = [1,2,2,3,3,3,4,4,4,4]
plt.hist(data, bins=4)
plt.show()
# Output: Histogram plot showing frequency of values
```

**➤ `seaborn.scatterplot()`**

* Creates scatter plots to visualize relationships between two variables

```python
import seaborn as sns
import pandas as pd

df = pd.DataFrame({'x':[1,2,3,4], 'y':[10,15,13,17]})
sns.scatterplot(data=df, x='x', y='y')
plt.show()
# Output: Scatter plot of x vs y
```

**➤ `matplotlib.pyplot.boxplot()`**

* Creates box plots to visualize data distribution and detect outliers

```python
plt.boxplot(data)
plt.show()
# Output: Box plot of data showing median, quartiles, and outliers
```

**➤ `seaborn.heatmap()`**

* Creates heatmaps for correlation matrices or data intensity

```python
import numpy as np

corr = np.corrcoef([[1,2,3],[4,5,6]])
sns.heatmap(corr, annot=True)
plt.show()
# Output: Heatmap of correlation coefficients
```

---

#### 📌 3. Code Example:

```python
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

data = [1,2,2,3,3,3,4,4,4,4]

# Histogram
plt.hist(data, bins=4)
plt.title("Histogram")
plt.show()
# Output: Histogram plot with frequencies for each bin

# Scatter plot
df = pd.DataFrame({'x':[1,2,3,4], 'y':[10,15,13,17]})
sns.scatterplot(data=df, x='x', y='y')
plt.title("Scatter Plot")
plt.show()
# Output: Scatter plot showing relation between x and y

# Box plot
plt.boxplot(data)
plt.title("Box Plot")
plt.show()
# Output: Box plot showing median and outliers

# Heatmap of correlation matrix
corr = np.corrcoef(df['x'], df['y'])
sns.heatmap(corr, annot=True)
plt.title("Heatmap of Correlation")
plt.show()
# Output: Heatmap visualizing correlation between x and y
```

---

#### ⏱️ 4. When to Use:

* 🔍 To understand data distribution and central tendency
* 🔄 To identify relationships and correlations between variables
* ⚠️ To spot outliers and unusual data points visually
* 📢 To communicate findings effectively in reports and presentations

---

#### ⚠️ 5. Limitations and Challenges:

* 🖼️ Visualizations can be misleading if axes or scales are not chosen carefully
* 📊 Overplotting can obscure insights in large datasets
* ⚙️ Requires careful selection of appropriate visualization types for data
* 🧩 May require domain knowledge to interpret complex plots accurately

---

