---
---

### ✅ **Complete List of EDA Topics for Machine Learning**

*(Organized step-by-step, from raw data to insights)*

---

#### 🧹 1. **Data Cleaning**

* Handling missing values (imputation, removal)
* Handling duplicates
* Handling outliers
* Type conversions (e.g., object to float)
* Dealing with inconsistent formats (dates, currencies, etc.)
* String trimming and standardization

---

#### 📏 2. **Data Type Identification & Conversion**

* Numerical vs Categorical
* Ordinal vs Nominal
* Datetime parsing
* Encoding (Label, One-hot, Ordinal)

---

#### 📊 3. **Univariate Analysis**

* Summary statistics (mean, median, mode, std, IQR)
* Frequency distribution
* Value counts
* Distribution plots (histogram, KDE, boxplot)
* Detecting skewness and kurtosis

---

#### 📈 4. **Bivariate Analysis**

* Correlation matrix (Pearson, Spearman)
* Scatter plots
* Heatmaps
* Pair plots
* Groupby analysis
* Cross-tabulation
* Boxplots/grouped boxplots

---

#### 🔁 5. **Multivariate Analysis**

* Multivariate correlation
* Pairplots (Seaborn)
* FacetGrid analysis
* PCA for visualization
* Bubble charts

---

#### 📉 6. **Outlier Detection**

* Z-score
* IQR method
* Boxplot visual method
* Isolation Forest (optional ML method)
* Mahalanobis distance

---

#### 🧬 7. **Feature Distribution Analysis**

* Normal vs non-normal distribution
* Skewness correction (log, sqrt, Box-Cox)
* Visualization: histogram, distplot, violin plot

---

#### 📆 8. **Time Series EDA**

* Time-based decomposition
* Rolling statistics
* Seasonal trends
* Lag analysis
* Autocorrelation/Partial Autocorrelation

---

#### 📂 9. **Categorical Variable Analysis**

* Frequency tables
* Bar plots / Count plots
* Pie charts (use sparingly)
* Stacked bar charts
* Chi-square test (for association)

---

#### 📊 10. **Numerical Variable Analysis**

* Distribution shape
* Mean/median comparison
* Boxplots by category
* Violin plots
* ANOVA or t-tests

---

#### 🔀 11. **Encoding Categorical Data**

* Label Encoding
* One-Hot Encoding
* Target/Mean Encoding
* Frequency Encoding

---

#### 📉 12. **Correlation Analysis**

* Pearson, Spearman, Kendall coefficients
* Heatmaps
* Variance Inflation Factor (VIF) for multicollinearity

---

#### 🧪 13. **Missing Value Treatment**

* Count and percentage of missing values
* Missingness pattern visualization
* Imputation techniques:

  * Mean/median/mode
  * Forward/backward fill
  * KNN imputation
  * Regression imputation

---

#### 🧮 14. **Feature Engineering (EDA-Aided)**

* Polynomial features
* Interaction terms
* Date parts (year, month, weekday, etc.)
* Binning (equal-width, equal-frequency, quantile-based)

---

#### ⚖️ 15. **Target Variable Analysis**

* Class imbalance (binary/multiclass)
* Distribution of target vs features
* Use of stratification for classification
* SMOTE or undersampling techniques (if applied)

---

#### 📈 16. **Visualization Techniques**

* Seaborn, Matplotlib, Plotly
* Histograms, KDE, Boxplots
* Count plots, Pie charts, Bar charts
* Heatmaps, Correlation plots
* Joint plots, Pair plots, Violin plots
* Time series plots
* Missing value matrix (e.g., `msno.matrix()`)

---

#### 📋 17. **EDA Reporting**

* Pandas Profiling
* Sweetviz
* D-Tale
* Autoviz
* Lux

---

#### 🛑 18. **EDA Red Flags**

* Data leakage detection
* Target leakage in features
* High multicollinearity
* Dominant class in target

---
---

### ✅ Complete EDA Guide for Machine Learning (All-in-One)

---

## 🧹 1. Data Cleaning

### 📘 Definition:

Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset.

### 🔧 Built-in Functions:

* `df.isnull()` – Detects missing values
* `df.dropna()` – Removes missing values
* `df.fillna()` – Fills missing values
* `df.duplicated()` – Detects duplicate rows
* `df.drop_duplicates()` – Removes duplicate rows
* `pd.to_numeric()` – Converts data types
* `df.replace()` – Replaces specific values
* `df.astype()` – Type conversion

### 🧪 Example:

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': ['  text', 'text', 'Text', 'text'],
    'C': [1, 1, 1, 1]
})

# Remove missing values
df_cleaned = df.dropna()
print(df_cleaned)
# Output:
#      A     B  C
# 0  1.0   text  1
# 1  2.0   text  1
# 3  4.0   text  1
```

### ✅ When to Use:

* At the beginning of any ML project

### ❌ When Not to Use:

* When you need to preserve raw data for auditing

### ⚠️ Limitations:

* Over-cleaning may lead to loss of important information

### 🎯 Interview Questions:

1. How do you handle missing values in a dataset?
2. What is the difference between `dropna()` and `fillna()`?
3. How can you detect outliers during data cleaning?

---

## 📏 2. Data Type Identification & Conversion

### 📘 Definition:

Identifying and converting data into appropriate formats such as numerical, categorical, or datetime for proper analysis.

### 🔧 Built-in Functions:

* `df.dtypes` – Shows data types
* `df.astype()` – Converts type
* `pd.to_datetime()` – Parses datetime
* `pd.get_dummies()` – One-hot encoding
* `LabelEncoder()` – Label encoding
* `OrdinalEncoder()` – Ordinal encoding

### 🧪 Example:

```python
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Gender': ['Male', 'Female', 'Female']})
le = LabelEncoder()
df['Gender_encoded'] = le.fit_transform(df['Gender'])
print(df)
# Output:
#   Gender  Gender_encoded
# 0   Male               1
# 1 Female               0
# 2 Female               0
```

### ✅ When to Use:

* Before applying ML models

### ❌ When Not to Use:

* When working on raw EDA before preprocessing

### ⚠️ Limitations:

* Incorrect encoding can mislead models

### 🎯 Interview Questions:

1. Difference between One-hot and Label encoding?
2. What are ordinal variables and how do you handle them?
3. How do you convert a column to datetime format?

---

## 📊 3. Univariate Analysis

### 📘 Definition:

Analyzing one variable at a time to understand its distribution and characteristics.

### 🔧 Built-in Functions:

* `df.describe()` – Summary statistics
* `df.value_counts()` – Frequency of values
* `df['col'].plot(kind='hist')` – Histogram
* `sns.boxplot()` – Boxplot
* `sns.kdeplot()` – Kernel Density Estimate

### 🧪 Example:

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(data=df, x='A')
plt.show()
# Output: Boxplot visualization
```

### ✅ When to Use:

* To understand distribution, outliers, central tendency

### ❌ When Not to Use:

* When analyzing interaction between features

### ⚠️ Limitations:

* Can’t show relationships with other variables

### 🎯 Interview Questions:

1. What are summary statistics?
2. How do you detect skewness and kurtosis?
3. Why is univariate analysis important?

---

## 📈 4. Bivariate Analysis

### 📘 Definition:

Analyzing the relationship between two variables.

### 🔧 Built-in Functions:

* `df.corr()` – Correlation
* `pd.crosstab()` – Cross tabulation
* `df.groupby()` – Group-wise analysis
* `sns.scatterplot()` – Scatter plot
* `sns.heatmap()` – Heatmap
* `sns.boxplot(x, y)` – Grouped boxplot

### 🧪 Example:

```python
sns.scatterplot(data=df, x='A', y='C')
plt.show()
# Output: Scatter plot
```

### ✅ When to Use:

* To identify linear/non-linear relationships

### ❌ When Not to Use:

* When one or both variables are not meaningful together

### ⚠️ Limitations:

* Only works on two variables at a time

### 🎯 Interview Questions:

1. What is the use of scatter plots?
2. What is a heatmap and when is it used?
3. How can correlation mislead in non-linear cases?

---

(## 🔁 5. Multivariate Analysis

### 📘 Definition:

Analyzing relationships among more than two variables simultaneously.

### 🔧 Built-in Functions:

* `sns.pairplot()` – Pairwise plots
* `sns.FacetGrid()` – Multi-variable faceted plots
* `PCA()` – Dimensionality reduction
* `sns.scatterplot()` with `hue` – Colored multivariable scatter

### 🧪 Example:

```python
sns.pairplot(df, hue='Gender')
plt.show()
# Output: Multiple scatter plots based on each pair of variables
```

### ✅ When to Use:

* When studying combined effects of features

### ❌ When Not to Use:

* With too many variables (can be noisy)

### ⚠️ Limitations:

* Hard to visualize beyond 3 dimensions

### 🎯 Interview Questions:

1. What is multivariate analysis?
2. When would you use PCA in EDA?
3. Difference between pairplot and FacetGrid?

---

## 📉 6. Outlier Detection

### 📘 Definition:

Finding data points that are significantly different from others.

### 🔧 Built-in Functions:

* `zscore()` – Z-score method
* `IQR` logic with `quantile()`
* `sns.boxplot()` – Boxplot for visual detection
* `IsolationForest()` – Tree-based outlier detection

### 🧪 Example:

```python
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['A'] < Q1 - 1.5*IQR) | (df['A'] > Q3 + 1.5*IQR)]
print(outliers)
# Output: Rows considered as outliers
```

### ✅ When to Use:

* Before modeling, to improve accuracy

### ❌ When Not to Use:

* In naturally long-tailed distributions

### ⚠️ Limitations:

* May remove valid rare events

### 🎯 Interview Questions:

1. What is IQR and how is it used?
2. How does Isolation Forest detect outliers?
3. Why are outliers important in ML?

---

## 🧬 7. Feature Distribution Analysis

### 📘 Definition:

Examining the distribution shape of individual features.

### 🔧 Built-in Functions:

* `sns.histplot()` – Histogram
* `sns.kdeplot()` – KDE plot
* `sns.violinplot()` – Violin plot
* `np.log()`, `np.sqrt()` – Skewness correction

### 🧪 Example:

```python
sns.violinplot(x='Gender', y='A', data=df)
plt.show()
# Output: Violin plot showing distribution by Gender
```

### ✅ When to Use:

* When verifying data assumptions like normality

### ❌ When Not to Use:

* When shape is irrelevant to analysis/model

### ⚠️ Limitations:

* Misleading if outliers not handled

### 🎯 Interview Questions:

1. What’s the use of KDE plot?
2. How do you handle skewed features?
3. When do you apply log transformation?

---


Here’s the continuation in your specified format for the next EDA topics:

---

## 🧩 8. Feature Engineering

### 📘 Definition:

Feature engineering is the process of creating new features or modifying existing ones to improve model performance.

### 🔧 Built-in Functions:

* `df['new'] = df['col1'] + df['col2']` – Create new features
* `df['col'].apply()` – Apply custom transformations
* `pd.cut()` – Bin numerical values
* `pd.qcut()` – Quantile binning
* `np.log()`, `np.sqrt()` – Transform features
* `PolynomialFeatures()` – Generate polynomial terms

### 🧪 Example:

```python
df['log_A'] = np.log(df['A'] + 1)
print(df[['A', 'log_A']])
# Output: Original and log-transformed column
```

### ✅ When to Use:

* To expose hidden patterns to ML algorithms

### ❌ When Not to Use:

* When raw features are already well-optimized

### ⚠️ Limitations:

* Over-engineering can cause overfitting

### 🎯 Interview Questions:

1. What is feature engineering?
2. How does feature transformation affect ML models?
3. Difference between `pd.cut()` and `pd.qcut()`?

---

## 🔁 9. Feature Selection

### 📘 Definition:

Selecting the most relevant features that contribute to the model and removing irrelevant ones.

### 🔧 Built-in Functions:

* `SelectKBest()` – Top K features
* `RFE()` – Recursive Feature Elimination
* `VarianceThreshold()` – Low variance filter
* `df.corr()` – Correlation-based filtering
* `model.feature_importances_` – From tree models

### 🧪 Example:

```python
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
df_selected = selector.fit_transform(df[['A', 'C']])
print(df_selected)
# Output: Selected features array
```

### ✅ When to Use:

* To reduce overfitting and improve model performance

### ❌ When Not to Use:

* On small datasets where all features are essential

### ⚠️ Limitations:

* Risk of removing informative features

### 🎯 Interview Questions:

1. Why is feature selection important?
2. What’s the difference between filter and wrapper methods?
3. How do tree models help in feature selection?

---

## 🧪 10. Handling Imbalanced Data

### 📘 Definition:

Techniques used to address datasets where the target class distribution is skewed.

### 🔧 Built-in Functions:

* `value_counts()` – View imbalance
* `resample()` – Over/undersampling
* `SMOTE()` – Synthetic Minority Over-sampling
* `class_weight='balanced'` – Adjust model training
* `confusion_matrix()` – Evaluate imbalance impact

### 🧪 Example:

```python
from sklearn.utils import resample

df_majority = df[df['target'] == 0]
df_minority = df[df['target'] == 1]
df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)
df_balanced = pd.concat([df_majority, df_minority_upsampled])
print(df_balanced['target'].value_counts())
# Output: Balanced class counts
```

### ✅ When to Use:

* When classification accuracy is biased toward majority class

### ❌ When Not to Use:

* On already balanced data

### ⚠️ Limitations:

* Over/undersampling may introduce noise or remove useful data

### 🎯 Interview Questions:

1. What is SMOTE and how does it work?
2. What are some metrics better than accuracy in imbalanced datasets?
3. How do you detect and fix imbalanced datasets?

---

