# Exploratory Data Analysis(EDA) and Preprocessing

EDA deals with understanding the data, identifying patterns, and uncovering insights by visualizing and summarizing the data. Preprocessing involves cleaning and transforming the data to make it suitable for analysis or modeling. Both EDA and preprocessing are crucial steps in any ML pipeline.

### Key things to keep in mind when selecting dataset from Kaggle/UCI/Google Dataset or any other source:

1. Usability: Must be above 7/7.5 out of 10.
2. Labels: Must have clear labels.
3. Source & Acknowledgment: Must have clear source and acknowledgment of creator.
4. Citation: If the dataset is used in any publication, it's a reliable source.
5. Synthetic vs Real: Synthetic datasets are generated artificially or copied from other sources, while real datasets are collected from real-world observations. Real datasets are generally preferred for training ML models as they better represent real-world scenarios. It is found in the context of that particular dataset.
6. Overall Summary: Read the overall summary of the dataset to understand its context and purpose.
7. Attributes: Check the attributes/features of the dataset to see if they are relevant to the analysis.

## EDA Steps:

#### Step 1: Import Libraries and Load Dataset
---

At this point basic things like how many rows and columns, what are the column names, data types, missing values, and basic statistics should be checked.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")
sns.set(font_scale=1.1)

# Load dataset
df = pd.read_csv("path_to_your_dataset.csv")

# Get basic info
print("Shape: ", df.shape)
print("Columns: ", df.columns)
print("Info: ")
print(df.info())
print("Description: ")
print(df.describe().T)
```

#### Step 2: Define Target & Features Types
---

Explicitly listing numeric and categorical features makes the work reproducible. It also forces one to think about each column and its meaning, which is a key habit in serious ML projects.

```python

target_col = "target_column_name"  # Replace with your target column name
feature_cols = [col for col in df.columns if col != target_col]
numerical_cols = df[feature_cols].select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df[feature_cols].select_dtypes(include=['object', 'category']).columns.tolist()

print("Target Column: ", target_col)
print("Numerical Columns: ", numerical_cols)
print("Categorical Columns: ", categorical_cols)
```

#### Step 3: Missing Values & Basic Quality Checks
---

Look for impossible or suspicious ranges such as zero or extremely low or high & missing values. These may be data entry issues or special codes that need to be handled separately. Even if the dataset description claims there are no missing values, always verify. Real world data often breaks promises. Also, check for unique values in categorical columns to identify inconsistencies.

```python
df.isna().sum()         # Check for missing values per column
df[numerical_cols].agg(['min', 'max', 'mean', 'median', 'std'])  # Basic stats for numerical columns

for col in categorical_cols:
    print(col, df[col].unique())  # Unique values in categorical columns. Like in Sex column we can get M, F, Female, Male, 'F' etc., which we can later preprocess by combining them.
```

#### Step 4: Understanding Distributions with Histograms and Boxplots
---

Histograms tell the shape of distributions. Boxplots give a quick view of spread and potential outliers. Before any scaling or transformation, a mental picture of how these variables behave is important.

```python
# Histograms for numerical features
df[numerical_cols].hist(bins=15, figsize=(12,8))
plt.suptitle("Histograms of Numerical Features", fontsize=14)
plt.tight_layout()          # Adjust layout to prevent different graphs overlap
plt.show()

# Single column histogram 
df['numerical_column_name'].hist(bins=15, figsize=(6,4))
plt.title("Histogram of Numerical Column", fontsize=14)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

# Boxplots for numerical features
plt.figure(figsize=(12,8))
df[numerical_cols].boxplot()
plt.title("Boxplots of Numerical Features", fontsize=14)
plt.xticks(rotation=45)         # Rotate x-axis labels for better readability
plt.show()

# Single column boxplot
plt.figure(figsize=(6,4))
df.boxplot(column='numerical_column_name')
plt.title("Boxplot of Numerical Column", fontsize=14)
plt.show()
```

#### Step 5: Target Distribution and Class Imbalance
---

See how balanced the target variable is. If one class dominates heavily, it needs to resampling strategies or class weighted models later. Even if the imbalance is moderate, it is critical to be aware of it before modeling.

```python
# Target variable distribution
plt.figure(figsize=(6,4))
sns.countplot(x=target_col, data=df)
plt.title("Target Variable Distribution", fontsize=14)
plt.xlabel("Classes")
plt.ylabel("Count")
plt.show()

# Normalized distribution
df[target_col].value_counts(normalize=True)         # Proportion of each class in target variable in scale of 100% 
```

#### Step 6: Categorical Feature Exploration
---

For categorical variables, check if some categories have very few samples. Rare categories can be merged, encoded carefully, or sometimes dropped if they add noise instead of signal.

```python
for col in categorical_cols:
    plt.figure(figsize=(5,4))
    sns.countplot(x=col, data=df)
    plt.title(f"Population of {col}", fontsize=14)
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.xticks(rotation=45)         # Rotate x-axis labels for better readability
    plt.show()
```

#### Step 7:  Relationships Between Features and Target
---

The relationship between numerical features and the target variable can be visualized using boxplots. This helps in understanding how different values of the target variable affect the distribution of numerical features. Large differences in distributions between target classes often signal strong predictive potential. If the distributions are almost identical, that feature may be less useful on its own.

```python
# Boxplots of numerical features vs target
plt.figure(figsize=(12,8))
for i, col in enumerate(numerical_cols,1):          # 1-based index for subplot
    plt.subplot(2, 3, i)            # 2 rows, 3 columns of subplots
    sns.boxplot(x=target_col, y=col, data=df)
    plt.title(f"{col} vs {target_col}", fontsize=12)
    plt.xlabel(target_col)
    plt.ylabel(col)

plt.tight_layout()
plt.show()
```

#### Step 8: Pairplot for a Subset of Features
---

Pairplots are expensive but powerful for small to medium sized datasets. They show pairwise relationships and distributions in one go.

```python
sns.pairplot(df[numerical_cols], hue=target_col, diag_kind='hist')
plt.suptitle("Pairplot of Numerical Features", fontsize=14)
plt.show()
```

#### Step 9: Correlation Matrix and Heatmap
---

High absolute correlation with the target is interesting, but low correlation features cann't be ignored. Some of them can become powerful in combination with others. Also watch for high correlation between predictors, which may indicate redundancy.

```python
corr_matrix = df[numerical_cols + [target_col]].corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title("Correlation Matrix", fontsize=14)
plt.tight_layout()
plt.show()

corr_matrix[target_col].sort_values(ascending=False)  # Sorting correlation of features with target variable
```

#### Step 10: Categorical Feature vs Target
---

Identify categorical features that have strong relationships with the target variable using stacked bar plots & crosstabs. Crosstabs normalized by row are very helpful. They show how the proportions change across categories of the categorical feature. This is often more informative than raw counts.

```python
for col in categorical_cols:
    plt.figure(figsize=(6,4))
    
    ct= pd.crosstab(df[col], df[target_col], normalize='index')   # Normalized crosstab to see proportions instead of counts
    ct.plot(kind='bar', stacked=True)
    plt.title(f"{col} vs {target_col}", fontsize=14)
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.xticks(rotation=0)         # Rotate x-axis labels for better readability
    plt.show()
```


### 8. EDA Best Practices Before Model Building

Some practical guidelines:

- Read the dataset description and understand dataset context where possible  
- Verify data types and ranges, do not trust them blindly  
- Check missing values and consider if they are random or systematic  
- Study distributions, not just summary statistics  
- Look at target distribution and potential class imbalance  
- Explore feature target relationships through plots and simple statistics  
- Take notes about:
  - Features that look noisy or suspicious  
  - Features that seem strongly related to the target  
  - Possible transformations such as log, binning, or scaling  
  - Any domain inspired features you might create later  

> Insight: Good EDA is like building a mental simulation of how the data behaves. Once that simulation is clear, the choices in preprocessing and modeling feel much less random.