# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the first and most important step in any data analysis or machine learning workflow. Its goal is to understand the data, identify patterns, and detect problems before applying models.

### Objectives of EDA

- Understand data structure & variables

- Identify missing values & outliers

- Analyze distributions & relationships

- Validate assumptions

- Generate hypotheses & insights

### Types of EDA

#### Univariate Analysis

-- Analysis of one variable at a time

1. Numerical Variables:

-- Mean, Median, Mode

-- Variance, Standard Deviation

-- Skewness, Kurtosis

-- Histogram

-- Boxplot

2. Categorical Variables:

-- Frequency table

-- Bar chart

-- Pie chart

#### Bivariate Analysis

---- Analysis of two variables

- Variable Types	Common Plots
- Numerical â€“ Numerical	Scatter plot, Line plot
- Numerical â€“ Categorical	Boxplot, Violin plot
- Categorical â€“ Categorical	Grouped bar chart, Heatmap

#### Measures:

- Correlation (Pearson, Spearman)

- Covariance

#### Multivariate Analysis

--- Analysis of more than two variables

##### Techniques:

- Pair plot

- Correlation heatmap

- Facet grids

- Parallel coordinates

## Step 1: Variable Identification

Variable identification is the foundation of Exploratory Data Analysis (EDA). It helps you decide which analysis techniques and plots are appropriate.

### Types of Variables
ðŸŽ¯ Predictor Variables vs Target Variable
#### 1.Predictor Variables (Independent Variables)

- Used to explain or predict the target

- Also called features, inputs, or X variables

- Can be numerical or categorical

--- Examples: Age, Income, Engine size, City mileage (cty), Education level

#### 2.Target Variable (Dependent Variable)

- The variable you want to predict or explain

- Also called response, output, or Y variable

--- Examples: House price, Sales, Customer churn (Yes/No), MPG, Loan approval status

#### Important:
EDA is performed on both predictors and target, but the target variable guides the modeling type:
1.Continuous â†’ Regression

2.Categorical â†’ Classification

### Data Types of Variables

Variables are broadly classified into Numerical and Categorical types.

#### A. Numerical Variables

Represent quantitative values where mathematical operations are meaningful.

âœ… 1. Continuous Variables

Can take any value within a range

Measured

Examples:

Height

Weight

Temperature

Salary

MPG

#### Common EDA plots: Histogram, Density plot, Boxplot

##### 2. Discrete Variables

- Countable values

- Usually integers

###### Examples: Number of children, Number of cylinders, Number of transactions

###### Common EDA plots: Bar chart, Count plot

#### B. Categorical Variables

Represent qualitative characteristics

##### 1. Nominal Variables

Categories without order

##### Examples: Gender, Color, City, Fuel type

##### Common EDA plots: 
- Bar chart
- Pie chart

##### 2. Ordinal Variables

Categories with order, but unequal spacing

##### Examples:

1. Education level (School < Graduate < Postgraduate)

2. Satisfaction rating (Low, Medium, High)

3. Credit score category

##### Common EDA plots:

- Ordered bar chart

- Boxplot (with encoding)

### Identify Variables in Python
##### Numerical and categorical variables
num_vars = df.select_dtypes(include=["int64", "float64"])
cat_vars = df.select_dtypes(include=["object", "category"])

print("Numerical Variables:\n", num_vars.columns)
print("Categorical Variables:\n", cat_vars.columns)

ðŸ§  Why This Step Is Important?

âœ” Decides visualization methods
âœ” Guides feature engineering
âœ” Prevents wrong statistical assumptions
âœ” Helps choose ML algorithms

### Step 2: Univariate Analysis

Univariate Analysis focuses on analyzing one variable at a time to understand its distribution, central tendency, spread, and anomalies.

#### Objectives of Univariate Analysis

- Understand data distribution

- Identify outliers

- Detect skewness

- Summarize data using statistics

- Choose suitable transformations

### Step 3: BI-Variate Analysis

Bi-Variate Analysis studies the relationship between two variables to understand association, trends, and dependency.
This step helps answer: Does one variable influence another?

#### Objectives of Bi-Variate Analysis

- Identify relationships & patterns

- Measure strength and direction of association

- Compare distributions across groups

- Support feature selection

### Step 4: Missing Values Treatment

Missing values can bias analysis, break models, and reduce performance.
This step focuses on identifying, understanding, and handling missing data correctly.

#### Objectives

- Detect missing values

- Understand why they are missing

- Choose the right treatment strategy

- Preserve data integrity

### Step 5: Outlier Detection & Treatment

Outliers are observations that deviate significantly from the rest of the data.
They can distort statistics, mislead models, or sometimes represent important insights.

#### Objectives

- Detect outliers accurately

- Understand their cause

- Decide whether to remove, cap, or keep

- Minimize negative impact on models

### Step 6: Variable Transformation

Variable Transformation modifies variables to make data more suitable for analysis and modeling.
It helps satisfy statistical assumptions, improves model performance, and stabilizes variance.

#### Objectives

- Reduce skewness

- Handle outliers

- Normalize scale

- Improve linear relationships

- Prepare data for ML algorithms

### Step 7: Variable Creation

Variable Creation means generating new, meaningful variables from existing data to improve model performance and insights.
This is one of the most impactful steps in EDA and ML.

#### Objectives

- Capture hidden patterns

- Improve predictive power

- Reduce noise

- Add domain knowledge