# Visualization Guide for Univariate, Bivariate & Multivariate Analysis

## 1. UNIVARIATE ANALYSIS (Single Variable)

### For Numerical Variables

| Plot | Purpose | Best When |
|------|---------|-----------|
| **Histogram** | Shows frequency distribution | Understanding data spread and shape |
| **KDE Plot** | Smooth probability density curve | Visualizing continuous distribution |
| **Box Plot** | Shows median, quartiles, outliers | Detecting outliers and spread |
| **Violin Plot** | Box plot + density shape | Seeing full distribution shape |
| **Rug Plot** | Individual data points as ticks | Small datasets, shows actual values |
| **QQ Plot** | Compares to theoretical distribution | Checking normality assumption |

## 2. BIVARIATE ANALYSIS (Two Variables)

### NUM-NUM (Numerical vs Numerical)

| Plot | Purpose | Best When |
|------|---------|-----------|
| **Scatter Plot** | Shows relationship between two numbers | Core relationship analysis |
| **Regression Plot** | Scatter + fitted regression line | Checking linear relationship |
| **Hexbin Plot** | Binned scatter (hexagonal) | Large datasets (>10k points) |
| **2D KDE / Contour Plot** | Density of point clusters | Identifying concentration areas |
| **Joint Plot** | Scatter + marginal distributions | Complete picture of both variables |
| **Line Plot** | Trend over ordered variable | Time series data |

### CAT-CAT (Categorical vs Categorical)

| Plot | Purpose | Best When |
|------|---------|-----------|
| **Grouped Bar Chart** | Side-by-side comparison | Comparing counts across groups |
| **Stacked Bar Chart** | Composition within categories | Showing part-to-whole |
| **Heatmap** | Contingency table visualization | Many categories, frequency matrix |
| **Mosaic Plot** | Proportional area representation | Showing relative proportions |

### NUM-CAT (Numerical vs Categorical)

| Plot | Purpose | Best When |
|------|---------|-----------|
| **Grouped Box Plot** | Distribution per category | Comparing spread across groups |
| **Grouped Violin Plot** | Full distribution shape per category | Seeing density differences |
| **Strip Plot** | Individual points by category | Small datasets |
| **Swarm Plot** | Non-overlapping strip plot | Medium datasets |
| **Bar Plot** | Mean/median with error bars | Summarizing central tendency |
| **Point Plot** | Mean with confidence intervals | Comparing means across groups |
| **Ridge Plot / Joy Plot** | Overlapping distributions | Many categories, beautiful viz |

## 3. MULTIVARIATE ANALYSIS (3+ Variables)

### Multiple Numerical Variables

| Plot | Purpose | Best When |
|------|---------|-----------|
| **Pair Plot / Scatter Matrix** | All pairwise scatter plots | Initial exploration of relationships |
| **Correlation Heatmap** | Visualize correlation matrix | Identifying correlated features |
| **3D Scatter Plot** | Three numerical variables | Spatial relationship (limited use) |
| **Bubble Chart** | Scatter + size dimension | Adding third numerical variable |
| **Parallel Coordinates** | Multiple dimensions as parallel axes | High-dimensional comparison |
| **Radar / Spider Chart** | Multiple metrics on radial axes | Comparing profiles |
| **PCA / t-SNE Plot** | Dimensionality reduction visualization | Very high dimensions |

### Mixed Variables (Num + Cat combined)

| Plot | Purpose | Best When |
|------|---------|-----------|
| **Colored Scatter Plot** | Scatter with color = category | Adding categorical dimension |
| **Facet Grid / Small Multiples** | Repeat plots for each category | Comparing patterns across groups |
| **Pair Plot with Hue** | Scatter matrix colored by category | Multivariate + grouping |
| **Grouped Box/Violin with Hue** | Multiple grouping levels | Two categorical + one numerical |

## Quick Reference Summary Table

| Analysis | Variable Types | Primary Plots |
|----------|---------------|---------------|
| **Univariate** | Numerical | Histogram, Box, Violin, KDE |
| **Univariate** | Categorical | Bar/Count, Pie |
| **Bivariate** | Num-Num | Scatter, Regression, Joint, Hexbin |
| **Bivariate** | Cat-Cat | Grouped Bar, Stacked Bar, Heatmap |
| **Bivariate** | Num-Cat | Box, Violin, Strip, Swarm, Bar |
| **Multivariate** | Multiple Num | Pair Plot, Correlation Heatmap, Parallel Coords |
| **Multivariate** | Mixed | Colored Scatter, Facet Grid, Pair Plot with Hue |

<br>
<br>
<hr>
<br>
<br>

#### What is the Interquartile Range (IQR), and why is it considered a "robust" measure of spread compared to standard deviation?
—→   Standard deviation is computed from the mean and squares deviations, so extreme values strongly influence it and can inflate the perceived spread. The IQR uses the median and the 25th/75th percentiles (the middle 50% of the data), so it largely ignores outliers and skewness—making it a more robust summary of typical variability.

<br>
<br>
<hr>
<br>
<br>

### 1. Variance vs. Covariance vs. Correlation

To understand the difference, imagine you are analyzing data about people's height and weight.

#### **Variance (The Spread of One)**
Variance measures how spread out a **single** variable is from its own average.
*   **Function:** It tells you about the dispersion of data.
*   **Formula Concept:** Average of squared differences from the mean.
*   **Example:** Analyzing only *Height*. High variance means you have a mix of giants and short people. Low variance means everyone is roughly the same height.
*   **Units:** The units are squared (e.g., $\text{cm}^2$). This makes it hard to interpret intuitively.

#### **Covariance (The Joint Movement of Two)**
Covariance measures how **two** variables change together.
*   **Function:** It tells you the *direction* of the relationship.
*   **Formula Concept:** Average of the product of deviations ($x$ from mean of $X$) and ($y$ from mean of $Y$).
*   **Example:** Analyzing *Height* and *Weight*.
    *   **Positive Covariance:** As height increases, weight tends to increase.
    *   **Negative Covariance:** As one increases, the other decreases.
*   **Units:** The units are the product of the two variables (e.g., $\text{cm} \cdot \text{kg}$).

#### **Correlation (The Standardized Relationship)**
Correlation is the **normalized** version of covariance.
*   **Function:** It tells you the *direction* AND the *strength* of the relationship on a standardized scale.
*   **Formula Concept:** Covariance divided by the product of the standard deviations of $X$ and $Y$.
*   **Units:** **Unitless.** It is a pure number between -1 and +1.

---

### 2. How the problem of one solves another

This is a hierarchy of statistical evolution. Each concept solves a flaw in the previous one.

#### **Problem 1: Variance tells us nothing about relationships.**
Variance is excellent for understanding volatility (risk) in a single dataset, but it is "blind" to how that dataset interacts with the outside world.
*   **Solution (Covariance):** By multiplying the deviations of two different variables, Covariance allows us to see if Variable A and Variable B move in sync.

#### **Problem 2: Covariance has an "Interpretation Problem" (The Scale Issue).**
This is the biggest issue. Because covariance keeps the units (e.g., $\text{cm} \cdot \text{kg}$), the number changes based on the scale.
*   *Scenario:* If you measure height in **meters**, you might get a covariance of **1.5**. If you convert height to **centimeters**, the relationship is exactly the same, but the covariance mathematically jumps to **150**.
*   *The Issue:* If I tell you the covariance is 500, is that a strong relationship? You don't know. It depends on the units.

#### **Solution (Correlation):**
Correlation solves the scale problem by **dividing the covariance by the standard deviation (variance rooted)** of the variables.
By dividing the "co-movement" by the "individual spreads," you cancel out the units.
*   **Result:** A correlation of 0.8 is always strong, regardless of whether you are measuring in inches, miles, or lightyears.

---

### 3. Why are they limited to Linear Relationships?

Standard Correlation (Pearson) and Covariance measure **constant rates of change**.

If you look at the math: $\sum (x - \bar{x})(y - \bar{y})$, it is calculating how much $Y$ deviates from the mean relative to how much $X$ deviates.

**The "Parabola" Problem:**
Imagine a U-shaped curve (like $y = x^2$, from $x = -10$ to $x = 10$).
*   On the left side, as $X$ goes up (moves toward 0), $Y$ goes down. (Negative correlation).
*   On the right side, as $X$ goes up (moves away from 0), $Y$ goes up. (Positive correlation).

If you run a standard correlation calculation on this perfect mathematical relationship, the negative side cancels out the positive side. The result is **0**. Correlation claims there is "no relationship," even though there is a perfect non-linear relationship.

Therefore, <u>`standard covariance and correlation are only reliable when the rate of change is constant (a straight line).`</u>

---

### 4. Types of Correlation Coefficients and When to Use Them

Because of the linearity limitation and data types, we use different coefficients.

#### **A. Pearson Correlation Coefficient ($r$)**
*   **What is it?** The standard, most common type.
*   **When to use:**
    *   Both variables are continuous (e.g., height, stock price, temperature).
    *   The relationship is **Linear**.
    *   The data is normally distributed.
    *   There are no extreme outliers (Pearson is very sensitive to outliers).

#### **B. Spearman’s Rank Correlation ($\rho$ or rho)**
*   **What is it?** A non-parametric test. Instead of calculating the values, it ranks the data (1st, 2nd, 3rd highest) and calculates the correlation of the **ranks**.
*   **When to use:**
    *   **Non-Linear but Monotonic:** The relationship isn't a straight line, but it moves in one direction (e.g., an exponential curve).
    *   **Ordinal Data:** Data that has an order but no fixed value (e.g., Survey results: "Satisfied", "Neutral", "Unsatisfied").
    *   **Outliers:** You have extreme data points that are ruining your Pearson score. Ranking them minimizes the impact of the outlier.

#### **C. Kendall’s Tau ($\tau$)**
*   **What is it?** Similar to Spearman but based on "concordant and discordant pairs" (probability math) rather than variance math.
*   **When to use:**
    *   **Small sample sizes.**
    *   When there are many "tied" ranks (e.g., three people ranked 3rd). Kendall’s Tau is statistically more robust than Spearman in these specific cases.

#### **D. Point-Biserial Correlation**
*   **When to use:**
    *   One variable is continuous (e.g., Test Score).
    *   One variable is binary/dichotomous (e.g., Gender: Male/Female, or Pass/Fail).

<br>
<br>
<hr>
<br>
<br>

### The Fundamental, Non-Mathematical Difference

**Correlation** describes a statistical relationship where two variables tend to change together in a predictable way. It's an observation about a pattern. You can think of it as **"What"** — *what* is happening in the data.

**Causation** implies that a change in one variable is directly responsible for producing a change in the other. It's about a mechanism of action or influence. You can think of it as **"Why"** — *why* it is happening.

In essence:
*   **Correlation is about association.** (X and Y move together).
*   **Causation is about consequence.** (X *makes* Y happen).

---

### Why a Strong Correlation Can Be Completely Misleading

A strong correlation can be a red flag waving at an interesting relationship, but it doesn't tell you *why* that relationship exists. It can be misleading because the observed link can be explained by other, hidden factors. Here are the primary reasons:

1.  **The Third Variable (Confounding/Lurking Variable):** This is the most common culprit. A third, unmeasured variable is causing *both* of the observed variables to change.
    *   **Classic Example:** There is a strong positive correlation between *ice cream sales* and *swimming pool drownings*. Does buying ice cream cause drowning? No. The lurking variable is **hot weather (summer season)**. Hot weather causes more people to buy ice cream *and* more people to swim, leading to more drownings.

2.  **Coincidence (Spurious Correlation):** Sometimes two variables trend together purely by random chance, with no logical connection at all.
    *   **Example:** There is a historically strong correlation between the number of films Nicolas Cage appeared in a given year and the number of people who drowned by falling into a pool. The link is utterly coincidental and meaningless.

3.  **Reverse Causation:** The direction of cause and effect is the opposite of what one might assume. Correlation doesn't tell you which variable is the cause and which is the effect.
    *   **Example:** A study finds a correlation between low self-esteem and depression. Does low self-esteem *cause* depression? Possibly. But it is equally plausible that being depressed *causes* low self-esteem. The correlation alone cannot untangle this.

4.  **Selection Bias (The Sample is Not Representative):** The observed correlation exists only within a specific, non-random sample and does not reflect a true relationship in the broader population.
    *   **Example:** A university finds a strong correlation between students' ownership of textbooks and their final grades. It would be misleading to conclude textbooks *cause* better grades. The sample is biased—it only includes students who enrolled and stayed in the course. It excludes students who dropped out early (who might have had textbooks but poor grades). The relationship might be driven by a student's overall motivation or commitment.

**Key Takeaway:** A strong correlation is a clue, not a conclusion. It signals that something interesting *might* be happening and warrants deeper investigation. Establishing causation requires logical reasoning, controlled experiments (where possible), and ruling out these alternative explanations—it moves from simply observing a pattern to understanding the underlying process.

---

### The Core Problem in ML
Most ML models are **masters of correlation, not causation**. They find patterns (correlations) in historical data to make predictions. They assume the future will behave like the past. But if those patterns are not causal, the model can fail spectacularly when you try to use it to make decisions that *change* the world.

---

### How It Affects ML in Practice: Simple Examples

#### 1. **Models Learn to Predict, Not to Understand "Why"**
*   **Example:** A bank builds a model to predict loan defaults. It finds a strong correlation: **"People who buy high-quality printers are less likely to default."**
*   **The Correlation Trap:** The model might use "printer ownership" as a key signal for a safe borrower.
*   **The Causal Reality:** Buying an expensive printer doesn't *cause* financial stability. Both are likely caused by a **lurking variable**: being **organized and financially responsible**. A responsible person buys a good printer *and* pays their bills.
*   **The Harm:** If the bank now **denies loans to people without good printers**, it's committing discrimination based on a spurious correlation. It's rejecting potentially good customers for the wrong reason.

#### 2. **Models Break When the World Changes (Lack of Robustness)**
*   **Example:** An e-commerce ML model learns to recommend products. It finds a strong historical correlation: **"People who click on article links about 'sunscreen' also buy swimsuits."** So, it recommends swimsuits on sunscreen pages.
*   **The Correlation Trap:** The model assumes this link is fixed.
*   **The Causal Reality:** The link wasn't causal; it was driven by context (**summer season**). In summer, people read about sun protection *and* buy beachwear.
*   **The Harm:** If you deploy this model in **December**, the recommendations fail. The model didn't understand the *cause* (summer intent), it just memorized a seasonal correlation. A causal model would understand the deeper "beach trip" intent.

#### 3. **Models Can Reinforce Bias and Create Feedback Loops**
*   **Example:** A police department uses an ML model to predict crime hotspots based on historical arrest data.
*   **The Correlation Trap:** The model finds a "high crime" correlation with certain neighborhoods and recommends sending more police there.
*   **The Causal Reality:** The historical data might reflect **biased policing patterns** (more police in certain areas lead to more arrests, not necessarily more crime). The correlation is between "police presence" and "reported crime," not underlying criminal activity.
*   **The Harm:** The model sends even more police to that neighborhood, leading to even more arrests, which the model sees as "confirming" its prediction. This creates a **toxic feedback loop** that amplifies historical bias, all because it mistook a correlation (arrests) for the cause of crime.

#### 4. **Models Give Bad Advice for Interventions**
*   **This is the biggest practical consequence.** ML is often used to answer "What should I do?"
*   **Correlation-Based Answer:** "What is likely to happen next?" (Prediction)
*   **Causation-Based Answer:** "What will happen **if I do X**?" (Intervention)
*   **Simple Example:**
    *   **Data shows a correlation:** Hospital patients with a dedicated family visitor (Variable A) recover faster (Variable B).
    *   **A correlation-only model suggests:** To speed recovery, assign a family visitor.
    *   **The causal truth:** The family visitor isn't the *cause* of recovery. The **true cause** is the patient's **underlying will to live and strength**. That same strength also motivates family to visit. Forcing a visitor on a lonely patient won't cause the same effect. The model's recommendation is useless or wasteful.

### The Simple Analogy
Think of a machine learning model as a **brilliant pattern-spotting detective**.
*   It sees that **every time the rooster crows (A), the sun rises (B)**. It learns this pattern perfectly.
*   If you ask it, **"Will the sun rise tomorrow?"** it will correctly predict **"Yes, if the rooster crows."** (This is **prediction** using correlation).
*   But if you ask it, **"How can I make the sun rise earlier?"** and it answers **"Make the rooster crow earlier,"** it is catastrophically wrong. It confused correlation (B follows A) with causation (A causes B).

<br>
<br>
<hr>
<br>
<br>