# Unit 5 - Correlations 


1. [Correlation](#section1)
2. [Heatmaps](#section2)
3. [Get Dummies](#section3)
4. [Summary - bringing it all together](#section4)
5. [Correlation ≠ Causation](#section5)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #for reshaping graph size
import seaborn as sns  # for creating the graphs

## 1. Correlation 

### 🎯 Motivation: Is There a Relationship Between Grades?

An example of student grades in two courses:

| Student | Math | Physics |
|---------|------|---------|
| A       | 95   | 94      |
| B       | 88   | 90      |
| C       | 70   | 68      |
| D       | 60   | 62      |
| E       | 98   | 97      |

📈 Is there a connection between Math and Physics grades?

---

Another example:

| Student | Math | History |
|---------|------|---------|
| A       | 95   | 32      |
| B       | 88   | 85      |
| C       | 70   | 64      |
| D       | 60   | 91      |
| E       | 98   | 45      |

👀 Here, the relationship is less obvious. Maybe there's a pattern, maybe not — it's hard to tell just by eye.

---
<div>
    <center>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/correlations_sample.png?raw=true" width="700"/>
    </center>
</div>


### ❓ So How Can We Tell If There’s a Relationship — and How Do We Measure It?
Key questions:
-  Do high Math scores tend to go with high Physics scores?
-  Is there a similar trend with History scores?
- **How can we *quantify* such a relationship?**


This is where **correlation** comes in — specifically, **Pearson’s correlation coefficient**, which gives us a number between -1 and 1 that tells us how strong and linear the relationship is between two variables.

### Pearson's correlation  

For linear correlation (normality assumption)

$$
r = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y}
$$

where:

$$
\text{cov} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n}
$$





#### Pearson's correlation reflects **strength** and **direction**, but not slope or non-linear relationships:
Note: the figure in the center has an undefined correlation (because the variance is 0)
<div>
    <center>
        <img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/correlation slopes.png?raw=true" width="700"/>
        <p style="text-align: center;">
            <em>By Denis Boigelot – <a href="https://commons.wikimedia.org/wiki/File:Correlation_examples2.svg">Wikimedia Commons</a>, licensed under 
            <a href="https://creativecommons.org/licenses/by-sa/2.5/deed.en">CC BY-SA 2.5</a></em>
        </p>
    </center>
</div>


### Spearman and Kendal Correlation

can capture non-linear correlations to some extent

<div>
<center><img src="https://github.com/nlihin/data-analytics/blob/main/images/Spearman.png?raw=true" width="400"/>
    <p style="text-align: center;"><em>By Skbkekas - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8778554</em></p></center>
</div>



### Correlation vs. Regression

**Correlation** -the degree of relationship between two random variables: `x` and `y`. 
* Purpose: descriptive statistics
*  `x` and `y` can be interchanged
*  Random variables: rerunning the experiment can change both `x` and `y`

**Regression** - the affect of an independent variable (`y`) on random (dependent) variables (`x_1`...`x_n`)
* Purpose: prediction, estimation
* `x` and `y` cannot be interchanged
* Fixed `x`s , Random `y`. Re-running the experiment will not change `x`s, but might change `y`

Examples:

* Math & Physics tests?
* Temperature & Electricity bill?

## 2. Heatmaps

Read the pickled file you saved on your PC at the end of unit 3:

In [None]:
strike_df = pd.read_pickle("pickled_strike")

If you haven't, then un-comment the below:

In [None]:
# url1 = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/aircraft%20wildlife%20strikes%202018-2020.csv'
# url2 = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/aircraft%20wildlife%20strikes%202021-2023.csv'
# strike_df_18_20 = pd.read_csv(url1)
# strike_df_21_23 = pd.read_csv(url2)
# strike_df = pd.concat([strike_df_18_20 ,strike_df_21_23]).reset_index(drop = True)
# strike_df['date'] = pd.to_datetime(strike_df['INCIDENT_DATE'],format='%d/%m/%Y')
# strike_df['month'] = pd.DatetimeIndex(strike_df['date']).month
# strike_df['year'] = pd.DatetimeIndex(strike_df['date']).year
# strike_df["people_impact"] = strike_df[['NR_INJURIES', 'NR_FATALITIES']].sum(axis=1)

In [None]:
numeric_features = ['HEIGHT', 'SPEED', 'AC_MASS']
target_features = ['AircraftOutOfService','people_impact','struck_parts', 'damaged_parts']

In [None]:
strike_df_num = strike_df[numeric_features + target_features]

If the distribution of the varaibles is not normal, better use Spearman or Kendall correlation

In [None]:
correlation_matrix = strike_df_num.corr(numeric_only = True, method = 'spearman' ).round(2)
correlation_matrix

#### Make it look a bit nicer: 
Take only the columns of the target features & transpose

In [None]:
subset_correlation_matrix = correlation_matrix[target_features].transpose()
subset_correlation_matrix

Turn into a heatmap

In [None]:
plt.figure(figsize=(17,2))
sns.heatmap(data=subset_correlation_matrix,cmap='coolwarm', annot=True)
plt.show()

## 3. Label encoding and one-hot encoding

### Label encoding

When categories are ordinal (the labels reflect a meaningful progression):

Examples:

* Education level: 'High School' < 'Bachelor' < 'Master' < 'PhD'

* Rating levels: 'Poor' < 'Fair' < 'Good' < 'Very Good' < 'Excellent'

* Size: 'Small' < 'Medium' < 'Large'

#### We don't have an example in our data, so here is a made up one: 
`pd.Categorical()` - turns the data type to categorical, and `.codes` assigns a numeric code

In [None]:
# Sample data
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Ethan", "Fiona"],
    "Education": ["Bachelor", "High School", "PhD", "Master", "Bachelor", "PhD"],
    "Satisfaction": ["Satisfied", "Neutral", "Very Satisfied", "Dissatisfied", "Neutral", "Very Dissatisfied"]
})

# Define the ordinal categories
education_order = ["High School", "Bachelor", "Master", "PhD"]
satisfaction_order = ["Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"]

# Encode as ordered categorical variables
df["Education_encoded"] = pd.Categorical(df["Education"], categories=education_order, ordered=True).codes
df["Satisfaction_encoded"] = pd.Categorical(df["Satisfaction"], categories=satisfaction_order, ordered=True).codes

# Display result
df



### One-hot encoding: `get_dummies`

Convert categorical variables into dummy variables.

Each variable is converted into as many 0/1 variables as there are different values.  
Columns in the output are each named after a value; the name of the original variable is prepended to the value.  

[pandas.get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [None]:
categorical_features = ['WARNED','PHASE_OF_FLIGHT','SKY','TIME_OF_DAY']

In [None]:
all_features = target_features + numeric_features + categorical_features 
all_features

In [None]:
strike_df_dum = pd.get_dummies(strike_df[all_features])
strike_df_dum.columns

In [None]:
all_correlation_matrix = strike_df_dum.corr( numeric_only = False,method = 'spearman' ).round(2)
subset_all_correlation_matrix = all_correlation_matrix[target_features].transpose()
subset_all_correlation_matrix

---
### <span style="color:blue"> Exercise:</span>
> Create a heatmap for `subset_all_correlation_matrix`
>

In [None]:
plt.figure(figsize=(17,2))
# YOUR CODE HERE
plt.show()

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/heatmap.png" width="900"/>
</div>

### Study the correlations:

* `HEIGHT` and `damaged_parts`
* `AC_MASS` & `damaged_parts`
* `AircraftOutOfService` and `damaged_parts`
* `PHASE_OF_FLIGHT` and `struck_parts` / `damaged_parts`

### `HEIGHT` and `damaged_parts`

A positive correlation of 0.14: more height, more damage. More damage, more height  

In [None]:
plt.figure(figsize=(6,3))
sns.scatterplot(data = strike_df, y = 'HEIGHT', x = 'damaged_parts')

#### ⚠️ <span style="color:red"> Problem:</span>
If the correlation is positive (0.14), why do we see the opposite?

#### ✅ <span style="color:green"> Solution:</span>
#### Use a log-scale
##### Below the y-axis is on a **log** scale. This provides a more accurate presentation

[For an intuitive understanding of log scales](https://www.youtube.com/watch?v=0fKBhvDjuy0&ab_channel=EamesOffice)

In [None]:
fig, axes = plt.subplots(figsize=(10, 4), ncols=2)

sns.scatterplot(data = strike_df, y = 'HEIGHT', x = 'damaged_parts', ax = axes[0])
axes[0].set_yscale('log')

sns.stripplot(data = strike_df, y = 'HEIGHT', x = 'damaged_parts', ax = axes[1])
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

### `AC_MASS` & `damaged_parts`

A negative correlation of -0.16: less mass, more damage. More damage, less height  

In [None]:
fig, axes = plt.subplots(figsize=(10, 4), ncols=2)

sns.stripplot(data = strike_df, x = 'AC_MASS', y = 'damaged_parts', ax = axes[0])
sns.scatterplot(data = strike_df, x = 'AC_MASS', y = 'damaged_parts', ax = axes[1])

plt.tight_layout()
plt.show()

#### ⚠️ <span style="color:red"> Problem:</span>
A log-scale doesn't help here, since the values are 1-5.  
#### ✅ <span style="color:green"> Solution:</span>
A barplot does help

In [None]:
plt.figure(figsize=(6,3))

sns.barplot(data = strike_df, x = 'AC_MASS', y = 'damaged_parts')

plt.show()


#### ❓ <span style="color:purple"> Question:</span>
So why bother with log scales? Why not always use a bar-plot?
#### ✅ <span style="color:green"> Solution:</span>


📝 **Your Answer:** <ins>_Type your response here..._</ins>


### `AircraftOutOfService` and `damaged_parts`


### <span style="color:blue"> Exercise:</span>
> Create a scatterplot and a stripplot for `AircraftOutOfService` and `damaged_parts`  
> Don't forget to use log scale
>

In [None]:
fig, axes = plt.subplots(figsize=(10, 4), ncols=2)

#YOUR CODE HERE

plt.tight_layout()
plt.show()

### PHASE_OF_FLIGHT and `struck_parts` / `damaged_parts`

We will first order the parts by their phase of flight

In [None]:
struck_counts = strike_df.groupby('PHASE_OF_FLIGHT')['struck_parts'].sum().sort_values(ascending=False)
struck_order = struck_counts.index

In [None]:
fig, axes = plt.subplots(figsize=(12, 5), ncols=2)

sns.barplot(data = strike_df, y = 'struck_parts', x = 'PHASE_OF_FLIGHT', ax = axes[0], order = struck_order, estimator = sum)
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45, ha='right')

sns.barplot(data = strike_df, y = 'damaged_parts', x = 'PHASE_OF_FLIGHT', ax = axes[1], order = struck_order, estimator = sum)
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

## 4. Summary – Bringing It All Together

> ✅ **Step 1: Load and inspect your data**
> - Upload your dataset to GitHub and read it into your notebook  
> - Explore its structure using: `len`, `shape`, `info()`, `describe()`

> ✅ **Step 2: Understand your variables**
> - **Categorical variables** → use `countplot`  
> - **Numerical variables** → use `histplot`, `boxplot`  
> - **Group-level statistics** → use `barplot`  
> - **Relationships between variables** → use `lineplot`, `stripplot`  
> - **Summarize with tables** → show averages, minima, maxima, etc.  
> - Ask yourself:  
>   - Do I need to remove outliers?  
>   - Should I relabel categories?  
>   - Are there missing values to fill?

> ✅ **Step 3: Analyze correlations**
> - Use `.corr()` to compute correlations  
> - Visualize with `heatmap`, `scatterplot`, `stripplot`, or `barplot`  
> - 📌 *In some cases, a log scale is needed to reveal patterns*

---

**Notes:**
- These are just examples — there are many other useful plots and techniques  
- Sometimes it’s helpful to **group**, **filter**, or even **melt** the data before plotting  
- In some cases, a **table** communicates your insight more clearly than a plot


## 5. Correlation ≠ Causation

Just because two variables are correlated does **not** mean that one causes the other.

- Sometimes a correlation **does** reflect a real causal relationship.
- But often, there is a **hidden (confounding) variable** influencing both.
- And sometimes, the correlation is simply **coincidental** — not meaningful at all.

For a fun (and cautionary) example, see the [Spurious Correlations website](https://www.tylervigen.com/spurious-correlations), where you'll find examples like:
> “Ice cream sales” are correlated with “shark attacks” — not because ice cream causes sharks, but because both happen more often in summer.


👣 **Be humble**.  
> Correlation can reveal interesting patterns — but we **report** them, not **explain** them.  
> We observe relationships in the data, but that doesn’t mean we understand the cause.


---