# Unit 5 - Correlations 


1. [Correlation](#section1)
2. [Heatmaps](#section2)
3. [Get Dummies](#section3)
4. [Summary - bringing it all together](#section4)
5. [Correlation is NOT causation](#section5)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #for reshaping graph size
import seaborn as sns  # for creating the graphs

## 1. Correlation 

#### Correlation vs. Regression

**Correlation** -the degree of relationship between two random variables: `x` and `y`. 
* Purpose: descriptive statistics
*  `x` and `y` can be interchanged
*  Random variables: rerunning the experiment can change both `x` and `y`

**Regression** - the affect of an independent variable (`y`) on random (dependent) variables (`x_1`...`x_n`)
* Purpose: prediction, estimation
* `x` and `y` cannot be interchanged
* Fixed `x`s , Random `y`. Re-running the experiment will not change `x`s, but might change `y`

Examples:

* Math & Physics tests?
* Temperature & Electricity bill?

### Pearson  

For linear correlation (normality assumption)

$$
r = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y}
$$

where:

$$
\text{cov} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n}
$$





<div>
    <center>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/correlation slopes.png?raw=true" width="700"/>
    </center>
</div>

### Spearman and Kendal 

can capture non-linear correlations to some extent

<div>
<center><img src="https://github.com/nlihin/data-analytics/blob/main/images/Spearman.png?raw=true" width="500"/>
    <p style="text-align: center;"><em>By Skbkekas - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8778554</em></p></center>
</div>



## 2. Heatmaps

Read the pickled file you saved on your PC at the end of unit 3:

In [None]:
strike_df = pd.read_pickle("pickled_strike")

If you haven't, then un-comment the below:

In [None]:
# url1 = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/aircraft%20wildlife%20strikes%202018-2020.csv'
# url2 = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/aircraft%20wildlife%20strikes%202021-2023.csv'
# strike_df_18_20 = pd.read_csv(url1)
# strike_df_21_23 = pd.read_csv(url2)
# strike_df = pd.concat([strike_df_18_20 ,strike_df_21_23]).reset_index(drop = True)
# strike_df['date'] = pd.to_datetime(strike_df['INCIDENT_DATE'],format='%d/%m/%Y')
# strike_df['month'] = pd.DatetimeIndex(strike_df['date']).month
# strike_df['year'] = pd.DatetimeIndex(strike_df['date']).year
# strike_df["people_impact"] = strike_df[['NR_INJURIES', 'NR_FATALITIES']].sum(axis=1)

In [None]:
numeric_features = ['HEIGHT', 'SPEED', 'AC_MASS']
target_features = ['AircraftOutOfService','people_impact','struck_parts', 'damaged_parts']

In [None]:
strike_df_num = strike_df[numeric_features + target_features]

If the distribution of the varaibles is not normal, better use Spearman or Kendall correlation

In [None]:
correlation_matrix = strike_df_num.corr(numeric_only = True, method = 'spearman' ).round(2)
correlation_matrix

Make it look a bit nicer:

In [None]:
subset_correlation_matrix = correlation_matrix[target_features].transpose()
subset_correlation_matrix

Turn into a heatmap

In [None]:
plt.figure(figsize=(17,2))
sns.heatmap(data=subset_correlation_matrix,cmap='coolwarm', annot=True)
plt.show()

## 3. Get Dummies

Convert categorical variable into dummy/indicator variables.

Each variable is converted in as many 0/1 variables as there are different values.  
Columns in the output are each named after a value; the name of the original variable is prepended to the value.  

[pandas.get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [None]:
categorical_features = ['WARNED','PHASE_OF_FLIGHT','SKY','TIME_OF_DAY']

In [None]:
all_features = target_features + numeric_features + categorical_features 
all_features

In [None]:
strike_df_dum = pd.get_dummies(strike_df[all_features])
strike_df_dum.columns

In [None]:
all_correlation_matrix = strike_df_dum.corr( numeric_only = False,method = 'spearman' ).round(2)
subset_all_correlation_matrix = all_correlation_matrix[target_features].transpose()
subset_all_correlation_matrix

---
### <span style="color:blue"> Exercise:</span>
> Create a heatmap for `subset_all_correlation_matrix`
>

In [None]:
plt.figure(figsize=(17,2))
# YOUR CODE HERE
plt.show()

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/heatmap.png" width="900"/>
</div>

### Study the correlations:

* `HEIGHT` and `damaged_parts`
* `AC_MASS` & `damaged_parts`
* `AircraftOutOfService` and `damaged_parts`
* `PHASE_OF_FLIGHT` and `struck_parts` / `damaged_parts`

### `HEIGHT` and `damaged_parts`

##### Note that the y-axis is on a **log** scale here!

[For an intuitive understanding of log scales](https://www.youtube.com/watch?v=0fKBhvDjuy0&ab_channel=EamesOffice)

In [None]:
fig, axes = plt.subplots(figsize=(10, 4), ncols=2)

sns.scatterplot(data = strike_df, y = 'HEIGHT', x = 'damaged_parts', ax = axes[0])
axes[0].set_yscale('log')

sns.stripplot(data = strike_df, y = 'HEIGHT', x = 'damaged_parts', ax = axes[1])
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

### `AC_MASS` & `damaged_parts`

In [None]:
fig, axes = plt.subplots(figsize=(10, 4), ncols=3)

sns.barplot(data = strike_df, x = 'AC_MASS', y = 'damaged_parts', ax = axes[0])
sns.stripplot(data = strike_df, x = 'AC_MASS', y = 'damaged_parts', ax = axes[1])
sns.scatterplot(data = strike_df, x = 'AC_MASS', y = 'damaged_parts', ax = axes[2])

plt.tight_layout()
plt.show()

### `AircraftOutOfService` and `damaged_parts`

---
### <span style="color:blue"> Exercise:</span>
> Create a scatterplot and a stripplot for `AircraftOutOfService` and `damaged_parts`  
> Don't forget to use log scale
>

In [None]:
fig, axes = plt.subplots(figsize=(10, 4), ncols=2)

#YOUR CODE HERE

plt.tight_layout()
plt.show()

### PHASE_OF_FLIGHT and `struck_parts` / `damaged_parts`

We will first order the parts by their phase of flight

In [None]:
struck_counts = strike_df.groupby('PHASE_OF_FLIGHT')['struck_parts'].sum().sort_values(ascending=False)
struck_order = struck_counts.index

In [None]:
fig, axes = plt.subplots(figsize=(12, 5), ncols=2)

sns.barplot(data = strike_df, y = 'struck_parts', x = 'PHASE_OF_FLIGHT', ax = axes[0], order = struck_order, estimator = sum)
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45, ha='right')

sns.barplot(data = strike_df, y = 'damaged_parts', x = 'PHASE_OF_FLIGHT', ax = axes[1], order = struck_order, estimator = sum)
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

## 4. Summary - bringing it all together:
>
> * Load your data to GitHub, then read it
> * Look at it using: `len`, `shape`, `info`, `describe`
> * Describe the variables:  
>   * Categorical variables: `countplot`
>   * Numerical variables:  `histplot`, `boxplot`
>   * Statistics: `barplot`
>   * Relationships: `lineplot`,  `stripplot`
>   * Tables for averages, max, min etc.
>   * Do you need to remove outliers? Change labels? Fill missing values?
> * Look for correlations: `corr`, `heatmap`  
> * Visualize the correlations: `scatterplot`
>
>  **Note 1:** These are just examples! There are more ways to visualize  
>  **Note 2:** Sometimes you need to group or filter the data first (or to melt it.. in the next lessons)  
>  **Note 3:** Sometimes presenting is in a table gives the best effect  


  

## 5. Correlation is NOT causation

Sometimes correlation indicates causation. But not always.  
Sometimes there is a hidden variable
Sometimes it just isn't causation: [spurios (meaning false/fake) correlations website](https://www.tylervigen.com/spurious-correlations)

Be Humble. We see things in the data. We report them. We don't know their cause.