# Correlations 


1. [Correlation computation and scatterplots](#section1)
2. [Scatterplot matrix](#section2)
3. [Heatmaps](#section3)


In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

We'll work with the [California Housing data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)



In [157]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/housing.csv'
house_df = pd.read_csv(url)
house_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### 1. Correlation computation and scatterplots

Is there a correlation between the income and the house value?

In [None]:
house_df[['median_income', 'median_house_value']].corr(method='pearson')

Plot this correlation:

In [None]:
sns.scatterplot(data = house_df, x = 'median_house_value', y = 'median_income')

How do you perform correlation for all attributes in the dataset?

In [None]:
house_df.corr(method='pearson', numeric_only = True)

In [None]:
sns.scatterplot(data = house_df, x = 'total_bedrooms', y = 'households')

##### Almost similar - using matplotlib plt function:

In [None]:
import matplotlib.pyplot as plt 
plt.scatter(house_df['total_bedrooms'], house_df['households'])

##### adding a regression line:

In [None]:
sns.regplot(data=house_df, x='total_bedrooms', y='households')

---
### <span style="color:blue"> Exercise:</span>
> Find a strong correlation in the above and vizualize it

---

<a id='section2'></a>

### 2. Scatterplot matrix

The diagonal shows the distribution of the three numeric variables.

In the other cells of the plot matrix, we have the scatterplots of each variable combination in the dataframe. 

In [20]:
features = ['median_house_value', 'housing_median_age',
            'median_income']

In [None]:
house_df[features]

In [None]:
sns.pairplot(house_df[features], height = 2.5)

<a id='section3'></a>

Remember our Dino data?

In [22]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/DatasaurusDozen.tsv'

In [23]:
df = pd.read_csv(url, sep='\t')

Add a regression line:

In [None]:
df_grid = sns.FacetGrid(df, col="dataset", hue="dataset", col_wrap=4)
df_grid.map_dataframe(sns.regplot, x="x", y="y")

Try the different correlation options - is there a difference?

In [None]:
df.groupby('dataset').corr(method = 'pearson')
#df.groupby('dataset').corr(method = 'spearman')
#df.groupby('dataset').corr(method = 'kendall')

### 3. Heatmaps

In [47]:
features = ['median_house_value', 'housing_median_age','median_income','total_bedrooms','population']

In [None]:
correlation_matrix = house_df[features].corr().round(2)
correlation_matrix

In [None]:
sns.heatmap(data=correlation_matrix,cmap='Blues', annot=True)

---
### <span style="color:blue"> Exercise:</span>
> What happens if `annot=False`? \
> Use another color for the heatmap. You can run the command: `plt.colormaps()` to see all available colors

---
> ##### Summary
>
>* `.corr` - compute pairwise correlation of columns, excluding NA/null values. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)
>
>* `.corr.style.background_gradient` - change the background color. [various options](corr.style.background_gradient)
>
>* `.plotting.scatter_matrix` - draw a matrix of scatter plots. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html?highlight=scatter_matrix)
>
>* `.plot.scatter` - plot a scatter plot
>
> Seaborn package:
>
>* `sns.scatterplot` - a scatter plot
>
>* `sns.regplot` - a scatter plot with a regression line
>
>* `sns.pairplot` - scatter plot matrix
>
> * `sns.heatmap` - a heatmap. @annot = True to print the values inside the square
>
---