<a href="https://colab.research.google.com/github/nupoor-ka/ES114-PSDV/blob/main/Correlation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#IMPORTING RELEVANT LIBRARIES AND DATASET

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import math
import seaborn as sns
import pandas as pd

In [None]:
data = sns.load_dataset(name = 'iris')

#FINDING MEAN AND VARIANCE OF ATTRIBUTES

##DEFINING FUNCTIONS

In [None]:
def me(att):
  x = np.array(att)
  return np.sum(x)/len(att)

The function me(att) is defined to find the mean or expectation, E[X], of a particular attribute.

$E[X] = \Large{\frac{\Sigma_{i=1}^n x_i}{n}}$, where $n$ is the sample size.

In [None]:
def va(att):
  x = np.array(att)
  a = x - me(x)
  b = a*a
  return (np.sum(b)/(len(b)-1))

The function va(att) has been defined to find the variance, $\sigma_x^2$, of a particular attribute.

$\sigma^2_x = E[(X - E[X])^2]$

$\sigma^2_x= E[X^2 - 2XE[X] +(E[X])^2]$

$\sigma^2_x= E[X^2] - 2(E[X])^2 + (E[X])^2$

$\sigma^2_x= E[X^2] - (E[X])^2$

The use of (len(b) - 1) instead of just (len(b)) is Bessel's correction. Another function vaa(att) has been defined to calculate variance without Bessel's correction which is needed in later calculations which do not require it.

In [None]:
def vaa(att):
  x = np.array(att)
  a = x - me(x)
  b = a*a
  return (np.sum(b)/(len(b)))

In [None]:
def mean_var(att):
  l = att.split('_')
  a = l[0] + ' ' + l[1]
  print(a.title()+':')
  print('Mean', a, '=', round(data[att].mean(), 2))
  print('Variance in', a, '=',round(data[att].var(), 2))
  print()

[string_name].split([char]) returns a list of words in the [string_name] separated by the character mentioned in the brackets, a space by default.
For example, 'sepal,length'.split(',') returns ['sepal', 'length'].

[string name].title() returns a new string with the first letter of every word capitalized. For example 'a new world'.title() returns 'A New World'.



##DISPLAYING VALUES

In [None]:
mean_var('sepal_length')
mean_var('sepal_width')
mean_var('petal_length')
mean_var('petal_width')

Sepal Length:
Mean sepal length = 5.84
Variance in sepal length = 0.69

Sepal Width:
Mean sepal width = 3.06
Variance in sepal width = 0.19

Petal Length:
Mean petal length = 3.76
Variance in petal length = 3.12

Petal Width:
Mean petal width = 1.2
Variance in petal width = 0.58



#FINDING CORRELATION COEFFICIENTS OF ATTRIBUTES

##DEFINING FUNCTIONS

###DEFINING FUNCTION TO CALCULATE CORRELATION COEFFICIENTS

In [None]:
def corr_coeff(x, y):
  n = len(x)
  a = me(x)
  b = me(y)
  c = me(x*y)
  cov_xy = c - a*b
  sigx = math.sqrt(vaa(x))
  sigy = math.sqrt(vaa(y))
  r = (cov_xy)/(sigx*sigy)
  return r

Correlation coefficient, $\rho_{X, Y}$, is a measure of the degree of linear correlation between two random variables X and Y. Its value ranges from -1 to 1, with 1 being a perfect positive linear correlation, 0 being no correlation at all and -1 being a perfect negative linear correlation.

$ \rho _{X, Y} = \Large{\frac {cov(X, Y)}{\sigma _X \sigma _Y}}$

Here,

$cov(X, Y)$ is the covariance of the two variables X and Y, stored as cov_xy

$cov(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]$

$\sigma_X$ is the standard deviation of X, stored as sigx

$\sigma_X = \sqrt{E[X^2] - (E[X])^2} = \sqrt{\sigma^2_X}$

$\sigma_Y$ is the standard deviation of Y, stored as sigy

$\sigma_Y = \sqrt{E[Y^2] - (E[Y])^2} = \sqrt{\sigma^2_Y}$

Therefore, the correlation coefficient can also be written as

$ \rho _{X, Y} = \Large{\frac {E[XY] - E[X]E[Y]}{\sqrt{E[X^2] - (E[X])^2}{\sqrt{E[Y^2] - (E[Y])^2}}}}$

###DEFINING FUNCTION TO DISPLAY CORRELATION COEFFICIENTS OF DIFFERENT PAIRS OF ATTRIBUTES

In [None]:
def get_ord_r(ls, name):
  ds = {}
  for i in range(len(ls)):
    for j in range(i+1, len(ls)):
      pair = name[i]+' and '+name[j]
      coeff = corr_coeff(ls[i], ls[j])
      ds[coeff] = pair
  l = list(ds.keys())
  l.sort(reverse = True)
  print('Pairs of attributes arranged in descending order of their correlation coefficients:')
  for i in l:
    print(ds[i],':', round(i, 2))

### DEFINING FUNCTION TO DISPLAY CORRELATION COEFFICIENTS ACCORDING TO DIFFERENT SPECIES

In [None]:
def disp(spec):
  if spec == 'all':
    ls = [np.array(data['sepal_length']), np.array(data['sepal_width']), np.array(data['petal_length']), np.array(data['petal_width'])]
  else:
    ls = [np.array(data['sepal_length'][data['species'] == spec]), np.array(data['sepal_width'][data['species'] == spec]), np.array(data['petal_length'][data['species'] == spec]), np.array(data['petal_width'][data['species'] == spec])]
  name = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']
  print(spec.title(), 'Flowers')
  print()
  get_ord_r(ls, name)

In [None]:
data['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

##DISPLAYING VALUES

###ALL FLOWERS

In [None]:
disp('all')

All Flowers

Pairs of attributes arranged in descending order of their correlation coefficients:
Petal Length and Petal Width : 0.96
Sepal Length and Petal Length : 0.87
Sepal Length and Petal Width : 0.82
Sepal Length and Sepal Width : -0.12
Sepal Width and Petal Width : -0.37
Sepal Width and Petal Length : -0.43


###SETOSA FLOWERS

In [None]:
disp('setosa')

Setosa Flowers

Pairs of attributes arranged in descending order of their correlation coefficients:
Sepal Length and Sepal Width : 0.74
Petal Length and Petal Width : 0.33
Sepal Length and Petal Width : 0.28
Sepal Length and Petal Length : 0.27
Sepal Width and Petal Width : 0.23
Sepal Width and Petal Length : 0.18


###VERSICOLOR FLOWERS

In [None]:
disp('versicolor')

Versicolor Flowers

Pairs of attributes arranged in descending order of their correlation coefficients:
Petal Length and Petal Width : 0.79
Sepal Length and Petal Length : 0.75
Sepal Width and Petal Width : 0.66
Sepal Width and Petal Length : 0.56
Sepal Length and Petal Width : 0.55
Sepal Length and Sepal Width : 0.53


###VIRGINICA FLOWERS

In [None]:
disp('virginica')

Virginica Flowers

Pairs of attributes arranged in descending order of their correlation coefficients:
Sepal Length and Petal Length : 0.86
Sepal Width and Petal Width : 0.54
Sepal Length and Sepal Width : 0.46
Sepal Width and Petal Length : 0.4
Petal Length and Petal Width : 0.32
Sepal Length and Petal Width : 0.28
