#### Covariance

Covariance is a measure of joint variability of two random variables.

If the greater values of variable mainly correspond with greater values of the other variable, and the same holds for the lesser values (that is, the variables tend to show similar behaviour), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the fewer values of the other - the covariance is negative.The sign of the covarience, therefor, shows tendency in the linear relationship between the variables.

$cov(X,Y) = E[(X - E[X])(Y - E[Y])]$

$cov(X,Y) = \frac{1}{n}\sum_{i=0}^n{(x_i - \overline{x})(y_i - \overline{y})}$

##### Covariance disadvantages
| Disadvantage      | Description |
| ----------- | ----------- |
| Scale dependent | The covariance value is highly dependent on the scale of the variables being measured. This can make it difficult to compare the strength of the relationship between variables measured on different scales. |
| Sensitive to outliers | Covariance can be sensitive to outliers, which are data points that are significantly different from the rest of the data. Outliers can have a large influence on the covariance value, potentially leading to inaccurate interpretations of the relationship between variables. |
| Limited to linear relationships | Covariance only measures the strength of a linear relationship between variables. If the relationship between variables is nonlinear, then covariance may not accurately reflect the true relationship. |

In [54]:
import numpy as np
import pandas as pd

In [9]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"

raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Variables in order:

| Variable      | Description |
| ----------- | ----------- |
| CRIM  | per capita crime rate by town |
| ZN    | proportion of residential land zoned for lots over 25,000 sq.ft. |
| INDUS | proportion of non-retail business acres per town |
| CHAS  | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| NOX   | nitric oxides concentration (parts per 10 million) |
| RM    | average number of rooms per dwelling |
| AGE   | proportion of owner-occupied units built prior to 1940 |
| DIS   | weighted distances to five Boston employment centres |
| RAD   | index of accessibility to radial highways |
| TAX   |full-value property-tax rate per $10,000 |
| PTRATIO | pupil-teacher ratio by town |
| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| LSTAT | % lower status of the population |
| MEDV | Median value of owner-occupied homes in $1000's |

In [11]:
raw_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.00632,18.00,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3
1,396.90000,4.98,24.00,,,,,,,,
2,0.02731,0.00,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8
3,396.90000,9.14,21.60,,,,,,,,
4,0.02729,0.00,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8
...,...,...,...,...,...,...,...,...,...,...,...
1007,396.90000,5.64,23.90,,,,,,,,
1008,0.10959,0.00,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0
1009,393.45000,6.48,22.00,,,,,,,,
1010,0.04741,0.00,11.93,0.0,0.573,6.030,80.8,2.5050,1.0,273.0,21.0


In [23]:
X = raw_df[1]
Z = raw_df[2]

In [36]:
def get_covariance(x: list, y: list):
    x_var_mean = np.array(x).mean()
    y_var_mean = np.array(y).mean()

    multple_vars = 0
    
    for i in range(len(x)):
        x_delta = x[i] - x_var_mean
        y_delta = y[i] - y_var_mean

        multple_vars += x_delta * y_delta
    
    return multple_vars / len(x)


In [37]:
get_covariance(X.to_list(), Z.to_list())

-63.12423418630577

#### Just example of covariance visualization:

In [55]:

raw_df.cov().style.background_gradient(cmap='coolwarm')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,35393.965253,-25.379811,1158.707506,-0.122109,0.419594,-1.325038,85.405322,-6.876722,46.847761,844.821538,5.399331
1,-25.379811,297.587624,-63.186672,-0.252925,-1.396148,5.112513,-373.901548,32.629304,-63.348695,-1236.453735,-19.776571
2,1158.707506,-63.186672,98.259949,0.109669,0.607074,-1.887957,124.513903,-10.228097,35.549971,833.36029,5.692104
3,-0.122109,-0.252925,0.109669,0.064513,0.002684,0.016285,0.618571,-0.053043,-0.016296,-1.523367,-0.066819
4,0.419594,-1.396148,0.607074,0.002684,0.013428,-0.024603,2.385927,-0.187696,0.616929,13.046286,0.047397
5,-1.325038,5.112513,-1.887957,0.016285,-0.024603,0.493671,-4.751929,0.303663,-1.283815,-34.583448,-0.540763
6,85.405322,-373.901548,124.513903,0.618571,2.385927,-4.751929,792.358399,-44.329379,111.770846,2402.690122,15.936921
7,-6.876722,32.629304,-10.228097,-0.053043,-0.187696,0.303663,-44.329379,4.434015,-9.068252,-189.664592,-1.059775
8,46.847761,-63.348695,35.549971,-0.016296,0.616929,-1.283815,111.770846,-9.068252,75.816366,1335.756577,8.760716
9,844.821538,-1236.453735,833.36029,-1.523367,13.046286,-34.583448,2402.690122,-189.664592,1335.756577,28404.759488,168.153141


#### And example of correlation matrix:

In [56]:
raw_df.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1.0,-0.00782,0.621328,-0.055892,0.420972,-0.219247,0.352734,-0.37967,0.625505,0.582764,0.289946
1,-0.00782,1.0,-0.369513,-0.042697,-0.516604,0.311991,-0.569537,0.664408,-0.311948,-0.314563,-0.391679
2,0.621328,-0.369513,1.0,0.062938,0.763651,-0.391676,0.644779,-0.708027,0.595129,0.72076,0.383248
3,-0.055892,-0.042697,0.062938,1.0,0.091203,0.091251,0.086518,-0.099176,-0.007368,-0.035587,-0.121515
4,0.420972,-0.516604,0.763651,0.091203,1.0,-0.302188,0.73147,-0.76923,0.611441,0.668023,0.188933
5,-0.219247,0.311991,-0.391676,0.091251,-0.302188,1.0,-0.240265,0.205246,-0.209847,-0.292048,-0.355501
6,0.352734,-0.569537,0.644779,0.086518,0.73147,-0.240265,1.0,-0.747881,0.456022,0.506456,0.261515
7,-0.37967,0.664408,-0.708027,-0.099176,-0.76923,0.205246,-0.747881,1.0,-0.494588,-0.534432,-0.232471
8,0.625505,-0.311948,0.595129,-0.007368,0.611441,-0.209847,0.456022,-0.494588,1.0,0.910228,0.464741
9,0.582764,-0.314563,0.72076,-0.035587,0.668023,-0.292048,0.506456,-0.534432,0.910228,1.0,0.460853
