
# Starting off

Below is a list of weights(kg) of 10 male subjects. How can you describe this data set to another person?


```[55, 56, 56, 58, 60, 61, 63, 64, 70, 78]```

# Introducing Statistics: Measures of Central Tendency , Disperson, and Correlation

## Aim:
- Be able to describe a large sample of data in a meaningful way that conveys information.

- Be able to describe how two sets of data are related to each other

- Write functions to calculate the Descriptive statistics of a data set. 


A **population** is the collection of **all** people, plants, animals, or objects of interest about which we wish to make statistical inferences (generalizations). 

A **population parameter** is a numerical characteristic of a population. In nearly all statistical problems we do not know the value of a parameter because we do not measure the entire population. We use sample data to make an inference about the value of a parameter.



A **sample** is the subset of the population that we actually measure or observe.

A **sample statistic is** a numerical characteristic of a sample. A sample statistic estimates the unknown value of a population parameter. Information collected from sample statistic is sometimes refered to as Descriptive Statistic.

 Here are the Notations that will be used:

$X_{ij}$ = Observation for variable *j* in subject *i* .

$p$ 
 = Number of variables

$n$
 = Number of subjects

In the example to come, we'll have data on 737 people (subjects) and 5 nutritional outcomes (variables). So, 

$p$
 = 5 variables

$n$
 = 737 subjects





In multivariate statistics we will always be working with vectors of observations. So in this case we are going to arrange the data for the p variables on each subject into a vector. In the expression below, 
$X_i$ is the vector of observations for the $i^{th}$ subject, $i$ = 1 to $n$(737). Therefore, the data for the $j^{th}$ variable will be located in the $j^{th}$ element of this subject's vector, $j$ = 1 to $p$(5).


$$\mathbf{X}_i = \left(\begin{array}{l}X_{i1}\\X_{i2}\\ \vdots \\ X_{ip}\end{array}\right)$$

## Measures of Central Tendency



### Mean
Mean or average is the value obtained by dividing the sum of all the data by the total number of data points.


<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/4e3313161244f8ab61d897fb6e5fbf6647e1d5f5' />

## Mathematically Speaking


Throughout this course, we’ll use the ordinary notations for the mean of a variable. 

That is, the symbol $\mu$ is used to represent a (theoretical) population mean and the symbol $\bar{x}$ is used to represent a sample mean computed from observed data. 

In the multivariate setting, we add subscripts to these symbols to indicate the specific variable for which the mean is being given. For instance, $\mu_1$ represents the population mean for variable 
$x_1$ and 
$\bar{x}$
 denotes a sample mean based on observed data for variable 
$\bar{x}_1$
.



The population mean is the measure of central tendency for the population. Here, the population mean for variable $j$ is:

$$\mu_j = E(X_{ij})$$

and the sample mean for variable $j$ is:

$$\bar{x}_j = \frac{1}{n}\sum_{i=1}^{n}X_{ij}$$

### Median

In a set with odd number of data points the median is the middlemost value while if the number of data points is even then it is the average of the two middle items.

In the previous set since the number of data is 10 (even) the 5th and 6th item correspond to the middle data items.


<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/da59c1e963f56160361fcce819a95f351748630a' />

### Mode

Mode refers the data item that occurs most frequently in a given data set.

### Questions:

-When would median be a better measure of central tendency than mean?

-When is mode the best measure of central tendency to use?

Questions:


1. We want to calculate the mean, median, and mode for the above list of numbers.Please write a function to calculate each of those statistics.



In [2]:
heights = [55, 56, 56, 58, 60, 61, 63, 64, 70, 78]

In [3]:
#write a function that returns the mean the call that function on heights

def calc_mean(data):
    return sum(data)/len(data)
calc_mean(heights)

62.1

In [13]:
#Write a function that returns the median and call it on heights
def medi(data):
    data = sorted(data)
    if len(data)%2==0:
        return (data[int(len(data)/2 - 1)] + data[int(len(data)/2)])/2
    else:
        return data[int(len(data/2)+0.5)]
medi(heights)

60.5

## Measures of Dispersion
Measures of dispersion quantify the spread of the data. They try to measure how much variation is there among the various data points.



### Range
One simple such measure is range which is simply the difference between the largest and the smallest data item. For our previous dataset,

Range = 78–55 = 23.





### InterQuantile Range - IQR
The quartiles of a data set divides the data into four equal parts, with one-fourth of the data values in each part. The second quartile position is the median of the data set, which divides the data set in half as shown for a simple dataset below:

![IQR](iqr.png)

The interquartile range (IQR) is a measure of where the “middle fifty” is in a data set. Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie. That’s why it’s preferred over many other measures of spread (i.e. the average or median) when reporting things like average retirement age and scores in a test etc.

### Variance
A more complex measure of dispersion is variance. The variancde of a population for variable $x_j$ is:

$\sigma_j^2 = E(x_j-u_j)^2$

The population variance $\sigma _{j}^{2}$ can be estimate by the sample variance: 

$s_j^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)^2=\frac{\sum_{i=1}^{n}X_{ij}^2-(\left(\sum_{i=1}^{n}X_{ij}\right)^2/n)}{n-1}$ 

Variance signifies how much the data items are deviating from mean.

1) Larger variance means the data items deviate more from the mean.

2) Smaller variance means the data items are closer to the mean.

Now let’s calculate the variance for the previous dataset,

*Variance* = 

~~~
[(55–62.1)² + (56–62.1)² + (56–62.1)² + (58–62.1)² + (60-62.1)² + (61–62.1)² + (63–62.1)² + (64–62.1)² + (70–62.1)² +(78–62.1)²]/9.

= 466.9/9

= 51.88

~~~

### Standard deviation
It is simply the square root of the variance. In the above formula, σ is the standard deviation and σ2 is the variance. Hence, in this example the standard deviation is

$\sigma = \sqrt{\sigma^2}$

$\sqrt{51.88} = 7.20$

### Application


- Write a function to calculate the variance of a dataset.
- Write a function to calculate the standard deviation of a dataset using the variance function.


***Functions can call other functions**

In [46]:
def increase_mean(data, increase):
    new_mean =  calc_mean(data)+increase
    return new_mean

In [47]:
increase_mean(data, 4)

66.1


66.1

In [21]:
#variance fuction
def var(data):
    return sum(list(map(lambda x: (x-calc_mean(data))**2, data)))/(len(data)-1)
var(heights)

51.87777777777777

In [24]:
#standard deviation function
def std(data):
    return var(data)**0.5
std(heights)

7.202622979011033

In [28]:
def var2(data):
    mean = calc_mean(data)
    return sum([(x-mean)**2 for x in data])/(len(data)-1)
var2(heights)

51.87777777777777

### List Comprehension


List comprehension is an elegant way to define and create lists based on existing lists.

**Syntax of List Comprehension**

`[expression for item in list]`


Let's take our list of data and create a new list where every data point is multiplied by 2.


In [32]:
print(heights)
times_2 = [x*2 for x in heights]

[55, 56, 56, 58, 60, 61, 63, 64, 70, 78]


Let's replace the for loop in  `calc_var()` with a list comprehension

In [62]:
import math 

def calc_std(data):
    variance = calc_var(data)
    return math.sqrt(variance)

In [61]:
calc_var(data)

51.87777777777777

## Measures of Association

When we have two variables and we want to describe how the two are related to eachother.

### Correlation

Let’s say we have a dataset of height and weight of ten males. Normally we expect that the weight and height of a person are correlated, i.e. a taller person has more chances of having more weight than a short person. Correlation measures relationship between these kinds of data.

### Co-variance
One such measure is called co-variance, which measures how two variables vary with respect to each other. 

The population covariance $σ_{jk}$ between variables $j$ and $k$ can be estimated by the sample covariance. This can be calculated using the formula below:


$$cov_{jk} = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)(X_{ik}-\bar{x}_k)=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{n-1}$$





### Positive & Negative Covariance.

1) Positive covariance signifies that the higher values of one variable correspond with the higher values of the other variable, and similarly for the lower ones.

2) Negative covariance, on the other hand, signifies that the higher values of one variable correspond to the lower values of the other.

Hence the sign of the covariance therefore shows us the kind of linear relationship between two variables.

#### Question:  
 What does a co-variance of 0 probably mean?

### Correlation Coefficient
The correlation coefficient is obtained by dividing the covariance by the product of the standard deviations of the two variables.  It is defined as,

![correlation](correlation.jpeg)

Or in our fancy notation it is: 

$$r_{jk}=\frac{s_{jk}}{s_js_k}=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{\sqrt{\{\sum_{i=1}^{n}X^2_{ij}-(\sum_{i=1}^{n}X_{ij})^2/n\}\{\sum_{i=1}^{n}X^2_{ik}-(\sum_{i=1}^{n}X_{ik})^2/n\}}}$$


The values lie between +1 and -1.

· +1 signifying a perfect increasing linear relationship (correlation).

· -1 signifying a perfect decreasing linear relationship (anti-correlation).

### Applied 

1. Write a function to calculate the covariance of a dataset.
2. Write a function to calculate correlation using your functions for covariance and standard deviation.

In [69]:
#read in data to use

import csv
with open('weight-height.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    
    weights = []
    heights = []
    count = 0
    for row in readCSV:
        if count > 0:
            weight = row[2]
            height = row[1]
            weights.append(float(weight))
            heights.append(float(height))
        else:
            count +=1

    print(weights[:10])
    print(heights[:10])

[241.893563180437, 162.3104725213, 212.7408555565, 220.042470303077, 206.349800623871, 152.212155757083, 183.927888604031, 167.971110489509, 175.92944039571, 156.399676387112]
[73.847017017515, 68.7819040458903, 74.1101053917849, 71.7309784033377, 69.8817958611153, 67.2530156878065, 68.7850812516616, 68.3485155115879, 67.018949662883, 63.4564939783664]


### Co-variance
One such measure is called co-variance, which measures how two variables vary with respect to each other. 

The population covariance $σ_{jk}$ between variables $j$ and $k$ can be estimated by the sample covariance. This can be calculated using the formula below:


$$cov_{jk} = \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)(X_{ik}-\bar{x}_k)=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{n-1}$$




In [70]:
# write formula to calculate covariance
def cov(data1,data2):
    x1mean,x2mean = calc_mean(data1),calc_mean(data2)
    return sum([(x1-x1mean)*(x2-x2mean) for x1,x2 in list(zip(data1,data2))])/(len(data1)-1)

cov(heights,weights)

114.2426564464631

0 73.847017017515
1 68.7819040458903
2 74.1101053917849
3 71.7309784033377
4 69.8817958611153
5 67.2530156878065
6 68.7850812516616
7 68.3485155115879
8 67.018949662883
9 63.4564939783664
10 71.1953822829745
11 71.6408051192206
12 64.7663291334055
13 69.2830700967204
14 69.2437322298112
15 67.6456197004212
16 72.4183166259878
17 63.974325721061
18 69.6400598997523
19 67.9360048540095
20 67.9150501938206
21 69.4394398680395
22 66.1491319608781
23 75.2059736142212
24 67.8931963386043
25 68.1440327982008
26 69.0896314289256
27 72.8008435165003
28 67.4212422817167
29 68.4964153568827
30 68.6181105502058
31 74.0338076216678
32 71.5282160355709
33 69.1801610995692
34 69.577202365402
35 70.4009288884762
36 69.0761711675356
37 67.1935232827228
38 65.8073156549306
39 64.3041878915595
40 67.9743362271967
41 72.1894259592134
42 65.2703455240394
43 66.0901773762725
44 67.5103215157138
45 70.1047862551571
46 68.2518364408672
47 72.1727091157973
48 69.1798576188774
49 72.870360147235
50 64.782582977

845 73.3965429282898
846 69.8445640844681
847 70.15196587967
848 66.2997258732792
849 76.4929339594115
850 71.0323851971723
851 66.6310409367559
852 66.0755368151504
853 70.6131624145417
854 65.2034806844673
855 66.8618612870774
856 73.7579982986776
857 69.8980860723227
858 69.5273197188296
859 66.7444774700376
860 69.0744920316712
861 68.5617507293083
862 70.3225500128014
863 71.9491204211446
864 70.9165517924984
865 70.37112335334
866 67.796477151995
867 68.6999956299772
868 69.1146317960513
869 67.6971036455845
870 71.2803302309239
871 71.4275751935442
872 71.0433563208046
873 64.7262559397982
874 67.1481594371304
875 72.29169702561
876 70.2428661701033
877 64.457857849175
878 74.4649885443558
879 65.2989151495077
880 72.7495261139615
881 65.7181787540213
882 68.8900949391591
883 73.0384086139
884 69.86074638417
885 69.8611969523225
886 71.7655616628825
887 66.8208458858784
888 67.7876938262705
889 69.2507138496364
890 68.9296721083659
891 68.9971098556648
892 66.3739949336111
893 6

1759 71.6763671720677
1760 67.904421608325
1761 63.0061801676474
1762 66.8007189202471
1763 66.017591670216
1764 64.9098453269633
1765 68.5032610130771
1766 70.0247144837372
1767 64.5489720237124
1768 72.461527924717
1769 69.9639120678988
1770 69.3194375313327
1771 69.573259797481
1772 69.6743679475224
1773 69.2292191172492
1774 64.4483876826173
1775 67.4512093255625
1776 73.4948611201265
1777 70.9685634065889
1778 74.6955227193401
1779 68.1880966668789
1780 69.8560223215865
1781 67.1975110041172
1782 67.8417923970019
1783 70.8769871061978
1784 71.2431116560846
1785 74.4629865565583
1786 69.5532683769463
1787 70.6070255239616
1788 67.9456298131804
1789 72.185992275691
1790 68.0726530945031
1791 68.6403250785417
1792 68.8292162440198
1793 68.0201882652601
1794 73.6314305763311
1795 67.35327100956
1796 73.8418843982711
1797 66.3581208265962
1798 72.1735564385227
1799 71.408518163844
1800 70.6688631861614
1801 65.8381692148508
1802 71.7657992928584
1803 70.8687664141297
1804 69.8438130383

2408 68.4643740899053
2409 68.4329923224506
2410 68.9882868135247
2411 70.616362867501
2412 65.0385320500325
2413 73.0242056315614
2414 68.7202431773936
2415 68.9253088352626
2416 75.1469078867175
2417 63.1542722595019
2418 64.4977259510755
2419 69.1694841724535
2420 67.1178740741552
2421 64.572896855173
2422 69.9818915473326
2423 67.2907755320768
2424 67.578739260442
2425 70.0306732915134
2426 67.285733279077
2427 67.7179469415444
2428 71.2524174586648
2429 70.9816065331949
2430 66.7989765317594
2431 66.3835760239606
2432 73.5135948991932
2433 69.3741611914573
2434 68.7206084483133
2435 71.5392985030705
2436 73.0691967049406
2437 69.5052293431528
2438 67.2688622118818
2439 69.0162747931198
2440 72.888870152212
2441 70.7109383642151
2442 68.9643667326337
2443 70.4721950619616
2444 69.0062973916329
2445 66.4464197394787
2446 68.8748761581227
2447 68.4231670839035
2448 74.2422931391381
2449 69.069036069054
2450 73.4010332774128
2451 70.2738215373148
2452 74.3408608692076
2453 67.70533285

3157 68.9416586042578
3158 67.4527672696579
3159 65.3764350085399
3160 69.6393119078689
3161 64.7904511350664
3162 68.8629875977668
3163 67.0832641697457
3164 67.0388652261268
3165 67.295350598399
3166 70.5491530734941
3167 65.4048857028292
3168 67.1101003447744
3169 66.8627250647371
3170 71.9553618381654
3171 64.0377595555002
3172 61.6258610740856
3173 70.9595910436341
3174 66.9118632816853
3175 69.2206683275423
3176 67.3747648517319
3177 71.541082249022
3178 71.8906646662965
3179 73.610215053458
3180 64.0963255375646
3181 70.9626171236153
3182 70.280271694611
3183 67.4444541091852
3184 69.4730963745151
3185 68.7436471208239
3186 70.901460796356
3187 66.6687269099779
3188 67.788799326341
3189 68.4570350328042
3190 69.509852369113
3191 69.7696044375304
3192 66.4872313451816
3193 72.7266201973182
3194 71.4901791634649
3195 71.4735705576821
3196 64.6394488390014
3197 69.689848438937
3198 70.0048866781933
3199 67.0328296969054
3200 69.885301125946
3201 65.3577593032289
3202 64.44509767186

4221 67.5759710489801
4222 68.9919775552593
4223 74.6290981211044
4224 70.8121622451816
4225 70.8535496255446
4226 68.5507144535643
4227 65.0383833813307
4228 69.9892052553004
4229 71.9937450614421
4230 70.3985435860698
4231 68.4137911385991
4232 67.0861118731947
4233 70.8475808780624
4234 67.9075589075407
4235 66.3529082736873
4236 62.4762567176972
4237 67.0600208165061
4238 66.7560292174062
4239 70.185950915161
4240 69.2112071037496
4241 71.1765712130773
4242 71.6354070267975
4243 70.872811462741
4244 72.1895620404837
4245 64.3984832271446
4246 67.6943047368205
4247 70.475543576638
4248 71.542832896417
4249 70.1078517456433
4250 64.059623006241
4251 69.4627154392701
4252 68.7930601669446
4253 66.3595061199446
4254 67.2498003057156
4255 71.9701401976461
4256 72.7152245437666
4257 65.1865694174926
4258 71.5670024732117
4259 65.840638295757
4260 67.942327464776
4261 72.5583724718679
4262 69.0962029934429
4263 65.9340903421704
4264 63.8071784549547
4265 67.319520128829
4266 71.8795384923

5072 60.5780628792393
5073 62.114866399197
5074 56.1594580191187
5075 67.449272973975
5076 65.0651446274894
5077 65.5288561745669
5078 63.1496527400666
5079 58.8743006695406
5080 66.0212319116557
5081 61.5250911236378
5082 63.5843104973355
5083 59.7867629089699
5084 67.1492063558276
5085 62.323087083779
5086 65.1476595576194
5087 70.6285964525231
5088 67.3874511948522
5089 68.130408809775
5090 63.4884596395885
5091 61.1630652757277
5092 70.7101395263283
5093 65.2088548097869
5094 61.3594893280584
5095 62.0717151894243
5096 65.4048818826678
5097 61.8233681172395
5098 62.9471972521971
5099 64.3630337197069
5100 64.038537483693
5101 69.7883326870588
5102 62.8223500167531
5103 71.9912565421476
5104 61.9940865436986
5105 65.8046058667036
5106 60.030610123321
5107 64.4988656513921
5108 62.0496445072297
5109 64.5679794100138
5110 60.5043159198482
5111 61.2144499154672
5112 65.7806082972014
5113 60.674278397707
5114 64.7898526139103
5115 62.9965793146519
5116 63.8094826504867
5117 62.279900106

5561 65.1110636621304
5562 65.220797969986
5563 61.6005970877239
5564 66.2954044521177
5565 62.6164606594454
5566 69.3299216808351
5567 63.2380660879524
5568 65.013435001229
5569 60.6748561538626
5570 65.3866908281118
5571 63.1028497186618
5572 63.0242234616811
5573 64.4954896927827
5574 63.6691129398454
5575 62.1097570425215
5576 67.0688136559148
5577 64.5563984334083
5578 66.0049985990387
5579 63.8260546124413
5580 67.6166729266317
5581 66.5384938612383
5582 62.5376588122446
5583 69.0583292258541
5584 64.1981553087341
5585 65.065914319463
5586 63.1767412364093
5587 62.8924807371768
5588 61.5004865343244
5589 62.5092134625984
5590 65.8175965051632
5591 62.6992776704058
5592 62.0045067675116
5593 64.1571004654397
5594 61.025224979151
5595 62.7430546559301
5596 62.763723922867
5597 65.5488123192662
5598 63.115267498179
5599 63.7387846485527
5600 67.4816970921321
5601 59.762156488396
5602 64.0269597600027
5603 63.9323879354694
5604 68.2618555144814
5605 60.2762621629998
5606 63.238524711

6508 62.2313974453422
6509 60.0762407628217
6510 65.4990108523165
6511 61.2383813168991
6512 60.1809766937198
6513 61.4781357929314
6514 68.554568155727
6515 67.9780198460861
6516 60.5007655934477
6517 62.0483185693854
6518 60.1834384808811
6519 66.2469657392108
6520 65.9240682228457
6521 63.6415700937888
6522 62.1577512772891
6523 67.0312170911189
6524 62.7179793731272
6525 58.596554530003
6526 66.8749708944032
6527 65.1792627741624
6528 63.6760971019316
6529 63.216686269988
6530 63.4661161157611
6531 66.0810430982694
6532 64.5616787688241
6533 67.0996613263998
6534 65.0387401702194
6535 62.3212719284674
6536 68.3754527268264
6537 61.5119869103612
6538 63.1585225252602
6539 62.2158420710198
6540 65.5643155666142
6541 64.5626271159302
6542 65.095308354207
6543 61.8155231944616
6544 64.6557952517356
6545 66.2304573345808
6546 63.7149300563257
6547 61.8196628362272
6548 63.5528924997549
6549 63.8530882856352
6550 63.5563537144268
6551 66.1559220263282
6552 62.389502626001
6553 68.8742685

7555 60.6846789669358
7556 61.1352148616206
7557 62.074058987907
7558 61.5246052888477
7559 60.1680232842102
7560 64.5743138651905
7561 68.70900894262
7562 63.4709314682854
7563 62.5342300513089
7564 61.7360610920102
7565 66.4890743902074
7566 57.5535052100887
7567 62.3210432270616
7568 65.363737273052
7569 67.2183941492729
7570 65.2524050978778
7571 67.75797031241
7572 60.9468462262195
7573 58.8390938792718
7574 63.5335385140987
7575 67.0160108447014
7576 61.3500917381204
7577 62.9891986938145
7578 62.044035447391
7579 60.3052182074258
7580 62.5936305835178
7581 65.6555249020193
7582 63.9288492009517
7583 64.0357060393282
7584 64.7401256289334
7585 67.0699084000576
7586 56.7371834718755
7587 69.8941265119169
7588 63.4725267227505
7589 64.3789118730644
7590 62.8111019243342
7591 62.7063378830046
7592 65.4002729606167
7593 67.8631429344428
7594 67.7705094932964
7595 66.088502275048
7596 65.7884934046996
7597 60.236390402201
7598 65.8652442036665
7599 61.1900488527784
7600 70.02025271473

8423 61.6210566067846
8424 63.5397728071705
8425 62.6152812347498
8426 62.2109628577726
8427 59.5336429964067
8428 63.2142206867711
8429 64.3257052898524
8430 68.6784331284575
8431 64.5895008241428
8432 63.5778485844985
8433 60.9001221812197
8434 63.9985239139983
8435 65.7065796786902
8436 64.0414419015282
8437 58.7573709705235
8438 59.8938938234214
8439 66.5597964055882
8440 61.6623671332511
8441 62.1340819210664
8442 61.3182843828611
8443 70.1759547014681
8444 62.157304372623
8445 65.1254074215185
8446 64.1618175719681
8447 59.5476808478638
8448 58.996286808194
8449 66.1786977209513
8450 63.6539641291906
8451 62.0158111276942
8452 63.7166915247012
8453 66.1111655679784
8454 65.109614478518
8455 58.1871410720517
8456 62.3083009668332
8457 63.825438005784
8458 63.9864556769549
8459 61.8102590379416
8460 65.0469974386592
8461 66.9471980540265
8462 68.3169056442754
8463 65.0967760242844
8464 71.0704512296053
8465 69.1409455400443
8466 62.2185993068968
8467 62.8408008879154
8468 67.225121

9042 63.3217287105328
9043 62.2531281462686
9044 63.8040324384367
9045 66.1004689229617
9046 64.9463164921449
9047 61.1912947465944
9048 65.5792556468706
9049 64.4369965026001
9050 61.0195275637746
9051 59.8601233557773
9052 63.7411007171824
9053 66.5960770336464
9054 60.587728662036
9055 64.4604140330399
9056 63.1396148899333
9057 60.7627947327993
9058 56.8222398387379
9059 69.3633891037202
9060 62.3209205687271
9061 65.8556391815103
9062 62.1284586002395
9063 64.659436607527
9064 66.7059047988235
9065 68.3110321825627
9066 64.5778172609094
9067 64.0029827614456
9068 67.4355930857502
9069 63.4857016654483
9070 61.2131692937785
9071 67.1269772515452
9072 66.957750042078
9073 69.3738437260129
9074 62.2777113817982
9075 65.4624704240378
9076 70.3375343257912
9077 63.9001145586492
9078 61.5091885414927
9079 61.0291607902737
9080 67.0790744878863
9081 67.5251004217047
9082 63.5717171675584
9083 59.3215326801004
9084 65.5696290821571
9085 64.8034934241898
9086 64.004992575683
9087 60.457177

9771 61.3457245777591
9772 63.845212126033
9773 65.1761355865157
9774 65.816652600841
9775 60.0259500665801
9776 68.828104386959
9777 62.5554253421448
9778 61.7266101916917
9779 64.7106239752276
9780 65.6646236546352
9781 66.479275640941
9782 62.2012348524975
9783 61.5046338204293
9784 61.6304558853935
9785 62.7202936746391
9786 63.583680336741
9787 60.7574674745953
9788 66.1180127548593
9789 65.2397367390664
9790 64.6234271641416
9791 64.632119489794
9792 64.5073306544843
9793 67.2196282962777
9794 63.7830427504241
9795 61.551110376156
9796 63.7328306115322
9797 61.2009946946584
9798 62.7434778192535
9799 66.5476085579028
9800 62.8850630208678
9801 61.5819746301437
9802 63.5205327540616
9803 61.597656715669
9804 62.8945987905138
9805 64.1742790637743
9806 62.433723061098
9807 65.9565834339489
9808 65.2322275232837
9809 58.2609650496285
9810 60.025917295093
9811 63.8055284133837
9812 60.571364105579
9813 64.4475209635432
9814 60.4422497363627
9815 61.509106353056
9816 65.0313096229525


### Correlation Coefficient
The correlation coefficient is obtained by dividing the covariance by the product of the standard deviations of the two variables.  It is defined as,

![correlation](correlation.jpeg)

Or in our fancy notation it is: 

$$r_{jk}=\frac{s_{jk}}{s_js_k}=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{\sqrt{\{\sum_{i=1}^{n}X^2_{ij}-(\sum_{i=1}^{n}X_{ij})^2/n\}\{\sum_{i=1}^{n}X^2_{ik}-(\sum_{i=1}^{n}X_{ik})^2/n\}}}$$


The values lie between +1 and -1.

· +1 signifying a perfect increasing linear relationship (correlation).

· -1 signifying a perfect decreasing linear relationship (anti-correlation).

In [72]:
# write formula to calculate correlation
def corr(data1,data2):
    return cov(data1,data2)/(std(data1)*std(data2))


corr(heights,weights)

0.9247562987409157

#### Question

When we find two variables are highly correlated, does that mean we can say one causes the other?

**Example:** 
*Children that watch a lot of TV are the most violent. Therefore, TV makes children more violent.*

http://www.tylervigen.com/spurious-correlations