# 04 Statistics

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. Grus, J. (2019). Data Science From Scratch: First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media

The following Python modules will be required. Make sure that you have them installed.
- `matplotlib`
- `requests`
- `numpy`
- `scipy`
- `collections`

## Lesson 1

### What is Statistics?

Statistics is a strict elder sister of Data Science. This is a branch of mathematics that uses strict mathematical methods for extraction information from data. 

Data Science uses statistics but this is not a branch of Mathematics. In addition to statistical formulas it uses many new computer methods like machine learning. Often the approaches of Data Science are not well founded mathematically.

One needs to distinguish Statistics as a science and statistic as a numerical characteristic of a set of data. 

Statistics (science) computes statistics (numerical values that describe data sets) using statistical algorithms.

Statistics (numbers) are computed to describe data.

### First look at the data and trivial statistics

If a dataset is small we do not need a special efforts to describe it. We can merely observe it. 

For example let some student has the examination grades (5, 4, 5, 3), on a scale of 2 to 5. It is obvious from mere observation that the last exam was not very successful. 

But what if we have several hundred exam results? We can not observe these data. We need to describe them somehow. Statistical methods help. They allow to extract essential features from data that describe the data meaningfully. 

We will experiment with a dataset `unif_state_exam.csv` that contains a sample of Unified State Examination grades in physics (column `Phys`), mathematics (`Math`) and Russian language (`Lang`) applied to some university (the dataset is not synthetic, it is taken from a real university). Recall that each grade scale ranges from 1 to 100. One more column `Ach` contains marks added for personal achievements, for example, in sports. It ranges from 0 to 10. Finally, column `Tot` is a total grade.

Now we will download this dataset and take only totals for further analysis.

In [None]:
# This module allows to work with web pages
import requests

# This is an URL of a repository
base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"

# We need this file
file_name = "unif_state_exam.csv"

# Here we downlaod the file
web_data = requests.get(base_url + file_name)
assert web_data.status_code == 200

# Take a look at the data
print(web_data.text[:100])

In [None]:
# Split by line ends
str_data = web_data.text.splitlines()
print(str_data[:10])

In [None]:
# Drop out the header and split grades
lst_data = [s.split(',') for s in str_data[1:]]
print(lst_data[:10])

In [None]:
# Take only totals
data_grd = [int(s[0]) for s in lst_data]
print(data_grd[:10])

The first question to answer when describing data is how many of them we have. We can answer it with the help of `len` function:

In [None]:
print(f'Data size={len(data_grd)}')

This is large enough to observe the data just by printing and looking at them.

We are probably also interested in the largest and the smallest values. There are functions `max` and `min` for it:

In [None]:
print(f'Largest  grade={max(data_grd)}')
print(f'Smallest grade={min(data_grd)}')

The length, the largest and the smallest values are trivial statistics describing the data.

Before going further let us plot a histogram.

Since our data are integers we can compute an exact number of bins: one for each particular value.

In [None]:
nbins = max(data_grd) - min(data_grd) + 1
print(nbins)

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")
ax.grid();

One more dataset will be `stud_activ.csv`. It describes student activity during one year: how many times he/she raised a hand (column `RH`), visited course resource (`Res`) and participated on discussion groups (`Disc`). We will take a number of rising hands `RH`.

In [None]:
# This module allows to work with web pages
import requests

# This is an URL of a repository
base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"

# We need this file
file_name = "stud_activ.csv"

# Here we downlaod the file
web_data = requests.get(base_url + file_name)
assert web_data.status_code == 200

# Take a look at the data
print(web_data.text[:100])

In [None]:
# Split by line ends
str_data = web_data.text.splitlines()
print(str_data[:10])

In [None]:
# Drop out the header and split
lst_data = [s.split(',') for s in str_data[1:]]
print(lst_data[:10])

In [None]:
# Take only RH
data_rh = [int(s[0]) for s in lst_data]
print(data_rh[:10])

Lets us have a first look at our dataset

In [None]:
print(f'Data size     ={len(data_rh)}')
print(f'Largest value ={max(data_rh)}')
print(f'Smallest value={min(data_rh)}')

Now plot a histogram

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_rh, bins=max(data_rh) - min(data_rh) + 1);
ax.set_xlabel("# of raised hand")
ax.set_ylabel("# of students")
ax.grid();

Observe two clusters: active students and those who prefer to stay still.

### Central tendencies

If we want to describe a dataset with only one value, this value obviously must be the most important in some sense.

For example, inspecting the histogram of grades above we notice that there is an area where the most of data are located: many people have a grade near 200. 

These most important values, in the other words the values somehow typical for the dataset are called central tendencies.

There are different definition of the central tendencies. We will consider the following:
- arithmetic mean (or simply, mean)
- median
- mode

### The Mean

Mean value is the sum of the data divided by its count.

In [None]:
def my_mean(data):
    """
    Mean value of dataset.
    """
    return sum(data) / len(data)

Let us test how it works

In [None]:
# compute mean by hands
mean_by_hand = (12 + 13 + 18 + 11) / 4

# gather the data into a list
test_data = [12, 13, 18, 11]

# use our function
mean_by_func = my_mean(test_data)

print(f'mean_by_hand={mean_by_hand}')
print(f'mean_by_func={mean_by_func}')

Notice that mean depends on each value of a dataset.

In [None]:
# change 12 to 12.5
test_data = [12.5, 13, 18, 11]

# observe that the mean is changed
print(f'mean={my_mean(test_data)}')

Now we find mean value for our main dataset and show it on the histogram.

In [None]:
data_mean = my_mean(data_grd)
print(f'data_mean={data_mean}')

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

data_mean = my_mean(data_grd)

# this function plots a vertical line through the whole figure
ax.axvline(data_mean, color='red', linewidth=3, label=f'mean at {int(data_mean)}')
ax.legend()
ax.grid();

Student activity dataset:

In [None]:
data_mean = my_mean(data_rh)
print(f'data_mean={data_mean}')

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_rh, bins=max(data_rh) - min(data_rh) + 1);
ax.set_xlabel("# of raised hand")
ax.set_ylabel("# of students")

data_mean = my_mean(data_rh)

ax.axvline(data_mean, color='red', linewidth=3, label=f'mean at {int(data_mean)}')
ax.legend()
ax.grid();

### The Median

The median is a value in the middle of the dataset found after its sorting.

Let us see how it is computed manually. Consider a dataset:

In [None]:
tst_data = [2, 14, 5, 8, 3, 12, 15, 6, 22]

To find a value in the middle we have to sort this array first:

In [None]:
srt_data = sorted(tst_data)
size = len(srt_data)
print(srt_data)
print(f'size={size}')

The dataset has size 9. Numbering starts from 0 so that the middle number is `9//2=4`. Thus the median is:

In [None]:
median = srt_data[size//2]
print(f'median={median}')

If a dataset has an even size the median is an average of its two middle values:

In [None]:
tst_data = [2, 14, 5, 8, 3, 12, 15, 22]
srt_data = sorted(tst_data)
size = len(srt_data)
print(srt_data)
print(f'size={size}')

In [None]:
# size is 8, numbering from 0:
# first half:   0 1 2 3 
# second half:  4 5 6 7

# higher midpoint is 8 // 2 -> 4
hi_midpoint = size // 2

# lower midpoint is hi_midpoint - 1
lo_midpoint = hi_midpoint - 1

print(f'hi_midpoint={hi_midpoint}')
print(f'lo_midpoint={lo_midpoint}')

# now compute the median
median = (srt_data[lo_midpoint] + srt_data[hi_midpoint]) / 2
print(f'median={median}')

In [None]:
def my_median(data):
    """
    Median of a dataset.
    """
    size = len(data)
    srt_data = sorted(data)
    if size % 2 != 0:
        # odd length
        midpoint = size // 2
        return srt_data[midpoint]
    else:
        # even length
        hi_midpoint = size // 2
        lo_midpoint = hi_midpoint - 1
        return (srt_data[lo_midpoint] + srt_data[hi_midpoint]) / 2

Why we use the median in addition or instead of the mean? 

It is useful when the data is largely skewed by a small number of very large or very small values. In this case the mean becomes misleading. 

Consider a dataset of salaries. Both non-management and management employees are included:

In [None]:
# salaries, non-management and management employees
salaries = [1200, 950, 1150, 1300, 950, 1150, 800, 1000, 900, 1100, 1200, 120000, 230000]

This dataset is skewed: there are two extremely large values. 

Obviously the mean salary is not relevant. Most of the actual values are much less:

In [None]:
mean_sal = my_mean(salaries)
print(f'mean salary={mean_sal:8.2f}')

Median salary is more adequate description of this dataset:

In [None]:
median_sal = my_median(salaries)
print(f'median salary={median_sal}')

Unlike the mean, the median is not so sensitive to data variation. For example, if the largest value becomes larger or the smallest value smaller, the median remain unchanged. It is said that this is robust.

Let us compute and show the median for our dataset of grades.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

data_mean = my_mean(data_grd)
data_median = my_median(data_grd)

print(f"mean   = {data_mean:6.2f}")
print(f"median = {data_median}")

ax.axvline(data_mean, color='red', linewidth=3, label=f'mean at {data_mean:6.2f}')
ax.axvline(data_median, color='green', linewidth=3, label=f'median at {data_median}')
ax.legend()
ax.grid();

We observe that the mean and the median are very close to each other. It indicates that our dataset is symmetrical: large and small values appear with similar frequencies.

Now our second dataset:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_rh, bins=max(data_rh) - min(data_rh) + 1);
ax.set_xlabel("# of raised hand")
ax.set_ylabel("# of students")

data_mean = my_mean(data_rh)
data_median = my_median(data_rh)

print(f"mean   = {data_mean:5.2f}")
print(f"median = {data_median}")

ax.axvline(data_mean, color='red', linewidth=3, label=f'mean at {data_mean:5.2f}')
ax.axvline(data_median, color='green', linewidth=3, label=f'median at {data_median}')
ax.legend()
ax.grid();

This dataset is not so symmetric and the median differs essentially form the mean. 

The median tells more about the situation: we observe that half of all students are very active, even though the mean activity is not so high.

### The Mode

The mode is the most frequent value in a dataset.

Let us first consider a simple example:

In [None]:
# what number is the most frequent? 
nums = [5, 1, 2, 3, 2, 2, 2, 1, 2, 2, 2, 4, 1, 5, 2]

# it will be easy to see if we sort it
srt_nums = sorted(nums)
print(srt_nums)

The mode of this set of numbers is `2`. It occurs `8` times, it is more often then the others.

To compute the mode we need `Counter` from the module `collections`. It creates an object that counts how many times each value is encountered in a list.

In [None]:
from collections import Counter

bins = Counter(nums)
print(f'bins={bins}')

By the way notice that we call a variable `bins`. This is because `Counter` actually produces the data for a histogram: it puts each value into its own bin and count them.

The `Counter` class provides a method `.most_common(n)` that returns a list of two-items tuples with the `n` most frequent  elements and their respective counts. If `n` is omitted, then `.most_common()` returns all of the elements.

In [None]:
# this is the most frequent elemnt
print(bins.most_common(1))

# these are two most frequent elements
print(bins.most_common(2))

# all elements
print(bins.most_common())

To get the mode we just take the first most frequent element

In [None]:
nums_mode = bins.most_common(1)[0][0]
print(f'nums_mode={nums_mode}')

But what if a dataset has several elements with identical frequencies?

In [None]:
# new list with 1s and 2s that occur five times
nums = [5, 1, 3, 2, 5, 2, 5, 2, 1, 2, 3, 1, 2, 1, 4, 1, 5]
srt_nums = sorted(nums)
print(srt_nums)

Our approach above will find only one of them.

In [None]:
from collections import Counter

bins = Counter(nums)
print(f'bins={bins}')
print(f'most_common={bins.most_common()}')

Now the mode is a lits of two values: 1 and 2.

The idea is the following: first take the largest frequency (this is five in our example), then select only those elements that have this frequency.

In [None]:
most_freq = bins.most_common(1)
print(most_freq)
most_freq = most_freq[0][1]
print(most_freq)

Object `Counter` is a sort of dictionary. Iterations over it are organized as follows:

In [None]:
# Just to check how we iterate over bins
[[key, val] for key, val in bins.items()]

Add here the filtering: take only most frequent elements

In [None]:
# Select only the most frequent elements
[[key, val] for key, val in bins.items() if val == most_freq]

In [None]:
# Variable val stores frequency, key holds the elements.
# We need only elements
[key for key, val in bins.items() if val == most_freq]

That is it. We have a list of modes for our dataset. Let us collect all together

In [None]:
from collections import Counter

def my_mode(data):
    """
    Returns a list of the most frequent elements
    """
    bins = Counter(data)
    most_freq = bins.most_common(1)[0][1]
    return [key for key, val in bins.items() if val == most_freq]

my_mode(nums)

Now we can find the mode for our grades dataset

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

data_mean = my_mean(data_grd)
data_median = my_median(data_grd)
data_mode = my_mode(data_grd)

print(f"mean   = {data_mean:6.2f}")
print(f"median = {data_median}")
print(f"mode   = {data_mode}")

# take each mode element if there are many of them
for md in data_mode:
    ax.axvline(md, color='cyan', linewidth=3, label=f'mode at {md}')
    
ax.legend()
ax.grid();

We see that our dataset has only one most frequent element and it almost coincides with the mean and the median. This is again due to the symmetry of the data distribution.

The other dataset:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_rh, bins=max(data_rh) - min(data_rh) + 1);
ax.set_xlabel("# of raised hand")
ax.set_ylabel("# of students")

data_mean = my_mean(data_rh)
data_median = my_median(data_rh)
data_mode = my_mode(data_rh)

print(f"mean   = {data_mean:5.2f}")
print(f"median = {data_median}")
print(f"mode   = {data_mode}")


ax.axvline(data_mean, color='red', linewidth=3, label=f'mean at {data_mean:5.2f}')
ax.axvline(data_median, color='green', linewidth=3, label=f'median at {data_median}')

for md in data_mode:
    ax.axvline(md, color='cyan', linewidth=3, label=f'mode at {md}')
    
ax.legend()
ax.grid();

We observe that the most common activity is at level 80.

The above function for the mode works properly only for datasets of integers. If the data are real numbers the result will be incorrect. 

### Quantiles

The quantile is a value under which a certain percent of data lies. The median is 50% quantile.

In [None]:
def my_quantile(data, q):
    """
    The quantile, q in (0,1)
    """
    inx = int(round(q * len(data)))
    return sorted(data)[inx]

This is a simplified version of a function for quantiles. If `q * len(data)` is not integer we just round it. More accurate version interpolate values between two points. We did it for the median when the dataset had even length.

Often 25% quantiles are used. Since the whole range of data is split into four areas they are called the quartiles. 
- The first quartile (Q1): the middle number between the smallest one and the median.
- The second quartile (Q2) is the median.
- The third quartile (Q3) is the middle value between the median and the highest value.

Often the quartile refers to a range of data:
- quartile Q1: range from the smallest values to the point Q1
- Q2: the range between Q1 and Q2 points
and so on.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

Q1 = my_quantile(data_grd, 0.25)
Q2 = my_median(data_grd)
Q3 = my_quantile(data_grd, 0.75)

ax.axvline(Q1, color='C1', linewidth=3, label=f'Q1 at {Q1}')
ax.axvline(Q2, color='C2', linewidth=3, label=f'Q2 at {Q2}')
ax.axvline(Q3, color='C3', linewidth=3, label=f'Q3 at {Q3}')
    
ax.legend()
ax.grid();

Our dataset represents examination grades: the higher values the better. Thus applicants from Q1 are the worst. 

Often for datasets where higher means better the quantiles and so the quartiles are defined with respect to the reversed sorting. In the other words the dataset can be sorted in the descending order.

Let us redefine our quantile function to take it into account.

In [None]:
def my_quantile(data, q, reverse=False):
    """
    The quantile, q in (0,1)
    reverse=True means sorting the array in descending order
    """
    inx = int(round(q * len(data)))
    return sorted(data, reverse=reverse)[inx]

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

Q1 = my_quantile(data_grd, 0.25, reverse=True)
Q2 = my_median(data_grd)
Q3 = my_quantile(data_grd, 0.75, reverse=True)

ax.axvline(Q1, color='C1', linewidth=3, label=f'Q1 at {Q1}')
ax.axvline(Q2, color='C2', linewidth=3, label=f'Q2 at {Q2}')
ax.axvline(Q3, color='C3', linewidth=3, label=f'Q3 at {Q3}')
    
ax.legend()
ax.grid();

Now the best applicants belong to the first quartile Q1, those that are not so bad are in Q2. Applicants form Q3 have a chance to be not enrolled, and finally those from Q4 probably will not be enrolled. 

Quartiles (also for sorting in the descending order) for the student activity:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_rh, bins=max(data_rh) - min(data_rh) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

Q1 = my_quantile(data_rh, 0.25, reverse=True)
Q2 = my_median(data_rh)
Q3 = my_quantile(data_rh, 0.75, reverse=True)

ax.axvline(Q1, color='C1', linewidth=3, label=f'Q1 at {Q1}')
ax.axvline(Q2, color='C2', linewidth=3, label=f'Q2 at {Q2}')
ax.axvline(Q3, color='C3', linewidth=3, label=f'Q3 at {Q3}')
    
ax.legend()
ax.grid();

Observe how the quartiles correspond to the data clustering. 

Real example of using quartiles: each scientific journals gets a rank according to its popularity: downloading, citing, submitted papers and so on. Then this rank is sorted in the descending order and split into quartiles. 

It is very prestigious for scientists to publish their papers in Q1 Journals.

### Exercises

1\. File `"bank_customers.csv"` that you can find at `"https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"` describes bank clinets. It contains three columns: customer age (`CustAge`), period of relationship with the bank in months (`Months`), and credit limit (`CredLim`). Plot a histogram for the column `Month` and compute its mean, median and the mode. Show these values on the histogram. Make a second copy of the histogram, compute quartiles and show them on the plot.

2\. Download the dataset from the previous exercise and do the same as in the exercise 1 for the column `CredLim`.

## Lesson 2

### Data variability

When we have described a data with the central tendency what can be said more? The second important value describing the data is their variability.  

Do we have a lot of different values or all of them are basically the same?

Data variability indicates the strength of the central tendency: the higher data variability the less strong is the central tendency.

We will consider the following variability measures:
- range
- variance and standard deviation
- interquartile range

### The Range

The range is the largest value minus the smallest value. It can be called the full width of the dataset. 

In [None]:
def my_range(data):
    return max(data) - min(data)

The ranges of our datasets:

In [None]:
print(f'data_grd range={my_range(data_grd)}')
print(f'data_rh  range={my_range(data_rh)}')

This is the simplest value and the least informative. 

Consider two datasets of random numbers:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Generate two random datasets 
rng = np.random.default_rng()
rand_data1 = np.concatenate([rng.normal(loc=0, scale=10, size=10000), 
                             rng.normal(loc=0, scale=1,  size=100000), 
                             rng.normal(loc=0, scale=2,  size=100000)])
rand_data2 = np.concatenate([rng.normal(loc=0, scale=10, size=110000),
                             rng.normal(loc=0, scale=7,  size=100000)])

fig, axs = plt.subplots(nrows=2, ncols=1, sharex=True, figsize=(5,6))

# compute ranges
rn1 = my_range(rand_data1)
rn2 = my_range(rand_data2)

axs[0].hist(rand_data1, bins=300, label=f"range={rn1:5.2f}");
axs[1].hist(rand_data2, bins=300, label=f"range={rn2:5.2f}");

axs[0].legend()
axs[1].legend();

Although ranges of these two datasets are close to each other they have quite different variability. 

The range is robust value. It does not sensitive to each data value. Only the largest and smallest values matter.

### The Variance and the standard deviation

The variance is a mean squared deviation of a value from its mean.

In the other words
- we compute the mean of the dataset
- we compute the deviations: subtract each value and the mean
- we compute the squared deviations
- we compute the mean of the squared deviations, but 
divide by $N-1$

The same procedure using mathematical notation. First we compute the mean:
$$
\overline x = \frac{1}{N}\sum_{i=1}^N x_i.
$$
Then find the variance:
$$
\text{Var}=\frac{1}{N-1}\sum_{i=1}^N (x_i - \overline x)^2.
$$

Why divide by $N-1$? This not obvious. 

When we deal with a dataset this is typically a sample from a
larger population. Thus $\overline x$ is not the actual mean, this is its estimate. It results to
an underestimate of the variance. To fix it we divide by $N-1$ instead of $N$.

The variance is measured in squared data units. If, for example, the data are in meters the variance is in squared meters. To describe the data variability we need the same units. This is the case for the range. 

To obtain correct units we need to computed squared root of the variance. This is called a standard deviation and usually is denoted by $\sigma$:
$$
\sigma=\sqrt{\frac{1}{N-1}\sum_{i=1}^N (x_i - \overline x)^2}.
$$
Let us first compute the standard deviation manually.

In [None]:
tst_data = [10.2, 12.3, 9.2, 8.4, 15.4, 11.2]

# the mean 
mean = sum(tst_data) / len(tst_data)
print(mean)

In [None]:
# squared deviations
sqr_dev = [(x - mean)**2 for x in tst_data]
print(sqr_dev)

In [None]:
# variance
variance = sum(sqr_dev) / (len(sqr_dev) -  1)
print(variance)

In [None]:
# standard deviation
sig = variance**0.5
print(sig)

Now wrap it to a function

In [None]:
def my_stddev(data):
    """
    Standard deviation.
    """
    size = len(data)
    mean = sum(data) / size
    sqr_dev = [(x - mean)**2 for x in data]
    variance = sum(sqr_dev) / (size - 1)
    return variance**0.5

print(my_stddev(tst_data))

Let us compute standard deviations for random data considered above.

In [None]:
import matplotlib.pyplot as plt
#from matplotlib.patches import Rectangle
import numpy as np

fig, axs = plt.subplots(nrows=2, ncols=1, sharex=True, figsize=(5,6))

# compute ranges
sig1 = my_stddev(rand_data1)
sig2 = my_stddev(rand_data2)

axs[0].hist(rand_data1, bins=300);
axs[1].hist(rand_data2, bins=300);

axs[0].axvspan(-sig1, sig1, color='red', alpha=0.3, label=f'$\sigma={sig1:4.2f}$')
axs[1].axvspan(-sig2, sig2, color='red', alpha=0.3, label=f'$\sigma={sig2:4.2f}$')

for ax in axs:
    ax.set_xlim([-25, 25]);
    ax.legend();

We see that the standard deviation more properly describes the data extent.

Now we compute the standard deviations for grades and student activities.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

data_mean = my_mean(data_grd)
data_std = my_stddev(data_grd)

ax.axvline(data_mean, color='red', linewidth=3, label=f'mean at {int(data_mean)}')
ax.axvspan(-data_std + data_mean, data_std + data_mean, color='yellow', alpha=0.3, label=f'$\sigma={data_std:4.2f}$')
ax.legend()
ax.grid();

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_rh, bins=max(data_rh) - min(data_rh) + 1);
ax.set_xlabel("# of raised hand")
ax.set_ylabel("# of students")

data_mean = my_mean(data_rh)
data_std = my_stddev(data_rh)

ax.axvline(data_mean, color='red', linewidth=3, label=f'mean at {int(data_mean)}')
ax.axvspan(-data_std + data_mean, data_std + data_mean, color='yellow', alpha=0.3, label=f'$\sigma={data_std:4.2f}$')
ax.legend()
ax.grid();

We observe that if the histogram of a data looks like a hump (or bell) the standard deviation properly describes data variability. 

But for a complicated histograms the range is better, while the standard deviation looks irrelevant.

### Interquartile range

One more variability characteristic is the interquartile range. This is the range between 75% an 25% quantiles, or Q3 - Q1.

In [None]:
def my_interquart(data):
    return my_quantile(data, 0.75) - my_quantile(data, 0.25)

Interquartile range for grades and student activities:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_grd, bins=max(data_grd) - min(data_grd) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

data_mean = my_mean(data_grd)
data_std = my_stddev(data_grd)
ax.axvspan(-data_std + data_mean, data_std + data_mean, color='yellow', alpha=0.3, label=f'$2\sigma={2*data_std:4.2f}$')

Q1 = my_quantile(data_grd, 0.25)
Q3 = my_quantile(data_grd, 0.75)
ax.axvspan(Q1, Q3, color='gray', alpha=0.3, label=f'iqr={Q3-Q1:4.2f}')

ax.legend()
ax.grid();

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(data_rh, bins=max(data_rh) - min(data_rh) + 1);
ax.set_xlabel("Total grade")
ax.set_ylabel("# of applicants")

data_mean = my_mean(data_rh)
data_std = my_stddev(data_rh)
ax.axvspan(-data_std + data_mean, data_std + data_mean, color='yellow', alpha=0.3, label=f'$2\sigma={2*data_std:4.2f}$')

Q1 = my_quantile(data_rh, 0.25)
Q3 = my_quantile(data_rh, 0.75)
ax.axvspan(Q1, Q3, color='gray', alpha=0.3, label=f'iqr={Q3-Q1:4.2f}')
    
ax.legend()
ax.grid();

For hump-like histogram of grades the interquartile range is as good as the standard deviation, at least visually. But for the complicated histogram of the activities the interquartile range provides intuitively more appropriate estimate. In particular, because it is located between Q1 and Q3.

### Correlation

Given two sequences of data the question is do they depend on each other somehow or not. 

Assume that we register daily an air temperature and an air pressure. When our records becomes sufficiently long we can try to reveal the mutual dependency of these two features. 

It can be done via Pearson correlation coefficient.

First we compute covariance.
- we find the means of each data sequences
- we find the deviations from the mean for both of them
- we find a dot product: multiply component by component and them sum
- we divide the sum by $N-1$

In [None]:
# just two sets of random numbers
xs = [-7.6, -9.5, -5.3, 2.0, 8.6, -0.5, -3.2, -7.4, 0.2, -0.7]
ys = [-92, -112, -69, 20, 89, 11, -43, -88, 2, -18]

# find the mean values
x_mean = sum(xs) / len(xs)
y_mean = sum(ys) / len(ys)
print(x_mean, y_mean)

In [None]:
# find the deviations
xd = [x - x_mean for x in xs]
yd = [y - y_mean for y in ys]
print(xd)
print(yd)

In [None]:
# find dot product
dot = [x * y for x, y in zip(xd, yd)]
print(dot)

In [None]:
# this is the covariance
cov = sum(dot) / (len(dot) - 1)
print(cov)

When we compute the dot product of `xd` and `yd` and corresponding elements of `x` and `y` are either both above their means
or both below their means, a positive number enters the sum. 

When one is above its mean and the other below, a negative number enters the sum. 

Accordingly, a large positive covariance means that `x` tends to be large when `y` is large and small when `y` is small. 

A large negative covariance means the opposite that `x` tends to be small when `y` is large and vice versa.

A covariance close to zero means that no such relationship exists. 

But the covariance has arbitrary scale that depends on scales of `xs` and `ys`. It is unclear what means large or small covariance. 

But if we divide it by the standard deviations of the input datasets it will always fit the range \[-1,1\]. This is called Pearson correlation coefficient.

In [None]:
cor = cov / (my_stddev(xs) * my_stddev(ys))
print(cor)

Now wrap all of this into a function.

In [None]:
def my_corrcoef(xs, ys):
    """
    Pearson correlaton coefficient
    """
    x_mean = my_mean(xs)
    y_mean = my_mean(ys)
    x_std = my_stddev(xs)
    y_std = my_stddev(ys)
    xd = [x - x_mean for x in xs]
    yd = [y - y_mean for y in ys]
    dot = [x * y for x, y in zip(xd, yd)]
    cov = sum(dot) / (len(dot) - 1)
    return cov / (x_std * y_std)

print(my_corrcoef(xs, ys))

We observe that `xs` and `ys` are highly correlated since the correlation coefficient is close to its highest value 1.

It means that when `x` is large `y` is also large and vice versa. 

Lets check what if we change the sign of `ys`:

In [None]:
print(my_corrcoef(xs, [-y for y in ys]))

We see that now these to sequences are anti-correlated: when `x` is large positive `y` is large negative and vice versa.

Consider two independent random sequences.

In [None]:
import numpy as np
rng = np.random.default_rng()

rx = rng.random(size=100)
ry = rng.random(size=100)

print(my_corrcoef(rx, ry))

Their correlation coefficient is small with respect to 1. Thus we conclude that the two sequences are uncorrelated.

### Correlations examples

Consider correlations of grades in our dataset

In [None]:
# This module allows to work with web pages
import requests

# This is an URL of a repository
base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"

# We need this file
file_name = "unif_state_exam.csv"

# Here we downlaod the file
web_data = requests.get(base_url + file_name)
assert web_data.status_code == 200

# Take a look at the data
print(web_data.text[:100])

In [None]:
# Split by line ends
str_data = web_data.text.splitlines()
print(str_data[:10])

In [None]:
# Drop out the header and split grades
lst_data = [s.split(',') for s in str_data[1:]]
print(lst_data[:10])

In [None]:
# Take Phys, Math, Lang
data_pml = [[int(s[2]), int(s[3]), int(s[4])] for s in lst_data]
print(data_pml[:10])

In [None]:
# Compute correlations
cor = my_corrcoef([x[0] for x in data_pml], [x[1] for x in data_pml])
print(f"Phys-Math grades correlation {cor:5.2f}")

cor = my_corrcoef([x[0] for x in data_pml], [x[2] for x in data_pml])
print(f"Phys-Lang grades correlation {cor:5.2f}")

cor = my_corrcoef([x[1] for x in data_pml], [x[2] for x in data_pml])
print(f"Lang-Math grades correlation {cor:5.2f}")

We observe that the correlations are small. But nevertheless the highest correlation is between Physics and Mathematics grades. The smallest correlation is between Physics and Language.

Now check student activities. Recall that it contains three columns: how many times he/she raised a hand (column `RH`), visited course resource (`Res`) and participated on discussion groups (`Disc`). 

In [None]:
# This module allows to work with web pages
import requests

# This is an URL of a repository
base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"

# We need this file
file_name = "stud_activ.csv"

# Here we downlaod the file
web_data = requests.get(base_url + file_name)
assert web_data.status_code == 200

# Take a look at the data
print(web_data.text[:100])

In [None]:
# Split by line ends
str_data = web_data.text.splitlines()
print(str_data[:10])

In [None]:
# Drop out the header and split
lst_data = [s.split(',') for s in str_data[1:]]
print(lst_data[:10])

In [None]:
# Convert to ints
data_rrd = [[int(s[0]), int(s[1]), int(s[2])] for s in lst_data]
print(data_rrd[:10])

In [None]:
# Compute correlations
cor = my_corrcoef([x[0] for x in data_rrd], [x[1] for x in data_rrd])
print(f"Raised hand vs Visited res {cor:5.2f}")

cor = my_corrcoef([x[0] for x in data_rrd], [x[2] for x in data_rrd])
print(f"Raised hand vs Discussions {cor:5.2f}")

cor = my_corrcoef([x[1] for x in data_rrd], [x[2] for x in data_rrd])
print(f"Visited res vs Discussions {cor:5.2f}")

We observe that class activity (raised hands count) highly correlate with the number of the course resource visits.

### Correlation caveats

Consider two sequences:

In [None]:
xs = [-3, -2, -1, 0, 1, 2, 3]
ys = [ 3,  2,  1, 0, 1, 2, 3]
print(my_corrcoef(xs, ys))

The correlation coefficient is zero, but we definitely observe a relationship between `xs` and `ys`: absolute values of their corresponding elements coincide. 

The reason why the correlation coefficient vanishes is the following: if one knows how $x_i$ deviates from the mean $\overline x$ one cannot say how the corresponding $y_i$ deviates from $\overline y$. That is the sort of relationship that correlation reveals.

Another ambiguous example:

In [None]:
xs = [-3, -2, -1, 0, 1, 2, 3]
ys = [100.01, 100.02, 100.03, 100.04, 100.05, 100.06, 100.07]
print(my_corrcoef(xs, ys))

The correlation is the highest so that technically the data are perfectly correlated. But since the data are so different their actual relationship is questionable. 

### Correlation and causation

Correlation is not causation. 

High correlation indicates that two data sequences vary similarly. But it says nothing why they vary similarly. If $x$ and $y$ are strongly correlated, that might mean that $x$ causes $y$, that $y$ causes $x$, that each causes the other, that some third factor
causes both, or nothing at all.

### NumPy and SciPy functions

Most of the statistical functions discussed above are available in NumPy except the mode. The mode is computed using another module `SciPy`.

In [None]:
import numpy as np
rng = np.random.default_rng()

rx = rng.integers(10, size=1000)
ry = rng.integers(100, size=1000)

In [None]:
# Mean
data_mean = np.mean(rx)
print(data_mean)

In [None]:
# Median
data_median = np.median(rx)
print(data_median)

In [None]:
# Mode
from scipy import stats
data_mode = stats.mode(rx)
print(data_mode)

In [None]:
# Quartiles
Q1, Q2, Q3 = np.percentile(rx, [25, 50, 75])
print(Q1, Q2, Q3)

In [None]:
# Standard deviation
std = np.std(rx)
print(std)

In [None]:
# Correlation coefficient - returns a matrix with pairwise correlation coefficients
cor = np.corrcoef(rx, ry)
print(cor)

### Exercises

1\. File `"bank_customers.csv"` that you can find at `"https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"` describes bank clinets. It contains three columns: customer age (`CustAge`), period of relationship with the bank in months (`Months`), and credit limit (`CredLim`). Plot a histogram for the column `CustAge` and compute its standard deviation and the interquartile range. Show them on the plot.

2\. Download the dataset from the previous exercise and compute correlation coefficients for each pair of columns.