# Lesson 3

## Intro

Hello, everyone! Here is the material for the third lesson.

As an introduction, let's look at the following example for contextual immersion - make sure you understand all the words in this introduction.

Suppose you have many different experts, each of whom is given individual statistical material on the basis of which you need to solve a correlation analysis problem with which you are already familiar. Also assume that they need to decide to reject the hypothesis at the $5\%$ significance level. Finally, let's assume that they are all investigating a case of independence. Thus, we would expect, on average, $5\%$ of the experts to reject the correct hypothesis (in other words *make a Type I error*).

Let $t$ be some positive value at which the corresponding $P$- value is $5\%$, i.e. $P(r < t) = 0.05$. For the corresponding negative value we have $P(r < -t) = 0.05$.

## Part 1. Parametric Tasks.

Let there be some sample. For the purposes of this assignment, we will always assume that the sample is normal (in other words, that the sample has a normal distribution).

We can also think of it this way: by sampling we mean a series of independent measurements of some physical quantity, i.e. the result of the measurement can be represented as

$$
X = a + \varepsilon,
$$

where $a$ is the true value of the measured quantity (fixed, non-random, but not known in advance); $\varepsilon$ is the error of measurement (a random variable, different for each new experiment).

Let us now represent the error of measurement $\varepsilon$ as the sum of the random error $\varepsilon_0$ with zero mathematical expectation and some fixed component $\mathbb{E}\varepsilon$, which is the systematic error:

$$
\varepsilon = \varepsilon_0 + \mathbb{E}\varepsilon.
$$

> In simpler terms: You conduct a series of experiments on an instrument. Because you do not change the instrument on which you perform your tests, your systematic error does not change. However, you always have some random error, which differs from test to test. 

The result of the measurement $X$ can now be written as 

$$
X = a + \varepsilon_0 + \mathbb{E}\varepsilon.
$$

### 1.1. Zero Systematic Error. 

If $\mathbb{E}\varepsilon = 0$, then passing to the expectation we have

$
\mathbb{E} X = \mathbb{E} a = a.
$

Thus, the task of finding the true value of the measured value is reduced to the task of estimating the mathematical expectation of the sample.

### 1.2. Non-Zero Systematic Error. 

If $\mathbb{E}\varepsilon \neq 0$, then going to the expectation we have

$
\mathbb{E} X = \mathbb{E} a + \mathbb{E} \varepsilon \neq a.
$

In this case, nothing good will come of it, because even though we can determine $\mathbb{E}X$, we won't have enough. So it is important to try to get rid of systematic error whenever possible. 

#### Remark

Turning to the variance in the same equality, we get

$$
D X = D \varepsilon,
$$

i.e., the variance really determines the random scatter of the measurement results around the mean.

## Part 2. Comparison of the Variance of the Two Samples.

Consider the following hypotheses:
<br>$H_0:$ The samples have equal variance,
<br>$H_1:$ The samples have different variances.

To check for equality of variance, we will conduct the $F$-test or, as it is also called, the Fisher test (or the $F$-criterion).

The statistic of the criterion is the value $F$, which is the ratio of the point estimates of the variance of each of the samples:
$$
F = \dfrac{\hat{\sigma}_X^2}{\hat{\sigma}_Y^2}
$$

It can be proved that if the variances of the samples are equal if the samples are normal, then the distribution of $F$-statistics is known and depends only on the volumes of the samples.

If the variance is different, the $F$-statistics will deviate.

For two samples of volume $m$ and $n$ of normal random variables $X$ and $Y$, respectively, the $F$-statistic will have a Fisher distribution $F(m - 1, n - 1)$.

Let's finally get busy.

In [1]:
import pandas as pd
df = pd.read_excel("../Data-EN/Comparisons.xls")
df.head()

Unnamed: 0,Method,CuO,Part,CuSO4,Group,N,Place,X,Т(С),Product
0,1.0,38.2,1.0,98.2,1.0,9.29,1.0,26.3,20,2.4
1,1.0,38.0,1.0,98.7,1.0,9.38,1.0,26.6,20,3.3
2,1.0,37.66,1.0,97.8,1.0,9.35,1.0,26.1,20,3.4
3,2.0,37.7,2.0,98.1,1.0,9.43,1.0,26.0,20,3.2
4,2.0,37.65,2.0,97.7,2.0,9.53,1.0,26.9,20,4.4


In [2]:
pd.set_option('display.max_rows', 100)
display(df)

Unnamed: 0,Method,CuO,Part,CuSO4,Group,N,Place,X,Т(С),Product
0,1.0,38.2,1.0,98.2,1.0,9.29,1.0,26.3,20,2.4
1,1.0,38.0,1.0,98.7,1.0,9.38,1.0,26.6,20,3.3
2,1.0,37.66,1.0,97.8,1.0,9.35,1.0,26.1,20,3.4
3,2.0,37.7,2.0,98.1,1.0,9.43,1.0,26.0,20,3.2
4,2.0,37.65,2.0,97.7,2.0,9.53,1.0,26.9,20,4.4
5,2.0,37.55,2.0,97.3,2.0,9.48,2.0,26.8,20,3.4
6,,,,,2.0,9.61,2.0,26.1,80,1.7
7,,,,,2.0,9.68,2.0,25.9,80,1.6
8,,,,,,,2.0,26.4,80,2.7
9,,,,,,9.52,2.0,26.6,80,2.2


*From this point on, I'm removing the pieces about formatting tables, assuming you've already learned that.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Method   6 non-null      float64
 1   CuO      6 non-null      float64
 2   Part     6 non-null      float64
 3   CuSO4    6 non-null      float64
 4   Group    8 non-null      float64
 5   N        9 non-null      float64
 6   Place    10 non-null     float64
 7   X        10 non-null     float64
 8   Т(С)     12 non-null     int64  
 9   Product  12 non-null     float64
dtypes: float64(9), int64(1)
memory usage: 1.1 KB


In [4]:
df = df.fillna(0)

Let's look at the descriptive statistics of our dataframe-we won't need it today obviously, but we do it for order and to reinforce the material.

In [5]:
df.describe()

Unnamed: 0,Method,CuO,Part,CuSO4,Group,N,Place,X,Т(С),Product
count,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0
mean,0.75,18.896667,0.75,48.983333,1.0,7.105833,1.25,21.975,50.0,2.758333
std,0.866025,19.737649,0.866025,51.162448,0.852803,4.286317,0.753778,10.269294,31.333978,0.817378
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,1.6
25%,0.0,0.0,0.0,0.0,0.0,6.9675,1.0,25.975,20.0,2.175
50%,0.5,18.775,0.5,48.65,1.0,9.405,1.0,26.2,50.0,2.7
75%,1.25,37.67,1.25,97.875,2.0,9.5225,2.0,26.6,80.0,3.325
max,2.0,38.2,2.0,98.7,2.0,9.68,2.0,26.9,80.0,4.4


Now let's perform a *two-sample $F$-test* to check for equality of variance for column **N**.

In [6]:
import scipy.stats
import numpy as np

Let's choose the first group as the first set of data, and the second group as the second.

In [7]:
firstGroupBadWay = df['N'][0:4]
secondGroupBadWay = df['N'][4:8]
print(firstGroupBadWay)
print("------")
print(secondGroupBadWay)

0    9.29
1    9.38
2    9.35
3    9.43
Name: N, dtype: float64
------
4    9.53
5    9.48
6    9.61
7    9.68
Name: N, dtype: float64


As you can see, we had to look at the row indexes corresponding to the first and second groups with our eyes, which is not very convenient. Besides, what to do if the table is very large, and the same groups are scattered on different rows? Of course, there is a more convenient way to select the values we need in column **N**, depending on the value in the column **Group**.

In [8]:
firstGroup = df.loc[df['Group'] == 1.0, 'N']
secondGroup = df.loc[df['Group'] == 2.0, 'N']

In [9]:
print(firstGroup)
print("-------")
print(secondGroup)

0    9.29
1    9.38
2    9.35
3    9.43
Name: N, dtype: float64
-------
4    9.53
5    9.48
6    9.61
7    9.68
Name: N, dtype: float64


As we can see, everything is in order.

Let's find the variances of the samples. We know how to do this in several ways, and you can choose whichever one you like best.

In [10]:
firstVar = np.var(firstGroup, ddof = 1)
secondVar = np.var(secondGroup, ddof = 1)
print("firstGroupVar: ",firstVar)
print("secondGroupVar: ", secondVar)

firstGroupVar:  0.003425000000000041
secondGroupVar:  0.007766666666666626


In [11]:
print("firstGroupVar:", firstVar)

firstGroupVar: 0.003425000000000041


Now we need to find the ratio of point estimates of variance. For example, we can do this for sample variances. 

Let's write a function that will perform an $F$-test to test for equality of variance between two samples - make sure you understand why this is the function.

In [12]:
def fTest(firstSample, secondSample):
    firstSample = np.array(firstSample)
    secondSample = np.array(secondSample)
    f = np.var(firstSample, ddof=1) / np.var(secondGroup, ddof=1)
    dfn = firstGroup.size - 1
    dfd = secondGroup.size - 1
    p = 1 - scipy.stats.f.cdf(f, dfn, dfd)
    return (f, p)

> Especially this line `p = 1 - scipy.stats.f.cdf(f, dfn, dfd)`. Please, in your completed assignment, write a textual parsing of this line along the lines of "Since the P-value is so-and-so, we must do so-and-so to get so-and-so."

Let us finally perform the $F$-test to check the variance of our samples.

In [13]:
f, pValue = fTest(firstGroup, secondGroup)
print("F:", f)
print("P-value:", pValue)

F: 0.4409871244635269
P-value: 0.7406260852648179


## Part 3. Comparison of the Averages of Two Samples.

You can use Student's t-test (or, as it is also called, $t$-test) to compare the averages of two samples.

It is worth noting that the classical Student's test uses the assumption that the variances are equal.

Consider the following hypotheses:
<br>$H_0:$ The samples have equal averages,
<br>$H_1:$ The samples have different averages.

Suppose there are two independent normal samples of size $n_1$ and $n_2$. If the variances coincide, the statistic of the criterion is

$$
t = \frac{\overline X_1 - \overline X_2}{s_X \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} ~,~~s_X=\sqrt {\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}},
$$
where $s$ – unbiased variance estimation:
$$
s^2=\frac {\sum^n_{t=1}(X_t-\overline X)^2}{n-1}
$$

Note also that this statistic has a Student distribution $t(n_1 + n_2 - 2)$.

Since we did not reject the hypothesis of matching variance, let's perform the $t$-test for the same groups.

In [14]:
scipy.stats.ttest_ind(firstGroup, secondGroup)

Ttest_indResult(statistic=-4.017367360141328, pvalue=0.006979687421480706)

What do we see? The value of the statistic deviates quite strongly from zero (it is negative because we took samples with smaller mean values as the first sample). The very small (by the way, two-sided) $P$-value confirms this. What is the conclusion that follows from this? That we can reject the hypothesis of equality of means. The probability of being wrong in this case is very, very small. 

## Part 4. A Two-sample t-test.

Here we will not go deeply into the theoretical justification. Instead, we will perform the necessary calculations.

In [15]:
dfFilter = pd.read_excel("../Data-EN/Filter.xls")

In [16]:
display(dfFilter)

Unnamed: 0,Before,After
0,100.1,96.6
1,115.1,115.6
2,130.0,125.5
3,93.6,94.0
4,108.3,103.3
5,137.2,134.4
6,104.4,100.2
7,97.3,97.1


In [17]:
dfFilter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Before  8 non-null      float64
 1   After   8 non-null      float64
dtypes: float64(2)
memory usage: 256.0 bytes


In [18]:
scipy.stats.ttest_rel(dfFilter['Before'], dfFilter['After'])

Ttest_relResult(statistic=2.9732548005798334, pvalue=0.020711935094041782)

We see that the $P$-value is quite small. Consequently, we can reject the hypothesis that the results **Before** and **After** the filtering result is significant.

In [19]:
print(dfFilter['Before'].mean())
print(dfFilter['After'].mean())

110.74999999999999
108.3375


## Part 5. Check for Equality of the Average to a Certain Number.

Let's go back to the file where the contents were. 

In [20]:
dfContents = pd.read_excel("../Data-EN/Contents.xls")

In [21]:
display(dfContents)

Unnamed: 0,Lab,Al,Sn,FeO
0,1,0.016,0.42,6.21
1,1,0.015,0.2,6.22
2,1,0.017,0.26,6.33
3,1,0.016,0.27,6.02
4,1,0.019,0.32,6.32
5,2,0.017,0.19,6.09
6,2,0.016,0.12,6.23
7,2,0.016,0.21,6.15
8,2,0.016,0.43,6.26
9,2,0.018,0.27,6.14


In [22]:
dfContents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Lab     60 non-null     int64  
 1   Al      60 non-null     float64
 2   Sn      60 non-null     float64
 3   FeO     60 non-null     float64
dtypes: float64(3), int64(1)
memory usage: 2.0 KB


In [23]:
scipy.stats.ttest_1samp(dfContents['Sn'], 0.26)

Ttest_1sampResult(statistic=-0.053254042834835584, pvalue=0.9577094055431739)

We see that the deviation is very small, and the $P$-value is very large. So, the hypothesis of equality of the mean to a given number ($0.26$) is accepted. 

In [24]:
scipy.stats.ttest_1samp(dfContents['FeO'], 0.26)

Ttest_1sampResult(statistic=247.2628135524946, pvalue=1.1124582548513744e-90)

What's here? The deviation is very large, and the $P$-value is almost zero.

In [25]:
dfContents['FeO'].mean()

6.117

Let's look at the hypothesis that the average of the last sample is equal to $6$.

In [26]:
scipy.stats.ttest_1samp(dfContents['FeO'], 6.0)

Ttest_1sampResult(statistic=4.939345942571638, pvalue=6.791082679392058e-06)

What do we see? A small deviation and a small $P$-value. Consequently, we accept the hypothesis that the mean is equal to a given number. 

For fun, let's see what happens if we check the average against the exact value of the average.

In [27]:
scipy.stats.ttest_1samp(dfContents['FeO'], 6.117)

Ttest_1sampResult(statistic=3.7495901483978375e-14, pvalue=0.9999999999999702)

As expected, the deviation is almost zero with a large $P$-value.

## Task.

Complete your individual assignment.