## Lesson 8

### Analysis of variance. Factor analysis. Logistic regression

#### Analysis of variance

##### Single-factor analysis of variance

Analysis of variance examines the effect of one or more qualitative variables on a quantitative variable.

In a one-factor analysis of variance, one quantitative variable $Y$ is influenced by one factor (one qualitative indicator), observed at $k$ levels, i.e. we have $k$ samples for variable $Y$.

We denote the observed data by $y_{ij}$, where $i$ is the level index ($i$ = 1, 2, ..., $k$), $j$ is the observation index at the $i$th level ($j$ = 1, 2, ..., $n_{i}$).

Each level may have a different number of observations $n_{i}$. The total number of observations is the sum of the observations at all levels:

$$n = \sum\limits_{i=1}^{k}n_i$$

From the data $y_{ij}$ is determined:

$\overline{y}_{i}$ — the average value of variable Y at level i:

$$\overline{y}_{i} = \frac{1}{n_i}\sum\limits_{j=1}^{n_i}y_{ij}$$

$\overline{Y}$ — the average value of the variable **Y** over all values:

$$\overline{Y} = \frac{1}{n}\sum\limits_{i=1}^{k}\sum\limits_{j=1}^{n_i}y_{ij} = \frac{1}{n}\sum\limits_{i=1}^{k}\overline{y}_{i}n_{i}$$

$S^2$ — is the sum of the squares of the deviations of the observations from the overall mean:

$$S^2 = \sum\limits_{i=1}^{k}\sum\limits_{j=1}^{n_i}({y}_{ij} - \overline{Y})^2$$

$S_F^2$ — the sum of the squares of the deviations of the group mean values from the overall mean value $\overline{Y}$:

$$S_F^2 = \sum\limits_{i=1}^{k}(\overline{y}_i - \overline{Y})^2n_i$$

$S_{rest}^2$ — the residual sum of the squared deviations:

$$S_{rest}^{2} = \sum\limits_{i=1}^{k}\sum\limits_{j=1}^{n_i}(y_{ij} - \overline{y}_i)^2$$

Then the sum of the squares of the deviations of the observations from the overall mean must be equal:

$$S^2 = S_F^2 + S_{rest}^2$$

Then it is necessary to calculate:

1) total variance:

$$\sigma_{total}^{2} = \frac{S^2}{n - 1} = \frac{1}{n-1}\sum\limits_{i=1}^{k}\sum\limits_{j=1}^{n_i}(y_{ij} - \overline{Y})^2$$

2) factor variance:

$$\sigma_{F}^{2} = \frac{S_{F}^{2}}{k-1} = \frac{1}{k-1}\sum\limits_{i=1}^{k}(\overline{y}_i - \overline{Y})^{2}n_i$$

3) residual variance:

$$\sigma_{rest}^{2} = \frac{S_{rest}^{2}}{n - k} = \frac{1}{n - k}\sum\limits_{i=1}^{k}\sum\limits_{j=1}^{n_i}(y_{ij} - \overline{y}_i)^2$$

In analysis of variance, the hypothesis $H_0$ is tested to see if the group mean of the quantitative indicator is equal:

$$(H_0: \overline{y}_1 = \overline{y}_2 = ... = \overline{y}_k).$$

To test this hypothesis, we need to use the ratio:

$$F_H = \frac{\sigma_{F}^{2}}{\sigma_{rest}^{2}}$$

If $F_H$ exceeds $F_{crit}$ from the Fisher-Snedekor distribution critical points table for a given significance level $\alpha$ of two degrees of freedom $df_{median} = k - 1$ (refers to the numerator of the relationship) and $df_{internal} = n - k$ (refers to the denominator), then the samples have different mean values.

The Fisher-Snedekor tables can be found at the links:


<a href = "https://studfiles.net/preview/5520526/page:14/">Fisher-Snedeker critical distribution points</a>
        
<a href = "https://www.matburo.ru/tv/table_fisher.pdf">Fisher-Snedekor distribution (F-distribution)</a>

Another ratio:

$$\eta^{2} = \frac{S_{F}^{2}}{S_{total}^{2}}$$

The greater the value of $\eta^2$ (Greek for "this", indicating an empirical correlation, placed between 0 and 1), the more likely it is that the samples have different mean values. It is generally accepted that group mean values below 0.2-0.3 are not statistically significantly different if the $\eta^2$ value is lower than 0.2-0.3.

**Example 1**

Among the people living in the same city, three groups are distinguished by qualitative feature - professions: accountants, lawyers, programmers. 

Let us consider a quantitative feature - wages (in thousands of rubles). We need to establish whether the average salaries of these three groups differ when the significance level $\alpha$ is 0.05. The number of people in each group: accountants - 5, lawyers - 8, programmers - 7.

In [None]:
import numpy as np

In [None]:
n1 = 5
n2 = 8
n3 = 7
n = n1 + n2 + n3
print(n)

20


There are three groups in total:

In [None]:
k = 3

Salaries of accountants:

In [None]:
y1 = np.array([70, 50, 65, 60, 75], dtype=np.float64)

Salaries of lawyers:

In [None]:
y2 = np.array([80, 75, 90, 70, 75, 65, 85, 100], dtype=np.float64)

Salaries of programmers:

In [None]:
y3 = np.array([130, 100, 140, 150, 160, 170, 200], dtype=np.float64)

Let's perform a one-factor analysis of variance. First, find the average wages for each occupation:

In [None]:
y1_mean = np.mean(y1)
print(y1_mean)

64.0


In [None]:
y2_mean = np.mean(y2)
print(y2_mean)

80.0


In [None]:
y3_mean = np.mean(y3)
print(y3_mean)

150.0


It can be seen that the average wages are different. Let us establish that this difference is statistically significant. To do this, we first collect all wage values into one array:

In [None]:
y_all = np.concatenate([y1, y2, y3])
y_all

array([ 70.,  50.,  65.,  60.,  75.,  80.,  75.,  90.,  70.,  75.,  65.,
        85., 100., 130., 100., 140., 150., 160., 170., 200.])

Let's find the average wage across all values:

In [None]:
y_mean = np.mean(y_all)
print(y_mean)

100.5


Find $S^2$ - the sum of the squares of the deviations of the observations from the overall mean:

In [None]:
s2 = np.sum((y_all - y_mean)**2)
s2

34445.0

Find $S^2_F$ - the sum of the squares of the deviations of the group mean from the overall mean:

In [None]:
s2_f = ((y1_mean - y_mean)**2) * n1 + ((y2_mean - y_mean)**2) * n2 + ((y3_mean - y_mean)**2) * n3
s2_f

27175.0

Find $S^2_{ost}$ - the residual sum of the squares of deviations:

In [None]:
3

3

Let's make sure that the equality is fulfilled $S^2 = S_F^2 + S_{rest}^2$:

In [None]:
print(s2)
print(s2_f + s2_residual)

34445.0


NameError: ignored

Let's find the total variance:

In [None]:
sigma2_general = s2 / (n - 1)
sigma2_general

Let's find the factor variance:

In [None]:
sigma2_f = s2_f / (k - 1)
sigma2_f

Let's find the residual variance:

In [None]:
sigma2_residual = s2_residual / (n - k)
sigma2_residual

Calculate $F_H$:

In [None]:
F_h = sigma2_f / sigma2_residual
F_h

Find the value of $F_{crit}$ in the Fisher-Snedekor distribution critical points table for a given significance level $\alpha = 0.05$ and two degrees of freedom: 

$df_{ext} = k - 1 = 3 - 1 = 2$ и $df_{int} = n - k = 20 - 3 = 17$.

For these values $F_{crit} = 3.59$. Since $F_H > F_{crit}$$, the difference in average wages in the three groups is statistically significant.

Also calculate the empirical correlation relation $\eta^2$:

In [None]:
eta2 = s2_f / s2
eta2

The value of $\eta^2$ is close to 1, which means that the difference in average wages between the three groups is statistically significant.

##### Two-factor analysis of variance

The table shows the estimated formulas for the two-factor  
analysis of variance with single observations.

<img src='https://ru.files.fm/thumb_show.php?i=vppqqgpj&view' width=650>Table 1. Formulas for two-factor analysis of variance with single observations</img>

#### Factor analysis

Factor analysis (FA) is a way of reducing (synthesizing) the set of directly observable indicators $X_j = \{x_{ij}\}$, $(i = 1, 2, ..., n;\:j = 1, 2, ..., m)$ to a smaller number **Q < m** of new linearly independent factors (attributes, indicators) $Y_q$, ($q$ = 1, 2, ..., $Q$).

Suppose that the original data are represented as a matrix $X = \{x_{ij}\}$, $(i = 1, 2, ..., n;\:j = 1, 2, ..., m)$, where $n$ is the number of observations, $m$ the indicators.

Knowing that the quantities $X_j$ may have different physical meaning and measurement scales, it is better to go to a standardised raw data matrix $X^{*} = \{x_{ij}^{*}\}$ for convenience.

Here each indicator $X_j^{*}$ has a mean value equal to 0 and a unit variance. In factor analysis, the relationship between the measured indicators and the factors is assumed to be linear:

$$X_{j}^{*} = \sum\limits_{q=1}^{Q}a_{jq}Y_q + U_j,\;\sum\limits_{j=1}^{m}a_{jq}^{2} = 1, \;q = 1, 2, ..., m,$$

where $a_{jq}$ are the coefficients to be determined.

A fair ratio:

$$\sigma_{j}^{2} = \sum\limits_{q=1}^{Q}a_{jq}^{2} + U_{j}^{2} = h^2 + U_{j}^{2} = 1,$$

where $h^2$, $U_j^2$ are the generality and specificity of the **j**th indicator, respectively.

This equality holds if the variables are standardised, not correlated and a linear model is used for the study. The estimation of $h_j^2$ is determined before the factors are identified, and this is the first problem.

The second is that it is difficult to determine the number of factors and the kind of coordinate axes used to represent $m$ variables. Centroid, principal component and factor models are used.

If there are problems with the factors, constraints must be imposed to uniquely define the system of equations $R = A\cdot{A^T} + U^2$, where $A^T$ is the transpose matrix of factor loadings.

The factor extraction procedure has infinitely many equivalent solutions, satisfying the equality $R^h = A\cdot{A^T}$.

To solve the rotation problem, in the common factor space we have already established, the simplest factor explanation (maximum loadings for some factors, minimum loadings for others) must be determined for each variable.

The final result of factor analysis is to obtain meaningfully interpreted factors that reproduce a matrix of correlation coefficients between variables. For a single observation (object) we have:

$$x_{ij}^{*} = \sum\limits_{q=1}^{Q}a_{jq}y_{iq} + U_j, \: i = \overline{1, n}; \: j = \overline{1, m}.$$

Here $y_{iq}$ is the value of factor **q** at **i**th object. We measure the factors on the basis of equality:

$$Y = A^{T}X, \; y_{iq} = \sum\limits_{j=1}^{m}a_{jq}x_{ij}; \; i = \overline{1, n}, \; q = \overline{1, m}.$$

##### Logistic regression

Let $Y_i$ denote the value of the variable $Y, i = 1, ..., n$, where $n$ is the number of choices, and the values of $X_i = (x_{il}, ..., x_{jk})$ of the choice and choice factors. Consider the linear probability model:

$$Y_i = F(X_{i}\beta^{T}) = \beta_{1} x_{1i} + \beta_{2} x_{2i} + ... + \beta_{k} x_{ki} + \varepsilon_{i}$$.

where $\beta$ is the vector of regression coefficients, $\varepsilon_{i}$ is an independently distributed random variable with zero mathematical expectation (hereinafter random error). It follows from the assumption of zero mathematical expectation of the random error that it only takes discrete values. Since $Y_i$ takes only two values, it is obvious that:

$$E(Y_i) = 1 \cdot {P(Y_i=1)} + 0 \cdot {P(Y_i = 0)} = P(Y_i = 1) = F(X_i \beta^T)$$

The model can be written in the form:

$$P(Y_i=1) = F(X_i \beta^T) = X_i \beta^T$$

A binary choice model based on the distribution function $F(z)$, whose range of values lies in the interval $[0, 1]$, is as follows:

$$P_i = F(X_i \beta^T)$$

The logistics distribution function

$$F(X_i \beta^T) = \frac{e^{X_i \beta^T}}{1 + e^{X_i \beta^T}} = \frac{1}{1 + e^{-X_i \beta^T}} = \Lambda(X_i \beta^T)$$

is called a logit model.