## DIFFERENT STATISTICAL TERMENOLOGIES

1. **ANOVA** stands for **Analysis of Variance**. It's a statistical test that was developed by **Ronald Fisher** in **1918**. ANOVA tells you if there are any statistical differences between the means of three or more independent groups. One-way ANOVA is the most basic form.ANOVA helps you find out whether the **differences between groups of data are statistically significant**. It works by analyzing the levels of variance within the groups through samples taken from each of them. it gives a **probability (p-vaue)** of whether or not differences between your groups are statistically significant.

In [3]:
# import library
import pandas as pd
#load data
sd= pd.read_csv("data/ml_salary_data.csv")
sd.head()

Unnamed: 0,age,distance,YearsExperience,Salary
0,31.1,77.75,1.1,39343
1,31.3,78.25,1.3,46205
2,31.5,78.75,1.5,37731
3,32.0,80.0,2.0,43525
4,32.2,80.5,2.2,39891


In [None]:
#Example of Analysis of variance test (ANOVA)
from scipy.stats import f_oneway
data1 = sd['Salary']
data2 = sd['YearsExperience']
data3 = sd['age']
stat, p = f_oneway(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
  print('Probably the same distribution')
else:
    print('Probably different distributions')

stat=230.458, p=0.000
Probably different distributions


2. **Analysis of covariance (ANCOVA)** is a general linear model which **blends ANOVA and regression**.Analysis of covariance is used to test the main and interaction effects of categorical variables on a continuous dependent variable, controlling for the effects of selected other continuous variables, which co-vary with the dependent. The control variables are called the "covariates."

In [None]:
pip install pingouin

In [None]:
from pingouin import ancova

In [None]:
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import scipy as sc
import pandas as pd

#load data
df= sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
# perform ANCOVA
ancova(data=df, dv='fare', covar='age', between='sex')

Unnamed: 0,Source,SS,DF,F,p-unc,np2
0,sex,75769.96,1,28.316695,1.381982e-07,0.038301
1,age,25864.45,1,9.666043,0.001951945,0.013413
2,Residual,1902497.0,711,,,


ANCOVA summary:
‘Source’: Names of the factor considered
‘SS’: Sums of squares
‘DF’: Degrees of freedom
‘F’: F-values
‘p-unc’: Uncorrected p-values
‘np2’: Partial eta-squared
RESULT:
No significant relationship between age and fare on the basis of sex

3. **MANOVA** is a technique which determines the **effects of independent categorical variables on multiple continuous dependent variables**. It is usually used to compare several groups with respect to multiple continuous variables.

In [None]:
from statsmodels.multivariate.manova import MANOVA
fit = MANOVA.from_formula('fare + age ~ sex', data=df)
print(fit.mv_test())

                  Multivariate linear model
                                                              
--------------------------------------------------------------
       Intercept        Value  Num DF  Den DF  F Value  Pr > F
--------------------------------------------------------------
          Wilks' lambda 0.3935 2.0000 711.0000 547.8547 0.0000
         Pillai's trace 0.6065 2.0000 711.0000 547.8547 0.0000
 Hotelling-Lawley trace 1.5411 2.0000 711.0000 547.8547 0.0000
    Roy's greatest root 1.5411 2.0000 711.0000 547.8547 0.0000
--------------------------------------------------------------
                                                              
--------------------------------------------------------------
           sex           Value  Num DF  Den DF  F Value Pr > F
--------------------------------------------------------------
           Wilks' lambda 0.9533 2.0000 711.0000 17.4012 0.0000
          Pillai's trace 0.0467 2.0000 711.0000 17.4012 0.0000
  Hotelling

The Pillai’s Trace test statistics is statistically significant [Pillai’s Trace = 0.0467 F(2.0000 711.0000) 17.4012 p < 0.001] and indicates that sex has a statistically significant association with both age and fare.

4. **Multivariate Analysis of Covariance (MANCOVA)** In MANCOVA, we assess for statistical differences on multiple continuous dependent variables by an independent grouping variable, while controlling for a third variable called the covariate; multiple covariates can be used, depending on the sample size.

1. **Variance** is a measure of how far a set of data are dispersed out from their mean or average value. It is denoted as ‘σ2’. It is always non-negative and Variance always has squared units

In [None]:
np.var(df)

  return var(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


survived         0.236506
pclass           0.698231
age            210.723580
sibsp            1.214678
parch            0.648999
fare          2466.665312
adult_male       0.239454
alone            0.239454
dtype: float64

6. **Standard Deviation** is the spread of statistical data is measured by the standard deviation. Distribution measures the deviation of data from its mean or average position. The degree of dispersion is computed by the method of estimating the deviation of data points. It is denoted by the symbol, ‘σ’.

In [None]:
np.std(df)

  return std(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


survived       0.486319
pclass         0.835602
age           14.516321
sibsp          1.102124
parch          0.805605
fare          49.665534
adult_male     0.489340
alone          0.489340
dtype: float64

> Difference between two;
> 
> Variance is equal to the average squared deviations from the mean, while standard deviation is the number’s square root. Also, the standard deviation is a square root of variance. 

7. The **standard error** is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation. In statistics, a sample mean deviates from the actual mean of a population; this deviation is the standard error of the mean.

> The standard deviation (SD) measures the amount of variability, or dispersion, from the individual data values to the mean, while the standard error of the mean (SEM) measures how far the sample mean (average) of the data is likely to be from the true population mean. The SEM is always smaller than the SD

In [None]:
from scipy.stats import sem

sem(df)

Standard deviation (SD) measures the dispersion of a dataset relative to its mean.
The standard error of the mean (SEM) measures how much discrepancy is likely in a sample’s mean compared with the population mean.
**The SEM takes the SD and divides it by the square root of the sample size**

![image](pics/stdse.png)

8. **Covariance** is a statistical tool that is used to determine the relationship between the movements of two random variables. When two stocks tend to move together, they are seen as having a positive covariance; when they move inversely, the covariance is negative. **Covariance provides the a measure of strength of correlation between two variable or more set of variables**.
9. A **covariate** is thus a possible predictive or explanatory variable of the dependent variable. This may be the reason that in regression analyses, independent variables (i.e., the regressors) are sometimes called covariates.

In [None]:
X = df['age']
Y = df['fare']
c = np.cov(X,Y, rowvar = True)
print(c)

[[          nan           nan]
 [          nan 2469.43684574]]


np.cov function returns you a covariance matrix, where off-diagonal entries are covariances and diagonal entries are variances.

10. **Univariate** statistics summarize only one variable at a time. Bivariate statistics compare two variables.

11. **Multivariate** statistics compare more than two variables. Most multivariate analysis involves a dependent variable and multiple independent variables



12. A **confidence interval** is how much uncertainty there is with any particular statistic. Confidence intervals are often used with a margin of error. It tells you how confident you can be that the results from a poll or survey reflect what you would expect to find if it were possible to survey the entire population. Confidence intervals are intrinsically connected to confidence levels.
Confidence levels are expressed as a percentage. Confidence intervals are your results and they are usually numbers.
You can find the upper and lower bounds of the confidence interval by adding and subtracting the margin of error from the mean

![image](pics/Cl.jpg)

13. **Alpha** is a threshold value used to judge whether a test statistic is statistically significant. It is chosen by the researcher. Alpha represents an acceptable probability of a Type I error in a statistical test. Because alpha corresponds to a probability, it can range from 0 to 1. In practice, 0.01, 0.05, and 0.1 are the most commonly used values for alpha, representing a 1%, 5%, and 10% chance of a Type I error occurring