# Tools and Methods of Data Analysis
## Session 9 - Part 1

Niels Hoppe <<niels.hoppe.extern@srh.de>>

In [95]:
import math
import pandas as pd
from scipy import stats

### Questionnaire

Please fill out this [short questionnaire](https://docs.google.com/forms/d/e/1FAIpQLScWKpDeZphVHjWxT-9_5svR5NU8jjfRPZGbvaAs_-Q_YifwYw/viewform?usp=sf_link) to help me improve the course. Thank you!

### The Titanic dataset

In [96]:
df = pd.read_csv('../data/titanic_dataset.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### The Titanic dataset (cont.)

In [97]:
def clean_data(df):
    # Drop columns: 'SibSp', 'Parch' and 4 other columns
    df = df.drop(columns=['SibSp', 'Parch', 'Ticket', 'Fare',
                          'Cabin', 'Embarked'])
    # Replace missing values with the mean of each column in: 'Age'
    df = df.fillna({'Age': round(df['Age'].mean())})
    return df

df_clean = clean_data(df.copy())
df_clean.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0


#### Women and children first!

Did female passengers of the Titanic have a higher chance of survival?

*State and test the appropriate hypotheses!*

* $H_0: \pi_{female} \leq \pi_{male}$
* $H_1: \pi_{female} > \pi_{male}$

#### Women and children first! (cont.)

In [98]:
# Crosstab columns 'Survived' and 'Sex'
ctab = pd.crosstab(index=df['Survived'], columns=df['Sex'])
# Calculate column totals (vertical)
ctab.loc['total'] = ctab.sum(axis=0)
# Calculate row totals (horizontal)
ctab['total'] = ctab.sum(axis=1)
ctab

Sex,female,male,total
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,81,468,549
1,233,109,342
total,314,577,891


#### Women and children first! (cont.)

In [99]:
from statsmodels.stats.proportion import proportions_ztest

count = ctab.iloc[1, 0:2]
nobs = ctab.iloc[2, 0:2]
_, pval = proportions_ztest(count=count, nobs=nobs,
                            alternative='larger')
pval

1.8558738850567398e-59

#### Age and economical status

Is there a significant difference in the mean age between first, second and third class passengers?

*State and test the appropriate hypotheses!*

* $H_0: \forall i, j: \mu_i = \mu_j$
* $H_1: \exists i, j: \mu_i \neq \mu_j$

#### Age and economical status (cont.)

In [100]:
k = 3 # number of groups
groups = { i: df['Age'][df['Pclass'] == i] for i in range(1, k+1) }
groups[0] = df['Age']
x = { i: group.mean() for i, group in groups.items() }
x

{1: 38.233440860215055,
 2: 29.87763005780347,
 3: 25.14061971830986,
 0: 29.69911764705882}

### One-way ANOVA for means (three or more samples)

Calculating the test statistics for one-way ANOVA:

$$F = \frac{MS_B}{MS_W}$$

where:

$$MS_B = \frac{SS_B}{k - 1} \qquad MS_W = \frac{SS_W}{n_T - k}$$
$$SS_B = \sum_{i=1}^k{n_i \cdot (\bar{x}_i - \bar{x})^2} \qquad SS_W = \sum_{i=1}^k{(n_i - 1) \cdot s^2_i}$$

Assumptions:

* Samples are independent
* Values are normally distributed in the population
* Samples have equal variance

Calculate the p-value based on the f-distribution.

### One-way ANOVA for means (three or more samples) (cont.)

In [101]:
n = { i: group.size for i, group in groups.items()}
s = { i: group.std() for i, group in groups.items() }
n, s

({1: 216, 2: 184, 3: 491, 0: 891},
 {1: 14.802855896450462,
  2: 14.0010768124762,
  3: 12.495398210982415,
  0: 14.526497332334042})

### One-way ANOVA for means (three or more samples) (cont.)

$$SS_B = \sum_{i=1}^k{n_i \cdot (\bar{x}_i - \bar{x})^2}$$

In [102]:
SS_B = sum([n[i] * (x[i] - x[0]) ** 2 for i in range(1, k+1)])
SS_B

25941.085326801287

$$SS_W = \sum_{i=1}^k{(n_i - 1) \cdot s^2_i}$$

In [103]:
SS_W = sum([(n[i] - 1) * s[i] ** 2 for i in range(1, k+1)])
SS_W

159491.432938904

### One-way ANOVA for means (three or more samples) (cont.)

$$MS_B = \frac{SS_B}{k - 1}$$

In [104]:
MS_B = SS_B / (k - 1)
MS_B

12970.542663400644

$$MS_W = \frac{SS_W}{n_T - k}$$

In [105]:
MS_W = SS_W / (n[0] - k)
MS_W

179.6074695257928

### One-way ANOVA for means (three or more samples) (cont.)

$$F = \frac{MS_B}{MS_W}$$

Calculate the p-value based on the f-distribution.

In [106]:
F = MS_B / MS_W
pval = stats.f.sf(F, dfn=k-1, dfd=n[0]-k)
F, pval

(72.216053695573, 8.726696383881273e-30)

### One-way ANOVA for means (three or more samples) (cont.)

In [107]:
stat, pval = stats.f_oneway(groups[1], groups[2], groups[3])
stat, pval

(nan, nan)

### Questionnaire again

In [108]:
sheet_id = '1nnTe5PNL5PdH_LrnNd9tqn0KjkQzUeXa6Q58-gFKSZ4'
sheet_name = 'Form%20Responses%201'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'

df = pd.read_csv(url)
df.head()

Unnamed: 0,Timestamp,How old are you? (Full years),Which gender do you identify with?,What month were you born in?,Which semester are you studying in?,What was your Python experience BEFORE the course?,What was your statistics knowledge BEFORE the course?,How long is your daily commute to university? (Estimate one-way in minutes),Do you prefer online or in-person lectures?,Do you prefer a continuous project or isolated exercises?
