In [1]:
import pandas as pd
from scipy import stats

### Hypothesis 1: 
"***Na silnicích první třídy se při nehodách umíralo se stejnou pravděpodobností jako na silnicích třetí třídy.***"

To begin with, we need to read the dataframe from the file "accidents.pkl.gz" and save only necessary columns: road class('p36') and number of fatalities('p13a').

Also according to the hypothesis we are only interested in roads of type 1 and type 3.

In [2]:
df = pd.read_pickle("accidents.pkl.gz")

In [3]:
df_hyp1 = df[['p36','p13a']].copy()
df_hyp1 = df_hyp1.query('p36 in [1,3]')

We also need to create a new column: "dead", that will contain boolean information(True or False) about the result of an road accident (if someone died or not).

In [4]:
df_hyp1['dead'] = (df_hyp1['p13a'] > 0)

To run the Chi-Square Test, we need to convert the data into a contingency table.

In [5]:
cont = pd.crosstab(df_hyp1['p36'], df_hyp1['dead'])
cont

dead,False,True
p36,Unnamed: 1_level_1,Unnamed: 2_level_1
1,96618,1104
3,91882,536


Run the Chi-Square Test and take the p-value

In [6]:
chi2_test = stats.chi2_contingency(cont)
chi2_test[1]

2.9583564622976707e-38

It can be seen, that p-value < 0.05, so we discard the null hypothesis(H0). We can conlude that on type 1 and type 3 roads, people died with different probabilities.
Now we will compare the expected Chi-Square and the actual values.

In [7]:
expected = chi2_test[3]
cont - expected

dead,False,True
p36,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-261.125907,261.125907
3,261.125907,-261.125907


From the difference between the expected and actual result we can  see that accidents on the road of the 1st type caused more fatal cases.

### Hypothesis 2: 
"***Při nehodách vozidel značky Škoda je škoda na vozidle nižší než při nehodách vozidel Audi.***"

First of all save only necessary columns: vehicles('p45a') and damage to the vehicles('p14').

Also according to the hypothesis we are only interested in these car brands: "Škoda"(39) and "Audi"(2).

In [21]:
# df_hyp2 = df[['p45a','p53']].copy()
skoda_dmg = df.query('p45a  == 39')['p14']
audi_dmg = df.query('p45a  == 2')['p14']

In order to determine, if the damage on the Škoda cars is lower than on the Audi cars we will calculate p-value with The Mann-Whitney U test. The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y.

In [26]:
res = stats.mannwhitneyu(skoda_dmg,audi_dmg)
print(res)
print(skoda_dmg.median())
print(audi_dmg.median())

MannwhitneyuResult(statistic=893517999.0, pvalue=1.8082422042771395e-165)
400.0
510.0


It can be seen, that p-value < 0.05, so we discard the null hypothesis(H0) and, knowing median values of damage for each car brand, we can conclude, that in accidents involving Škoda vehicles, the damage to the vehicle is lower than in accidents involving Audi vehicles.