Dataset “Babyboom” (переменные Time of birth recorded on the 24-hour clock, Sex of the child (1 = girl, 2 = boy), Birth weight in grams, Number of minutes after midnight of each birth):
- Проверьте гипотезу, что средний вес девочек такой же, как вес мальчиков.
- Проверьте гипотезу, что дисперсия веса девочек такая же, как и веса мальчиков.

Dataset “Euroweight” (переменные weight, batch):
- Проверить гипотезы о том, что среднее значение веса монеты одинаково в разных пакетах (попарно и все вместе).

Dataset “iris.txt” (прочитайте описание данных в файле «iris_description.txt», переменные sepal length, sepal width, petal length, petal width, class):
- Проверить гипотезы о равенстве распределений характеристик цветков разных типов.
- Проверьте гипотезы о равенстве средних и дисперсий различных характеристик цветов разных типов.

Dataset “sugery.xlsx” (переменные «До операции, V left», «До операции, V right», «После операции, V left», «После операции, V right»):

- Проверить гипотезу об успешности операции с вероятностью 0.7 (0.8). Под успехом мы подразумеваем, что «V справа» до операции меньше, чем «V справа» после операции, и одновременно с этим «V слева» до операции меньше, чем «V слева» после операции.

In [2]:
import numpy as np
import pandas as pd

from scipy import stats

# Babyboom

In [23]:
# Определяем позиции столбцов
column_specs = [
    (0, 8),   # TimeOfBirth
    (8, 16),  # Sex
    (16, 24),  # BirthWeight
    (24, 32),  # NumOfMinutesAfterMidnight
]

# Определяем имена столбцов
column_names = [
    "TimeOfBirth",
    "Sex",
    "BirthWeight",
    "NumOfMinutesAfterMidnight",
]

# Читаем файл
df = pd.read_fwf('babyboomdat.txt', colspecs=column_specs, header=None, names=column_names)

In [24]:
df

Unnamed: 0,TimeOfBirth,Sex,BirthWeight,NumOfMinutesAfterMidnight
0,5,1,3837,5
1,104,1,3334,64
2,118,2,3554,78
3,155,2,3838,115
4,257,2,3625,177
5,405,1,2208,245
6,407,1,1745,247
7,422,2,2846,262
8,431,2,3166,271
9,708,2,3520,428


In [31]:
girls_weight = df[df['Sex'] == 1]['BirthWeight']
boys_weight = df[df['Sex'] == 2]['BirthWeight']

stat, pvalue = stats.ttest_ind(girls_weight, boys_weight)

print(f'T Statistics = {stat}, P-Value = {pvalue}')

T Statistics = -1.5228564442562815, P-Value = 0.13528918910545554


In [30]:
from scipy.stats import f

# Вычисление дисперсий
var1 = np.var(girls_weight, ddof=1)
var2 = np.var(boys_weight, ddof=1)

# F-статистика
f_stat = var1 / var2

# Степени свободы
df1 = len(girls_weight) - 1
df2 = len(boys_weight) - 1

# p-уровень (двусторонний тест)
p_value = 2 * min(f.cdf(f_stat, df1, df2), 1 - f.cdf(f_stat, df1, df2))

print(f"F-stat: {f_stat}")
print(f"p-value: {p_value}")

F-stat: 2.1771042882107263
p-value: 0.07526261914285004


In [40]:
statistic, p_value = stats.levene(girls_weight, boys_weight)
print(f'Statistic = {stat}, P-Value = {p_value}')

Statistic = -14.625367047410148, P-Value = 0.18508483634639278


In [65]:
f_value, p_value = f_oneway(girls_weight, boys_weight)
print(f'F-value = {f_value}, P-Value = {p_value}')

F-value = 2.319091749812882, P-Value = 0.13528918910545557


# Euroweight

In [7]:
# Определяем имена столбцов
column_names = [
    "id",
    "weight",
    "batch",
]

# Читаем файл
df = pd.read_csv('euroweightdat.txt', sep='\t', header=None, names=column_names)

In [14]:
batches = df['batch'].unique()

for i in range(len(batches)):
    for j in range(i + 1, len(batches)):
        first_batch = df[df['batch'] == batches[i]]['weight']
        second_batch = df[df['batch'] == batches[j]]['weight']
        stat, pvalue = stats.ttest_ind(first_batch, second_batch)
        print(f'Batches {batches[i]} and {batches[j]} T Statistics = {stat}, P Value = {pvalue}')

Batches 1 and 2 T Statistics = -1.1241810509353891, P Value = 0.26147780679017935
Batches 1 and 3 T Statistics = 3.1644998631182344, P Value = 0.001648507542831034
Batches 1 and 4 T Statistics = -4.0016907491459435, P Value = 7.245319251258257e-05
Batches 1 and 5 T Statistics = -4.0914574540531685, P Value = 4.999384916652405e-05
Batches 1 and 6 T Statistics = 1.4565887588620556, P Value = 0.14586011949455843
Batches 1 and 7 T Statistics = -1.115153466577227, P Value = 0.2653225064981139
Batches 1 and 8 T Statistics = 0.9226820370831669, P Value = 0.3566197335516895
Batches 2 and 3 T Statistics = 4.199462222751052, P Value = 3.1699994833517745e-05
Batches 2 and 4 T Statistics = -2.7223104844542156, P Value = 0.0067100099181974905
Batches 2 and 5 T Statistics = -2.814325820935377, P Value = 0.005081426143454851
Batches 2 and 6 T Statistics = 2.5714317115838488, P Value = 0.010416963435836278
Batches 2 and 7 T Statistics = 0.04959673381902426, P Value = 0.9604636345373081
Batches 2 and 8

In [20]:
from scipy.stats import f_oneway

samples = [df[df['batch'] == b]['weight'] for b in batches]

f_value, p_value = f_oneway(*samples)

print(f'ALL: F-value = {f_value}, P-Value = {p_value}')

ALL: F-value = 12.67221788627366, P-Value = 5.361761521220593e-16


# Iris

In [69]:
# Определяем имена столбцов
column_names = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "class",
]

# Читаем файл
df = pd.read_csv('iris.txt', header=None, names=column_names)

In [70]:
classes = df['class'].unique()

for column in column_names[:4]:
    print(column)
    for i in range(len(classes)):
        for j in range(i + 1, len(classes)):
            k_stat, p_value = stats.kstest(df[df['class'] == classes[i]][column], df[df['class'] == classes[j]][column])
            print(f'Classes {classes[i]} and {classes[j]}, K-Statistics = {k_stat}, P-Value = {p_value}')

sepal_length
Classes Iris-setosa and Iris-versicolor, K-Statistics = 0.78, P-Value = 2.807570962237254e-15
Classes Iris-setosa and Iris-virginica, K-Statistics = 0.92, P-Value = 7.773164323782225e-23
Classes Iris-versicolor and Iris-virginica, K-Statistics = 0.46, P-Value = 3.800827929128319e-05
sepal_width
Classes Iris-setosa and Iris-versicolor, K-Statistics = 0.68, P-Value = 2.6679407140599687e-11
Classes Iris-setosa and Iris-virginica, K-Statistics = 0.5, P-Value = 4.8075337049514946e-06
Classes Iris-versicolor and Iris-virginica, K-Statistics = 0.26, P-Value = 0.06779471096995852
petal_length
Classes Iris-setosa and Iris-versicolor, K-Statistics = 1.0, P-Value = 1.9823306042836678e-29
Classes Iris-setosa and Iris-virginica, K-Statistics = 1.0, P-Value = 1.9823306042836678e-29
Classes Iris-versicolor and Iris-virginica, K-Statistics = 0.86, P-Value = 3.173227767377155e-19
petal_width
Classes Iris-setosa and Iris-versicolor, K-Statistics = 1.0, P-Value = 1.9823306042836678e-29
Class

In [71]:
# Вычисление дисперсий
def ftest(x1, x2):
    var1 = np.var(x1, ddof=1)
    var2 = np.var(x2, ddof=1)
    
    # F-статистика
    f_stat = var1 / var2
    
    # Степени свободы
    df1 = len(x1) - 1
    df2 = len(x2) - 1
    
    # p-уровень (двусторонний тест)
    p_value = 2 * min(f.cdf(f_stat, df1, df2), 1 - f.cdf(f_stat, df1, df2))

    return f_stat, p_value

In [72]:
classes = df['class'].unique()

for column in column_names[:4]:
    print(column)
    for i in range(len(classes)):
        for j in range(i + 1, len(classes)):
            stat, pvalue = stats.ttest_ind(df[df['class'] == classes[i]][column], df[df['class'] == classes[j]][column])
            print(f'Classes {classes[i]} and {classes[j]} mean, T-Statistics = {stat}, P-Value = {pvalue}')
            #stat, p_value = stats.levene(df[df['class'] == classes[i]][column], df[df['class'] == classes[j]][column])
            stat, p_value = ftest(df[df['class'] == classes[i]][column], df[df['class'] == classes[j]][column])
            print(f'Classes {classes[i]} and {classes[j]} variance, Statistics = {stat}, P-Value = {pvalue}')

sepal_length
Classes Iris-setosa and Iris-versicolor mean, T-Statistics = -10.52098626754911, P-Value = 8.985235037487079e-18
Classes Iris-setosa and Iris-versicolor variance, Statistics = 0.4663429131686992, P-Value = 8.985235037487079e-18
Classes Iris-setosa and Iris-virginica mean, T-Statistics = -15.386195820079404, P-Value = 6.892546060674106e-28
Classes Iris-setosa and Iris-virginica variance, Statistics = 0.3072861988209642, P-Value = 6.892546060674106e-28
Classes Iris-versicolor and Iris-virginica mean, T-Statistics = -5.629165259719801, P-Value = 1.7248563024547942e-07
Classes Iris-versicolor and Iris-virginica variance, Statistics = 0.6589275619801338, P-Value = 1.7248563024547942e-07
sepal_width
Classes Iris-setosa and Iris-versicolor mean, T-Statistics = 9.282772555558111, P-Value = 4.362239016010215e-15
Classes Iris-setosa and Iris-versicolor variance, Statistics = 1.4743626943005181, P-Value = 4.362239016010215e-15
Classes Iris-setosa and Iris-virginica mean, T-Statistics

# Surgery

In [3]:
df = pd.read_excel('surgery.xlsx')
df = df.drop(0)
df.columns = ["VRBefore", "VLBefore", "VRAfter", "VLAfter"]

df

Unnamed: 0,VRBefore,VLBefore,VRAfter,VLAfter
1,7.2,6.7,12,13.1
2,1.2,1.2,4.5,4.2
3,6.7,7.3,15.3,14.9
4,9.9,10.05,9.6,9.1
5,3.1,2.13,,
...,...,...,...,...
90,3,5.4,5.09,6.7
91,0.32,0.33,0.8,0.76
92,6.5,5.3,9.7,8.03
93,11.7,8.3,11.7,9.3


In [5]:
def test(k, n, p):
    p_val = stats.binomtest(k, n, p, alternative='less').pvalue
    print(f"P = {p}, Number of successes {k} of {n}, p-value = {p_val}")

suc = (df['VRBefore'] < df['VRAfter']) & (df['VLBefore'] < df['VLAfter'])
k = suc.sum()
n = len(df)

test(k, n, 0.7)
test(k, n, 0.85)

P = 0.7, Number of successes 69 of 94, p-value = 0.7961639942121717
P = 0.85, Number of successes 69 of 94, p-value = 0.002539960080399346


In [1]:
69 / 94

0.7340425531914894