In [1]:
import numpy as np
import pandas as pd
from functions import *

df = pd.read_csv("data/data.csv")
df.head()

Unnamed: 0,Sex,Height
0,Male,187
1,Male,187
2,Male,193
3,Female,170
4,Female,165


In [2]:
# difference between average height of men and women
mask = df["Sex"] == "Male"

np.mean(df["Height"][mask]) - np.mean(df["Height"][~mask])

np.float64(15.303716608594641)

Association measures between quantitative and binary variable (say, height and gender, hand size and gender...)
- Cohen's d (difference between the means adjusted for the pooled standard deviation)
    $$ \text{Cohen's d} = \frac{\bar{x}_1 - \bar{x}_2}{\text{sd}_{pooled}}$$
- point-biserial correlation coefficient (Pearson correlation when one of the variables is 0/1)
- probability of superiority (AUC, C-statistic, common language effect size...)
- Somers D = 2*AUC-1

In [3]:
# Calculate Cohenâ€™s d (positive number)

print(cohens_d(df["Height"], df["Sex"]))

2.2140420469797393


Cohen's d interpretation:
- Standard Deviation Units: A \(d\) of 1.0 indicates the two group means differ by one standard deviation.
- Small, Medium, Large: According to Cohen (1988), \(d=0.2\) is a small effect, \(d=0.5\) is a medium effect (visible to the naked eye), and \(d=0.8\) is a large effect.
- Further classifications include \(d<0.1\) (tiny), \(0.1\le d<0.2\) (very small), \(0.8\le d<1.2\) (large), \(1.2\le d<2\) (very large), and \(d\ge 2\) (huge).

in our case it means that the sex has a very large effect on height.

In [19]:
# Calculate AUC (number greater than 0.5)

import sklearn
from sklearn.metrics import roc_auc_score

x = df["Sex"].apply(lambda x: 1 if x == "Male" else 0)
y = df["Height"]
auc = roc_auc_score(x, y)
auc = round(auc, 4)
print()
print("AUC for sex to predict height:", auc)


AUC for sex to predict height: 0.9428


AUC serves as a measure of the "probability of superiority", indicating the likelihood that a random sample from one distribution is larger than a random sample from another, 0.5 is random choice.

Here AUC= 0.94 indicates that Height is an almost perfect separator between males and females. 