# Q2 part 2: Does the number of domains a student is interested in is related to their comfort level (1–5) in programming, math, or statistics?”


In [65]:
import pandas as pd
import numpy as np

#Combined data from data-cleaning.py
combined = pd.read_csv('data/combined.csv')
#print(combined.head())    

In [66]:
#Create a new column for the number of domain interests for each student (each row)

domain_cols = [
    "Biology", "Chemistry", "Ecology", "Economics / Accounting", "Entertainment",
    "Environmental science", "Music & Audio", "Neuroscience", "Psychology", "Public health", 
    "Social or political science", "Software development", "Technology", "media/musical technology"
]
combined["domain_count"] = combined[domain_cols].sum(axis=1).astype(int)

#The domain counts range from 2 to 8
print("domain_count range:", combined["domain_count"].min(), "to", combined["domain_count"].max())

domain_count range: 2 to 8


### The ordered logistic regression model will be run for each category (math, stats, programming) of comfort levels

Hypothesis testing: 

Null Hypothesis H_0: beta_domain_count = 0. The number of domains a student is interested in is not related to comfort; the odds ratio equals 1.

Alternative Hypothesis H_a: beta_domain_count != 0. The number of domains is related to comfort (either higher or lower); the odds ratio is not equal to 1.

From the above output, we ran proportional-odds ordinal logistic regressions to test the association between the number of interested domains and comfort (1–5) in programming, math, and statistics. 

We modeled comfort (1–5) ~ domain_count, estimating the odds ratio per +1 domain and using a two-sided Wald test on beta_domain_count.

In [67]:
import numpy as np, pandas as pd
from statsmodels.miscmodels.ordinal_model import OrderedModel
from statsmodels.stats.multitest import multipletests

#Columns for comfort levels in programming, math, and statistics
comfort_cols = ["prog.comf", "math.comf", "stat.comf"]

rows = []
for y in comfort_cols:

    #Ordinal logistic: comfort (1–5) ~ domain_count
    mod = OrderedModel(combined[y], combined[["domain_count"]], distr="logit")
    res = mod.fit(method="bfgs", disp=False)

    #BFGS: optimizer that maximizes the ordered-logit likelihood for comfort ~ domain_count, 
    #to estimate the coefficient on domain_count (variable b), to later compute to get the odds ratio (OR)

    b  = res.params["domain_count"] #coefficient for domain_count. On log-odds scale
    se = res.bse["domain_count"] #standard error
    odds_ratio = float(np.exp(b)) #converts b to an odds ratio by exponentiating
    lower = float(np.exp(b - 1.96*se)) #95% CI lower bound
    upper = float(np.exp(b + 1.96*se)) #95% CI upper bound
    #The p-value testing H0: odds_ratio = 1 (no association).
    p_value  = float(res.pvalues["domain_count"])

    rows.append({"outcome": y, "n": len(combined), "odds_ratio": odds_ratio, "lower": lower, "upper": upper, "p_value": p_value})

out = pd.DataFrame(rows)

#Benjamini–Hochberg to adjust the p-values for multiple testing (across the 3 tests)
out["q_value"] = multipletests(out["p_value"], method="fdr_bh")[1]
print(out.round(4))

     outcome   n  odds_ratio   lower   upper  p_value  q_value
0  prog.comf  49      0.8878  0.6186  1.2743   0.5188   0.7781
1  math.comf  49      0.9822  0.7025  1.3733   0.9164   0.9164
2  stat.comf  49      0.8989  0.6509  1.2414   0.5176   0.7781


After running the test we obtain the odds_ratio, the lower and upper bound for the 95% confidence interval, the p_value, and the q_value. 

The odds ratio quantifies how the odds of being in a higher comfort category are associated with a +1 increase in domain count.

Odds ratio > 1: an increase in domains is associated with higher odds of being in a higher comfort category
Odds ratio ≈ 1: little/no relationship
Odds ratio < 1: an increase in domains is associated with lower odds of being in a higher comfort category

From our results, the odds_ratio for programming, math, and statistics are all close to 1, with programming and statistics odds_ratio being less than 1, indicating there is a possibility of lower odds of higher comfort when increasing domain interests. 

The 95% confidence interval for all three subject areas include 1, so the data are consistent with no association.

P_values were also adjusted using Benjamini–Hochberg to control the false discovery rate (FDR), which is the expected fraction of false positives among positives, reported as q_values. The q_values for all three subjects were 0.7781, 0.9164, and 0.7781. These are all significantly larger than the alpha level of 0.05, meaning that there is no statistically significant association between number of domains and comfort levels in this sample of n=49.

## Conclusions

The results show that odds_ratios for programming, math, and statistics were all near 1 (programming/statistics slightly < 1). However, all 95% CIs included 1 and all q_values > 0.05, so we do not detect a statistically significant association between domain count and comfort in any subject for this sample (n=49)

Overall, we do not find statistically significant evidence that the number of domains a student is interested in is associated with comfort in programming, math, or statistics.