# CS 685/785: Coding Assignment 1

**Spring 2025&mdash;Prof. Brandon Oubre**

In this assignment, you will gain familiarity with the central limit theorem, data visualization and hypothesis testing.

This assignment is worth a total of 80 points for CS 685 students. (40 points for CS 785 students.)

## Academic Integrity Declaration

**I declare that:**
- **I have completed this assignment entirely on my own.**
- **I understand and have complied with the course policy on the use of AI tools.**
- **I have read the UAB Academic Integrity Code and understand that any breach of this code may result in severe penalties, including failure of the class.**

Name: <u>*__You should edit this markdown cell (double-click) to include your name here as an acknowledgement of the academic integrity declaration.__*</u>

## Reminder on the Use of AI Tools
For coding assignments, you are permitted limited use of AI tools within the bounds of the policy in the syllabus. In broad strokes, you must:
- Include the prompt you used to generate the code
- Include the original code resulting from the prompt
- Include a citation to the tool used
- Be able to explain any code submitted as part of this assignment
  
<u>You have the ultimate responsibility for the correctness and clarity of any code submitted as part of this assignment.</u> You should thereore understand it, test it, debug it, revise/improve it, and document it.

<u>**You should not use AI to respond to the written prompts asking you to analyze or interpret your results.**</u>

## Imports

Do not import additional libraries unless specified by the assignment prompts. When permitted, those imports should not be added here but rather be included as part of your solution.

In [None]:
import os
from tqdm.auto import tqdm

import numpy as np
import pandas as pd
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

## Loading the Data

The data is assumed to be stored in a `data/` folder in the same directory as this notebook.

In [None]:
data_dir = os.path.abspath('data/')
print(f'The data directory is: {data_dir}')

In [None]:
file_paths = [
    os.path.join(data_dir, f) for f in os.listdir(data_dir)
    if os.path.splitext(f)[1].lower() == '.xpt'
]
data_files = [pd.read_sas(f) for f in tqdm(file_paths, desc='Loading Data')]

# Merge all the data files into a single DataFrame 
data = data_files[0]
for df in tqdm(data_files[1:], desc='Merging Data'):
    data = pd.merge(data, df, how='outer', on='SEQN', validate='1:1')  # Outer join is important; do not want to drop data missing in one file but not others

print('The data frame has {} rows and {} columns'.format(*data.shape))

if data['SEQN'].duplicated().sum() > 0:
    print('WARNING: The data has duplicated identifiers. Something is wrong.')

## Data Set Documentation

You may find the following documentation links helpful for interpreting the subset of NHANES data used in this assignment:

- [Demographic Variables and Sample Weights (DEMO_L)](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.htm)
- [Body Measures (BMX_L)](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/BMX_L.htm)
- [Cholesterol – High-Density Lipoprotein (HDL_L)](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/HDL_L.htm)
- [Blood Pressure - Oscillometric Measurements (BPXO_L)](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/BPXO_L.htm)

## Problem 1

In this problem, we will empirically observe the Central Limit Theorem.

First, we create an `hdl_cholesterol` array that contains the amount of cholesterol (mg/dL) in participants' blood during laboratory tests.

In [None]:
hdl_cholesterol = data['LBDHDD'].dropna().to_numpy()

### Problem 1.1 [10 points]

Let $X$ be a random variable that takes on the values in `hdl_cholesterol` with equal probability. We will create a new random variable $Y_n=\frac{1}{n}\sum_{i=1}^{n}x_i$ that represents the mean of $n$ samples $x_i$ drawn from $X$.

Create a function (call it `hdl_means`) that takes an argument `n` and returns an array containing **10,000 samples drawn from $Y_n$**.

*Hint: You may use `np.random.choice`. You should think about how to avoid explicitly looping in Python if you want your code to run quickly.*

### Problem 1.2 [10 points]
Create a plot with three subplots on a single row. Each plot should contain a histogram that shows the approximate distribution of $Y_n$ (i.e., the result of your `hdl_means` function).
The leftmost plot should display $Y_{10}$, the middle plot should display $Y_{30}$, and the right plot should display $Y_{100}$.
In each histogram, plot a vertical black line at the mean using `axvline`.

Ensure that the histograms are directly comparable (e.g., **same** bins, normalized heights, same y axes). Remember to label your axes. A bin width of 1 should show sufficient detail to compare the distributions.

### Problem 1.3 [8 points]
In written language (use a Markdown cell), explain how the Central Limit Theorem predicts what you observed above. (Be sure to consider the distribution, the mean, and the standard deviation.)

**NO AI USE ALLOWED FOR THIS QUESTION**

Write your answer here

### Problem 1.4 [6 points]
For each of $Y_{10}$, $Y_{30}$, and $Y_{100}$, compute the theoretical mean and standard deviation using the Central Limit Theorem. (You may use the empirical value of the mean and standard deviation of $X$, i.e., `hdl_cholesterol`. NumPy has functions for this.)

Then, compute the absolute difference between each theoretical value and the emprically-obtained value (shown by your histograms). Print out each absolute difference in scientific notation. (If your difference is `diff`, then you can format the string as `'{diff:.2e}'`.) Be sure to specify the parameter and distribution corresponding to each difference.

*Hint: You should report six differences, as you are considering two paramaters for three distributions.*

## Problem 2

In this problem, we will compare the heights of men and women.

To start, we will curate a subset of the data for your use.
The `male` and `female` arrays contain Boolean flags about the sex of each participant. The `height` array contains the height of each participant in centimeters.
The $i^{th}$ entry in each array corresponds to the $i^{th}$ participant.

In [None]:
prompt2_subset = data[['RIAGENDR', 'RIDAGEYR', 'BMXHT']].dropna()
prompt2_subset = prompt2_subset.loc[prompt2_subset['RIDAGEYR'] >= 18]  # Remove pediatric data

male = (prompt2_subset['RIAGENDR'] == 1).to_numpy()
female = (prompt2_subset['RIAGENDR'] == 2).to_numpy()
height = prompt2_subset['BMXHT'].to_numpy()

### Problem 2.1 [10 points]

Create a single plot with no subplots containing two box plots. One subfigure should show a boxplot of heights for men. The other should show a boxplot of heights for women.

Remember to label the axis correspond to participant height. The text specifying which group each box plot corresponds to (i.e., the tick labels) should read "Male" and "Female".

### Problem 2.2 [6 points]
You should see a difference in the above plot. Let us now test the hypothesis that a difference actually exists. Use `scipy.stats.ttest_ind` with `equal_var=False` to perform a Welch's $t$-test.

### Problem 2.3 [6 points]
Assume $\alpha=0.01$. In written language (use a Markdown cell), interpret this result in terms of the null hypothesis and explain your reasoning. What do you ultimately conclude?

**NO AI USE ALLOWED FOR THIS QUESTION**

Write your answer here

## Problem 3
In this question, we will investigate whether there is association between blood pressure and age.

Let us again curate a subset of data for your use. `systolic` and `diastolic` are systolic (during heartbeats) and diastolic (between heartbeats) pressures, respectively. `age` is, of course, participant age.

In [None]:
prompt3_subset = data[['BPXOSY1', 'BPXODI1', 'RIDAGEYR']].dropna()
systolic = prompt3_subset['BPXOSY1'].to_numpy()
diastolic = prompt3_subset['BPXODI1'].to_numpy()
age = prompt3_subset['RIDAGEYR'].to_numpy()

### Problem 3.1 [10 points]
Create a plot with three subplots in a single row. Each subplot should be a scatter plot between two of the three variables being investigated (systoloc pressure, diastolic pressure, and age). Remember to label your axes.

### Problem 3.2 [6 points]
Use `scipy.stats.pearsonr` too compute the correlation between each pair of variables.

### Problem 3.3 [8 points]
Interpret these correlations in writing (use a Markdown cell). Are there actually associations between each pair of variables? Looking at your scatter plots, explain any differences in correlation observed between systolic pressure and age and between diastolic pressure and age.

**NO AI USE ALLOWED FOR THIS QUESTION**

Write your answer here