# Data Manipulation

## Primary Dataset

### Menstrual Cycle Regularity

Menstrual cycle regularity in the past 12 months was assessed by asking participants whether they experienced regular periods. Participants were classified as having regular cycles if they reported consistent menstrual bleeding patterns. Participants who reported irregular cycles were classified as having irregular cycles, provided that no notable medical or physiological reasons were present.

**Participants were excluded from the irregularity classification if they had conditions or circumstances that naturally prevent menstruation, including: pregnancy, postpartum or breastfeeding-related amenorrhea, hysterectomy, or menopause.** This approach ensures that the measure of cycle irregularity reflects only those individuals for whom menstrual cycles would be expected under normal conditions.

In [2]:
import pandas as pd

# Load primary dataset XPT file
cycle_17_to_pre20 = pd.read_sas("data/P_RHQ.XPT")
cycle_21_to_23 = pd.read_sas("data/RHQ_L.XPT")

# See what’s inside
print(cycle_17_to_pre20.shape)
print(cycle_21_to_23.shape)

(5314, 32)
(3917, 13)


In [3]:
# Find overlapping columns
common_cols = cycle_17_to_pre20.columns.intersection(cycle_21_to_23.columns)

# Keep only overlapping columns
cycle = pd.concat([cycle_17_to_pre20[common_cols], cycle_21_to_23[common_cols]], axis=0, ignore_index=True)

In [4]:
cycle

Unnamed: 0,SEQN,RHQ010,RHQ031,RHD043,RHQ060,RHQ078,RHQ131,RHD143,RHD167,RHQ200,RHD280,RHQ305,RHQ332
0,109264.0,12.0,1.0,,,,,,,,,,
1,109266.0,13.0,1.0,,,2.0,2.0,,,,2.0,2.0,
2,109277.0,11.0,1.0,,,,,,,,,,
3,109279.0,12.0,1.0,,,,,,,,,,
4,109284.0,9.0,1.0,,,2.0,1.0,2.0,4.0,,2.0,2.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9226,142301.0,12.0,2.0,7.0,52.0,,,,,,2.0,2.0,
9227,142303.0,13.0,2.0,3.0,38.0,,,,,,1.0,1.0,40.0
9228,142305.0,,,,,,,,,,,,
9229,142307.0,12.0,2.0,3.0,38.0,2.0,1.0,,3.0,,1.0,2.0,


In [5]:
# Map values (others become NaN automatically)
cycle["regular_mapped"] = cycle["RHQ031"].map({1: "Yes", 2: "No"})

In [6]:
cycle["regular_mapped"].isna().sum()

1018

In [7]:
len(cycle["regular_mapped"])

9231

In [8]:
# Drop rows when whether the period is regular indicator is missing
cycle.dropna(subset=["regular_mapped"], inplace=True)

In [9]:
# Drop rows where RHD043 is in Pregnancy (1), Breast feeding(2), Hysterectomy(3) and Menopause/Change of life(7)
cycle = cycle[~cycle['RHD043'].isin([1, 2, 3, 7])]

In [10]:
cycle['regular_mapped'].value_counts()

regular_mapped
Yes    4063
No      396
Name: count, dtype: int64

In [11]:
cycle = cycle[['SEQN','regular_mapped']]

## Secondary Dataset
- Body Measures (height, weight, BMI) - Ying
- Smoking (smoking status, frequency, history) - Miranda
- Physical Activity (activity type, frequency, duration, intensity) - Tanisha
- Demographics (age, race/ethnicity, education, income) - Ying
- Dietary Data (diet quality, caloric and nutrient intake) - Miranda
- Sleep Disorders (sleep quality, trouble sleeping, sleep duration) - Tanisha

#### Demographics (age, race/ethnicity, education, income)

From the Demographic Dataset, we preserve variables that capture key background characteristics of participants. Specifically, we keep:

- Gender (RIAGENDR)

- Age (RIDAGEYR)

- Race/Ethnicity (RIDRETH3)

- Education (DMDEDUC2) – available only for respondents aged 20 years and older. For younger participants, values are coded as missing.

- Income (INDFMPIR) – the ratio of family income to the poverty threshold. This is the only income variable in NHANES. A value of 1.0 indicates income exactly at the poverty line, values <1.0 indicate income below poverty, and values above 1.0 represent income above poverty. All values ≥5.0 are top-coded to 5.

During cleaning, we:

- Recode categorical variables (e.g., gender, race, and education) into readable labels.

- Preserve income-to-poverty ratio as a continuous measure that can be further categorized if needed (e.g., below poverty, low income, middle income, high income).

In [12]:
# Load body measures dataset XPT file
demo_17_to_pre20 = pd.read_sas("data/P_DEMO.XPT")
demo_21_to_23 = pd.read_sas("data/DEMO_L.XPT")

In [13]:
# Find overlapping columns
common_cols = demo_17_to_pre20.columns.intersection(demo_21_to_23.columns)

# Keep only overlapping columns
demo = pd.concat([demo_17_to_pre20[common_cols], demo_21_to_23[common_cols]], axis=0, ignore_index=True)

In [14]:
# Rename variables we will keep
demo.rename(columns={
    "RIAGENDR": "Gender",
    "RIDRETH3": "Race",
    "RIDAGEYR": "Age",
    "DMDEDUC2": "Education",
    "INDFMPIR": "Ratio_of_family_income_to_poverty"
}, inplace=True)


In [15]:
# Map categorical values
demo["Gender"] = demo["Gender"].map({1: "Male", 2: "Female"})
demo["Race"] = demo["Race"].map({1: "Mexican American", 
                                  2: "Other Hispanic",
                                  3: "Non-Hispanic White", 
                                  4: "Non-Hispanic Black",
                                  6: "Non-Hispanic Asian",
                                  7: "Other Race - Including Multi-Racial"})
demo["Education"] = demo["Education"].map({1: "Less than 9th grade",
                                           2: "9-11th grade",
                                           3: "High school/GED",
                                           4: "Some college/AA",
                                           5: "College graduate or above"})

In [17]:
# Keep only relevant variables
demo = demo[["SEQN","Gender", "Race","Age","Education","Ratio_of_family_income_to_poverty"]]

# Merge cycle and bmx on SEQN (common ID)
merged = pd.merge(cycle, demo, on='SEQN', how='left')

#### Body Measures (height, weight, BMI)

From the Body Measures dataset, we preserve only BMI (BMXBMI), which captures the participant’s body mass relative to height. Height and weight are not retained separately since BMI already summarizes body size. 

In [18]:
# Load body measures dataset XPT file
bodymeasures_17_to_pre20 = pd.read_sas("data/P_BMX.XPT")
bodymeasures_21_to_23 = pd.read_sas("data/BMX_L.XPT")

In [19]:
# Find overlapping columns
common_cols = bodymeasures_17_to_pre20.columns.intersection(bodymeasures_21_to_23.columns)

# Keep only overlapping columns
bodymeasures = pd.concat([bodymeasures_17_to_pre20[common_cols], bodymeasures_21_to_23[common_cols]], axis=0, ignore_index=True)

In [20]:
# Rename variables we will keep
bodymeasures.rename(columns={
    "BMXBMI": "BMX",
}, inplace=True)

In [21]:
# Keep only relevant variables
bodymeasures = bodymeasures[['SEQN','BMX']]
# Merge cycle and bmx on SEQN (common ID)
merged = pd.merge(merged, bodymeasures, on='SEQN', how='left')

In [22]:
merged

Unnamed: 0,SEQN,regular_mapped,Gender,Race,Age,Education,Ratio_of_family_income_to_poverty,BMX
0,109264.0,Yes,Female,Mexican American,13.0,,0.83,17.6
1,109266.0,Yes,Female,Non-Hispanic Asian,29.0,College graduate or above,5.00,37.8
2,109277.0,Yes,Female,Mexican American,12.0,,1.35,18.6
3,109279.0,Yes,Female,Non-Hispanic White,17.0,,1.19,21.0
4,109284.0,Yes,Female,Mexican American,44.0,9-11th grade,,39.1
...,...,...,...,...,...,...,...,...
4454,142263.0,Yes,Female,Non-Hispanic White,44.0,College graduate or above,,22.6
4455,142269.0,No,Female,Non-Hispanic Black,32.0,Some college/AA,0.74,
4456,142280.0,Yes,Female,Other Race - Including Multi-Racial,23.0,Some college/AA,1.40,38.4
4457,142283.0,Yes,Female,Other Race - Including Multi-Racial,29.0,High school/GED,1.04,45.8


#### Smoking (smoking status, frequency, history)

- What variables you have kept in the dataset and why?

#### Sleep Disorders
- What variables you have kept in the sleep disorders dataset to join the cycle data and why?

In [1]:
#Clean a dataset with only relevant sleep disorder variables
#Join your cleaned dataset with the cycle dataset