# Data Manipulation

## Primary Dataset

### Menstrual Cycle Regularity

Menstrual cycle regularity in the past 12 months was assessed by asking participants whether they experienced regular periods. Participants were classified as having regular cycles if they reported consistent menstrual bleeding patterns. Participants who reported irregular cycles were classified as having irregular cycles, provided that no notable medical or physiological reasons were present.

**Participants were excluded from the irregularity classification if they had conditions or circumstances that naturally prevent menstruation, including: pregnancy, postpartum or breastfeeding-related amenorrhea, hysterectomy, or menopause.** This approach ensures that the measure of cycle irregularity reflects only those individuals for whom menstrual cycles would be expected under normal conditions.

In [51]:
import pandas as pd

# Load primary dataset XPT file
cycle_17_to_pre20 = pd.read_sas("data/P_RHQ.XPT")
cycle_21_to_23 = pd.read_sas("data/RHQ_L.XPT")

# See what’s inside
print(cycle_17_to_pre20.shape)
print(cycle_21_to_23.shape)

(5314, 32)
(3917, 13)


In [52]:
# Find overlapping columns
common_cols = cycle_17_to_pre20.columns.intersection(cycle_21_to_23.columns)

# Keep only overlapping columns
cycle = pd.concat([cycle_17_to_pre20[common_cols], cycle_21_to_23[common_cols]], axis=0, ignore_index=True)

In [53]:
cycle

Unnamed: 0,SEQN,RHQ010,RHQ031,RHD043,RHQ060,RHQ078,RHQ131,RHD143,RHD167,RHQ200,RHD280,RHQ305,RHQ332
0,109264.0,12.0,1.0,,,,,,,,,,
1,109266.0,13.0,1.0,,,2.0,2.0,,,,2.0,2.0,
2,109277.0,11.0,1.0,,,,,,,,,,
3,109279.0,12.0,1.0,,,,,,,,,,
4,109284.0,9.0,1.0,,,2.0,1.0,2.0,4.0,,2.0,2.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9226,142301.0,12.0,2.0,7.0,52.0,,,,,,2.0,2.0,
9227,142303.0,13.0,2.0,3.0,38.0,,,,,,1.0,1.0,40.0
9228,142305.0,,,,,,,,,,,,
9229,142307.0,12.0,2.0,3.0,38.0,2.0,1.0,,3.0,,1.0,2.0,


In [13]:
# Map values (others become NaN automatically)
cycle["regular_mapped"] = cycle["RHQ031"].map({1: "Yes", 2: "No"})

In [15]:
cycle["regular_mapped"].value_counts()

regular_mapped
No     4150
Yes    4063
Name: count, dtype: int64

In [20]:
# Drop rows where RHD043 is in [1, 2, 3, 7]
cycle = cycle[~cycle['RHD043'].isin([1, 2, 3, 7])]

In [29]:
cycle = cycle[['SEQN','regular_mapped']]

In [38]:
cycle.shape

(5477, 2)

## Secondary Dataset
- Body Measures (height, weight, BMI)
- Smoking (smoking status, frequency, history)
- Physical Activity (activity type, frequency, duration, intensity)
- Demographics (age, race/ethnicity, education, income)
- Dietary Data (diet quality, caloric and nutrient intake)
- Sleep Disorders (sleep quality, trouble sleeping, sleep duration)

#### Body Measures (height, weight, BMI)

In [26]:
# Load body measures dataset XPT file
bodymeasures_17_to_pre20 = pd.read_sas("data/P_BMX.XPT")
bodymeasures_21_to_23 = pd.read_sas("data/BMX_L.XPT")

In [44]:
# Find overlapping columns
common_cols = bodymeasures_17_to_pre20.columns.intersection(bodymeasures_21_to_23.columns)

# Keep only overlapping columns
bodymeasures = pd.concat([bodymeasures_17_to_pre20[common_cols], bodymeasures_21_to_23[common_cols]], axis=0, ignore_index=True)

In [46]:
# Merge cycle and bmx on SEQN (common ID)
merged = pd.merge(cycle, bodymeasures, on='SEQN', how='left')

In [49]:
merged[['SEQN','regular_mapped', 'BMXBMI']]

Unnamed: 0,SEQN,regular_mapped,BMXBMI
0,109264.0,Yes,17.6
1,109266.0,Yes,37.8
2,109277.0,Yes,18.6
3,109279.0,Yes,21.0
4,109284.0,Yes,39.1
...,...,...,...
5472,142272.0,,18.0
5473,142280.0,Yes,38.4
5474,142283.0,Yes,45.8
5475,142300.0,Yes,32.6


#### Smoking (smoking status, frequency, history)