**FINAL PROJECT SAP**

_Working Hypothesis:_

My working hypothesis is that higher urinary glyphosate concentrations are associated with elevated levels of oxidative stress biomarkers during pregnancy. I expect these associations to be positive and potentially nonlinear, reflecting biological pathways through which pesticide exposures may influence oxidative stress.

_Research Questions:_

1. How do urinary glyphosate concentrations relate to oxidative stress biomarkers among pregnant women?
2. Do demographic or lifestyle factors such as maternal age, BMI, race, or diet confound or modify the associations between glyphosate exposure and oxidative stress?
3. Is the relationship between glyphosate or AMPA exposure and oxidative stress linear, or do generalized additive models suggest nonlinear exposure–response patterns?

**Objective 1: Dataset**

The data has been simulated and stored as simdata.xlsx. It's located in the data folder.

**Objective 2: Summary Statistics**

*Packages needed: pandas, numpy, scipy.stats

_1. Demographic Categorical and Continuous Variables_

For all categorical variables, including maternal race, education, employment status and marital status, I will calculate counts and percentages. For all continuous variables, including maternal age, pre-pregnancy bmi, and alternative healthy eating index score, I will calculate means and standard deviations. These summaries will provide a clear overview of the sample composition and allow me to identify differences in the distribution of participant characteristics.

_2. Exposure Variable_

Since glyphosate concentrations are right-skewed, I will summarize them using geometric means (GM) and geometric standard deviations (GSD). I will compute these overall and stratified by each categorical covariate. This will help identify whether certain demographic groups have higher or lower pesticide exposure, which may inform later decisions about confounding or effect modification.

_3. Descriptive Tables_

I will present the descriptive statistics in two tables:
Table 1: Descriptive characteristics of the sample (categorical: n (%); continuous: mean (SD))
Table 2: Geometric means and GSDs of glyphosate, overall and stratified by covariates

**Objective 3: Statistical Analysis Plan**

To address my research questions, I will begin by preparing the dataset for analysis. This includes restricting the dataset to complete cases, renaming outcome variables to ensure clear scientific interpretation, checking distributions, and verifying that all variables are coded with the correct data types. Because glyphosate and the oxidative stress biomarkers are right-skewed, I will log-transform the exposure and outcome variables prior to modeling. These steps, which support the assumptions required for later analyses and directly contribute to answering Research Question 1, will be completed using pandas and numpy. Once the dataset is cleaned, I will generate descriptive statistics as previously outlined using pandas, including geometric means calculated via numpy (and custom functions) to understand the distribution of glyphosate across demographic groups. To estimate unadjusted associations between glyphosate and each oxidative stress biomarker (Research Question 1), I will fit separate ordinary least squares (OLS) linear regressions with log-glyphosate as the exposure and each log-transformed biomarker as the outcome using statsmodels.api.OLS. These models will provide β coefficients, 95% confidence intervals, and p-values. To identify potential covariates for adjusted models (Research Question 2), I will evaluate relationships between glyphosate, biomarkers, and demographic variables using t-tests and one-way ANOVA for categorical predictors, and Pearson correlations for continuous predictors, implemented through scipy.stats. Covariates associated with both the exposure and the outcome at p < 0.10 will be included in multivariable models. I will then fit adjusted OLS regression models using statsmodels to assess whether demographic or lifestyle factors confound or modify the association between glyphosate and oxidative stress, directly addressing Research Question 2. Finally, to explore potential nonlinearity in the exposure–response relationship (Research Question 3), I will fit generalized additive models (GAMs) with spline terms using statsmodels.gam.api.GLMGam and BSplines. These models will allow me to visualize and statistically test nonlinear patterns between glyphosate and each biomarker.

**Objective 4: Visualizations**

First, to address Research Question 1, I will generate a forest-style plot showing the β estimates and 95% confidence intervals from the unadjusted linear regressions between log-glyphosate and each oxidative stress biomarker. This figure will be titled “Unadjusted Associations Between Glyphosate and Oxidative Stress Biomarkers” with the x-axis labeled “Beta Estimate (95% CI)” and the y-axis labeled with biomarker names. To address Research Questions 1 and 2, I will produce a second forest plot summarizing the adjusted regression results, titled “Adjusted Associations Between Glyphosate and Oxidative Stress Biomarkers,” using the same axis labels for consistency. Finally, to explore Research Question 3, I will create a 3×2 multipanel GAM plot (one panel per biomarker) showing the smooth function of log-glyphosate with 95% confidence bands. Each panel will have a brief descriptive title, with the x-axis labeled “log(Glyphosate)” and the y-axis labeled “Partial Effect.” Together, these visualizations will help clarify both the linear and potential nonlinear relationships between glyphosate and oxidative stress biomarkers.