**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

* Sophia Ashraf
* Dylan Oquendo
* Karun Mokha
* Jake Kondo
* Ekrem Ersoz

# Research Question

**How did the interplay between mental health and specific Covid-19-related events influence pregnancy outcomes, and what novel patterns emerge when comparing pre-pandemic, pandemic, and post-vaccine introduction phases?**

**Sub Questions:**

**Normal time data (overall population):**<br>
Analyze pregnancy outcomes data from periods before the COVID-19 pandemic as a baseline. This will help establish what "normal" outcomes look like, against which pandemic-era outcomes can be compared.<br>

**Was it related to mental health:**<br>
Investigate whether changes in pregnancy outcomes during the COVID-19 pandemic correlate with reported changes in mental health statistics. This involves collecting data on mental health issues among pregnant individuals during the pandemic and comparing these with pregnancy outcomes.<br>

**How did these relate to large events (e.g., vaccines)?**<br>
Examine the timeline of major Covid-19-related events, such as lockdowns, infection waves, and the introduction of vaccines, and analyze their impact on mental health and pregnancy outcomes. This could involve comparing pregnancy outcomes and mental health data before and after such events to identify any significant changes or trends.


## Background and Prior Work

The impact that COVID-19 had on mental health around the globe has been acknowledged and studied by scientific researchers to examine its effects, particularly on specific “at risk” populations such as pregnant and postpartum women. In fact the World Health Organization (WHO) estimates anxiety and depression prevalence increased 25% globally during the height of the pandemic. A systematic review and meta-analysis conducted by Gayathri Delanerolle, department of Health Care Science at Oxford, titled “The prevalence of mental ill-health in women during pregnancy and after childbirth during the COVID-19 pandemic” found a significant increase in negative mental health outcomes specifically in women who were pregnant and postpartum . They quantified depression, anxiety, and stress, and suggested that the increase of these conditions highlight a need for mental health resources amongst maternal healthcare services, especially during a pandemic as impactful as COVID-19. This research provides a strong empirical context for our group illustrating the vulnerability of pregnant women to negative mental health outcomes during the pandemic.

For our project, we also want to contextualize maternal stress during pregnancy at times not during a pandemic so we have a measure to compare. An article titled, “Prenatal developmental origins of behavior and mental health: The influence of maternal stress in pregnancy” Van den Bergh et al’s did a review on current research to find that stress during pregnancy cna have long lasting behavioral and mental health affects later in life. This includes affecting fetal development in a variety of ways including hormonal changes and affecting brain development. Overall the article highlights how critical the pregnancy period is for mothers, and how during extra stressful times there could be profound long lasting effects on child development.

These two studies serve as a foundation for our project as context for stress, and mental health in pregnant and postpartum women with a meta-analysis on mental health during the pandemic, and a systematic review of all current research on maternal stress and child development. Delanerolle et al. 's article found increase in mental health issues amongst pregnant and postpartum women during the pandemic, while Van den Bergh et al’s comprehensive review gave us insight into the origins of prenatal mental health by highlighting cognitive and neurodevelopment dysfunction. . These findings highlight the importance of our research question. Our goal is to examine the relationship between mental health during COVID-19 and pregnancy outcomes, our project aims to contribute to the prior research and understandings of these interactions and serve to suppose maternal well-being during pandemic eras currently and in the future.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9883834/

https://pubmed.ncbi.nlm.nih.gov/28757456/

# Hypothesis


Increased stress and mental health challenges faced by pregnant individuals during the Covid-19 pandemic negatively impacted pregnancy outcomes. We arrived at our prediction by considering the radical effect of the Covid-19 pandemic on an individual’s emotional and mental well being. We also paired this with the already existing stress that a woman faces throughout her pregnancy, and came up with our hypothesis.

# Data

## Data overview

- **Dataset #1**
  - Mental health in the pregnancy during the COVID-19
  - https://www.kaggle.com/datasets/yeganehbavafa/mental-health-in-the-pregnancy-during-the-covid-19/data
  - 5176 observations
  - 16 variables



**Variables:** Mental health indicators (e.g., levels of stress, anxiety, depression), pregnancy outcomes (e.g., gestational age at birth, birth weight, any complications), Covid-19 impact metrics (e.g., infection status, lockdown impact), major pandemic events timelines (e.g., start of lockdowns, vaccine rollouts).

**Population:** Pregnant individuals during the COVID-19 pandemic, with a comparison group from before the pandemic as a baseline.

**Time Period:** Data should span from before the COVID-19 pandemic (as a baseline) and continue through the pandemic, ideally with timestamps to align with major pandemic events for dynamic analysis.

This dataset from the Pregnancy during the COVID-19 Pandemic (PdP) project includes variables such as maternal age, household income, maternal education levels, Edinburgh Postnatal Depression Scale (EPDS) scores, PROMIS Anxiety scores, gestational age at birth, delivery date, birth length and weight, delivery mode, NICU stay, survey language, and perceived threat levels to life and unborn baby due to COVID-19. These variables offer comprehensive insights into the socio-economic, psychological, and health-related aspects of pregnant individuals' experiences during the pandemic, allowing for a multifaceted analysis of the impact of COVID-19 on pregnancy outcomes.

## Dataset #1 (use name instead of number here)

In [20]:
import pandas as pd
import numpy as np


file_path = 'Pregnancy During the COVID-19 Pandemic.csv'
df = pd.read_csv(file_path)

# Display the first few rows to understand the structure of the dataset
df.head()

Unnamed: 0,OSF_ID,Maternal_Age,Household_Income,Maternal_Education,Edinburgh_Postnatal_Depression_Scale,PROMIS_Anxiety,Gestational_Age_At_Birth,Delivery_Date(converted to month and year),Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Language,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,"$200,000+",Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,English,2.0,3.0,27.0
1,2,34.6,"$200,000+",Undergraduate degree,4.0,17.0,,,,,,,English,2.0,33.0,92.0
2,3,34.3,"$100,000 -$124,999",Undergraduate degree,,,,,,,,,French,,,
3,4,28.8,"$100,000 -$124,999",Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,French,53.0,67.0,54.0
4,5,36.5,"$40,000-$69,999",Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,English,23.0,32.0,71.0


In [21]:
df = df.dropna(subset=['PROMIS_Anxiety', 'Birth_Length', 'Birth_Weight', 'NICU_Stay', 'Edinburgh_Postnatal_Depression_Scale'])
#Drop any row with null value for the values we are most interested in
df = df.reset_index(drop=True)
df.shape

(5176, 16)

In [22]:
#drop language column
df.drop('Language', axis=1)

Unnamed: 0,OSF_ID,Maternal_Age,Household_Income,Maternal_Education,Edinburgh_Postnatal_Depression_Scale,PROMIS_Anxiety,Gestational_Age_At_Birth,Delivery_Date(converted to month and year),Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,"$200,000+",Masters degree,9.0,13.0,39.71,Dec2020,49.20,3431.0,Vaginally,No,2,3,27
1,4,28.8,"$100,000 -$124,999",Masters degree,9.0,20.0,38.57,Dec2020,41.00,2534.0,Vaginally,No,53,67,54
2,5,36.5,"$40,000-$69,999",Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,23,32,71
3,9,33.1,"$100,000 -$124,999",College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,No,27,76,72
4,14,29.2,"$70,000-$99,999",Masters degree,14.0,17.0,41.00,Oct2020,47.00,3084.0,Vaginally,No,68,69,81
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5171,10756,41.7,"$175,000- $199,999",Undergraduate degree,19.0,21.0,38.43,Aug2020,48.26,3700.0,Caesarean-section (c-section),No,94,94,91
5172,10757,27.8,"$150,000 - $174,999",Masters degree,8.0,19.0,38.86,Aug2020,50.80,3573.0,Caesarean-section (c-section),No,45,82,86
5173,10758,36.2,"$150,000 - $174,999",Undergraduate degree,3.0,9.0,38.57,Jul2020,50.50,3119.0,Vaginally,No,70,32,75
5174,10762,33.2,"$125,000- $149,999",College/trade school,0.0,8.0,41.57,Oct2020,52.00,3629.0,Vaginally,No,0,13,17


In [23]:
missing_values = df.isnull().sum()

missing_values

OSF_ID                                         0
Maternal_Age                                   3
Household_Income                              20
Maternal_Education                            14
Edinburgh_Postnatal_Depression_Scale           0
PROMIS_Anxiety                                 0
Gestational_Age_At_Birth                       0
Delivery_Date(converted to month and year)     0
Birth_Length                                   0
Birth_Weight                                   0
Delivery_Mode                                  0
NICU_Stay                                      0
Language                                       0
Threaten_Life                                  0
Threaten_Baby_Danger                           0
Threaten_Baby_Harm                             0
dtype: int64

In [24]:
df.rename(columns={'Maternal_Age': 'mat_age', 'Household_Income': 'income', 'Maternal_Education': 'mat_edu',
                  'Edinburgh_Postnatal_Depression_Scale': 'depression',
                  'PROMIS_Anxiety': 'anxiety', 'Gestational_Age_At_Birth': 'birth_age',
                  'Delivery_Date(converted to month and year)': 'birth_date'}, inplace=True)
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Language,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,"$200,000+",Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,English,2,3,27
1,4,28.8,"$100,000 -$124,999",Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,French,53,67,54
2,5,36.5,"$40,000-$69,999",Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,English,23,32,71
3,9,33.1,"$100,000 -$124,999",College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,No,English,27,76,72
4,14,29.2,"$70,000-$99,999",Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,No,French,68,69,81


In [25]:
summary_statistics = df.describe()


summary_statistics, missing_values

(             OSF_ID      mat_age   depression      anxiety   birth_age  \
 count   5176.000000  5173.000000  5176.000000  5176.000000  5176.00000   
 mean    5300.733578    32.521322     9.738022    18.389104    39.33868   
 std     3114.246816     4.140823     5.307232     5.950169     1.62486   
 min        1.000000    18.500000     0.000000     7.000000    24.86000   
 25%     2560.750000    29.700000     6.000000    14.000000    38.57000   
 50%     5294.500000    32.400000    10.000000    19.000000    39.57000   
 75%     8009.250000    35.300000    13.000000    23.000000    40.43000   
 max    10764.000000    49.000000    28.000000    35.000000    42.86000   
 
        Birth_Length  Birth_Weight  
 count   5176.000000   5176.000000  
 mean      50.499834   3412.676005  
 std        4.433899    534.564742  
 min       20.000000    314.000000  
 25%       49.000000   3119.000000  
 50%       50.800000   3431.000000  
 75%       53.310000   3742.000000  
 max       70.000000   5968

In [26]:
def standardize_income(income):
    if pd.isna(income):
        # Return NaN as is, you can also choose to fill it with a specific value if required
        return np.nan
    elif isinstance(income, str):
        # Check for non-standard strings and convert them
        if 'Less than' in income:
            return 20000  # Example value, adjust based on your dataset
        # Check if income is a range
        elif '-' in income:
            parts = income.replace('$', '').replace(',', '').split('-')
            # Calculate midpoint for ranges
            if len(parts) == 2 and parts[1]:
                low, high = map(int, parts)
                return (low + high) / 2
            else:  # Handle cases like '$150,000 -'
                low = int(parts[0])
                return low * 1.25
        elif '+' in income:
            # Handle open-ended values like '$200,000+'
            low = int(income.replace('$', '').replace(',', '').replace('+', ''))
            return low * 1.25
        else:
            # Handle single values without range
            return int(income.replace('$', '').replace(',', '').replace(' ', ''))
    else:
        # If income is already a number, just return it
        return income

# Assuming 'df' is your dataframe
df['income'] = df['income'].apply(standardize_income)

df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Language,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,250000.0,Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,English,2,3,27
1,4,28.8,112499.5,Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,French,53,67,54
2,5,36.5,54999.5,Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,English,23,32,71
3,9,33.1,112499.5,College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,No,English,27,76,72
4,14,29.2,84999.5,Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,No,French,68,69,81


In [27]:
def categorize_income(value, low_thresh, high_thresh):
    if pd.isna(value):
        return 'Unknown'  # Handle NaN values as 'Unknown'
    elif value < low_thresh:
        return 'Low'
    elif low_thresh <= value < high_thresh:
        return 'Middle'
    else:
        return 'High'

# Assuming 'df' is your dataframe and 'income' is the correct income column
low_threshold = df['income'].quantile(0.33)
high_threshold = df['income'].quantile(0.66)

# Categorize incomes using the thresholds
df['income'] = df['income'].apply(lambda x: categorize_income(x, low_threshold, high_threshold))

# Display the updated DataFrame
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Language,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,High,Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,English,2,3,27
1,4,28.8,Middle,Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,French,53,67,54
2,5,36.5,Low,Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,English,23,32,71
3,9,33.1,Middle,College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,No,English,27,76,72
4,14,29.2,Middle,Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,No,French,68,69,81


In [28]:
# Convert 'Yes'/'No' to 1/0 in NICU_Stay
df['NICU_Stay'] = df['NICU_Stay'].map({'Yes': 1, 'No': 0})
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Language,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,High,Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,0,English,2,3,27
1,4,28.8,Middle,Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,0,French,53,67,54
2,5,36.5,Low,Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),0,English,23,32,71
3,9,33.1,Middle,College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,0,English,27,76,72
4,14,29.2,Middle,Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,0,French,68,69,81


In [29]:
df['birth_date'] = pd.to_datetime(df['birth_date'], format='%Y-%m-%d', errors='coerce')
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Language,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,High,Masters degree,9.0,13.0,39.71,NaT,49.2,3431.0,Vaginally,0,English,2,3,27
1,4,28.8,Middle,Masters degree,9.0,20.0,38.57,NaT,41.0,2534.0,Vaginally,0,French,53,67,54
2,5,36.5,Low,Undergraduate degree,14.0,20.0,39.86,NaT,53.34,3714.0,Caesarean-section (c-section),0,English,23,32,71
3,9,33.1,Middle,College/trade school,1.0,7.0,40.86,NaT,55.88,4480.0,Vaginally,0,English,27,76,72
4,14,29.2,Middle,Masters degree,14.0,17.0,41.0,NaT,47.0,3084.0,Vaginally,0,French,68,69,81


In [30]:
df.shape

(5176, 16)

# Results

## Exploratory Data Analysis

### Section 1 of EDA - please give it a better title than this

![Screenshot%202024-03-11%20at%202.19.58%20AM.png](attachment:Screenshot%202024-03-11%20at%202.19.58%20AM.png)

Given the multitude of variables in our dataset related to each mothers background, mental health status, and births, we have explored our dataset for possible outliers, skews, and trends that should align with common birth knowledge. For maternal age, we have a normal distribution centered around are thirty, suggesting the women of this dataset are mostly in their 30’s. Interestingly enough, the Edinburgh Postnatal Depression Scale Score is actually slightly right-skewed, meaning more respondents had lower depression scores, with fewer individuals reporting high levels of depression in their model’s assessment. We saw a similar trend in the PROMIS Anxiety Score as well. The Gestational Age at Birth distribution also had peaks at 38-40 weeks, indicating that the majority of our births in the dataset occurred on term, according to the WHO, who we compared our birth related values to. The majority of babies had a weight of 2500 to 4000 grams, which is within the normal range also according to the WHO.

Furthermore, some other trends in our dataset amongst our variables were majority of our mothers were high or middle class, and majority spoke english with some french speakers (as our dataset comes from canadian population), and the absolute majority of our mothers received a college degree. The histograms for depression and anxiety scores hint at potential outliers in the higher score ranges, but these could just be cases of severe anxiety or depression, potentially people with overlapping conditions. In terms of our birth data weight and gestational age distributions seemed relatively clean, with no extreme outliers present. It will be important to keep in mind when we start running our models how we clean our data if we choose to reevaluate how we choose our null values or distributions, as we have some ranges to define relative to publicly known depression and birth knowledge (i.e how we compared values to the WHO). 


# Ethics & Privacy

The dataset from the Pregnancy during the COVID-19 Pandemic (PdP) project might exhibit biases such as geographic concentration (limited to Canada, possibly not reflecting experiences in other healthcare systems or cultures), socio-economic and educational disparities (respondents may skew towards certain income or education levels based on survey reach and accessibility), and potential language barriers (despite survey language options, nuances in understanding or expression may affect responses). For the variables which we used such as Edinburgh Postnatal Depression Scale and the PROMIS Anxiety score those are both established statistical measurements in research which have already gotten consent from participants to not only partake in the research but to give permission to publish the findings, so though we are looking at their numerical measurements we’ve ensured the ethics and privacy of those numbers. To address these, analyzing demographic data against wider population statistics and considering the socio-cultural context in interpretations will be important. The only issues about privacy that could come up would be mental health information of the individuals.

# Team Expectations 

* It's agreed that there should be equal effort from all members
* The team plans to meet once a week in person for discussions and updates on the project's progress.
* The team will efficiently communicate regularly and reply to the group messages in a timely manner.
* The team will respect one another's opinions and ideas and allow for a safe environment to share ideas.

# Project Timeline Proposal

**February 22nd** - Met in person to dicuss about Checkpoint #1 and Project Proposal<br>
**February 23rd** - attend OH to get approval for new project idea <br>
**February 25th** - Complete Checkpoint #1, Rewriting the project proposal, discord call as a group<br>
**February 28th** - Had team meeting online to work on the project<br>
**March 8th** - Held team meeting in-person and collobrated on Checkpoint #2<br>
**March 10th** - Complete Checkpoint #2<br>
**March 14th** -  Have a team meeting in person to finalize the project, attend OH if needed<br>
**March 15th** - Complete the Final Project Report, Have a meeting to review the report<br>
**March 20th** - Submit all project work<br>