<h1>Checkpoint #1: Data</h1>


# NAMES: <br>
Sophia Ashraf <br>
Dylan Oquendo <br>
Karun Mokha <br>
Jake Kondo <br>
Ekrem Ersoz


# RESEARCH QUESTION:


**How did the interplay between mental health and specific Covid-19-related events influence pregnancy outcomes, and what novel patterns emerge when comparing pre-pandemic, pandemic, and post-vaccine introduction phases?**

**Sub Questions:**

**Normal time data (overall population):**<br>
Analyze pregnancy outcomes data from periods before the COVID-19 pandemic as a baseline. This will help establish what "normal" outcomes look like, against which pandemic-era outcomes can be compared.<br>

**Was it related to mental health:**<br>
Investigate whether changes in pregnancy outcomes during the COVID-19 pandemic correlate with reported changes in mental health statistics. This involves collecting data on mental health issues among pregnant individuals during the pandemic and comparing these with pregnancy outcomes.<br>

**How did these relate to large events (e.g., vaccines)?**<br>
Examine the timeline of major Covid-19-related events, such as lockdowns, infection waves, and the introduction of vaccines, and analyze their impact on mental health and pregnancy outcomes. This could involve comparing pregnancy outcomes and mental health data before and after such events to identify any significant changes or trends.















# DATASET(S):

In our study, we have focused on a comprehensive selection of variables that represent three pivotal themes: mental health status, pregnancy outcomes, and the COVID-19 pandemic's impact, guided by data from a highly regarded source detailed in the dataset article at https://doi.org/10.1016/j.dib.2023.109366. The mental health indicators include stress, anxiety, and depression levels, measured by validated scales such as the Edinburgh Postnatal Depression Scale (EPDS) and the PROMIS Anxiety scores, to assess the psychological well-being of pregnant individuals during different phases of the pandemic. Additionally, we measure pregnancy outcomes through clinical metrics like gestational age at birth, birth weight, and NICU stay, reflecting the health of both mother and newborn. To specifically examine the pandemic's effects, we consider variables such as perceived threat levels to life and unborn baby, aligning with key events like lockdowns and vaccine rollouts. Our selection also includes socio-economic and demographic factors—maternal age, household income, and maternal education—to address the broader social and economic influences on child development and mental health, acknowledging the critical role these play in access to resources and opportunities.

Building upon the findings of the research article accessible via https://doi.org/10.1016/j.jad.2021.12.057, which explored the influence of mental health on pregnancy outcomes using the same dataset, our analysis aims to reveal new patterns by segmenting data across different pandemic phases: pre-pandemic, during pandemic, and post-vaccine introduction. We employ advanced analytical methods, including temporal analysis and interaction effects, to delve into the dynamics of mental health fluctuations in response to specific COVID-19-related events and their effects on pregnancy outcomes. Our goal is to identify critical periods within the pandemic that significantly impacted mental health and pregnancy outcomes, and to understand the mediating role of socio-economic factors. This approach enables us to extend beyond the scope of the original study, offering novel insights into the complex interplay between mental health, socio-economic variables, and the progression of the COVID-19 pandemic.


**SETUP:** 

- import pandas as pd
- import numpy as np

**DATA CLEANING:** 

The data was moderately clean, however we did need to remove rows with null values in the columns we deemed important to answering our questions, as these observations would not help us in answering them.
We also renamed columns, and changed some column entries to be simpler and uniform for readability and efficiency. Some entries were made into percentages since they were scores out of 100, and we changed birth date to be a datetime variable.
Before making these changes however, we had to check the datatype of the columns of the dataframe, and also check the unique entries of the categorical data in order to rename and clean them up, or convert in some cases.

In [18]:
import pandas as pd
import numpy as np

# Load the dataset

file_path = 'Pregnancy During the COVID-19 Pandemic.csv'
df = pd.read_csv(file_path)

# Display the first few rows to understand the structure of the dataset
df.head()

Unnamed: 0,OSF_ID,Maternal_Age,Household_Income,Maternal_Education,Edinburgh_Postnatal_Depression_Scale,PROMIS_Anxiety,Gestational_Age_At_Birth,Delivery_Date(converted to month and year),Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Language,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,"$200,000+",Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,English,2.0,3.0,27.0
1,2,34.6,"$200,000+",Undergraduate degree,4.0,17.0,,,,,,,English,2.0,33.0,92.0
2,3,34.3,"$100,000 -$124,999",Undergraduate degree,,,,,,,,,French,,,
3,4,28.8,"$100,000 -$124,999",Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,French,53.0,67.0,54.0
4,5,36.5,"$40,000-$69,999",Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,English,23.0,32.0,71.0


In [19]:
#drop OSF ID



df = df.dropna(subset=['PROMIS_Anxiety', 'Birth_Length', 'Birth_Weight', 'NICU_Stay', 'Edinburgh_Postnatal_Depression_Scale'])
#Drop any row with null value for the values we are most interested in
df = df.reset_index(drop=True)
df.shape

(5176, 16)

In [20]:
#drop language column

df = df.drop('Language', axis=1)

In [21]:
# Checking for missing values
missing_values = df.isnull().sum()

missing_values

OSF_ID                                         0
Maternal_Age                                   3
Household_Income                              20
Maternal_Education                            14
Edinburgh_Postnatal_Depression_Scale           0
PROMIS_Anxiety                                 0
Gestational_Age_At_Birth                       0
Delivery_Date(converted to month and year)     0
Birth_Length                                   0
Birth_Weight                                   0
Delivery_Mode                                  0
NICU_Stay                                      0
Threaten_Life                                  0
Threaten_Baby_Danger                           0
Threaten_Baby_Harm                             0
dtype: int64

In [22]:
df.rename(columns={'Maternal_Age': 'mat_age', 'Household_Income': 'income', 'Maternal_Education': 'mat_edu',
                  'Edinburgh_Postnatal_Depression_Scale': 'depression',
                  'PROMIS_Anxiety': 'anxiety', 'Gestational_Age_At_Birth': 'birth_age', 
                  'Delivery_Date(converted to month and year)': 'birth_date'}, inplace=True)
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,"$200,000+",Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,2,3,27
1,4,28.8,"$100,000 -$124,999",Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,53,67,54
2,5,36.5,"$40,000-$69,999",Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,23,32,71
3,9,33.1,"$100,000 -$124,999",College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,No,27,76,72
4,14,29.2,"$70,000-$99,999",Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,No,68,69,81


In [23]:
# Summary statistics for numerical columns
summary_statistics = df.describe()


summary_statistics, missing_values

(             OSF_ID      mat_age   depression      anxiety   birth_age  \
 count   5176.000000  5173.000000  5176.000000  5176.000000  5176.00000   
 mean    5300.733578    32.521322     9.738022    18.389104    39.33868   
 std     3114.246816     4.140823     5.307232     5.950169     1.62486   
 min        1.000000    18.500000     0.000000     7.000000    24.86000   
 25%     2560.750000    29.700000     6.000000    14.000000    38.57000   
 50%     5294.500000    32.400000    10.000000    19.000000    39.57000   
 75%     8009.250000    35.300000    13.000000    23.000000    40.43000   
 max    10764.000000    49.000000    28.000000    35.000000    42.86000   
 
        Birth_Length  Birth_Weight  
 count   5176.000000   5176.000000  
 mean      50.499834   3412.676005  
 std        4.433899    534.564742  
 min       20.000000    314.000000  
 25%       49.000000   3119.000000  
 50%       50.800000   3431.000000  
 75%       53.310000   3742.000000  
 max       70.000000   5968

In [24]:
def standardize_income(income):
    if pd.isna(income):
        # Return NaN as is, you can also choose to fill it with a specific value if required
        return np.nan
    elif isinstance(income, str):
        # Check for non-standard strings and convert them
        if 'Less than' in income:
            return 20000  # Example value, adjust based on your dataset
        # Check if income is a range
        elif '-' in income:
            parts = income.replace('$', '').replace(',', '').split('-')
            # Calculate midpoint for ranges
            if len(parts) == 2 and parts[1]:
                low, high = map(int, parts)
                return (low + high) / 2
            else:  # Handle cases like '$150,000 -'
                low = int(parts[0])
                return low * 1.25
        elif '+' in income:
            # Handle open-ended values like '$200,000+'
            low = int(income.replace('$', '').replace(',', '').replace('+', ''))
            return low * 1.25
        else:
            # Handle single values without range
            return int(income.replace('$', '').replace(',', '').replace(' ', ''))
    else:
        # If income is already a number, just return it
        return income

# Assuming 'df' is your dataframe
df['income'] = df['income'].apply(standardize_income)

df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,250000.0,Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,2,3,27
1,4,28.8,112499.5,Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,53,67,54
2,5,36.5,54999.5,Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,23,32,71
3,9,33.1,112499.5,College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,No,27,76,72
4,14,29.2,84999.5,Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,No,68,69,81


In [25]:
def categorize_income(value, low_thresh, high_thresh):
    if pd.isna(value):
        return 'Unknown'  # Handle NaN values as 'Unknown'
    elif value < low_thresh:
        return 'Low'
    elif low_thresh <= value < high_thresh:
        return 'Middle'
    else:
        return 'High'

# Assuming 'df' is your dataframe and 'income' is the correct income column
low_threshold = df['income'].quantile(0.33)
high_threshold = df['income'].quantile(0.66)

# Categorize incomes using the thresholds
df['income'] = df['income'].apply(lambda x: categorize_income(x, low_threshold, high_threshold))

# Display the updated DataFrame
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,High,Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,No,2,3,27
1,4,28.8,Middle,Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,No,53,67,54
2,5,36.5,Low,Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),No,23,32,71
3,9,33.1,Middle,College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,No,27,76,72
4,14,29.2,Middle,Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,No,68,69,81


In [26]:
# Convert 'Yes'/'No' to 1/0 in NICU_Stay
df['NICU_Stay'] = df['NICU_Stay'].map({'Yes': 1, 'No': 0})
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,High,Masters degree,9.0,13.0,39.71,Dec2020,49.2,3431.0,Vaginally,0,2,3,27
1,4,28.8,Middle,Masters degree,9.0,20.0,38.57,Dec2020,41.0,2534.0,Vaginally,0,53,67,54
2,5,36.5,Low,Undergraduate degree,14.0,20.0,39.86,Oct2020,53.34,3714.0,Caesarean-section (c-section),0,23,32,71
3,9,33.1,Middle,College/trade school,1.0,7.0,40.86,Nov2020,55.88,4480.0,Vaginally,0,27,76,72
4,14,29.2,Middle,Masters degree,14.0,17.0,41.0,Oct2020,47.0,3084.0,Vaginally,0,68,69,81


In [27]:
df['birth_date'] = pd.to_datetime(df['birth_date'], format='%Y-%m-%d', errors='coerce')
df.head()

Unnamed: 0,OSF_ID,mat_age,income,mat_edu,depression,anxiety,birth_age,birth_date,Birth_Length,Birth_Weight,Delivery_Mode,NICU_Stay,Threaten_Life,Threaten_Baby_Danger,Threaten_Baby_Harm
0,1,38.3,High,Masters degree,9.0,13.0,39.71,NaT,49.2,3431.0,Vaginally,0,2,3,27
1,4,28.8,Middle,Masters degree,9.0,20.0,38.57,NaT,41.0,2534.0,Vaginally,0,53,67,54
2,5,36.5,Low,Undergraduate degree,14.0,20.0,39.86,NaT,53.34,3714.0,Caesarean-section (c-section),0,23,32,71
3,9,33.1,Middle,College/trade school,1.0,7.0,40.86,NaT,55.88,4480.0,Vaginally,0,27,76,72
4,14,29.2,Middle,Masters degree,14.0,17.0,41.0,NaT,47.0,3084.0,Vaginally,0,68,69,81


In [28]:
df.shape

(5176, 15)