# Calculate Wind Demographics #
**Author:** Andrew Larkin <br>
**Date Created:** February 8th, 2023 <br>
**Summary:** Calculate summary statistics of wind exposures and sociodemographics for the wind cohort (HEI contract 4970).  <br>

- for details about cohort records and vital statistics, see "Data Dictionary - HEI Birth Data - 1996 to 2016 V1.docx"
- for details about deriving the wind-based exposure metrics, see "Deriving Wind Exposure Metrics_HEI_4970.docx"
- for details about matching upwinwd/downwind neighbors, see "Matching by Wind_HEI_4970.docx"

### load required libraries and define global constants ###

In [1]:
import numpy as np
import pandas as ps
import gConst as const

In [51]:
epiDataAll = ps.read_csv(const.EPI_WIND_DATASET_ALL)
epiDataRestricted = ps.read_csv(const.EPI_WIND_DATASET_37TO42)

### calculate percent distribution of values for a given category in an exposure-based subset of the entire cohort ###
**INPUTS:** 

- epiData (pandas dataframe) - contains information about the category to calculate values for in an exposure-based
  subset of the entire cohort
- catName (string) - variable (column) name for the category of interest
- catDict (dictionary) - formal names for each value in a category
- colName (string) - name to differentiate the exposure-based subset from the entire cohort

In [47]:
def calcCatPercents(epiData,catName,catDict,colName):
    
    # number of participants
    
    totalRecords = epiData.count()[0] 
    percents = []
    indexNames = []
    
    # for each possible value in a type of category ('e.g. less than high school for education'),
    # calcualte the percent of the dataset that contain that value
    for key in catDict.keys():
        curSubset = epiData[epiData[catName]==key]
        totalCount = curSubset.count()[0]
        percents.append(100*totalCount/totalRecords)
        indexNames.append(catDict[key])
        
    # combine the value percents into a dataframe and rename the columns
    newDF = ps.DataFrame({
        colName:percents
    })
    newDF.index = indexNames
    return(newDF)

### calculate descriptive statistics for sociodemographic variables in a subse of the entire cohort ###
**INPUTS:**

- epiData (dataframe) - subset data that contains the variables to calculate descriptive statistics for
- colName (string) - name to differentiate the exposure-based subset from the entire cohort

**OUTPUTS:**

- descrStats (pandas dataframe) - calculated descriptive stats

In [39]:
def calcCatDescrStats(epiData,colName):
    
    # dictionary for race categories
    raceDict = {
        1:'White (%)',
        2:'Black (%)',
        3:'Native American (%)',
        4:'Asian (%)',
        5:'Pacific Islander (%)',
        6:'Other (%)'
    }
    
    # calculate race distribution in the cohort subset
    raceStats = calcCatPercents(epiData,'b_m_race_eth',raceDict,colName)

    
    # dictionary for ethnicity categories
    ethnicityDit = {
        0:'Non-Hispanic/Latina (%)',
        1:'Hispanic (%)'
    }

    # calculate ethnicity distribution in the cohort subset
    ethnicityStats = calcCatPercents(epiData,'b_m_hispanic',ethnicityDit,colName)
    
    
    # dictionary for education categories
    educDict = {
        1:'less than 8th Grade (%)',
        2:'Up to High School Diploma (%)',
        3:"Up to Bachelor's Degree (%)",
        4:"More than a Bachelor's Degree (%)"
    }

    # calculate education distribution in the cohort subset
    educStats = calcCatPercents(epiData,'b_m_educ2',educDict,colName)
    
    neighDict = {
        0:'low neighborhood income tertile',
        1:'middle neighborhod income tertile',
        2:'high neighborhood income tertile'
    }
    
    # calculate education distribution in the cohort subset
    neighbStats = calcCatPercents(epiData,'neigh_inc_tert',neighDict,colName)
    
    # dictionary for cigarette smoking categories
    cigDict = {
        0:'no reported smoking',
        1:'reported smoking'
    }
    
    # calculate smoking distribution in the cohort
    cigStats = calcCatPercents(epiData,'b_m_cig',cigDict,colName)


    # combine sociodemographic summary statistics into a single dataframe
    descrStats = ps.concat([raceStats,ethnicityStats,educStats,cigStats,neighbStats])
    
    return(descrStats)

### calculate number of continuous birth outcomes in the cohort or a cohort subset ###
**INPUTS:**

- data (pandas dataframe) - contains the cohort subset and metrics data to calculate the number of events for
- keys (string list) - names of the variables to calculate number of events for

**OUTPUTS:**

- an array of number of estimates

In [5]:
def generateEvents(data,keys):
    subsetData = data[keys]
    return(list(subsetData.sum()))

### generate mean estimates of continuous birth metrics in the cohort or a cohort subset ###
**INPUTS:**

- data (pandas dataframe) - contains the cohort subset and metrics data to calculate the mean estimates for
- keys (string list) - names of the variables to calculate mean estimates for

**OUTPUTS:**

- an array of mean estimates

In [6]:
def generateEstimates(data,keys):
    subsetData = data[keys]
    return(list(subsetData.mean()))

### calculate birth events such as low term birth weight and premature birth ###
**INPUTS:**

- cohortData (pandas dataframe) - cohort to calculate birth events for 

**OUTPUTS:**
- cohortData (pandas dataframe) - the input cohort data appended with birth events

In [58]:
def calculateBirthEvents(cohortData):
    # code low term birth weight events
    cohortData.dropna(subset=['median_income_imputeavg5'],inplace=True)
    cohortData['ltbw'] = cohortData['b_wt_cgr']<2500
    cohortData['ltbw'] = cohortData['ltbw'].astype(int)
    
    # calculate preterm birth events
    cohortData['ptb'] = cohortData['b_es_ges'] < 37
    cohortData['ptb'] = cohortData['ptb'].astype(int)
    
    # calculate very preterm birth events
    cohortData['vptb'] = cohortData['b_es_ges'] < 32
    cohortData['vptb'] = cohortData['vptb'].astype(int)
    
    cohortData['neigh_inc_tert'] = ps.qcut(cohortData['median_income_imputeavg5'],3,labels=False)
    
    return cohortData

### calculate summary statistics for an exposure-based subset of the epi cohort ###
**INPUTS:**
- categoryData (pandas dataframe) - the exposure-based subset of the cohort to derive summary statistics for
- colName (string) - name to differentiate the exposure-based subset from the entire cohort

**OUTPUTS:**
- epiDF (pandas dataframe) - contains derived descriptive statistics

In [40]:
def calcDescrStatsOneCategory(categoryData,colName):
    
    # variables containing wind metrics 
    #windKeys = ['allmwn','allcutLn','allcutTr','allcutBd','allpwn','alllen','alltrsh','allblsh']
    windKeys = ['allmwn','allpwn','alltrsh','allblsh']
    
    # variables containing continuous birth outcomes
    birthKeys = ['b_es_ges','b_wt_cgr']
    
    # define the column names for the summary statistics    
    indexNames = [
        'n',
        'max downwind (%)',
        'mean downwind (%)',
        'max downwind tree shielding (%)',
        'max downwind building shielding (%)',
        'estimated gestational age(wk)',
        'birth weight (g)',
        'low term birth weight (n)',
        'preterm birth (n)',
        'very preterm birth (n)'  
    ]
    
    # code low term birth weight, preterm birth, and very preterm birth events
    eventKeys = ['ltbw','ptb','vptb']
    
    # calculate birth events
    categoryData = calculateBirthEvents(categoryData)
    
    # calculate number of participants
    nParticipants = [len(list(set(categoryData['uniqueid'])))]
    
    # derive summary statistics of wind metrics
    epiWind = generateEstimates(categoryData,windKeys)
    
    # derive summary statistics of continuous birth outcomes
    epiBirth = generateEstimates(categoryData,birthKeys)
    
    # derive summary statistics of birth outcome events
    epiEvents = generateEvents(categoryData,eventKeys)
    
    # combine all summary statistics into a single dataframe and rename labels
    nParticipants += epiWind + epiBirth + epiEvents
    epiDF = ps.DataFrame({
        colName:nParticipants
    })
    epiDF.index = indexNames
    catStats = calcCatDescrStats(categoryData,colName)
    epiDF = ps.concat([epiDF,catStats])
    epiDF = epiDF.round(2)
    return(epiDF)

### calculate descriptive statistics for the wind epi cohort and store results in a pandas dataframe ###
**INPUTS:**
- epiData (pandas dataframe) - contains wind epi cohort records <br>

**OUTPUTS:** <br>
- statsDF (pandas dataframe) - descriptive statistics for variables of interest

In [18]:
def createDescrStatsDataset(epiData):
    
    # restrict the cohort to wind exposed and calculate descriptive statistics on the subset
    eposureData = epiData[epiData['windCat']==0]
    statsDF = calcDescrStatsOneCategory(eposureData,'paired exposed')
    
    # restrict the cohort to wind controls (minimal exposure) and calculate descriptive statistics on the subset
    controlData = epiData[epiData['windCat']==1]
    controlStats = calcDescrStatsOneCategory(controlData,'paired control')
    statsDF['paired control'] = controlStats
    
    # calculate descriptive statistics on the entire cohort 
    allStats = calcDescrStatsOneCategory(epiData,'all residences')
    statsDF['all residences'] = allStats['all residences']
    
    return(statsDF)

In [43]:
epiDataRestricted['neigh_inc_tert'].describe()

count    73190.000000
mean         0.999822
std          0.816483
min          0.000000
25%          0.000000
50%          1.000000
75%          2.000000
max          2.000000
Name: neigh_inc_tert, dtype: float64

In [59]:
# calculate descriptive statistics for the cohort restricted 37 to 42 weeks
restrictedDescriptive = createDescrStatsDataset(epiDataRestricted)
restrictedDescriptive.to_csv(const.EPI_FOLDER + "results/epi_4_100_37to42_descriptive.csv",index=True)
restrictedDescriptive

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cohortData['ltbw'] = cohortData['b_wt_cgr']<2500
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cohortData['ltbw'] = cohortData['ltbw'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cohortData['ptb'] = cohortData['b_es_ges'] < 37
A value is trying to be set on a copy of a slice from a 

Unnamed: 0,paired exposed,paired control,all residences
n,36534.0,18081.0,54615.0
max downwind (%),24.54,6.99,15.75
mean downwind (%),11.29,4.58,7.93
max downwind tree shielding (%),4.08,4.06,4.07
max downwind building shielding (%),10.79,10.28,10.53
estimated gestational age(wk),38.9,38.88,38.89
birth weight (g),3329.58,3341.98,3335.79
low term birth weight (n),1023.0,1026.0,2049.0
preterm birth (n),0.0,0.0,0.0
very preterm birth (n),0.0,0.0,0.0


In [68]:
import pandas as ps
#a = ps.read_csv("E:/wind_estimates/vitalWindJoined_Jan_11_22.csv")
#print(a.count())
#print(a.head())
print(a['windCat'].describe())
b = a[a['windCat']==1]
print(b.count())

count    388316.000000
mean          1.250415
std           0.829118
min           0.000000
25%           1.000000
50%           2.000000
75%           2.000000
max           2.000000
Name: windCat, dtype: float64
uniqueid     97024
bs_sex       97024
b_bcntyc     96925
b_btype      97024
b_m_sta      97024
             ...  
sc_250       97024
md_250       97024
matc_250     97024
omat_250     97024
NEAR_DIST    97024
Length: 177, dtype: int64


In [60]:
# calculate descriptive statistics for the unrestricted cohort
allDescriptive = createDescrStatsDataset(epiDataAll)
allDescriptive.to_csv(const.EPI_FOLDER + "results/epi_4_100_all_descriptive.csv",index=True)
allDescriptive

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cohortData['ltbw'] = cohortData['b_wt_cgr']<2500
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cohortData['ltbw'] = cohortData['ltbw'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the 

Unnamed: 0,paired exposed,paired control,all residences
n,40693.0,20066.0,60759.0
max downwind (%),24.6,6.97,15.77
mean downwind (%),11.29,4.56,7.92
max downwind tree shielding (%),4.09,4.06,4.07
max downwind building shielding (%),10.8,10.25,10.52
estimated gestational age(wk),38.44,38.43,38.44
birth weight (g),3243.26,3259.57,3251.43
low term birth weight (n),3009.0,2881.0,5890.0
preterm birth (n),3801.0,3710.0,7511.0
very preterm birth (n),559.0,555.0,1114.0
