# Investigating DHS microdata for Nigeria, Ethiopia, and India
We have microdata from a recent DHS for each location:
 - Nigeria 2018 - J:/DATA/DHS_PROG_DHS/NGA/2018/
- Ethiopia 2017 - J:\DATA\DHS_PROG_DHS\ETH\2016
- India 2019-2021 - J:\DATA\DHS_PROG_DHS\IND\2019_2021

Questions we're trying to answer in this notebook: 
- Do we have continuous values for hemoglobin (e.g. instead of the categorical data in the DHS Final Reports)? 
- Do we have hemoglobin cross-stratified by wealth quintile and pregnancy? (i.e., in DHS Final Reports, we have hemoglobin in women stratified by pregnancy and separately stratified by wealth quintile)

DHS 7 recode manual for variable definitions: https://www.dhsprogram.com/pubs/pdf/DHSG4/Recode7_DHS_10Sep2018_DHSG4.pdf

Hemoglobin variables of interest: 
- HA53: Hemoglobin level in g/dl with 1 implied dcimal
- HA54: Currently pregnant
- HA55 Result of Hemoglobin measuring.
- HA56 Hemoglobin level adjusted by altitude in g/dl with 1 implied decimal. 

Wealth index variables of interest: 
- HV270: The wealth index is a composite measure of a household's cumulative living standard.
The wealth index is calculated using easy-to-collect data on a household’s ownership of
selected assets, such as televisions and bicycles; materials used for housing construction; and
types of water access and sanitation facilities.
Generated with a statistical procedure known as principal components analysis, the wealth
index places individual households on a continuous scale of relative wealth. DHS separates
all interviewed households into five wealth quintiles to compare the influence of wealth on
various population, health and nutrition indicators. The wealth index is presented in the DHS
Final Reports and survey datasets as a background characteristic
- HV271: Wealth index factor score (5 decimals) 

Pregnancy variables of interest: 
- HML18: Pregnancy status from individual questionnaire. For complete woman’s interviews this is
taken from V213. For incomplete woman's interview with anemia testing the pregnancy
status is taken from this section.
BASE: Women with a completed individual questionnaire or when available information
from the anemia testing section. 


List of datasets: https://www.dhsprogram.com/data/dataset/Nigeria_Standard-DHS_2018.cfm?flag=1

In [1]:
import pandas as pd, numpy as np

%load_ext autoreload
%autoreload 2

!date

Thu Jul 18 06:58:09 PDT 2024


In [2]:
def convert(word):
    return ' '.join(x.capitalize() or ' ' for x in word.split(' '))

In [30]:
%%time

# Load Nigeria dataset with wealth quintile and hemoglobin data
directory = '/snfs1/DATA/DHS_PROG_DHS/NGA/2018/' 
dfn =  pd.read_stata(directory + 'NGA_DHS7_2019_HHM_NGPR7AFL_Y2019M11D05.DTA')
dfn

CPU times: user 3.39 s, sys: 56.2 ms, total: 3.44 s
Wall time: 3.46 s


Unnamed: 0,hhid,hvidx,hv000,hv001,hv002,hv003,hv004,hv005,hv006,hv007,...,idxdis,hdis1,hdis2,hdis3,hdis4,hdis5,hdis6,hdis7,hdis8,hdis9
0,1 1,1,NG7,1,1,1,1,1368354,9,2018,...,1,yes,some difficulty,no,no difficulty hearing,no difficulty communicating,some difficulty,some difficulty,some difficulty,some difficulty
1,1 1,2,NG7,1,1,1,1,1368354,9,2018,...,2,no,no difficulty seeing,no,no difficulty hearing,no difficulty communicating,no difficulty remembering/concentrating,no difficulty walking or climbing,no difficulty washing or dressing,no difficulty
2,1 1,3,NG7,1,1,1,1,1368354,9,2018,...,3,no,no difficulty seeing,no,no difficulty hearing,no difficulty communicating,no difficulty remembering/concentrating,no difficulty walking or climbing,no difficulty washing or dressing,no difficulty
3,1 1,4,NG7,1,1,1,1,1368354,9,2018,...,4,no,no difficulty seeing,no,no difficulty hearing,no difficulty communicating,no difficulty remembering/concentrating,no difficulty walking or climbing,no difficulty washing or dressing,no difficulty
4,1 1,5,NG7,1,1,1,1,1368354,9,2018,...,5,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188005,1400 45,2,NG7,1400,45,1,1400,787007,10,2018,...,2,no,no difficulty seeing,no,no difficulty hearing,no difficulty communicating,no difficulty remembering/concentrating,no difficulty walking or climbing,no difficulty washing or dressing,no difficulty
188006,1400 45,3,NG7,1400,45,1,1400,787007,10,2018,...,3,no,no difficulty seeing,no,no difficulty hearing,no difficulty communicating,no difficulty remembering/concentrating,no difficulty walking or climbing,no difficulty washing or dressing,no difficulty
188007,1400 46,1,NG7,1400,46,1,1400,787007,10,2018,...,1,no,no difficulty seeing,no,no difficulty hearing,no difficulty communicating,no difficulty remembering/concentrating,no difficulty walking or climbing,no difficulty washing or dressing,no difficulty
188008,1400 46,2,NG7,1400,46,1,1400,787007,10,2018,...,2,no,no difficulty seeing,no,no difficulty hearing,no difficulty communicating,no difficulty remembering/concentrating,no difficulty walking or climbing,no difficulty washing or dressing,no difficulty


In [4]:
dfn.hv270.value_counts()
# Wealth quintile column

middle     39784
poorest    39607
poorer     38588
richer     37675
richest    32356
Name: hv270, dtype: int64

In [5]:
dfn.ha53.value_counts()
# Hemoglobin concentration column

112.0    427
120.0    419
113.0    418
118.0    411
108.0    406
        ... 
166.0      1
218.0      1
170.0      1
161.0      1
27.0       1
Name: ha53, Length: 132, dtype: int64

In [6]:
dfn.hml18.value_counts()
# Pregnancy status column

not pregnant, don't know    37833
pregnant                     4203
Name: hml18, dtype: int64

In [7]:
%%time

# Load Ethiopia dataset with wealth index and hemoglobin data
directory = '/snfs1/DATA/DHS_PROG_DHS/ETH/2016/' 
dfe =  pd.read_stata(directory + 'ETH_DHS7_2016_HHM_ETPR71FL_Y2019M12D11.DTA')
dfe

CPU times: user 1.28 s, sys: 26.6 ms, total: 1.31 s
Wall time: 1.31 s


Unnamed: 0,hhid,hvidx,hv000,hv001,hv002,hv003,hv004,hv005,hv006,hv007,...,hb61,hb62,hb63,hb64,hb65,hb66,hb67,hb68,hb69,hb70
0,1 15,1,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
1,1 15,2,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
2,1 15,3,ET7,1,15,1,1,5056023,9,2008,...,granted,Z1Y4E,blood taken,granted,not at home,no education,,no education,,1391.0
3,1 15,4,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
4,1 15,5,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75219,645 464,2,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,,,
75220,645 464,3,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,,,
75221,645 464,4,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,,,
75222,645 485,1,ET7,645,485,2,645,664627,5,2008,...,granted,Z7N9E,blood taken,granted,completed,technical/ vocational,3.0,more than secondary,,3191.0


In [8]:
dfe.ha53.value_counts()
# Hemoglobin concentration column

not present    1215
refused         913
135.0           378
140.0           377
144.0           356
               ... 
188.0             1
198.0             1
196.0             1
193.0             1
35.0              1
Name: ha53, Length: 156, dtype: int64

In [9]:
# dfe.hml18 .value_counts()
# Pregnancy status column - Not present in this data file 

In [10]:
dfe.hv270.value_counts()
# Wealth quintile column

poorest    22929
richest    20988
poorer     11138
richer     10092
middle     10077
Name: hv270, dtype: int64

In [11]:
dfe.columns.tolist()

['hhid',
 'hvidx',
 'hv000',
 'hv001',
 'hv002',
 'hv003',
 'hv004',
 'hv005',
 'hv006',
 'hv007',
 'hv008',
 'hv008a',
 'hv009',
 'hv010',
 'hv011',
 'hv012',
 'hv013',
 'hv014',
 'hv015',
 'hv016',
 'hv017',
 'hv018',
 'hv019',
 'hv020',
 'hv021',
 'hv022',
 'hv023',
 'hv024',
 'hv025',
 'hv026',
 'hv027',
 'hv028',
 'hv030',
 'hv031',
 'hv032',
 'hv035',
 'hv040',
 'hv041',
 'hv042',
 'hv044',
 'hv045a',
 'hv045b',
 'hv045c',
 'hv046',
 'hv801',
 'hv802',
 'hv803',
 'hv804',
 'hv807d',
 'hv807m',
 'hv807y',
 'hv807c',
 'hv807a',
 'hv201',
 'hv202',
 'hv201a',
 'hv204',
 'hv205',
 'hv206',
 'hv207',
 'hv208',
 'hv209',
 'hv210',
 'hv211',
 'hv212',
 'hv213',
 'hv214',
 'hv215',
 'hv216',
 'hv217',
 'hv218',
 'hv219',
 'hv220',
 'hv221',
 'hv225',
 'hv226',
 'hv227',
 'hv228',
 'hv230a',
 'hv230b',
 'hv232',
 'hv232b',
 'hv232c',
 'hv232d',
 'hv232e',
 'hv232y',
 'hv234',
 'hv234a',
 'hv235',
 'hv236',
 'hv237',
 'hv237a',
 'hv237b',
 'hv237c',
 'hv237d',
 'hv237e',
 'hv237f',
 'hv237

In [12]:
dfe.hv001.unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105,
       106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
       119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
       132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
       145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157,
       158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
       171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 18

In [13]:
dfe.hv002.unique()

array([ 15,  17,  18,  25,  37,  40,  45,  65,  90, 125, 133, 135, 183,
       207, 212, 214, 227, 263, 273, 279, 316, 358, 400, 410, 439, 448,
       449,   8,  29,  59,  97, 103, 140, 165, 168, 171, 176, 189, 206,
       234, 235, 340, 399, 424, 437, 463,  14,  22,  26,  31,  67, 100,
       115, 127, 131, 138, 141, 143, 188, 241, 243, 250, 286, 300, 317,
       363, 368, 382, 401, 433, 447,   7,  39,  72, 112, 116, 154, 162,
       238, 240, 287, 335, 336, 418, 442,   4,  20,  43,  75,  99, 106,
       136, 148, 173, 182, 213, 228, 230, 236, 245, 276, 281, 289, 318,
       322, 348, 385, 416, 486,  94, 123, 152, 198, 223, 252, 267, 292,
       294, 296, 311, 320, 349, 373, 374, 378, 389, 398, 440, 443, 461,
       462,  24,  48,  51,  92, 174, 190, 199, 232, 297, 312, 319, 330,
       334, 337, 347, 372, 386, 417, 429, 480, 108, 124, 129, 139, 160,
       256, 277, 324, 332, 353, 390, 394, 396, 430, 483,   5,   9,  50,
        60,  78, 205, 239, 249, 259, 328, 329, 438, 460, 468, 47

In [14]:
%%time

# Load Ethiopia dataset with individual women survey results 
directory = '/snfs1/DATA/DHS_PROG_DHS/ETH/2016/' 
dfe1 =  pd.read_stata(directory + 'ETH_DHS7_2016_WN_ETIR71FL_Y2019M12D11.DTA', convert_categoricals=False)
dfe1

CPU times: user 50.5 s, sys: 53.1 s, total: 1min 43s
Wall time: 1min 43s


Unnamed: 0,caseid,v000,v001,v002,v003,v004,v005,v006,v007,v008,...,sd508va_3,sd508va_4,sd508va_5,sd508va_6,sm508va_1,sm508va_2,sm508va_3,sm508va_4,sm508va_5,sm508va_6
0,1 17 2,ET7,1,17,2,1,5087433,9,2008,1305,...,,,,,,,,,,
1,1 17 3,ET7,1,17,3,1,5087433,9,2008,1305,...,,,,,,,,,,
2,1 18 2,ET7,1,18,2,1,5087433,9,2008,1305,...,,,,,,,,,,
3,1 25 2,ET7,1,25,2,1,5087433,9,2008,1305,...,,,,,,,,,,
4,1 25 10,ET7,1,25,10,1,5087433,9,2008,1305,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15678,645 419 4,ET7,645,419,4,645,685395,5,2008,1301,...,,,,,,,,,,
15679,645 449 2,ET7,645,449,2,645,685395,5,2008,1301,...,,,,,0.0,,,,,
15680,645 464 2,ET7,645,464,2,645,685395,5,2008,1301,...,,,,,0.0,,,,,
15681,645 464 4,ET7,645,464,4,645,685395,5,2008,1301,...,,,,,,,,,,


In [15]:
dfe1.v213.value_counts()
# Pregnancy status column
# 1 is currently pregnant, 0 is not pregnant (and maybe also unknown)

0    14561
1     1122
Name: v213, dtype: int64

In [16]:
dfe1.v001.unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105,
       106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
       119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
       132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
       145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157,
       158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
       171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 18

In [17]:
# For Ethiopia, we don't have pregnancy status in the same dataset as our hemoglobin and wealth status columns.
# ETH_DHS7_2016_HH_ETHR71FL_Y2019M12D11.DTA doesn't have pregnancy either

# Info on merging datasets from DHS here: https://dhsprogram.com/data/Merging-Datasets.cfm

In [18]:
dfe2 = pd.merge(dfe, dfe1[['v001', 'v002', 'v213']], left_on=['hv001', 'hv002'], right_on=['v001', 'v002'], how='left')
dfe2

Unnamed: 0,hhid,hvidx,hv000,hv001,hv002,hv003,hv004,hv005,hv006,hv007,...,hb64,hb65,hb66,hb67,hb68,hb69,hb70,v001,v002,v213
0,1 15,1,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
1,1 15,2,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
2,1 15,3,ET7,1,15,1,1,5056023,9,2008,...,granted,not at home,no education,,no education,,1391.0,,,
3,1 15,4,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
4,1 15,5,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97976,645 464,3,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,645.0,464.0,0.0
97977,645 464,4,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,645.0,464.0,0.0
97978,645 464,4,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,645.0,464.0,0.0
97979,645 485,1,ET7,645,485,2,645,664627,5,2008,...,granted,completed,technical/ vocational,3.0,more than secondary,,3191.0,645.0,485.0,0.0


In [20]:
dfe2

Unnamed: 0,hhid,hvidx,hv000,hv001,hv002,hv003,hv004,hv005,hv006,hv007,...,hb64,hb65,hb66,hb67,hb68,hb69,hb70,v001,v002,v213
0,1 15,1,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
1,1 15,2,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
2,1 15,3,ET7,1,15,1,1,5056023,9,2008,...,granted,not at home,no education,,no education,,1391.0,,,
3,1 15,4,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
4,1 15,5,ET7,1,15,1,1,5056023,9,2008,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97976,645 464,3,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,645.0,464.0,0.0
97977,645 464,4,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,645.0,464.0,0.0
97978,645 464,4,ET7,645,464,1,645,664627,5,2008,...,,,,,,,,645.0,464.0,0.0
97979,645 485,1,ET7,645,485,2,645,664627,5,2008,...,granted,completed,technical/ vocational,3.0,more than secondary,,3191.0,645.0,485.0,0.0


In [21]:
dfe2.v213.value_counts()
# Now we have pregnancy, Hb, and wealth status all in the same df for Ethiopia! We could rename dfe2['v213'] to 
# dfe2['hml18'] if that makes things easier to deal with - but skipping that for now.

# 1.0 is pregnant

0.0    79908
1.0     5638
Name: v213, dtype: int64

In [26]:
%%time

# Load India dataset with wealth quintile and hemoglobin data
directory = '/snfs1/DATA/DHS_PROG_DHS/IND/2015_2016/' 
dfi =  pd.read_stata(directory + 'IND_DHS7_2015_2016_HHM_IAPR74FL_Y2018M12D06.DTA', convert_categoricals = False)
dfi

CPU times: user 53.9 s, sys: 44.8 s, total: 1min 38s
Wall time: 1min 44s


Unnamed: 0,hhid,hvidx,hv000,hv001,hv002,hv003,hv004,hv005,hv006,hv007,...,hml32a,hml32b,hml32c,hml32d,hml32e,hml32f,hml32g,hml33,hml34,hml35
0,01000101,1,IA6,10001,1,3,1,191072,7,2015,...,,,,,,,,,,
1,01000101,2,IA6,10001,1,3,1,191072,7,2015,...,,,,,,,,,,
2,01000101,3,IA6,10001,1,3,1,191072,7,2015,...,,,,,,,,,,
3,01000101,4,IA6,10001,1,3,1,191072,7,2015,...,,,,,,,,,,
4,01000109,1,IA6,10001,9,3,1,191072,7,2015,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2869038,36048285,1,IA6,360482,85,1,482,2270734,5,2015,...,,,,,,,,,,
2869039,36048285,2,IA6,360482,85,1,482,2270734,5,2015,...,,,,,,,,,,
2869040,36048296,1,IA6,360482,96,2,482,2270734,5,2015,...,,,,,,,,,,
2869041,36048296,2,IA6,360482,96,2,482,2270734,5,2015,...,,,,,,,,,,


In [27]:
dfi.hml18.value_counts()
# Pregnancy status column 
# 1.0 is pregnant

0.0    679788
1.0     33167
Name: hml18, dtype: int64

In [28]:
dfi.hv270.value_counts()
# Wealth quintile column

2    625444
1    609790
3    586096
4    535071
5    512642
Name: hv270, dtype: int64

In [29]:
dfi.ha53.value_counts()
# Hemoglobin concentration column

123.0    20810
121.0    20672
122.0    20266
124.0    19749
112.0    19712
         ...  
207.0        1
205.0        1
201.0        1
209.0        1
203.0        1
Name: ha53, Length: 211, dtype: int64

# Check for other data needs in DHS microdata
Data needs and associated columns in DHS files (if present):
- Serum/plasma folate levels at baseline: NOT PRESENT IN DHS
- RBC folate levels at baseline: NOT PRESENT IN DHS
- LBWSG prevalence
    - M19:  Weight of child at birth given in kilograms with three implied decimal places (or grams with
    no decimal places). Children who were not weighed are coded 9996. In some countries, the
    birth weight was collected in grams, i.e. a total of four digits, whereas other countries
    collected the weight in kilograms to one decimal place, i.e. a total of two digits. In the latter
    case, the third and fourth digits are set to zeros. In a few countries, the weight was collected
    in pounds and/or ounces. For these countries, the original weight variables are stored as a
    country-specific variable and this variable contains the weight converted to kilograms. 
- CGF prevalence: I'm not sure which of these columns will be most applicable to calculating CGF prevalence by wealth quintile 
    - HC70: Height for age standard deviation (according to WHO)
    - HC71: Weight for age standard deviation (according to WHO)
    - HC72: Weight for height standard deviations (according to WHO)
    - HC4: Height/Age Percentile
    - HC5: Height/Age Standard deviations
    - HC6: Height/Age Percent of ref. median
    - HC7: Weight/Age Percentile
    - HC8: Weight/Age Standard deviations
    - HC9: Weight/Age Percent of ref. median
    - HC10: Weight/Height Percentile
    - HC11: Weight/Height Standard deviations
    - HC12: Weight/Height Percent of ref. median 
- Measles prevalence
    - H9: Measles 1 vaccination. As for H2, H2D, H2M, H2Y. There was only one measles
      vaccination prior to the DHS VII recode.  
    - H9A: Measles 2 vaccination. As for H2, H2D, H2M, H2Y. 
- LRI prevalence: I'm not sure if the columns we have for this will suffice. In the DHS final reports though, we have data for acute respiratory infection. 
    - H31: Whether the child had suffered from a cough in the last two weeks and whether the child had
        been ill with the cough in the last 24 hours. Code 1 indicates that the child had been ill in the
        last 24 hours; code 2 indicates that the child had been ill with the cough in the last two
        weeks. Code 1 is country specific for surveys after DHS II. In case code 1 is used, code 2
        indicates that the child had cough in last two weeks but not in the last 24 hours.
    - H31B: Whether the child had suffered from rapid breathing.
        BASE: All living children born in the last 5 years. In previous recodes the base was
        restricted to children who had suffered from cough.
    - H31C: Whether the child has a problem in the chest or a blocked or running nose. 
- Malaria prevalence
    - For Nigeria and Ethiopia, we have malaria prevalence data and do not have to rely on these DHS household surveys.
    - However, for India, we may have to use fever prevalence as a proxy for malaria prevalence. 
    - H22: Whether the child had fever in the last two weeks.
- U5 hemoglobin levels at baseline
    - HC53: Hemoglobin level (g/dl - 1 decimal)
    - HC55: Result of measuring (Hemoglobin)
    - HC56: Hemoglobin level adjusted by altitude in g/dl with 1 implied decimal
    - HC57: Anemia levels below 7.0 g/dl are considered as severe anemia, levels between 7.1g/dl and
        9.9g/dl are considered as moderate anemia and cases between 10.0 g/dl and 10.9 g/dl are
        considered as mild anemia. 
- Maternal disorders incidence: NOT PRESENT IN DHS 
- Maternal hemorrhage incidence: NOT PRESENT IN DHS
- Neural tube defect deaths: NOT PRESENT IN DHS

In [35]:
columns = ['hc53','hc55','hc56','hc57','h22','h31','h9','hc70','hc72','hc4','hc5','hc6','hc7','hc8','hc9','hc10','hc11','hc12']
dhs_dfs = {'dfe2': dfe2, 'dfi': dfi, 'dfn': dfn}

for df_name, df in dhs_dfs.items():
    print(f"Checking in DataFrame: {df_name}")
    for column in columns:
        if column in df.columns:
            print(f"Column {column} exists.")
        else:
            print(f"Column {column} does not exist.")
    print() 

Checking in DataFrame: dfe2
Column hc53 exists.
Column hc55 exists.
Column hc56 exists.
Column hc57 exists.
Column h22 does not exist.
Column h31 does not exist.
Column h9 does not exist.
Column hc70 exists.
Column hc72 exists.
Column hc4 exists.
Column hc5 exists.
Column hc6 exists.
Column hc7 exists.
Column hc8 exists.
Column hc9 exists.
Column hc10 exists.
Column hc11 exists.
Column hc12 exists.

Checking in DataFrame: dfi
Column hc53 exists.
Column hc55 exists.
Column hc56 exists.
Column hc57 exists.
Column h22 does not exist.
Column h31 does not exist.
Column h9 does not exist.
Column hc70 exists.
Column hc72 exists.
Column hc4 exists.
Column hc5 exists.
Column hc6 exists.
Column hc7 exists.
Column hc8 exists.
Column hc9 exists.
Column hc10 exists.
Column hc11 exists.
Column hc12 exists.

Checking in DataFrame: dfn
Column hc53 exists.
Column hc55 exists.
Column hc56 exists.
Column hc57 exists.
Column h22 does not exist.
Column h31 does not exist.
Column h9 does not exist.
Column h

In [36]:
# The same 3 columns aren't available in each of the location's datasets I loaded. I think this is because they are 
# in the household-children survey data files and not the overall household files. I will double-check this.
# h22 (whether child has had fever in last 2 weeks - we only want this for India anyways
# h31 (whether child has had coughing/chest problems in last 2 weeks - not sure we will use this anyways, proxy for LRI)
# h9 (measles 1 vaccination)

In [38]:
# Load India children's survey data: 

directory = '/snfs1/DATA/DHS_PROG_DHS/IND/2015_2016/' 
dfi1 =  pd.read_stata(directory + 'IND_DHS7_2015_2016_CH_IAKR74FL_Y2018M12D06.DTA', convert_categoricals = False)
dfi1

Unnamed: 0,caseid,midx,v000,v001,v002,v003,v004,v005,v006,v007,...,s560,s561,s562,s563a,s563b,s563c,s564,s565a,s565b,s565c
0,01000110 02,1,IA6,10001,10,2,1,191760,7,2015,...,,,0.0,,,,0.0,,,
1,01000123 02,1,IA6,10001,23,2,1,191760,7,2015,...,,,0.0,,,,0.0,,,
2,01000227 04,1,IA6,10002,27,4,2,8798,7,2015,...,2.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,01000227 04,2,IA6,10002,27,4,2,8798,7,2015,...,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,01000258 02,1,IA6,10002,58,2,2,8798,7,2015,...,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259622,36048211 04,2,IA6,360482,11,4,482,2380715,5,2015,...,2.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
259623,36048218 02,1,IA6,360482,18,2,482,2380715,5,2015,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
259624,36048261 07,1,IA6,360482,61,7,482,2380715,5,2015,...,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
259625,36048275 02,1,IA6,360482,75,2,482,2380715,5,2015,...,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [40]:
# Load Ethiopia children's survey data: 

directory = '/snfs1/DATA/DHS_PROG_DHS/ETH/2016/' 
dfe3 =  pd.read_stata(directory + 'ETH_DHS7_2016_CH_ETKR71FL_Y2019M12D11.DTA', convert_categoricals = False)
dfe3

Unnamed: 0,caseid,midx,v000,v001,v002,v003,v004,v005,v006,v007,...,sd508nc,sm508nc,sd508ra,sm508ra,sd508rb,sm508rb,sd508ma,sm508ma,sd508va,sm508va
0,1 17 2,1,ET7,1,17,2,1,5087433,9,2008,...,,,,,,,,,,
1,1 17 2,2,ET7,1,17,2,1,5087433,9,2008,...,,,,,,,,,,
2,1 17 2,3,ET7,1,17,2,1,5087433,9,2008,...,,,,,,,,,,
3,1 18 2,1,ET7,1,18,2,1,5087433,9,2008,...,,,,,,,,,,
4,1 25 2,1,ET7,1,25,2,1,5087433,9,2008,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10636,645 364 2,1,ET7,645,364,2,645,685395,5,2008,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10637,645 364 2,2,ET7,645,364,2,645,685395,5,2008,...,,,,,,,,,,
10638,645 396 1,1,ET7,645,396,1,645,685395,5,2008,...,0.0,0.0,9.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
10639,645 449 2,1,ET7,645,449,2,645,685395,5,2008,...,25.0,2.0,6.0,13.0,28.0,1.0,0.0,0.0,0.0,0.0


In [42]:
# Load Nigeria children's survey data: 

directory = '/snfs1/DATA/DHS_PROG_DHS/NGA/2018/' 
dfn2 =  pd.read_stata(directory + 'NGA_DHS7_2018_CH_NGKR7AFL_Y2019M11D05.DTA', convert_categoricals = False)
dfn2

Unnamed: 0,caseid,midx,v000,v001,v002,v003,v004,v005,v006,v007,...,s434ib,s434ic,s434id,s434ie,s434if,s434ig,s434ix,s434iz,s434k,s434l
0,1 1 2,1,NG7,1,1,2,1,1335530,9,2018,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,1 11 1,1,NG7,1,11,1,1,1335530,9,2018,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,1 25 2,1,NG7,1,25,2,1,1335530,9,2018,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,1 35 2,1,NG7,1,35,2,1,1335530,9,2018,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,1 40 5,1,NG7,1,40,5,1,1335530,9,2018,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33919,1400 19 2,2,NG7,1400,19,2,1400,768129,10,2018,...,,,,,,,,,,
33920,1400 29 1,1,NG7,1400,29,1,1400,768129,10,2018,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,101.0
33921,1400 33 1,1,NG7,1400,33,1,1400,768129,10,2018,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,101.0
33922,1400 38 2,1,NG7,1400,38,2,1400,768129,10,2018,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [43]:
columns = ['h22','h31','h9']
dhs_dfs = {'dfi1': dfi1,'dfe3': dfe3,'dfn2': dfn2}

for df_name, df in dhs_dfs.items():
    print(f"Checking in DataFrame: {df_name}")
    for column in columns:
        if column in df.columns:
            print(f"Column {column} exists.")
        else:
            print(f"Column {column} does not exist.")
    print() 
# Yep, we have them! So if we do want to use these variables, we can join them to our existing dfs like we did with the 
# Ethiopia pregnancy status column, but I will skip doing that for now. 

Checking in DataFrame: dfi1
Column h22 exists.
Column h31 exists.
Column h9 exists.

Checking in DataFrame: dfe3
Column h22 exists.
Column h31 exists.
Column h9 exists.

Checking in DataFrame: dfn2
Column h22 exists.
Column h31 exists.
Column h9 exists.

