## Identify updated / new DHS surveys

Compare what's available on the website to what we have in the database already. 

Any missing may be new, or we may not have access to them, or we may simply have chosen not to use them.



In [2]:
import pandas as pd
import pandas.io.sql as psql
#import psycopg2 as pg
from sqlalchemy import create_engine
import sqlalchemy as sa
import os
import glob

In [3]:
with open('config/pg_conn.txt') as conn_details:
    conn_str_psyco = conn_details.readline()
    conn_str_sqlalchemy = conn_details.readline()

In [4]:
engine = create_engine(conn_str_sqlalchemy)

We maintain a CSV of known issues to explain discrepancies between the listed survey availability and our actual holdings

In [5]:
known_issues_all = pd.read_csv('~/notebooks/DHS-DataExtraction/DHS_Preprocess_And_Ingestion/01_Check_Survey_Updates/missing_and_unavailable.csv')
known_issues_all.head()

Unnamed: 0,surveynum,country,year,issue,data issue,gps issue,change expected,last manual check
0,23,botswana,1988,,restricted data,not collected,no,04/02/2019
1,73,eritrea,1995,,restricted data,not collected,no,04/02/2019
2,103,yemen,1997,,restricted data,not collected,no,04/02/2019
3,179,turkmenistan,2000,,not in public domain,not collected,no,04/02/2019
4,160,mauritania,2000,,restricted data,not collected,no,04/02/2019


### Available surveys

Query the DHS API for available surveys (may include ones our account does not have download access to)

In [10]:
# Note: presently this returns all surveys. In the past it has been paginated 
# so if that was to occur again we'd have to loop
surveys = pd.read_csv("http://api.dhsprogram.com/rest/dhs/surveys?f=csv",
                      usecols=['SurveyId', 'SurveyNum', 'ReleaseDate', 'SurveyCharacteristicIds', 'CountryName', 'RegionName'])

In [11]:
surveys.head() 

Unnamed: 0,SurveyId,CountryName,SurveyNum,RegionName,ReleaseDate,SurveyCharacteristicIds
0,AF2015DHS,Afghanistan,471,South & Southeast Asia,2017-02-15,"92, 39, 6, 1, 101, 79, 31, 14, 57, 8, 34, 17, ..."
1,AL2008DHS,Albania,327,North Africa/West Asia/Europe,2010-03-15,"33, 10, 43, 41, 31, 22, 26, 24, 21, 59, 11, 34..."
2,AL2017DHS,Albania,525,North Africa/West Asia/Europe,2019-01-11,"70, 93, 33, 41, 10, 43, 11, 34, 5, 60, 67, 31,..."
3,AO2006MIS,Angola,282,Sub-Saharan Africa,2008-02-27,"8, 41, 57, 36, 68, 63, 61, 26, 101, 79, 89"
4,AO2011MIS,Angola,395,Sub-Saharan Africa,2012-01-24,"101, 79, 90, 41, 26, 57, 8, 17, 89"


### Updated surveys

Query the DHS API for surveys updated since 2015 (may include ones our account does not have download access to)

In [7]:
recent_updates = pd.read_csv("http://api.dhsprogram.com/rest/dhs/dataupdates?lastUpdate=20150901&f=csv",
                            usecols=['SurveyId', 'SurveyYear', 'DHS_CountryCode', 'CountryName', 'UpdateDate'])

In [8]:
recent_updates.head()

Unnamed: 0,SurveyId,SurveyYear,DHS_CountryCode,CountryName,UpdateDate
0,TD2014DHS,2014.0,TD,Chad,2016-07-15 19:17:14.0
1,ZW2010DHS,2010.0,ZW,Zimbabwe,2016-12-07 20:54:25.0
2,GH2016MIS,2016.0,GH,Ghana,2017-07-27 17:52:17.0
3,ET2016DHS,2016.0,ET,Ethiopia,2017-08-16 21:17:18.0
4,AM2016DHS,2016.0,AM,Armenia,2017-08-30 17:46:51.0


The recent updates API endpoint doesn't return all details we need, so merge with the data from the main surveys endpoint

In [12]:
potential_recent_updates = recent_updates.merge(surveys, how='inner', on='SurveyId')
potential_recent_updates

Unnamed: 0,SurveyId,SurveyYear,DHS_CountryCode,CountryName_x,UpdateDate,CountryName_y,SurveyNum,RegionName,ReleaseDate,SurveyCharacteristicIds
0,TD2014DHS,2014.0,TD,Chad,2016-07-15 19:17:14.0,Chad,465,Sub-Saharan Africa,2016-07-11,"10, 33, 67, 32, 39, 34, 5, 6, 4, 1, 101, 104, ..."
1,ZW2010DHS,2010.0,ZW,Zimbabwe,2016-12-07 20:54:25.0,Zimbabwe,367,Sub-Saharan Africa,2012-04-05,"33, 17, 3, 79, 69, 24, 21, 23, 41, 10, 31, 22,..."
2,GH2016MIS,2016.0,GH,Ghana,2017-07-27 17:52:17.0,Ghana,516,Sub-Saharan Africa,2017-07-27,"89, 92, 41, 70, 15, 57"
3,ET2016DHS,2016.0,ET,Ethiopia,2017-08-16 21:17:18.0,Ethiopia,478,Sub-Saharan Africa,2017-08-15,"91, 33, 15, 41, 10, 31, 22, 39, 26, 69, 24, 21..."
4,AM2016DHS,2016.0,AM,Armenia,2017-08-30 17:46:51.0,Armenia,492,North Africa/West Asia/Europe,2017-08-30,"16, 33, 31, 22, 26, 38, 69, 24, 21, 17, 87, 94..."
5,AM2016DHS,2016.0,AM,Armenia,2017-10-05 21:50:48.0,Armenia,492,North Africa/West Asia/Europe,2017-08-30,"16, 33, 31, 22, 26, 38, 69, 24, 21, 17, 87, 94..."
6,SN2015DHS,2015.0,SN,Senegal,2017-10-05 21:50:48.0,Senegal,489,Sub-Saharan Africa,2016-11-04,"90, 79, 34, 14, 4, 31, 32, 69, 15, 26, 57, 41,..."
7,SN2016DHS,2016.0,SN,Senegal,2017-10-05 21:50:48.0,Senegal,524,Sub-Saharan Africa,2017-09-06,"80, 26, 69, 14, 11, 34, 4, 15, 70, 22, 57, 79,..."
8,SL2016MIS,2016.0,SL,Sierra Leone,2017-10-05 21:50:48.0,Sierra Leone,515,Sub-Saharan Africa,2017-09-21,"41, 90, 89, 79, 70, 26, 57, 92"
9,NP2016DHS,2016.0,NP,Nepal,2017-11-13 15:29:15.0,Nepal,472,South & Southeast Asia,2017-11-13,"6, 31, 14, 20, 17, 34, 36, 13, 9, 60, 3, 22, 3..."


### Our surveys

Define a survey as being in our database if it has an entry in any of the RECH1, REC01, MREC01 tables

In [15]:
our_surveys = pd.read_sql('''
    SELECT DISTINCT (surveyid) FROM dhs_data_tables."RECH1" 
    UNION 
    SELECT DISTINCT (surveyid) FROM dhs_data_tables."REC01"
    UNION 
    SELECT DISTINCT (surveyid) FROM dhs_data_tables."MREC01"''',
                         con=engine)

In [16]:
our_surveys.head()

Unnamed: 0,surveyid
0,354
1,497
2,13
3,383
4,93


### Missing surveys

List of all surveys that the API advertised but aren't in our DB for whatever reason

In [19]:
surveys_we_dont_have = surveys[~surveys['SurveyNum'].isin(our_surveys['surveyid'])]

In [20]:
surveys_we_dont_have

Unnamed: 0,SurveyId,CountryName,SurveyNum,RegionName,ReleaseDate,SurveyCharacteristicIds
18,BD2017DHS,Bangladesh,536,South & Southeast Asia,2020-12-14,"38, 22, 70, 92, 31, 10, 43, 46, 9, 2, 41, 68, ..."
29,BT1988DHS,Botswana,23,Sub-Saharan Africa,1989-08-01,"79, 3, 21, 20, 17, 101"
90,ER1995DHS,Eritrea,73,Sub-Saharan Africa,1997-03-01,"1, 2, 4, 10, 11, 21, 24, 79, 101, 3"
91,ER2002DHS,Eritrea,170,Sub-Saharan Africa,2003-05-01,"4, 13, 57, 17, 10, 14, 79, 101, 5, 3, 60, 20, ..."
107,GH2019MIS,Ghana,557,Sub-Saharan Africa,2020-07-28,"90, 17, 26, 70, 89, 92, 41, 79, 69"
187,MR2000DHS,Mauritania,160,Sub-Saharan Africa,2001-12-01,"1, 2, 4, 11, 21, 24, 57, 14, 32, 79, 101, 10, ..."
226,PG2017DHS,Papua New Guinea,499,Oceania,2019-11-25,"10, 78, 11, 6, 1, 74, 31, 22, 79, 69, 24, 21, ..."
265,SN2018DHS,Senegal,580,Sub-Saharan Africa,2020-06-17,"31, 17, 3, 80, 26, 11, 10, 14, 6, 4, 32, 69, 8..."
269,SL2019DHS,Sierra Leone,545,Sub-Saharan Africa,2020-12-11,"11, 41, 23, 70, 104, 1, 6, 4, 10, 89, 99, 26"
298,TR2008DHS,Turkey,547,North Africa/West Asia/Europe,2018-07-30,"16, 10, 31, 69, 14, 34, 3"


List of surveys for which we've previously noted that we'll never get the data due to known issues

In [21]:
known_survey_issues = known_issues_all.loc[
    (known_issues_all['change expected']=='no') & (known_issues_all['data issue'] != 'no issue')
]
known_survey_issues

Unnamed: 0,surveynum,country,year,issue,data issue,gps issue,change expected,last manual check
0,23,botswana,1988,,restricted data,not collected,no,04/02/2019
1,73,eritrea,1995,,restricted data,not collected,no,04/02/2019
2,103,yemen,1997,,restricted data,not collected,no,04/02/2019
3,179,turkmenistan,2000,,not in public domain,not collected,no,04/02/2019
4,160,mauritania,2000,,restricted data,not collected,no,04/02/2019
5,170,eritrea,2002,,restricted data,restricted data,no,04/02/2019
6,224,uganda,2004,,restricted data,not collected,no,04/02/2019
7,525,albania,2017,,we don't care about it,we are not approved for gps,no,04/02/2019


### Listing of surveys that we don't have and might want to get

In [22]:
_ = ~surveys_we_dont_have['SurveyNum'].isin(known_survey_issues['surveynum'])
survey_data_to_look_for = surveys_we_dont_have[_]
survey_data_to_look_for

Unnamed: 0,SurveyId,CountryName,SurveyNum,RegionName,ReleaseDate,SurveyCharacteristicIds
18,BD2017DHS,Bangladesh,536,South & Southeast Asia,2020-12-14,"38, 22, 70, 92, 31, 10, 43, 46, 9, 2, 41, 68, ..."
107,GH2019MIS,Ghana,557,Sub-Saharan Africa,2020-07-28,"90, 17, 26, 70, 89, 92, 41, 79, 69"
226,PG2017DHS,Papua New Guinea,499,Oceania,2019-11-25,"10, 78, 11, 6, 1, 74, 31, 22, 79, 69, 24, 21, ..."
265,SN2018DHS,Senegal,580,Sub-Saharan Africa,2020-06-17,"31, 17, 3, 80, 26, 11, 10, 14, 6, 4, 32, 69, 8..."
269,SL2019DHS,Sierra Leone,545,Sub-Saharan Africa,2020-12-11,"11, 41, 23, 70, 104, 1, 6, 4, 10, 89, 99, 26"
298,TR2008DHS,Turkey,547,North Africa/West Asia/Europe,2018-07-30,"16, 10, 31, 69, 14, 34, 3"
299,TR2013DHS,Turkey,486,North Africa/West Asia/Europe,2018-07-30,"10, 22, 3, 101, 69, 16, 31, 32, 38, 20, 34, 60"


## GPS data

Listing of all the surveys we have GPS data for

In [26]:
our_gps_data = pd.read_sql('''
    SELECT DISTINCT (surveyid) FROM dhs_data_locations.dhs_cluster_locs''',
                         con=engine)

In [27]:
our_gps_data.head()

Unnamed: 0,surveyid
0,145
1,435
2,58
3,228
4,384


Listing of all the surveys for which GPS data are available, according to the API

In [28]:
available_gps_data = surveys[surveys['SurveyCharacteristicIds'].str.contains('26')]['SurveyNum']
available_gps_data

0      471
1      327
3      282
4      395
5      477
      ... 
325    542
328    184
329    260
330    367
331    475
Name: SurveyNum, Length: 201, dtype: int64

Listing of all the GPS data which are advertised as available but we don't have

In [30]:
gps_data_we_dont_have = available_gps_data[~available_gps_data.isin(our_gps_data['surveyid'])]
gps_data_we_dont_have

0      471
7      262
47     101
91     170
107    557
147    115
165    207
213    277
214    407
224    419
235    426
236    433
237    434
245    165
264    534
265    580
269    545
277     82
319    358
322    206
Name: SurveyNum, dtype: int64

Listing of surveys for which we've previously noted that we'll never get the GPS data due to known issues 

In [31]:
known_gps_issues = known_issues_all.loc[
    (known_issues_all['change expected']=='no') & (known_issues_all['gps issue'] != 'no issue')
]['surveynum']
known_gps_issues

0      23
1      73
2     103
3     179
4     160
5     170
6     224
7     525
8     101
9     115
10    165
11    206
12    207
13    262
14    277
15    358
16    407
17    419
18    426
19    433
20    434
21    471
23     82
Name: surveynum, dtype: int64

### Listing of GPS data that we don't have and might want to get

In [32]:
gps_data_to_look_for = gps_data_we_dont_have[~gps_data_we_dont_have.isin(known_gps_issues)]
gps_data_to_look_for

107    557
264    534
265    580
269    545
Name: SurveyNum, dtype: int64

In [33]:
surveys[surveys['SurveyNum'].isin(gps_data_to_look_for)]

Unnamed: 0,SurveyId,CountryName,SurveyNum,RegionName,ReleaseDate,SurveyCharacteristicIds
107,GH2019MIS,Ghana,557,Sub-Saharan Africa,2020-07-28,"90, 17, 26, 70, 89, 92, 41, 79, 69"
264,SN2017DHS,Senegal,534,Sub-Saharan Africa,2018-09-28,"10, 22, 23, 57, 15, 41, 31, 70, 79, 80, 26, 69..."
265,SN2018DHS,Senegal,580,Sub-Saharan Africa,2020-06-17,"31, 17, 3, 80, 26, 11, 10, 14, 6, 4, 32, 69, 8..."
269,SL2019DHS,Sierra Leone,545,Sub-Saharan Africa,2020-12-11,"11, 41, 23, 70, 104, 1, 6, 4, 10, 89, 99, 26"


## Next steps:

Code is no longer provided to automatically download / scrape the data from the DHS website. They took steps to prevent this so we aren't going to try and circumvent it.  

Instead, you should use the batch download tool they do provide in conjunction with a browser bulk download utility to get the necessary files.

For each survey, you should download the "Individual Recode - Hierarchical ASCII data (.dat)" and the "Men's Recode - Hierarchical ASCII data (.dat)" files, as well as the Geographic Data where available.

For any that turn out not to be available, suggesting a mismatch between the API and reality, update the CSV files if appropriate to note this.

Once downloaded, unzip and rename the files using notebook 02.