# DHS recode generator

This file contains code for producing custom output "recode" tables from DHS tables that have first been processed out into individual files for each survey and table (record type). 

Essentially it provides a means of "joining" tables that are stored in flat CSV files, just like executing a join query on data held in a DB. The joining is done by loading each of the input tables to an in-memory SQLite database, building the output table there, and dumping the output table back to a CSV file.

The code for actually constructing the SQL that builds the in-memory database, and performs the joins, is implemented in an external module that should be in the same directory as this notebook. This notebook just contains the necessary code to read the input requirements and data, and write the output files.

This was developed by Harry Gibson for extracting information (potentially) related to Under 5 Mortality for Donal Bisanzio. However it should be applicable for creating any "flat" joined output tables from DHS data that has been parsed into separate tables from the CSPro format.

## Usage

The main input is a list of tables / variables that should go into the output, specified in a CSV file. This has one row for each output column, and it should have the columns "Name" and "RecordName" which specify the variable name and table name respectively.

A second input CSV file provides a list of survey IDs (in a "DHS_id" column) that data should be extracted from.

Finally a directory of the parsed DHS survey tables (produced using the DHS survey parsing code) must be provided. The survey id and record name from the above inputs will be used to select CSV files from this directory.

In [1]:
import csv
import glob
from collections import defaultdict
import sqlite3
import os

In [2]:
from DHSTableManagement import *
from UnicodeWriter import UnicodeWriter

### Specify input file locations / patterns

In [3]:
varsFile = r'\\129.67.26.176\map_data\DHS_Automation\Processing\U5M_TheUniverse_And_Everything_201510\Info\variables_chosen_with_lengthandtype.csv'
#varsFile = r'\\129.67.26.176\map_data\DHS_Automation\Processing\HIV_Pilot\Info\259_Variables_Chosen.csv'
svyFile = r'\\129.67.26.176\map_data\DHS_Automation\Processing\U5M_TheUniverse_And_Everything_201510\Info\survey_db_list.csv'
#tblPattern = r'\\129.67.26.176\map_data\DHS_Automation\DataExtraction\20150626_FullSiteScrape\ProcessedTables\{0}.*.{1}.csv'
tblPattern = r'\\129.67.26.176\map_data\DHS_Automation\DataExtraction\20160307_Updates\ProcessedTables\{0}.*.{1}.csv'
#tblPattern = r'\\129.67.26.176\map_data\DHS_Automation\Acquisition\AdditionalDownloads\259\OutTables\{0}.*.{1}.csv'
outputFilenameTag = "U5M_Extra_Surveys"

In [4]:
#varsFile = r'\\129.67.26.176\map_data\DHS_Automation\Processing\HouseholdElectricity_201511\Info\variables_for_hh_electric.csv'
#svyFile = r'\\129.67.26.176\map_data\DHS_Automation\Processing\U5M_Universe_And_Everything_201510\Info\survey_db_list.csv'
#tblPattern = r'\\129.67.26.176\map_data\DHS_Automation\DataExtraction\20150626_FullSiteScrape\ProcessedTables\{0}.*.{1}.csv'

In [4]:
#outDir = r'\\129.67.26.176\map_data\hsg\Donal'
outDir = r'\\129.67.26.176\map_data\DHS_Automation\Processing\U5M_TheUniverse_And_Everything_201510\Outputs_Extra'
#outDir = r'\\129.67.26.176\map_data\DHS_Automation\Processing\HIV_Pilot\out'

In [6]:
#outDir = r'\\129.67.26.176\map_data\DHS_Automation\Processing\HouseholdElectricity_201511\Out'

In [5]:
outFNPattern = os.path.join(outDir,outputFilenameTag+".{0!s}.csv")

In [29]:
# allTables = ["REC01","REC11","REC21","REC22","REC41","REC42","REC43","REC44","REC51",
#             "REC71","REC75","REC91","REC94","REC95","RECH0","RECH1","RECH2","RECH3",
#             "RECH4","RECHM2"]
# "RECH5","RECH6", missed as they contain nothing new

### Specify master table

In [32]:
# All other tables must be capable of joining to this table either 1:1 or M:1. Any that join 
# 1:M would result in duplicate rows from the left outer join, possibly exponentially many if
# there are several such tables.
masterTable = "REC21" #"RECH1" #"REC21"

In [33]:
# Build a dictionary of the columns that have been requested for each table
# Key is the tablename and value is a list (because order matters) of the 
# column names/lengths (each in a 2-item dict).
tableVars = defaultdict(list)
with open(varsFile) as varfile:
    reader = csv.DictReader(varfile,delimiter=',') # delim ; in original
    for row in reader:
        varname = row['Name']
        recname = row['RecordName']
        varlength = row['Len']# if 'Len' in row else 10
        if 1:#recname in allTables:
            tableVars[recname].append(ColumnInfo({"Name": varname, "Length": varlength}))
            

In [38]:
# Ensure that we are always adding the iditems and joinable column for each table
#fullSpecFile = r'\\129.67.26.176\map_data\DHS_Automation\Acquisition\AdditionalDownloads\259\OutSpecs\259.szir51.FlatRecordSpec.csv'
#specFileTemplate = r'\\129.67.26.176\map_data\DHS_Automation\DataExtraction\20160307_Updates\ParsedSpecs\{0}.*.FlatRecordSpec.csv'
fullSpecFile = r'\\129.67.26.176\map_data\DHS_Automation\DataExtraction\20150626_FullSiteScrape\SchemaMappingSupport\SchemaMapperTableSpecs_AllTables_AllSurveys_InclTypes.csv'
# Now build a similar thing for tableIDs so we know how to join each table to each other one
tableIds = defaultdict(list)
with open(fullSpecFile) as specfile:
    reader = csv.DictReader(specfile, delimiter = ',')
    for row in reader:
        recname = row['RecordName']
        if (recname in tableVars and 
            (row['ItemType'] in ['IdItem', 'JoinableItem']
             )): #or row['Label'].startswith('Index'))):
             #or 'line num' in row['Label'].lower())):
            varname = row['Name']
            varlength = row['Len']
            colInfo = ColumnInfo({"Name": varname, "Length":varlength})
            if colInfo not in tableIds[recname]:
            #if not sum(i.Name == varname for i in tableVars[recname]):
            #    tableVars[recname].insert(0, ColumnInfo({"Name": varname, "Length":varlength}))
                tableIds[recname].append(ColumnInfo({"Name": varname, "Length":varlength}))    


In [71]:
s.startswith('Inde')

False

In [39]:
tableIds['REC01'].append(ColumnInfo({"Name":"V003", "Length":3}))

In [40]:
[c.Name for c in tableIds['REC01']]

['CASEID', 'V003']

In [41]:
[c.Name for c in tableIds['RECH1']]

['HHID', 'HVIDX']

In [37]:
# get files to read from survey_db_list
with open(svyFile) as svyfile:
    reader = csv.DictReader(svyfile)
    svys = [row['DHS_id'] for row in reader]


In [32]:
svys = [259]
svys = [393, 421,239,311,425,437,451,473,457,450]
# or do all that are available
# svys= [os.path.basename(f).split('.')[0] 
#       for f in glob.glob(tblPattern.format("*", "RECH0"))]

### Process the surveys

In [42]:
skipDB = False
skipBlanks = False

* Load all CSV files for a survey into individual tables in the in-memory DB 
* Create indexes
* Create an output table that is the result of joining them all
* Write that to disk.

In [43]:
for svyID in [450]:#svys:
    print "Survey "+str(svyID)
    # Use an in-memory sqlite db to load the tables and join them 
    db = sqlite3.connect(':memory:')
    cursor = db.cursor()
    srcTableInfos = {}
    outname = outFNPattern.format(svyID)
    for tblName, tblCols in tableVars.iteritems():
        # Load one table of one survey into the database. 
        tblIdCols = tableIds[tblName]
        # Find the individual file required
        tblFiles = glob.glob(tblPattern.format(svyID, tblName))
        if len(tblFiles) != 1:
            print ("Survey "+str(svyID)+" table "+tblName+" does not exist or is not well specified!")
            continue
        print tblName +"... ",
        tblFile = tblFiles[0]
        
        # build a TableInfo for working with the file and use it to load the data to a DB table
        with open(tblFile) as tbl:
            reader = csv.DictReader(tbl)
            # Ensure we get the ID (join) variables, which are CASEID or HIDX / BIDX etc,
            # regardless of what is specified in the output spec file.
            
            # Note that there are also HA0 and HC0 as IDs in RECH5/RECH6 but we don't 
            # actually need those tables (The only cols so far requested from those tables 
            # are duplicated in the woman tables).
            # So for now we'll just look for any columns with ID in the name. To do it 
            # "properly" we would have to go back to the DCF parsing code and pull out the 
            # relationship info there.
            # Note that the relative order of the id columns between the tables is important
            # as it is used by the joiner code to figure out which columns match to which
            # The fieldnames in the parsed files do give them in a consistent order,
            # but it might be more relaxing to actually check that here (CASEID first then 
            # BIDX or whatever)
            #ids = [{"Name":v, "Length":1} for v in reader.fieldnames if v.find("ID") != -1]
            #for i in ids:
            #    if i not in tblCols:
            #        tblCols.insert(0, {"Name": i["Name"], "Length":1})
            # Create a tableinfo object which will handle building the sql necessary
            # for interacting with this table in the database
            srcTable = TableInfo(tblName, tblIdCols, tblCols)
            
            if (skipDB):
                # For debugging of TableInfo
                continue

            # Get the sql to create the table in the database
            createSql = srcTable.GetCreateTableSQL()
            orderedCols = srcTable.AllColumns()
            
            # Populate the data into the DB from the CSV reader
            insertSql = srcTable.GetInsertSQLTemplate()
            # Use "N/A" for any columns that are not present in this survey
            data = [([row.get(i, 'N/A') for i in orderedCols]) for row in reader ]
            
            # if a table doesn't have any of the columns we asked for then don't just 
            # include its ID columns, just skip dealing with it altogether
            gotData = False
            for i in data:
                if i.count('N/A') < (len(i) - len(tblIdCols)):
                    gotData = True
                    break
            if skipBlanks and not gotData:
                print "Skipping table {0} as none of the required cols are present".format(
                    tblName)
                continue
                
            # otherwise save the tableinfo
            srcTableInfos[tblName] = srcTable
            # and put the data into the db
            cursor.execute(createSql)
            cursor.executemany(insertSql, data)
            # and create indexes in the DB on the relevant join columns
            idxSql = srcTable.GetCreateIndexSQL()
            cursor.executescript(idxSql)
        db.commit()
        
    # Move the "master table" - i.e. the left one on the left outer join - 
    # to the start of the list as required by MultiTableJoiner
    tblNames = [i for i in sorted(srcTableInfos) if i != masterTable]
    if masterTable in srcTableInfos:
        tblNames[0] = masterTable
    else:
        print "Warning: requested master table {0} isn't present! Join may fail!"
    # Note that we also don't actually check here if the join is appropriate. 
    # For example from a Child master table we can join to its parents table and the household 
    # table. But we couldn't do the reverse as for each household there are many children.
    # If we tried, we'd get repeated rows (probably) on the left join.
    # If there was more than one such table then we would get an exploding number of rows.
 
    if (len(tblNames)) == 0:
        print "Nothing for survey " + str(svyID)
        continue
    
    # Perform the join!
    multi = MultiTableJoiner("outputTbl", [srcTableInfos[n] for n in tblNames] )
    # Use GetCreateIntoSQL(QualifyFieldNames=True) to name output fields like 
    # RECH2_HV270 rather than just HV270
    joinEmAllSQL = multi.GetCreateIntoSQL(QualifyFieldNames=True)
    # Bodge for Donal's data where we want to join the household schedule table to the child table. This uses 
    # a different join column in the child tables (REC21.B16) and there is no way that I can see of automatically 
    # inferring this from the .DCF specification files.
    joinEmAllSQL = joinEmAllSQL.replace (
        'LEFT JOIN RECH1 ON substr(REC21.CASEID, 1, length(REC21.CASEID)-3) = RECH1.HHID and REC21.BIDX = RECH1.HVIDX',
        'LEFT JOIN RECH1 ON substr(REC21.CASEID, 1, length(REC21.CASEID)-3) = RECH1.HHID and REC21.B16 = RECH1.HVIDX'
    )
    continue
    cursor.execute(joinEmAllSQL)
    
    # Write the results out to CSV
    cursor.execute("select * from outputTbl")
    colNames = [description[0] for description in cursor.description]
    # TODO a given column should always appear in the same table but occasionally 
    # this is not the case. So we have to specify in the input file all the places it 
    # could come from, which will generate multiple columns in the output.
    # e.g. some surveys have HV270 in RECH3 rather than RECH2 and so we need to specify 
    # both if we are running for all surveys.
    # Ideally we would check here for these duplicates and write out only the one which 
    # doesn't have "N/A" in the values. But that would need to inspect each row, and thus 
    # would be much slower.
    with open(outname, "wb") as f:
        writer = UnicodeWriter(f)
        writer.writerow(colNames)
    #print ""
        writer.writerows(cursor)
    db.close()

Survey 450
REC21...  RECML...  REC22...  RECH3...  REC51...  RECH1...  RECH0...  REC11...  RECH2...  REC01...  REC91...  RECH4...  REC94...  REC95...  REC75...  REC43...  REC42...  REC41...  REC71...  Survey 450 table RECHM2 does not exist or is not well specified!
REC44...  397
397


In [27]:
m = MultiTableJoiner("test", [srcTableInfos['REC21'],srcTableInfos['REC22']])


In [44]:
print m.GetCreateIntoSQL(QualifyFieldNames=True)

18
18
CREATE TABLE test AS SELECT 
REC21.CASEID as REC21_CASEID, REC21.BIDX as REC21_BIDX, REC21.B0 as REC21_B0, REC21.B1 as REC21_B1, REC21.B11 as REC21_B11, REC21.B12 as REC21_B12, REC21.B15 as REC21_B15, REC21.B16 as REC21_B16, REC21.B2 as REC21_B2, REC21.B3 as REC21_B3, REC21.B4 as REC21_B4, REC21.B5 as REC21_B5, REC21.B6 as REC21_B6, REC21.B7 as REC21_B7, REC21.B8 as REC21_B8, REC21.BORD as REC21_BORD, REC22.V201 as REC22_V201, REC22.V218 as REC22_V218 FROM 
 REC21 
LEFT JOIN REC22 ON REC21.CASEID = REC22.CASEID


In [45]:
print joinEmAllSQL

CREATE TABLE outputTbl AS SELECT 
REC21.CASEID as REC21_CASEID, REC21.BIDX as REC21_BIDX, REC21.B0 as REC21_B0, REC21.B1 as REC21_B1, REC21.B11 as REC21_B11, REC21.B12 as REC21_B12, REC21.B15 as REC21_B15, REC21.B16 as REC21_B16, REC21.B2 as REC21_B2, REC21.B3 as REC21_B3, REC21.B4 as REC21_B4, REC21.B5 as REC21_B5, REC21.B6 as REC21_B6, REC21.B7 as REC21_B7, REC21.B8 as REC21_B8, REC21.BORD as REC21_BORD, REC11.V104 as REC11_V104, REC11.V106 as REC11_V106, REC11.V107 as REC11_V107, REC11.V113 as REC11_V113, REC11.V115 as REC11_V115, REC11.V116 as REC11_V116, REC11.V119 as REC11_V119, REC11.V120 as REC11_V120, REC11.V121 as REC11_V121, REC11.V122 as REC11_V122, REC11.V123 as REC11_V123, REC11.V124 as REC11_V124, REC11.V125 as REC11_V125, REC11.V127 as REC11_V127, REC11.V128 as REC11_V128, REC11.V129 as REC11_V129, REC11.V134 as REC11_V134, REC11.V135 as REC11_V135, REC11.V136 as REC11_V136, REC11.V137 as REC11_V137, REC11.V138 as REC11_V138, REC11.V139 as REC11_V139, REC11.V140 as REC1

In [60]:
s = s.replace ('LEFT JOIN RECH1 ON substr(REC21.CASEID, 1, length(REC21.CASEID)-3) = RECH1.HHID and REC21.BIDX = RECH1.HVIDX',
           'LEFT JOIN RECH1 ON substr(REC21.CASEID, 1, length(REC21.CASEID)-3) = RECH1.HHID and REC21.B16 = RECH1.HVIDX')

In [62]:
cursor.execute(s)

ProgrammingError: Cannot operate on a closed database.

In [66]:
svyID

450

In [None]:
multi = MultiTableJoiner("outputTbl", [srcTableInfos[n] for n in tblNames] )

In [40]:
multi.GetCreateIntoSQL()

397
397


'CREATE TABLE outputTbl AS SELECT REC21.CASEID, REC21.BIDX, REC21.B0, REC21.B1, REC21.B11, REC21.B12, REC21.B15, REC21.B16, REC21.B2, REC21.B3, REC21.B4, REC21.B5, REC21.B6, REC21.B7, REC21.B8, REC21.BORD, REC11.V104, REC11.V106, REC11.V107, REC11.V113, REC11.V115, REC11.V116, REC11.V119, REC11.V120, REC11.V121, REC11.V122, REC11.V123, REC11.V124, REC11.V125, REC11.V127, REC11.V128, REC11.V129, REC11.V134, REC11.V135, REC11.V136, REC11.V137, REC11.V138, REC11.V139, REC11.V140, REC11.V141, REC11.V150, REC11.V153, REC11.V155, REC11.V160, REC11.V161, REC22.V201, REC22.V218, REC41.MIDX, REC41.M1, REC41.M15, REC41.M17, REC41.M18, REC41.M19, REC41.M19A, REC41.M27, REC41.M28, REC41.M2A, REC41.M2B, REC41.M2C, REC41.M2D, REC41.M2E, REC41.M2F, REC41.M2G, REC41.M2H, REC41.M2I, REC41.M2J, REC41.M2K, REC41.M2L, REC41.M2M, REC41.M2N, REC41.M3A, REC41.M3B, REC41.M3C, REC41.M3D, REC41.M3E, REC41.M3F, REC41.M3G, REC41.M3H, REC41.M3I, REC41.M3J, REC41.M3K, REC41.M3L, REC41.M3M, REC41.M3N, REC41.M49A, RE

In [22]:
multi._MasterTable.Name()

'REC01'

In [25]:
srcTableInfos['REC01'].JoinColumns()


['CASEID']

In [26]:
srcTableInfos['RECH5'].JoinColumns()

['HHID', 'HA0']

In [24]:
tblNames

['REC01',
 'REC42',
 'REC51',
 'REC71',
 'REC75',
 'REC80',
 'REC91',
 'RECH0',
 'RECH1',
 'RECH2',
 'RECH5',
 'RECHCH',
 'RECHEL',
 'RECHMA',
 'RECHYT']

In [14]:
d = [ColumnInfo(p) for p in l]

In [16]:
tblCols

['HHID', {'Length': '1', 'Name': 'HA60'}]

In [23]:
tblCols

{'HA60': '1'}

In [None]:
multi.GetCreateIntoSQL()

### Workings below this point - all redundant

Now the best thing to do would be then to join each of the other tables in turn and update, for example:

In [None]:
inTable = srcTableInfos['REC43']
cp = TableToTableFieldCopier(outTable, inTable, inTable.OutputColumns())

update = cp.GetUpdateSQL_Join()
cursor.execute(update)
db.commit()

But that doesn't work because it turns out SQLite doesn't support join in an update query. Nice to know.

In [None]:
inTable = srcTableInfos['REC43']
cp = TableToTableFieldCopier(outTable, inTable, inTable.OutputColumns())


Instead we might try REPLACE INTO. But that doesn't work because it adds duplicate rows.

In [None]:
inTable = srcTableInfos['REC43']
cp = TableToTableFieldCopier(outTable, inTable, inTable.OutputColumns())

update = cp.GetUpdateSQL_Replace()
cursor.execute(update)
db.commit()
