012616IPUMSdataaggregation.py - Takes cleaned IPUMS data (cleaned by IPUMS/01182016ipumscleaning.py) and
aggregates the variables in the files to the provincial and municipal level, after merging geolevel with universal key

INPUTS: 
'IPUMS/ipumsclean.csv' - cleaned IPUMS data  (by 01182016ipumscleaning.ipynb) for DR in 2002 and 2010
'DR Codigos/muncorrespondence.csv' - file which provides a correspondence between IPUMS geo2 names and D.R. municipality codes

usd2013todr02pesos - Nominal conversion rate from 2013 USD to 2002(2003?) RD, times a weighted factor wrt to 
current 2010 USD (to make this in real terms).

INTERMEDIATE FILES: 

'occupationbymunicipality2002.csv' - IPUMS 2002 information on the share of workers with reported ISIC 2-digit occupation in a given D.R. municipality, used by compute_regional_employment.py

'occupationbymunicipality2010.csv'  - IPUMS 2010 information on the share of workers with reported ISIC 2-digit occupation in a given D.R. municipality, used by compute_regional_employment.py


OUTPUTS:

'averageincomebymunicipality2002.csv' - IPUMS 2002 information on the average income of workers in the private sector for a given D.R. municipality

'averageincomebyoccmun2002.csv' - IPUMS 2002 information on the average income of workers in a given ISIC occupation code for a given D.R. municipality

Jay Sayre - sayrejay (at) Gm@i|,

In [1]:
import pandas as pd
import os

if os.name == 'nt':
    base_dir = "D:/Dropbox/Dropbox (Personal)/College/DR_Paper/"
else:
    base_dir = "/home/j/Dropbox/College/DR_Paper/"

## INPUTS
ipumsinputdata = base_dir+'IPUMS/ipumsclean.csv'
geocodefl = base_dir+'cafta-dr/DR Codigos/muncorrespondence.csv'
## Conversion rate from 2002 RD to 2013 USD 
usd2013todr02pesos = 18.609825#*(0.975)
## If I feel 2003 exchange rate is more appropriate since survey occured
## on October 19th, 2002
#usd2013todr02pesos = 30.8307083333*(0.975)
### Weighted average of both, come back and find monthly exchange rate
#usd2013todr02pesos = ((18.609825*0.17)+(30.8307083333*0.83))*(0.975)

## INTERMEDIATE DATA
occumunoutput2002 = base_dir+'cafta-dr/Output/IPUMSoccupationbymunicipality2002.csv'
occumunoutput2010 = base_dir+'cafta-dr/Output/IPUMSoccupationbymunicipality2010.csv'
## OUTPUTS
munincoutput = base_dir+'cafta-dr/Output/averageincomebymunicipality2002.csv'
munincoccoutput = base_dir+'cafta-dr/Output/averageincomebyoccmun2002.csv'

## Read in input files
ipumsdf = pd.read_csv(ipumsinputdata, encoding='utf-8')
geodf = pd.read_csv(geocodefl, encoding='utf-8')

## Merge IPUMS data with geographic key

## Make list of IPUMS names that correspond to more than one municipality
tempc = geodf.set_index('CODIGO')['IPUMS'].to_dict()
codesnames = {}
for a,b in tempc.items():
    if b not in codesnames.keys():
        codesnames[b] = [a]
    else:
        codesnames[b].extend([a])
        
## Manually add "other municipalities" categories
codesnames.update({'Other municipalities in Peravia':[1701,1702],
 'Other municipalities in Monte Plata':[2901,2902,2903,2904,2905],
 'Other municipalities in La Altagracia':[1101,1102],
 'Other municipalities in Duarte':[601,602,603,604,605,606,607],
 'Other municipalities in Maria Trinidad':[1401,1402,1403,1404],
 'Other municipalities in Hermanas Mirabal':[1901,1902,1903],
 'Other municipalities in La Vega':[1301,1302,1303,1304],
 u'Other municipalities in Monse\xf1or Nouel':[2801,2802,2803],
 'Other municipalities in Espaillat':[901,902,903,904],
 'Other municipalities in Puerto Plata':[1801,1802,1803,1804,1805,1806,1807,1808,1809],
 'Other municipalities in Monte Cristi':[1501,1502,1503,1504,1505,1506],
 'Other municipalities in Valverde':[2701,2702,2703],
 'Other municipalities in San Juan':[2201,2202,2203,2204,2205,2206],
 ##Since I can't determine where "El Carril" is, dropping obs from data set 
 #'El Carril':[]
    })
#Drop 'El Carril' observations
withelcarril = len(ipumsdf)
ipumsdf = ipumsdf[ipumsdf["geo2_dox"] != 'El Carril']
print "Number of El Carril observations", withelcarril-len(ipumsdf)

## Split data set into 2002 and 2010 sections, we primarily care about 2002
df2002, df2010 = ipumsdf[ipumsdf['year']==2002], ipumsdf[ipumsdf['year']==2010]

## Subsetting down to only obs with available (and nonzero) income data for 2002
incdf = df2002[df2002['inctot']!= 9999998]
incdf = incdf[incdf['inctot']!= 9999999]
incdf = incdf[incdf['inctot']!= 0]
## Convert income as measured in monthly 2002 RD to 2013 USD
incdf['inctot'] = (incdf['inctot']/usd2013todr02pesos)*12
## Subset to only workers employed in the private sector
incdf = incdf[incdf['classwk'] == 2]

## Subsetting down to only workers employed in the private sector for 2010
workers2010df = df2010[df2010['classwk'] == 2]

Number of El Carril observations 3216


cols = ['COUNTRY', 'YEAR', 'SAMPLE', 'SERIAL', 'PERSONS', 'HHWT', 'SUBSAMP', 'STRATA', 'URBAN', 'REGIONW', 'GEOLEV1', 'GEO1_DO', 'GEO1_DOX', 'GEO2_DOX', 'SUBRDO', 'AGE', 'SEX', 'NATIVITY', 'BPLCOUNTRY', 'BPLDO', 'YRIMM', 'YRSIMM', 'SCHOOL', 'LIT', 'EDATTAIN', 'EDATTAIND', 'YRSCHOOL', 'EDUCDO', 'EMPSTAT', 'EMPSTATD', 'OCCISCO', 'OCC', 'INDGEN', 'IND', 'CLASSWK', 'CLASSWKD', 'EMPSECT', 'INCTOT', 'MIGRATE5', 'MIGCTRY5', 'MIGDO', 'DISABLED', 'DISEMP']

In [7]:
### Calculate average income at the municipality level for 2002

## Calculate number of observations for each municipality code originating from each IPUMS geoname 
## Create dict with number of observations from a given municipality for each municipio codigo
munshares = {}
munobs = incdf.groupby('geo2_dox')['year'].count()
munobs = dict(zip(munobs.index,munobs))
for mun in munobs.keys():
    for muncode in codesnames[mun]:
        if muncode in munshares.keys():
            munshares[muncode][mun] = munobs[mun]
        else:
            munshares[muncode] = {mun:munobs[mun]}
            
## Weight municipality according to num of obs from each category
codigos = [str(b) for b in list(geodf['CODIGO'])]
for muncode in munshares.keys():
    if len(munshares[muncode]) == 1:
        munshares[muncode] = [1]
    else:
        othermuns = float(len([a for a in codigos if str(muncode)[:2] in a[:2]]))
        totalcount = 0
        for munname in munshares[muncode].keys():
                if 'Other municipalities' in munname:
                    totalcount += munshares[muncode][munname]/othermuns
                    othermuncount = munshares[muncode][munname]/othermuns
                else:
                    totalcount += munshares[muncode][munname]
                    frstmuncount = munshares[muncode][munname]
        munshares[muncode] = [frstmuncount/totalcount, othermuncount/totalcount]
        
## Quick program to find row according to each municipality
def whereindf(item,dataframe,column='mun'):
    return list(dataframe[column]).index(item)

## Aggregate income data at the municipality level
muninc = incdf.groupby('geo2_dox', as_index=False)['inctot'].mean()
muninc = dict(zip(muninc['geo2_dox'],muninc['inctot']))
munincdf = pd.DataFrame({'mun':list(geodf['CODIGO'])+['MEAN'],'inctot':0})
for municiname in muninc.keys():
    for municicode in codesnames[municiname]:
        ## If there's only municipio code corresponding to municipality name, weight that fully
        if len(munshares[municicode]) == 1:
            munincdf.loc[whereindf(municicode,munincdf),'inctot'] = muninc[municiname]
        else:
            if 'Other municipalities' in municiname:
                munincdf.loc[whereindf(municicode,munincdf),'inctot'] += munshares[municicode][1]*muninc[municiname]
            else:
                ## PRETTY SURE THERE WAS A CODING ERROR HERE BEFORE, used to be munshares[municicode][1]
                munincdf.loc[whereindf(municicode,munincdf),'inctot'] += munshares[municicode][0]*muninc[municiname]
                
munincdf.loc[whereindf('MEAN',munincdf),'inctot'] = float(munincdf.mean())

Unnamed: 0,inctot,mun
0,5918.637010,101
1,3489.051061,201
2,1593.022997,202
3,1877.987084,203
4,1760.761636,204
5,1593.022997,205
6,1760.761636,206
7,3489.051061,207
8,1877.987084,208
9,1877.987084,209


In [3]:
### Calculate average income at the municipality and occupation level for 2002

munoccinc = incdf.groupby(['occ','geo2_dox'], as_index=False)['inctot'].mean()
#munoccincdf = pd.DataFrame({'mun':list(geodf['CODIGO'])+['MEAN'],'inctot':0,'occ':0})
munoccinc['mun'] = munoccinc['geo2_dox'].apply(lambda x: codesnames[x])
## Create duplicate entries for each municipality code (note that for each, 'geo2_dox' will differ)
munoccinc = munoccinc.reset_index()
munoccinc.drop('index',1,inplace=True)
origrows = list(munoccinc.index)
for i in origrows:
    nummuns = len(munoccinc.loc[i,'mun'])
    if nummuns != 1:
        updaterows = []
        for j in range(nummuns)[1:]:
            newrow = dict(munoccinc.loc[i,:])
            newrow['mun']=munoccinc.loc[i,'mun'][j]
            updaterows.append(newrow)
        munoccinc = munoccinc.append(updaterows)
        munoccinc = munoccinc.reset_index()
        munoccinc.drop('index',1,inplace=True)
    
    munoccinc.loc[i,'mun'] = munoccinc.loc[i,'mun'][0]

## Sort on municipality and occupation for testing, don't delete this though
munoccinc.sort_values(['mun','occ'], inplace=True)
munoccinc = munoccinc.reset_index()
munoccinc.drop('index',1,inplace=True)

## Create dict with number of observations for a given municipality code in a certain occupation
munoccinc['munocc'] = munoccinc['mun'].astype(str)+"  "+munoccinc['occ'].astype(str)
munoccobs = munoccinc.groupby('munocc')['geo2_dox'].count()
munoccobs = munoccobs.reset_index()
occandmuncounts = dict(zip(munoccobs['munocc'],munoccobs['geo2_dox']))

## Weight income by munshares dictionary, and then sum on municipality
munoccinc['munocc'] = munoccinc['munocc'].apply(lambda x: occandmuncounts[x])
for i in munoccinc.index:
    if munoccinc.loc[i,'munocc'] != 1:
        if 'Other municipalities' in munoccinc.loc[i,'geo2_dox']:
            munoccinc.loc[i,'inctot']=munoccinc.loc[i,'inctot']*munshares[munoccinc.loc[i,'mun']][1]
        else:
            munoccinc.loc[i,'inctot']=munoccinc.loc[i,'inctot']*munshares[munoccinc.loc[i,'mun']][0]
munoccinc.drop(['munocc','geo2_dox'],1,inplace=True) #This isn't really necessary, groupby drops cols
munoccinc = munoccinc.groupby(['occ','mun'], as_index=False)['inctot'].sum()

In [4]:
## Figure out proportion of occupations in each municipality for 2002 data
occudf = pd.DataFrame({'mun':list(geodf['CODIGO'])})
occupationcodes = sorted(incdf['occ'].unique())
## Create columns in dataframe according to each occupation code
for occ in occupationcodes:
    occudf[occ] = 0    
## Construct data set
munocc = incdf.groupby('occ', as_index=False)['geo2_dox'].groups
for occ in occupationcodes:
    for dfindex in munocc[occ]:
        municiname = incdf['geo2_dox'][dfindex]
        for municicode in codesnames[municiname]:
            occudf.loc[whereindf(municicode,occudf),occ] += 1

In [5]:
## Figure out proportion of occupations in each municipality for 2010 data
## This code is literally a direct copy of code above

occudf2010 = pd.DataFrame({'mun':list(geodf['CODIGO'])})
occodes2010 = sorted(workers2010df['occ'].unique())
## Create columns in dataframe according to each occupation code
for occ in occodes2010:
    occudf2010[occ] = 0    
## Construct data set
munocc2010 = workers2010df.groupby('occ', as_index=False)['geo2_dox'].groups
for occ in occodes2010:
    for dfindex in munocc2010[occ]:
        municiname = workers2010df['geo2_dox'][dfindex]
        for municicode in codesnames[municiname]:
            occudf2010.loc[whereindf(municicode,occudf2010),occ] += 1

In [6]:
### Write each file to csv
## INTERMEDIATES
occudf.to_csv(occumunoutput2002,index=False)
occudf2010.to_csv(occumunoutput2010,index=False)

## OUTPUTS
munincdf.to_csv(munincoutput,index=False)
munoccinc.to_csv(munincoccoutput,index=False)