dataassembly.py
Jay Sayre - sayrejay (at) Gmai|, 

Purpose: Combines various datasets and preps code for analysis (see exploratoryanalysis.R script)

Inputs:

Produced by mergingtariffandindustrydata.py:
"cafta-dr/Output/ISICtwodigitleveltariffs.csv" - tariff averages at the ISIC 2-digit level for 2002 and 2013

"cafta-dr/Output/municipalityaverageisic4dig.csv" - municipality level tariff averages using ISIC 4 digit codes for the D.R. in 2013, using empresas data produced by the script compute_regional_employment.py, updated on 2/13/16
    
"cafta-dr/Output/municipalityaveragetariff2002.csv" - municipality level tariff averages (for import competing industries, where import competing is considered harmonized system goods that correspond to ISIC
codes based upon conversion table) for the D.R. in 2002, using estimated industrial activity in a municipality for 2002 produced by the script mergingtariffandindustrydata.py

"cafta-dr/Output/municipalityaveragetariff2013.csv" - municipality level tariff averages (for import competing industries, where import competing is considered harmonized system goods that correspond to ISIC
codes based upon conversion table) for the D.R. in 2013, using estimated industrial activity in a municipality for 2010 produced by the script mergingtariffandindustrydata.py

Produced by DHSdataaggregation.py:
"averageincomebymunicipality2013.csv" - Average income in each D.R. municipality for 2013

"averageincomebyoccmun2013.csv" - Average income in each D.R. municipality and occupation for 2013

Produced by IPUMSdataaggregation.py:
'averageincomebymunicipality2002.csv' - IPUMS 2002 information on the average income of workers in the private sector for a given D.R. municipality

'averageincomebyoccmun2002.csv' - IPUMS 2002 information on the average income of workers in a given ISIC occupation code for a given D.R. municipality

Produced by compute_regional_employment.py:

estmunicipalindustryactivity2002.csv - Combines D.R. empresa data and IPUMS data at the ISIC 2-digit level for 2002
estmunicipalindustryactivity2010.csv - Combines D.R. empresa data and IPUMS data at the ISIC 2-digit level for 2010

Outputs:

'mun_level_isic4dig_DATASET.csv' - contains income and tariff levels at
for 2002 and 2013 at the municipality level (using ISIC four digit, just empresas), to be analyzed later in R
or STATA

'municipality_level_DATASET.csv' - contains income and tariff levels at
for 2002 and 2013 at the municipality level (using ISIC two digit, IPUMS+empresas), to be analyzed later in R
or STATA

'municipality_occupation_level_DATASET.csv' - contains income and tariff levels for 2002 and 2013 at the municipality and occupational level, to be analyzed later in R or STATA

In [2]:
import pandas as pd
import os

if os.name == 'nt':
    basedir ="D:/Dropbox/Dropbox (Personal)/College/DR_Paper/"
else:
    basedir ="/home/j/Dropbox/College/DR_Paper/"

outputdir = basedir+'cafta-dr/Output/'
    
## INPUTS
isictariffs = outputdir+'ISICtwodigitleveltariffs.csv' #MUN/OCC
muntariffisic4dig = outputdir+'municipalityaverageisic4dig.csv' # MUN/ISIC 4 DIG
muntariff2002 = outputdir+'municipalityaveragetariff2002.csv' #MUN
muntariff2013 = outputdir+'municipalityaveragetariff2013.csv' #MUN
munavginc2002 = outputdir+'averageincomebymunicipality2002.csv' #MUN
munavginc2013 = outputdir+'averageincomebymunicipality2013.csv' #MUN
munavgincocc2002 = outputdir+'averageincomebyoccmun2002.csv' #MUN/OCC
munavgincocc2013 = outputdir+'averageincomebyoccmun2013.csv' #MUN/OCC
industry2002 = outputdir+'estmunicipalindustryactivity2002.csv' #MUN/OCC
industry2010 = outputdir+'estmunicipalindustryactivity2010.csv' #MUN/OCC

## OUTPUTS
munisic4output = outputdir+'mun_level_isic4dig_DATASET.csv'
munoutput = outputdir+'municipality_level_DATASET.csv'
munoccoutput = outputdir+'municipality_occupation_level_DATASET.csv'

In [3]:
### Build municipality level tariff/income data set aka #MUN

tariffdf02 = pd.read_csv(muntariff2002)
tariffdf13 = pd.read_csv(muntariff2013)
avgincdf02 = pd.read_csv(munavginc2002)
avgincdf13 = pd.read_csv(munavginc2013)

tariffdf02.columns = ['mun','duty02']
tariffdf13.columns = ['mun','duty13']
avgincdf02.columns = ['inc02','mun']
avgincdf13.drop(['PROV','MUN'],1,inplace=True)
avgincdf13.columns = ['grossalary13', u'occinc13', u'frstsourcinc13','mun']
avgincdf13['mun']=avgincdf13['mun'].astype(str)

mundf = tariffdf02.merge(tariffdf13,on='mun')
mundf['mun']=mundf['mun'].astype(str)
mundf = mundf.merge(avgincdf02,on='mun')
mundf = mundf.merge(avgincdf13,on='mun')
mundf['prov'] = mundf['mun'].apply(lambda x: x[:1] if len(x)==3 else x[:2])


In [4]:
### Build municipality level tariff/income data set aka #MUN/ISIC 4 dig

tariff4digdf = pd.read_csv(muntariffisic4dig)
avgincdf02 = pd.read_csv(munavginc2002)
avgincdf13 = pd.read_csv(munavginc2013)

tariff4digdf.columns = ['mun','duty02','duty13']
tariff4digdf['mun']=tariff4digdf['mun'].astype(str)
avgincdf02.columns = ['inc02','mun']
avgincdf13.drop(['PROV','MUN'],1,inplace=True)
avgincdf13.columns = ['grossalary13','occinc13','frstsourcinc13','mun']
avgincdf13['mun']=avgincdf13['mun'].astype(str)

mun4digdf = tariff4digdf.merge(avgincdf02,on='mun')
mun4digdf = mun4digdf.merge(avgincdf13,on='mun')
mun4digdf['prov'] = mun4digdf['mun'].apply(lambda x: x[:1] if len(x)==3 else x[:2])


In [5]:
### Build municipality and occupation level tariff/income data set aka #MUN/OCC

munoccdf02 = pd.read_csv(munavgincocc2002)
munoccdf13 = pd.read_csv(munavgincocc2013)
isictwodig = pd.read_csv(isictariffs)
industryact02 = pd.read_csv(industry2002)
industryact10 = pd.read_csv(industry2010)

## Prepare data for merging
munoccdf13.drop(['PROV','MUN'],1,inplace=True)
munoccdf02.columns=['occ','mun','inc2002']
munoccdf13.columns = ['occ','grossalary13','occinc13','frstsourcinc13','mun']
isictwodig.columns=['occ','duty02','duty13','nontraded']
industryact02 = industryact02.set_index('mun').stack().reset_index()
industryact10 = industryact10.set_index('mun').stack().reset_index()
industryact02.columns=['mun','occ','numworkers02']
industryact10.columns=['mun','occ','numworkers10']
## Make sure all merge columns are of the same type
## This shouldn't be necessary, but I couldn't fix merge issues otherwise
## Whatever, it only means code is longer than it has to be.. oh well
munoccdf02['munocc']=munoccdf02['mun'].astype(str)+'  '+munoccdf02['occ'].astype(str)
munoccdf13['munocc']=munoccdf13['mun'].astype(str)+'  '+munoccdf13['occ'].astype(str)
industryact02['munocc']=industryact02['mun'].astype(str)+'  '+industryact02['occ'].astype(str)
industryact10['munocc']=industryact10['mun'].astype(str)+'  '+industryact10['occ'].astype(str)
munoccdf02.drop(['mun','occ'],1,inplace=True)
munoccdf13.drop(['mun','occ'],1,inplace=True)
industryact02.drop(['mun','occ'],1,inplace=True)
industryact10.drop(['mun','occ'],1,inplace=True)
isictwodig['occ']=isictwodig['occ'].astype(str)

## Merge all files together
munoccdf = munoccdf02.merge(munoccdf13,on='munocc')
munoccdf = munoccdf.merge(industryact02,on='munocc')
munoccdf = munoccdf.merge(industryact10,on='munocc')
munoccdf['occ'] = munoccdf['munocc'].apply(lambda x: x.split('  ')[1])
munoccdf['munocc'] = munoccdf['munocc'].apply(lambda x: x.split(' ')[0])
munoccdf = munoccdf.merge(isictwodig,on='occ')

In [6]:
### Write outputs to file
mundf.to_csv(munoutput,index=False)
mun4digdf.to_csv(munisic4output,index=False)
munoccdf.to_csv(munoccoutput,index=False)