This imports pandas and the data set.

In [1]:
import pandas as pd
Survey='https://github.com/arsell/599-Project/blob/master/data/HNBR62FL.DTA?raw=true'
DF=pd.read_stata(Survey)

This creates a dataframe with the variables from the data that will be used in our analysis.

In [2]:
DF=DF[['caseid', 'v133', 'v012', 'v130', 'v024', 'v190', 'v191','v025', 'v001']]

This renames the variables.

In [3]:
DF.columns=['caseid', # CASEID
            'educ', # Education in Single Years
           'age', # Respondent's current age
           'religion', # Religion
           'region', # Region
           'wealthCat', # Wealth index categorical
           'wealthDec', # Wealth index decimal
           'urban', # Type of residence
            'DHSCLUST'] # DHS Cluster number - to link with shape file


Properly recoding missing values

In [4]:
import numpy as np

DF[['educ']]=DF[['educ']].replace([99], np.NaN)
DF[['educ']]=DF[['educ']].astype('float')
DF=DF[DF.religion != 99]

In [5]:
oldUrban=list(DF.urban.cat.categories)
# 1 urban / 0 rural
newUrban=[1,0]
recodeUrban={old:new for old,new in zip(oldUrban,newUrban)}

In [6]:
DF.urban.cat.rename_categories(recodeUrban,inplace=True)

Removing duplicate observations from data frame

In [7]:
DF=DF.drop_duplicates()

Because the original language is Spanish, some of the region category names contain accents. With this code, we create a dictionary of the original category names and their non-accented versions. We then use the replace command to implement the changes to the data frame.

In [8]:
cleanup_names = {"region": {"Atl?ntida": "Atlantida", "Cop?n": "Copan", "Col?n": "Colon", "Cort?s": "Cortes",  
                            "Francisco Moraz?n": "Francisco Morazan", "Intibuc?": "Intibuca", 
                            "Santa B?rbara": "Santa Barbara", "El Para?so": "El Paraiso", 
                            "Islas de la Bah?a": "Islas de la Bahia"}} 

DF.replace(cleanup_names, inplace = True)

Exporting clean data as CSV file

In [9]:
import os

DF.to_csv("../data/cleandata.csv", index=None)