This imports pandas and the data set.

In [1]:
import pandas as pd
Survey='https://github.com/arsell/599-Project/blob/master/data/HNBR62FL.DTA?raw=true'
DF=pd.read_stata(Survey)

This creates a dataframe with the variables from the data that will be used in our analysis.

In [2]:
DF=DF[['caseid', 'v133', 'v012', 'v130', 'v024', 'v190', 'v191','v025', 'v001']]

Let's look at our data

In [3]:
DF.shape

(49263, 9)

In [4]:
DF.head()

Unnamed: 0,caseid,v133,v012,v130,v024,v190,v191,v025,v001
0,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
1,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
2,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
3,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
4,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1


Now we rename the variables to something easier to understand.

In [5]:
newnames={'caseid': 'caseid', #CASEID
            'v133': 'educ', #Education in Single years
            'v012': 'age', #Respondent's current age
            'v130': 'religion', #Religion-Catholic, Evangelical, Other, None
            'v024': 'region', #Region-one of 18 administrative regions
            'v190': 'wealthCat', #Wealth index categorical
            'v191': 'wealthDec', #Wealth index decimal
            'v025': 'urban', #Type of residence-Urban, Rural
            'v001': 'DHSCLUST'} #DHS Cluster number-to link with shape file


In [6]:
DF.rename(columns=newnames,inplace=True)

Let's check to make sure everything was renamed properly

In [7]:
DF.head()

Unnamed: 0,caseid,educ,age,religion,region,wealthCat,wealthDec,urban,DHSCLUST
0,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
1,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
2,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
3,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1
4,1 21 1,6,40,Evangelical / Protestant,Atl?ntida,Richer,72182,Urban,1


Check the types of variables (so we know how to treat them)

In [8]:
DF.dtypes

caseid         object
educ         category
age              int8
religion     category
region       category
wealthCat    category
wealthDec       int32
urban        category
DHSCLUST        int32
dtype: object

Check for missing or incorrectly coded data

In [9]:
import numpy as np

DF[['educ']]=DF[['educ']].replace([99], np.NaN)
DF[['educ']]=DF[['educ']].astype('float')
DF=DF[DF.religion != 99]

Look at individual variables (in this case Urban)

In [10]:
DF.urban

0        Urban
1        Urban
2        Urban
3        Urban
4        Urban
5        Urban
6        Urban
7        Urban
8        Urban
9        Urban
10       Urban
11       Urban
12       Urban
13       Urban
14       Urban
15       Urban
16       Urban
17       Urban
18       Urban
19       Urban
20       Urban
21       Urban
22       Urban
23       Urban
24       Urban
25       Urban
26       Urban
27       Urban
28       Urban
29       Urban
         ...  
49233    Rural
49234    Rural
49235    Rural
49236    Rural
49237    Rural
49238    Rural
49239    Rural
49240    Rural
49241    Rural
49242    Rural
49243    Rural
49244    Rural
49245    Rural
49246    Rural
49247    Rural
49248    Rural
49249    Rural
49250    Rural
49251    Rural
49252    Rural
49253    Rural
49254    Rural
49255    Rural
49256    Rural
49257    Rural
49258    Rural
49259    Rural
49260    Rural
49261    Rural
49262    Rural
Name: urban, Length: 49222, dtype: category
Categories (2, object): [Urban < Rural]

Recode Type of Residence as a true binary [0,1]

In [11]:
oldUrban=list(DF.urban.cat.categories)
# 1 urban / 0 rural
newUrban=[1,0]
recodeUrban={old:new for old,new in zip(oldUrban,newUrban)}

In [12]:
DF.urban.cat.rename_categories(recodeUrban,inplace=True)

Check our work

In [13]:
DF.urban

0        1
1        1
2        1
3        1
4        1
5        1
6        1
7        1
8        1
9        1
10       1
11       1
12       1
13       1
14       1
15       1
16       1
17       1
18       1
19       1
20       1
21       1
22       1
23       1
24       1
25       1
26       1
27       1
28       1
29       1
        ..
49233    0
49234    0
49235    0
49236    0
49237    0
49238    0
49239    0
49240    0
49241    0
49242    0
49243    0
49244    0
49245    0
49246    0
49247    0
49248    0
49249    0
49250    0
49251    0
49252    0
49253    0
49254    0
49255    0
49256    0
49257    0
49258    0
49259    0
49260    0
49261    0
49262    0
Name: urban, Length: 49222, dtype: category
Categories (2, int64): [1 < 0]

Removing duplicate observations from data frame (we want each observation to represent a single woman, rather than a woman-birth pairing)

In [14]:
DF=DF.drop_duplicates()

Because the original language is Spanish, some of the region category names contain accents. With this code, we create a dictionary of the original category names and their non-accented versions. We then use the replace command to implement the changes to the data frame.

In [15]:
cleanup_names = {"region": {"Atl?ntida": "Atlantida", "Cop?n": "Copan", "Col?n": "Colon", "Cort?s": "Cortes",  
                            "Francisco Moraz?n": "Francisco Morazan", "Intibuc?": "Intibuca", 
                            "Santa B?rbara": "Santa Barbara", "El Para?so": "El Paraiso", 
                            "Islas de la Bah?a": "Islas de la Bahia"}} 

DF.replace(cleanup_names, inplace = True)

Exporting clean data as CSV file

In [16]:
import os

DF.to_csv("../data/cleandata.csv", index=None)