# DEVP 252: Problem Set 1. Due January 31

You may work in groups of up to 2 people, and then turn in just one copy of the problem set for the whole grou You may use STATA, R or Python to complete this assignment (but a walkthrough will only be provided for STATA and R). Please provide very short answers to each of the questions and turn in your Jupyter notebook as a PDF if you are using R or Python or include your .do or .log files if you are using Stata.

For this problem set, you will look at poverty and welfare in South Africa. The data set we will use is a little old, but it is a very easy data set to learn on, so it's good to start with. Once you learn on these data, it will be easier to use many other data sets. 

Begin by going to the World Bankâ€™s Living Standards Measurement Survey (LSMS) webpage (just google World Bank LSMS, or go to: https://microdata.worldbank.org/index.php/catalog/297/get-microdata) and download the data for South Africa (you will have to fill out a brief data use agreement form to start). 

When you download the data, you will notice that the dataset is in many separate files, each file referring to a different portion of the survey questionnaire (there are also a few files containing "constructed" variables, where the World Bank saved you some time and added some variables together for you). You can find a "data map" file on the website that tells you which sections of the questionnaire are in which data files. This is done simply to make the data set more manageable and easy to use. And the files are also "compressed" or "zipped" to save space, so you should double click on the folders and then click the button to "extract" the files so that they are no longer compressed. You should also download the survey questionnaires, since they will show you what questions were asked. 


## Question 1: Computing Poverty Rates

Let's start out by measuring poverty. Let's use 150 Rand per person per month as our poverty line. (To keep it simple, ignore economies of scale and adult equivalents for now. Let's also ignore sampling weights)

### A. What percent of households live below the poverty line?

I will walk you through this one because some of you are new to R. And this may seem like a very long process, but it is just a few steps--I am making it long to help explain things to you. So you don't get lost, here is a road map to the major steps: 
1. Take the data file that gives you information on people in the household and create a variable that adds up how many people are in the household;
2.  Merge this file with the file containing data on household income;
3.  Create an income per capita variable;
4.  Create a variable that tells me if income per capita is less than 150.

Before diving in further though, let's get R up and running. We'll need to install some packages and make sure R is pointed to right location on our machines. 

In [1]:
## USE CTRL-ENTER to run me

# Only need to run this once (unless you've already installed pacman before)
install.packages('pacman')
# Load packages we need (Pacman installs them if we don't already have them). dplyr helps us manipulate dataframes while haven allows us to read .dta files 
pacman::p_load(tidyverse, haven)

# Print your working directory
getwd()

# Replace below with path to where you've put the data (unless it's in your current wd). Only run this once
setwd('DEVP252/ZAF_1993_IHS_v01_M_STATA')


package 'pacman' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\jsilv\AppData\Local\Temp\RtmpQdAm4O\downloaded_packages


Let's use income to measure poverty, instead of expenditures. The file `HHINCTL.dta` contains data on total household income, and income from various sources. This is one of those "constructed" variables, where the World Bank took all of the separate questions on income from Sections 8.1 to 8.7 on the survey and added them all together to make one variable called totminc, for total monthly income (I could have asked you to do that yourself, but this will save you some time).

But to compute poverty, you need income per person (we will ignore adult equivalents and economies of scale to make your life easier). So you need to divide this variable by the number of people in the household. How do you do that

First, you need to know how many people are in each household. So we will need to use the household roster file (`M8_HROST.dta`), which contains data on each person in the household. This corresponds to Section 1 of the survey questionnaire; for each person in each household, there is a line of data, with information on the person's age, sex, relationship to household head, etc.? R is an object-oriented language (unlike STATA) which means that everything (whether it's a number, a dataframe, a list, a function, or anything else) can be assigned a name and held in memory. To assign an object a name we use this reverse arrow `<-` (which you can read as "gets"). If we don't assign a name in a line of code then it will print the output (as in the last line below).



In [2]:
# Read in both dataframes
inc <- read_stata('./HHINCTL.dta')
rost <- read_stata('./M8_HROST.dta')


# How many observations are in the income data?
N <- nrow(inc)
print(paste0(N, " households"))


# See first 10 rows of the roster data
head(rost, 10)


[1] "8810 households"


hhid,pcode,mcode,pers_res,rel_head,gender_c,age,educ_c,spouse,father,mother,abs_mont,absence_,resident,migrate,last_pla,gender_n,clustnum
1001,1,1,1,1,M,62,4,2,88,99,0,-2,1,2,-2,3,1
1001,2,1,2,3,F,45,6,1,88,88,0,-2,1,2,-2,2,1
1002,1,1,1,1,M,67,0,99,88,88,0,-2,1,2,-2,3,1
1003,1,1,2,1,M,62,5,99,88,88,0,-2,1,2,-2,3,1
1003,2,1,1,4,F,24,6,-2,1,99,0,-2,1,2,-2,2,1
1004,1,1,1,1,M,57,5,99,88,88,2,11,1,2,-2,3,1
1005,1,1,2,1,M,28,6,2,88,99,0,-2,1,2,-2,3,1
1005,2,1,1,3,F,25,9,1,88,99,0,-2,1,2,-2,2,1
1005,3,1,2,4,M,4,0,-2,1,2,0,-2,1,2,-2,3,1
1006,1,1,1,1,M,51,0,99,88,99,0,-2,1,2,-2,3,1


Look at the first two lines above. These two people live in a household called 1001 (hhid=1001). There are two people in the household. Person 1 (`pcode=1`) is the household head, Male, 62 years old. Person 2 (`pcode=2`) is the wife of the head of the household, Female, 45 years old. The third line then tells you about person 1 in household 1002, etc. Get it? And do you see how they correspond to the survey questionnaire, Section 1?

OK, back to the question of how we compute how many people are in each household. You don't want to do this by hand, there are over 6,000 households in the data set! There are many ways we can do this in R. One such way is to use `dplyr`'s `group_by()` function followed by `mutate()`.

In [3]:
# Create variable for # of HH members
rost <- rost %>% arrange(hhid) %>% group_by(hhid) %>%
  mutate(hh_size = n())

What's going on above invovles a few steps so lets unpack them one by one. 

First, `%>%` is known as the pipe operator and what it does is use the previous data frame as the first argument of the following function. So `rost %>% arrange(hhid)` takes the dataframe `rost` and then sorts it by the column `hhid`. `group_by()` then groups the data by `hhid`. After `group_by()` (note how we can continue the code on the following line after a pipe) we use the `mutate()` function to create a new column called `hh_size` that's equal to the number of observations in each group using the function `n()`. Note that we're assigning the outcome of this whole thing to `rost` (overwriting the original object). 

We may also want to know how many children and adults are in each household, which we can create as follows by summing up the values of the Boolean (TRUE/FALSE) statements within each household.

In [4]:
rost <- rost %>% mutate(adults = sum(age>=16), kids = sum(age<16))

This is all the information we really need out of `rost` and since these variables are the same for all observations within each household, we can make our lives easier by only keeping one row per household as follows

In [5]:
rost <- rost %>% slice(1)

Now that we've collapsed everything down to the household level, we'd like to merge the data on houeshold size with the data on incomes we loaded earlier. We do so as follows, creating a new dataframe called `df`. 

We need to choose which variables to merge on (note that these variables should uniquely identify rows in at least one of the data frames) and put them in quotes inside the `by` argument. Also, when we want to denote a list in R, we need to wrap it with `c()`. We want to merge on both `hhid` and `clustnum` here because they both identify rows in both dataframes (you could just merge on `hhid` but then it would create a `clustnum.x` and a `clustnum.y`, which would be annoying). Finally, we specifiy `all = T` (or equivalently `all = TRUE`) to indicate that we want to keep all observations from both dataframes, even if they don't match each other .

In [6]:
# merge the two dataframes
df <- merge(inc, rost, by=c('hhid','clustnum'), all = T)

All we've got to do now is create a variable for household income per capita and then a dummy for whether the household's income (`totminc`) is below the poverty line of 150 Rand. Note that there are multiple ways to create variables (columns) in a dataframe.  

In [7]:
df <- df %>% mutate(inc_percapita = totminc/hh_size)
df$is_poor <- df$inc_percapita < 150 # $ is used to access a dataframe column

What's the poverty rate? We just need to take the average of `is_poor`. But since some rows have missing data on incomes, we need to specify `na.rm = T`. (Here, we're choosing to ignore households missing data for simplicity, although this is not always a good idea in practice. You can also just drop these households altogether using `df <- df %>% drop_na(totminc)`.)

In [18]:
mean(df$is_poor, na.rm = T)

So 38.1% of households live below the poverty line

### B. I have computed the headcount poverty measure, $P_0$. Now you compute the poverty gap ($P_1$) and ($P_2$) poverty measures. 
_Hint:_ For the poverty gap measure $P_1$, after you compute income per capita, compute $1-\frac{y}{z}$ for each household, multiply this by your `is_poor` variable and then sum across all households. I'll leave it to you to figure out how to do $P_2$

## Question 2. Adult Equivalents

Now compute the poverty rate (using headcount measure) but adjust for adult equivalents so that children (age < 16) are only worth half of an adult, i.e. $\gamma=0.5$. How does the poverty rate change? 

_Hint:_ instead of counting up the number of people in the household, count up the number of children and the number of adults separately).

## Question 3. By Race and Urban/Rural 

A. Use the adult equivalents in 2 but now compute the headcount measure by race.

B. Now calculate the headcount ratio by urban/rural poverty (_Hint:_ the merge here will be by `clustnum`). 

## Question 4: Final Proposal

The objective of your analysis is to make recommendations to the Government of South Africa and the World Bank as to how to target their efforts to reduce poverty. Give your punch line: if the objective of eliminating poverty is to be met, where should South Africa and the World Bank concentrate their efforts? Are there any caveats regarding the analysis? 

