In [3]:
suppressPackageStartupMessages(library(tidyverse))

# Some drafty excercises 

## A normal income

In this series of excercises you will be driven through an apparently simple data science task: computing the usual disponible income. The data set is a synthetic one, but it is similar to many real life income survey you may find in government data repositories.

**_pedagogical goal_**: The series of exercise should encourage the student to consider the particular point of view they are adopting when solving even simple data science tasks. _Data analysis strongly depends on the point of view we take. This point of views do have a political meaning and they we routinely implement them in technical decision. We can not avoid them, but we should try to be aware of them._

### Ex 1.1 Alberonia and Florilandia, being mean.

In the dataframe `Income_survey` every row corresponds to a person. In the column `Income` of the dataframe you can find the yearly income of that person, normalized to a standard value.

In [4]:
Income_survey <- read_rds("Income.rds")
Income_survey %>%
    head()

PersonID,Income,Country,Family_ID
1,1358.146,Alberonia,80
2,1717.555,Alberonia,161
3,1710.991,Alberonia,94
4,1382.5,Alberonia,59
5,1506.073,Alberonia,17
6,1380.668,Alberonia,27


In the data frame we have information about two different countries: Alberonia and Florilandia. We are asked to answer a simple data science task: where are people richer? Consider the mean Incomes grouped by `Country`.

In [5]:
Income_survey %>%
  group_by(Country) %>%
  summarise(mean_Income = mean(Income))

Country,mean_Income
Alberonia,5378.305
Florilandia,5378.284


Which of the following statement would you support with reasonable confidence?

- [ ] The available incomes of the citizen of the two countries do not differ significantly.
- [ ] The available incomes of the Alberonian is significantly higher than that of the Florilandiensis.
- [*] None of the above.

_solution_

There are many reasons why we should take a step back and not jump to conclusion. At this point, what is more relevant and it is based on the difference between the average income (`mean(Income)`) and the income of the average person (`median(Income)`). In particular, the `mean(Income)` is more sensitive to the presence of outliers (few people with a very high income). On the other hand, the `median(Income)` is not effected by an increase in the income of rich.

## Ex 1.3 Being median

In the data frame we have information about two different countries: Alberonia and Florilandia. Do people earn the same in the two states? Consider the `median` Incomes grouped by `Country`.

In [6]:
Income_survey %>%
  group_by(Country) %>%
  summarise(median_Income = median(Income))

Country,median_Income
Alberonia,4427.662
Florilandia,4427.627


Can we say with reasonable confidence that the average person has access to the same income (roughly) in the two countries?

- [*] no
- [ ] yes

_solution_

We should not jump to conclusion, even if the `median(Income)` is essentially equal in the two countries. One mistake we have been committing in this analysis is to think that _every person is an island_. When we summarised, computing even a mean or a median, we considered each observation independently. However, the presence of a family structure (as indicated by the `Family_ID` field in the dataframe) suggest that people do not live independently. If they were to share their income in the family, the family structure may strongly change the result.

## Ex 1.4 Family matters

In Ex 1.3 we discussed the possibility that the family structure may affect our results. Now, let's assume that the component of a family share their income, and let's repeat the analysis of Ex 1.2 and Ex 1.3 under this hypothesis. Family ties are encoded in the dataframes as factors in `Family_ID`.

Consider the following chunk of code. It encodes a specific decision about how to compute a family income. What does it entail? Changing **one** single function we could have encoded a different decision (e.g., one that considers the numerosity of a family). Try changing the following code accordingly.

In [18]:
Income_families <- Income_survey %>%
    group_by(Country,Family_ID) %>%
    summarise(Family_Income = sum(Income))

Income_families %>%
    group_by(Country) %>%
    summarise(mean_Income = median(Family_Income))

Country,mean_Income
Alberonia,5013.013
Florilandia,5174.533


_Solution_

The decision we are making in the code is to consider every family as one observation independently of the number of people that composes it. But families may have to stretch a certain income over a larger number of people. A different decision would be to consider not the overall income of a family (`sum(Income)`), but the _per person_ income (the overall divided by the number of people in that family, that is, the `mean` income for that family).

In [17]:
Income_families <- Income_survey %>%
    group_by(Country,Family_ID) %>%
    summarise(Family_Income = mean(Income))

Income_families %>%
    group_by(Country) %>%
    summarise(mean_Income = median(Family_Income))

Country,mean_Income
Alberonia,5148.613
Florilandia,5180.549


## Ex 1.5 Family matters and people too

In Ex 1.4 we considered the possibility that the family structure may affect our results and addressed it by computing the average family income per person between the families. Let's dig deeper.

Consider the following chunk of code. The `group`ing and `summarise`ing encode a decision about what we consider our observation unit when we compute the `median` income. What does this decision entail? What unit of measure would also be reasonable to use? Implement it in the code changing the one offered here.

In [14]:
Income_families <- Income_survey %>%
    group_by(Country,Family_ID) %>%
    summarise(Family_Income = mean(Income))

Income_families %>%
    group_by(Country) %>%
    summarise(mean_Income = median(Family_Income))

Country,mean_Income
Alberonia,5148.613
Florilandia,5180.549


_Solution_

The decision affecting our result is considering families as our unit of observation, instead of individuals, and thus computing the average available income as a `median()` over the families. Doing so, individuals in larger families are less represented in our results than people in smaller families or singles. A different decision would have been deciding to average over individuals. Notice that we find a significant different solution.

In [15]:
Income_survey %>%
    group_by(Country,Family_ID) %>%
    mutate(Family_Income = mean(Income)) %>%
    ungroup() %>%
    group_by(Country) %>%
    summarise(mean_available_Income = median(Family_Income))

Country,mean_available_Income
Alberonia,5502.398
Florilandia,4963.787


## Ex 1.6 Bringing it all home

In Ex 1.2 to 1.5 we computed the average income of the people in Alberonia and Florilandia in 4 different ways. How would you decide which result to ?

- [ ] Comparing the squared error distances of the models.
- [ ] Comparing the lack-of-fit sum of squares of the models.
- [ ] All of the above.
- [*] None of the above.

_solution_:

The difference between the results depends on difficult ethical and political decisions. They are not simply commensurable because they entail different point of views on the problem. Whatever result (or combination of results) you think more adequate in describing the reality, it is not enough to justify it on the basis of a goodness-of-fit test.