# Dr. Semmelweis and the discovery of handwashing

In [None]:
# This allows .... to be used as placeholder value in the sample code cells
.... <- NULL 

## 1. Meet Dr. Ignaz Semmelweis


<img style="float: left;margin:5px 20px 5px 1px" src="http://s3.amazonaws.com/assets.datacamp.com/production/project_49/datasets/ignaz_semmelweis_1860.jpeg">

<!--
<img style="float: left;margin:5px 20px 5px 1px" src="datasets/ignaz_semmelweis_1860.jpeg">
-->

This is Dr. Ignaz Semmelweis, a Hungarian physician born in 1818 and active at the Vienna General Hospital. If Dr. Semmelweis looks troubled it's probably because he's thinking about *childbed fever*: A deadly disease affecting women that just have given birth. He is thinking about it because in the early 1840s at the Vienna General Hospital as many as 10% of the women giving birth die from it. He is thinking about it because he knows the cause of childbed fever: It's the contaminated hands of the doctors delivering the babies. And they won't listen to him and *wash their hands*!

In this notebook, we're going to reanalyze the data that made Semmelweis discover the importance of *handwashing*. Let's start by looking at the data that made Semmelweis realize that something was wrong with the procedures at Vienna General Hospital.

- Read about Dr. Ignaz Semmelweis to the right.
- Load in the `tidyverse` package.
- Read in `datasets/yearly_deaths_by_clinic.csv` using `read_csv` and assign it to the variable `yearly`.
- Print out `yearly`.

<hr>

### Good to know

The `tidyverse` package automatically loads in the packages `ggplot2`, `dplyr`, and `readr`. This project assumes you can manipulate data frames using `dplyr` and make simple plots using `ggplot2`. You can learn these skills in the course <a href="https://www.datacamp.com/courses/introduction-to-the-tidyverse" target="_blank">Introduction to the Tidyverse</a>. The most relevant exercises are:

- <a href="https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=11" target="_blank">Using mutate to change or create a column</a>
- <a href="https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-visualization?ex=9" target="_blank">Adding color to a scatter plot</a>
- <a href="https://campus.datacamp.com/courses/introduction-to-the-tidyverse/grouping-and-summarizing?ex=7" target="_blank">Summarizing by continent</a>
- <a href="https://campus.datacamp.com/courses/introduction-to-the-tidyverse/types-of-visualizations?ex=3" target="_blank">Visualizing median GDP per capita by continent over time</a>

Even if you've taken this course you will still find this project challenging unless you use some external _documentation_. In this project Rstudio's <a href="https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf" target="_blank">ggplot2 cheat sheet</a> and <a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf" target="_blank">dplyr cheat sheet</a> can come in handy.

If you load in the `tidyverse` package:

```r
library(tidyverse)
```

You can use the `read_csv` function to read in data stored in `csv`-files like this:

```r
my_data <- read_csv("path_to/my_data.csv")
```

In [None]:
# Load in the tidyverse package
# .... YOUR CODE FOR TASK 1 ....

# Read datasets/yearly_deaths_by_clinic.csv into yearly
yearly = ....

# Print out yearly
# .... YOUR CODE FOR TASK 1 ....

In [None]:
# Load in the tidyverse package
library(tidyverse)

# Read datasets/yearly_deaths_by_clinic.csv into yearly
yearly <- read_csv('datasets/yearly_deaths_by_clinic.csv')

# Print out yearly
yearly

In [None]:
library(testthat) 
library(IRkernel.testthat)
run_tests({
    test_that("Read in data correctly.", {
        expect_is(yearly, "data.frame", 
            info = 'You should use read_csv to read "datasets/yearly_deaths_by_clinic.csv" into yearly')
    })
    
    test_that("Read in data correctly.", {
        yearly_temp <- read_csv('datasets/yearly_deaths_by_clinic.csv')
        expect_equivalent(yearly, yearly_temp, 
            info = 'yearly should contain the data in "datasets/yearly_deaths_by_clinic.csv"')
    })
})

## 2. The alarming number of deaths

The table above shows the number of women giving birth at the two clinics at the Vienna General Hospital for the years 1841 to 1846. You'll notice that giving birth was very dangerous; an *alarming* number of women died as the result of childbirth, most of them from childbed fever.

We see this more clearly if we look at the *proportion of deaths* out of the number of women giving birth. 

- Use `mutate`  to add the column `proportion_deaths` to `yearly` calculated as the proportion of `deaths` per number of `births`.
- Print out `yearly`.

<hr>

For an example of how `mutate` works look under **Make New Variables** in the <a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf" target="_blank">dplyr cheat sheet</a>.

Don't forget that `mutate` doesn't change the actual data frame. To make the changes last you have to overwrite the data frame like this:

```
my_data <- my_data %>%
  mutate(new_column = old_column * 100)
```

In [None]:
# Adding a new column to yearly with proportion of deaths per no. births
# .... YOUR CODE FOR TASK 1 ....

# Print out yearly
yearly

In [1]:
# Adding a new column with proportion of deaths per no. births
yearly <- yearly %>% 
  mutate(proportion_deaths = deaths / births)

# Print out yearly
yearly

ERROR: Error in yearly %>% mutate(proportion_deaths = deaths/births): could not find function "%>%"


In [None]:
run_tests({
    test_that("A proportion_deaths column exists", {
        expect_true("proportion_deaths" %in% names(yearly), 
            info = 'yearly should have the new column proportion_deaths')
    })
    
    test_that("Read in data correctly.", {
        yearly_temp <- read_csv('datasets/yearly_deaths_by_clinic.csv') %>% 
          mutate(proportion_deaths = deaths / births)
        expect_equivalent(yearly, yearly_temp, 
            info = 'proportion_deaths should be calculated as deaths / births')
    })
})

## 3. Death at the clinics

If we now plot the proportion of deaths at both clinic 1 and clinic 2  we'll see a curious pattern...

- Use `ggplot` to make a line plot of `proportion_deaths` by `year` with one line per clinic.
- The lines should have different `color`s.

<hr>

If you don't remember how to plot line plots with `ggplot` check out the <a href="https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf" target="_blank">ggplot2 cheat sheet</a> under **Geoms**, **continuous function**.

If `my_data` has columns `a`, `b`, and `c`, this is how to make a line plot where the color depends on `c` :

```r
ggplot(my_data, aes(x = a, y = b, color = c)) +
  geom_line()
```

In [None]:
# Setting the size of plots in this notebook
options(repr.plot.width=7, repr.plot.height=4)

# Plot yearly proportion of deaths at the two clinics
# .... YOUR CODE FOR TASK 3 ....

In [None]:
# Setting the size of plots in this notebook
options(repr.plot.width=7, repr.plot.height=4)

# Plot yearly proportion of deaths at the two clinics
ggplot(yearly, aes(x = year, y = proportion_deaths, color = clinic)) +
  geom_line()

In [None]:
run_tests({
    test_that("The right columns are plotted", {
        correct_mapping <- list(x = as.symbol("year"), y = as.symbol("proportion_deaths"), colour = as.symbol("clinic"))
        student_mapping <- last_plot()$mapping
        expect_equivalent(correct_mapping, student_mapping,
            info = 'year should be on the x-axis, proportion_deaths on the y-axis, and the color should depend on clinic .')
    })
    
})

## 4. The handwashing begins

Why is the proportion of deaths constantly so much higher in Clinic 1? Semmelweis saw the same pattern and was puzzled and distressed. The only difference between the clinics was that many medical students served at Clinic 1, while mostly midwife students served at Clinic 2. While the midwives only tended to the women giving birth, the medical students also spent time in the autopsy rooms examining corpses. 

Semmelweis started to suspect that something on the corpses, spread from the hands of the medical students, caused childbed fever. So in a desperate attempt to stop the high mortality rates, he decreed: *Wash your hands!* This was an unorthodox and controversial request, nobody in Vienna knew about bacteria at this point in time. 

Let's load in monthly data from Clinic 1 to see if the handwashing had any effect.

- Read in `datasets/monthly_deaths.csv` and assign it to the variable `monthly`. 
- Add the column `proportion_deaths` to `monthly` calculated as the proportion of `deaths` per number of `births`.
- Print out the first rows in `monthly` using the `head()` function.

<hr>



You can calculate `proportion_deaths` almost in the same way as in Task 2. 

In [None]:
# Read datasets/monthly_deaths.csv into monthly
monthly <- ....

# Adding a new column with proportion of deaths per no. births
# .... YOUR CODE FOR TASK 4 ....

# Print out the first rows in monthly
# .... YOUR CODE FOR TASK 4 ....

In [None]:
# Read datasets/monthly_deaths.csv into monthly
monthly <- read_csv("datasets/monthly_deaths.csv")

# Adding a new column with proportion of deaths per no. births
monthly <- monthly %>% 
  mutate(proportion_deaths = deaths / births)

# Print out the first rows in monthly
head(monthly)

In [None]:
run_tests({
    test_that("Read in monthly correctly.", {
        monthly_temp <- read_csv("datasets/monthly_deaths.csv")
        expect_true(all(names(monthly_temp) %in% names(monthly)), 
            info = 'monthly should contain the data in "datasets/monthly_deaths.csv"')
    })
    
    test_that("proportion_death is calculated correctly.", {
        monthly_temp <- read_csv("datasets/monthly_deaths.csv")
        monthly_temp <- monthly_temp %>% 
          mutate(proportion_deaths = deaths / births)
        expect_equivalent(monthly, monthly_temp, 
            info = 'proportion_deaths should be calculated as deaths / births')
    })
})

## 5. The effect of handwashing

With the data loaded we can now look at the proportion of deaths over time. In the plot below we haven't marked where obligatory handwashing started, but it reduced the proportion of deaths to such a degree that you should be able to spot it!

- Make a line plot of `proportion_deaths` by `date` for the `monthly` data frame using `ggplot`.
- Use the `labs` function to give the x-axis and y-axis *any* prettier labels.

<hr>

For how to use the `labs` function to add labels check out the <a href="https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf" target="_blank">ggplot2 cheat sheet</a> under **Labels**.

The code to make the plot is almost the same as for task 3, except that you don't need to specify `color`.

In [None]:
# Plot monthly proportion of deaths
# ... YOUR CODE FOR TASK 5 ...

In [None]:
ggplot(monthly, aes(date, proportion_deaths)) +
  geom_line() +
  labs(x = "Year", y = "Proportion Deaths")

In [None]:
run_tests({
    test_that("The right columns are plotted", {
        correct_mapping <- list(x = as.symbol("date"), y = as.symbol("proportion_deaths"))
        student_mapping <- last_plot()$mapping
        expect_equivalent(correct_mapping, student_mapping,
            info = 'date should be on the x-axis, proportion_deaths on the y-axis.')
    })  
})

## 6. The effect of handwashing highlighted

Starting from the summer of 1847 the proportion of deaths is drastically reduced and, yes, this was when Semmelweis made handwashing obligatory. 

The effect of handwashing is made even more clear if we highlight this in the graph.

- Add a `TRUE`/`FALSE` column to `monthly` called `handwashing_started` which is `TRUE` for `date`s where obligatory handwashing was enforced.
- Make a line plot of `proportion_deaths` by `date` for the `monthly` data frame using `ggplot`. Make the `color` of the line depend on `handwashing_started`.
- Use the `labs` function to give the x-axis and y-axis *any* prettier labels.

<hr>

Since the column `monthly$date` is a `Date` column you can now compare it to other `Date`s using the comparison operators (`<`, `>=`, `==`, etc.). For example, the following would create a new column in `monthly`
 which is `FALSE` for all `date`s except for the month when handwashing started:

```r
monthly <- monthly %>%
  mutate(is_start_month = 
    date == handwashing_start)
```

You should be able to solve this task using a combination of code copied from task 2 and 3.

In [None]:
# From this date handwashing was made mandatory
handwashing_start = as.Date('1847-06-01')

# Add a TRUE/FALSE column to monthly called handwashing_started
# .... YOUR CODE FOR TASK 6 ....

# Plot monthly proportion of deaths before and after handwashing
# .... YOUR CODE FOR TASK 6 ....

In [None]:
# From this date handwashing was made mandatory
handwashing_start = as.Date('1847-06-01')

# Add a TRUE/FALSE to monthly called handwashing_started
monthly <- monthly %>%
  mutate(handwashing_started = date >= handwashing_start)

# Plot monthly proportion of deaths before and after handwashing
ggplot(monthly, aes(x = date, y = proportion_deaths, color = handwashing_started)) +
  geom_line()

In [None]:
run_tests({
    test_that("handwashing_started has been defined", {
        expect_true("handwashing_started" %in% names(monthly),
            info = 'monthly should contain the column handwashing_started.')
    })  
    
    test_that("there are 22 rows where handwashing_started is TRUE", {
        expect_equal(22, sum(monthly$handwashing_started),
            info = 'handwashing_started should be a TRUE/FALSE column where the rows where handwashing was enforced are set to TRUE.')
    })
    
    test_that("The right columns are plotted", {
        correct_mapping <- list(x = as.symbol("date"), y = as.symbol("proportion_deaths"), colour = as.symbol("handwashing_started"))
        student_mapping <- last_plot()$mapping
        expect_equivalent(correct_mapping, student_mapping,
            info = 'date should be on the x-axis, proportion_deaths on the y-axis, and handwashing_started should be mapped to color.')
    })  
})

## 7. More handwashing, fewer deaths?

Again, the graph shows that handwashing had a huge effect. How much did it reduce the monthly proportion of deaths on average?

* Use `group_by` and `summarise` to calculate the `mean` proportion of deaths before and after handwashing was enforced. 
* Put the resulting table into `monthly_summary`.

<hr>

The resulting data frame should look like below, but with 0.????? replaced by the actual numbers.

![](http://s3.amazonaws.com/assets.datacamp.com/production/project_49/datasets/task_7_example_table.png)

Look under **Group Cases** in the <a href="https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf" target="_blank">dplyr cheat sheet</a> for an example of how `group_by` and `summarise` work together.

You could group by `handwashing_started` like this:

```
monthly %>% 
  group_by(handwashing_started) %>%
  ....
```

Then you just need to replace `....` with something that calculates the `mean` `proportion_deaths` for each group.

In [None]:
# Calculating the mean proportion of deaths 
# before and after handwashing.

monthly_summary <- ....
# .... YOUR CODE FOR TASK 7 HERE ....

# Printing out the summary.
monthly_summary

In [None]:
# Calculating the mean proportion of deaths 
# before and after handwashing.

monthly_summary <- monthly %>% 
  group_by(handwashing_started) %>%
  summarise(mean_proportion_deaths = mean(proportion_deaths))

# Printing out the summary.
monthly_summary

In [None]:
as.numeric(unlist(.Last.value))

In [None]:
run_tests({
    test_that("mean_proportion_deaths was calculated correctly", {
        flat_summary <- as.numeric(unlist(monthly_summary))
        handwashing_start = as.Date('1847-06-01')
        monthly_temp <- read_csv("datasets/monthly_deaths.csv") %>% 
          mutate(proportion_deaths = deaths / births) %>% 
          mutate(handwashing_started = date >= handwashing_start) %>% 
          group_by(handwashing_started) %>%
          summarise(mean_proportion_deaths = mean(proportion_deaths))
        expect_true(all(monthly_temp$mean_proportion_deaths %in% flat_summary),
            info = 'monthly_summary should containt the mean monthly proportion of deaths before and after handwashing was enforced.')
    })  
})

## 8. A statistical analysis of Semmelweis handwashing data

It reduced the proportion of deaths by around 8 percentage points! From 10% on average before handwashing to just 2% when handwashing was enforced (which is still a high number by modern standards). 
To get a feeling for the uncertainty around how much handwashing reduces mortalities we could look at a confidence interval (here calculated using a t-test).

- Use the `t.test` function to calculate a 95% confidence interval around how much dirty hands increases `proportion_deaths`.

<hr>

A t-test is a simple statistical model for the means of two groups where you have continuous measurements. The two groups we have are monthly `proportion_deaths` _before_ handwashing had started and then _after_ it was enforced. A t-test produces a lot of numbers, but what we are interested in is the _confidence interval_, here a measure of uncertainty around what the increase in mortality could be due to doctors not washing their hands.

If `df` is a data frame, `outcome` is a numeric column in `df`, and `group` is a `TRUE`/`FALSE` column splitting `df` into two groups, then the following would run a t-test for the two groups:

```r
t.test(outcome ~ group, data = df)
```

The tilde (`~`) should be read as "depends on", and so the above means "assume the `outcome` depends on `group`".

The `....` to the left of `~` should be the measure we are interested in, the `....` to the right of `~` should be the `TRUE/FALSE` column splitting the data into two parts. 

In [None]:
# Calculating a 95% Confidence intrerval using t.test 
test_result <- t.test( .... ~ ...., data = monthly)
test_result

In [None]:
# Calculating a 95% Confidence intrerval using t.test 
test_result <- t.test( proportion_deaths ~ handwashing_started, data = monthly)
test_result

In [None]:
run_tests({
    test_that("the confidence intervals match", {
        temp_test_result <- t.test( proportion_deaths ~ handwashing_started, data = monthly)
        expect_equivalent(test_result$conf.int, temp_test_result$conf.int,
            info = 'The t-test should be calculated with proportion_deaths as a function of handwashing_started.')
    })  
})

## 9. The fate of Dr. Semmelweis

That the doctors didn't wash their hands increased the proportion of deaths by between 6.7 and 10 percentage points, according to a 95% confidence interval. All in all, it would seem that Semmelweis had solid evidence that handwashing was a simple but highly effective procedure that could save many lives.

The tragedy is that, despite the evidence, Semmelweis' theory — that childbed fever was caused by some "substance" (what we today know as *bacteria*) from autopsy room corpses — was ridiculed by contemporary scientists. The medical community largely rejected his discovery and in 1849 he was forced to leave the Vienna General Hospital for good.

One reason for this was that statistics and statistical arguments were uncommon in medical science in the 1800s. Semmelweis only published his data as long tables of raw data, but he didn't show any graphs nor confidence intervals. If he would have had access to the analysis we've just put together he might have been more successful in getting the Viennese doctors to wash their hands.

* Given the data Semmelweis collected, is it `TRUE` or `FALSE` that doctors should wash their hands? 

<hr>

Congratulations, you've made it this far! If you haven't tried it already, you should **check** your project now by clicking the "Check project" button.

Good luck! :)


When you've finished this project <a href="https://www.cbtnuggets.com/blog/2016/10/bytes-and-bacteria-exposing-the-germs-on-your-technology/" target="_blank">you should probably wash your hands</a> too... 

In [None]:
# The data Semmelweis collected points to that:
doctors_should_wash_their_hands <- FALSE

In [None]:
# The data Semmelweis collected points to that:
doctors_should_wash_their_hands <- TRUE

In [None]:
run_tests({
    test_that("The project is finished.", {
        expect_true(doctors_should_wash_their_hands, 
            info = "Semmelweis would argue that doctors_should_wash_their_hands should be TRUE .")
    })  
})