<center><img src="ignaz_semmelweis_1860_small.jpeg"></center>

Hungarian physician Dr. Ignaz Semmelweis worked at the Vienna General Hospital with childbed fever patients. Childbed fever is a deadly disease affecting women who have just given birth, and in the early 1840s, as many as 10% of the women giving birth died from it at the Vienna General Hospital. Dr.Semmelweis discovered that it was the contaminated hands of the doctors delivering the babies, and on **June 1st, 1847**, he decreed that everyone should wash their hands, an unorthodox and controversial request; nobody in Vienna knew about bacteria.

You will reanalyze the data that made Semmelweis discover the importance of handwashing and its impact on the hospital and the number of deaths.

The data is stored as two CSV files within the `data` folder.

`data/yearly_deaths_by_clinic.csv` contains the number of women giving birth at the two clinics at the Vienna General Hospital between the years 1841 and 1846.

| Column | Description |
|--------|-------------|
|`year`  |Years (1841-1846)|
|`births`|Number of births|
|`deaths`|Number of deaths|
|`clinic`|Clinic 1 or clinic 2|

`data/monthly_deaths.csv` contains data from 'Clinic 1' of the hospital where most deaths occurred.

| Column | Description |
|--------|-------------|
|`date`|Date (YYYY-MM-DD)
|`births`|Number of births|
|`deaths`|Number of deaths|

In [33]:
# Imported libraries
import pandas as pd
import matplotlib.pyplot as plt

# Read CSV file into df and check the first values
df = pd.read_csv("data/yearly_deaths_by_clinic.csv")
df.head()

# Check for columns types and missing values
df.info()

# New column with the proportion of deaths over births
df["deaths/births"] = df["deaths"] / df["births"]

df.head(12)

# Separate the df for clinic 1 and clinic 2
c1_df = df[df["clinic"]=="clinic 1"]
c2_df = df[df["clinic"]=="clinic 2"]

# Find the row with the maximum deaths/births ratio for clinic 1 and clinic 2
c1_max = c1_df[c1_df["deaths/births"] == c1_df["deaths/births"].max()]
print(c1_max["year"])
c2_max = c2_df[c2_df["deaths/births"] == c2_df["deaths/births"].max()]
print(c2_max["year"])

# Highest year with deaths at both clinics
highest_year = 1842
      
# Import monthly data for clinic 1 and check the data type
df_month = pd.read_csv("data/monthly_deaths.csv")
df_month.info()

# Change date type to date time object. Add a column of mnth and another of Year
df_month["date"] = pd.to_datetime(df_month["date"])
df_month["month"] = df_month["date"].dt.month
df_month["year"] = df_month["date"].dt.year

# Adding deaths/birth proportion
df_month["deaths/births"] = df_month["deaths"] / df_month["births"]

print(df_month.head())

# Adding a column to identify when handwashing started
df_month["handwashing_started"] = df_month["date"] >= "1847-06-01"

# Group by handwashing_started column to separete befor and after handwashing
monthly_summary = df_month.groupby("handwashing_started")["deaths/births"].agg("mean").reset_index()
print(monthly_summary)

# Separate data frames for before and after handwashing. Create a variable with the proportions
before_df = df_month[df_month["handwashing_started"] == False]
after_df = df_month[df_month["handwashing_started"] == True]
before_proportion = before_df["deaths/births"]
after_proportion = after_df["deaths/births"]

# Loop for analyze the values obtained
boot_mean_diff=[]
for i in range(3000):
    boot_before = before_proportion.sample(frac=1, replace=True)
    boot_after = after_proportion.sample(frac=1, replace=True)
    boot_mean_diff.append(boot_after.mean() - boot_before.mean())
    
# Confidence interval
confidence_interval = pd.Series(boot_mean_diff).quantile([0.0275,0.975])
print(confidence_interval)
    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    12 non-null     int64 
 1   births  12 non-null     int64 
 2   deaths  12 non-null     int64 
 3   clinic  12 non-null     object
dtypes: int64(3), object(1)
memory usage: 512.0+ bytes
1    1842
Name: year, dtype: int64
7    1842
Name: year, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    98 non-null     object
 1   births  98 non-null     int64 
 2   deaths  98 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 2.4+ KB
        date  births  deaths  month  year  deaths/births
0 1841-01-01     254      37      1  1841       0.145669
1 1841-02-01     239      18      2  1841       0.075314
2 1841-03-01     277      12      3  1841       0.0433