# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [13]:
DATA_FOLDER = '/Users/guillaume/ADA/ADA2017-Tutorials/02 - Intro to Pandas/Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

In [162]:
# import the data into 3 dataframe for each country
# and test the data in each country before concating them
# process and keep only the relevant data
# merge the different dataframe together
# find the desired result

#import os utility
from os import listdir
#import libraries
import pandas as pd
import numpy as np

ebola_folder_path = DATA_FOLDER + '/ebola'

## 1.1 Guinea

In [28]:
# Guinea data
ebola_guinea_folder_path = ebola_folder_path + '/guinea_data'
ebola_guinea_files_paths = [ebola_guinea_folder_path + '/' + path for path in listdir(ebola_guinea_folder_path)]

# check the first file to see the structure
temp_data = pd.read_csv(ebola_guinea_files_paths[0])
temp_data.head()

Unnamed: 0,Date,Description,Totals,Conakry,Gueckedou,Macenta,Dabola,Kissidougou,Dinguiraye,Telimele,Boffa,Kouroussa,Dubreka,Siguiri,Pita,Nzerekore
0,2014-08-04,New cases of suspects,5,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2014-08-04,New cases of probables,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2014-08-04,New cases of confirmed,4,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2014-08-04,Total new cases registered so far,9,6.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2014-08-04,Total cases of suspects,11,9.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
# we can assume that for the exercise the only relevant columns are going to be Date, Description and Totals
# let's seee if these columns are present in each file

files_number = len(ebola_guinea_files_paths) # we have 22 files

counter = 0
for path in ebola_guinea_files_paths:
    df = pd.read_csv(path)
    if set(['Date', 'Description', 'Totals']).issubset(df.columns):
        counter += 1
    
print('We have', files_number, 'files.')
print('And', counter, 'times the columns Date, Description and Totals.')


We have 22 files.
And 22 times the columns Date, Description and Totals.


In [178]:
# now that we know we have the columns for each files we can load them all in a data frame

ebola_guinea_df = pd.DataFrame()
for path in ebola_guinea_files_paths:
    df = pd.read_csv(path, usecols=['Date', 'Description', 'Totals'], parse_dates=['Date'], index_col=['Date'])
    ebola_guinea_df = ebola_guinea_df.append(df)

ebola_guinea_df.head()

Unnamed: 0_level_0,Description,Totals
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-08-04,New cases of suspects,5
2014-08-04,New cases of probables,0
2014-08-04,New cases of confirmed,4
2014-08-04,Total new cases registered so far,9
2014-08-04,Total cases of suspects,11


In [179]:
# now we need to filter the description to get the corresponding death and new cases for each day
ebola_guinea_df.Description.value_counts()

# we observe that we can use the "Total new cases registered so far" as it is present in every files (22)
# we also see that for the number of death of the day it's going to be a little more complicated as we don't have
# one description matching for the 22 files

Total cases of confirmed                                    22
Total cases of probables                                    22
Total deaths (confirmed + probables + suspects)             22
Number of contacts to follow today                          22
Total deaths of confirmed                                   22
New cases of confirmed                                      22
Total cases of suspects                                     22
New cases of suspects                                       22
Total deaths of suspects                                    22
New cases of probables                                      22
Cumulative (confirmed + probable + suspects)                22
Total contacts registered from start date                   22
Total deaths of probables                                   22
Total new cases registered so far                           22
New contacts registered so far                              21
Total number of exits from CTE                         

In [180]:
# to see a little more which columns we could use let's find out which description containe death
ebola_guinea_df[ebola_guinea_df.Description.str.contains('[Dd]eath')].Description.value_counts()
# only one file seems to differ and we should use New deaths registered and New deaths registered today

Total deaths of confirmed                                   22
Total deaths of probables                                   22
Total deaths of suspects                                    22
Total deaths (confirmed + probables + suspects)             22
New deaths registered among health workers                  21
Total deaths registered among health workers                21
New deaths registered                                       21
Number of deaths of probables cases among health workers     1
Total of deaths in confirmed cases in CTE                    1
New deaths registered today (probables)                      1
Number of deaths of confirmed cases among health workers     1
Number of death of confirmed cases among health workers      1
New deaths registered today (suspects)                       1
New deaths registered today                                  1
New deaths registered today (confirmed)                      1
Name: Description, dtype: int64

In [181]:
# For this exercise, the interesting value here is the new death registered today 
# and the Total new cases registered so far
# lets check if these description value are present in every files:
ebola_guinea_df_new_cases = ebola_guinea_df.query('Description == "Total new cases registered so far"')
ebola_guinea_df_new_deaths = ebola_guinea_df.query('Description == "New deaths registered today" or Description == "New deaths registered"')

# check that we have as we supposed one data per day
print('New case date is unique: ',ebola_guinea_df_new_cases.index.is_unique)
print('New death date is unique: ',ebola_guinea_df_new_deaths.index.is_unique)

# and description as it's not useful anymore:
ebola_guinea_df_new_cases.drop('Description', axis=1, inplace=True)
ebola_guinea_df_new_deaths.drop('Description', axis=1, inplace=True)

# and rename the total column to what it represent:
ebola_guinea_df_new_cases = ebola_guinea_df_new_cases.rename(columns = {'Totals':'New Cases'})
ebola_guinea_df_new_deaths = ebola_guinea_df_new_deaths.rename(columns = {'Totals':'New Deaths'})

New case date is unique:  True
New death date is unique:  True


In [182]:
ebola_guinea_df_new_cases.head()

Unnamed: 0_level_0,New Cases
Date,Unnamed: 1_level_1
2014-08-04,9
2014-08-26,28
2014-08-27,22
2014-08-30,24
2014-08-31,46


In [183]:
ebola_guinea_df_new_deaths.head()

Unnamed: 0_level_0,New Deaths
Date,Unnamed: 1_level_1
2014-08-04,2
2014-08-26,5
2014-08-27,2
2014-08-30,5
2014-08-31,3


In [221]:
# now we concat and we are done
ebola_guinea_df_final = pd.concat([ebola_guinea_df_new_cases, ebola_guinea_df_new_deaths], axis=1)
ebola_guinea_df_final.head()

Unnamed: 0_level_0,New Cases,New Deaths
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-08-04,9,2
2014-08-26,28,5
2014-08-27,22,2
2014-08-30,24,5
2014-08-31,46,3


## 1.2 Liberia

In [185]:
# Liberia data
ebola_liberia_folder_path = ebola_folder_path + '/liberia_data'
ebola_liberia_files_paths = [ebola_liberia_folder_path + '/' + path for path in listdir(ebola_liberia_folder_path)]

# check the first file to see the structure
temp_data = pd.read_csv(ebola_liberia_files_paths[0])
temp_data.head()

Unnamed: 0,Date,Variable,National,Bomi County,Bong County,Grand Kru,Lofa County,Margibi County,Maryland County,Montserrado County,Nimba County,River Gee County,RiverCess County,Sinoe County
0,6/16/2014,Specimens collected,1.0,,,,1.0,,,0.0,,,,
1,6/16/2014,Specimens pending for testing,0.0,,,,0.0,,,0.0,,,,
2,6/16/2014,Total specimens tested,28.0,,,,21.0,,,7.0,,,,
3,6/16/2014,Newly reported deaths,2.0,,,,1.0,,,0.0,,,,
4,6/16/2014,Total death/s in confirmed cases,8.0,,,,4.0,,,0.0,,,,


In [209]:
# here again we can assume that for the exercise the only relevant columns are going to be Date, Variable and National
# let's seee if these columns are present in each file

files_number = len(ebola_liberia_files_paths)

counter = 0
for path in ebola_liberia_files_paths:
    df = pd.read_csv(path)
    if set(['Date', 'Variable', 'National']).issubset(df.columns):
        counter += 1
    
print('We have', files_number, 'files.')
print('And', counter, 'times the columns Date, Variable and National.')

We have 100 files.
And 100 times the columns Date, Variable and National.


In [177]:
# now that we know we have the columns for each files we can load them all in a data frame

ebola_liberia_df = pd.DataFrame()
for path in ebola_liberia_files_paths:
    df = pd.read_csv(path, usecols=['Date', 'Variable', 'National'], parse_dates=['Date'], index_col=['Date'])
    ebola_liberia_df = ebola_liberia_df.append(df)

ebola_liberia_df.head()

Unnamed: 0_level_0,Variable,National
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-06-16,Specimens collected,1.0
2014-06-16,Specimens pending for testing,0.0
2014-06-16,Total specimens tested,28.0
2014-06-16,Newly reported deaths,2.0
2014-06-16,Total death/s in confirmed cases,8.0


In [189]:
# now we need to filter the description to get the corresponding death and new cases for each day
ebola_liberia_df.Variable.value_counts()

Cumulative deaths among HCW                                         101
Total death/s in probable cases                                     101
Total death/s in suspected cases                                    101
Total death/s in confirmed cases                                    101
Cumulative cases among HCW                                          101
Newly reported contacts                                             100
Newly reported deaths                                               100
Total contacts listed                                               100
New case/s (confirmed)                                              100
Contacts lost to follow-up                                          100
New admissions                                                      100
Currently under follow-up                                           100
Total confirmed cases                                               100
Total suspected cases                                           

In [194]:
# This time the interesting Variable value seems to be the newly reported death and the new case confirmed
# which seems to be in each file according to the count but lets check that:
ebola_liberia_df_new_cases = ebola_liberia_df.query('Variable == "New case/s (confirmed)"')
ebola_liberia_df_new_deaths = ebola_liberia_df.query('Variable == "Newly reported deaths"')

# check that we have as we supposed one data per day
print('New case date is unique: ',ebola_liberia_df_new_cases.index.is_unique)
print('New death date is unique: ',ebola_liberia_df_new_deaths.index.is_unique)

# drop Variable as it's not useful anymore:
ebola_liberia_df_new_cases.drop('Variable', axis=1, inplace=True)
ebola_liberia_df_new_deaths.drop('Variable', axis=1, inplace=True)

# and rename the total column to what it represent:
ebola_liberia_df_new_cases = ebola_liberia_df_new_cases.rename(columns = {'National':'New Cases'})
ebola_liberia_df_new_deaths = ebola_liberia_df_new_deaths.rename(columns = {'National':'New Deaths'})

New case date is unique:  True
New death date is unique:  True


In [195]:
ebola_liberia_df_new_cases.head()

Unnamed: 0_level_0,New Cases
Date,Unnamed: 1_level_1
2014-06-16,1.0
2014-06-17,0.0
2014-06-22,5.0
2014-06-24,4.0
2014-06-25,2.0


In [196]:
ebola_liberia_df_new_deaths.head()

Unnamed: 0_level_0,New Deaths
Date,Unnamed: 1_level_1
2014-06-16,2.0
2014-06-17,0.0
2014-06-22,4.0
2014-06-24,4.0
2014-06-25,3.0


In [203]:
# now we concat and we are done
ebola_liberia_df_final = pd.concat([ebola_liberia_df_new_cases, ebola_liberia_df_new_deaths], axis=1)
ebola_liberia_df_final = ebola_liberia_df_final.dropna()
# if more time improving idea = put a treshold on values
# NaN on New cases then try suspected then try probable else drop the row
ebola_liberia_df_final.head()

Unnamed: 0_level_0,New Cases,New Deaths
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-06-16,1.0,2.0
2014-06-17,0.0,0.0
2014-06-22,5.0,4.0
2014-06-24,4.0,4.0
2014-06-25,2.0,3.0


## 1.3 Sierra Leone

In [207]:
# Sierra Leone data
ebola_sierra_folder_path = ebola_folder_path + '/sl_data'
ebola_sierra_files_paths = [ebola_sierra_folder_path + '/' + path for path in listdir(ebola_sierra_folder_path)]

# check the first file to see the structure
temp_df = pd.read_csv(ebola_sierra_files_paths[0])
temp_df.head()

Unnamed: 0,date,variable,Kailahun,Kenema,Kono,Kambia,Koinadugu,Bombali,Tonkolili,Port Loko,Pujehun,Bo,Moyamba,Bonthe,Western area urban,Western area rural,National
0,2014-08-12,population,465048,653013,325003.0,341690.0,335471.0,494139,434937,557978,335574,654142,278119,168729.0,1040888,263619,6348350
1,2014-08-12,new_noncase,0,3,0.0,0.0,0.0,0,0,1,0,0,0,0.0,0,0,4
2,2014-08-12,new_suspected,0,9,0.0,0.0,0.0,0,0,0,0,1,0,0.0,0,0,10
3,2014-08-12,new_probable,0,0,0.0,0.0,0.0,0,0,0,0,1,0,0.0,0,0,1
4,2014-08-12,new_confirmed,0,9,0.0,0.0,0.0,0,0,2,0,0,0,0.0,0,0,11


In [208]:
# here again we can assume that for the exercise the only relevant columns are going to be date, variable and National
# let's seee if these columns are present in each file

files_number = len(ebola_sierra_files_paths) 

counter = 0
for path in ebola_sierra_files_paths:
    df = pd.read_csv(path)
    if set(['date', 'variable', 'National']).issubset(df.columns):
        counter += 1
    
print('We have', files_number, 'files.')
print('And', counter, 'times the columns date, variable and National.')

We have 103 files.
And 103 times the columns date, variable and National.


In [211]:
# now that we know we have the columns for each files we can load them all in a data frame

ebola_sierra_df = pd.DataFrame()
for path in ebola_sierra_files_paths:
    df = pd.read_csv(path, usecols=['date', 'variable', 'National'], parse_dates=['date'], index_col=['date'])
    ebola_sierra_df = ebola_sierra_df.append(df)

ebola_sierra_df.head()

Unnamed: 0_level_0,variable,National
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-08-12,population,6348350
2014-08-12,new_noncase,4
2014-08-12,new_suspected,10
2014-08-12,new_probable,1
2014-08-12,new_confirmed,11


In [214]:
# now we need to filter the description to get the corresponding death and new cases for each day
ebola_sierra_df.variable.value_counts()

etc_cum_deaths            103
etc_cum_admission         103
contacts_healthy          103
etc_cum_discharges        103
death_probable            103
percent_seen              103
etc_new_discharges        103
cum_noncase               103
etc_new_deaths            103
new_suspected             103
etc_new_admission         103
cum_suspected             103
cfr                       103
new_completed_contacts    103
new_contacts              103
new_probable              103
death_confirmed           103
cum_completed_contacts    103
contacts_ill              103
etc_currently_admitted    103
contacts_not_seen         103
death_suspected           103
cum_confirmed             103
contacts_followed         103
cum_contacts              103
population                103
cum_probable              103
new_confirmed             103
new_noncase               103
negative_corpse            35
pending                    35
positive_corpse            35
new_negative               34
new_positi

In [215]:
# This time the interesting variable value seems to be the newly reported death and the new case confirmed
# which seems to be in each file according to the count but lets check that:
ebola_sierra_df_new_cases = ebola_sierra_df.query('variable == "new_confirmed"')
ebola_sierra_df_new_deaths = ebola_sierra_df.query('variable == "death_confirmed"')

# check that we have as we supposed one data per day
print('New case date is unique: ',ebola_sierra_df_new_cases.index.is_unique)
print('New death date is unique: ',ebola_sierra_df_new_deaths.index.is_unique)

# drop variable as it's not useful anymore:
ebola_sierra_df_new_cases.drop('variable', axis=1, inplace=True)
ebola_sierra_df_new_deaths.drop('variable', axis=1, inplace=True)

# and rename the total column to what it represent:
ebola_sierra_df_new_cases = ebola_sierra_df_new_cases.rename(columns = {'National':'New Cases'})
ebola_sierra_df_new_deaths = ebola_sierra_df_new_deaths.rename(columns = {'National':'New Deaths'})

New case date is unique:  True
New death date is unique:  True


In [216]:
ebola_sierra_df_new_cases.head()

Unnamed: 0_level_0,New Cases
date,Unnamed: 1_level_1
2014-08-12,11
2014-08-13,15
2014-08-14,13
2014-08-15,10
2014-08-16,18


In [217]:
ebola_sierra_df_new_deaths.head()

Unnamed: 0_level_0,New Deaths
date,Unnamed: 1_level_1
2014-08-12,264
2014-08-13,273
2014-08-14,280
2014-08-15,287
2014-08-16,297


In [223]:
# now we concat and we are done
ebola_sierra_df_final = pd.concat([ebola_sierra_df_new_cases, ebola_sierra_df_new_deaths], axis=1)
ebola_sierra_df_final = ebola_sierra_df_final.dropna()
ebola_sierra_df_final.head()

Unnamed: 0_level_0,New Cases,New Deaths
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-08-12,11,264
2014-08-13,15,273
2014-08-14,13,280
2014-08-15,10,287
2014-08-16,18,297


## 1.4 Group all the dataframe into on single clean dataframe

In [None]:
# set hierarchical index on the dataframes


## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [None]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
# Write your answer here