# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [3]:
DATA_FOLDER = 'Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

## Solution
---
First of all, we need to read all of the data which is in the csv format. So for this, we need to retrieve all csv file names for some country. Thus using a ls command seems to be the best option here. After that, we retrieve the useful columns of the csv (it's always the Date, the Description and the National numbers) and concatenate them.
We then filter the rows by their description, looking for the new cases and deaths related descriptions. Right after, we rename these descriptions to be either 'New cases' or 'Deaths', so that we can then group_by the month, the day and the description, to get at most one total per 'New cases' per day, and one total per 'Deaths' per day. After that, we can do the daily average per month, by using the first and third index columns (month and description). We rename the indexes and the column.
We do this procedure for each country, with different column names, 'New cases' and 'Deaths' descriptions, and when we have the three DataFrames, we concatenate them.

A priori, for Sierra Leone we can't use death_suspected, death_probable and death_confirmed, as it seems to be cumulated, and we only want the new deaths.




Question : in the case

* 0 	2014-10-01 	New cases of suspects 	28
* 1 	2014-10-01 	New cases of probables 	0
* 2 	2014-10-01 	New cases of confirmed 	6
* 3 	2014-10-01 	Total new cases registered so far 	34
* 15 	2014-10-01 	New cases of confirmed among health workers 	2

Do we count the New cases of confirmed among health workers as part of New cases of confirmed, so that the total is correct, or is it a new category ?

In [4]:
def country_DF(country, date_c, description_c, total_c, filters_new_cases, filters_deaths):
    path = DATA_FOLDER + '/ebola/' + country + '_data/'
    country_files = !ls $path

    country_DF = pd.concat([pd.read_csv(path+filename, usecols=[date_c, description_c, total_c], parse_dates=[date_c], index_col=False) 
                           for filename in country_files]) 
    

    country_DF = country_DF.loc[[ description.lower() in filters_new_cases or
                                    description.lower() in filters_deaths
                                    for description in country_DF[description_c].values]]

    country_DF.loc[[description.lower() in filters_new_cases for description in country_DF[description_c]], description_c] = 'New cases'
    country_DF.loc[[description.lower() in filters_deaths for description in country_DF[description_c]], description_c] = 'Deaths'
    country_DF[total_c] = pd.to_numeric(country_DF[total_c])
    

    country_computation = pd.DataFrame(country_DF.groupby([country_DF[date_c].dt.month, country_DF[date_c].dt.day, country_DF[description_c]])[total_c].sum())
    country_computation = country_computation.groupby(level=[0,2]).mean()
    country_computation.index.names = ['Month', 'Description']
    country_computation.columns = ['Daily average per month']
    return country_computation

guinea = country_DF('guinea', 'Date', 'Description', 'Totals', ['total new cases registered so far'], ['new deaths registered', 'new deaths registered today'])
liberia = country_DF('liberia', 'Date', 'Variable', 'National', ['new case/s (suspected)', 'new case/s (probable)', 'new case/s (confirmed)'], ['newly reported deaths'])
sl = country_DF('sl', 'date', 'variable', 'National', ['new_noncase', 'new_suspected', 'new_probable', 'new_confirmed'], ['death_suspected', 'death_probable', 'death_confirmed'])

pd.concat({'Guinea': guinea, 'Liberia': liberia, 'Sierra Leone': sl})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Daily average per month
Unnamed: 0_level_1,Month,Description,Unnamed: 3_level_1
Guinea,8,Deaths,3.4
Guinea,8,New cases,25.8
Guinea,9,Deaths,3.5625
Guinea,9,New cases,19.625
Guinea,10,Deaths,15.0
Guinea,10,New cases,34.0
Liberia,6,Deaths,2.0
Liberia,6,New cases,5.714286
Liberia,7,Deaths,4.272727
Liberia,7,New cases,8.545455


In [148]:
def country_DF(country, date_c, description_c, total_c, filters_new_cases, filters_deaths):
    path = DATA_FOLDER + '/ebola/' + country + '_data/'
    country_files = !ls $path

    country_DF = pd.concat([pd.read_csv(path+filename, parse_dates=[date_c], na_values=['-']) 
                           for filename in country_files]) 

    country_DF = country_DF.loc[[ description.lower() in filters_new_cases or
                                    description.lower() in filters_deaths
                                    for description in country_DF[description_c].values]]
    
    cols = pd.Series(country_DF.columns)
    cols = cols[~cols.isin([date_c, description_c, total_c])]
    
    #print(country_DF.dropna(thresh = 3))
    #print('-'*200)
    country_DF = country_DF.dropna(thresh = 3) #drop every entry containing no data, as we can't infer anything (it's not 0)
    recomputed_total = country_DF[cols].apply(pd.to_numeric).sum(axis=1)
    
    #tmp = pd.DataFrame()
    #tmp["recomputed"] = recomputed_total
    #tmp["total"] = pd.to_numeric(country_DF[total_c])
    #tmp["combined"] = pd.to_numeric(country_DF[total_c]).combine_first(recomputed_total)
    #print(tmp)
    
    country_DF[total_c] = pd.to_numeric(country_DF[total_c]).combine_first(recomputed_total)
    #print(country_DF[[date_c, description_c, total_c]])

    country_DF.loc[[description.lower() in filters_new_cases for description in country_DF[description_c]], description_c] = 'New cases'
    country_DF.loc[[description.lower() in filters_deaths for description in country_DF[description_c]], description_c] = 'Deaths'
    #country_DF[total_c] = pd.to_numeric(country_DF[total_c])
    
    
    country_computation = pd.DataFrame(country_DF.groupby([country_DF[date_c].dt.month, country_DF[date_c].dt.day, country_DF[description_c]])[total_c].sum())
    country_computation = country_computation.groupby(level=[0,2]).mean()
    country_computation.index.names = ['Month', 'Description']
    country_computation.columns = ['Daily average per month']
    return country_computation

guinea = country_DF('guinea', 'Date', 'Description', 'Totals', ['total new cases registered so far'], ['new deaths registered', 'new deaths registered today'])
liberia = country_DF('liberia', 'Date', 'Variable', 'National', ['new case/s (suspected)', 'new case/s (probable)', 'new case/s (confirmed)'], ['newly reported deaths'])
sl = country_DF('sl', 'date', 'variable', 'National', ['new_noncase', 'new_suspected', 'new_probable', 'new_confirmed'], ['etc_new_deaths'])

pd.concat({'Guinea': guinea, 'Liberia': liberia, 'Sierra Leone': sl})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Daily average per month
Unnamed: 0_level_1,Month,Description,Unnamed: 3_level_1
Guinea,8,Deaths,3.4
Guinea,8,New cases,25.8
Guinea,9,Deaths,3.5625
Guinea,9,New cases,19.625
Guinea,10,Deaths,15.0
Guinea,10,New cases,34.0
Liberia,6,Deaths,2.0
Liberia,6,New cases,5.714286
Liberia,7,Deaths,4.272727
Liberia,7,New cases,8.545455


In [282]:
def get_filenames(path):
    filenames = !ls $path
    return filenames

def get_cols(df, to_remove):
    cols = pd.Series(df.columns)
    return cols[~cols.isin(to_remove)]


def country_DF_cumulative(country, date_c, description_c, total_c, filters_new_cases, filters_deaths, date_begin= '2014-11-30', date_end='2014-12-09'):
    """To use this function, the data must be in following days, => no gap"""
    PATH = DATA_FOLDER + '/ebola/' + country + '_data/'
    
    country_DF = pd.concat([pd.read_csv(PATH + date_begin + '.csv', parse_dates=[date_c], na_values=['-']), 
                            pd.read_csv(PATH + date_end + '.csv', parse_dates=[date_c], na_values=['-'])]) 
    
    country_DF = country_DF.loc[[ description.lower() in filters_new_cases or
                                    description.lower() in filters_deaths
                                    for description in country_DF[description_c].values]]
    
    cols = get_cols(country_DF, to_remove=[date_c, description_c, total_c])
    recomputed_total = country_DF[cols].apply(pd.to_numeric).sum(axis=1)
    country_DF[total_c] = pd.to_numeric(country_DF[total_c]).combine_first(recomputed_total)
    
    country_DF.loc[[description.lower() in filters_new_cases for description in country_DF[description_c]], description_c] = 'New cases'
    country_DF.loc[[description.lower() in filters_deaths for description in country_DF[description_c]], description_c] = 'Deaths'
    
    days_between = (pd.to_datetime(date_end) - pd.to_datetime(date_begin)).days
    
    country_DF = country_DF.groupby(country_DF[description_c])[total_c].apply(lambda x: x.iloc[1] - x.iloc[0])/days_between
    country_DF = pd.DataFrame(pd.concat({pd.to_datetime(date_end).month: country_DF}))
    country_DF.index.names = ['Month', 'Description']
    country_DF.columns = ['Daily average per month']
    
    return country_DF



def country_DF_mass(country, date_c, description_c, total_c, filters_new_cases, filters_deaths, lsnames=''):
    
    PATH = ''
    
    if(not lsnames):
        PATH = DATA_FOLDER + '/ebola/' + country + '_data/'
        country_files = get_filenames(PATH)
    else:
        country_files = lsnames

    country_DF = pd.concat([pd.read_csv(PATH+filename, parse_dates=[date_c], na_values=['-']) 
                           for filename in country_files]) 

    country_DF = country_DF.loc[[ description.lower() in filters_new_cases or
                                    description.lower() in filters_deaths
                                    for description in country_DF[description_c].values]]
    
    cols = get_cols(country_DF, to_remove=[date_c, description_c, total_c])
    
    country_DF = country_DF.dropna(thresh = 3) #drop every entry containing no data, as we can't infer anything (it's not 0)
    recomputed_total = country_DF[cols].apply(pd.to_numeric).sum(axis=1)

    country_DF[total_c] = pd.to_numeric(country_DF[total_c]).combine_first(recomputed_total)

    country_DF.loc[[description.lower() in filters_new_cases for description in country_DF[description_c]], description_c] = 'New cases'
    country_DF.loc[[description.lower() in filters_deaths for description in country_DF[description_c]], description_c] = 'Deaths'
    
    country_computation = pd.DataFrame(country_DF.groupby([country_DF[date_c].dt.month, country_DF[date_c].dt.day, country_DF[description_c]])[total_c].sum())
    country_computation = country_computation.groupby(level=[0,2]).mean()
    country_computation.index.names = ['Month', 'Description']
    country_computation.columns = ['Daily average per month']
    return country_computation

guinea = country_DF_mass('guinea', 'Date', 'Description', 'Totals', ['total new cases registered so far'], ['new deaths registered', 'new deaths registered today'])

liberia_mass_files = !ls 'Data/ebola/liberia_data'/2014-[0.1][^2]*
liberia = country_DF_mass('liberia', 'Date', 'Variable', 'National', ['new case/s (suspected)', 'new case/s (probable)', 'new case/s (confirmed)'], ['newly reported deaths'], liberia_mass_files)
liberia_december = country_DF_cumulative('liberia', 'Date', 'Variable', 'National', ['cumulative confirmed, probable and suspected cases'], ['total death/s in confirmed, \n probable, suspected cases'])
liberia = pd.concat([liberia, liberia_december])

sl = country_DF_mass('sl', 'date', 'variable', 'National', ['new_noncase', 'new_suspected', 'new_probable', 'new_confirmed'], ['etc_new_deaths'])

pd.concat({'Guinea': guinea, 'Liberia': liberia, 'Sierra Leone': sl})

9


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Daily average per month
Unnamed: 0_level_1,Month,Description,Unnamed: 3_level_1
Guinea,8,Deaths,3.4
Guinea,8,New cases,25.8
Guinea,9,Deaths,3.5625
Guinea,9,New cases,19.625
Guinea,10,Deaths,15.0
Guinea,10,New cases,34.0
Liberia,6,Deaths,2.0
Liberia,6,New cases,5.714286
Liberia,7,Deaths,4.272727
Liberia,7,New cases,8.545455


## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

## Solution
---


In [None]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

## Solution
---


In [None]:
# Write your answer here