# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [None]:
DATA_FOLDER = 'Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

In [None]:
# Useful imports
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os #I want to read all files in doc automaticaly

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

# Answer Comentary:

As all the csv files are structured in very different ways, I cleaned all data for each country individualy before merging it into one dataframe.

In [None]:
path = DATA_FOLDER + '/ebola/'

#I procede to go though each folder, starting with guinea

guinea_frame = pd.DataFrame()
for filename in os.listdir(path+'guinea_data'):
    frame = pd.read_csv(path+'guinea_data/'+filename, parse_dates=['Date'],
                        usecols=['Date', 'Description', 'Totals'])#keep only the 3 relevant colums
    
    deaths = frame[['New deaths registered' in d for d in frame.Description]
                  ][['Date', 'Totals']].drop_duplicates(subset='Date')
    
    cases= frame[['Total new cases' in d for d in frame.Description]][['Date', 'Totals']]

    deaths.columns = ['Date', 'Deaths']
    cases.columns = ['Date', 'Cases']
    
    if(cases.shape != deaths.shape):
        raise AssertionError #make sure that we get the same number of values per file
                            #will be one row per date/csv file
    
    total = pd.merge(deaths, cases, on='Date', how='inner')
    total['Date'] = total['Date'].map(lambda x : x.month) #we only want to know about the month (all csv set in 2014)
    guinea_frame = guinea_frame.append(total)

guinea_frame['Country'] = 'Guinea'

#make sure we get some data from every document & we don't miss rows
print('Getting all rows : ',
      guinea_frame.Deaths.shape[0] == len(os.listdir(path+'guinea_data')))


The next part treats the data from Liberia. Overall, the transcription contains a lot of inconsistencies,
which were adressed by treating cases seperately

In [None]:
liberia_frame = pd.DataFrame()
cumulative_cases = pd.DataFrame()

for filename in os.listdir(path+'liberia_data'):
    frame = pd.read_csv(path+'liberia_data/'+filename, parse_dates=['Date'],
                       usecols=['Date', 'Variable', 'National']) #get only the 3 relevant colums
    
    #one csv contains duplicates on the variable column
    #As the values differ I assumed that the first value is the correct one
    frame = frame.drop_duplicates(subset='Variable') 
    
    deaths = frame[[d in 'Newly reported deaths' for d in frame.Variable]][['Date', 'National']]
    deaths.columns = ['Date', 'Deaths']
    
    #we get all 3 values of new cases (suspected, probable and confirmed) & sum them up
    cases_int = frame[['New Case/s'.lower() in d.lower() for d in frame.Variable]][['Date', 'National']]  
    cases_int.columns = ['Date', 'Cases'] #renaming for convenience
    cases = cases_int.groupby('Date', as_index=False)['Cases'].sum()
    
    #make sure that there is no error in the rows we're getting
    if(cases.shape != deaths.shape):
        raise AssertionError
    
    total = pd.merge(deaths, cases, on='Date', how='inner')  
    total['Date'] = total['Date'].map(lambda x : x.month)
    liberia_frame = liberia_frame.append(total)
    
#in december, the total cases & new cases are exchanged, no values for the new cases are provided
#we approximate those values by taking the difference of total values over multiple dates
liberia_frame.loc[(liberia_frame.Cases > 1000 ),'Cases'] =\
    liberia_frame[[d > 1000 for d in liberia_frame.Cases]]['Cases'].diff(periods=1)

liberia_frame['Country'] = 'Liberia'    

print('Getting all rows : ' , 
      liberia_frame.Deaths.shape[0] == len(os.listdir(path+'liberia_data')))    


In [None]:
#dealing with sierra leone files

sl_frame = pd.DataFrame()
for filename in os.listdir(path+'sl_data'):
    frame = pd.read_csv(path+'sl_data/'+filename, parse_dates=['date'],
                       usecols=['date', 'variable', 'National'])
    #these csv files do not count the number of new cases
    #we find an approximate value by taking the difference between two consequitive days registered
    deaths = frame[[d in ['death_suspected','death_probable','death_confirmed' ]
                    for d in frame.variable]][['date', 'variable', 'National']]

    deaths.National = deaths.National.astype(float) #as to deal well with nan values
    deaths = deaths.groupby('date', as_index=False)['National'].sum() #summing over 3 possible types of death
    
    
    cases = frame[[d in ['new_suspected','new_probable','new_confirmed' ]
                    for d in frame.variable]][['date', 'variable', 'National']]
    
    cases.dropna(axis='rows') #dealing with na values
    cases.National = cases.National.map(
        lambda x: x if type(x) is float else x.replace(',', '')) #if string contains , remove it
    
    cases.National = cases.National.astype(float) #needed to sum
    cases = cases.groupby('date', as_index=False)['National'].sum()
    
    cases.columns = ['Date', 'Cases']
    deaths.columns = ['Date', 'Total Deaths']
    
    if(cases.shape != deaths.shape):
        raise AssertionError
    
    total = pd.merge(deaths, cases, on='Date', how='inner')  
    total['Date'] = total['Date'].map(lambda x : x.month)
    
    sl_frame = sl_frame.append(total)#aggregate all csv files

sl_frame.reset_index(drop=True, inplace=True)

#calculate new deaths by taking difference between consequitive total deaths
new_deaths = sl_frame['Total Deaths'].diff(periods=1) 

sl_frame['Deaths'] = new_deaths #set new deaths
del sl_frame['Total Deaths']

sl_frame.loc[((sl_frame.Deaths < 0)), 'Deaths'] = float('nan') #clearly we can't have negative deaths
sl_frame.loc[((sl_frame.Deaths > 200)), 'Deaths'] = float('nan')#strong outliers (2) in dataset

sl_frame['Country'] = 'Sierra Leone'

print('Getting all rows : ' , 
    sl_frame.Cases.shape[0] == len(os.listdir(path+'sl_data')))

Now that we have clean values for all three countries, we can easily put them together and calculate the means

In [None]:
#MERGING ALL FRAMES
ebola_deaths_cases = pd.concat([guinea_frame, liberia_frame, sl_frame]).reset_index(drop=True)
ebola_deaths_cases.Deaths = ebola_deaths_cases.Deaths.astype(float)
ebola_deaths_cases.Cases = ebola_deaths_cases.Cases.astype(float)
means = ebola_deaths_cases.groupby(['Date', 'Country'], as_index=False)[['Deaths', 'Cases']].mean()

means.sort_values(by='Country', ascending=1).set_index(['Country', 'Date']) #the means we were asked to calculate, sort for convenience

In [None]:
#some figures (for fun!)
plt.figure();
grouped_means = means.groupby('Country')
grouped_means.get_group('Guinea').plot(x='Date', title='Guinea')
grouped_means.get_group('Liberia').plot(x='Date', title='Liberia')
grouped_means.get_group('Sierra Leone').plot(x='Date', title='Sierra Leone')

BONUS answer: 
to make sure the 'cleaned' data makes sense (no negative deaths or cases etc.) I used a simple bar graphs to check at one glance

In [None]:
ebola_deaths_cases.groupby('Country').plot(x='Date', kind='bar')

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

# Answer Comentary:

The idea for the first part of this exercise is to merge all the files and to name each column imported using the namefile (it will be easier to tag all columns using the barcode at the end).

In [None]:
aggregated_frame = pd.DataFrame() # Creation of the aggregate (as in the exercises)
aggregated_frame.index.name = 'Taxon' # Tagging the index to be able to join the frames

for i in range(1, 10):
    filename = 'MID' + str(i) # Defined as a variable to be able to tag the columns
    temp_frame = pd.read_excel(DATA_FOLDER + '/microbiome/' + filename + '.xls','Sheet 1', index_col=0, header=None)
    temp_frame.columns = [filename]
    temp_frame.index.name = 'Taxon'
    aggregated_frame = aggregated_frame.join(temp_frame, how='outer') #Joining frames
    
aggregated_frame #Show intermediate DataFrame

In the second part, we import the Metadata (and clean it) before merging it in the third part

In [None]:
metadata = pd.read_excel(DATA_FOLDER + '/microbiome/metadata.xls','Sheet1', index_col=0)
metadata = metadata.fillna('NA') #As the 'NA' is translated to 'NaN', we decided to name 'NA' as to differentiate it
                                    #from the unknown objects and stick with the metadata
metadata

In the final part, we group the elements by the values given in the metadata file and replace the NaN values

In [None]:
aggregated_frame = aggregated_frame.T.join(metadata) #Allows us to easily join with metadata
final_frame = aggregated_frame.set_index(['SAMPLE', 'GROUP']).T #We group the elements and obtain the desired shape
final_frame = final_frame.fillna('unknown') #This is the last step as required
final_frame #Display final frame

### Notes:

1. Instead of iterating on the indexes each file, we could use the barcode of the metadata to iterate over each file and give the possibility to add more files without having to change the code
2. We did not know if we should have tagged the 'NA' value in the SAMPLE column as 'unkwown' or if we should have kept the name, so we decided to stick with the name
3. To group the elements, we thought it best to define the SAMPLE as the supergroup as it contained more columns than the GROUP (easier visualization), but it is easy to change them and order them according to each group.

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

# Importing the cls file
titanicXls = pd.read_excel(DATA_FOLDER+'/titanic.xls', header=None)

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
# Write your answer here