# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [None]:
# Imports
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import webbrowser 

# Constants
DATA_FOLDER = 'Data/'

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

In [None]:
# Write your answer here

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [None]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

## For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

### Question 3.1

##### Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 

Assumptions: 
    - "For each exercise, please provide both a written explanation of the steps you will apply to manipulate the data, and the corresponding code." We assume that "written explanation can come in the form of commented code as well as text"
    - We assume that we must not describe the value range of attributes that contain string as we dont feel the length of strings or ASCI-values don't give any insight

In [None]:
''' 
Here is a sample of the information in the titanic dataframe
''' 

# Importing titanic.xls info with Pandas
titanic = pd.read_excel('Data/titanic.xls')

# printing only the 30 first and last rows of information
print(titanic.head)

In [None]:
'''
To describe the INTENDED values and types of the data we will show you the titanic.html file that was provided to us
Notice:
    - 'age' is of type double, so someone can be 17.5 years old, mostly used with babies that are 0.x years old
    - 'cabin' is stored as integer, but it har characters and letters
    - By this model, embarked is stored as an integer, witch has to be interpreted as the 3 different embarkation ports
    - It says that 'boat' is stored as a integer even though it has spaces and letters, it should be stored as string
    
PS: it might be that the information stored as integer is supposed to be categorical data,
        ...because they have a "small" amount of valid options
''' 

# Display html info in Jupyter Notebook
from IPython.core.display import display, HTML
htmlFile = 'Data/titanic.html'
display(HTML(htmlFile))


In [None]:
''' 
The default types of the data after import:
Notice:
    - the strings and characters are imported as objects
    - 'survived' is imported as int instead of double (which is in our opinion better since it's only 0 and 1
    - 'sex' is imported as object not integer because it is a string
'''

titanic.dtypes

In [None]:
''' 
Below you can see the value range of the different numerical values.

name, sex, ticket, cabin, embarked, boat and home.dest is not included because they can't be quantified numerically.
''' 

titanic.describe()

In [None]:

'''
Additional information that is important to remember when manipulation the data
is if/where there are NaN values in the dataset
'''

# This displays the number of NaN there is in different attributes
print(pd.isnull(titanic).sum())

'''
Some of this data is missing while some is meant to describe 'No' or something of meaning.
Example:
    Cabin has 1014 NaN in its column, it might be that every passenger had a cabin and the data is missing.
    Or it could mean that most passengers did not have a cabin or a mix. The displayed titanic.html file 
    give us some insight if it is correct. It says that there are 0 NaN in the column. This indicates that
    there are 1014 people without a cabin. Boat has also 823 NaN's, while the titanic lists 0 NaN's. 
    It is probably because most of those who died probably weren't in a boat.
'''

In [None]:
'''
What attributes should be stored as categorical information?

Categorical data is essentially 8-bit integers which means it can store up to 2^8 = 256 categories
Benefit is that it makes memory usage lower and it has a performance increase in calculations.
'''

print('Number of unique values in... :')
for attr in titanic:
    print("   {attr}: {u}".format(attr=attr, u=len(titanic[attr].unique())))

In [None]:
'''
We think it will be smart to categorize: 'pclass', 'survived', 'sex', 'cabin', 'embarked' and 'boat'
because they have under 256 categories and don't have a strong numerical value like 'age'
'survived' is a bordercase because it might be more practical to work with integers in some settings
'''

# changing the attributes to categorical data
titanic.pclass = titanic.pclass.astype('category')
titanic.survived = titanic.survived.astype('category')
titanic.sex = titanic.sex.astype('category')
titanic.cabin = titanic.cabin.astype('category')
titanic.embarked = titanic.embarked.astype('category')
titanic.boat = titanic.boat.astype('category')

#Illustrate the change by printing out the new types
titanic.dtypes

### Question 3.2
###### "Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. "

assumptions: 

In [None]:

#Plotting the ratio different classes(1st, 2nd and 3rd class) the passengers have
pc = titanic.pclass.value_counts().sort_index().plot(kind='bar')
pc.set_title('Travel classes')
pc.set_ylabel('Number of passengers')
pc.set_xlabel('Travel class')
pc.set_xticklabels(('1st class', '2nd class', '3rd class'))
plt.show(pc)

#Plotting the amount of people that embarked from different cities(C=Cherbourg, Q=Queenstown, S=Southampton)
em = titanic.embarked.value_counts().sort_index().plot(kind='bar')
em.set_title('Ports of embarkation')
em.set_ylabel('Number of passengers')
em.set_xlabel('Port of embarkation')
em.set_xticklabels(('Cherbourg', 'Queenstown', 'Southampton'))
plt.show(em)

#Plotting what sex the passengers are
sex = titanic.sex.value_counts().plot(kind='bar')
sex.set_title('Gender of the passengers')
sex.set_ylabel('Number of Passengers')
sex.set_xlabel('Gender')
sex.set_xticklabels(('Female', 'Male'))
plt.show(sex)

#Plotting agegroup of passengers
bins = [0,10,20,30,40,50,60,70,80]
age_grouped = pd.DataFrame(pd.cut(titanic.age, bins))
ag = age_grouped.age.value_counts().sort_index().plot.bar()
ag.set_title('Age of Passengers ')
ag.set_ylabel('Number of passengers')
ag.set_xlabel('Age groups')
plt.show(ag)


### Question 3.3
###### Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.

assumptions: 
    - Because we are tasked with categorizing persons by the floor of their cabin it was problematic that you had cabin input: "F E57" and "F G63". There were only 7 of these instances with conflicting cabinfloors. We also presumed that the was a floor "T". Even though there was only one instance, so it might have been a typo.

In [None]:
#Parsing the cabinfloor, into floors A, B, C, D, E, F, G, T, E&F, F&G, No Cabin

cabin_floors = titanic.cabin.dropna()
cabin_floors = cabin_floors.str.replace(r'[\d ]+', '') #removes digits and spaces
cabin_floors = cabin_floors.str.replace(r'(.)(?=.*\1)', '') #removes duplicate letters and leave unique (CC -> C) (FG -> G)

cabin_floors = cabin_floors.str.replace(r'([A-Z]{1})\w+', 'NaN' ) #removes ambigous data from the dataset (FE -> NaN)(FG -> NaN)

# Removing NaN's from the plotting
cabin_floors = cabin_floors.cat.remove_categories('NaN')
cabin_floors = cabin_floors.dropna()

# Preparing data for plt.pie
numberOfCabinPlaces = cabin_floors.count()
grouped = cabin_floors.groupby(cabin_floors).count()
sizes = np.array(grouped)
labels = np.array(grouped.index)

plt.pie(sizes, labels=labels, autopct='%1.1f%%', pctdistance=0.75, labeldistance=1.1)
print("There are {cabin} passengers that have cabins and {nocabin} passengers without a cabin".format(cabin=numberOfCabinPlaces, nocabin=(len(titanic) - numberOfCabinPlaces)))
True

### Question 3.4
###### For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.

assumptions: 

In [None]:
def survivedPerClass(pclass):
    survived = len(titanic.survived[titanic.survived == 1][titanic.pclass == pclass])
    died = len(titanic.survived[titanic.survived == 0][titanic.pclass == pclass])
    return [survived, died]


the_grid = plt.GridSpec(1, 3)
labels = ["Survived", "Died"]

# Each iteration plots a pie chart
for p in titanic.pclass.unique():
    sizes = survivedPerClass(p)
    plt.subplot(the_grid[0, p-1], aspect=1 )
    plt.pie(sizes, labels=labels, autopct='%1.1f%%')
    
plt.show()

### Question 3.5
##### "Calculate the proportion of the passengers that survived by travel class and sex. Present your results in a single histogram."

assumptions: 
    1. By "proportions" We assume it is a likelyhood-percentage of surviving

In [None]:
#change survived from int to boolean data
titanic.survived = titanic.survived.astype(bool)
survivalrate = titanic.groupby(['pclass', 'sex', 'survived']).size()


survivalpercentage = survivalrate.groupby(level=['pclass', 'sex']).apply(lambda x: x / x.sum() * 100)

histogram = survivalpercentage.filter(like='True', axis=0).plot(kind='bar')
histogram.set_title('Proportion of the passengers that survived by travel class and sex')
histogram.set_ylabel('Percent likelyhood of surviving titanic')
histogram.set_xlabel('class/gender group')
plt.show(histogram)

### Question 3.6
##### "Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. Present your results in a DataFrame with unique index."

assumptions: 
1. By "proportions" we assume it is a likelyhood-percentage of surviving
2. To create 2 equally populated age categories; we will find the median and round up from the median to nearest whole year difference before splitting.

In [None]:
#drop NaN rows
age_without_nan = titanic.age.dropna()

#categorizing
age_categories = pd.qcut(age_without_nan, 2, labels=["Younger", "Older"])

#Numbers to explain difference
median = int(np.float64(age_without_nan.median()))
amount = int(age_without_nan[median])
print("The Median age is {median} years old".format(median = median))
print("and there are {amount} passengers that are {median} year old \n".format(amount=amount, median=median))

print(age_categories.groupby(age_categories).count())
print("\nAs you can see the pd.qcut does not cut into entirely equal sized bins, because the age is of a discreet nature")


In [None]:
#imported for the sake of surpressing some warnings
import warnings
warnings.filterwarnings('ignore')

csas = titanic[['pclass', 'sex', 'age', 'survived']]
csas.dropna(subset=['age'], inplace=True)
csas['age_group'] = csas.age > csas.age.median()
csas['age_group'] = csas['age_group'].map(lambda age_category: 'older' if age_category else "younger")

# Converting to int to make it able to aggregate and give percentage
csas.survived = csas.survived.astype(int)

g_categories = csas.groupby(['pclass', 'age_group', 'sex'])
result = pd.DataFrame(g_categories.survived.mean()).rename(columns={'survived': 'survived proportion'})

# reset current index and spesify the unique index
result.reset_index(inplace=True)
unique_index = result.pclass.astype(str) + ': ' + result.age_group.astype(str) + ' ' + result.sex.astype(str)

# Finalize the unique index dataframe
result_w_unique = result[['survived proportion']]
result_w_unique.set_index(unique_index, inplace=True)
print(result_w_unique)
