# Part 1 - Importing necessary packages

If installation was done correctly, there should be no errors here.

In [None]:
%matplotlib inline

# Numerical library
import numpy as np

# Data manipulation
import pandas as pd
from patsy import dmatrix

# Ploting
import matplotlib
import matplotlib.pyplot as plt

# Survival analysis
import lifelines

# Part 2 - A look at the clinical data
The data is avaliable at http://www.cbioportal.org/study?id=brca_tcga_pub

Now lets load the clinical data

In [None]:
clinical = pd.read_csv('data/brca_tcga_pub_clinical_data.tsv', sep='\t')

with pd.option_context('display.max_columns', None):
    display(clinical)

Lets plot how the survival data looks like

In [None]:
matplotlib.rcParams['figure.figsize'] = [15, 50]

data_sorted = clinical[['Overall Survival (Months)','Overall Survival Status']].sort_values(by = 'Overall Survival (Months)').dropna().reset_index(drop=True)
status_slice = data_sorted['Overall Survival Status'] == 'DECEASED'

plt.barh(data_sorted.loc[~status_slice].index, data_sorted.loc[~status_slice,'Overall Survival (Months)'], height = 1, color = 'b')
plt.barh(data_sorted.loc[status_slice].index, data_sorted.loc[status_slice,'Overall Survival (Months)'], height = 1, color = 'r')
plt.legend(['Alive', 'Dead'])
plt.ylabel('Patients')
plt.xlabel('Months alive')

---
### Exercises
1.1 Describe the plot and what inferences you are able to make from it

---

Ok, now that we've seen the data, lets play around with it.

How does the survival curve looks like in general? We can use the survival package __lifelines__ to figure this out, and generate a *Kaplan Meier plot*

In [None]:
kmf = lifelines.KaplanMeierFitter()

clinical = clinical.dropna()

Time = clinical['Overall Survival (Months)']
Event = clinical['Overall Survival Status'] == 'DECEASED'

kmf.fit(Time, Event)

matplotlib.rcParams['figure.figsize'] = [15, 10]
kmf.plot()

---
### Exercises

2.1 Compare the Kaplan Meier plot with the first one, what additional insights are avaliable on this latter plot?

---

Now we can start to play around with clinical variables that might influence in the survival curve.

You should play around with the groupings and see if you can find some usefull insight

In [None]:
# Define positive groups here
groups = clinical['Metastasis-Coded'] == 'Positive'

kmf.fit(Time[~groups], Event[~groups], label='False')
ax = kmf.plot()
kmf.fit(Time[groups], Event[groups], label='True')
kmf.plot(ax=ax)

---
### Exercises
3.1 Make at least 3 plots using different separation criteria, and lable them accordingly, save them on the report.

3.2 Make a plot that includes the survival curve for 3 different age groups.

3.3 (Advanced) The **logrank_test** function of the **lifelines** package performs a statistical test on the two groups to see if their "death generation process" is the same. Use this function to obtain a significance statistic for the separations above.

---

# Part 3 - Incorporating gene expression data

It's time to look at the gene expression data!

First we load the expression data for the same samples. This may take some time.

In [None]:
expression_raw  = pd.read_csv('data/data_expression_median.txt', sep='\t')
expression = expression_raw.set_index('Hugo_Symbol').iloc[:,1:].T

And then we merge the clinical and expression data in one, and display the result

In [None]:
data = clinical.merge(expression, how='inner', left_on='Sample ID', right_index=True)
data

How can we incorporate the expression data in the survival analysis? Cox regression

In [None]:
formula = "Q('Diagnosis Age') -1"

X = dmatrix(formula, data)
X