# Step 1 - Importing necessary packages

If installation was done correctly, there should be no errors here.

In [None]:
%matplotlib inline

# Numerical library
import numpy as np

# Data manipulation
import pandas as pd
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", None)
from patsy import dmatrix

# Ploting
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = [15, 10]

# Survival analysis
import lifelines

# Step 2 - Load and manipulate the data
The data is avaliable at http://www.cbioportal.org/study?id=brca_tcga_pub

Now lets load the clinical data

In [None]:
clinical = pd.read_csv('data/brca_tcga_pub_clinical_data.tsv', sep='\t')
clinical

Lets play around with this data.

How does the survival curve looks like in general? We can use the survival package __lifelines__ to figure this out

In [None]:
kmf = lifelines.KaplanMeierFitter()

clinical = clinical.dropna()

Time = clinical['Overall Survival (Months)']
Event = clinical['Overall Survival Status'] == 'DECEASED'

kmf.fit(Time, Event)
kmf.plot()

Now we can start to play around with clinical variables that might influence in the survival curve

In [None]:
# Define positive groups here
groups = clinical['Metastasis-Coded'] == 'Positive'


kmf.fit(Time[~groups], Event[~groups], label='False')
ax = kmf.plot()
kmf.fit(Time[groups], Event[groups], label='True')
kmf.plot(ax=ax)

It's time to look at the gene expression data!

First we load the expression data for the same samples. This may take some time.

In [None]:
expression_raw  = pd.read_csv('data/data_expression_median.txt', sep='\t')
expression = expression_raw.set_index('Hugo_Symbol').iloc[:,1:].T
expression

And then we merge the clinical and expression data in one, and display the result

In [None]:
data = clinical.merge(expression, how='inner', left_on='Sample ID', right_index=True)
data

In [None]:
formula = "Q('Diagnosis Age') -1"

X = dmatrix(formula, data)
X