# 05. Logistic regression

In [12]:
# Install a conda package in the current Jupyter kernel
# xlrd package needs to be installed for pandas to open Excel files
import sys
! conda install --yes --prefix {sys.prefix} xlrd

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [1]:
import os
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import matplotlib.gridspec as gridspec
from mpl_toolkits.axes_grid1 import make_axes_locatable

from IPython.display import display

In [2]:
data = './data/'
out = './out/'

# Bold print for Jupyter Notebook
b1 = '\033[1m'
b0 = '\033[0m'

### Just some matplotlib and seaborn parameter tuning

In [3]:
axistitlesize = 20
axisticksize = 17
axislabelsize = 26
axislegendsize = 23
axistextsize = 20
axiscbarfontsize = 15

# Set axtick dimensions
major_size = 6
major_width = 1.2
minor_size = 3
minor_width = 1
mpl.rcParams['xtick.major.size'] = major_size
mpl.rcParams['xtick.major.width'] = major_width
mpl.rcParams['xtick.minor.size'] = minor_size
mpl.rcParams['xtick.minor.width'] = minor_width
mpl.rcParams['ytick.major.size'] = major_size
mpl.rcParams['ytick.major.width'] = major_width
mpl.rcParams['ytick.minor.size'] = minor_size
mpl.rcParams['ytick.minor.width'] = minor_width

mpl.rcParams.update({'figure.autolayout': False})

# Seaborn style settings
sns.set_style({'axes.axisbelow': True,
               'axes.edgecolor': '.8',
               'axes.facecolor': 'white',
               'axes.grid': True,
               'axes.labelcolor': '.15',
               'axes.spines.bottom': True,
               'axes.spines.left': True,
               'axes.spines.right': True,
               'axes.spines.top': True,
               'figure.facecolor': 'white',
               'font.family': ['sans-serif'],
               'font.sans-serif': ['Arial',
                'DejaVu Sans',
                'Liberation Sans',
                'Bitstream Vera Sans',
                'sans-serif'],
               'grid.color': '.8',
               'grid.linestyle': '--',
               'image.cmap': 'rocket',
               'lines.solid_capstyle': 'round',
               'patch.edgecolor': 'w',
               'patch.force_edgecolor': True,
               'text.color': '.15',
               'xtick.bottom': True,
               'xtick.color': '.15',
               'xtick.direction': 'in',
               'xtick.top': True,
               'ytick.color': '.15',
               'ytick.direction': 'in',
               'ytick.left': True,
               'ytick.right': True})

# Colorpalettes, colormaps, etc.
sns.set_palette(palette='rocket')

## 1. Download data from https://science.sciencemag.org/content/359/6378/926 (supplementary materials). If you do not succeed, you will find _aar3247_Cohen_SM_Tables-S1-S11.xlsx_ file in the homework's folder.
 - read the abstract of the article to get familiar with data origin
 - open the data in excel and get familiar with its content
 - load the protein level data (you need to figure out which one is that) as a pandas dataframe
 - handle missing values and convert features to numeric values when it is needed
 - get rid of the unnecessary (which does not encode protein levels or the tumor type) columns and the CancerSEEK results

### 1./a. Open the protein dataset

In [4]:
import urllib.request

### Issue

Pandas somewhy can't handle I/O with excel files, when loading them from an `urllib3.response.HTTPResponse` object:

- [Issue #20434](https://github.com/pandas-dev/pandas/issues/20434)
- [Issue #28825](https://github.com/pandas-dev/pandas/issues/28825)

It was said to be adressed in [Issue #28874](https://github.com/pandas-dev/pandas/pull/28874), but it seems that it wasn't, or maybe it was reintroduced in a newer release.

In [6]:
# PANDAS BUG!
#url = 'https://science.sciencemag.org/highwire/filestream/704651/field_highwire_adjunct_files/1/aar3247_Cohen_SM_Tables-S1-S11.xlsx'
#with urllib.request.urlopen(url) as url:
#    df = pd.read_excel(url)

### Open file locally

In [8]:
os.listdir(data)

['aar3247_Cohen_SM_Tables-S1-S11.xlsx']

In [31]:
df = pd.read_excel(data + 'aar3247_Cohen_SM_Tables-S1-S11.xlsx', sheet_name='Table S6', header=2)

In [32]:
display(df.head())
display(df.tail())

Unnamed: 0,Patient ID #,Sample ID #,Tumor type,AJCC Stage,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),...,sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score,CancerSEEK Test Result
0,CRC 455,CRC 455 PLS 1,Colorectum,I,1583.45,5598.5,3621.04,5.09,19.08,*16.452,...,*204.792,55.06,6832.07,9368.53,*16.086,21863.7,56428.7,39498.82,0.938342,Positive
1,CRC 456,CRC 456 PLS 1,Colorectum,I,*715.308,20936.3,2772.96,7.27,10.04,40.91,...,*204.792,72.92,5549.47,6224.55,*16.086,29669.7,73940.5,41277.09,0.925363,Positive
2,CRC 457,CRC 457 PLS 1,Colorectum,II,4365.53,2350.93,4120.77,*4.854,16.96,*16.452,...,*204.792,173.78,3698.16,4046.48,179.03,6020.47,22797.3,28440.6,0.852367,Negative
3,CRC 458,CRC 458 PLS 1,Colorectum,II,*715.308,1604.34,2029.96,5.39,8.31,*16.452,...,*204.792,29.47,5856.0,6121.93,*16.086,4331.02,20441.2,25896.73,0.617639,Negative
4,CRC 459,CRC 459 PLS 1,Colorectum,II,801.3,2087.57,2069.17,*4.854,11.73,*16.452,...,*204.792,78.07,5447.93,6982.32,*16.086,2311.91,56288.5,49425.2,0.318434,Negative


Unnamed: 0,Patient ID #,Sample ID #,Tumor type,AJCC Stage,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),...,sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score,CancerSEEK Test Result
1816,PAPA 1357,PAPA 1357 PLS 1,Ovary,III,*879.498,3546.43,1493.32,1428.31,**836.85,37.9,...,*207.24,72.22,3967.55,4045.18,*16.89,12877.1,88464.0,47219.24,1.0,Positive
1817,,,,,,,,,,,...,,,,,,,,,,
1818,*Protein concentration below the limit of dete...,,,,,,,,,,...,,,,,,,,,,
1819,**Protein concentration above the limit of det...,,,,,,,,,,...,,,,,,,,,,
1820,NA: Not available,,,,,,,,,,...,,,,,,,,,,


In [33]:
# last 4 columns are just comments
df = df.iloc[:-4]

In [34]:
display(df.head())
display(df.tail())

Unnamed: 0,Patient ID #,Sample ID #,Tumor type,AJCC Stage,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),...,sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score,CancerSEEK Test Result
0,CRC 455,CRC 455 PLS 1,Colorectum,I,1583.45,5598.5,3621.04,5.09,19.08,*16.452,...,*204.792,55.06,6832.07,9368.53,*16.086,21863.7,56428.7,39498.82,0.938342,Positive
1,CRC 456,CRC 456 PLS 1,Colorectum,I,*715.308,20936.3,2772.96,7.27,10.04,40.91,...,*204.792,72.92,5549.47,6224.55,*16.086,29669.7,73940.5,41277.09,0.925363,Positive
2,CRC 457,CRC 457 PLS 1,Colorectum,II,4365.53,2350.93,4120.77,*4.854,16.96,*16.452,...,*204.792,173.78,3698.16,4046.48,179.03,6020.47,22797.3,28440.6,0.852367,Negative
3,CRC 458,CRC 458 PLS 1,Colorectum,II,*715.308,1604.34,2029.96,5.39,8.31,*16.452,...,*204.792,29.47,5856.0,6121.93,*16.086,4331.02,20441.2,25896.73,0.617639,Negative
4,CRC 459,CRC 459 PLS 1,Colorectum,II,801.3,2087.57,2069.17,*4.854,11.73,*16.452,...,*204.792,78.07,5447.93,6982.32,*16.086,2311.91,56288.5,49425.2,0.318434,Negative


Unnamed: 0,Patient ID #,Sample ID #,Tumor type,AJCC Stage,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),...,sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score,CancerSEEK Test Result
1812,PAPA 1353,PAPA 1353 PLS 1,Ovary,I,*879.498,1484.7,2096.76,24.82,10.3,42.39,...,*207.24,115.24,5390.31,8538.58,*16.89,*599.4,167800,50128.6,0.980312,Positive
1813,PAPA 1354,PAPA 1354 PLS 1,Ovary,I,1337.33,1607.9,852.37,5.58,9.8,*16.44,...,*207.24,147.17,7951.03,12966.19,*16.89,*599.4,123444,54066.98,0.999995,Positive
1814,PAPA 1355,PAPA 1355 PLS 1,Ovary,III,*879.498,1592.84,1044.45,30.48,8.48,*16.44,...,*207.24,104.63,2396.36,1901.41,*16.89,*599.4,104071,39844.02,1.0,Positive
1815,PAPA 1356,PAPA 1356 PLS 1,Ovary,II,*879.498,5267.95,1445.69,1469.45,23.74,62.26,...,*207.24,73.55,3079.81,5312.9,*16.89,6864.33,110579,42921.13,1.0,Positive
1816,PAPA 1357,PAPA 1357 PLS 1,Ovary,III,*879.498,3546.43,1493.32,1428.31,**836.85,37.9,...,*207.24,72.22,3967.55,4045.18,*16.89,12877.1,88464,47219.24,1.0,Positive


### 1./b. Handle missing values

In [44]:
print('Missing values in the dataset by features  :\n'+
      '--------------------------------------------')
print(df.isna().sum())

Missing values in the dataset by features  :
--------------------------------------------
Patient ID #                              0
Sample ID #                               0
Tumor type                                0
AJCC Stage                              812
AFP (pg/ml)                               0
Angiopoietin-2 (pg/ml)                    0
AXL (pg/ml)                               6
CA-125 (U/ml)                             0
CA 15-3 (U/ml)                            0
CA19-9 (U/ml)                             0
CD44 (ng/ml)                              6
CEA (pg/ml)                               0
CYFRA 21-1 (pg/ml)                        0
DKK1 (ng/ml)                              0
Endoglin (pg/ml)                          0
FGF2 (pg/ml)                              0
Follistatin (pg/ml)                       0
Galectin-3 (ng/ml)                        0
G-CSF (pg/ml)                             7
GDF15 (ng/ml)                             0
HE4 (pg/ml)                   

## 2. Predict if a sample is cancerous or not
 - your need to build a classifier that predicts the probability of a sample coming from a cancerous (tumor type is normal or not) person based on the measured protein levels
 - train a logistic regression (sklearn API) on every second sample (not first 50% of the data (!), use every second line)
 - generate prediction for the samples that were not used during the training

## 3. Comparision to CancerSEEK
 - plot the ROC curve and calculate the confusion matrix for the predictions
 - do the same for the CancerSEEK predictions
 - compare your model's performance to CancerSEEK performance

## 4. Hepatocellular carcinoma
 - fit a logistic regression (using statsmodels API this time) to predict if a sample has Hepatocellular carcinoma (liver cancer) or not. You need to keep only the liver and the normal samples for this exercise! For fitting use only the first 25 features and all the rows (which are liver or normal)
 - select the 5 best predictor based on P values.
 - Write down the most important features (based on P value) and compare them to the tumor markers that you find on wikipeida https://en.wikipedia.org/wiki/Hepatocellular_carcinoma or other sources!

## 5. Multiclass classification
 - Again, using every second datapoint train a logistic regression (sklearn API) to predict the tumor type. It is a multiclass classification problem.
 - Generate prediction for the rest of the dataset and show the confusion matrix for the predictions!
 - Plot the ROC curves for the different cancer types on the same plot! 
 - Intepret your results. Which cancer type can be predicted the most reliably?

### Hints:
 - On total you can get 10 points for fully completing all tasks.
 - Decorate your notebook with, questions, explanation etc, make it self contained and understandable!
 - Comments you code when necessary
 - Write functions for repetitive tasks!
 - Use the pandas package for data loading and handling
 - Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation
 - Use the scikit learn package for almost everything
 - Use for loops only if it is really necessary!
 - Code sharing is not allowed between student! Sharing code will result in zero points.
 - If you use code found on web, it is OK, but, make its source clear!