# 05. Logistic regression

In [1]:
# Install a conda package in the current Jupyter kernel
# xlrd package needs to be installed for pandas to open Excel files
import sys
! conda install --yes --prefix {sys.prefix} xlrd

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\ProgramData\Miniconda3

  added / updated specs:
    - xlrd


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.6.20          |   py38h9bdc248_2         152 KB  conda-forge
    xlrd-1.2.0                 |     pyh9f0ad1d_1         108 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         259 KB

The following NEW packages will be INSTALLED:

  xlrd               conda-forge/noarch::xlrd-1.2.0-pyh9f0ad1d_1

The following packages will be UPDATED:

  certifi                          2020.6.20-py38h32f6830_0 --> 2020.6.20-py38h9bdc248_2



Downloading and Extracting Packages

xlrd-1.2.0           | 108 KB    |            |   0% 
xlrd-1.2.0

In [1]:
import os
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import matplotlib.gridspec as gridspec
from mpl_toolkits.axes_grid1 import make_axes_locatable

from IPython.display import display

In [2]:
data = './data/'
out = './out/'

# Bold print for Jupyter Notebook
b1 = '\033[1m'
b0 = '\033[0m'

### Just some matplotlib and seaborn parameter tuning

In [3]:
axistitlesize = 20
axisticksize = 17
axislabelsize = 26
axislegendsize = 23
axistextsize = 20
axiscbarfontsize = 15

# Set axtick dimensions
major_size = 6
major_width = 1.2
minor_size = 3
minor_width = 1
mpl.rcParams['xtick.major.size'] = major_size
mpl.rcParams['xtick.major.width'] = major_width
mpl.rcParams['xtick.minor.size'] = minor_size
mpl.rcParams['xtick.minor.width'] = minor_width
mpl.rcParams['ytick.major.size'] = major_size
mpl.rcParams['ytick.major.width'] = major_width
mpl.rcParams['ytick.minor.size'] = minor_size
mpl.rcParams['ytick.minor.width'] = minor_width

mpl.rcParams.update({'figure.autolayout': False})

# Seaborn style settings
sns.set_style({'axes.axisbelow': True,
               'axes.edgecolor': '.8',
               'axes.facecolor': 'white',
               'axes.grid': True,
               'axes.labelcolor': '.15',
               'axes.spines.bottom': True,
               'axes.spines.left': True,
               'axes.spines.right': True,
               'axes.spines.top': True,
               'figure.facecolor': 'white',
               'font.family': ['sans-serif'],
               'font.sans-serif': ['Arial',
                'DejaVu Sans',
                'Liberation Sans',
                'Bitstream Vera Sans',
                'sans-serif'],
               'grid.color': '.8',
               'grid.linestyle': '--',
               'image.cmap': 'rocket',
               'lines.solid_capstyle': 'round',
               'patch.edgecolor': 'w',
               'patch.force_edgecolor': True,
               'text.color': '.15',
               'xtick.bottom': True,
               'xtick.color': '.15',
               'xtick.direction': 'in',
               'xtick.top': True,
               'ytick.color': '.15',
               'ytick.direction': 'in',
               'ytick.left': True,
               'ytick.right': True})

# Colorpalettes, colormaps, etc.
sns.set_palette(palette='rocket')

## 1. Download data from https://science.sciencemag.org/content/359/6378/926 (supplementary materials). If you do not succeed, you will find _aar3247_Cohen_SM_Tables-S1-S11.xlsx_ file in the homework's folder.
 - read the abstract of the article to get familiar with data origin
 - open the data in excel and get familiar with its content
 - load the protein level data (you need to figure out which one is that) as a pandas dataframe
 - handle missing values and convert features to numeric values when it is needed
 - get rid of the unnecessary (which does not encode protein levels or the tumor type) columns and the CancerSEEK results

### 1./a. Open the protein dataset

#### Open file from URL

In [4]:
#import urllib.request

### Issue

Pandas somewhy can't handle I/O with excel files, when loading them from an `urllib3.response.HTTPResponse` object:

- [Issue #20434](https://github.com/pandas-dev/pandas/issues/20434)
- [Issue #28825](https://github.com/pandas-dev/pandas/issues/28825)

It was said to be adressed in [Issue #28874](https://github.com/pandas-dev/pandas/pull/28874), but it seems that it wasn't, or maybe it was reintroduced in a newer release. At the end of the day this code simply doesn't work, however it should in normal circumstances.

In [5]:
# PANDAS BUG!
#url = 'https://science.sciencemag.org/highwire/filestream/704651/field_highwire_adjunct_files/1/aar3247_Cohen_SM_Tables-S1-S11.xlsx'
#with urllib.request.urlopen(url) as url:
#    df = pd.read_excel(url)

#### Open file locally

Open file using the local download

In [6]:
os.listdir(data)

['aar3247_Cohen_SM_Tables-S1-S11.xlsx']

In [7]:
df = pd.read_excel(data + 'aar3247_Cohen_SM_Tables-S1-S11.xlsx', sheet_name='Table S6', header=2)

In [8]:
display(df.head())
display(df.tail())

Unnamed: 0,Patient ID #,Sample ID #,Tumor type,AJCC Stage,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),...,sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score,CancerSEEK Test Result
0,CRC 455,CRC 455 PLS 1,Colorectum,I,1583.45,5598.5,3621.04,5.09,19.08,*16.452,...,*204.792,55.06,6832.07,9368.53,*16.086,21863.7,56428.7,39498.82,0.938342,Positive
1,CRC 456,CRC 456 PLS 1,Colorectum,I,*715.308,20936.3,2772.96,7.27,10.04,40.91,...,*204.792,72.92,5549.47,6224.55,*16.086,29669.7,73940.5,41277.09,0.925363,Positive
2,CRC 457,CRC 457 PLS 1,Colorectum,II,4365.53,2350.93,4120.77,*4.854,16.96,*16.452,...,*204.792,173.78,3698.16,4046.48,179.03,6020.47,22797.3,28440.6,0.852367,Negative
3,CRC 458,CRC 458 PLS 1,Colorectum,II,*715.308,1604.34,2029.96,5.39,8.31,*16.452,...,*204.792,29.47,5856.0,6121.93,*16.086,4331.02,20441.2,25896.73,0.617639,Negative
4,CRC 459,CRC 459 PLS 1,Colorectum,II,801.3,2087.57,2069.17,*4.854,11.73,*16.452,...,*204.792,78.07,5447.93,6982.32,*16.086,2311.91,56288.5,49425.2,0.318434,Negative


Unnamed: 0,Patient ID #,Sample ID #,Tumor type,AJCC Stage,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),...,sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score,CancerSEEK Test Result
1816,PAPA 1357,PAPA 1357 PLS 1,Ovary,III,*879.498,3546.43,1493.32,1428.31,**836.85,37.9,...,*207.24,72.22,3967.55,4045.18,*16.89,12877.1,88464.0,47219.24,1.0,Positive
1817,,,,,,,,,,,...,,,,,,,,,,
1818,*Protein concentration below the limit of dete...,,,,,,,,,,...,,,,,,,,,,
1819,**Protein concentration above the limit of det...,,,,,,,,,,...,,,,,,,,,,
1820,NA: Not available,,,,,,,,,,...,,,,,,,,,,


In [9]:
# last 4 columns are just comments
df = df.iloc[:-4]

In [10]:
display(df.tail())

Unnamed: 0,Patient ID #,Sample ID #,Tumor type,AJCC Stage,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),...,sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score,CancerSEEK Test Result
1812,PAPA 1353,PAPA 1353 PLS 1,Ovary,I,*879.498,1484.7,2096.76,24.82,10.3,42.39,...,*207.24,115.24,5390.31,8538.58,*16.89,*599.4,167800,50128.6,0.980312,Positive
1813,PAPA 1354,PAPA 1354 PLS 1,Ovary,I,1337.33,1607.9,852.37,5.58,9.8,*16.44,...,*207.24,147.17,7951.03,12966.19,*16.89,*599.4,123444,54066.98,0.999995,Positive
1814,PAPA 1355,PAPA 1355 PLS 1,Ovary,III,*879.498,1592.84,1044.45,30.48,8.48,*16.44,...,*207.24,104.63,2396.36,1901.41,*16.89,*599.4,104071,39844.02,1.0,Positive
1815,PAPA 1356,PAPA 1356 PLS 1,Ovary,II,*879.498,5267.95,1445.69,1469.45,23.74,62.26,...,*207.24,73.55,3079.81,5312.9,*16.89,6864.33,110579,42921.13,1.0,Positive
1816,PAPA 1357,PAPA 1357 PLS 1,Ovary,III,*879.498,3546.43,1493.32,1428.31,**836.85,37.9,...,*207.24,72.22,3967.55,4045.18,*16.89,12877.1,88464,47219.24,1.0,Positive


### 1./b. Handle missing values

#### Possible problems of naive filling and solutions

Handling columns with just a few ($< 10$) missing values is completely straightforward. However almost half of the values missing in the column `AJCC Stage`, which makes it somewhat more problematic. The chosen filling or dropping method could significantly alter the model in the next tasks, if this feature has a high impact. It would be reasonable to test impact of parameters on a random model in the case of different filling methods.

I can simply use the Logistic Regression (since that's the topic of this assignment) to test the robustness of train data. For reference, I'll also test the data with a Random Forest Regressor. For the feature impact analysis I'll use SHAP values to get an approximate picture of the situation.

#### Non-numeric columns

There are numerous features with non-numeric entries, or numeric entries with appended non-numeric characters. First of all, the first two ID colums could be simply dropped, since they're completely artificial and random, thus do not carry any useful information. However there are three more features with useful data but in the form of columns with non-numeric entries. Particularly these are the columns `Tumor type`, `AJCC Stage` and `CancerSEEK Test Result`. The latter one is simply a binary data column, while `Tumor type` and `AJCC Stage` are categorical features with $9$ and $3$ categories respectively. These could be easily mapped to numeric values, which I'll do first before any other analysis or column filling.

All other columns with NaN entries have continuous variables, thus we're able to fill missing entries with the mean of existing values. However there is still one problem with these columns, but with other completely filled columns also. Besides NaNs, there is another type of values that represents itself in this dataset. These values are numeric, but in a string format with a `*` or `**` appended to the front of them. The meaning of these notations can be found in the original `.xlsx` Excel document, also in the tail of the very first, raw DataFrame in this notebook:

- `*`  : Protein concentration below the limit of detection of the assay; value set as experiment-specific lower limit of detection  
- `**` : Protein concentration above the limit of detection of the assay; value set as experiment-specific upper limit of detection

Every occurence of this type of values should be converted to numeric to be able to use them in the analysis, or in the filling of missing entries.

In [30]:
print('# of missing values in the dataset by features:\n'+
      '-----------------------------------------------')
print(df.isna().sum())

# of missing values in the dataset by features:
-----------------------------------------------
Patient ID #                              0
Sample ID #                               0
Tumor type                                0
AJCC Stage                              812
AFP (pg/ml)                               0
Angiopoietin-2 (pg/ml)                    0
AXL (pg/ml)                               6
CA-125 (U/ml)                             0
CA 15-3 (U/ml)                            0
CA19-9 (U/ml)                             0
CD44 (ng/ml)                              6
CEA (pg/ml)                               0
CYFRA 21-1 (pg/ml)                        0
DKK1 (ng/ml)                              0
Endoglin (pg/ml)                          0
FGF2 (pg/ml)                              0
Follistatin (pg/ml)                       0
Galectin-3 (ng/ml)                        0
G-CSF (pg/ml)                             7
GDF15 (ng/ml)                             0
HE4 (pg/ml)             

In [140]:
# Create a new DataFrame to temper with
df_n = df.copy()
df_n = df_n[df.columns[2:]]

#### 1./b. - 1. Convert entries with appended `*` and `**` symbols to numeric

Not the accepted, but the second most liked answer under this question is beautiful:  
https://stackoverflow.com/questions/13682044/remove-unwanted-parts-from-strings-in-a-column

In [141]:
columns = df_n.columns

In [142]:
# Columns with fully non-numeric entries
# Can be checked by
#     (df_n.applymap(type) == str).all(0),
# but NaN values makes it problematic this makes it only partly useful/accurate
str_columns = ['Tumor type', 'AJCC Stage', 'CancerSEEK Test Result']

# Columns with fully numeric entries (NaN entries are ignored)
nmr_columns = list([c for c in df_n.columns if c not in str_columns])

# Create a map of numeric and non-numeric columns
# Here `True` entries stand for non-numeric, while
# `False` entries mark numeric values
str_map = (df_n[nmr_columns].applymap(type) == str)

In [151]:
str_map

Unnamed: 0,AFP (pg/ml),Angiopoietin-2 (pg/ml),AXL (pg/ml),CA-125 (U/ml),CA 15-3 (U/ml),CA19-9 (U/ml),CD44 (ng/ml),CEA (pg/ml),CYFRA 21-1 (pg/ml),DKK1 (ng/ml),...,sEGFR (pg/ml),sFas (pg/ml),SHBG (nM),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),TGFa (pg/ml),Thrombospondin-2 (pg/ml),TIMP-1 (pg/ml),TIMP-2 (pg/ml),CancerSEEK Logistic Regression Score
0,False,False,False,False,False,True,False,False,True,False,...,False,True,False,False,False,True,False,False,False,False
1,True,False,False,False,False,False,False,False,True,False,...,False,True,False,False,False,True,False,False,False,False
2,False,False,False,True,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
3,True,False,False,False,False,True,False,False,True,False,...,False,True,False,False,False,True,False,False,False,False
4,False,False,False,True,False,True,False,False,True,False,...,False,True,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1812,True,False,False,False,False,False,False,False,True,False,...,False,True,False,False,False,True,True,False,False,False
1813,False,False,False,False,False,True,False,False,True,False,...,False,True,False,False,False,True,True,False,False,False
1814,True,False,False,False,False,True,False,True,False,False,...,False,True,False,False,False,True,True,False,False,False
1815,True,False,False,False,False,False,False,True,False,False,...,False,True,False,False,False,True,False,False,False,False


In [155]:
def remove_stars(df, str_map):
    
    df_c = df.copy()
    for c in str_map.columns:
        if str_map[c].sum() > 0:
            c_vals = df_c[c][str_map[c]]
            indeces = list(c_vals.index)
            df_c[c][str_map[c]] = c_vals.str.replace('*', '').astype(float)
        else:
            continue
        
    return df_c

In [156]:
df_test = remove_stars(df_n, str_map)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_c[c][str_map[c]] = df_c[c][str_map[c]].str.replace('*', '').astype(float)


#### 1./b. - 2. Map `Tumor type`

In [32]:
print('# of different values in the column `Tumor type`:')
print('-------------------------------------------------')
print(df['Tumor type'].value_counts())

# of different values in the column `Tumor type`:
-------------------------------------------------
Normal        812
Colorectum    388
Breast        209
Lung          104
Pancreas       93
Stomach        68
Ovary          54
Esophagus      45
Liver          44
Name: Tumor type, dtype: int64


In [16]:
map_tumor_type = {key : i+1 for i, key in enumerate(df['Tumor type'].value_counts().index)}
df_n['Tumor type'] = df['Tumor type'].map(map_tumor_type)

In [31]:
print('# of different values in the column `Tumor type`:')
print('-------------------------------------------------')
print(df_n['Tumor type'].value_counts())

# of different values in the column `Tumor type`:
-------------------------------------------------
1    812
2    388
3    209
4    104
5     93
6     68
7     54
8     45
9     44
Name: Tumor type, dtype: int64


#### 1./b. - 3. Map `AJCC Stage`

In [18]:
print('Different values in the column `AJCC Stage`:')
print('--------------------------------------------')
print(df['AJCC Stage'].value_counts())

Different values in the column `AJCC Stage`:
--------------------------------------------
II     497
III    309
I      199
Name: AJCC Stage, dtype: int64


Convert the `I`, `II` and `III` values in the `AJCC Stage` to numerical values first.

In [19]:
map_ajcc_stage = {'I' : 1, 'II' : 2, 'III' : 3}
df_n['AJCC Stage'] = df_n['AJCC Stage'].map(map_ajcc_stage)

In [20]:
print('Different values in the column `AJCC Stage`:')
print('--------------------------------------------')
print(df_n['AJCC Stage'].value_counts())

Different values in the column `AJCC Stage`:
--------------------------------------------
2.0    497
3.0    309
1.0    199
Name: AJCC Stage, dtype: int64


#### 1./b. - 4. Fill everty NaN entry in features except for the column `AJCC Stage`

In [21]:
nan_counts = df_n.isna().sum()
nan_columns = [key for key in nan_counts.index if nan_counts[key] > 0]
# Exclude the column `AJCC Stage`
nan_columns.remove('AJCC Stage')

In [22]:
df_n[nan_columns]

Unnamed: 0,AXL (pg/ml),CD44 (ng/ml),G-CSF (pg/ml),Kallikrein-6 (pg/ml),Mesothelin (ng/ml),Midkine (pg/ml),PAR (pg/ml),sEGFR (pg/ml),sFas (pg/ml),sHER2/sEGFR2/sErbB2 (pg/ml),sPECAM-1 (pg/ml),Thrombospondin-2 (pg/ml)
0,3621.04,9.81,*131.46,5938.28,14.29,315.23,8852.96,3284.17,*204.792,6832.07,9368.53,21863.7
1,2772.96,27.57,*131.46,3409.18,32.57,260.56,20782.6,1911.81,*204.792,5549.47,6224.55,29669.7
2,4120.77,14.59,*131.46,3338.6,15.09,491.81,7534.43,1743.94,*204.792,3698.16,4046.48,6020.47
3,2029.96,7.78,152.24,3162.89,16.52,230.45,4722.42,1059.24,*204.792,5856,6121.93,4331.02
4,2069.17,12.21,*131.46,4442.46,8.81,238.47,6945.9,1736.92,*204.792,5447.93,6982.32,2311.91
...,...,...,...,...,...,...,...,...,...,...,...,...
1812,2096.76,14.92,133.08,5376.57,17.71,679.06,16717.4,2542.26,*207.24,5390.31,8538.58,*599.4
1813,852.37,12.32,*32.802,6774.89,21.95,524.17,2656.02,1670.22,*207.24,7951.03,12966.19,*599.4
1814,1044.45,8.26,*32.802,7294.52,37.91,467.4,7127.74,1194.03,*207.24,2396.36,1901.41,*599.4
1815,1445.69,16.53,104,6212.68,8.26,916.6,8954.41,1607.16,*207.24,3079.81,5312.90,6864.33


In [92]:
df_n[nan_columns].isna().sum()

AXL (pg/ml)                    6
CD44 (ng/ml)                   6
G-CSF (pg/ml)                  7
Kallikrein-6 (pg/ml)           6
Mesothelin (ng/ml)             6
Midkine (pg/ml)                6
PAR (pg/ml)                    6
sEGFR (pg/ml)                  6
sFas (pg/ml)                   1
sHER2/sEGFR2/sErbB2 (pg/ml)    6
sPECAM-1 (pg/ml)               6
Thrombospondin-2 (pg/ml)       6
dtype: int64

In [93]:
df_n[nan_columns].fillna(df_n.mean()).isna().sum()

AXL (pg/ml)                    0
CD44 (ng/ml)                   6
G-CSF (pg/ml)                  7
Kallikrein-6 (pg/ml)           6
Mesothelin (ng/ml)             6
Midkine (pg/ml)                6
PAR (pg/ml)                    6
sEGFR (pg/ml)                  6
sFas (pg/ml)                   1
sHER2/sEGFR2/sErbB2 (pg/ml)    6
sPECAM-1 (pg/ml)               0
Thrombospondin-2 (pg/ml)       6
dtype: int64

In [25]:
df_n[nan_columns] = df_n[nan_columns].fillna(df_n.mean())

In [26]:
df_n.isna().sum()

Patient ID #                              0
Sample ID #                               0
Tumor type                                0
AJCC Stage                              812
AFP (pg/ml)                               0
Angiopoietin-2 (pg/ml)                    0
AXL (pg/ml)                               0
CA-125 (U/ml)                             0
CA 15-3 (U/ml)                            0
CA19-9 (U/ml)                             0
CD44 (ng/ml)                              6
CEA (pg/ml)                               0
CYFRA 21-1 (pg/ml)                        0
DKK1 (ng/ml)                              0
Endoglin (pg/ml)                          0
FGF2 (pg/ml)                              0
Follistatin (pg/ml)                       0
Galectin-3 (ng/ml)                        0
G-CSF (pg/ml)                             7
GDF15 (ng/ml)                             0
HE4 (pg/ml)                               0
HGF (pg/ml)                               0
IL-6 (pg/ml)                    

In [12]:
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [13]:
def fit_model(df):
    
    # The target variable is 'CancerSEEK Test Result'
    Y = df['CancerSEEK Test Result']
    X =  df[df.columns[2:-1]]
    # Split the data into train and test data
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=0)

    # Build the model with Logistic Regression and Random Forest Regression algorithms
    model_log = LogisticRegression(penalty='l2', random_state=0)
    model_log.fit(X_train, Y_train)
    model_rfr = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
    model_rfr.fit(X_train, Y_train)
    
    return model_log, model_rfr, X_train, X_test, Y_train, Y_test

#### Filling methods for the column `AJCC Stage`

Since `AJCC Stage` values are discrete, we can't simply fill missing values by the mean of the existing ones, so different methods needs to be considered.

In [None]:
model_log, model_rfr, X_train, X_test, Y_train, Y_test = fit_model(df_n)

In [None]:
shap_values = shap.TreeExplainer(model).shap_values(X_train)

## 2. Predict if a sample is cancerous or not
 - your need to build a classifier that predicts the probability of a sample coming from a cancerous (tumor type is normal or not) person based on the measured protein levels
 - train a logistic regression (sklearn API) on every second sample (not first 50% of the data (!), use every second line)
 - generate prediction for the samples that were not used during the training

## 3. Comparision to CancerSEEK
 - plot the ROC curve and calculate the confusion matrix for the predictions
 - do the same for the CancerSEEK predictions
 - compare your model's performance to CancerSEEK performance

## 4. Hepatocellular carcinoma
 - fit a logistic regression (using statsmodels API this time) to predict if a sample has Hepatocellular carcinoma (liver cancer) or not. You need to keep only the liver and the normal samples for this exercise! For fitting use only the first 25 features and all the rows (which are liver or normal)
 - select the 5 best predictor based on P values.
 - Write down the most important features (based on P value) and compare them to the tumor markers that you find on wikipeida https://en.wikipedia.org/wiki/Hepatocellular_carcinoma or other sources!

## 5. Multiclass classification
 - Again, using every second datapoint train a logistic regression (sklearn API) to predict the tumor type. It is a multiclass classification problem.
 - Generate prediction for the rest of the dataset and show the confusion matrix for the predictions!
 - Plot the ROC curves for the different cancer types on the same plot! 
 - Intepret your results. Which cancer type can be predicted the most reliably?

### Hints:
 - On total you can get 10 points for fully completing all tasks.
 - Decorate your notebook with, questions, explanation etc, make it self contained and understandable!
 - Comments you code when necessary
 - Write functions for repetitive tasks!
 - Use the pandas package for data loading and handling
 - Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation
 - Use the scikit learn package for almost everything
 - Use for loops only if it is really necessary!
 - Code sharing is not allowed between student! Sharing code will result in zero points.
 - If you use code found on web, it is OK, but, make its source clear!