<a href="https://colab.research.google.com/github/i-moes/TM10007_PROJECT/blob/master/assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TM10007 Assignment Group 7

---



In [0]:
# Run this to use from colab environment
!pip install -q --upgrade git+https://github.com/i-moes/TM10007_PROJECT.git

# Install packages
!pip install sklearn numpy matplotlib

In [0]:
# # General packages
import numpy as np 
import matplotlib.pyplot as plt
from sklearn import datasets as ds
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
 
# Metrics
#from sklearn.metrics import confusion_matrix
#from sklearn.metrics import mean_absolute_error
# from sklearn.metrics import r2_score

# Classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

pd.options.mode.chained_assignment = None  # default='warn'

## Section 1: Data loading and Splitting


In this section the data is loaded, splitted and preprocessed.

The data is splitted into a train and test set. This is done using train_test_split from sklearn.model_selection. The test data contains 45% of the data. 

In [85]:
## Load Data

from brats.load_data import load_data
data = load_data()

print(f'The number of samples: {len(data.index)}')
print(f'The number of columns: {len(data.columns)}')

## Split Data

# Extract labels from dataframe
labels = data['label']
# Drop column containing patient labels for imputation
data = data.drop(columns=['label'])

# Split data into train and test set
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size = 0.45)
print(f'length of x_train: {len(x_train.index)}')
print(f'length of x_test: {len(x_test.index)}')
print(f'total is: {len(x_train.index) + len(x_test.index)}')


The number of samples: 167
The number of columns: 725
length of x_train: 91
length of x_test: 76
total is: 167


## Section 2: Preprocessing

In this section ... 

In [86]:
# Cleaning data from NaN, zero, #DIC/0! errors and inf values
for x_set in [x_train, x_test]:
  # Replace all zero values with NaN
  x_set.replace(0, np.nan, inplace=True)
  # Replace all zero division errors with NaN
  x_set.replace('#DIV/0!', np.nan, inplace=True)
  # Replace all inf values with NaN
  x_set.replace(np.inf, np.nan, inplace=True)
  # Remove column when >5% of values is NaN
  x_set.dropna(axis = 1, thresh=0.95*len(x_set.index), inplace=True)

print(f'The number of samples in train set: {len(x_train.index)}')
print(f'The number of samples in test set: {len(x_test.index)}')
print(f'The number of columns in train set: {len(x_train.columns)}')
print(f'The number of columns in test set: {len(x_test.columns)}')

# Impute for NaN values
data_imp_train = x_train
data_imp_test = x_test
imputor = KNNImputer(n_neighbors=5, weights='distance')
# this will look for all columns where we have NaN value and replace the NaN value with specified test statistic

array_imp_train = imputor.fit_transform(data_imp_train)
array_imp_test = imputor.fit_transform(data_imp_test)
data_imp_train[:] = array_imp_train
data_imp_test[:] = array_imp_test

print(f'The number of samples in train set imp: {len(data_imp_train.index)}')
print(f'The number of samples in test set imp: {len(data_imp_test.index)}')
print(f'The number of columns in train set imp: {len(data_imp_train.columns)}')
print(f'The number of columns in test set imp: {len(data_imp_test.columns)}')


The number of samples in train set: 91
The number of samples in test set: 76
The number of columns in train set: 458
The number of columns in test set: 452
The number of samples in train set imp: 91
The number of samples in test set imp: 76
The number of columns in train set imp: 458
The number of columns in test set imp: 452


## Section 3: Feature Extraction

Features are extracted from the train set with  ... 

In [0]:
# Feature extraction
import sys
import os
import seaborn as sns
import pandas as pd
from sklearn.decomposition import PCA

def train_pca(point_data_train, point_data_test, components=4):
    '''
    The  Principal component analysis (PCA) training function creates
    and fits the PCA to transform point data into
    based on an amount of prinicipal components.
    Returns the transformed point data.

    Parameters
    ----------

    point_data_train : array-like, shape (n_samples, n_features)
        Training data, where n_samples is the number of samples
        and n_features is the number of features.

    point_data_test : array-like, shape (n_samples, n_features)
        Testing data, where n_samples is the number of samples
        and n_features is the number of features.

    components : integer
        Amount of used principal components.
        Default value is 4 principal components

    Returns
    -------

    point_data_train_trans : array-like, shape (n_samples, n_features)
        Transformed training data, where n_samples is the number of samples
        and n_features is the number of features.

    point_data_test_trans : array-like, shape (n_samples, n_features)
        Transformed test data, where n_samples is the number of samples
        and n_features is the number of features.

    '''

    try:
        # Create a PCA which retains an amount of principle components
        pca = PCA(n_components=components)

        # Fit the PCA model
        pca.fit(point_data_train)

        # Transform data
        point_data_train_trans = pca.transform(point_data_train)
        point_data_test_trans = pca.transform(point_data_test)

        return point_data_train_trans, point_data_test_trans
    
    except ValueError:
        print('Not enough subjects per set to fit the requested amount of components in PCA.')
        sys.exit()

#x_train_trans, x_test_trans = train_pca(data_imp_train, data_imp_test, components=4)

#PCA
pca = PCA(n_components=4)
pca.fit(data_imp_train)
data_train_trans = pca.transform(data_imp_train)
#data_test_trans = pca.transform(data_imp_test)

print(f' length of x_train after pca (index): {len(x_train_trans)}')
print(data_train_trans)
#print(f' length of x_test after pca (index): {len(x_test_trans)}')



## Section 4: Training Classifiers

In this section ... are trained on the train set using sklearn. 

In [0]:
# Classifiers

## Section 5: Performance Evaluation

Testing classifiers on test set. The following evaluation measures are evaluated:


*   Accuracy
*   F1-Score
*  AUC



In [0]:
# Metrics 