# Bioinformatics Project

## Introduction

Cancer classification based on molecular level investigation has gained the interest of researches as it provides a systematic, accurate and objective diagnosis for different cancer types. In this project, we introduce an approach for classifying two different kind of leukaemia based on gene expression profiles. In order to perfom this classification, we applied seven machine learning (ML) algorithms, hoping to get an accurate model that could be used to predict the type of cancer of any patiend based on its expression profile. For this aim, we compared the different classification models basing our conclusions in the AUROC obtained for it. We considered that an AUROC over 0,95 would point to an acceptable model. 


## Preliminar data analysis

Our available data for the performance of prediction models consists in two dataframes. The first one, already labelled as training data, contains 38 observations of patients, to whom the expression of 7129 genes has been assessed in order to identify any characteristic expression pattern for differential diagnosis of two leukaemia conditios: acute myeloid leukaemia (AML) and acute lymphoid leukaemia (ALL). The test data (in which the classification models are going to be tested) consist in the expression levels of the same genes for 34 different pacients. The actual diagnosis of both groups is also provided.

## Data pre-processing
Due to the vast extension of the data, an initial pre-procesing is needed in order to minimize the number of features to be used with the ML tools.
Since the expression levels of many genes has been analysed, it is important to determine which of them show a correlation with the differential diagnosis of the diseases (i.e. which of them show a significant change in the expression when the patient has been diagnosed with ALL or AML).In order to do that, it is useful to perfom a feature selection.





In [1]:
#Importing all the necessary libraries for the ML analysis
import numpy as np
import sklearn
from sklearn import metrics
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_curve, auc
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LinearRegression


In [2]:
#Reading the files:

#These paths might change according to the locations of the files within one's computer
ruta_train='C:\\Users\\Maria\\OneDrive\\Master\\Bioinformatica\\Trabajo\\Originales\\data_set_ALL_AML_train.csv'
ruta_y='C:\\Users\\Maria\\OneDrive\\Master\\Bioinformatica\\Trabajo\\Originales\\actual.csv'
ruta_test='C:\\Users\\Maria\\OneDrive\\Master\\Bioinformatica\\Trabajo\\Originales\\data_set_ALL_AML_independent.csv'
train=pd.read_csv(ruta_train)
test=pd.read_csv(ruta_test)
y=pd.read_csv(ruta_y)


FileNotFoundError: File b'C:\\Users\\Maria\\OneDrive\\Master\\Bioinformatica\\Trabajo\\Originales\\data_set_ALL_AML_train.csv' does not exist

Features are initially placed in the rows of the dataframe, while patients are in the columns. Thus, it is necessary to transpose the data. Besides, there are some columns name 'call', between the patients information, whose data is not relevant for our model. These columns were also removed. Lastly, the name of the features was changed and the actual expression measure was converted into a numeric data.

In [3]:
#Transposing the data
train=train.T
test=test.T


#Removing call data
for fila in train.index:
    if 'call' in fila:
        train=train.drop(fila)

for fila in test.index:
    if 'call' in fila:
        test=test.drop(fila)

#Collumns are labelled with the gene accession number

columnastrain=train.loc['Gene Accession Number']
train=train[2:]
train.columns=columnastrain

columnastest=test.loc['Gene Accession Number']
test=test[2:]
test.columns=columnastest

#Converting into numeric
train=train.astype('float')
test=test.astype('float')
train.index=train.index.astype(int)
test.index=test.index.astype(int)
train=train.sort_index()
test=test.sort_index()

NameError: name 'train' is not defined

The response is coded as a categorical variable with 1 coding ALL and 0 coding AML. This response is split into the diagnosis of test and training patients.

In [4]:
y.index=y['patient']
y['cancer']=np.where(y.cancer=='ALL', 1, 0) 
y.groupby('cancer').size() 
ytrain=y[:len(train)]
ytest=y[len(train):]

Once the data has the accurate structure, it is time to perform the aforementioned feature selection. Two algorithms for this aim were used in a combined way. In the first one, an univariant selection was performed, and thus each feature was individually selected or removed for the final analysis. In order to do that, the Fischer score was computed and the 100 best atributes were selected. 
On the other hand, a packing selection tool was used. This type of tools consider the selection as a searching problem (typical of the artificial intelligence tools), and different combinations of features are evaluated and compared. To each of these combinations, a score is computed and assigned and some algorithms are run to select the best combinations. In this case, the algorithm for the selection was the recursive feature removal. 

In [5]:
#Univariant feature selection (F-score)
k = 100  
columnas = list(train.columns.values)
seleccionadas = SelectKBest(f_classif, k=k).fit(train, ytrain['cancer'])
atrib = seleccionadas.get_support()
atributos1 = [columnas[i] for i in list(atrib.nonzero()[0])]

#Recursive atribute selecion
modelo = ExtraTreesClassifier()
era = RFE(modelo, 100)  # número de atributos a seleccionar
era = era.fit(train, ytrain['cancer'])
atrib2 = era.support_
atributos2 = [columnas[i] for i in list(atrib2.nonzero()[0])]

#Se combinan los tributos elegidos en una lista
print('The selected features are: ')
atribselec=list(set(atributos1)|set(atributos2))
print(atribselec)

trainred=pd.DataFrame()
for i in atribselec:
    trainred[i]=train[i]

testred=pd.DataFrame()
for i in atribselec:
    testred[i]=test[i]

The selected features are: 
['X71345_f_at', 'L38608_at', 'Z47556_rna2_at', 'U12471_cds1_at', 'K03189_f_at', 'M28585_f_at', 'J03801_f_at', 'U19765_at', 'X67698_at', 'M33317_f_at', 'M63138_at', 'HG4535-HT4940_s_at', 'M27749_r_at', 'X98296_at', 'X72475_at', 'J04027_at', 'Z14978_at', 'X06985_at', 'U82279_at', 'M95178_at', 'HG4490-HT4876_f_at', 'Z83336_at', 'U25975_at', 'L08246_at', 'U61836_at', 'M11147_at', 'M27318_f_at', 'M21551_rna1_at', 'Z80781_at', 'L05188_f_at', 'M62762_at', 'L42583_f_at', 'M92269_f_at', 'L42611_f_at', 'Y00339_s_at', 'M32304_s_at', 'U10690_f_at', 'HG458-HT458_f_at', 'U41767_s_at', 'Z68274_at', 'V00551_f_at', 'L42379_at', 'AFFX-HUMTFRR/M11507_M_at', 'U07132_at', 'Y00787_s_at', 'U84388_at', 'X85116_rna1_s_at', 'X69654_at', 'X60487_at', 'Z48501_s_at', 'U28015_at', 'M81933_at', 'X05345_at', 'Z32765_at', 'X58399_at', 'X06825_at', 'M95678_at', 'D10495_at', 'Z00010_at', 'M20030_f_at', 'HG4236-HT4506_f_at', 'U83117_at', 'X00540_at', 'D14874_at', 'HG67-HT67_f_at', 'Z80779_at',

Since 100 features have been selected with each type of tool, a maximum number of 200 were considered to be significant for our classification models. They may also be too many, and so a principal component analysis (PCA) was performed. This is a statistical tool used for describing complex data in terms of new uncorrelationated variables. In this case, the algorithm is computed in a way that the new variables maintain the 95% of the variance in the original variables. Since PCA is affected by the data scale, it is necessary to standardize it previously to the computing. 

In [6]:
scaler=StandardScaler()
scaler.fit(trainred)
train=scaler.transform(trainred)
test=scaler.transform(testred)

pca=PCA(.95) 
pca.fit(trainred)

train=pca.transform(trainred)
test=pca.transform(testred)

print("TRAIN")
print(pd.DataFrame(train))
print("\n"+"-"*50+"\n")
print("TEST")
print(pd.DataFrame(test))

TRAIN
              0             1             2             3             4   \
0   -9719.946506   1897.733260  -5468.825853   -421.292636  -1670.279297   
1    3603.085601  -6334.906277  -1478.006496    -30.331072   2567.422317   
2   -9537.200508   -271.380681    156.073326  -2195.337254  -2217.662520   
3   -9588.668311   -218.602568  -1641.461557  -1002.840966  -1168.085611   
4   -6717.558950   -655.595651   4759.141317  -1953.901653   3077.684813   
5   -9190.667777  -1522.704311   3131.723675  -3645.631561  -1593.100699   
6  -10277.453758   1917.860279    142.098627    368.724046   -823.685989   
7   -9443.945005     38.600113  -2316.092980  -1155.947159  -3300.690522   
8   -6357.738546   -217.718441  -4765.634738  -2423.467060  -5273.865116   
9  -10162.905881   -669.496816   3735.176109    397.963521   2757.394146   
10 -10337.789365  -2814.295249   1476.317759   -580.946699   -233.664550   
11   1358.090902     28.007157    104.748090   4414.805601     -7.545819   
12  -9

## Machine learning algorithms to compute classifiers

Up to this point, we are now able to perform the machine learnings algorithms in order to establish a classification model for the leukaemia diagnosis. 


### Linear discriminant analysis

This model is a generalization of Fisher's linear discriminant and it is used in machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting is used as a linear classifier.


In [7]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()  # Creating the Linear Discriminant Analysis
lda.fit(train, ytrain['cancer']) # Fitting the model
ypred_t_lda=lda.predict(train)
ypred_lda=lda.predict(test)

print("CONFUSSION MATRIX")
print("\nTRAIN\n")
print(pd.crosstab(ytrain['cancer'],ypred_t_lda, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))
print("\nTEST\n")
print(pd.crosstab(ytest['cancer'],ypred_lda, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))

print("\nCLASSIFICATION REPORT\n")
print(classification_report(ytest['cancer'], ypred_lda))
false_positive_rate, true_positive_rate, thresholds = roc_curve(ytest['cancer'], ypred_lda)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("\nAUROC\n")
print(roc_auc)

CONFUSSION MATRIX

TEST

Predicted diagnosis   0   1
Actual diagnosis           
0                    13   1
1                     0  20

TRAIN

Predicted diagnosis   0   1
Actual diagnosis           
0                    11   0
1                     0  27

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       1.00      0.93      0.96        14
          1       0.95      1.00      0.98        20

avg / total       0.97      0.97      0.97        34


AUROC

0.9642857142857143


### Naive Bayes Model

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption 
of conditional independence between every pair of features given the value of the class variable. In this case, we have implemented the
Gaussian Naive Bayes classifier, which is a special type or Naive Bayes algorithm. It’s specifically used when the features have continuous values.
Then, the algorithm creates a classification report that contains the various statistics required to judge a model and a confusion matrix  which will give us a clear idea of the accuracy and the fitting of the model.

In [9]:
model = GaussianNB() # Creating the Naive Bayes model
model.fit(train, ytrain['cancer']) # Fitting the model

expected = ytrain['cancer']
ypred_t_by = model.predict(train)  # Making predictions(train)

# Getting Accuracy and Statistics (train)
print('\nTRAIN\n\n\tCONFUSION MATRIX')
print(pd.crosstab(expected,ypred_t_by, rownames=['Expected diagnosis'], colnames=['Predicted diagnosis']))
print('\n\tACCURACY')
print(accuracy_score(expected, ypred_t_by, normalize = True))

expected = ytest['cancer']
ypred_by = model.predict(test)  # Making predictions(test)

# Getting Accuracy and Statistics (test)
print('\n\nTEST\n\n\tCLASSIFICATION REPORT')
print(metrics.classification_report(expected, ypred_by))
print('\n\tCONFUSION MATRIX')
print(pd.crosstab(expected,ypred_by, rownames=['Expected diagnosis'], colnames=['Predicted diagnosis']))
print('\n\tACCURACY')
print(accuracy_score(expected, ypred_by, normalize = True))
false_positive_rate, true_positive_rate, thresholds = roc_curve(ytest['cancer'], ypred_by)
roc_auc_by = auc(false_positive_rate, true_positive_rate)
print("\nAUROC\n")
print(roc_auc_by)


TRAIN

	CONFUSION MATRIX
Predicted diagnosis   0   1
Expected diagnosis         
0                    11   0
1                     1  26

	ACCURACY
0.9736842105263158


TEST

	CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.76      0.93      0.84        14
          1       0.94      0.80      0.86        20

avg / total       0.87      0.85      0.85        34


	CONFUSION MATRIX
Predicted diagnosis   0   1
Expected diagnosis         
0                    13   1
1                     4  16

	ACCURACY
0.8529411764705882

AUROC

0.8642857142857143


The model is fits properly to the training data. As it can be seen in the confusion matrix, the prediction was correct in 37 out of 38 patients. However, at the time of making the prediction with the test data, the model manages to make fewer correct predictions, since it only adequately diagnoses 30 of 34 patients. Even so, our model is quite accurate classifying the categories of our dataset.

### Decision tree

The second implemented algorithm was the decision tree classifier, which computes a diagram with logic construction in order to represent a set of conditions, which are consecutively assessed for the resolution of a problem. In order to get the best results, the parameters to be used for building the model where calculated using a cross-validation algorithm. The best parameters were used to build the final decision tree, which was evaluated with the test sample, constructing the confussion matrix. 



In [11]:
split_range=list(range(2,15))
prof_range=list(range(2, 10))

param_grid={'min_samples_split':split_range, 'max_depth':prof_range}
clf_gini = DecisionTreeClassifier(criterion = 'gini', random_state=100)
grid_dt=GridSearchCV(clf_gini, param_grid, scoring='accuracy')
grid_dt.fit(train, ytrain['cancer'])
print(grid_dt.best_score_)
print(grid_dt.best_params_)

mejor_clf_gini=grid_dt.best_estimator_


ypred=mejor_clf_gini.predict(test)
ypred_t=mejor_clf_gini.predict(train)

print("CONFUSSION MATRIX")
print("\nTRAIN\n")
print(pd.crosstab(ytrain['cancer'],ypred_t, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))

print("\nTEST\n")
print(pd.crosstab(ytest['cancer'],ypred, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))



print("\nCLASSIFICATION REPORT\n")
print(classification_report(ytest['cancer'], ypred))
false_positive_rate, true_positive_rate, thresholds = roc_curve(ytest['cancer'], ypred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("\nAUROC\n")
print(roc_auc)

0.9736842105263158
{'min_samples_split': 2, 'max_depth': 2}
CONFUSSION MATRIX

TRAIN

Predicted diagnosis   0   1
Actual diagnosis           
0                    11   0
1                     0  27

TEST

Predicted diagnosis   0   1
Actual diagnosis           
0                    12   2
1                     0  20

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       1.00      0.86      0.92        14
          1       0.91      1.00      0.95        20

avg / total       0.95      0.94      0.94        34


AUROC

0.9285714285714286


The calculated decision tree is not a bad classifier, but there is some kind of overfitting. When we apply this model to the training set, the accuracy of it is perfect: every patient is correctly diagnose. However, when running the classifier model for the test set, 3 patients are misdiagnosed.

### Random forest
Instead of using a single decision tree, this algorithm computes a whole forest of trees with little depth. In order to obtain the classification result, it takes the individual result of each tree and the resulting class is the most "voted" one. 
In order to tackle the aforementioned overfitting, a cross-validation search of the best hypeparameters was also performed. 
Due to the high number of hyperparameters to be tested, two tipes of parameter search are computed. In first place, a random search is perform in order to get close to the actual best value of each parameter. Once we have some idea of this approximate value, we perform the grid search by building a grid within a range including this approximation. Best computed parameters are finally used to compute the final ranfom forest classifier.

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf_clf=RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf_clf, param_distributions = random_grid, n_iter = 10, cv = 3, verbose=2, random_state=100, n_jobs = -1)
rf_random.fit(train, ytrain['cancer'])



print(rf_random.best_params_)

param_grid = {
    'bootstrap': [True],
    'max_depth': [None],
    'max_features': ['auto'],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 4, 6],
    'n_estimators': [500, 1000, 2000]
}

grid_rf=GridSearchCV(rf_clf, param_grid, scoring='accuracy', cv=3)
grid_rf.fit(train, ytrain['cancer'])
print(grid_rf.best_score_)
print(grid_rf.best_params_)

mejor_rf_clf=grid_rf.best_estimator_


ypred_rf=mejor_rf_clf.predict(test)

print("CONFUSSION MATRIX")
print("\nTRAIN\n")
print(pd.crosstab(ytrain['cancer'],ypred_t_rf, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))
print("\nTEST\n")
print(pd.crosstab(ytest['cancer'],ypred_rf, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))
ypred_t_rf=mejor_rf_clf.predict(train)


print("\nCLASSIFICATION REPORT\n")
print(classification_report(ytest['cancer'], ypred_rf))
false_positive_rate, true_positive_rate, thresholds = roc_curve(ytest['cancer'], ypred_rf)
roc_auc_rf = auc(false_positive_rate, true_positive_rate)
print("\nAUROC\n")
print(roc_auc_rf)


### Support Vector Machine
This model is a type of supervised machine learning classification algorithm. The algorithm chooses the most optimal decision boundary (a region which maximizes the distance between the nearest data of all the classes), which is the one that has the maximum margin from the nearest points of data. 

In [13]:
svclassifier = SVC(kernel='linear')
svclassifier.fit(train, ytrain['cancer'])
ypred_svm = svclassifier.predict(test)
ypred_t_svm=svclassifier.predict(train)

print("CONFUSSION MATRIX")
print("\nTRAIN\n")
print(pd.crosstab(ytrain['cancer'],ypred_t_svm, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))
print("\nTEST\n")
print(pd.crosstab(ytest['cancer'],ypred_svm, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))


print("\nCLASSIFICATION REPORT\n")
print(classification_report(ytest['cancer'], ypred_svm))
false_positive_rate, true_positive_rate, thresholds = roc_curve(ytest['cancer'], ypred_svm)
roc_auc_svm = auc(false_positive_rate, true_positive_rate)
print("\nAUROC\n")
print(roc_auc_svm)

CONFUSSION MATRIX

TEST

Predicted diagnosis   0   1
Actual diagnosis           
0                    13   1
1                     0  20

TRAIN

Predicted diagnosis   0   1
Actual diagnosis           
0                    11   0
1                     0  27

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       1.00      0.93      0.96        14
          1       0.95      1.00      0.98        20

avg / total       0.97      0.97      0.97        34


AUROC

0.9642857142857143


### Logistic regression
This model is a type of supervised machine learning classification algorithm which is used to predict the probability of a categorical dependent variable. The model show a threshold in whom will be specified the result of one of the classes. It is important to point that this algorithm follows Bernoulli Distribution and the final results are shown in a confussion matrix, which evaluates the performance of the model.


In [15]:
from sklearn.linear_model import LogisticRegression
rlog = LogisticRegression() # Creando el modelo
rlog.fit(train, ytrain['cancer']) 
ypred_t_rlog = rlog.predict(train) 
ypred_rlog = rlog.predict(test)

print("CONFUSSION MATRIX")
print("\nTRAIN\n")
print(pd.crosstab(ytrain['cancer'],ypred_t_rlog, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))
print("\nTEST\n")
print(pd.crosstab(ytest['cancer'],ypred_rlog, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))


print("\nCLASSIFICATION REPORT\n")
print(classification_report(ytest['cancer'], ypred_rlog))
false_positive_rate, true_positive_rate, thresholds = roc_curve(ytest['cancer'], ypred_rlog)
roc_auc_rlog = auc(false_positive_rate, true_positive_rate)
print("\nAUROC\n")
print(roc_auc_rlog)

CONFUSSION MATRIX

TRAIN

Predicted diagnosis   0   1
Actual diagnosis           
0                    11   0
1                     0  27

TEST

Predicted diagnosis   0   1
Actual diagnosis           
0                    13   1
1                     0  20

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       1.00      0.93      0.96        14
          1       0.95      1.00      0.98        20

avg / total       0.97      0.97      0.97        34


AUROC

0.9642857142857143


### Neural network 
This classification model is based in the biological neurons. It receives an input and produces a signal (output) based on it which is received by another neuron as a new input. Each neuron has an activation function, which determines whether an output is computed for a determined input will be sended to another neuron or not. Due to the complexity of this algorithm, in this case no parameter search was computed. 

In [17]:
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
mlp.fit(train, ytrain['cancer'])
ypred_nn=mlp.predict(test)

print("CONFUSSION MATRIX")
print("\nTEST\n")
print(pd.crosstab(ytest['cancer'],ypred_nn, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))
ypred_t_nn=mlp.predict(train)
print("\nTRAIN\n")
print(pd.crosstab(ytrain['cancer'],ypred_t_nn, rownames=['Actual diagnosis'], colnames=['Predicted diagnosis']))

print("\nCLASSIFICATION REPORT\n")
print(classification_report(ytest['cancer'], ypred_nn))
false_positive_rate, true_positive_rate, thresholds = roc_curve(ytest['cancer'], ypred_nn)
roc_auc_nn = auc(false_positive_rate, true_positive_rate)
print("\nAUROC\n")
print(roc_auc_nn)


CONFUSSION MATRIX

TEST

Predicted diagnosis   0   1
Actual diagnosis           
0                    10   4
1                     5  15

TRAIN

Predicted diagnosis  0   1
Actual diagnosis          
0                    8   3
1                    4  23

CLASSIFICATION REPORT

             precision    recall  f1-score   support

          0       0.67      0.71      0.69        14
          1       0.79      0.75      0.77        20

avg / total       0.74      0.74      0.74        34


AUROC

0.7321428571428571


# Discussion and conclusion
Using different machine learning (ML) tools, we have been able to obtain some models tha could certainly be useful when differentially diagnosis these two types of canceer. The best results have been achieved when Linear Discriminant Analysis, Support Vector Machine and Logistic Regression models were used, with whom default parameters were set. This lead us to the conclusion that sometimes the easiest approach is also the most convinient.
On the other hand, poor results were got when a neural netwok model was used. This could be due to the high complexity of this kind of models, in which many parameters can be set in order to get to the best results. 
Finally, best parameter search was performed in decision tree and random forest, just in order to get to know the tools commonly used for this aim. We found that this is a useful but time-consuming and computer demanding technique.
On the whole, we have proved that ML analysis consist in an interesting tool with a vast range of application in precission medicine. 