<h1 align="center">Bonus task <br> Tissue type classification based on microarray gene expression profiles</h1>

<br>
<br>
<center>CS-EJ3211 Machine Learning with Python 29.5.-17.7.2023</center>
<center>Aalto University (Espoo, Finland)</center>
<center>fitech.io (Finland)</center>

<h2><strong>Submit a notebook by 24.07.2023 at jupyter hub aalto. Follow the required outline presented in this notebook.</strong></h2>

The submitted notebook should contain all Python code used in the project (early prototyping and "scrapbooking" can be excluded). The notebook should be arranged so that the reader can replicate your workflow by running the cells in the notebook in order.

You can get 15p max (problem formulation - 2p, methods - 3p, implementation - 5p, results - 2.5p, conclutions - 2.5p) for this notebook.  

**General recommendations**\
Strive to use the notation used on this course if you use mathematical formulas or symbols. In the case that you want to use different notation, use good scientific writing principles and clearly define the meaning of your symbols.

**Please comment your code.**\
The commenting doesn't have to be as comprehensive as it is in the exercise rounds (where it is for educational reasons), but it should give some indication of the what is happening in different sections of your code.

## Introduction

"A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time. DNA microarrays are microscope slides that are printed with thousands of tiny spots in defined positions, with each spot containing a known DNA sequence or gene."\
text source: https://www.nature.com/scitable/definition/microarray-202/

<img src="../../../coursedata/7_bonus_notebook/DNA_microarray.jpg" width=800/>

image source: https://www.genome.gov/about-genomics/fact-sheets/DNA-Microarray-Technology

The microarray data for this problem consists of normalized relative expression of certain genes measured in different tissue. There are 3000 gene probes and 2000 samples. The full dataset can be found at https://www.ebi.ac.uk/arrayexpress/ (accession number E-MTAB-62). 

The subset of this data is stored as csv file in `/coursedata/7_bonus_notebook/` directory, i.e. use path `"/coursedata/7_bonus_notebook/data_subset.csv"` to load the data. 

The first columns of  'data_subset.csv' file contains ID's of samples (e.g. 'GSM23227.CEL') and analyses info ('RMA') and the rest - expression values for 3000 genes. 

Your task is to predict the type of tissue ('disease' vs 'normal') based on expression profile of samples. 

<a id='problem'></a>
<div class=" alert alert-info">

## Problem formulation (2 p)

In contrast to the conceptual presentation of the problem in the introduction, this section formulates the problem as a machine learning problem. You should:

- Define the type of your problem. Is it a regression or classification problem? Or perhaps something else?

- Define the **data points** in your problem and define the **features** and **labels** of the points.

- Define the **metric** that serves as the measure of quality of an ML model on your problem*. For example, the mean squared error might be a reasonable choice for a regression problem, whereas some kind of balanced accuracy score might suit a classification problem with imbalanced classes. Note that this is not necessarily equivalent to the loss function used by your model!
    
    
*More info on metric below.
    
</div>

### YOUR TEXT HERE

Problem formulation ...

## Methods and Implementation instructions.

    
Your task is to build **logistic regression and Support Vector Machine (SVM) models** for solving tissue type prediction task. During this course, you have familiarized yourself with multiple ML methods from scikit-learn library, but now you will need to independently learn the specifics of how to use the SVM classifier in scikit-learn by studying the documentation and related resources. 

**Note, that here we are doing model selection and choosing between hypothesis space of logistic regression and several hypothesis spaces of SVM models with different hyperparameters.**
    
More precisely, you need to:

1. Upload the "data_subset.csv" file as a Pandas dataframe. The file contains gene expression data for tissues of different types. The first column contains the sample id and the second column indicates how the data was analysed (Robust Multi-array Average or RMA). The remaining columns, excluding the final one, contain the relative gene expression values. Finally, the last column contains the category (label) to which the data points belong to ('cell line', 'disease', 'neoplasm', 'normal'). 


2. You will only use data points belonging to two of the four categories in the dataset - 'disease' and 'normal'. Consequently, you should create a new data frame that only contains the data points with these labels. The new dataset should consist of 700 data points.


3. Create numpy arrays `X` (feature matrix) and `y` (label vector) based on the data frame. The feature matrix should contain the expression data and be of shape `(700, 3000)`.
   The label vector `y` should be of shape `(700,)` and contain integer values 1 (for data points labled as "disease") and 0 (for data points labled as "normal").
   
   
4. Split the data with `train_test_split` into trainval and test sets (with 80:20 ratio, random_state=42). Keep test set aside until final evaluation. Use trainval data for training models and for model selection (as described below). 


5. Implement PCA (using 20 components) with logistic regression:

   - Use Pipeline sklearn class to chain pre-processing steps (StandardScaler() and PCA(n_components=20, random_state=42)) and logistic regression. 
   - Use [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) class from sklearn.model_selection to perform 5-fold cross-validation and get average F1-score (use parameters scoring='f1' and cv=5 in `cross_val_score` object).
 

6. Implement PCA (using 20 components) with SVM:

  - Construct Pipeline object with scaler and PCA for SVM model in a similar way as for logistic regression.
  - Use training set for choosing parameters and hyperparameters. Specifically, perform grid search combined with cross-validation on the Pipeline object by using the `GridSearchCV` class in scikit-learn. 
  
  The candidate parameter values for the SVM model in your grid search should be `'C': [0.01, 1, 100]` and `'gamma': [1e-04, 1e-03, 1e-02]}`, the number of folds used for cross-validation should be `cv=5`, and scoring parameter `f1`.
  - Report F1-score of SVM model with best parameter values for `C` and `gamma`.
  

7. Choose model with best F1-score and perform final evaluation:

    - Fit model (pipeline object) on the trainval set.
    - Report the accuracy and F1-score on the training and test sets.
    - Plot a normalized confusion matrix for the test set. 

Useful links:

- Learn about Support Vector Machine (SVM) methods (e.g. https://scikit-learn.org/stable/modules/svm.html#support-vector-machines) and the implementation of SVM (specifically the SVC) in the scikit-learn library.
- Pipeline example https://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html
- Metrics for evaluation https://scikit-learn.org/stable/modules/model_evaluation.html
- Function for plotting confusion matrix https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn-metrics-confusionmatrixdisplay

<a id='methods'></a>
<div class=" alert alert-info">

## Methods (3p)
    
This section presents the methods used to solve the machine learning problem and walks through the process of solving the problem. This section could include:

- A description of the dataset. What is the source of the dataset? How many data points does it contain? The features and labels where already presented in the previous section but can be presented once again.
    
- Describe why and how the data split on subsets.

- A description of the pre-processing methods that you have used on your data. 

- A description of the model(s) you are using to solve your machine learning problem. Of what form are the predictor functions (include formula if applicable)? What is the loss function to be minimized or maximized (include formula if applicable). You should also include a short description of the hyperparameters that you tune to optimize the model. 
    
- If you use some tools/methods for model selection and validation (e.g. cross-validation, grid search), explain the purpose of it and how it was performed.
    
- A description of hyperparameter tuning and model selection process. E.g. which validation methods have you used to estimate the model performance on previously unseen data?
</div>

### YOUR TEXT HERE

Methods ...

<a id='methods'></a>
<div class=" alert alert-info">

## Implementation (5p)
</div>

In [None]:
#==============Import all needed libraries===============#



In [None]:
#===============Import dataset=================#



In [None]:
#===============Select subset of dataset [only 'disease' and 'normal' categories]=================#



In [None]:
#===============Split dataset=================#



In [None]:
#===============Logistic regression===============#



In [None]:
#===============SVM===============#



In [None]:
#===============Final evaluation of the chosen model===============#



<a id='result'></a>
<div class=" alert alert-info">

## Results (2.5 p)

This section presents the results of the experiments. In most problems, the central result is the estimated performance of the final model on new data with respect to the chosen performance metric. In addition, you can for example, present results for different models or consider how the hyperparameters affect the models performance.

</div>

### YOUR TEXT HERE ###

Some text about your results ...

<a id='discussion'></a>
<div class=" alert alert-info">


## Discussion/ Conclusions (2.5 p)

In this section you should analyze the results on a more general level and summarize the findings of your work. Discuss the following questions:
- Do the results suggest satisfactory performance of your final model, or is there much room for improvement?
- How do your results compare to benchmarks/ solutions of others (if such are available)?
- Are you aware of some methodological shortcomings in the project?
- Do you have ideas for how to improve the performance (e.g. using more training data, using more features for the data points, using different class of predictor functions (hypothesis space) ?

</div>

### YOUR TEXT HERE ###

Discussion ....