# BIO-SELECT - Marigliano
## Features selection using Limma and R

The goal of this script is to use Limma algorithm and add the selected features to the ones we already have selected in the features_selection.ipynb notebook.

To use this notebook, you need to have Docker installed.

The steps are the following:
1. Build docker image to setup a ready-to-use R environment
2. Run the two docker containers, one for MILE and one for Golub
    1. Run the container
    1. Read the limma CSV file
    1. Sort this file
    1. Convert limma features indices to dataset indices
    1. Append the feature list to the CSV files generated in the features_selection.ipynb notebook



# Build Docker image

In [None]:
# execute this every time you change the R scripts.
!cd docker-R && \
docker build -t rdocker .

# Run Limma

In [None]:
import os
DATASET = "GSE13425"
GROUP_NAME = DATASET + "_10052017"
os.environ['DATASET'] = DATASET
os.environ["GROUP_NAME"] = GROUP_NAME

N_FEATURES = 1000
ALG_NAME = "Limma"

In [None]:
####
#### change the cell type to "Code" to be able to run it
####

# $HOST_WD is a environment variable which contains the host current folder since Docker in Docker containers use the host context
!cd docker-R && \
docker run -it -v $HOST_WD/docker-R/dataset:/dataset --rm rdocker Rscript --no-save --no-restore --verbose limma-$DATASET.R

In [None]:
!ls -al docker-R/dataset/

In [None]:
!head -n4 docker-R/dataset/limma-$DATASET.csv

## Parse Limma CSV

In [None]:
import pandas as pd
from datasets.GSE13425.GSE13425Dataset import GSE13425Dataset
from algorithms.Algorithm import Algorithm

In [None]:
ds = GSE13425Dataset()

In [None]:
filename = r"docker-R/dataset/limma-%s.csv" % DATASET

df = pd.read_csv(filename, sep="\t", usecols=["ID", "F"])
df = df.dropna()  # ignore NaN values

df = df[["ID", "F"]] # order the columns

# convert pandas dataframe to array of tuples
features_by_score = [tuple(x) for x in df.to_records(index=False)]

# convert features name to features indices
f_names, f_scores = zip(*features_by_score)
f_names = ds.get_features_indices(f_names)
features_by_score = zip(f_names, f_scores)

# normalize the score
features_by_score_normed = Algorithm.normalize_scores(features_by_score)[:N_FEATURES]
print(features_by_score_normed[:10])

# transform the rank tuples in the format: (index, rank)
# reverse the rank to have the best features with a higher score appear first
r = [f[0] for f in features_by_score_normed]
features_by_rank = [(v, 1.0/(1.0+k)) for k, v in enumerate(r)]

# assign the same weight for all features
features = [(f[0], 1) for f in features_by_score_normed]

# prepare the subsets dict to export in CSV
subsets = {}
subsets[ALG_NAME] = {"features": [], "features_by_rank": [], "features_by_score": []}
subsets[ALG_NAME]["features"] = features
subsets[ALG_NAME]["features_by_rank"] = features_by_rank
subsets[ALG_NAME]["features_by_score"] = features_by_score_normed

## Save the features

In [None]:
from utils.CSVFeaturesExporter import CSVFeaturesExporter

group_name = GROUP_NAME + "_limma"
features_exporter = CSVFeaturesExporter(subsets, group_name)
features_exporter.export()

In [None]:
print(group_name)

In [None]:
!ls outputs | grep -i GSE

In [None]:
# TODO: change the cell type of this cell to "Code" to concatenate 
# the CSVs (features_selection.ipynb and features_selection_limma.ipynb)

!cat outputs/$GROUP_NAME\_limma_features.csv >> outputs/$GROUP_NAME\_features.csv
!cat outputs/$GROUP_NAME\_limma_features_by_rank.csv >> outputs/$GROUP_NAME\_features_by_rank.csv
!cat outputs/$GROUP_NAME\_limma_features_by_score.csv >> outputs/$GROUP_NAME\_features_by_score.csv