#   Researching algorithms mentioned in paper

##  [Implementing SVM and Kernel SVM with Python's Scikit-Learn](https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn)
Split dataset to training and testing datasets using `sklearn`.

##  [Recursive Feature Elimination](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html)
How to use RFE with SVM in `sklearn`.

##  [Cross Validation in Scikit Learn](https://youtu.be/L_dQrZZjGDg)
Goes through the `sklearn.model_selection.train_test_split` and how it's used.

## [Transformer that performs Sequential Feature Selection.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)

## [Feature selection in Python using the Filter method](https://towardsdatascience.com/feature-selection-in-python-using-filter-method-7ae5cbc4ee05)

# Researching $Y$

[Basic probability: Joint, marginal and conditional probability | Independence - YouTube](https://www.youtube.com/watch?v=SrEmzdOT65s)

[Time Series Measures — PyInform 0.2.0 documentation](https://elife-asu.github.io/PyInform/timeseries.html#module-pyinform.conditionalentropy)
[Python Module Index — PyInform 0.2.0 documentation](https://elife-asu.github.io/PyInform/py-modindex.html)

[Empirical Distributions — PyInform 0.2.0 documentation](https://elife-asu.github.io/PyInform/dist.html?module-pyinform.dist#module-pyinform.dist)

[Time Series Measures — PyInform 0.2.0 documentation](https://elife-asu.github.io/PyInform/timeseries.html#module-pyinform.mutualinfo)

[entropy.dvi](https://www.csd.uoc.gr/~hy438/lectures/entropy.pdf)

[2 ENTROPY DEPENDENCE.pdf](http://cosynet.auth.gr/sites/default/files/less16_17/2%20ENTROPY%20DEPENDENCE.pdf)

[python - Best way to get joint probability from 2D numpy - Stack Overflow](https://stackoverflow.com/questions/45070966/best-way-to-get-joint-probability-from-2d-numpy)

[joint probability distribution marginal probabilities - Google Search](https://www.google.com/search?q=joint+probability+distribution+marginal+probabilities&oq=joint+distribution+of+marginal+pro&aqs=chrome.1.69i57j0i22i30l3j0i390l3.9008j0j7&sourceid=chrome&ie=UTF-8)

[numpy - Calculate marginal distribution from joint distribution in Python - Stack Overflow](https://stackoverflow.com/questions/53945334/calculate-marginal-distribution-from-joint-distribution-in-python)

[numpy.argmax — NumPy v1.20 Manual](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html)

[python - Negative values in time series forecast and high fluctuations in input data - Cross Validated](https://stats.stackexchange.com/questions/396647/negative-values-in-time-series-forecast-and-high-fluctuations-in-input-data)

[sklearn.metrics.mutual_info_score — scikit-learn 0.24.2 documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html#sklearn.metrics.mutual_info_score)

[sklearn.metrics.adjusted_mutual_info_score — scikit-learn 0.24.2 documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html#sklearn.metrics.adjusted_mutual_info_score)

[2.3. Clustering — scikit-learn 0.24.2 documentation](https://scikit-learn.org/stable/modules/clustering.html#mutual-info-score)

[Can the mutual information of a "cell" be negative? - Cross Validated](https://stats.stackexchange.com/questions/5698/can-the-mutual-information-of-a-cell-be-negative)

[pr.probability - Can the mutual information of a "cell" be negative? - Theoretical Computer Science Stack Exchange](https://cstheory.stackexchange.com/questions/3939/can-the-mutual-information-of-a-cell-be-negative/3943#3943)

[Shannon Information Measures — PyInform 0.2.0 documentation](https://elife-asu.github.io/PyInform/shannon.html?highlight=kullback)

[kullback-leibler divergence venn diagram - Google Search](https://www.google.com/search?q=kullback-leibler+divergence+venn+diagram&sxsrf=ALeKk02knhy320Bys-4rFFMhgoIYQdYSzw:1622810157415&tbm=isch&source=iu&ictx=1&fir=8gKMdWJTKQY0IM%252CNyi01RYV8spmIM%252C_&vet=1&usg=AI4_-kTrk0nY-muv3AzngG_ukpB9BbbJRg&sa=X&ved=2ahUKEwjCquH3_v3wAhXzh_0HHat5A7gQ9QF6BAgPEAE#imgrc=UkSG_FIH7kBp_M)

[ECS452 4 u5.pdf](http://www2.siit.tu.ac.th/prapun/ecs452_2017_2/ECS452%204%20u5.pdf)

# Import modules

In [1]:
import time

import numpy as np
import pandas as pd
import pyinform
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

#   Data preprocessing

## Read gene dataset

In [2]:
# original_dataset = pd.read_csv('http://139.91.190.186/tei/bioinformatics/Golub.txt', sep='\t')
from sklearn.svm import SVC

original_dataset = pd.read_csv('Golub.txt', sep='\t')
original_dataset

Unnamed: 0,Genes,ALL,ALL.1,ALL.2,ALL.3,ALL.4,ALL.5,ALL.6,ALL.7,ALL.8,...,AML.15,AML.16,AML.17,AML.18,AML.19,AML.20,AML.21,AML.22,AML.23,AML.24
0,AFFX-BioB-5_at,-342,-87,22,-243,-130,-256,-62,86,-146,...,7,-213,-25,-72,-4,15,-318,-32,-124,-135
1,AFFX-BioB-M_at,-200,-248,-153,-218,-177,-249,-23,-36,-74,...,-100,-252,-20,-139,-116,-114,-192,-49,-79,-186
2,AFFX-BioB-3_at,41,262,17,-163,-28,-410,-7,-141,170,...,-57,136,124,-1,-125,2,-95,49,-37,-70
3,AFFX-BioC-5_at,328,295,276,182,266,24,142,252,174,...,132,318,325,392,241,193,312,230,330,337
4,AFFX-BioC-3_at,-224,-226,-211,-289,-170,-535,-233,-201,-32,...,-377,-209,-396,-324,-191,-51,-139,-367,-188,-407
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7124,X83863_at,1074,67,893,722,612,1950,245,1235,354,...,752,1293,1733,1567,987,279,737,588,1170,2315
7125,Z17240_at,475,263,297,170,370,906,164,9,-42,...,295,342,304,627,279,51,227,361,284,250
7126,L49218_f_at,48,-33,6,0,29,79,84,7,-100,...,28,26,12,21,22,6,-9,-26,39,-12
7127,M71243_f_at,168,-33,1971,510,333,170,100,1545,45,...,1558,246,3193,2520,662,2484,371,133,298,790


## Process data

In [3]:
def process_data(dataset: pd.DataFrame) -> [pd.Series, pd.Series, pd.DataFrame]:
    """To process read datasets.
    Firstly, returns gene names as a pd.Series object.
    Secondly, returns label names as a pd.Series object without Gene label and numbers at the end of ALL and AML columns.
    Thirdly, returns a processed dataset in the following ways:
        -   Its columns are renamed by the dataset_labels list.
        -   It gets transposed.
    Args:
        dataset(pd.DataFrame): Read dataset.
    Returns:
        [np.array, np.array, np.array]: Returns a tuple containing dataset genes, dataset labels and processed dataset as np.arrays.
    """
    altered_dataset = dataset
    dataset_genes = altered_dataset.iloc[:, 0]
    dataset_labels = pd.Series(
        [
            'ALL' if 'ALL' in dataset_column else 'AML' if 'AML' in dataset_column else dataset_column
            for dataset_column in list(altered_dataset.columns)
        ]
    )
    altered_dataset = altered_dataset.set_axis(dataset_labels, axis='columns').T.iloc[1:]
    return [dataset_genes, dataset_labels[1:], altered_dataset]


genes, features, labels = process_data(original_dataset)

## Split dataset to train and test data subsets

In [4]:
train_labels, test_labels, train_features, test_features = train_test_split(features, labels, train_size=0.75)

#   Data analysis

##  SVM-RFE

### Train

In [5]:
tic = time.perf_counter()

svm = SVC(kernel='linear').fit(train_features, train_labels)
svm_rfe = RFE(estimator=svm).fit(train_features, train_labels)

toc = time.perf_counter()
toc - tic

18.748268666999998

### Accuracy

In [6]:
svm_rfe.score(train_features, train_labels)
# should return 1.0

1.0

In [7]:
svm_rfe_accuracy = svm_rfe.score(test_features, test_labels)
# sometimes it achieves score of 1.0; might need some kernel restarts and re-runs
svm_rfe_accuracy

0.9444444444444444

### Prediction

In [14]:
list(svm_rfe.predict(train_features) == train_labels).count(True)
# should return 54

54

In [9]:
svm_rfe_prediction = svm_rfe.predict(test_features)
svm_rfe_test_prediction = list(test_labels == svm_rfe_prediction)
f'{svm_rfe_test_prediction.count(True)} labels correct out of {len(svm_rfe_prediction)}'
# sometimes it achieves 18 out of 18; might need some kernel restarts and re-runs

'17 labels correct out of 18'

### Notes

Sometimes the SVM-RFE model achieves

## MIGS-Pruning

### Implementation

In [10]:
ns = np.vectorize(lambda x: 1 / (1 + np.exp(-x)))(np.array(train_features.T)).tolist()
# applied sigmoid function because negative values later result in ValueError
# while a sparse array results in all MIs to be 0
s = []

  ns = np.vectorize(lambda x: 1 / (1 + np.exp(-x)))(np.array(train_features.T)).tolist()


In [11]:
#   Initialization
MI = [
    pyinform.conditional_entropy(
        pyinform.dist.Dist(ns[i]),
        pyinform.dist.Dist(ns[i + 1])
    )
    for i in range(len(ns) - 1)
]

**The following cell will not stop running when executed for a second time without a kernel restart.**

In [12]:
#   Iterative selection procedure

# for k in range(1, len(MI)):
#     s.append(np.argmax(MI[k]))
#     MI[k] = 0
#     for i in range(1, len(ns) - k):
#         MI[i] = min([
#             MI[i],
#             pyinform.conditional_entropy(
#                 pyinform.dist.Dist(ns[i]),
#                 pyinform.dist.Dist(ns[i] * s[k])
#             )
#         ])