#   Sources

##  [Implementing SVM and Kernel SVM with Python's Scikit-Learn](https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn)
Split dataset to training and testing datasets using `sklearn`.

##  [Recursive Feature Elimination](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html)
How to use RFE with SVM in `sklearn`.

##  [Cross Validation in Scikit Learn](https://youtu.be/L_dQrZZjGDg)
Goes through the `sklearn.model_selection.train_test_split` and how it's used.

# Import modules

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.metrics import confusion_matrix

#   Data preprocessing

## Read gene dataset

In [2]:
original_dataset = pd.read_csv('Golub.txt', sep='\t')
original_dataset

Unnamed: 0,Genes,ALL,ALL.1,ALL.2,ALL.3,ALL.4,ALL.5,ALL.6,ALL.7,ALL.8,...,AML.15,AML.16,AML.17,AML.18,AML.19,AML.20,AML.21,AML.22,AML.23,AML.24
0,AFFX-BioB-5_at,-342,-87,22,-243,-130,-256,-62,86,-146,...,7,-213,-25,-72,-4,15,-318,-32,-124,-135
1,AFFX-BioB-M_at,-200,-248,-153,-218,-177,-249,-23,-36,-74,...,-100,-252,-20,-139,-116,-114,-192,-49,-79,-186
2,AFFX-BioB-3_at,41,262,17,-163,-28,-410,-7,-141,170,...,-57,136,124,-1,-125,2,-95,49,-37,-70
3,AFFX-BioC-5_at,328,295,276,182,266,24,142,252,174,...,132,318,325,392,241,193,312,230,330,337
4,AFFX-BioC-3_at,-224,-226,-211,-289,-170,-535,-233,-201,-32,...,-377,-209,-396,-324,-191,-51,-139,-367,-188,-407
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7124,X83863_at,1074,67,893,722,612,1950,245,1235,354,...,752,1293,1733,1567,987,279,737,588,1170,2315
7125,Z17240_at,475,263,297,170,370,906,164,9,-42,...,295,342,304,627,279,51,227,361,284,250
7126,L49218_f_at,48,-33,6,0,29,79,84,7,-100,...,28,26,12,21,22,6,-9,-26,39,-12
7127,M71243_f_at,168,-33,1971,510,333,170,100,1545,45,...,1558,246,3193,2520,662,2484,371,133,298,790


## Process data

In [3]:
def process_data(dataset: pd.DataFrame) -> [pd.Series, pd.Series, pd.DataFrame]:
    """To process read datasets.
    Firstly, returns gene names as a pd.Series object.
    Secondly, returns label names as a pd.Series object without Gene label and numbers at the end of ALL and AML columns.
    Thirdly, returns a processed dataset in the following ways:
        -   Its columns are renamed by the dataset_labels list.
        -   It gets transposed.
    Args:
        dataset(pd.DataFrame): Read dataset.
    Returns:
        [np.array, np.array, np.array]: Returns a tuple containing dataset genes, dataset labels and processed dataset as np.arrays.
    """
    altered_dataset = dataset
    dataset_genes = altered_dataset.iloc[:, 0]
    dataset_labels = pd.Series(
        [
            'ALL' if 'ALL' in dataset_column else 'AML' if 'AML' in dataset_column else dataset_column
            for dataset_column in list(altered_dataset.columns)
        ]
    )
    altered_dataset = altered_dataset.set_axis(dataset_labels, axis='columns').T.iloc[1:]
    return [dataset_genes, dataset_labels[1:], altered_dataset]


genes, features, labels = process_data(original_dataset)

## Split dataset to train and test data subsets

In [4]:
train_labels, test_labels, train_features, test_features = train_test_split(features, labels, train_size=0.8)

#   Data analysis

##  SVM-RFE

### Training

In [5]:
svc = SVC(kernel='linear').fit(train_features, train_labels)
rfe = RFE(estimator=svc).fit(train_features, train_labels)

### Accuracy

In [6]:
rfe.score(train_features, train_labels)

1.0

In [7]:
rfe.score(test_features, test_labels)

1.0

### Prediction

In [8]:
train_prediction = list(train_labels == rfe.predict(train_features))
f'{train_prediction.count(True)} labels correct out of {len(train_prediction)}'

'57 labels correct out of 57'

In [9]:
test_prediction = list(test_labels == rfe.predict(test_features))
f'{test_prediction.count(True)} labels correct out of {len(test_prediction)}'

'15 labels correct out of 15'