# ML test
---

In [None]:
# !pip install pyarrow
# mirna.columns
# mirna.info()
# mirna.describe()
# mirna = mirna.dropna()

In [199]:
import pandas as pd
import numpy as np
import pyarrow.feather as pa
import s3_access #file with all the key for S3

In [200]:
s3.get_file_info(ft.FileSelector('envbran/methylation/', recursive = True))

[<FileInfo for 'envbran/methylation/GSE117064_mirna.arrow': type=FileType.File, size=17661274>,
 <FileInfo for 'envbran/methylation/GSE117064_pheno.arrow': type=FileType.File, size=146786>,
 <FileInfo for 'envbran/methylation/GSE216997_mirna.arrow': type=FileType.File, size=1140522>,
 <FileInfo for 'envbran/methylation/GSE216997_pheno.arrow': type=FileType.File, size=11250>]

In [201]:
a = s3.open_input_file('envbran/methylation/GSE117064_mirna.arrow')
b = s3.open_input_file('envbran/methylation/GSE117064_pheno.arrow')
pheno = pa.read_feather(b)
mirna = pa.read_feather(a)

In [148]:
DT = pheno.loc[pheno.class_label == 1]

In [149]:
DT.source_name_ch1.value_counts()

CVD patient        173
Non-CVD control    173
Name: source_name_ch1, dtype: int64

In [150]:
mirna = mirna.set_index('rn').unstack().unstack().reset_index().rename_axis(columns=None)
mirna = mirna.rename(columns={'index': 'geo_accession'});

In [151]:
DT.columns

Index(['title', 'geo_accession', 'source_name_ch1', 'organism_ch1', 'relation',
       'age:ch1', 'bmi:ch1', 'diastolic bp:ch1', 'group:ch1', 'hb-a1c:ch1',
       'Sex:ch1', 'smoking:ch1', 'systolic bp:ch1', 'tissue:ch1', 'diagnosis',
       'class_label'],
      dtype='object')

In [152]:
DT = DT[['geo_accession','source_name_ch1','age:ch1','Sex:ch1']] # select only some columns

In [153]:
DT = pd.merge(DT, mirna, on = "geo_accession")

### LASSO regression

Lasso regression, short for Least Absolute Shrinkage and Selection Operator, is a linear regression technique that incorporates L1 regularization. In standard linear regression, the objective is to minimize the sum of squared residuals. Lasso regression, on the other hand, adds a penalty term to the objective function, which is proportional to the absolute values of the regression coefficients.

The objective function for lasso regression can be expressed as:

$$ \min\limits_{\beta_0,\beta_j} \frac{1}{2n} \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j| $$

Here:
- $n$ is the number of observations.
- $p$ is the number of features.
- $y_i$ is the target variable for the $i$-th observation.
- $ x_{ij} $ is the $j$-th feature for the $i$-th observation.
- $ \beta_0, \beta_1, ..., \beta_p $ are the regression coefficients.
- $ \lambda $ is the regularization parameter that controls the strength of the penalty term.

The term $ \lambda \sum_{j=1}^{p} |\beta_j| $ is the L1 penalty term. The use of this penalty encourages the model to have sparse coefficients, effectively performing feature selection by pushing some coefficients to exactly zero. This can be useful in situations where there are many features, and some of them may not contribute significantly to the predictive power of the model.

Lasso regression is particularly beneficial when dealing with high-dimensional data or situations where feature selection is important. It helps prevent overfitting and can lead to a more interpretable and parsimonious model.

In [216]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [184]:
enc = OneHotEncoder(drop='first').fit(DT[['Sex:ch1']])

In [187]:
DT[['Sex:ch1']] = enc.transform(DT[['Sex:ch1']]).toarray()

In [218]:
X = DT.drop(['geo_accession','source_name_ch1'],axis=1)
y = DT['source_name_ch1']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=561)

In [219]:
lasso_model = LogisticRegression(random_state=0, max_iter=10000).fit(X_train, y_train)

In [220]:
# Add cross-validation + parameters tuning
# lasso_model.coef_
lasso_model.predict(X_test)
lasso_model.score(X_test,y_test)

0.9770114942528736

In [221]:
# Create dataframe with 
# d = {'pred':  lasso_model.predict(X_test), 'actual': y_test}
# df = pd.DataFrame(data=d)

In [222]:
confusion_matrix(y_test,lasso_model.predict(X_test))

array([[42,  2],
       [ 0, 43]], dtype=int64)