# User Case

PROFAB is a benchmarking platform that provides dataset, classifies the proteins according functional annotations and evaluates the training models. The reason to do this platform is providing complete sets of dataset-training-evaluation triangle. Since the workflow is dense, an easy to implement user case is prepared.

## 1. Data Importing

To import data in Python, following lines of code can be used. It is designed as easy to implement and if user needs to import multiple dataset at the sametime a loop can be used. For example following codes can be examined:

- To import single dataset from Enzyme Commission Number prediction

In [1]:
from profab.import_dataset import ECNO,GOID
data_model = GOID(ratio = 0.2, protein_feature = 'paac', pre_determined = True, set_type = 'temporal')
X_train,X_test,X_validation,y_train,y_test,y_validation = data_model.get_data(data_name = 'GO_0000018')

- If the one drives the program for any Gene Ontology term, then GOID function should be run. 

## 2. Training

PROFAB can train any type of data. It provides both classification and regression training. Since our datasets are based on classication of proteins, as an example, classification method will be shown. As soon as regression based datasets are added to ProFAB, the same process will be valid for regression, too.

After training session, outcome of training can be stored in 'model_path' ```if path != None```. Because this process lasts to long, saving the outcome will be time-saver. Stored model must be exported and be imported with 'pickle' a Python library.

In [3]:
#To train the data:
import pickle
from profab.model_preprocess import scale_methods
from profab.model_learn import classification_methods

#Let's define model path where training model will be saved.
model_path = 'model_path.txt'

#Then sets are scaled to eleminate bias. Scaler is obtained from train data and can be used for different sets
X_train,scaler = scale_methods(X_train,scale_type = 'standard')
X_test,X_validation = scaler.transform(X_test),scaler.transform(X_validation)

#After assigning paths and scaling datasets, training can be done manually like this way (validation by hand):
model = classification_methods(ml_type = 'logistic_reg',
                                X_train = X_train,
                                y_train = y_train,
                                X_valid = X_validation,
                                y_valid = y_validation,
                                path = model_path
                                )

LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


## 3. Evaluation

After training session is done, evaluation can be done with following lines of code. The output of evaluation is given below of code.

### Get Scores

In [4]:
#by returning model, it can be directly used or
#saved model can be obtained by using 'pickle' package.
model = pickle.load(open(model_path,'rb'))

from profab.model_evaluate import evaluate_score

score_train,f_train = evaluate_score(model,X_train,y_train,preds = True)
score_test,f_test = evaluate_score(model,X_test,y_test,preds = True)
score_validation,f_validation = evaluate_score(model,X_validation,y_validation,preds = True)

The score of train and test are given for data: 'ecNo_1-2-7 'target'.

In [6]:
print(score_train)

{'Precision': 0.5205479452054794, 'Recall': 0.4730290456431535, 'F1-Score': 0.49565217391304345, 'F05-Score': 0.5102954341987467, 'Accuracy': 0.6791147994467497, 'MCC': 0.26178984017238344, 'AUC': 0.6275933609958506, 'AUPRC': 0.5846169878171241, 'TP': 114, 'FP': 105, 'TN': 377, 'FN': 127}


In [7]:
print(score_test)

{'Precision': 0.3819444444444444, 'Recall': 0.30386740331491713, 'F1-Score': 0.3384615384615384, 'F05-Score': 0.3632760898282695, 'Accuracy': 0.6040515653775322, 'MCC': 0.061949328472919014, 'AUC': 0.5290055248618785, 'AUPRC': 0.45892802332719457, 'TP': 55, 'FP': 89, 'TN': 273, 'FN': 126}


In [8]:
print(score_validation)

{'Precision': 0.5192307692307693, 'Recall': 0.34615384615384615, 'F1-Score': 0.41538461538461535, 'F05-Score': 0.47202797202797203, 'Accuracy': 0.6752136752136753, 'MCC': 0.2107878791782229, 'AUC': 0.592948717948718, 'AUPRC': 0.5416666666666667, 'TP': 27, 'FP': 25, 'TN': 131, 'FN': 51}


### Table Formating

To get the data in table format, following lines of code can be executed. Besides scores, sizes of each sets are also given. Tables is stored in .csv format

In [9]:
#If user wants to see all results in a table, following codes can be run:
from profab.model_evaluate import form_table

score_path = 'score_path.csv' #To save the results.

scores = {'train':score_train,'test':score_test,'validation':score_validation}
form_table(scores = scores, path = score_path)

### Feature Extraction from Protein Sequences

Other than other functionalities, ProFAB also provides users extracting numerical features by introducing
protein sequence data with many protein descriptor methods. The sequence files must be in ".fasta" formats and this module must be run in LINUX or MAC. To implement this module following lines can be run:

In [None]:
from profab.model_preprocess import extract_protein_feature
extract_protein_feature('edp', 1, 
                          'profab/feature_extraction_module/input_folder', 
                          'sample')