# User Case

PROFAB is a benchmarking platform that is expected to fill the gap of datasets about protein functions with total 7656 datasets. In addition to protein function datasets, ProFAB provides complete sets of preprocessing-training-evaluation triangle to speed up machine learning usage in biological studies. Since the workflow is dense, an easy to implement user case is prepared. This use case is to show functions to import data available in ProFAB only and their applications.

## 1. Data Importing

ProFAB allows users to import datasets ready to use in training algorithms from ProFAB database with a few lines of code.To import data from ProFAB database, following three lines will do the job:

In [None]:
import sys
sys.path.insert(0, '../')

In [None]:
from profab.import_dataset import ECNO
data_model = ECNO(ratio = [0.1, 0.2], protein_feature = 'paac', pre_determined = False, set_type = 'random')
X_train,X_test,X_validation,y_train,y_test,y_validation = data_model.get_data(data_name = 'ecNo_1-2-4')

Explanation of parameters is available in section of "import_dataset". Above code is to import datasets for Enzyme comission number 1-2-4 with three sets which are train, validation and test sets. Similar code is used to import any GO term data. An example code is:

In [None]:
from profab.import_dataset import GOID
data_model = GOID(ratio = [0.1, 0.2], protein_feature = 'paac', pre_determined = False, set_type = 'random')
X_train,X_test,X_validation,y_train,y_test,y_validation = data_model.get_data(data_name = 'GO_0000018')

After loading datasets, preprocessing step comes in.

## 2. PreProcessing

Preprocessing is applicable in three sections which are featurization, splitting and scaling. However, since ProFAB provides datasets with splitted and numerical features automatically, featurization and splitting steps will be explained only in 'test_file_2'.

### a. Scaling

Scaling is a function to rearange the range of inputs points. The reason to do it prevent imbalance problem. If data 
is stable then this function is unnecessary to apply. like other preprocessing steps, its detailed introduction can 
found in 'model_preprocess'. A use case:

In [None]:
from profab.model_preprocess import scale_methods
X_train,scaler = scale_methods(X_train,scale_type = 'standard')
X_test = scaler.transform(X_test)

Scaling function returns fitted train (X_train) data and fitting model (scaler) to transform other sets as can be seen in use case.

## 3. Training

PROFAB can train any type of data. It provides both classification and regression training. Since our datasets are based on classication of proteins, as an example, classification method will be shown.

After training session, outcome of training can be stored in 'model_path' ```if path is not None```. Because this process lasts to long, saving the outcome will be time-saver. Stored model must be exported and be imported with 'pickle' a python based package.

In [None]:
from profab.model_learn import classification_methods

#Let's define model path where training model will be saved.
model_path = 'model_path.txt'

model = classification_methods(ml_type = 'logistic_reg',
                                X_train = X_train,
                                y_train = y_train,
                                path = model_path
                                )

## 3. Evaluation

After training session is done, evaluation can be done with following lines of code. The output of evaluation is given below of code.

### a. Get Scores

In [None]:
from profab.model_evaluate import evaluate_score

score_train,f_train = evaluate_score(model,X_train,y_train,preds = True)
score_test,f_test = evaluate_score(model,X_test,y_test,preds = True)
score_validation,f_validation = evaluate_score(model,X_validation,y_validation,preds = True)

The score of train and test are given for data: 'ecNo_1-2-7 'target'.

### b. Table Formating

To get the data in table format, a dictionary that consists of scores of different sets must be given. Following lines of code can be executed to tabularize the results:

In [None]:
#If user wants to see all results in a table, following codes can be run:
from profab.model_evaluate import form_table

score_path = 'score_path.csv' #To save the results.

scores = {'train':score_train,'test':score_test,'validation':score_validation}
form_table(scores = scores, path = score_path)

'form_table' function will write scores for one dataset.

## 5. Working with Multiple Set

If user wants to predict mutliple data and see performance results, ProFAB can be handle with the 'for-loop'. Let's say, user has negative sets for 3 GO terms and import positive sets from our servers, a use case will be:

In [None]:
import sys
sys.path.insert(0, '../')

In [None]:
from profab.import_dataset import GOID, SelfGet
from profab.model_preprocess import ttv_split
from profab.model_learn import classification_methods
from profab.model_evaluate import evaluate_score, multiple_form_table

GO_list = ['GO_0000018','GO_0019935']
scores = {}
for go_term in GO_list: #GO_List: variable includes GO terms
    
    #Importing data
    negative_data_name = go_term + '_negative_data.txt' 
    #This file is given as sample. Please consider dimension matching while importing data.
    
    negative_set = SelfGet(name = True).get_data(file_name = negative_data_name)
    positive_set = GOID(set_type = 'random', protein_feature = 'aac', label = 'positive').get_data(go_term)
    
    print(len(negative_set[0]))
    print(len(positive_set[0]))
    
    #splitting
    X_train,X_test,X_validation,y_train,y_test,y_validation = ttv_split(X_pos = positive_set,
                                                              X_neg = negative_set,
                                                              ratio = [0.1,0.2])
    #for i in range(len(X_train)):
    #    print(len(X_train[i]))
              
    #prediction
    model = classification_methods(ml_type = 'SVM',
                                  X_train = X_train,
                                  X_valid = X_validation,
                                  y_train = y_train,
                                  y_valid = y_validation)
    
    #evaluation
    score_train = evaluate_score(model,X_train,y_train) 
    score_test = evaluate_score(model,X_test,y_test)
    set_scores = {'train':score_train,'test': score_test}
    scores.update({go_term:set_scores})

#tabularizing the scores
score_path = 'score_path.csv'
multiple_form_table(scores, score_path)