# Use Case

PROFAB is a benchmarking platform that is expected to fill the gap of datasets about protein functions with total 7656 datasets. In addition to protein function datasets, ProFAB provides complete sets of preprocessing-training-evaluation triangle to speed up machine learning usage in biological studies. Since the workflow is dense, an easy to implement user case is prepared. Here, user dataset importing is shown.

## 1. Data Importing

ProFAB provides users to import their datasets that are not available in ProFAB. To import data, SelfGet() function will be savior:

In [None]:
import sys
sys.path.insert(0, '../')

In [None]:
from profab.import_dataset import SelfGet
data = SelfGet(delimiter = '\t', name = False, label = False).get_data(file_name = "sample.txt")

Explanation of parameters is available in "import_dataset" section. With these functions, users can manage dataset 
construction. If s/he has positive set of any term available in ProFAB, only negative set can be obtained by setting 
parameter 'label' = 'negative'. For example, let's say user has positive set for EC number 1-2-7 and wants to get
negative set to use in prediction, following lines can be executed:

In [None]:
from profab.import_dataset import SelfGet, ECNO
negative_set = ECNO(label = 'negative').get_data('ecNo_1-2-7')
positive_set = SelfGet().get_data('users_1-2-7_positive_set.txt')

After loading datasets, preprocessing step comes in.

## 2. PreProcessing

Preprocessing is applicable in three sections which are featurization, splitting and scaling. 

### a. Featurization

Featurization is used to convert protein fasta file into numearical feature data with many protein descriptors. Detailed 
explanation can be found in "model_preprocess". This function is only applicable with LINUX and MAC operation systems and input file format must be '.fasta'. Following lines can be run:

In [None]:
from profab.model_preprocess import extract_protein_feature
extract_protein_feature('edp', 1, 
                       'directory_folder_input_file', 
                       'sample')

After running this function, a new file that holds numerical features of proteins will be formed and it can be imported via SelfGet() function as shown in previous section.

### b. Splitting

Another preprocessing module is splitting module that is to prepare train, validation (if needed) and test sets
for prediction. Detailed information is available in "model_preprocess" and reading it is highly recommended to see how function is working. If one has X (feature matrix) and y
(label matrix), by defining fraction of test set, splitting can be done:

In [None]:
from profab.model_preprocess import ttv_split
X_train,X_test,y_train,y_test = ttv_split(X,y,ratio)

Rather than giving all data, user can choose to feed 'ttv_split' function with positive and negative sets and s/he can be obtain splitted data, eventually.

In [None]:
from profab.model_preprocess import ttv_split
X_train,X_test,y_train,y_test = ttv_split(X_pos,X_neg,ratio)

If data is regression tasked, then y (label matrix) must be given.

### c. Scaling

Scaling is a function to rearange the range of inputs points. The reason to do it prevent imbalance problem. If data 
is stable then this function is unnecessary to apply. like other preprocessing steps, its detailed introduction can 
found in 'model_preprocess'. A use case:

In [None]:
from profab.model_preprocess import scale_methods
X_train,scaler = scale_methods(X_train,scale_type = 'standard')
X_test = scaler.transform(X_test)

Scaling function returns fitted train (X_train) data and fitting model (scaler) to transform other sets as can be seen in use case. The rest is exactly the same as 'test_file_1'.

## 3. Training

PROFAB can train any type of data. It provides both classification and regression training. Since our datasets are based on classication of proteins, as an example, classification method will be shown.

After training session, outcome of training can be stored in 'model_path' ```if path is not None```. Because this process lasts to long, saving the outcome will be time-saver. Stored model must be exported and be imported with 'pickle' a python based package.

In [None]:
from profab.model_learn import classification_methods

#Let's define model path where training model will be saved.
model_path = 'model_path.txt'

model = classification_methods(ml_type = 'logistic_reg',
                                X_train = X_train,
                                y_train = y_train,
                                path = model_path
                                )

## 3. Evaluation

After training session is done, evaluation can be done with following lines of code. The output of evaluation is given below of code.

### a. Get Scores

In [None]:
from profab.model_evaluate import evaluate_score

score_train,f_train = evaluate_score(model,X_train,y_train,preds = True)
score_test,f_test = evaluate_score(model,X_test,y_test,preds = True)
score_validation,f_validation = evaluate_score(model,X_validation,y_validation,preds = True)

The score of train and test are given for data: 'ecNo_1-2-7 'target'.

### b. Table Formating

To get the data in table format, a dictionary that consists of scores of different sets must be given. Following lines of code can be executed to tabularize the results:

In [None]:
#If user wants to see result in a table, following codes can be run:
from profab.model_evaluate import form_table

score_path = 'score_path.csv' #To save the results.

scores = {'train':score_train,'test':score_test,'validation':score_validation}

#form_table() function will write scores to score_path.
form_table(scores = scores, path = score_path)

## 5. Working with Multiple Set

If user wants to make a prediction uses multiple class, ProFAB can handle this with 'for-loop'. For this case, let's say user has negative datasets for 2 GO terms which names of files are:

    - GO_0000018_negative_data.txt
    - GO_0019935_negative_data.txt

Both files are tab separated and protein features are described with their name.
So, this time using SelfGet() function with parameter 'name' = True will be efficient to load negative datasets. To load positive datasets, ProFAB GOID() function with parameter 'label' = 'positive' can be used.

In [1]:
import sys
sys.path.insert(0, '../')

In [2]:
from profab.import_dataset import GOID, SelfGet
from profab.model_preprocess import ttv_split
from profab.model_learn import classification_methods
from profab.model_evaluate import evaluate_score, multiple_form_table

#GO_List: variable includes GO terms
GO_list = ['GO_0000018','GO_0019935']

#To hold scores of model performances
scores = {}

for go_term in GO_list: 

    #User imports his/her negative dataset with SelfGet() function
    negative_data_name = go_term + '_negative_data.txt' 
    negative_set = SelfGet(name = True).get_data(file_name = negative_data_name) 
    
    #Importing positive datasets with ProFAB GOID() function.
    positive_set = GOID(set_type = 'random', protein_feature = 'aac', label = 'positive').get_data(go_term)
    
    #splitting
    X_train,X_test,X_validation,y_train,y_test,y_validation = ttv_split(X_pos = positive_set,
                                                              X_neg = negative_set,
                                                              ratio = [0.1,0.2])
    #prediction
    model = classification_methods(ml_type = 'SVM',
                                  X_train = X_train,
                                  X_valid = X_validation,
                                  y_train = y_train,
                                  y_valid = y_validation)
    
    #evaluation
    score_train = evaluate_score(model,X_train,y_train) 
    score_test = evaluate_score(model,X_test,y_test)
    set_scores = {'train':score_train,'test': score_test}
    scores.update({go_term:set_scores})

#tabularizing the scores
score_path = 'score_path.csv'
multiple_form_table(scores, score_path)

SVC(C=49.50251256281407, gamma=0.0517947467923121, kernel='linear',
    max_iter=2500)
SVC(C=49.50251256281407, gamma=0.0517947467923121, kernel='linear',
    max_iter=2500)


In [3]:
print(scores)

{'GO_0000018': {'train': {'Precision': 0.7014218009478673, 'Recall': 0.42165242165242167, 'F1-Score': 0.5266903914590748, 'F05-Score': 0.6192468619246863, 'Accuracy': 0.7473884140550807, 'MCC': 0.3908801801563357, 'AUC': 0.6659544159544161, 'AUPRC': 0.6579283743580743, 'TP': 148, 'FP': 63, 'TN': 639, 'FN': 203}, 'test': {'Precision': 0.7419354838709677, 'Recall': 0.4423076923076923, 'F1-Score': 0.5542168674698795, 'F05-Score': 0.6534090909090909, 'Accuracy': 0.7549668874172185, 'MCC': 0.42526107635748295, 'AUC': 0.6807498057498057, 'AUPRC': 0.6881480781555551, 'TP': 23, 'FP': 8, 'TN': 91, 'FN': 29}}, 'GO_0019935': {'train': {'Precision': 0.754950495049505, 'Recall': 0.5743879472693032, 'F1-Score': 0.6524064171122994, 'F05-Score': 0.7102934326967861, 'Accuracy': 0.7972551466001248, 'MCC': 0.5225464902906568, 'AUC': 0.7410186005003233, 'AUPRC': 0.7351620471107454, 'TP': 305, 'FP': 99, 'TN': 973, 'FN': 226}, 'test': {'Precision': 0.8431372549019608, 'Recall': 0.5512820512820513, 'F1-Score