# 01 Data Process
This document will describe how the input data should be processed before using GloEC. Below, we will illustrate with `input_sample` as the test data.

## Import the necessary packages
`config_util.py` is sourced from the provided source files, containing all the hyperparameters or file paths needed during runtime.
<br>`pandas` is used to read CSV files.
<br>`torch` is used to convert data into tensors for GloEC to utilize.

In [2]:
from config_util import Config
import pandas as pd
import torch

## Input file reading
We have stored the source files in the following `root` directory and named them as input_sample. The source files are saved using the `.csv` format, which is convenient for visualization outside of the code. We recommend users to similarly process their source files into `.csv` format. Secondly, we use the pandas package to read the source files. For demonstration purposes, we have included `encoding="GB2312"`.
<br>The source files consist of Entry lists, Sequence lists, and EC number lists. The Entry list represents the entry numbers in the UniProt database.

In [3]:
root = '../Data/input_sample/'
pd.read_csv(root + "input_sample.csv", encoding="GB2312")

Unnamed: 0,Entry,Sequence,EC number
0,A0A067YMX8,MAASPYSIFAVQLLLLASWMLSSSSSNFNQDFNIAWGGGRARILNN...,2.4.1.207
1,A0A0K3AV08,MEQASVPSYVNIPPIAKTRSTSHLAPTPEHHRSVSYEDTTTASTST...,2.7.11.25
2,A0A1D6K6U5,MVLSSSCTTVPHLSSLAVVQLGPWSSRIKKKTDAVAVPAAAGRWRA...,5.5.1.13
3,A1XSY8,MMTAKAVDKIPVTLSGFVHQLSDNIYPVEDLAATSVTIFPNAELGS...,2.3.2.-
4,A1ZA55,MAMNLENIVNQATAQYVKIKEHREPYTAHYNALKDKVYSEWKSSAV...,2.7.7.-
5,A2A5Z6,MSNPGGRRNGPVKLRLTVLCAKNLVKKDFFRLPDPFAKVVVDGSGQ...,2.3.2.26
6,A2CEI6,MDPKRPTFPSPPGVIRAPWQQSTEDQSQLLDQPSLGRARGLIMPID...,3.1.26.-
7,A2TK72,KREAEANRTPEQQIYDPYKYVETVFVVDKAMVTKYNGDLDKIKTRM...,3.4.24.-
8,A3KPQ7,MQVNDGPSSHPIFVAPVNGNAQRSSGYVPGRIVPVRSPPPAKAPPP...,3.2.1.35
9,A4FUD9,MAGTVVLDDVELREAQRDYLDFLDDEEDQGIYQSKVRELISDNQYR...,3.6.4.12


## FASTA file
The processed source files need to be formatted into `.fasta` format for subsequent use. FASTA format is a text format used to record nucleotide sequences or peptide sequences, where nucleotides or amino acids are represented by single-letter codes. This format also allows the definition of names and writing comments before sequences. Formatting into `.fasta` format is not difficult; here, I won't go into too much detail.

In [4]:
with open(root + 'input_sample.fasta', 'r') as file:
    for lineID, line in enumerate(file):
        print(line[:-1])


>A0A067YMX8
MAASPYSIFAVQLLLLASWMLSSSSSNFNQDFNIAWGGGRARILNNGELVTLSLDKASGSGFRSKNLYLFGKIDMQLKLVPGNSAGTVTTYYLSSEGSVRDEIDFEFLGNLTGEPYTLHTNVYSHGKGEREQQFRLWFDPAADFHTYSILWNSKTIVFYVDQTPVREFKNMESIGVPYLRQPMRLFSSIWNADEWATRGGLIKTDWTQAPFTTSYRNFRADNACVWAAKASSCGLAAGGNAWLSVELDAKSRGRLRWVRRNQMIYDYCVDGKRFPRGVPPECKLNLHI
>A0A0K3AV08
MEQASVPSYVNIPPIAKTRSTSHLAPTPEHHRSVSYEDTTTASTSTDSVPEVRIRSESSQVSRESPPIRASKAFVASYEYEAQKDDELNLPLGAIITLVTVETNEDGWYRGELNGKVGLFPSNYAREVTYKDNLVEFKQDEIMLPVAVRTLSDCQIGHGATATVFKMDIKIKKELQNGRMGEAVGDQMKAALKRFNRHASNFRADVVSTDEQLEQLKREANLVNGLSHNNIVRLLGICLEDPYFGLLLELCEGSSLRNVCRNLNSDAAIPLGVLIDWATQVAEGMEYLTKQGYVHRDLKADNVLVKEEVCLCMDEEMFQYAYCLKCGKRPFDKLQLKITDFGVTRKMTADANRFSTAGTYAWLAPEAFKEGTWSEASDVWSYGVVLWELLTREEPYQGHIPATIAFQIANKGQNLSIGDSCPDRWKKLMQDCWNLEPNFRPKFSTLAISFKQYAKEFKDTHLQRAPSKMAVKELYSECFADKTKEEFEKRFHDLYAGSGDINRKNRHSIAPETKARRLKHHKPKKADITGPTEVKHILSVQKDDKNFRVKTYDQSSTGGTLPRLNERQSTLSLSSPDLFHISNLISGSNTVGHSAHRISRKNAIRHKKNQHRMFESPVVSPTMDDSNTFSTIDNADEVDPNHSKESKKGGTLSRAWAKLPWNKRDSKEDHDERAVAGSISSRSSSTT

## Obtaining ESM embeddings
ESM-1b is a large-scale protein language model based on the Transformer architecture. It has been pre-trained on a vast amount of unannotated protein sequences. The obtained ESM embeddings facilitate downstream tasks such as enzyme function prediction, making it convenient for readers to perform tasks like classification. ESM-1b is open-source, and users only need to submit the mentioned `.fasta` sequences to obtain the corresponding embedding files. Once the embedding files are packaged, users can utilize GloEC for predictions.

In [5]:
esm_tensor = torch.load(root + 'input_sample.pt')
print(esm_tensor.size())
print("-----------detail--------------")
for i in range(len(esm_tensor)):
    print(esm_tensor[i])
    

torch.Size([10, 1280])
-----------detail--------------
tensor([-0.0317,  0.1434,  0.1174,  ..., -0.0356,  0.0135, -0.0783])
tensor([-0.0093,  0.1870, -0.0409,  ...,  0.0302, -0.0700,  0.2251])
tensor([-0.2415,  0.1000,  0.2506,  ...,  0.0966, -0.0440,  0.1334])
tensor([ 0.0399,  0.2734,  0.0106,  ..., -0.0436, -0.0718,  0.1517])
tensor([-0.0444,  0.1305, -0.1353,  ..., -0.1114, -0.2894, -0.0029])
tensor([-0.0412,  0.1405,  0.1725,  ..., -0.0949, -0.1410, -0.0593])
tensor([-0.0757,  0.2825,  0.0841,  ..., -0.1310, -0.0524,  0.1540])
tensor([ 0.0464,  0.2468,  0.0159,  ..., -0.0729, -0.1199, -0.0078])
tensor([0.0278, 0.2378, 0.0902,  ..., 0.0890, 0.0183, 0.0597])
tensor([-0.0605,  0.2491,  0.0573,  ..., -0.0993, -0.0678,  0.0439])


## Hyperparameters
The `.config` file contains the hyperparameters that may be needed. `batch_size` represents the batch size, `start_lr` represents the initial learning rate, `lambda` represents the degree of exponential decay of the learning rate, and so on. Readers can find more hyperparameters in the `config_util.py` source file.

In [6]:
config = Config()
def show_config(config):
    print("----------- Display hyperparameter ------------")
    print('batch_size: ' + str(config.batch_size))
    print('start lr: ' + str(config.learning_rate))
    print('landa: ' + str(config.landa))
    print('epoch: ' + str(config.epoch))
    print('use_hierar_penalty: ' + str(config.use_hierar_penalty))
    print('use_GCN: ' + str(config.use_GCN))
    print('gcn_layer: ' + str(config.gcn_layer))
    print("--------------------------------- ------------")
    
show_config(config)

using CPU training
----------- Display hyperparameter ------------
batch_size: 256
start lr: 0.001
landa: 0.8
epoch: 300
use_hierar_penalty: True
use_GCN: True
gcn_layer: 3
--------------------------------- ------------
