Immunoglobulin

To establish a benchmark dataset, the following steps will be taken. Protein sequences containing the nonstandard amino acid characters"B", "J", "O", "X", "Z" were first removed. Second to avoid overfitting caused by homology bias and to reduce redundancy, the CD-Hit program was chosen to set a 60% sequence identity cutoff to remove highly similar sequences. Finally, if a certain protein sequence was a subsequence of other proteins it was also removed. Considering that to avoid the influence of the expression of different protein sequences on the predicted effects, we selected only human, mouse, and rat samples.
For the benchmark dataset, the sequence information of 109 immunoglobulins was stored in the ip.fasta file. Meanwhile, the sequence information of 119 non-immunoglobulins is stored in the in.fasta file.
In order to study the generalization ability of the model, one hundred and eighty four samples were randomly selected from the benchmark data set for training. Forty-four sequences were used as an independent test set, including twenty two immunoglobulin sequences and twenty two non-immunoglobulin sequences.

monoTrikGap

monoTriKGap Theoretical Description:

When -kgap=n then the (20) × (20 × 20 × 20) × n features will exist for protein.
When -kgap=1, feature structure will be X_XXX.
When -kgap=2, feature structure will be X_XXX, and X_ XXX.
When -kgap=3, feature structure will be X_XXX, X XXX, and X _ _XXX.
X={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}.In this study, k=1.

Command for generate dataset only monoTriKGap method:

Run command

Python main.py -fa=/home/gongxiaodou/Datasets/Protein/7.fasta -la=/home/gongxiaodou/Datasets/Protein/label.txt -full=1 -optimum=1 -f13=1 -kgap=1

Parameter description

Full = 1 where the parameters that do not want to save the complete set of data, optimum = 1 that do not want to save the best data set and the generated pseudo = 1 represents a feature.

CC-PSSM

This method first uses the pssm.py file to compare the input protein sequence with the blast to obtain the pssm matrix, then uses the cut-pssm.py file to intercept the first 20 columns of the matrix, and finally uses the test_calcCCPSSM.py file for profile-based Cross covariance (CC-PSSM) to calculate and extract the features of each protein sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Immunoglobulin

monoTrikGap

monoTriKGap Theoretical Description:

Command for generate dataset only monoTriKGap method:

Run command

Parameter description

CC-PSSM

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Benchmark dataset		Benchmark dataset
7.fasta		7.fasta
README.md		README.md
cut-pssm.py		cut-pssm.py
label.txt		label.txt
main.py		main.py
pssm.py		pssm.py
test_calcCCPSSM.py		test_calcCCPSSM.py

gongxiaodou/Immunoglobulin

Folders and files

Latest commit

History

Repository files navigation

Immunoglobulin

monoTrikGap

monoTriKGap Theoretical Description:

Command for generate dataset only monoTriKGap method:

Run command

Parameter description

CC-PSSM

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages