Machine learning method for detecting the presence of pseudoknots.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Pseudoknow.zip
README.md

README.md

Pseudoknow: a Method for Fast and Accurate Pseudoknot Detection


All codes are written in Python. The required packages Biophyton, numpy, scipy, and scikit-learn. The code has two main parts: The first part is for the feature extarction, and the second part is for Machine Learning (ML) implementation. For the feature extarction we have pknowExtractFeatures.py script and pknow.py module. If there is a need to extract only the 103 features of Pseudoknow, one can easily run the following: python pknowExtractFeatures.py -w 100 -c conf.txt test.fasta

There are different options to consider based on the need like window size, confirmed features, and etc. The name of the feature file is the same as fasta file except that it has .features for it's extention.

Eventhough we have a seperate script to extract the features, we combined our feature extraction and ML functions together and it can be used as follow:

  1. pknowTrain.py

    -Purpose: Extracting features, save features, train a random forest ML model, and save the model.

    -Input: Pk- and PKF- strucrtures Fasta files (2 seperate files). Optional inputs like window size, confirmed features file, number of trees in the ML model, and etc.

    -Output: Features as a .features file. The trained model as rf_trained.model.

  2. pknowTrainTest.py

    -Purpose: Extracting features for both training and testinf, save features of both testing and training, train a random forest ML model and save the model. It also prints the PK probability of each sequence in the test set.

    -Input: Pk- and PKF- strucrtures Fasta files (2 seperate files). The test structures fasta file. Optional inputs like window size, confirmed features file, number of trees in the ML model, and etc.

    -Output: Features as a .features files for both training and testing sequences. The trained model as rf_trained.model. The PK probablities is printed out with it's ID at the end of training phase.

  3. pknowTest.py

    -Purpose: Using the trained model, we can test new sequence(s). The benefit of this function is we can test any sequence with the model that already trained. For example, you can run the following to test:

    python pknowTest.py -w 100 -c conf.txt pkfs_PDB.fasta # For the PDB PKF-structures

    python pknowTest.py -w 100 -c conf.txt pks_PDB.fasta # For the PDB PK-structures

    -Input: The trained model and the fasta file of the test sequences. Optional inputs also can be passed in as desired.

    -Output: Features as a .features file for both testing sequences. The PK probablities is printed out with it's ID.