All the notebooks and files for the bacteriocin prediction paper
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.ipynb_checkpoints
Baseline_pipeline.ipynb
Blast_baseline_bacteriocin_paper.ipynb
Identifying_gene_blocks_for_putative_bacteriocins.ipynb
Making_bacteria_wordvectors.ipynb
Making_negative_training_set.ipynb
Mining_keywords_for_Context_genes.ipynb
README.md
SVM_and_other_models.ipynb
all_curated_context_genes_proper.fa
all_lactobacillus_sequences_from_geneblocks
bidirectional_rnn.ipynb
final_NN_model.h5
primary_bacteriocin_training_set
rnn_only_predictions_119_putative_bacteriocins.fa
rnn_predictions_six_putative_bacteriocins.fa
second_bacteriocin_training_set
third_bacteriocin_training_set
word2vec_model_trembl_size_200.txt
word2vec_model_trembl_size_200_gensim

README.md

A companion to the manuscript "Bacteriocin prediction using Word Embedding with Deep Recurrent Neural Networks", Md-Nafiz Hamid and Iddo Friedberg

This repository includes data and code contained in jupyter notebooks that were used in the work described in the manuscript. Brief descriptions of the files follow.

  • Baseline_pipeline.ipynb: This file has the code to generate the baseline performance for predicting bacteriocins. This refers to section 2.2.1 of the paper.
  • Identifying_gene_blocks_for_putative_bacteriocins.ipynb: The identification of 50kb gene blocks for potential bacteriocins.
  • Making_bacteria_wordvectors.ipynb: Generating word vectors for each 3-gram using Uniprot TrEMBL bacteria database.
  • Making_negative_training_set.ipynb: Creating the primary, second, and third negative bacteriocin dataset from Uniprot Swissprot bacteria protein sequences.
  • SVM_and_other_models.ipynb: SVM and other model performance with the word2vec representation of amino acid sequences.
  • bidirectional_rnn.ipynb: Bidirectional RNN performance with the word2vec representation of amino acid sequences.
  • all_curated_context_genes_proper.fa: The 1,240 curated context genes using annotation keywords.
  • all_lactobacillus_sequences_from_geneblocks: All the sequences from all of the 50kb geneblocks.
  • final_NN_model.h5: Weights for the final RNN model used for bacteriocin prediction on the geneblock sequences.
  • primary_bacteriocin_training_set: The primary negative bacteriocin set used to train the final model.
  • second_bacteriocin_training_set: The second negative bacteriocin set only used to show further performance.
  • third_bacteriocin_training_set: The third negative bacteriocin set only used to show further performance.
  • word2vec_model_trembl_size_200_gensim: word2vec vectors for each 3-gram in gensim binary format.
  • word2vec_model_trembl_size_200.txt: word2vec vectors for each 3-gram in txt format.