## Some examples of the `bi-python-kit` module working

### Content

[✨ Function examples](#✨-Function-examples)  
- [OpenFasta](#OpenFasta)
- [run_genscan](#run_genscan)
- [DNASequence](#DNASequence)
- [RNASequence](#RNASequence)
- [AminoAcidSequence](#AminoAcidSequence)
- [convert_multiline_fasta_to_oneline](#convert_multiline_fasta_to_oneline)
  
[🌲 Checking correct parallelization of threads in RandomForestClassifierCustom](#🌲-Checking-correct-parallelization-of-threads-in-RandomForestClassifierCustom)  
- [Checking fit function](#Checking-fit-function)  
- [Checking predict function](#Checking-predict-function)  
- [Comparison of predictions](#Comparison-of-predictions)

In [1]:
import time

import numpy as np
import pandas as pd

from custom_random_forest import RandomForestClassifierCustom
from sklearn.datasets import make_classification
from bio_files_processor import (convert_multiline_fasta_to_oneline,
                                 OpenFasta)
from bi_python_kit import (DNASequence, RNASequence, AminoAcidSequence,
                         run_genscan, GenscanOutput)

## ✨ Function examples

### `OpenFasta`

In [23]:
fasta_filename = 'data/example_open_fasta.fasta'

with OpenFasta(fasta_filename) as fasta_file:
    for record in fasta_file:
        print(record)

>GTD323452 5S_rRNA NODE_272_length_223_cov_0.720238:18-129(+)
ACGGCCATAGGACTTTGAAAGCACCGCATCCCGTCCGATCTGCGAAGTTAACCAAGATGCCGCCTGGTTAGTACCATGGTGGGGGACCACATGGGAATCCCTGGTGCTGTG
>GTD678345 16S_rRNA NODE_80_length_720_cov_1.094737:313-719(+)
TTGGCTTCTTAGAGGGACTTTTGATGTTTAATCAAAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGAGCCCTTGGGAGTGGTCCATTTGAGCCGGCAACGGCACGTTTGGACTGCAAACTTGGGCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGT
>GTD174893 16S_rRNA NODE_1_length_2558431_cov_75.185164:2153860-2155398(+)
TTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAACAGCTTGCTGTTTCGCTGACGAGTGGGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTT
>GTD906783 16S_rRNA NODE_1_length_2558431_cov_75.185164:793941-795479(-)
TTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAACAGCTTGCTGTTTCGCTGACGAGTGGGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTT

In [24]:
with OpenFasta(fasta_filename) as fasta_file:
    single_record = fasta_file.read_record()
    print(single_record)

>GTD323452 5S_rRNA NODE_272_length_223_cov_0.720238:18-129(+)
ACGGCCATAGGACTTTGAAAGCACCGCATCCCGTCCGATCTGCGAAGTTAACCAAGATGCCGCCTGGTTAGTACCATGGTGGGGGACCACATGGGAATCCCTGGTGCTGTG


### `run_genscan`

In [25]:
result = run_genscan(sequence_file="data/example_genscan.fa")
print(f'Status of your request: {result.status}')

Status of your request: 200


In [26]:
print('Predicted protein sequences in your data:\n')
for cds in result.cds_list:
    print(cds)

Predicted protein sequences in your data:

MDVVDSLLVNGSNITPPCELGLENETLFCLDQPRPSKEWQPAVQILLYSLIFLLSVLGNTLVITVLIRNKRMRTVTNIFLLSLAVSDLMLCLFCMPFNLIPNLLKDFIFGSAVCKTTTYFMGTSVSVSTFNLVAISLERYGAICKPLQSRVWQTKSHALKVIAATWCLSFTIMTPYPIYSNLVPFTKNNNQTANMCRFLLPNDVMQQSWHTFLLLILFLIPGIVMMVAYGLISLELYQGIKFEASQKKSAKERKPSTTSSGKYEDSDGCYLQKTRPPRKLELRQLSTGSSSRANRIRSNSSAANLMAKKRVIRMLIVIVVLFFLCWMPIFSANAWRAYDTASAERRLSGTPISFILLLSYTSSCVNPIIYCFMNKRFRLGFMATFPCCPNPGPPGARGEVGEEEEGGTTGASLSRFSYSHMSASVPPHEMSPDPPPQKEGREEAEKKERKKRSGREGAELMEKEGSISSGNSSX


In [27]:
print('Predicted exons in your data:\n')
exons = pd.DataFrame(result.exon_dict.items(), columns=['exons', 'boundaries'])
print(exons.to_string(index=False))

Predicted exons in your data:

    exons   boundaries
Exon 1.01   (276, 387)
Exon 1.02 (1059, 1310)
Exon 1.03 (4645, 4906)
Exon 1.04 (7260, 7387)
Exon 1.05 (8373, 8901)
Exon 1.06 (8903, 9040)


In [28]:
print('Predicted introns in your data:\n')
introns = pd.DataFrame(result.intron_dict.items(), columns=['introns', 'boundaries'])
print(introns.to_string(index=False))

Predicted introns in your data:

    introns   boundaries
Intron 1.01  (388, 1058)
Intron 1.02 (1311, 4644)
Intron 1.03 (4907, 7259)
Intron 1.04 (7388, 8372)
Intron 1.05 (8902, 8902)


### `DNASequence`

In [29]:
dna_sequence = DNASequence("ATGC")
dna_gc_content = dna_sequence.gc_content()
print(dna_gc_content)

50.0


In [30]:
dna_sequence = DNASequence("ATGC")
dna_transcribe = dna_sequence.transcribe()
print(dna_transcribe)

AUGC


### `RNASequence`

In [31]:
rna_sequence = RNASequence("AUGC")
rna_complement = rna_sequence.complement()
print(rna_complement)

UACG


In [32]:
rna_sequence = RNASequence("ATGC")  # incorrect sequence for RNA
rna_complement = rna_sequence.complement()
print(rna_complement)

SequenceError: Operation cannot be performed: incorrect sequence.

### `AminoAcidSequence`

In [33]:
protein_sequence = AminoAcidSequence("ARNDCHGQV")
molecular_weight = protein_sequence.calculate_molecular_weight()

print(f"Molecular Weight: {molecular_weight} Da")

Molecular Weight: 981.047 Da


### `convert_multiline_fasta_to_oneline`

In [34]:
with open("data/example_multiline_fasta.fasta", 'r') as file:
    file_contents = file.read()
    print(file_contents)

>5S_rRNA::NODE_272_length_223_cov_0.720238:18-129(+)
ACGGCCATAGGACTTTGAAAGCACCGCATCCCGTCCGATCTGCGAAGTTAACCAAGATGCCGCC
GGTGCTGTG
>16S_rRNA::NODE_80_length_720_cov_1.094737:313-719(+)
TTGGCTTCTTAGAGGGACTTTTGATGTTTAATCAAAGGAAGTTTGAGGCAATAACAGGTCTGTG
GACAAAGTCAACGAGTTTTATTATTATTCCTTTATTGAAAAATATGGGT


In [35]:
convert_multiline_fasta_to_oneline("data/example_multiline_fasta.fasta", output_fasta="data/example_oneline_fasta.fasta")

In [36]:
with open("data/example_oneline_fasta.fasta", 'r') as file:
    file_contents = file.read()
    print(file_contents)

>5S_rRNA::NODE_272_length_223_cov_0.720238:18-129(+)
ACGGCCATAGGACTTTGAAAGCACCGCATCCCGTCCGATCTGCGAAGTTAACCAAGATGCCGCCGGTGCTGTG
>16S_rRNA::NODE_80_length_720_cov_1.094737:313-719(+)
TTGGCTTCTTAGAGGGACTTTTGATGTTTAATCAAAGGAAGTTTGAGGCAATAACAGGTCTGTGGACAAAGTCAACGAGTTTTATTATTATTCCTTTATTGAAAAATATGGGT


## 🌲 Checking correct parallelization of threads in `RandomForestClassifierCustom`

In [2]:
X, y = make_classification(n_samples=100000)
random_forest = RandomForestClassifierCustom(max_depth=30, n_estimators=10, 
                                             max_features=2, random_state=42)

### Checking `fit` function

In [7]:
%%time

# 1 thread
fit_one_thread = random_forest.fit(X, y, n_jobs=1)

CPU times: user 6.51 s, sys: 4.58 ms, total: 6.51 s
Wall time: 6.51 s


In [8]:
%%time

# 2 threads
fit_two_thread = random_forest.fit(X, y, n_jobs=2)

CPU times: user 7 s, sys: 344 µs, total: 7 s
Wall time: 3.52 s


### Checking `predict` function

In [9]:
%%time

# 1 thread
predictions_one_thread = random_forest.predict(X, n_jobs=1)

CPU times: user 136 ms, sys: 21 µs, total: 136 ms
Wall time: 134 ms


In [10]:
%%time

# 2 threads
predictions_two_thread = random_forest.predict(X, n_jobs=2)

CPU times: user 146 ms, sys: 397 µs, total: 146 ms
Wall time: 78 ms


### Comparison of predictions

In [11]:
predictions_match = np.array_equal(predictions_one_thread, predictions_two_thread)
print("The resulting predictions coincide:", predictions_match)

The resulting predictions coincide: True
