# NORTH : 
# a highly accurate and scalable Naive Bayes based ORTHologous gene cluster prediction algorithm

# Importing Libraries

Importing the necessary modules

In [None]:
import os
import pickle

import numpy as np
from tqdm import tqdm_notebook as tqdm
from math import ceil, log10
from Bio import SeqIO
from sklearn.metrics import classification_report, confusion_matrix

# Reading Clusters

To use NORTH we need to save each of the orthologous clusters as seprate fasta files. The name of the fasta file represents the name of the orthologous cluster. These fasta files need to be put in a directory together (for example, please see the *sample_clusters* directory)

The *readFastas* function from **ReadFastas.py** traverses a given directory and reads all the orthologous clusters.

However, the*readFastas* function will ignore the Outliers.fasta 



In [None]:
from ReadFastas import readFastas

readFastas(data_path='sample_clusters', save_file_name='sample_data.p', n_clusters=5, min_cnt=None, K=5)

# Using the Naive Bayes Model

The model is described in **ScalableNaiveBayes.py**

## Load the data

First we need to load the data and format it in a manner convenient for the model.

The clusters read by *readFastas()* are loaded in this stage

The function *formatData* formats the cluster data



In [None]:
from OrthologClustering import formatData

# Load the data

(X, Y, v_size) = pickle.load(open('sample_data.p', 'rb'))      # as obtained from the readFastas() function

# Prepare the data for training

(documents, total) = formatData(X, Y)

## Creating a model

Next we need to initialize the naive bayes model with parameters like v_size, total_data, n_child etc. More details on the parameters can be found in **ScalableNaiveBayes.py**

In [None]:
from ScalableNaiveBayes import *

nb = ScalableNaiveBayes(model_id='dummy_model', v_size=v_size, total_data=total, n_child=1) # initializing the naive bayes model

## Train the model

After initializing the model and loading the data. The model can be trained simply as...

In [None]:
nb.train(documents, K=5)

## Make prediction with the model

After training the model we can make predictions simply as... 


In [None]:
YP = nb.predict([ X[0] ],proba=False)

print(YP)

# Stratified 10 Fold Cross Validation

Here a simple example of startified 10 fold cross validation is shown. In the paper we used the biggest 250 orthologous clusters from KEGG, but here for simplicity 5 arbitrary clusters have been used (provided in sample_clusters).

## Performing Cross Validation

The steps of dividing the data into folds, training model and testing on validation set is conveniently encapsulated inside the *stratifiedKFoldCrossValidation* function of **OrthologClustering.py**

In [None]:
from OrthologClustering import stratifiedKFoldCrossValidation

stratifiedKFoldCrossValidation('sample_data.p', K=5, n_clusters=5, id='test', n_child=2)

## Analyzing Results

The results of the cross validation test can be analyzed as...

In [None]:
from OrthologClustering import classificationPerformance

classificationPerformance(result_file='ST10FoldResults-test.p')

## Plotting Confusion Matrix

In [None]:
from OrthologClustering import plotConfusionMatrix

plotConfusionMatrix(result_file='ST10FoldResults-test.p')

# Outlier Detection

In [None]:
# TODO