# DNA CLASSIFICATION
In this project ,we will be exploring what of bioinformatics.

For this project, we wil use a dataset from the UCI Machine Learning response for a story of that has 100 DNA sequences with 57 sequential nucleotides.

We will learn how to import data from the UCI repositories that can put to numerical data below and train classification algorithms and compare andcontrast classification machine learning algorithms.

## Introduction to DNA Classifier

### Getting the data

In [None]:
# Import libraries
import sys
import numpy as np
import sklearn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [None]:
# import the ICU Molecular Biology (Promoter Gene Sequences) Data Set
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data"

In [None]:
names = ["Class", "id", "Sequence"]

In [None]:
data = pd.read_csv(url, names = names)

In [None]:
data

## Exploring data


In [None]:
# Building our Dataset by creating a custom Pandas Dataframe
# Each column in a Dataframe is called a Series. 
classes = data.loc[:,"Class"]

In [None]:
print(classes)

**A promoter is a sequence of DNA needed to turn a gene on or off.** The process of transcription is initiated at the promoter. Usually found the beginning of a gene, the promoter has a binding site for the enzyme used to make a messenger RNA (mRNA) molecule.

Source:[Genome - Genetic Glossary - Promoter](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjr6-2e49PzAhVvkosKHeFuBbsQFnoECA8QAw&url=https%3A%2F%2Fwww.genome.gov%2Fgenetics-glossary%2FPromoter&usg=AOvVaw3g7_kGWqn5M6d6Cgpu8j_X)

According to this dataset:
* "+" means the promoters
* "-" means is anything that is not promoters.



## Generating a DNA sequence

### Data preparation

**A nucleotide is the basic building block of nucleic acids.**

RNA and DNA are polmyers made of long chains of nucleotides. 

A nucleotide consists of a sugar molecule (either ribose in RNA or deoxyriboe in DNA) attached to a phosphate group and a nitrogen-containing base. 

The bases in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T). In RNA, the base uracil (U) takes the place of thymine.

Source: [Genome - Genetic Glossary - Nucleotide](https://www.genome.gov/genetics-glossary/Nucleotide)

---
Sequencing DNA means determining the order of the four chemical building blocks - called "bases" - that make up the DNA molecule.** The sequence tells scientists the kind of genetic information that is carried in a particular DNA segment.**

Source : [DNA Sequencing Fact Sheet](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjm-Pyy79PzAhV3gP0HHcdTA7cQFnoECA8QAw&url=https%3A%2F%2Fwww.genome.gov%2Fabout-genomics%2Ffact-sheets%2FDNA-Sequencing-Fact-Sheet&usg=AOvVaw2GeMB1Qi6PvqKx0tf2ggEY)

In [None]:
# generate list of DNA sequence
sequences = list(data.loc[: , "Sequence"])
dataset = {}

# loop hrough sequences and split into individual nucleotides
for i, seq in enumerate(sequences):

  # split into nucleotides, remove ta characters
  nucleotides = list(seq)
  nucleotides = [x for x in nucleotides if x != '\t']

  # append class assignment
  nucleotides.append(classes[i])

  # add to dataset
  dataset[i] = nucleotides

print(dataset[0])

In [None]:
# turn dataset into pandas DataFrame
dframe = pd.DataFrame(dataset)

# print the dataframe using tabulate packages in a tab-separated format
dframe

In [None]:
# transpose the dataframe
df = dframe.transpose()

# print the dataframe using tabulate packages in a tab-separated format
df.iloc[:5]

In [None]:
# rename the last column as class

df.rename(columns = {57 : "Class"}, inplace = True)
df

In [None]:
# looks good. Let's start to familiarize ourselves with the dataset so we can pick the most suitable

df.describe()

In [None]:
# Target Column Visualization

def visualize_target(plot, feature):
  total = len(feature)
  for p in plot.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height()/total)
    x = p.get_x() + p.get_width() / 2 - 0.05
    y = p.get_y() + p.get_height()
    ax.annotate(percentage, (x, y), size = 12)
  plt.show()

In [None]:
plt.figure(figsize=(7,5))
ax = sns.countplot(x="Class", data = df)
plt.xticks(size = 12)
plt.xlabel("Promoters or Non Promoters")
plt.yticks(size = 12)
plt.ylabel("count", size = 12)

visualize_target(ax, df.Class)
plt.savefig("target_histogram")

In [None]:
# describe  doesn't tell us enough information since the attributes are text. Let's record alue counts for each sequence

series = []

for name in df.columns:
  series.append(df[name].value_counts())

info = pd.DataFrame(series)
details = info.transpose()
details

In [None]:
# Unfortunately, we can't run machine learning algorithms on the data in 'String' formats. As a result, we need to switch it to numerical data. 
# This can easily be accomplished using the pd.get_dummies() function
numerical_df = pd.get_dummies(df)
numerical_df.head()

In [None]:
# We don't need both class columns.  Lets drop one then rename the other to simply 'Class'.
df = numerical_df.drop(columns=['Class_-'])

df.rename(columns = {'Class_+': 'Class'}, inplace = True)
df.head()

## Modeling

### Splitting the dataset into training test and test set

In [None]:
# Use the model_selection module to separate training and testing datasets
from sklearn import model_selection

# Create X and Y datasets for training
X = np.array(df.drop(['Class'], axis=1))
y = np.array(df['Class'])

# define seed for reproducibility
seed = 1

# split data into training and testing datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)

In [None]:
# Now that we have our dataset, we can start building algorithms! We'll need to import each algorithm we plan on using from sklearn.  
# We also need to import some performance metrics, such as accuracy_score and classification_report.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

#define scoring method
scoring ='accuracy'

#define the model to train
names = ["Nearest Neighbors", "Gaussian Process","Decision Tree","Random Forest",
         "Neural Net", "AdaBoost","Naive Bayes","SVM Linear","SVM RBF","SVM Sigmoid"]
classifiers =[
    KNeighborsClassifier(n_neighbors=3),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10,max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel='linear'),
    SVC(kernel='rbf'),
    SVC(kernel='sigmoid')
]
models =zip(names,classifiers)

# evaluate each model in turn
results = []
names=[]
accuracy = []
for name,model in models:
    kfold = model_selection.KFold(n_splits = 10,shuffle=True, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train,y_train,cv=kfold,scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg= "%s: %f (%f)" %(name, cv_results.mean(), cv_results.std())
    
    print(msg)
    model.fit(X_train,y_train)
    predictions= model.predict(X_test)
    print(name)
    print( classification_report(y_test,predictions))

In [None]:
from IPython.core.display import HTML as Center

Center(""" <style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style> """)

In [None]:
# boxplot algorithm comparison
fig = plt.figure(figsize=(10,12), dpi = 80)
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results, vert=False)
ax.set_yticklabels(names)
plt.show()
plt.savefig("algorithm_comparison.png")