#**DNA CLASSIFICATION**

---

In this notebook will be used the **Molecular Biology (Promoter Gene Sequences) Data Set** from the UCI repository. This dataset is used to evaluate a "hybrid" learning algorithm known as KBANN, which refines pre-existing biological knowledge using machine learning techniques.It contains **gene sequences that are classified into promoters(+) and non-promoters(-).**




In [1]:
import numpy as np
import pandas as pd
import sklearn

**A promoter** is a DNA sequence that initiates transcription of a particular gene.


**Type of Data:**
- Non-numeric, nominal data representing DNA nucleotides.
- Nucleotides are represented as the four DNA bases: Adenine (A), Guanine (G), Thymine (T), and Cytosine (C).
- R (Purine): A, G
- Y (Pyrimidine): T, C
X represents any nucleotide.

DNA nucleotides can be grouped into a hierarchy, as shown below:

		      X (any)
		    /   \
	  (purine) R     Y (pyrimidine)
		  / \   / \
		 A   G T   C


#Data Loading and Inspection


In [2]:
# import the uci Molecular Biology (Promoter Gene Sequences) Data Set
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'
names = ['Class', 'id', 'Sequence']
data = pd.read_csv(url, names = names)

**explanation of code**

 The `names` variable is a list of three strings: `'Class'`, `'id'`, and `'Sequence'`. These will be used as the column names for the imported data because they do not have headers.

 `pd.read_csv(url, names = names)` uses the `read_csv` function from the pandas library to read a csv (comma-separated values) file into a DataFrame. A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. The URL in the brackets specifies the data source, and `names = names` sets the column names of the DataFrame to the names list.

# **Overview of the Dataset:**


In [3]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Class     106 non-null    object
 1   id        106 non-null    object
 2   Sequence  106 non-null    object
dtypes: object(3)
memory usage: 2.6+ KB


In [4]:
data.describe()

Unnamed: 0,Class,id,Sequence
count,106,106,106
unique,2,106,106
top,+,S10,\t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
freq,53,1,1


In [5]:
data.head()

Unnamed: 0,Class,id,Sequence
0,+,S10,\t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
1,+,AMPC,\t\ttgctatcctgacagttgtcacgctgattggtgtcgttacaat...
2,+,AROH,\t\tgtactagagaactagtgcattagcttatttttttgttatcat...
3,+,DEOP2,\taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaa...
4,+,LEU1_TRNA,\ttcgataattaactattgacgaaaagctgaaaaccactagaatgc...


- Instances: 106 (53 promoter sequences and 53 non-promoter sequences).
- Features:
 - Class: The class label (+ for promoter, - for non-promoter).
 - Id: The instance name (an identifier based on the nucleotide's position in the sequence).
 - Sequence: Each field represents one nucleotide (A, G, T, or C) in the sequence.

The dataset contains 57 nucleotide positions for each sequence.
Nucleotide sequences start at position -50 and end at +7 relative to the transcription start site.

Tere is \t\ at the start of every sequence. That is because there's a tab or two preceding every sequence in the CSV file, and they are being recorded as the data are imported.

The dataset will be build using a custom pandas DataFrame and then  will be redefined based on the information imported. Each column in a DataFrame is called a series.First will be made a series for each column, using the data.loc[] method, and then printing the first five classes.

In [6]:
# Each column in a DataFrame is called a Series. Lets start by making a series for each column.

classes = data.loc[:, 'Class']
print(classes[:5])

0    +
1    +
2    +
3    +
4    +
Name: Class, dtype: object


**explanation of code**

1. `classes = data.loc[:, 'Class']`: This line of code is accessing a DataFrame named `data` using the `.loc[]` indexer. The `.loc[]` indexer allows to access a group of rows and columns by label(s) or a boolean array. The colon `:` means that all rows are selected. `'Class'` means that the column named "Class" is selected. Thus, this code is assigning all the values from the column "Class" to the variable `classes`.

2. `print(classes[:5])`: This line of code is used to display the first 5 entries in the `classes` variable

In [7]:
sequences = list(data.loc[:, 'Sequence'])

data_list = []

# Loop through sequences and split into individual nucleotides
for i, seq in enumerate(sequences):
    # split into nucleotides, remove tab characters
    nucleotides = list(seq)
    nucleotides = [x for x in nucleotides if x != '\t']

    # append class assignment
    nucleotides.append(classes[i])

    # Add the processed nucleotides (a single sequence) to the data_list
    data_list.append(nucleotides)

# Create the DataFrame directly from data_list
df = pd.DataFrame(data_list)

# Rename columns for clarity
column_names = list(range(len(df.columns) - 1)) + ['Class']
df.columns = column_names
print(df)

     0  1  2  3  4  5  6  7  8  9  ... 48 49 50 51 52 53 54 55 56 Class
0    t  a  c  t  a  g  c  a  a  t  ...  g  c  t  t  g  t  c  g  t     +
1    t  g  c  t  a  t  c  c  t  g  ...  c  a  t  c  g  c  c  a  a     +
2    g  t  a  c  t  a  g  a  g  a  ...  c  a  c  c  c  g  g  c  g     +
3    a  a  t  t  g  t  g  a  t  g  ...  a  a  c  a  a  a  c  t  c     +
4    t  c  g  a  t  a  a  t  t  a  ...  c  c  g  t  g  g  t  a  g     +
..  .. .. .. .. .. .. .. .. .. ..  ... .. .. .. .. .. .. .. .. ..   ...
101  c  c  t  c  a  a  t  g  g  c  ...  g  a  a  c  t  a  t  a  t     -
102  g  t  a  t  t  c  t  c  a  a  ...  t  c  a  a  c  a  t  t  g     -
103  c  g  c  g  a  c  t  a  c  g  ...  a  a  g  g  c  t  t  c  c     -
104  c  t  c  g  t  c  c  t  c  a  ...  a  g  g  a  g  g  a  a  c     -
105  t  a  a  c  a  t  t  a  a  t  ...  t  c  a  a  g  a  a  c  t     -

[106 rows x 58 columns]


**explanation of code**

This code processes DNA sequence data from data DataFrame  and creates a new DataFrame df where **each row represents a sequence and each column represents a position in the sequence, with the last column representing the class label ('+' or '-').**


1. `sequences = list(data.loc[:, 'Sequence']):` Extracts the 'Sequence' column from data, converts it to a list, and stores it in sequences.
2. `data_list = []:` Creates an empty list called data_list to store the processed sequences.
3. `for i, seq in enumerate(sequences):` Iterates through each sequence in sequences using enumerate, providing the index i and the sequence seq for each iteration.
4. `nucleotides = list(seq):` Converts the current sequence seq into a list of individual nucleotides.
5. `nucleotides = [x for x in nucleotides if x != '\t']:` Removes tab characters (\t) from the nucleotides list using a list comprehension.
6. `nucleotides.append(classes[i]):` Appends the corresponding class label from classes (using index i) to the end of the nucleotides list.
7. `data_list.append(nucleotides):` Appends the processed nucleotides list (representing a single sequence with its class label) to data_list.
8. `df = pd.DataFrame(data_list):` Creates a DataFrame df directly from data_list.
9. `column_names = list(range(len(df.columns) - 1)) + ['Class']:` Generates column names for the DataFrame. It creates numerical names for the sequence positions and adds 'Class' for the last column.
10. `df.columns = column_names:` Assigns the generated column names to the DataFrame df.

In [8]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,Class
count,106,106,106,106,106,106,106,106,106,106,...,106,106,106,106,106,106,106,106,106,106
unique,4,4,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,2
top,t,a,a,c,a,a,a,a,a,a,...,c,c,c,t,t,c,c,c,t,+
freq,38,34,30,30,36,42,38,34,33,36,...,36,42,31,33,35,32,29,29,34,53


#**Discussion of outcome**

There are letters that correspond to the different nucleotides. For each of the nucleotides, the unique value is 4, since they represent A,T,G,C. The top row shows the most common nucleotide at each position, while the freq row shows the amount of times that this nucleotide appeared.Tere are 53 promoters and 53 non-promoters in this dataset, which is a nice 50/50 split. **However, these need to be converted to numerical values. Most algorithms work best with numerical data, as they rely on mathematical calculations and distance measures.**

In [9]:
series = []
for name in df.columns:
    series.append(df[name].value_counts())

info = pd.DataFrame(series)
details = info.transpose()
details.columns = df.columns
print(details)

      0     1     2     3     4     5     6     7     8     9  ...    48  \
t  38.0  26.0  27.0  26.0  22.0  24.0  30.0  32.0  32.0  28.0  ...  21.0   
c  27.0  22.0  21.0  30.0  19.0  18.0  21.0  20.0  22.0  22.0  ...  36.0   
a  26.0  34.0  30.0  22.0  36.0  42.0  38.0  34.0  33.0  36.0  ...  23.0   
g  15.0  24.0  28.0  28.0  29.0  22.0  17.0  20.0  19.0  20.0  ...  26.0   
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   

     49    50    51    52    53    54    55    56  Class  
t  22.0  23.0  33.0  35.0  30.0  23.0  29.0  34.0    NaN  
c  42.0  31.0  32.0  21.0  32.0  29.0  29.0  17.0    NaN  
a  24.0  28.0  27.0  25.0  22.0  26.0  24.0  27.0    NaN  
g  18.0  24.0  14.0  25.0  22.0  28.0  24.0  28.0    NaN  
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  

[6 rows x 58 columns]


**explanation of code**

1. `series = []` Creates an empty list named series to store the value counts for each column.

2. `for name in df.columns:` This loops through each column in the dataframe 'df'.

3. `series.append(df[name].value_counts())` For each column, the 'value_counts' function is used to count the frequency of each unique value in that column. These frequency counts are then appended to the 'series' list.

4. `info = pd.DataFrame(series)` This line is transforming the 'series' list into a pandas DataFrame and storing it in the 'info' variable. At this point, each row in 'info' represents a unique value count for a certain column from 'df'.

5. `details = info.transpose()` Transposes the 'info' dataframe. The transpose operation switches the rows and the columns. This means that if before, each row in 'info' was representing a column in 'df', now that's no longer the case. After the transpose, each column in 'details' represents a column presence in 'df'.

6. `details.columns = df.columns` assigning the column names of your original DataFrame (df) to the details DataFrame using details.columns = df.columns.

#**Discussion of outcome**

The value 38.0 in the first row and first column indicates that the nucleotide 't' appears 38 times at position 0 across all sequences in  dataset.

Class is no longer a number, there are only pluses and minuses. The number of occurrences in each case is now obvious. There aren't any pluses or minuses in the first 57 columns, which makes sense, because those should be nucleotides. There is a fairly even split in the value counts for each nucleotide in each position.


It is also noted that every column has either a number or a +/- label in the Class column. So there are not any missing  data, and  can prosseed to convert the data to numerical.


In [10]:
numerical_df = pd.get_dummies(df)
numerical_df.iloc[:5]

Unnamed: 0,0_a,0_c,0_g,0_t,1_a,1_c,1_g,1_t,2_a,2_c,...,55_a,55_c,55_g,55_t,56_a,56_c,56_g,56_t,Class_+,Class_-
0,False,False,False,True,True,False,False,False,False,True,...,False,False,True,False,False,False,False,True,True,False
1,False,False,False,True,False,False,True,False,False,True,...,True,False,False,False,True,False,False,False,True,False
2,False,False,True,False,False,False,False,True,True,False,...,False,True,False,False,False,False,True,False,True,False
3,True,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,True,False,False,True,False
4,False,False,False,True,False,True,False,False,False,False,...,True,False,False,False,False,False,True,False,True,False


**explanation of code**

`numerical_df = pd.get_dummies(df)`: This line creates a new dataframe (`numerical_df`) from the original dataframe `df`. The `get_dummies()` function is used to convert categorical variable(s) into dummy/indicator variables. Basically, for each unique value in the column, a new column is created on the dataframe. For each row, the column representing the value in the original categorical column gets a 1 and other columns get 0. For example, let’s say there is a column in `df` named "Color" with the values "Red", "Blue", and "Green". After `get_dummies()` function applied, three new columns will be generated ("Color_Red", "Color_Blue", "Color_Green"). If a row had the value "Red" in the color column, it will have a 1 in column "Color_Red" and 0 in "Color_Blue", "Color_Green".

**This process is called one-hot encoding, where each nucleotide is represented by a binary vector (e.g., 'A' becomes [1, 0, 0, 0], 'T' becomes [0, 1, 0, 0], and so on**



In [11]:
df = numerical_df.drop(columns=['Class_-'])

df.rename(columns = {'Class_+': 'Class'}, inplace = True)
print(df.iloc[:5])

     0_a    0_c    0_g    0_t    1_a    1_c    1_g    1_t    2_a    2_c  ...  \
0  False  False  False   True   True  False  False  False  False   True  ...   
1  False  False  False   True  False  False   True  False  False   True  ...   
2  False  False   True  False  False  False  False   True   True  False  ...   
3   True  False  False  False   True  False  False  False  False  False  ...   
4  False  False  False   True  False   True  False  False  False  False  ...   

    54_t   55_a   55_c   55_g   55_t   56_a   56_c   56_g   56_t  Class  
0  False  False  False   True  False  False  False  False   True   True  
1  False   True  False  False  False   True  False  False  False   True  
2  False  False   True  False  False  False  False   True  False   True  
3  False  False  False  False   True  False   True  False  False   True  
4   True   True  False  False  False  False  False   True  False   True  

[5 rows x 229 columns]


#**Discussion of outcome**

**Each column** represents a combination of the original sequence position and nucleotide. For example:
0_a represents the presence of nucleotide 'a' at position 0. 55_c represents the presence of nucleotide 'c' at position 55.

The last two columns, Class_+ and Class_-, represent the one-hot encoded class labels (promoter or non-promoter).


**Each row** corresponds to a DNA sequence from your original dataset.
Values:

**The values are boolean (True or False)** indicating whether the specific nucleotide is present at that position in the sequence. True means the nucleotide is present.False means the nucleotide is not present.

There is no need ffor both class columns.  Lets drop one, then rename the other to simply 'Class'.

So now there are 229 columns, 228 of which represent different nucleotide inputs (because four for each position).  The same information is maintained as before, but the dataset is now preprocessed in the way that isneed it: converted to numerical input.

# Train-Test

---

In [12]:
# Use the model_selection module to separate training and testing datasets
from sklearn import model_selection

# Create X and Y datasets for training
X = np.array(df.drop(['Class'], axis=1))
y = np.array(df['Class'])

# define seed for reproducibility
seed = 1

# split data into training and testing datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)


**explanation of code**

`X = np.array(df.drop(['Class'], 1))`In this line, the 'Class' column is dropped from the DataFrame `df` because it is presumably the label and  don't want it in  feature matrix X.  
`y = np.array(df['Class'])`This line is creating the target array y. It's taking the 'Class' column from the DataFrame `df` because this is what  will be predicting.

`np.array` is used to convert the DataFrame to a Numpy array which is the required input for scikit-learn estimators.

`seed = 1`This line is creating a seed for the random number generator. This is so that when the code is run multiple times the same random sequence is generated each time, which is important for reproducibility in scientific studies.

`X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)`This line is dividing our dataset into a training set and a test set. The features and target variables are first provided `(X,y)`, then the `test_size` parameter indicates that the test data should be 25% of the total data and the remaining 75% should be the training set. The `random_state` parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices. By providing the `seed (1)`, the split will be deterministic and thus the output stable between runs.

In [13]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

#define scoring method
scoring ='accuracy'
#define the model to train
names = ["Nearest Neighbors", "Gaussian Process","Decision Tree","Random Forest",
         "Neural Net", "AdaBoost","Naive Bayes","SVM Linear","SVM RBF","SVM Sigmoid"]
classifiers =[
    KNeighborsClassifier(n_neighbors=3),
    GaussianProcessClassifier(1.0 * RBF(1.0, length_scale_bounds=(1e-8, 100.0))),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10,max_features=1),
    MLPClassifier(alpha=1, max_iter=400),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel='linear'),
    SVC(kernel='rbf'),
    SVC(kernel='sigmoid')
]
models =zip(names,classifiers)
# evaluate each model in turn
results = []
names=[]
for name,model in models:
    kfold = model_selection.KFold(n_splits = 10, random_state=seed, shuffle=True)
    cv_results = model_selection.cross_val_score(model, X_train,y_train,cv=kfold,scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg= "%s: %f (%f)" %(name, cv_results.mean(), cv_results.std())

    print(msg)
    model.fit(X_train,y_train)
    predictions= model.predict(X_test)
    print(name)

    print(classification_report(y_test,predictions))

Nearest Neighbors: 0.810714 (0.099808)
Nearest Neighbors
              precision    recall  f1-score   support

       False       1.00      0.65      0.79        17
        True       0.62      1.00      0.77        10

    accuracy                           0.78        27
   macro avg       0.81      0.82      0.78        27
weighted avg       0.86      0.78      0.78        27





Gaussian Process: 0.855357 (0.160605)
Gaussian Process
              precision    recall  f1-score   support

       False       1.00      0.82      0.90        17
        True       0.77      1.00      0.87        10

    accuracy                           0.89        27
   macro avg       0.88      0.91      0.89        27
weighted avg       0.91      0.89      0.89        27

Decision Tree: 0.696429 (0.138321)
Decision Tree
              precision    recall  f1-score   support

       False       0.92      0.71      0.80        17
        True       0.64      0.90      0.75        10

    accuracy                           0.78        27
   macro avg       0.78      0.80      0.78        27
weighted avg       0.82      0.78      0.78        27

Random Forest: 0.671429 (0.112712)
Random Forest
              precision    recall  f1-score   support

       False       0.75      0.71      0.73        17
        True       0.55      0.60      0.57        10

    accuracy                 



AdaBoost: 0.862500 (0.141973)
AdaBoost
              precision    recall  f1-score   support

       False       1.00      0.76      0.87        17
        True       0.71      1.00      0.83        10

    accuracy                           0.85        27
   macro avg       0.86      0.88      0.85        27
weighted avg       0.89      0.85      0.85        27

Naive Bayes: 0.837500 (0.112500)
Naive Bayes
              precision    recall  f1-score   support

       False       1.00      0.88      0.94        17
        True       0.83      1.00      0.91        10

    accuracy                           0.93        27
   macro avg       0.92      0.94      0.92        27
weighted avg       0.94      0.93      0.93        27

SVM Linear: 0.912500 (0.097628)




SVM Linear
              precision    recall  f1-score   support

       False       1.00      0.94      0.97        17
        True       0.91      1.00      0.95        10

    accuracy                           0.96        27
   macro avg       0.95      0.97      0.96        27
weighted avg       0.97      0.96      0.96        27

SVM RBF: 0.875000 (0.111803)
SVM RBF
              precision    recall  f1-score   support

       False       1.00      0.88      0.94        17
        True       0.83      1.00      0.91        10

    accuracy                           0.93        27
   macro avg       0.92      0.94      0.92        27
weighted avg       0.94      0.93      0.93        27

SVM Sigmoid: 0.925000 (0.100000)
SVM Sigmoid
              precision    recall  f1-score   support

       False       1.00      0.88      0.94        17
        True       0.83      1.00      0.91        10

    accuracy                           0.93        27
   macro avg       0.92      0.94  

 **explanation of code**

- A string `scoring = 'accuracy'` is defined. This string is specifying that the models should be evaluated based on their accuracy score(the proportion of correct predictions).
- `names:` A list of model names to be used for printing and tracking.
- `classifiers:` A list of classifier objects, each initialized with some parameters:
 - KNeighborsClassifier(n_neighbors=3): K-NN with k=3.
 - GaussianProcessClassifier(1.0 * RBF(1.0)): Gaussian process classifier using a Radial Basis Function (RBF) kernel.
 - DecisionTreeClassifier(max_depth=5): Decision tree with a maximum depth of 5.
 - RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1): Random forest with 5 levels, 10 trees, and using 1 feature per split.
 - MLPClassifier(alpha=1, max_iter=400): Multi-layer perceptron (neural network) with a maximum of 400 iterations and regularization term alpha=1.
 - AdaBoostClassifier(): AdaBoost classifier.
GaussianNB(): Gaussian Naive Bayes classifier.
 - SVC(kernel='linear'): Support Vector Classifier with a linear kernel.
SVC(kernel='rbf'): SVC with an RBF kernel.
SVC(kernel='sigmoid'): SVC with a sigmoid kernel.

- `models`: This zips together the names and classifiers lists so that each classifier is paired with its respective name.

- an empty list `results` is defined to store the results of each model after evaluation and a names list is re-initialized.

- a for loop is used to iterate over each name and model in the `models` list. For each model:

  1. A `kfold` cross validation object is created. **Cross validation is a method for assessing how well a model will generalize to unseen data.** `'KFold'` splits the training data into 10 subsets, and the model is trained on 9 and tested on 1, which is repeated for each subset.
  
  2. `cross_val_score` calculates the cross-validation scores for the current model, on the training data, using the k-fold object created, and the accuracy scoring method.
  
  3. The resulting score is then appended to the `results` list, and the name of the current model is appended to the list `names`.
  
  4. Finally, a message is constructed that displays the name of the model, the mean of the cross-validation scores, and the standard deviation of the scores, and the message is printed.
- `model.fit(X_train, y_train):` After cross-validation, the model is trained on the entire training set (X_train, y_train).

- `model.predict(X_test):` The trained model is used to predict the labels for the test set (X_test).
- `classification_report(y_test, predictions):` Prints a detailed classification report, which includes:

 1. Precision: Proportion of true positives out of predicted positives.
 2. Recall: Proportion of true positives out of actual positives.
 3. F1-score: The harmonic mean of precision and recall.
 4. Support: The number of actual occurrences of the class in the dataset.

**Cross Validation**

---

Cross-validation takes the validation process a step further by dividing the data into multiple folds (e.g., k folds in k-fold cross-validation) and repeatedly training and evaluating the model on different combinations of these folds. This provides a more robust estimate of model performance compared to a single validation split.

In simpler terms:

Validation is like having one practice exam to check your understanding before the real exam.
Cross-validation is like having multiple practice exams, each covering different topics, to get a more comprehensive assessment of your knowledge.


#**Discussion of outcome**

For each model, the cross-validation accuracy(the proportion of correctly classified instances) is displayed as the mean accuracy across folds, followed by the standard deviation in parentheses. Furthermore, after training and testing on the provided dataset, a classification report is generated. This report includes several important metrics.

  - **For example the Neural Net has mean Accuracy 0.912500,** meaning that it correctly classified 91.25% of the instances across all cross-validation folds.
  This high average accuracy suggests that the Neural Net is generally effective at distinguishing between promoter (+) and non-promoter (-) DNA sequences. **The Standard Deviation (0.097628)** suggests that the accuracy varied by ±9.76% across the different cross-validation folds. A standard deviation of approximately 9.76% indicates some variability in the model's performance depending on the specific subset of data it was trained and tested on in each fold. While a lower standard deviation is preferable for consistent performance, a ~10% variation is not uncommon, especially with smaller datasets or those with inherent variability.

  - **The classification report** provides a detailed breakdown of the model's performance on each class (False and True) in the test set. Here's what each metric means:


  - **Class: False**

   - **Precision	1.00.**  When the model predicts an instance as False, it is 100% correct. There are no false positives for the False class. Every instance predicted as False is truly False.

   - **Recall	0.88**. Out of all actual False instances, the model correctly identifies 88% of them. The model misses 12% of the actual False instances, which are incorrectly classified as True (false negatives).

   - **F1-score	0.94**. The harmonic mean of precision and recall for the False class is 0.94.A high F1-score indicates a good balance between precision and recall, showcasing effective performance for the False class.

   - **Support	17**. There are 17 actual instances of the False class in the test set. This provides context for the other metrics, showing that the model's performance is evaluated on 17 False instances.



 - **Class: True**

    - **Precision	0.83** Interpretation: When the model predicts an instance as True, it is 83% correct.
Significance: There are some false positives for the True class. Specifically, 17% of the predictions labeled as True are actually False.

    - **Recall (1.00):**Interpretation: Out of all actual True instances, the model correctly identifies 100% of them.Significance: The model does not miss any True instances (no false negatives).
    - **F1-score (0.91):**The harmonic mean of precision and recall for the True class is 0.91. A high F1-score indicates a strong balance between precision and recall, demonstrating effective performance for the True class.

    - **Support (10):** There are 10 actual instances of the True class in the test set. This shows that the True class is less represented in the test set, which is important when interpreting precision and recall.

 - **Overall Accuracy: 0.93 (93%)** The model correctly predicted 93% of the total instances in the test set.This high accuracy indicates strong overall performance in distinguishing between the two classes.

 - **Macro Average:**
 These averages indicate that the model performs well on both classes, maintaining high precision and recall overall.

 - **Weighted Average:**
  The weighted averages take into account the class distribution, giving a more balanced view when classes are imbalanced. Here, they are slightly higher than the macro averages, reflecting the distribution of classes in the dataset.

With a small test set (27 instances) and relatively balanced classes (17 False, 10 True), the model's high performance is encouraging. However, to ensure robustness, further evaluation on larger and more diverse datasets would be beneficial.

---
---



**Model Comparison:**

- Top Performers: Models like SVM Linear, Neural Net, SVM RBF, SVM Sigmoid, and Naive Bayes performed the best, achieving around 93-96% accuracy, with SVM Linear achieving the best accuracy (96%).
- Gaussian Process Classifier performs well but shows a convergence warning, suggesting that tuning the hyperparameters (e.g., the bounds of the kernel) might lead to better performance.
Nearest Neighbors performed relatively well but showed some variability with a lower recall for the False class.
- Weaker Performers: Random Forest and Decision Tree performed less consistently, with accuracy scores of 74% and 85% respectively.






#**Conclusion**

The goal of the machine learning models is to learn patterns from the labeled sequences of the dataset and then accurately predict the class (promoter or non-promoter) of new, unseen DNA sequences.
The results highlight that SVM (linear kernel) and Neural Networks (MLPClassifier) are performing best on this dataset, with consistently high accuracy, precision, and recall. Some models (e.g., Random Forest and Decision Tree) may need further tuning to improve their performance, and others like K-Nearest Neighbors may be sensitive to the class distribution or the distance metric used.


