<a href="https://colab.research.google.com/github/isurushanaka/AII_Course-1/blob/main/Classification/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [None]:
# Load the Iris dataset
X, y = load_iris(return_X_y=True, as_frame=True)

In [None]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [None]:
y.tail(-10)

10     0
11     0
12     0
13     0
14     0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 140, dtype: int32

### Data Preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(X) # scaled_data is a numpy array

X = pd.DataFrame(scaled_data, columns=['sepal length (cm)','sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444


In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

### Logistic Regression

* Logistic regression is a classification algorithm that works by finding the best-fit line to separate different classes of data points. It uses a mathematical function called the logistic function to map input features to a probability of belonging to a specific class.
* The algorithm adjusts the line based on the training data, aiming to maximize the likelihood of correctly classifying the data points.
* In simpler terms, logistic regression helps us predict the probability of an outcome or class based on input features.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create and train the logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Make predictions on the test set
lr_pred = lr.predict(X_test)

### Decision Trees

* Decision trees are like flowcharts that make decisions by splitting the data based on features.
* They start with a root node and make decisions by moving down the tree branches until reaching a leaf node, which represents a predicted class.
* The algorithm selects the best feature to split the data at each node, aiming to maximize the information gain.
* In other words, decision trees learn patterns in the data and create a set of logical rules to classify new instances.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create and train the decision tree model
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Make predictions on the test set
dt_pred = dt.predict(X_test)

### Random Forest

* Random forest is an ensemble learning algorithm that combines multiple decision trees to make predictions.
* It creates a collection of decision trees, each trained on a random subset of the data and using a random subset of features.
* When making predictions, each tree votes on the class, and the majority class becomes the final prediction.
* Random forests help reduce overfitting and improve prediction accuracy by leveraging the wisdom of multiple trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create and train the random forest model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Make predictions on the test set
rf_pred = rf.predict(X_test)

### Support Vector Machines (SVM)

* Support Vector Machines are powerful algorithms used for both classification and regression tasks.
* SVM works by finding the best hyperplane that separates different classes of data with the largest margin.
* The algorithm selects support vectors, which are data points closest to the decision boundary, to determine the optimal hyperplane.
* SVM can handle complex data by transforming the input features into higher dimensions using a technique called the kernel trick.

In [None]:
from sklearn.svm import SVC

# Create and train the SVM model
svc = SVC()
svc.fit(X_train, y_train)

# Make predictions on the test set
svc_pred = svc.predict(X_test)

### Naive Bayes

* Naive Bayes is a probabilistic algorithm that calculates the probability of a data point belonging to a specific class based on its features.
* It assumes that all features are independent of each other, hence the "naive" assumption.
* Naive Bayes uses Bayes' theorem to compute the probability using prior knowledge and likelihood estimates from the training data.
* It's a fast and efficient algorithm, especially for text classification and spam filtering tasks.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Create and train the Naive Bayes model
nb = GaussianNB()
nb.fit(X_train, y_train)

# Make predictions on the test set
nb_pred = nb.predict(X_test)

### k-Nearest Neighbors (k-NN)

* k-Nearest Neighbors is a simple yet effective algorithm for classification and regression.
* It works by finding the k nearest data points in the training set to a new data point and assigns the majority class or average value of those neighbors as the prediction.
* The algorithm uses distance metrics (such as Euclidean distance) to measure the similarity between data points.
* k-NN is based on the idea that similar data points tend to belong to the same class or have similar output values.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create and train the k-NN model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Make predictions on the test set
knn_pred = knn.predict(X_test)

### Neural Networks

* Neural networks are inspired by the human brain and are composed of interconnected nodes called neurons.
* They can learn complex patterns by adjusting the strength of connections between neurons.
* Each neuron receives input signals, applies a mathematical function to it, and passes the output to the next layer.
* Deep neural networks have multiple layers, allowing them to learn hierarchical representations of the data.
* They are widely used in tasks such as image and speech recognition, natural language processing, and more.

In [None]:
from sklearn.neural_network import MLPClassifier

# Create and train the neural network model
nn = MLPClassifier()
nn.fit(X_train, y_train)

# Make predictions on the test set
nn_pred = nn.predict(X_test)



## Evaluation

#### Accuracy

Accuracy measures the overall correctness of the predictions by calculating the ratio of correctly predicted instances to the total number of instances in the dataset. It is a simple and intuitive metric but may not be suitable for imbalanced datasets.

In [None]:
from sklearn.metrics import accuracy_score


lr_accuracy = accuracy_score(y_test, lr_pred)
print(f"Logistic Regression Classifier: {lr_accuracy:.2f}")

dt_accuracy = accuracy_score(y_test, dt_pred)
print(f"Decision Trees Classifier: {dt_accuracy:.2f}")

rf_accuracy = accuracy_score(y_test, rf_pred)
print(f"Random Forest Classifier: {rf_accuracy:.2f}")

svc_accuracy = accuracy_score(y_test, svc_pred)
print(f"Support Vector Machine Classifier: {svc_accuracy:.2f}")

nb_accuracy = accuracy_score(y_test, nb_pred)
print(f"Naive Bayes Classifier: {nb_accuracy:.2f}")

knn_accuracy = accuracy_score(y_test, knn_pred)
print(f"k-Nearest Neighbors Classifier: {knn_accuracy:.2f}")

nn_accuracy = accuracy_score(y_test, nn_pred)
print(f"Neural Networks (MLP) Classifier: {nn_accuracy:.2f}")

Logistic Regression Classifier: 0.98
Decision Trees Classifier: 0.97
Random Forest Classifier: 0.98
Support Vector Machine Classifier: 0.98
Naive Bayes Classifier: 0.97
k-Nearest Neighbors Classifier: 0.98
Neural Networks (MLP) Classifier: 0.98


#### Precision

Precision measures the proportion of correctly predicted positive instances (true positives) out of all instances predicted as positive. It focuses on the quality of positive predictions and is useful when the cost of false positives is high.

In [None]:
from sklearn.metrics import precision_score


lr_precision = precision_score(y_test, lr_pred, average='micro')
print(f"Logistic Regression Classifier: {lr_precision:.2f}")

dt_precision = precision_score(y_test, dt_pred, average='micro')
print(f"Decision Trees Classifier: {dt_precision:.2f}")

rf_precision = precision_score(y_test, rf_pred, average='micro')
print(f"Random Forest Classifier: {rf_precision:.2f}")

svc_precision = precision_score(y_test, svc_pred, average='micro')
print(f"Support Vector Machine Classifier: {svc_precision:.2f}")

nb_precision = precision_score(y_test, nb_pred, average='micro')
print(f"Naive Bayes Classifier: {nb_precision:.2f}")

knn_precision = precision_score(y_test, knn_pred, average='micro')
print(f"k-Nearest Neighbors Classifier: {knn_precision:.2f}")

nn_precision = precision_score(y_test, nn_pred, average='micro')
print(f"Neural Networks (MLP) Classifier: {nn_precision:.2f}")

Logistic Regression Classifier: 0.98
Decision Trees Classifier: 0.97
Random Forest Classifier: 0.98
Support Vector Machine Classifier: 0.98
Naive Bayes Classifier: 0.97
k-Nearest Neighbors Classifier: 0.98
Neural Networks (MLP) Classifier: 0.98


#### Recall

Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances in the dataset. It focuses on capturing all positive instances and is useful when the cost of false negatives is high.

In [None]:
from sklearn.metrics import recall_score


lr_recall = recall_score(y_test, lr_pred, average='micro')
print(f"Logistic Regression Classifier: {lr_recall:.2f}")

dt_recall = recall_score(y_test, dt_pred, average='micro')
print(f"Decision Trees Classifier: {dt_recall:.2f}")

rf_recall = recall_score(y_test, rf_pred, average='micro')
print(f"Random Forest Classifier: {rf_recall:.2f}")

svc_recall = recall_score(y_test, svc_pred, average='micro')
print(f"Support Vector Machine Classifier: {svc_recall:.2f}")

nb_recall = recall_score(y_test, nb_pred, average='micro')
print(f"Naive Bayes Classifier: {nb_recall:.2f}")

knn_recall = recall_score(y_test, knn_pred, average='micro')
print(f"k-Nearest Neighbors Classifier: {knn_recall:.2f}")

nn_recall = recall_score(y_test, nn_pred, average='micro')
print(f"Neural Networks (MLP) Classifier: {nn_recall:.2f}")

Logistic Regression Classifier: 0.98
Decision Trees Classifier: 0.97
Random Forest Classifier: 0.98
Support Vector Machine Classifier: 0.98
Naive Bayes Classifier: 0.97
k-Nearest Neighbors Classifier: 0.98
Neural Networks (MLP) Classifier: 0.98


#### F1 Score

The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the classifier's performance, taking into account both precision and recall. It is useful when the class distribution is imbalanced.

In [None]:
from sklearn.metrics import f1_score


lr_f1 = f1_score(y_test, lr_pred, average='micro')
print(f"Logistic Regression Classifier: {lr_f1:.2f}")

dt_f1 = f1_score(y_test, dt_pred, average='micro')
print(f"Decision Trees Classifier: {dt_f1:.2f}")

rf_f1 = f1_score(y_test, rf_pred, average='micro')
print(f"Random Forest Classifier: {rf_f1:.2f}")

svc_f1 = f1_score(y_test, svc_pred, average='micro')
print(f"Support Vector Machine Classifier: {svc_f1:.2f}")

nb_f1 = f1_score(y_test, nb_pred, average='micro')
print(f"Naive Bayes Classifier: {nb_f1:.2f}")

knn_f1 = f1_score(y_test, knn_pred, average='micro')
print(f"k-Nearest Neighbors Classifier: {knn_f1:.2f}")

nn_f1 = f1_score(y_test, nn_pred, average='micro')
print(f"Neural Networks (MLP) Classifier: {nn_f1:.2f}")

Logistic Regression Classifier: 0.98
Decision Trees Classifier: 0.97
Random Forest Classifier: 0.98
Support Vector Machine Classifier: 0.98
Naive Bayes Classifier: 0.97
k-Nearest Neighbors Classifier: 0.98
Neural Networks (MLP) Classifier: 0.98


In the context of binary classification, the terms True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) are used to describe the outcomes of a classification model. Using these terms, we can define accuracy, precision, recall, and F1 score as follows:



1.   Accuracy: Accuracy measures the overall correctness of the model's predictions and is defined as the ratio of correct predictions (TP + TN) to the total number of predictions (TP + TN + FP + FN).

  Accuracy = (TP + TN) / (TP + TN + FP + FN)

2.   Precision: Precision measures the proportion of correctly predicted positive instances (TP) out of the total instances predicted as positive (TP + FP). It focuses on the quality of positive predictions.

  Precision = TP / (TP + FP)

3.   Recall (Sensitivity or True Positive Rate): Recall measures the proportion of correctly predicted positive instances (TP) out of the actual positive instances (TP + FN). It focuses on the coverage of positive instances.

  Recall = TP / (TP + FN)

4.   F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a single metric that combines both precision and recall, giving equal weight to both. It is often used when there is an uneven class distribution.

  F1 Score = 2 * (Precision * Recall) / (Precision + Recall)


#### Confusion Matrix

A confusion matrix provides a tabular representation of the classifier's predictions, comparing the predicted labels with the true labels. It shows the number of true positives, true negatives, false positives, and false negatives, allowing for a more detailed analysis of the classifier's performance.

In [None]:
from sklearn.metrics import confusion_matrix

lr_confusion = confusion_matrix(y_test, lr_pred)
print(f"Logistic Regression Classifier:\n {lr_confusion}")

dt_confusion = confusion_matrix(y_test, dt_pred)
print(f"Decision Trees Classifier:\n {dt_confusion}")

rf_confusion = confusion_matrix(y_test, rf_pred)
print(f"Random Forest Classifier:\n {rf_confusion}")

svc_confusion = confusion_matrix(y_test, svc_pred)
print(f"Support Vector Machine Classifier:\n {svc_confusion}")

nb_confusion = confusion_matrix(y_test, nb_pred)
print(f"Naive Bayes Classifier:\n {nb_confusion}")

knn_confusion = confusion_matrix(y_test, knn_pred)
print(f"k-Nearest Neighbors Classifier:\n {knn_confusion}")

nn_confusion = confusion_matrix(y_test, nn_pred)
print(f"Neural Networks (MLP) Classifier:\n {nn_confusion}")

Logistic Regression Classifier:
 [[23  0  0]
 [ 0 19  0]
 [ 0  1 17]]
Decision Trees Classifier:
 [[23  0  0]
 [ 0 18  1]
 [ 0  1 17]]
Random Forest Classifier:
 [[23  0  0]
 [ 0 19  0]
 [ 0  1 17]]
Support Vector Machine Classifier:
 [[23  0  0]
 [ 0 19  0]
 [ 0  1 17]]
Naive Bayes Classifier:
 [[23  0  0]
 [ 0 18  1]
 [ 0  1 17]]
k-Nearest Neighbors Classifier:
 [[23  0  0]
 [ 0 19  0]
 [ 0  1 17]]
Neural Networks (MLP) Classifier:
 [[23  0  0]
 [ 0 19  0]
 [ 0  1 17]]
