

**Aim:**

To perform a comparative analysis of different machine learning algorithms, namely Artificial Neural Networks (ANN), Support Vector Machines (SVM), Naive Bayes, Decision Trees (DT), and k-Nearest Neighbors (KNN), on the Diabetes dataset for the task of classification to determine the most effective algorithm for predicting diabetes occurrences.

**Title:**

"Comparative Analysis of Machine Learning Algorithms for Diabetes Prediction: ANN, SVM, Naive Bayes, DT, and KNN"

**Dataset Source:**

The Diabetes dataset is sourced from the UCI Machine Learning Repository, containing various features such as glucose concentration, blood pressure, BMI, age, and other health-related metrics. The dataset aims to predict the onset of diabetes within a certain period based on these features.

**Theory (Explanation of Algorithms):**

**1. Artificial Neural Networks (ANN):**

Explanation: ANN is a computational model inspired by the human brain, consisting of interconnected nodes (neurons) organized in layers (input, hidden, output).
Functionality: It learns complex patterns by adjusting weights and biases through forward propagation and backpropagation.

**2. Support Vector Machines (SVM):**

Explanation: SVM is a supervised learning algorithm used for classification and regression tasks.
Functionality: SVM finds the optimal hyperplane that best separates classes by maximizing the margin between them.

**3. Naive Bayes:**

Explanation: Naive Bayes is a probabilistic classifier based on Bayes' theorem with an assumption of independence between features.
Functionality: It computes the probability of a data point belonging to a particular class using prior probabilities and likelihoods.

**4. Decision Trees (DT):**

Explanation: DT is a tree-like structure where internal nodes represent feature splits, and leaf nodes represent class labels.
Functionality: It recursively splits data based on features to create a tree, making predictions by traversing from the root to leaf nodes.

**5. k-Nearest Neighbors (KNN):**

Explanation: KNN is a simple and effective algorithm based on instance-based learning.
Functionality: It classifies new data points by assigning them the majority class label among their k-nearest neighbors in the feature space.

**Conclusion:**

After performing a comprehensive comparative analysis of these algorithms on the Diabetes dataset, several observations can be made:

Performance Metrics: Evaluate each algorithm's accuracy, precision, recall, F1-score, and AUC-ROC to determine its classification performance.
Algorithm Suitability: Identify which algorithm provides the best balance between predictive accuracy, computational efficiency, and robustness for the given dataset.
Consideration of Hyperparameters: Optimize hyperparameters for each algorithm to improve their predictive performance and generalizability.

In [25]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

In [27]:
# ANN
# Load the 20 newsgroups dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

# Preprocess text data using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Neural Network (MLPClassifier)
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42)
mlp.fit(X_train, y_train)

# Predict on the test set
y_pred = mlp.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Display classification report
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

Accuracy: 0.9228723404255319
                        precision    recall  f1-score   support

           alt.atheism       0.95      0.88      0.92       252
         comp.graphics       0.92      0.93      0.93       295
               sci.med       0.90      0.92      0.91       299
soc.religion.christian       0.92      0.95      0.93       282

              accuracy                           0.92      1128
             macro avg       0.92      0.92      0.92      1128
          weighted avg       0.92      0.92      0.92      1128



In [15]:
# Load the breast_cancer dataset
d = fetch_20newsgroups()

In [16]:
X, y = d.data, d.target
y = y.astype(int)

In [17]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
# Linear SVM
linear_svm = SVC(kernel='linear', C=1.0)
linear_svm.fit(X_train[:1000], y_train[:1000])
linear_svm_predictions = linear_svm.predict(X_test)
linear_svm_accuracy = accuracy_score(y_test, linear_svm_predictions)
print("Linear SVM Accuracy:", linear_svm_accuracy)

# Polynomial SVM
poly_svm = SVC(kernel='poly', degree=3, C=1.0)
poly_svm.fit(X_train[:1000], y_train[:1000])
poly_svm_predictions = poly_svm.predict(X_test)
poly_svm_accuracy = accuracy_score(y_test, poly_svm_predictions)
print("Polynomial SVM Accuracy:", poly_svm_accuracy)

# Radial Basis Function (RBF) SVM
rbf_svm = SVC(kernel='rbf', C=1.0)
rbf_svm.fit(X_train[:1000], y_train[:1000])
rbf_svm_predictions = rbf_svm.predict(X_test)
rbf_svm_accuracy = accuracy_score(y_test, rbf_svm_predictions)
print("RBF SVM Accuracy:", rbf_svm_accuracy)

Linear SVM Accuracy: 0.9131205673758865
Polynomial SVM Accuracy: 0.8457446808510638
RBF SVM Accuracy: 0.8971631205673759


In [30]:
# Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
bnb_predictions = bnb.predict(X_test)
bnb_accuracy = accuracy_score(y_test, bnb_predictions)
print("Bernoulli Naive Bayes Accuracy:", bnb_accuracy)

Bernoulli Naive Bayes Accuracy: 0.8049645390070922


In [9]:
#KNN
from sklearn.neighbors import KNeighborsClassifier
knnClassifier= KNeighborsClassifier(n_neighbors=1, metric='minkowski', p=2 )
knnClassifier.fit(X_train, y_train)
y_predKnn= knnClassifier.predict(X_test)
knn_accuracy = accuracy_score(y_test, y_predKnn)
print("KNeighborsClassifier Accuracy:", knn_accuracy)

KNeighborsClassifier Accuracy: 0.011235955056179775


In [10]:
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
DecClassifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
DecClassifier.fit(X_train, y_train)
y_predDec= DecClassifier.predict(X_test)
dec_accuracy = accuracy_score(y_test, y_predDec)
print("DecisionTreeClassifier Accuracy:", dec_accuracy)

DecisionTreeClassifier Accuracy: 0.011235955056179775
