## Common Classification Models in Machine Learning

Classification models are algorithms that categorize data points into predefined classes or groups. Below is a comprehensive list of widely used classification models, covering both classic and modern approaches:

**Linear Models**
- Logistic Regression
- Linear Discriminant Analysis (LDA)

**Probabilistic Models**
- Naive Bayes (Gaussian, Multinomial, Bernoulli)

**Instance-Based Models**
- k-Nearest Neighbors (KNN)

**Tree-Based Models**
- Decision Trees
- Random Forest
- Extra Trees (Extremely Randomized Trees)
- Gradient Boosting Machines (GBM)
- XGBoost
- Bagging Classifier

**Support Vector Machines**
- Support Vector Machine (SVM), including various kernels (linear, polynomial, RBF, sigmoid)

**Neural Networks**
- Artificial Neural Networks (ANN)
- Deep Neural Networks (DNN)
- Convolutional Neural Networks (CNN) (for image classification)
- Recurrent Neural Networks (RNN) (for sequence classification)
- Ensemble of Neural Networks

**Ensemble Methods**
- Stacking and Blending (combining multiple models with a meta-classifier)
- Bagging (Bootstrap Aggregating)
- Boosting (AdaBoost, Gradient Boosting, XGBoost)

**Other Models**
- Quadratic Discriminant Analysis (QDA)
- Cost-Sensitive Classifiers (for imbalanced data)
- Multi-label and Multi-class adaptations of the above models

## Specialized and Advanced Techniques

- **Explainable AI (XAI) Techniques**: Not models themselves, but methods like SHAP, LIME, and counterfactual explanations are used to interpret complex classification models.
- **Imbalanced Classification Approaches**: Cost-sensitive versions of standard models, sampling techniques (SMOTE, undersampling), and cluster-based oversampling.

## Summary Table

| Model Type              | Example Algorithms                                 |
|-------------------------|---------------------------------------------------|
| Linear                  | Logistic Regression, Linear Discriminant Analysis |
| Probabilistic           | Naive Bayes                                       |
| Instance-Based          | k-Nearest Neighbors                               |
| Tree-Based              | Decision Tree, Random Forest, Gradient Boosting   |
| Support Vector Machines | SVM (various kernels)                             |
| Neural Networks         | ANN, DNN, CNN, RNN                                |
| Ensemble Methods        | Bagging, Boosting, Stacking, Blending             |
| Specialized             | Cost-sensitive, Multi-label, Multi-class versions |

This list covers the most common and widely used classification models in machine learning, suitable for binary, multi-class, and multi-label tasks across various domains.

In [24]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold

from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC

import warnings
warnings.simplefilter('ignore')

In [6]:
df = pd.read_csv(r"C:\Users\asd\Desktop\Diabetes_Prediction\notebook\data\diabetes.csv")

In [7]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [8]:
X = df.drop('Outcome',axis=1)
y = df['Outcome']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'MLP (Neural Network)': MLPClassifier(max_iter=1000)
}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    score = model.score(X_test_scaled, y_test)
    print(classification_report(y_test, y_pred))

    cross_score = cross_val_score(model, X, y, cv=5)
    print(f"{name} Accuracy: {score:.2f}")
    print(f"{name} Cross Validation score :{cross_score}")
    print(f"Mean of Cross Validation Score: {cross_score.mean()}")
    print("--"*30)

    X_scaled = scaler.fit_transform(X)

    cross_score1 = cross_val_score(model, X_scaled, y, cv=5)
    print(f"{name} Cross Validation Scaled score :{cross_score1}")
    print(f"Mean of Scaled Cross Validation Score: {cross_score1.mean()}")
    print("=="*30)

Logistic Regression Accuracy: 0.75
Logistic Regression Cross Validation score :[0.77272727 0.74675325 0.75974026 0.81699346 0.75163399]
Mean of Cross Validation Score: 0.7695696460402341
------------------------------------------------------------
Logistic Regression Cross Validation Scaled score :[0.77272727 0.74675325 0.75324675 0.81699346 0.76470588]
Mean of Scaled Cross Validation Score: 0.7708853238265002
SVM Accuracy: 0.73
SVM Cross Validation score :[0.74675325 0.73376623 0.77272727 0.79084967 0.75163399]
Mean of Cross Validation Score: 0.7591460826754943
------------------------------------------------------------
SVM Cross Validation Scaled score :[0.76623377 0.75324675 0.74675325 0.81045752 0.77777778]
Mean of Scaled Cross Validation Score: 0.7708938120702827
KNN Accuracy: 0.69
KNN Cross Validation score :[0.72727273 0.72727273 0.7012987  0.75816993 0.70588235]
Mean of Cross Validation Score: 0.723979288685171
------------------------------------------------------------
KNN C

In [32]:
# Train and evaluate each model
for model_name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"Model: {model_name}\n, Accuracy_Score: {accuracy}")
    print(classification_report(y_test, y_pred))

Model: Logistic Regression
, Accuracy_Score: 0.7532467532467533
              precision    recall  f1-score   support

           0       0.81      0.80      0.81        99
           1       0.65      0.67      0.66        55

    accuracy                           0.75       154
   macro avg       0.73      0.74      0.73       154
weighted avg       0.76      0.75      0.75       154

Model: SVM
, Accuracy_Score: 0.7337662337662337
              precision    recall  f1-score   support

           0       0.77      0.83      0.80        99
           1       0.65      0.56      0.60        55

    accuracy                           0.73       154
   macro avg       0.71      0.70      0.70       154
weighted avg       0.73      0.73      0.73       154

Model: KNN
, Accuracy_Score: 0.6948051948051948
              precision    recall  f1-score   support

           0       0.75      0.80      0.77        99
           1       0.58      0.51      0.54        55

    accuracy          