# Classification

## DATASET preparation

### Step 01 : importing libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Step 02 : Loading data

In [2]:
path ="/content/drive/MyDrive/Projet ML/Modified_ObesityDataset_.csv"
df = pd.read_csv(path)

### Step 03 : split the data into X ( features ) and y (Target)

In [None]:
X = df.drop('NObeyesdad', axis=1)
y = df['NObeyesdad']

### Step 04 : split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model 01 : Support Vector Machine (SVM)

### Step 01 : import libraries

In [7]:
from sklearn import svm
from sklearn.svm import SVC

### Step 02 : fit the classification model *italicized text*

In [8]:
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train, y_train)

### Step 03 : evaluate the model

In [9]:
# Make predictions on the test set
y_pred = svm_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8858695652173914

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.86      0.87        63
           1       0.72      0.82      0.77        61
           2       0.85      0.71      0.77        55
           3       0.89      0.92      0.91        64
           4       0.97      0.98      0.98        66
           5       1.00      1.00      1.00        59

    accuracy                           0.89       368
   macro avg       0.89      0.88      0.88       368
weighted avg       0.89      0.89      0.89       368


Confusion Matrix:
[[54  9  0  0  0  0]
 [ 5 50  5  1  0  0]
 [ 2  9 39  5  0  0]
 [ 0  1  2 59  2  0]
 [ 0  0  0  1 65  0]
 [ 0  0  0  0  0 59]]


### Step 04 : grid search to fine-tune the hyperparameters of SVC

In [10]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameter
    'kernel': ['linear', 'rbf', 'poly'],  # Kernel types to try
    'gamma': ['scale', 'auto']  # Kernel coefficient for 'rbf' and 'poly'
}

# Create an SVC classifier object
svm_classifier = SVC(random_state=42)

# Instantiate GridSearchCV
grid_search = GridSearchCV(estimator=svm_classifier, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Get the best estimator (classifier)
best_classifier = grid_search.best_estimator_

# Make predictions on the test set using the best classifier
y_pred = best_classifier.predict(X_test)

# Evaluate the best classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report and confusion matrix
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Best Parameters: {'C': 100, 'gamma': 'scale', 'kernel': 'linear'}
Accuracy: 0.9565217391304348

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        63
           1       0.90      0.90      0.90        61
           2       0.89      0.91      0.90        55
           3       0.97      0.97      0.97        64
           4       1.00      0.98      0.99        66
           5       1.00      1.00      1.00        59

    accuracy                           0.96       368
   macro avg       0.96      0.96      0.96       368
weighted avg       0.96      0.96      0.96       368


Confusion Matrix:
[[61  2  0  0  0  0]
 [ 2 55  4  0  0  0]
 [ 0  4 50  1  0  0]
 [ 0  0  2 62  0  0]
 [ 0  0  0  1 65  0]
 [ 0  0  0  0  0 59]]


## Model 02 : Random Forest

### Step 01 : import libraries

In [11]:
from sklearn.ensemble import RandomForestClassifier

### Step 2: Create and Fit the Random Forest Classifier

In [12]:
# Create a Random Forest classifier object
rf_classifier = RandomForestClassifier(random_state=42)

# Fit the classifier to the training data
rf_classifier.fit(X_train, y_train)


### Step 03 : evaluate the model

In [13]:
# Make predictions on the test set
y_pred_rf = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Accuracy (Random Forest):", accuracy_rf)

# Generate classification report
print("\nClassification Report (Random Forest):")
print(classification_report(y_test, y_pred_rf))

# Generate confusion matrix
print("\nConfusion Matrix (Random Forest):")
print(confusion_matrix(y_test, y_pred_rf))


Accuracy (Random Forest): 0.9429347826086957

Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       0.84      0.94      0.89        63
           1       0.91      0.87      0.89        61
           2       0.94      0.87      0.91        55
           3       0.97      0.97      0.97        64
           4       1.00      1.00      1.00        66
           5       1.00      1.00      1.00        59

    accuracy                           0.94       368
   macro avg       0.94      0.94      0.94       368
weighted avg       0.94      0.94      0.94       368


Confusion Matrix (Random Forest):
[[59  4  0  0  0  0]
 [ 6 53  2  0  0  0]
 [ 4  1 48  2  0  0]
 [ 1  0  1 62  0  0]
 [ 0  0  0  0 66  0]
 [ 0  0  0  0  0 59]]


## Comparison


1. **Accuracy**:
   - SVM Accuracy: Approximately 95.65%
   - Random Forest Accuracy: Approximately 94.29%

2. **Precision, Recall, and F1-score**:
   - Both models show high precision, recall, and F1-score across all classes, indicating strong performance.

3. **Computational Complexity**:
   - SVM: Generally slower training time, especially with large datasets, due to solving the optimization problem.
   - Random Forest: Can be computationally expensive due to building multiple decision trees, but typically scales better with large datasets.

4. **Interpretability**:
   - SVM: Provides more interpretable models, especially with linear kernels.
   - Random Forest: Harder to interpret due to the ensemble nature of the model.

5. **Robustness to Overfitting**:
   - SVM: Can be prone to overfitting, especially with nonlinear kernels or when the number of features is large compared to the number of samples.
   - Random Forest: Less prone to overfitting due to the ensemble approach and built-in bagging and feature randomness.

6. **Scalability**:
   - SVM: Less scalable with large datasets and high-dimensional data.
   - Random Forest: Typically scales better with large datasets and high-dimensional data.

7. **Sensitivity to Hyperparameters**:
   - SVM: Sensitivity to hyperparameters like C (regularization parameter) and choice of kernel.
   - Random Forest: Also sensitive to hyperparameters like the number of trees, max depth, and minimum samples per leaf.

Based on these comparisons:

- **Accuracy**: SVM has a slightly higher accuracy compared to Random Forest, but the difference is not significant.
- **Interpretability**: SVM provides more interpretable models, which might be advantageous if model interpretability is crucial.
- **Robustness to Overfitting**: Random Forest tends to be less prone to overfitting, which can be beneficial, especially with complex or noisy datasets.
- **Scalability**: Random Forest scales better with large datasets and high-dimensional data, making it more suitable for scalable solutions.

