# Classification

## DATASET preparation

### Step 01 : importing libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Step 02 : Loading data

In [4]:
path ="/content/drive/MyDrive/Projet ML/Modified_ObesityDataset_.csv"
df = pd.read_csv(path)

### Step 03 : split the data into X ( features ) and y (Target)

In [5]:
X = df.drop('NObeyesdad', axis=1)
y = df['NObeyesdad']

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Step 04 : split the data into training and testing sets

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 05 : Scaling the data

Scaling the data refers to the process of transforming the numerical features in your dataset so that they all have the same scale or range. The goal is to ensure that no single feature dominates the others, which can help improve the performance and convergence of machine learning algorithms, such as logistic regression

In [None]:
from sklearn.preprocessing import StandardScaler

# Instantiate the StandardScaler
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)

## Model 03 : Logistic Regression

### Step 01 : import libraries

In [7]:
from sklearn.linear_model import LogisticRegression

### Step 02 : fit the classification model

In [16]:
# Create a Logistic Regression classifier object
lr_classifier = LogisticRegression(random_state=42, max_iter=1000)

# Fit the classifier to the scaled training data
lr_classifier.fit(X_train_scaled, y_train)

### Step 03 : evaluate the model

In [17]:
# Make predictions on the test set
y_pred_lr = lr_classifier.predict(X_test_scaled)

# Calculate accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy (Logistic Regression):", accuracy_lr)

# Generate classification report
print("\nClassification Report (Logistic Regression):")
print(classification_report(y_test, y_pred_lr))

# Generate confusion matrix
print("\nConfusion Matrix (Logistic Regression):")
print(confusion_matrix(y_test, y_pred_lr))


Accuracy (Logistic Regression): 0.9157608695652174

Classification Report (Logistic Regression):
              precision    recall  f1-score   support

           0       0.92      0.90      0.91        63
           1       0.78      0.82      0.80        61
           2       0.84      0.84      0.84        55
           3       0.97      0.94      0.95        64
           4       0.98      0.98      0.98        66
           5       1.00      1.00      1.00        59

    accuracy                           0.92       368
   macro avg       0.91      0.91      0.91       368
weighted avg       0.92      0.92      0.92       368


Confusion Matrix (Logistic Regression):
[[57  6  0  0  0  0]
 [ 5 50  6  0  0  0]
 [ 0  8 46  1  0  0]
 [ 0  0  3 60  1  0]
 [ 0  0  0  1 65  0]
 [ 0  0  0  0  0 59]]


## Model 04 : KNN

### Step 01 : import libraries

In [20]:
from sklearn.neighbors import KNeighborsClassifier

### Step 2: Create and Fit the KNN Classifier

In [21]:
knn_classifier = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn_classifier.fit(X_train_scaled, y_train)


### Step 03 : evaluate the model

In [22]:
# Step 5: Make Predictions
y_pred = knn_classifier.predict(X_test_scaled)

# Step 6: Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7880434782608695
Classification Report:
              precision    recall  f1-score   support

           0       0.66      0.63      0.65        63
           1       0.71      0.67      0.69        61
           2       0.74      0.64      0.69        55
           3       0.76      0.81      0.79        64
           4       0.85      0.95      0.90        66
           5       0.98      1.00      0.99        59

    accuracy                           0.79       368
   macro avg       0.78      0.79      0.78       368
weighted avg       0.78      0.79      0.78       368

Confusion Matrix:
[[40 12  6  4  1  0]
 [11 41  3  5  0  1]
 [ 7  3 35  5  5  0]
 [ 3  2  2 52  5  0]
 [ 0  0  1  2 63  0]
 [ 0  0  0  0  0 59]]


### Step 04 : Grid search to optimize parameters

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Step 1: Define the parameter grid
param_grid = {
    'knn__n_neighbors': [3, 5, 7],  # Number of neighbors
    'knn__weights': ['uniform', 'distance'],  # Weight function
    'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],  # Algorithm used to compute the nearest neighbors
}

# Step 2: Instantiate the KNN classifier within a pipeline
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Step 3: Instantiate the GridSearchCV
grid_search = GridSearchCV(estimator=knn_pipeline, param_grid=param_grid, cv=5, scoring='accuracy')

# Step 4: Fit the GridSearchCV to the training data
grid_search.fit(X_train_scaled, y_train)

# Step 5: Get the best parameters and best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

print("Best Parameters:", best_params)

# Step 6: Evaluate the best model
y_pred = best_estimator.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Best Parameters: {'knn__algorithm': 'auto', 'knn__n_neighbors': 7, 'knn__weights': 'distance'}
Accuracy: 0.8179347826086957
Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.67      0.68        63
           1       0.79      0.69      0.74        61
           2       0.81      0.71      0.76        55
           3       0.76      0.84      0.80        64
           4       0.86      0.98      0.92        66
           5       0.98      1.00      0.99        59

    accuracy                           0.82       368
   macro avg       0.82      0.82      0.81       368
weighted avg       0.82      0.82      0.81       368

Confusion Matrix:
[[42 10  5  5  1  0]
 [ 9 42  2  7  0  1]
 [ 5  1 39  5  5  0]
 [ 4  0  1 54  5  0]
 [ 0  0  1  0 65  0]
 [ 0  0  0  0  0 59]]


## Comparison


1. **Accuracy**:
   - KNN: 0.8179
   - Logistic Regression: 0.9158
   
2. **Precision**:
   - KNN: Macro avg: 0.82, Weighted avg: 0.82
   - Logistic Regression: Macro avg: 0.91, Weighted avg: 0.92

3. **Recall**:
   - KNN: Macro avg: 0.82, Weighted avg: 0.82
   - Logistic Regression: Macro avg: 0.91, Weighted avg: 0.92

4. **F1-score**:
   - KNN: Macro avg: 0.81, Weighted avg: 0.81
   - Logistic Regression: Macro avg: 0.91, Weighted avg: 0.92

5. **Confusion Matrix**:
   - KNN: The confusion matrix indicates the number of correct and incorrect predictions made by the KNN model.
   - Logistic Regression: Similarly, the confusion matrix for logistic regression provides insight into the model's performance across different classes.

Based on these evaluation metrics, it's evident that the Logistic Regression model outperforms the KNN classifier in terms of accuracy, precision, recall, and F1-score. Additionally, the confusion matrix of the Logistic Regression model shows fewer misclassifications compared to KNN.
