In [2]:
##importing the dataset
from google.colab import files
uploaded = files.upload()


Saving diabetes.csv to diabetes.csv


In [19]:
#importing required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report



#loading csv in dataframe

df=pd.read_csv("diabetes.csv")
print(df.head)

<bound method NDFrame.head of      Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50 

In [4]:
df.shape

(768, 9)

In [5]:
  ## getting the statistical measures of data
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## üìä Feature Scaling Requirement

From the statistical summary, it is clear that the features in the dataset exist on **very different numerical scales**.

Such variation can negatively impact models that rely on distance calculations. Since **SVM is a distance-based algorithm**, feature scaling becomes necessary to ensure fair contribution from all features. Without scaling, features with larger numerical ranges may dominate the learning process and bias the model.

---

## üõ†Ô∏è Scaling Technique Used ‚Äî StandardScaler

To address this, **StandardScaler** is applied to the dataset.

StandardScaler transforms each feature such that:
- The **mean becomes 0**
- The **standard deviation becomes 1**

This ensures that:
- All features contribute equally to the model
- Model performance becomes more stable and reliable


In [6]:
##target value distrubution
df["Outcome"].value_counts()

Unnamed: 0_level_0,count
Outcome,Unnamed: 1_level_1
0,500
1,268


## Class Imbalance Analysis

The target variable distrubution  outcome  shows  that  the dataset is **mildly imbalanced**. The imbalance is not severe, but it is still important to handle it properly to avoid biased predictions.

---

##  Why Class Imbalance Matters

If a dataset is imbalanced, a model can achieve high accuracy by simply predicting the majority class.  
In medical datasets, this is risky because the minority class (diabetic patients) is the most important to identify.

Therefore, accuracy alone is not a reliable evaluation metric in this case.

---

##  Measures Taken to Handle Class Imbalance

### 1Ô∏è‚É£ StrIatified Train-Test Split  
I used **stratified sampling** while splitting the data so that both training and testing sets maintain the same class distribution as the original dataset.

---

### 2Ô∏è‚É£ Appropriate Evaluation Metrics  
Instead of relying only on accuracy, I evaluated the model using:

- **Precision** ‚Äì measures how many predicted positives are actually correct  
- **Recall** ‚Äì measures how well the model identifies actual positive cases  
- **F1-score** ‚Äì balances precision and recall  
- **Confusion Matrix** ‚Äì shows detailed prediction performance  

These metrics provide a more reliable evaluation for imbalanced data.

---

### 3Ô∏è‚É£ Class Weight Handling  
To reduce bias toward the majority class, **class weights** were used during model training.  
This helps the model give more importance to the minority class and improves overall generalization.

---



All the above measures taken will ensure fair and reliable model performance.


## Feature-wise Comparison by Outcome

To better understand how different features behave across classes, the mean values of all numerical features were calculated separately for each outcome class.

This group-wise analysis helps identify patterns and differences between diabetic and non-diabetic patients.






In [7]:
df.groupby("Outcome").mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


### Observations from above analysis

- **Glucose** levels are significantly higher in diabetic patients, indicating strong correlation with the target variable.
- **BMI** is also higher for diabetic individuals, suggesting increased body weight is an important contributing factor.
- **Age** shows a noticeable increase in diabetic patients, indicating higher risk with age.
- **Insulin** and **SkinThickness** values are moderately higher for the diabetic class.
- Other features show smaller but still meaningful differences.

In [8]:
##separating the data and label
X=df.drop(columns="Outcome" ,  axis=1)
y=df["Outcome"]

In [9]:
print(X.head())
print("\n")
print(y.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  


0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64


In [10]:

##train test split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)



In [11]:
## Data Standardization

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)





In [12]:
##verifying shapes of training and testing
print(X.shape, X_train.shape, X_test.shape)
print(y.shape, y_train.shape, y_test.shape)


(768, 8) (614, 8) (154, 8)
(768,) (614,) (154,)


In [17]:
##training the support vector machine classifier model
classifier = SVC(kernel='linear', class_weight='balanced')
classifier.fit(X_train, y_train)




In [20]:
##Model Evaluation
# Predictions
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

# Training performance
print("Training Accuracy:", accuracy_score(y_train, y_train_pred))

# Testing performance
print("\nTest Accuracy:", accuracy_score(y_test, y_test_pred))
print("Precision:", precision_score(y_test, y_test_pred))
print("Recall:", recall_score(y_test, y_test_pred))
print("F1 Score:", f1_score(y_test, y_test_pred))

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))


Training Accuracy: 0.7703583061889251

Test Accuracy: 0.7532467532467533
Precision: 0.6290322580645161
Recall: 0.7222222222222222
F1 Score: 0.6724137931034483

Confusion Matrix:
[[77 23]
 [15 39]]

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.77      0.80       100
           1       0.63      0.72      0.67        54

    accuracy                           0.75       154
   macro avg       0.73      0.75      0.74       154
weighted avg       0.76      0.75      0.76       154



#  Model Evaluation Summary

Based on the evaluation results obtained from the test dataset, the SVM classifier demonstrates balanced and reliable performance.

## üîπ Overall Performance

- **Training Accuracy:** 77.0%  
- **Test Accuracy:** 75.3%

The close gap between training and testing accuracy indicates good generalization and minimal overfitting.

## üîπ Class-wise Performance

| Class | Precision | Recall | F1-score | Support |
|------|----------|--------|----------|---------|
| 0 (Non-Diabetic) | 0.84 | 0.77 | 0.80 | 100 |
| 1 (Diabetic) | 0.63 | 0.72 | 0.67 | 54 |

## üîπ Key Observations from Classification Report

- Precision (**0.63**) for the diabetic class indicates that some non-diabetic cases are incorrectly predicted as diabetic.  
- Recall (**0.72**) for the diabetic class shows the model correctly identifies most diabetic patients, which is crucial in medical diagnosis.  
- F1-score (**0.67**) reflects a reasonable balance between precision and recall for the minority class.  
- The confusion matrix confirms that misclassifications exist but are within an acceptable range.

## üîπ Confusion Matrix Interpretation

- **True Negatives (TN = 77):**  
  Non-diabetic patients correctly classified.

- **False Positives (FP = 23):**  
  Non-diabetic patients incorrectly classified as diabetic.

- **False Negatives (FN = 15):**  
  Diabetic patients incorrectly classified as non-diabetic (more critical error).

- **True Positives (TP = 39):**  
  Diabetic patients correctly identified.

## Final Conclusion

The SVM model demonstrates strong and reliable performance, especially considering the class imbalance.  
Its ability to correctly identify diabetic cases, combined with balanced precision and recall, makes it a robust baseline model suitable for further optimization or real-world application.


In [21]:
# Making a Predictive System (Diabetes Prediction)

# Sample input data
# Order: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age
input_data = (4, 110, 92, 0, 0, 37.6, 0.191, 30)

#  Convert input data to numpy array and reshaping
input_data_numpy_array = np.asarray(input_data).reshape(1,-1)


#standardizing input data
std_data = scaler.transform(input_data_numpy_array)

# 4 Make prediction
prediction = classifier.predict(std_data)

if prediction[0] == 0:
    print("Prediction: The person is NOT diabetic")
else:
    print("Prediction: The person IS diabetic")


Prediction: The person is NOT diabetic


