---
<center><h1>Diabetes Prediction</h1></center>

---

## 1) Understanding Problem Statement
---

**Diabetes** is a chronic medical condition that affects how your body processes glucose (sugar) in the blood. It occurs when either the pancreas doesn't produce enough insulin (a hormone that regulates blood sugar) or when the body can't effectively use the insulin it produces. This results in elevated blood sugar levels, which can lead to various health complications if not managed properly.

The goal of this project is to leverage machine learning **to predict whether an individual has diabetes or not based on certain medical variables**. This falls under **Classication Machine Learning Problem**. This prediction can be valuable for early diagnosis and intervention which can improve the management of diabetes.




## 2) Understanding Data
---
The project uses Diabetes data which contains several medical variables (independent variables) and one outcome variable (dependent variable) for each individual.

### Dataset Description:

- **Pregnancies:** The number of times a woman has been pregnant.
- **Glucose:** Plasma glucose concentration measured 2 hours after an oral glucose tolerance test.
- **BloodPressure:** Diastolic blood pressure (in mmHg).
- **SkinThickness:** Triceps skin fold thickness (in mm).
- **Insulin:** 2-hour serum insulin level (in μU/ml).
- **BMI (Body Mass Index):** A measure of body fat based on weight in kilograms divided by height in meters squared.
- **Age:** The age of the individual in years.
- **DiabetesPedigreeFunction:** A score that indicates the likelihood of diabetes based on family history.
- **Outcome:** The outcome variable with two possible values:
  - 0: Indicates that the individual does not have diabetes.
  - 1: Indicates that the individual has diabetes.


## 3) Getting System Ready
---
Importing required libraries


In [1]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

## 4) Data Eyeballing
---

### Laoding Data

In [2]:
diabetes_data = pd.read_csv('Datasets/Day2_Diabetes_Data.csv') 

In [3]:
diabetes_data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [4]:
print('The size of Dataframe is: ', diabetes_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
diabetes_data.info()
print('-'*100)

The size of Dataframe is:  (768, 9)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
-----------------------------------------

In [None]:
# printing the first 5 rows of the dataset
diabetes_dataset.head()

In [None]:
# number of rows and Columns in this dataset
diabetes_dataset.shape

In [None]:
# getting the statistical measures of the data
diabetes_dataset.describe()

In [None]:
diabetes_dataset['Outcome'].value_counts()

0 --> Non-Diabetic

1 --> Diabetic

In [None]:
diabetes_dataset.groupby('Outcome').mean()

In [None]:
# separating the data and labels
X = diabetes_dataset.drop(columns = 'Outcome', axis=1)
Y = diabetes_dataset['Outcome']

In [None]:
print(Y)

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
X = standardized_data
Y = diabetes_dataset['Outcome']

In [None]:
print(X)
print(Y)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
classifier = svm.SVC(kernel='linear')

In [None]:
#training the support vector Machine Classifier
classifier.fit(X_train, Y_train)

In [None]:
# accuracy score on the training data
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy score of the training data : ', training_data_accuracy)

In [None]:
# accuracy score on the test data
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of the test data : ', test_data_accuracy)