<a id='1'></a><center><h2 style="background-color:skyblue; color:red"><br>🩺 Diabetes Prediction Dataset Summary<br>

This dataset is widely used for binary classification tasks, specifically predicting whether a patient has diabetes based on diagnostic measurements. It contains medical data of female patients and is commonly referred to as the **Pima Indians Diabetes Dataset**.
</h2>

## 📌 Dataset Overview

- **Total Records:** 768  
- **Total Features:** 8 input features + 1 target variable  
- **Purpose:** Predict diabetes diagnosis (1 = diabetic, 0 = non-diabetic)  
- **Data Types:** All numeric (integers and floats)


## 📋 Feature Description

| Feature       | Description                                                  |
|---------------|--------------------------------------------------------------|
| `pregnancies` | Number of times the patient has been pregnant                |
| `glucose`     | Plasma glucose concentration (mg/dL)                         |
| `diastolic`   | Diastolic blood pressure (mm Hg)                             |
| `triceps`     | Triceps skinfold thickness (mm)                              |
| `insulin`     | 2-Hour serum insulin (mu U/ml)                               |
| `bmi`         | Body Mass Index (weight in kg/(height in m)^2)              |
| `dpf`         | Diabetes Pedigree Function (genetic predisposition)          |
| `age`         | Age of the patient (in years)                                |
| `diabetes`    | Target variable (1 = diabetic, 0 = non-diabetic)             |


In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [3]:
df=pd.read_csv(r"D:\OneDrive\Venkat.My_projects\datassetss\Diabetes.csv")

In [4]:
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## 🧾 Basic Information

Overview of column types, non-null counts, and memory usage.


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [6]:
df.describe()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [7]:
df.columns

Index(['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',
       'dpf', 'age', 'diabetes'],
      dtype='object')

In [8]:
df['diabetes'].value_counts()

diabetes
0    500
1    268
Name: count, dtype: int64

In [9]:
df.groupby(['diabetes']).mean()

Unnamed: 0_level_0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


### Target variable for Prediction

In [10]:
y=df['diabetes']

In [11]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: diabetes, Length: 768, dtype: int64

### Independent Variable

In [12]:
x=df[['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi','dpf', 'age']]

In [13]:
x

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [14]:
x.shape

(768, 8)

In [15]:
mm=MinMaxScaler()

In [64]:
X=mm.fit_transform(x)

In [17]:
X

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,
        0.48333333],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
        0.16666667],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
        0.18333333],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.390462  , 0.07130658,
        0.15      ],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
        0.43333333],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
        0.03333333]])

In [18]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,stratify=y,random_state=2524)

In [19]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((537, 8), (231, 8), (537,), (231,))

In [20]:
model=LogisticRegression()
model

In [51]:
model.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [22]:
y_pred=model.predict(x_test)

In [23]:
y_pred

array([0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int64)

In [24]:
model.predict_proba(x_test)

array([[0.87880621, 0.12119379],
       [0.74476075, 0.25523925],
       [0.26868455, 0.73131545],
       [0.82624425, 0.17375575],
       [0.52874767, 0.47125233],
       [0.47442017, 0.52557983],
       [0.85691774, 0.14308226],
       [0.67552761, 0.32447239],
       [0.77586634, 0.22413366],
       [0.57442252, 0.42557748],
       [0.96414193, 0.03585807],
       [0.68731995, 0.31268005],
       [0.84771034, 0.15228966],
       [0.31317718, 0.68682282],
       [0.8018937 , 0.1981063 ],
       [0.86004073, 0.13995927],
       [0.29558379, 0.70441621],
       [0.76102268, 0.23897732],
       [0.54590408, 0.45409592],
       [0.1374931 , 0.8625069 ],
       [0.4896754 , 0.5103246 ],
       [0.58517248, 0.41482752],
       [0.30583165, 0.69416835],
       [0.93880854, 0.06119146],
       [0.32073989, 0.67926011],
       [0.2898586 , 0.7101414 ],
       [0.73407173, 0.26592827],
       [0.49068564, 0.50931436],
       [0.81290969, 0.18709031],
       [0.39252192, 0.60747808],
       [0.

In [27]:
confusion_mat=confusion_matrix(y_test,y_pred)
confusion_mat

array([[135,  15],
       [ 28,  53]], dtype=int64)

In [30]:
classification=classification_report(y_test,y_pred)
classification

'              precision    recall  f1-score   support\n\n           0       0.83      0.90      0.86       150\n           1       0.78      0.65      0.71        81\n\n    accuracy                           0.81       231\n   macro avg       0.80      0.78      0.79       231\nweighted avg       0.81      0.81      0.81       231\n'

## Predict on sample data

In [128]:
x_new=df.sample(5)
x_new

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
601,6,96,0,0,0,23.7,0.19,28,0
746,1,147,94,41,0,49.3,0.358,27,1
187,1,128,98,41,58,32.0,1.321,33,1
728,2,175,88,0,0,22.9,0.326,22,0
336,0,117,0,0,0,33.8,0.932,44,0


In [129]:
x_new.shape

(5, 9)

In [130]:
x_new.drop('diabetes',axis=1,inplace=True)

In [131]:
x_new.shape

(5, 8)

In [132]:
m=MinMaxScaler()

In [133]:
sample_data=m.fit_transform(x_new)

In [134]:
sample_data.shape

(5, 8)

In [135]:
new_pred=model.predict(sample_data)
new_pred



array([0, 0, 0, 0, 0], dtype=int64)

In [139]:
model.predict_proba(x_new)

array([[0.7950302 , 0.2049698 ],
       [0.34617148, 0.65382852],
       [0.600321  , 0.399679  ],
       [0.60018186, 0.39981814],
       [0.37126537, 0.62873463]])