# Logistic Regression Analysis on Diabetes Data

This notebook implements a logistic regression analysis to predict the outcome of diabetes based on various health metrics. It demonstrates data preprocessing, model training, and evaluation using common classification metrics.

## Contents

1. **Data Loading and Exploration**
   - Load the diabetes dataset from a CSV file.
   - Display basic information about the dataset and check for missing values.
   - Analyze feature correlations to understand relationships.

2. **Feature Selection**
   - Separate features (`X`) from the target variable (`y`).
   - Drop unnecessary features to focus on relevant predictors.

3. **Data Preprocessing**
   - Standardize the feature set using `StandardScaler` to normalize the data, ensuring each feature contributes equally to the model's performance.

4. **Data Splitting**
   - Split the dataset into training and testing sets using an 80-20 split to evaluate model performance.

5. **Model Training**
   - Create and fit a logistic regression model to the training data.
   - Set a maximum number of iterations to ensure convergence.

6. **Model Evaluation**
   - Make predictions on the test set.
   - Calculate and display the model's accuracy score.
   - Display the confusion matrix to visualize prediction results.
   - Provide a detailed classification report, including precision, recall, and F1-score for each class.

## Dependencies

- `numpy`
- `pandas`
- `scikit-learn`

## Usage

- Run each section sequentially to perform logistic regression analysis on the diabetes dataset.
- Modify the input data as needed or experiment with different feature selections.

## Notes

- Ensure that the diabetes dataset (`diabetesdata.csv`) is accessible in the specified path in the Colab environment.
- The notebook includes various metrics to assess model performance, aiding in understanding the effectiveness of the logistic regression model.




In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [None]:
data = pd.read_csv("/content/drive/MyDrive/Week 4/regression exercises/logRegression/diabetesdata.csv")

In [None]:
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [None]:
print(data.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [None]:
data.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


In [None]:
# Separate the features (X) and the target variable (y)
X = data.drop(columns=['Outcome','BloodPressure','SkinThickness'])
y = data['Outcome']


In [None]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the features (excluding the target variable)
X_scaled = scaler.fit_transform(X)
X=X_scaled
X

array([[ 0.63994726,  0.84832379, -0.69289057,  0.20401277,  0.46849198,
         1.4259954 ],
       [-0.84488505, -1.12339636, -0.69289057, -0.68442195, -0.36506078,
        -0.19067191],
       [ 1.23388019,  1.94372388, -0.69289057, -1.10325546,  0.60439732,
        -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.27959377, -0.73518964, -0.68519336,
        -0.27575966],
       [-0.84488505,  0.1597866 , -0.69289057, -0.24020459, -0.37110101,
         1.17073215],
       [-0.84488505, -0.8730192 , -0.69289057, -0.20212881, -0.47378505,
        -0.87137393]])

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the train and test sets
print(X_train.shape, X_test.shape)


(614, 6) (154, 6)


In [None]:
# Create a logistic regression model
model = LogisticRegression(max_iter=1000)

# Fit the model to the training data
model.fit(X_train, y_train)


In [None]:
# Make predictions on the testing set
y_pred = model.predict(X_test)


In [None]:
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Display the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', conf_matrix)

# Display the classification report
class_report = classification_report(y_test, y_pred)
print('Classification Report:\n', class_report)


Accuracy: 0.7597402597402597
Confusion Matrix:
 [[81 18]
 [19 36]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.82      0.81        99
           1       0.67      0.65      0.66        55

    accuracy                           0.76       154
   macro avg       0.74      0.74      0.74       154
weighted avg       0.76      0.76      0.76       154

