 # Predicting Diabetes

In [1]:
from pathlib import Path
import pandas as pd

In [2]:
data = Path('../Resources/diabetes.csv')
df = pd.read_csv(data)
df.head()
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


 ## Separate the Features (X) from the Target (y)

In [3]:
y = df["Outcome"]
X = df.drop(columns="Outcome")

 ## Split our data into training and testing

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)
X_train.shape

(576, 8)

 ## Create a Logistic Regression Model

In [5]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs',
                                max_iter=200,
                                random_state=1)
classifier

 ## Fit (train) or model using the training data

In [6]:
classifier.fit(X_train, y_train)

* sklearn Logistic Regerssion module has a fit function, which can be used to train/fit the model using training data. Not always the case for other modules

 ## Score the model using the test data

In [7]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.7829861111111112
Testing Data Score: 0.7760416666666666


# Understanding Model Evaluation and Performance Metrics

## Analyzing Model Scores

- The scores from both the Train and Test datasets should trend in the same direction.
- How should you interpret this data? How confident are you that your model effectively predicts diabetes?
    - Your training dataset shows a 78.29% accuracy in predicting diabetes.
    - However, once tested, the accuracy drops to 77.60%.
- Should the testing score be higher or lower than the training score? Can the testing score be higher than the training score?
    - The testing score typically should be lower. However, it's possible for the testing score to be higher. In such cases, utilize K-fold cross-validation methodology or perform further analysis to understand why your test set outperforms your training set. You will learn about this soon.

## Evaluating Model Accuracy

- Is the accuracy obtained acceptable? What metrics can we use to evaluate model performance?
- Consider a model that incorrectly flags diabetes for patients who don't have the disease. Alternatively, it could miss predicting the disease in some patients. Which is better: false-positives or false-negatives?
- Accuracy alone isn't enough; we need to consider precision.



In [8]:
predictions = classifier.predict(X_test)
results = pd.DataFrame({"Prediction": predictions, "Actual": y_test}).reset_index(drop=True)
results.head(20)

Unnamed: 0,Prediction,Actual
0,0,0
1,1,1
2,0,0
3,1,1
4,0,0
5,0,0
6,1,1
7,1,0
8,1,1
9,0,0



## Understanding Stratification in scikit-learn

- In scikit-learn, "stratify" refers to the process of dividing a dataset into subsets that maintain the same class distribution as the original dataset.
- For example, in classification tasks, if your dataset has two classes, stratification ensures that each subset (e.g., train and test sets) contains approximately the same proportion of each class as the original dataset.
- Stratification is crucial to ensure that the model is trained and evaluated on representative samples from each class, preventing biases in model performance evaluation.
- When splitting datasets for training and testing, using the "stratify" parameter helps maintain the integrity of class distributions in both subsets.
- stratify = y --> see scikit_learn model selection train_test split or 3.1.2.2.1. Stratified k-fold


### Confusion Matrix

In [9]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


# Create a confusion matrix
confusion_matrix(y_test, predictions)

array([[113,  12],
       [ 31,  36]], dtype=int64)

In [10]:
# Create a classification report
target_names = ["No Diabetes", "Diabetes"]
print(classification_report(y_test, predictions, target_names=target_names))

              precision    recall  f1-score   support

 No Diabetes       0.78      0.90      0.84       125
    Diabetes       0.75      0.54      0.63        67

    accuracy                           0.78       192
   macro avg       0.77      0.72      0.73       192
weighted avg       0.77      0.78      0.77       192



### Understanding Diabetes Recall Rates

* In the context of diabetes prediction, having a recall of 0.54 for diabetes cases and 0.84 for non-diabetes cases raises important considerations:

    * A recall rate of 0.54 for diabetes suggests a higher rate of false negatives, meaning there are individuals predicted as non-diabetic who actually have diabetes. This indicates a significant number of errors in identifying positive cases.

    * Conversely, the higher recall rate of 0.84 for non-diabetic cases indicates fewer errors, leading to a lower rate of false positives. This suggests a more accurate identification of negative cases.

    * Ideally, in medical diagnosis, we aim for high sensitivity (or recall) to minimize false negatives. A recall of 0.54 implies that the model may be missing many true positive cases.

    * While the model may be precise in identifying non-diabetic cases, the need for improvement lies in reducing false negatives. Further training and refinement of the model could help mitigate this issue and enhance its predictive accuracy.
