In [1]:
import pandas as pd

# Display the first few rows of the DataFrame
df = pd.read_csv('health_data.csv')

# is_diabetic column is my Y variable
# find missing values for is_diabetic column

missing_values = df['is_diabetic'].isnull().sum()
print(f'Missing values in is_diabetic column: {missing_values}')

#find imbalance in is_diabetic column
imbalance = df['is_diabetic'].value_counts()
print('Imbalance in is_diabetic column:')
print(imbalance)

# find imbalance 


Missing values in is_diabetic column: 0
Imbalance in is_diabetic column:
is_diabetic
0    481
1    260
Name: count, dtype: int64


### Class Imbalance in is_diabetic Column

The target variable `is_diabetic` is imbalanced:
- 0 (non-diabetic): 481 samples
- 1 (diabetic): 260 samples

This means there are significantly more non-diabetic cases than diabetic cases. Class imbalance can affect model performance, especially for classification tasks, as models may be biased towards the majority class. Consider using techniques such as resampling, class weights, or appropriate metrics (like precision, recall, F1-score) to address imbalance when building predictive models.

In [2]:
# evaluate model with as is data without any imputation or balancing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Split the data into features and target variable
X = df.drop('is_diabetic', axis=1)
y = df['is_diabetic']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))# in is_diabetic column

              precision    recall  f1-score   support

           0       0.74      0.80      0.76        98
           1       0.53      0.45      0.49        51

    accuracy                           0.68       149
   macro avg       0.64      0.62      0.63       149
weighted avg       0.67      0.68      0.67       149



### Model Evaluation: Classification Report Interpretation

- **Class 0 (non-diabetic):**
  - Precision: 0.74
  - Recall: 0.80
  - F1-score: 0.76
  - Support: 98

- **Class 1 (diabetic):**
  - Precision: 0.53
  - Recall: 0.45
  - F1-score: 0.49
  - Support: 51

**Interpretation:**
- The model performs better for the majority class (non-diabetic) than for the minority class (diabetic).
- Lower recall and F1-score for the diabetic class indicate that many diabetic cases are missed (false negatives).
- This is a common issue with imbalanced datasets, where the model is biased towards the majority class.
- To improve performance for the minority class, consider techniques such as resampling (oversampling/undersampling), using class weights, or trying different algorithms and evaluation metrics.

In [9]:
# evaulate model with imputation and balancing

from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# Balance the dataset using SMOTE
smote = SMOTE(random_state=30)  
X_balanced, y_balanced = smote.fit_resample(X_imputed, y)   
# Split the balanced data
X_train_bal, X_test_bal, y_train_bal, y_test_bal = train_test_split(X_balanced, y_balanced, test_size=0.2, random_state=42)
# Train the model
model_bal = RandomForestClassifier(random_state=30)
model_bal.fit(X_train_bal, y_train_bal)
y_pred_bal = model_bal.predict(X_test_bal)
print(classification_report(y_test_bal, y_pred_bal))

              precision    recall  f1-score   support

           0       0.83      0.81      0.82       105
           1       0.78      0.80      0.79        88

    accuracy                           0.80       193
   macro avg       0.80      0.80      0.80       193
weighted avg       0.80      0.80      0.80       193



### Model Evaluation After Imputation and Balancing (SMOTE)

- **Class 0 (non-diabetic):**
  - Precision: 0.83
  - Recall: 0.81
  - F1-score: 0.82
  - Support: 105

- **Class 1 (diabetic):**
  - Precision: 0.78
  - Recall: 0.80
  - F1-score: 0.79
  - Support: 88

**Interpretation:**
- After imputing missing values and balancing the dataset with SMOTE, the model's performance for both classes has improved and is now more balanced.
- Precision, recall, and F1-score for the minority class (diabetic) are much higher compared to the previous model.
- This demonstrates the effectiveness of handling missing data and class imbalance for improving classification results.
- Both classes are now predicted with similar accuracy, reducing bias towards the majority class.

### General Rule for Identifying Class Imbalance

Class imbalance occurs when the number of samples in one class is much higher or lower than in other classes. 

**General Rule:**
- If the ratio of the minority class to the majority class is less than 1:2 (or 33%), the dataset is considered imbalanced.
- Severe imbalance is often defined as a minority class ratio below 10%.

**How to Check:**
- Use `value_counts()` on the target variable to see the distribution of classes.
- Calculate the percentage of each class:
  ```python
  class_distribution = y.value_counts(normalize=True)
  print(class_distribution)
  ```
- If one class is much less frequent, consider the dataset imbalanced and apply appropriate techniques to address it.

### How to Deal with Class Imbalance

Common techniques to address class imbalance:

1. **Resampling Methods**
   - **Oversampling**: Increase the number of minority class samples (e.g., SMOTE).
   - **Undersampling**: Reduce the number of majority class samples.

2. **Class Weights**
   - Assign higher weights to the minority class in model training (supported by many algorithms).

3. **Ensemble Methods**
   - Use ensemble models (e.g., Random Forest, XGBoost) that handle imbalance better.

4. **Algorithm Selection**
   - Choose algorithms robust to imbalance or specifically designed for imbalanced data.

5. **Evaluation Metrics**
   - Use metrics like precision, recall, F1-score, ROC-AUC instead of accuracy.

6. **Data Augmentation**
   - Generate synthetic samples for the minority class.

7. **Threshold Tuning**
   - Adjust decision thresholds to improve minority class detection.

Select and combine these techniques based on your data and problem context for best results.