## ✅ **Q1. Import the dataset and explore it**

### **Code outline (in Python using Pandas, Matplotlib, Seaborn):**

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("diabetes.csv")  # replace with the actual file path if local

# Preview the dataset
print(df.head())

# Summary statistics
print(df.describe())

# Check for nulls
print(df.info())

# Visualizations
sns.pairplot(df, hue='Outcome')
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
```

---

## ✅ **Q2. Data preprocessing**

### Tasks:
- Handle missing or zero values (especially in `Glucose`, `BMI`, `BloodPressure`, `SkinThickness`, and `Insulin`)
- Outlier removal using IQR or z-score
- No categorical variables → dummy encoding not needed

```python
# Replace 0 with NaN for relevant columns (except 'Pregnancies' and 'Outcome')
cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_zeros] = df[cols_with_zeros].replace(0, pd.NA)

# Fill missing values with median
for col in cols_with_zeros:
    df[col].fillna(df[col].median(), inplace=True)

# Check again
print(df.isnull().sum())
```

---

## ✅ **Q3. Train-test split**

```python
from sklearn.model_selection import train_test_split

X = df.drop("Outcome", axis=1)
y = df["Outcome"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

---

## ✅ **Q4. Train Decision Tree with Cross-validation**

Use `GridSearchCV` with DecisionTreeClassifier.

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [3, 5, 7, None],
    "min_samples_split": [2, 5, 10],
    "criterion": ["gini", "entropy"]
}

clf = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
clf.fit(X_train, y_train)

print("Best parameters:", clf.best_params_)
```

---

## ✅ **Q5. Evaluate model performance**

```python
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, roc_curve

y_pred = clf.predict(X_test)

# Metrics
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

# ROC Curve
y_proba = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_proba):.2f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
```

---

## ✅ **Q6. Interpret the decision tree**

```python
from sklearn.tree import plot_tree

plt.figure(figsize=(20, 10))
plot_tree(clf.best_estimator_, feature_names=X.columns, class_names=["No Diabetes", "Diabetes"], filled=True)
plt.show()

# Feature importances
importances = pd.Series(clf.best_estimator_.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).plot(kind='bar', title="Feature Importance")
plt.show()
```

### Likely top features:
- Glucose
- BMI
- Age
- Insulin

You can interpret splits like:  
> If Glucose > 127 and BMI > 30 → high likelihood of diabetes.

---

## ✅ **Q7. Validate the model (robustness check)**

You can simulate new data scenarios or perturb values:

```python
# Sensitivity test: Add small noise to test data
import numpy as np
X_test_perturbed = X_test.copy()
X_test_perturbed += np.random.normal(0, 0.01, X_test.shape)
y_pred_perturbed = clf.predict(X_test_perturbed)

# Compare results
print("Original F1:", classification_report(y_test, y_pred, output_dict=True)["weighted avg"]["f1-score"])
print("Perturbed F1:", classification_report(y_test, y_pred_perturbed, output_dict=True)["weighted avg"]["f1-score"])
```
