# Prediksi Diabetes

**Nama:** Iqbal Satria Nugraha  
**NIM:** A11.2023.15310  
**Tanggal:** 09 Juli 2025


## Ringkasan Masalah dan Tujuan
Proyek ini bertujuan untuk memprediksi apakah seseorang menderita diabetes atau tidak berdasarkan data medis tertentu.

Dataset yang digunakan adalah *Pima Indians Diabetes Dataset* yang terdiri dari fitur-fitur medis seperti tekanan darah, kadar glukosa, dan indeks massa tubuh (BMI).

### Tujuan:
- Melakukan eksplorasi dan preprocessing pada dataset
- Melatih beberapa model machine learning
- Mengevaluasi performa model dan menentukan mana yang terbaik

## Alur Penyelesaian
```mermaid
graph TD
A[Load Dataset] --> B[Explorasi Data]
B --> C[Preprocessing]
C --> D[Split Data]
D --> E[Modeling]
E --> F[Evaluasi Model]




---

### 💻 **[Code Cell] Import dan Load Dataset**
```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load dataset
file_path = "pima-indians-diabetes.data.csv"
column_names = [
    "Pregnancies", "Glucose", "BloodPressure", "SkinThickness", 
    "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"
]

data = pd.read_csv(file_path, names=column_names)
data.head()


In [None]:
data.info()


In [None]:
data.describe()


In [None]:
sns.countplot(x="Outcome", data=data)
plt.title("Distribusi Kelas (Diabetes vs Non-Diabetes)")
plt.show()


In [None]:
cols_with_zero = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
data[cols_with_zero] = data[cols_with_zero].replace(0, np.nan)
data.fillna(data.median(), inplace=True)


In [None]:
X = data.drop("Outcome", axis=1)
y = data["Outcome"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))


In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))


In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))


In [None]:
print("Confusion Matrix - Random Forest:")
print(confusion_matrix(y_test, y_pred_rf))
print("\\nClassification Report:")
print(classification_report(y_test, y_pred_rf))


## Kesimpulan

Berdasarkan percobaan beberapa model klasifikasi:
- **Random Forest** memberikan akurasi terbaik di antara model yang diuji.
- Preprocessing seperti imputasi nilai nol dan scaling sangat penting dalam meningkatkan kinerja model.
- Dataset ini memiliki distribusi yang cukup seimbang antara kelas positif dan negatif.

Model ini dapat digunakan untuk membantu prediksi awal risiko diabetes berdasarkan data medis dasar.

