# **Data Understanding**

Dataset yang digunakan adalah **Pima Indians Diabetes Database** dari UCI/Kaggle.  
Dataset ini berisi data klinis pasien dengan atribut seperti jumlah kehamilan, kadar glukosa, tekanan darah, BMI, riwayat keluarga, dan label apakah pasien terdiagnosis diabetes.

## Informasi Dataset
- Jumlah baris: 768  
- Jumlah kolom: 9 (8 fitur + 1 label)  
- Target variabel: `Outcome`  
  - 1 = pasien diabetes  
  - 0 = pasien tidak diabetes  

## Sumber Data
- https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

## Eksplorasi Data
Tujuan eksplorasi data adalah memahami pola awal dari dataset:  
- **Distribusi variabel numerik** (contoh: Glucose, BloodPressure, BMI, Age).  
- **Perbandingan rata-rata antara pasien diabetes dan non-diabetes**.  
- **Pemeriksaan nilai anomali** (misalnya nilai 0 pada tekanan darah atau insulin yang seharusnya tidak mungkin).  
- **Korelasi antar fitur** untuk melihat variabel mana yang paling berkaitan dengan outcome.  

Contoh pertanyaan eksplorasi:
- Apakah pasien diabetes cenderung memiliki kadar glukosa lebih tinggi?  
- Apakah pasien dengan BMI tinggi lebih berisiko diabetes?  
- Bagaimana distribusi umur antara pasien diabetes dan non-diabetes?  


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# === Load Dataset ===
url = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
df = pd.read_csv(url)

# === Info Dataset ===
print("Jumlah baris & kolom:", df.shape)
print("\n5 data teratas:")
display(df.head())

Jumlah baris & kolom: (768, 9)

5 data teratas:


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
print("\nRingkasan Statistik:")
display(df.describe())


Ringkasan Statistik:


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
print("\nCek missing value:")
print(df.isnull().sum())


Cek missing value:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
