# **1. Perkenalan Dataset**


Tahap pertama, Anda harus mencari dan menggunakan dataset dengan ketentuan sebagai berikut:

1. **Sumber Dataset**:  
   Dataset dapat diperoleh dari berbagai sumber, seperti public repositories (*Kaggle*, *UCI ML Repository*, *Open Data*) atau data primer yang Anda kumpulkan sendiri.


# **2. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning atau deep learning.

In [43]:
!pip install pandas numpy scikit-learn joblib



In [44]:
import kagglehub
import pandas as pd
import numpy as np
import os
import joblib
from sklearn.preprocessing import StandardScaler

# **3. Memuat Dataset**

Pada tahap ini, Anda perlu memuat dataset ke dalam notebook. Jika dataset dalam format CSV, Anda bisa menggunakan pustaka pandas untuk membacanya. Pastikan untuk mengecek beberapa baris awal dataset untuk memahami strukturnya dan memastikan data telah dimuat dengan benar.

Jika dataset berada di Google Drive, pastikan Anda menghubungkan Google Drive ke Colab terlebih dahulu. Setelah dataset berhasil dimuat, langkah berikutnya adalah memeriksa kesesuaian data dan siap untuk dianalisis lebih lanjut.

Jika dataset berupa unstructured data, silakan sesuaikan dengan format seperti kelas Machine Learning Pengembangan atau Machine Learning Terapan

In [45]:
path = kagglehub.dataset_download("cherngs/heart-disease-cleveland-uci")

print("Dataset path:", path)
print("Isi folder:")
for f in os.listdir(path):
    print(f)

Using Colab cache for faster access to the 'heart-disease-cleveland-uci' dataset.
Dataset path: /kaggle/input/heart-disease-cleveland-uci
Isi folder:
heart_cleveland_upload.csv


In [46]:
csv_path = os.path.join(path, "heart_cleveland_upload.csv")
df = pd.read_csv(csv_path)

df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,condition
0,69,1,0,160,234,1,2,131,0,0.1,1,1,0,0
1,69,0,0,140,239,0,0,151,0,1.8,0,2,0,0
2,66,0,0,150,226,0,0,114,0,2.6,2,0,0,0
3,65,1,0,138,282,1,2,174,0,1.4,1,1,0,1
4,64,1,0,110,211,0,2,144,1,1.8,1,0,0,0


In [47]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'condition'],
      dtype='object')

In [48]:
for cand in ["target", "condition", "num"]:
    if cand in df.columns:
        print("Target ketemu:", cand)

Target ketemu: condition


# **4. Exploratory Data Analysis (EDA)**

Pada tahap ini, Anda akan melakukan **Exploratory Data Analysis (EDA)** untuk memahami karakteristik dataset.

Tujuan dari EDA adalah untuk memperoleh wawasan awal yang mendalam mengenai data dan menentukan langkah selanjutnya dalam analisis atau pemodelan.

In [49]:
print("Shape:", df.shape)
df.info()

Shape: (297, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        297 non-null    int64  
 1   sex        297 non-null    int64  
 2   cp         297 non-null    int64  
 3   trestbps   297 non-null    int64  
 4   chol       297 non-null    int64  
 5   fbs        297 non-null    int64  
 6   restecg    297 non-null    int64  
 7   thalach    297 non-null    int64  
 8   exang      297 non-null    int64  
 9   oldpeak    297 non-null    float64
 10  slope      297 non-null    int64  
 11  ca         297 non-null    int64  
 12  thal       297 non-null    int64  
 13  condition  297 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 32.6 KB


In [50]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,condition
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,54.542088,0.676768,2.158249,131.693603,247.350168,0.144781,0.996633,149.599327,0.326599,1.055556,0.602694,0.676768,0.835017,0.461279
std,9.049736,0.4685,0.964859,17.762806,51.997583,0.352474,0.994914,22.941562,0.469761,1.166123,0.618187,0.938965,0.95669,0.49934
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,2.0,120.0,211.0,0.0,0.0,133.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,56.0,1.0,2.0,130.0,243.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,0.0,0.0
75%,61.0,1.0,3.0,140.0,276.0,0.0,2.0,166.0,1.0,1.6,1.0,1.0,2.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,3.0,2.0,1.0


In [51]:
df.duplicated().sum()

np.int64(0)

In [52]:
df["condition"].value_counts().sort_index()

Unnamed: 0_level_0,count
condition,Unnamed: 1_level_1
0,160
1,137


In [53]:
df["target"] = (df["condition"] > 0).astype(int)
df = df.drop(columns=["condition"])

df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,160
1,137


# **5. Data Preprocessing**

Pada tahap ini, data preprocessing adalah langkah penting untuk memastikan kualitas data sebelum digunakan dalam model machine learning.

Jika Anda menggunakan data teks, data mentah sering kali mengandung nilai kosong, duplikasi, atau rentang nilai yang tidak konsisten, yang dapat memengaruhi kinerja model. Oleh karena itu, proses ini bertujuan untuk membersihkan dan mempersiapkan data agar analisis berjalan optimal.

Berikut adalah tahapan-tahapan yang bisa dilakukan, tetapi **tidak terbatas** pada:
1. Menghapus atau Menangani Data Kosong (Missing Values)
2. Menghapus Data Duplikat
3. Normalisasi atau Standarisasi Fitur
4. Deteksi dan Penanganan Outlier
5. Encoding Data Kategorikal
6. Binning (Pengelompokan Data)

Cukup sesuaikan dengan karakteristik data yang kamu gunakan yah. Khususnya ketika kami menggunakan data tidak terstruktur.

In [54]:
df2 = df.copy()

In [55]:
for col in df2.columns:
    df2[col] = pd.to_numeric(df2[col], errors="coerce")

In [56]:
for col in df2.columns:
    if df2[col].isna().sum() > 0:
        df2[col] = df2[col].fillna(df2[col].median())

df2.isna().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


In [57]:
print("Duplikat sebelum:", df2.duplicated().sum())
df2 = df2.drop_duplicates()
print("Duplikat sesudah:", df2.duplicated().sum())

Duplikat sebelum: 0
Duplikat sesudah: 0


In [58]:
feature_cols = [c for c in df2.columns if c != "target"]

for c in feature_cols:
    q1 = df2[c].quantile(0.25)
    q3 = df2[c].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    df2[c] = df2[c].clip(lower, upper)

In [59]:
X = df2[feature_cols].astype(float)
y = df2["target"].astype(int)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

df_processed = pd.DataFrame(X_scaled, columns=feature_cols)
df_processed["target"] = y.values

df_processed.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,1.600302,0.691095,-1.923237,1.708484,-0.256741,0.0,1.010199,-0.8187,-0.696419,-0.844856,0.643781,0.415469,-0.874292,0
1,1.600302,-1.44698,-1.923237,0.516098,-0.152042,0.0,-1.003419,0.059667,-0.696419,0.682965,-0.976583,1.579566,-0.874292,0
2,1.268242,-1.44698,-1.923237,1.112291,-0.424258,0.0,-1.003419,-1.565312,-0.696419,1.40194,2.264145,-0.748628,-0.874292,0
3,1.157555,0.691095,-1.923237,0.39686,0.748366,0.0,1.010199,1.069789,-0.696419,0.323478,0.643781,0.415469,-0.874292,1
4,1.046868,0.691095,-1.923237,-1.272481,-0.738354,0.0,1.010199,-0.247762,1.435916,0.682965,0.643781,-0.748628,-0.874292,0


In [60]:
out_dir = "namadataset_preprocessing"
os.makedirs(out_dir, exist_ok=True)

csv_out = os.path.join(out_dir, "heart_disease_preprocessing.csv")
scaler_out = os.path.join(out_dir, "scaler.joblib")

df_processed.to_csv(csv_out, index=False)
joblib.dump(scaler, scaler_out)

print("Saved:", csv_out)
print("Saved:", scaler_out)
print("Shape processed:", df_processed.shape)


Saved: namadataset_preprocessing/heart_disease_preprocessing.csv
Saved: namadataset_preprocessing/scaler.joblib
Shape processed: (297, 14)


In [61]:
%%writefile automate_NadhylaRachellya.py
import os
import pandas as pd
import kagglehub
import joblib
from sklearn.preprocessing import StandardScaler

def main(out_dir="namadataset_preprocessing"):
    path = kagglehub.dataset_download("cherngs/heart-disease-cleveland-uci")
    csv_path = os.path.join(path, "heart_cleveland_upload.csv")
    df = pd.read_csv(csv_path)

    # target: condition -> binary target
    df["target"] = (df["condition"] > 0).astype(int)
    df = df.drop(columns=["condition"])

    df2 = df.copy()

    for col in df2.columns:
        df2[col] = pd.to_numeric(df2[col], errors="coerce")

    for col in df2.columns:
        if df2[col].isna().sum() > 0:
            df2[col] = df2[col].fillna(df2[col].median())

    df2 = df2.drop_duplicates()

    feature_cols = [c for c in df2.columns if c != "target"]
    for c in feature_cols:
        q1 = df2[c].quantile(0.25)
        q3 = df2[c].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        df2[c] = df2[c].clip(lower, upper)

    X = df2[feature_cols].astype(float)
    y = df2["target"].astype(int)

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    df_processed = pd.DataFrame(X_scaled, columns=feature_cols)
    df_processed["target"] = y.values

    os.makedirs(out_dir, exist_ok=True)
    df_processed.to_csv(os.path.join(out_dir, "heart_disease_preprocessing.csv"), index=False)
    joblib.dump(scaler, os.path.join(out_dir, "scaler.joblib"))

    print("✅ Preprocessing selesai. Output ada di:", out_dir)

if __name__ == "__main__":
    main()


Overwriting automate_NadhylaRachellya.py


In [62]:
!python automate_NadhylaRachellya.py

Using Colab cache for faster access to the 'heart-disease-cleveland-uci' dataset.
✅ Preprocessing selesai. Output ada di: namadataset_preprocessing
