# **1. Perkenalan Dataset**


Dataset yang digunakan adalah `train_u6lujuX_CVtuZ9i.csv` dari Loan Prediction Problem Dataset yang didownload dari Kaggle (https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset).


# **2. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning atau deep learning.

In [1]:
import os
import pandas as pd 
import numpy as np 
import kagglehub
from kagglehub import KaggleDatasetAdapter

  from .autonotebook import tqdm as notebook_tqdm


# **3. Memuat Dataset**

Pada tahap ini, dataset diimport dari kaggle dan disimpan di direktori `data_raw`.

In [3]:
# Dataset ID Kaggle
dataset_id = "altruistdelhite04/loan-prediction-problem-dataset"

# Folder penyimpanan
os.makedirs("../data_raw", exist_ok=True)

# Daftar file yang ingin diunduh
file_list = ["train_u6lujuX_CVtuZ9i.csv"]

# Loop untuk load dan simpan masing-masing file
for file_name in file_list:
    df = kagglehub.load_dataset(
        KaggleDatasetAdapter.PANDAS,
        dataset_id,
        file_name
    )
    
    # Simpan ke folder data/
    save_path = os.path.join("../data_raw", file_name)
    df.to_csv(save_path, index=False)
    print(f" {file_name} disimpan di: {save_path}")


  df = kagglehub.load_dataset(


 train_u6lujuX_CVtuZ9i.csv disimpan di: ../data_raw\train_u6lujuX_CVtuZ9i.csv


# **4. Exploratory Data Analysis (EDA)**

Tahap ini dilakukan untuk memahami karakter dataset dan menentukan langkah-langkah apa saja yang perlu dilakukan dalam preprocessing.

## Info dan sampel data_raw

Dataset terdiri dari 614 baris dan 13 kolom dengan variabel:        

| Nama Kolom          | Deskripsi                                 | Tipe Data   |
|---------------------|--------------------------------------------|-------------|
| Loan_ID             | ID unik peminjam                           | object      |
| Gender              | Jenis kelamin peminjam                     | object      |
| Married             | Status pernikahan                          | object      |
| Dependents          | Jumlah tanggungan                          | object      |
| Education           | Tingkat pendidikan                         | object      |
| Self_Employed       | Status pekerjaan                           | object      |
| ApplicantIncome     | Pendapatan peminjam                        | Int64       |
| CoapplicantIncome   | Pendapatan pasangan                        | Float64     |
| LoanAmount          | Jumlah pinjaman (dalam ribuan)             | Float64     |
| Loan_Amount_Term    | Jangka waktu pinjaman                      | Float64     |
| Credit_History      | Riwayat kredit                             | Float64     |
| Property_Area       | Lokasi properti                            | object      |
| Loan_Status         | Status persetujuan pinjaman (Target)       | object      |


In [4]:
data_raw = pd.read_csv('../data_raw/train_u6lujuX_CVtuZ9i.csv')
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [22]:
data_raw.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## Missing values pada data_raw

- Terdapat missing values di beberapa kolom:
  - Gender (13 null)
  - Married (3 null)
  - Dependents (15 null)
  - Self_Employed (32 null)
  - LoanAmount (22 null)
  - Loan_Amount_Term (14 null)      

In [24]:
data_raw.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

## Nilai unik semua kolom

- Jumlah Nilai Unik per Kolom:    
  - Loan_ID              614
  - Gender                 2
  - Married                2
  - Dependents             4
  - Education              2
  - Self_Employed          2
  - ApplicantIncome      505
  - CoapplicantIncome    287
  - LoanAmount           203
  - Loan_Amount_Term      10
  - Credit_History         2
  - Property_Area          3
  - Loan_Status            2

In [25]:
data_raw.nunique()

Loan_ID              614
Gender                 2
Married                2
Dependents             4
Education              2
Self_Employed          2
ApplicantIncome      505
CoapplicantIncome    287
LoanAmount           203
Loan_Amount_Term      10
Credit_History         2
Property_Area          3
Loan_Status            2
dtype: int64

## nilai unik kolom-kolom kategorik

- Nilai unik di 'Gender': ['Male' 'Female' nan]
- Nilai unik di 'Married': ['No' 'Yes' nan]
- Nilai unik di 'Dependents': ['0' '1' '2' '3+' nan]
- Nilai unik di 'Education': ['Graduate' 'Not Graduate']
- Nilai unik di 'Self_Employed': ['No' 'Yes' nan]
- Nilai unik di 'Property_Area': ['Urban' 'Rural' 'Semiurban']
- Nilai unik di 'Loan_Status': ['Y' 'N']

In [14]:
# Menampilkan nilai unik untuk setiap kolom bertipe object
unique_values = {
    col: data_raw[col].unique()
    for col in data_raw.select_dtypes(include='object').columns
}

for col, values in unique_values.items():
    print(f"Nilai unik di '{col}': {values}")

Nilai unik di 'Loan_ID': ['LP001002' 'LP001003' 'LP001005' 'LP001006' 'LP001008' 'LP001011'
 'LP001013' 'LP001014' 'LP001018' 'LP001020' 'LP001024' 'LP001027'
 'LP001028' 'LP001029' 'LP001030' 'LP001032' 'LP001034' 'LP001036'
 'LP001038' 'LP001041' 'LP001043' 'LP001046' 'LP001047' 'LP001050'
 'LP001052' 'LP001066' 'LP001068' 'LP001073' 'LP001086' 'LP001087'
 'LP001091' 'LP001095' 'LP001097' 'LP001098' 'LP001100' 'LP001106'
 'LP001109' 'LP001112' 'LP001114' 'LP001116' 'LP001119' 'LP001120'
 'LP001123' 'LP001131' 'LP001136' 'LP001137' 'LP001138' 'LP001144'
 'LP001146' 'LP001151' 'LP001155' 'LP001157' 'LP001164' 'LP001179'
 'LP001186' 'LP001194' 'LP001195' 'LP001197' 'LP001198' 'LP001199'
 'LP001205' 'LP001206' 'LP001207' 'LP001213' 'LP001222' 'LP001225'
 'LP001228' 'LP001233' 'LP001238' 'LP001241' 'LP001243' 'LP001245'
 'LP001248' 'LP001250' 'LP001253' 'LP001255' 'LP001256' 'LP001259'
 'LP001263' 'LP001264' 'LP001265' 'LP001266' 'LP001267' 'LP001273'
 'LP001275' 'LP001279' 'LP001280' 'LP

## Nilai duplikat

Pada data_raw tidak terdapat nilai duplikat

In [36]:
data_raw. duplicated().sum()

np.int64(0)

# **5. Data Preprocessing**

Fungsi `clean_data()` melakukan preprocessing untuk mempersiapkan data ke pipeline machine learning:

1. Menangani missing values:
   - Numerik: diisi dengan median
   - Kategorik: diisi dengan modus

2. Transformasi data:
   - Mengubah 'Dependents' dari '3+' menjadi 4
   - Mengkonversi 'Loan_Status' ke binary (Y→1, N→0)
   - Mengubah tipe data 'ApplicantIncome' ke float

3. Pembersihan:
   - Menghapus kolom 'Loan_ID' yang tidak diperlukan
   - Menghapus baris duplikat

In [31]:
def clean_data(df):

    df_clean = df.copy()

    # Hapus kolom Loan_ID
    if 'Loan_ID' in df_clean.columns:
        df_clean = df_clean.drop(columns=['Loan_ID'])

    # Isi nilai null
    for col in df_clean.columns:
        if df_clean[col].dtype in [np.float64, np.int64, float, int]:
            median_val = df_clean[col].median()
            df_clean[col] = df_clean[col].fillna(median_val)
        else:
            mode_val = df_clean[col].mode(dropna=True)
            if not mode_val.empty:
                df_clean[col] = df_clean[col].fillna(mode_val[0])

    # Bersihkan dan konversi kolom 'Dependents'
    if 'Dependents' in df_clean.columns:
        df_clean['Dependents'] = df_clean['Dependents'].replace('3+', 4)
        df_clean['Dependents'] = df_clean['Dependents'].astype(float)

    # Ubah 'Loan_Status' menjadi 0.0 dan 1.0
    if 'Loan_Status' in df_clean.columns:
        df_clean['Loan_Status'] = df_clean['Loan_Status'].replace({'Y': 1.0, 'N': 0.0}).astype(float)

    # Ubah 'ApplicantIncome' ke float
    if 'ApplicantIncome' in df_clean.columns:
        df_clean['ApplicantIncome'] = df_clean['ApplicantIncome'].astype(float)

    # Hapus duplikat
    df_clean = df_clean.drop_duplicates()

    return df_clean

Data yang sudah dibersihkan disimpan ke `data/data.csv` untuk tahap selanjutnya.

In [None]:
os.makedirs('data_clean', exist_ok=True)
clean_data(data_raw).to_csv('data_clean/data_clean.csv', index=False)

  df_clean['Loan_Status'] = df_clean['Loan_Status'].replace({'Y': 1.0, 'N': 0.0}).astype(float)
