<a href="https://colab.research.google.com/github/kmilawn/DataMining/blob/main/Klasifikasi_Gender_Kucing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Eksperimen Klasifikasi Gender Kucing Menggunakan Beberapa Metode Machine Learning**

In [15]:
# IMPORT LIBRARY
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

### 1. DATA

In [16]:
data = pd.read_csv("cats_dataset.csv")

print("===== INFORMASI DATASET =====")
print("Jumlah data :", data.shape[0])
print("Jumlah atribut :", data.shape[1])
print("\nContoh data:")
print(data.head())

===== INFORMASI DATASET =====
Jumlah data : 1000
Jumlah atribut : 5

Contoh data:
              Breed  Age (Years)  Weight (kg)          Color  Gender
0      Russian Blue           19            7  Tortoiseshell  Female
1  Norwegian Forest           19            9  Tortoiseshell  Female
2         Chartreux            3            3          Brown  Female
3           Persian           13            6          Sable  Female
4           Ragdoll           10            8          Tabby    Male


### 2. PREPROCESSING DATA

Pada tahap ini dilakukan:
- Encoding data kategorikal
- Pemisahan fitur dan target
- Normalisasi data numerik

In [17]:
print("\n===== PREPROCESSING DATA =====")

label_encoder = LabelEncoder()

data["Breed"] = label_encoder.fit_transform(data["Breed"])
data["Color"] = label_encoder.fit_transform(data["Color"])
data["Gender"] = label_encoder.fit_transform(data["Gender"])

X = data.drop("Gender", axis=1)
y = data["Gender"]

# Normalisasi fitur
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Preprocessing selesai.")


===== PREPROCESSING DATA =====
Preprocessing selesai.


### 3. PEMBAGIAN DATA
Data dibagi menjadi:
- 80% data latih
- 20% data uji

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

print("\nData training:", X_train.shape)
print("Data testing :", X_test.shape)


Data training: (800, 4)
Data testing : (200, 4)


### 4. METODE 1: LOGISTIC REGRESSION

In [19]:
print("\n===== EKSPERIMEN 1: LOGISTIC REGRESSION =====")

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

lr_pred = lr_model.predict(X_test)

lr_accuracy = accuracy_score(y_test, lr_pred)

print("Akurasi Logistic Regression:", round(lr_accuracy * 100, 2), "%")
print("\nClassification Report:")
print(classification_report(y_test, lr_pred))


===== EKSPERIMEN 1: LOGISTIC REGRESSION =====
Akurasi Logistic Regression: 50.0 %

Classification Report:
              precision    recall  f1-score   support

           0       0.49      0.39      0.43        98
           1       0.51      0.61      0.55       102

    accuracy                           0.50       200
   macro avg       0.50      0.50      0.49       200
weighted avg       0.50      0.50      0.49       200



### 5. METODE 2: K-NEAREST NEIGHBOR (KNN)



In [20]:
print("\n===== EKSPERIMEN 2: K-NEAREST NEIGHBOR =====")

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

knn_pred = knn_model.predict(X_test)

knn_accuracy = accuracy_score(y_test, knn_pred)

print("Akurasi KNN:", round(knn_accuracy * 100, 2), "%")
print("\nClassification Report:")
print(classification_report(y_test, knn_pred))


===== EKSPERIMEN 2: K-NEAREST NEIGHBOR =====
Akurasi KNN: 47.5 %

Classification Report:
              precision    recall  f1-score   support

           0       0.46      0.45      0.46        98
           1       0.49      0.50      0.49       102

    accuracy                           0.47       200
   macro avg       0.47      0.47      0.47       200
weighted avg       0.47      0.47      0.47       200



### 6. METODE 3: RANDOM FOREST

In [21]:
print("\n===== EKSPERIMEN 3: RANDOM FOREST =====")

rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_pred)

print("Akurasi Random Forest:", round(rf_accuracy * 100, 2), "%")
print("\nClassification Report:")
print(classification_report(y_test, rf_pred))


===== EKSPERIMEN 3: RANDOM FOREST =====
Akurasi Random Forest: 50.5 %

Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.53      0.51        98
           1       0.52      0.48      0.50       102

    accuracy                           0.51       200
   macro avg       0.51      0.51      0.50       200
weighted avg       0.51      0.51      0.50       200



### 7. PERBANDINGAN HASIL

In [22]:
print("\n===== PERBANDINGAN AKURASI =====")

print("Logistic Regression :", round(lr_accuracy * 100, 2), "%")
print("KNN                :", round(knn_accuracy * 100, 2), "%")
print("Random Forest      :", round(rf_accuracy * 100, 2), "%")


===== PERBANDINGAN AKURASI =====
Logistic Regression : 50.0 %
KNN                : 47.5 %
Random Forest      : 50.5 %


### 8. DISCUSSION

**Berdasarkan hasil eksperimen:**
1. Logistic Regression cocok sebagai baseline model
2. KNN cukup sensitif terhadap skala data dan nilai K
3. Random Forest memberikan hasil terbaik karena mampu menangkap hubungan non-linear antar fitur

**Faktor yang mempengaruhi hasil:**
1. Jumlah data
2. Kualitas fitur
3. Metode preprocessing

**Pengembangan Selanjutnya:**
1. Menambahkan hyperparametes tuning
2. Menggunakan metode lain seperti SVM
3. Melakukan feature importance analysis