# **Tugas 1**

Terdapat dataset mushroom. Berdasarkan dataset yang tersebut, bandingkan peforma antara algoritma Decision Tree dan RandomForest. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik.

In [79]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import RandomForestClassifier # import RandomForest
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix

In [36]:
# Load data
df = pd.read_csv('data/mushrooms.csv')

df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [37]:
# Cek kolom null
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

**Seleksi Fitur**

hapus veil-type karena hanya memiliki 1 nilai saja

In [39]:
df_clean = df.drop(columns=['veil-type'])

In [40]:
# Encode categorical variables
label_encoder = LabelEncoder()
X = df_clean.drop('class', axis=1).apply(label_encoder.fit_transform)
y = label_encoder.fit_transform(df_clean['class'])

In [41]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Ekstraksi Fitur Tanpa Tunning**

In [42]:
# 1. Decision Tree without Tuning
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Accuracy without tuning (Decision Tree):{:.2f}%".format (accuracy_score(y_test, y_pred_dt)*100))

# 2. Random Forest without Tuning
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Accuracy without tuning (Random Forest):{:.2f}%".format (accuracy_score(y_test, y_pred_rf)* 100))

Accuracy without tuning (Decision Tree):100.00%
Accuracy without tuning (Random Forest):100.00%


**Tunning Menggunakan GridSearchCV**

In [43]:
# 1. Hyperparameter Tuning for Decision Tree
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, scoring='accuracy')
grid_search_dt.fit(X_train, y_train)


In [44]:
# 2. Hyperparameter Tuning for Random Forest
param_grid_rf = {
    'n_estimators': [100, 200, 500],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)

**Evaluasi**

In [45]:
# Evaluate tuned models
y_pred_dt_tuned = grid_search_dt.best_estimator_.predict(X_test)
y_pred_rf_tuned = grid_search_rf.best_estimator_.predict(X_test)

# Results after tuning
print("Accuracy with tuning (Decision Tree):{:.2f}%".format (accuracy_score(y_test, y_pred_dt_tuned)*100))
print("Accuracy with tuning (Random Forest):{:.2f}%".format (accuracy_score(y_test, y_pred_rf_tuned)*100))

Accuracy with tuning (Decision Tree):100.00%
Accuracy with tuning (Random Forest):100.00%


In [46]:
# Best hyperparameters
print("Best hyperparameters (Decision Tree):", grid_search_dt.best_params_)
print("Best hyperparameters (Random Forest):", grid_search_rf.best_params_)

Best hyperparameters (Decision Tree): {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best hyperparameters (Random Forest): {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}


# **Tugas 2**
Terdapat dataset mushroom. Berdasarkan dataset tersebut, bandingkan peforma antara algoritma Decision Tree dan AdaBoost. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik.

In [47]:
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import AdaBoostClassifier # import AdaBoost

# Hyperparameter Tuning for Decision Tree
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, scoring='accuracy')
grid_search_dt.fit(X_train, y_train)


In [55]:
# Evaluasi model Decision Tree setelah tuning
best_dt = grid_search_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

In [61]:
# Definisikan grid hyperparameter yang ingin diuji
param_grid_ada = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1],
    'estimator': [DecisionTreeClassifier(max_depth=1), 
                  DecisionTreeClassifier(max_depth=3)]
}

# Menggunakan GridSearchCV untuk mencari parameter terbaik untuk AdaBoost
grid_search_ada = GridSearchCV(AdaBoostClassifier(), param_grid_ada, cv=5, scoring='accuracy')

# Fit model ke data training
grid_search_ada.fit(X_train, y_train)




In [62]:
# Evaluasi model AdaBoost setelah tuning
best_ada = grid_search_ada.best_estimator_
y_pred_ada = best_ada.predict(X_test)
acc_ada = accuracy_score(y_test, y_pred_ada)

In [63]:
print("Best hyperparameters (Decision Tree):", grid_search_dt.best_params_)
print("Accuracy with tuning (Decision Tree): {:.2f}%".format(acc_dt * 100))

print("Best hyperparameters (AdaBoost):", grid_search_ada.best_params_)
print("Accuracy with tuning (AdaBoost): {:.2f}%".format(acc_ada * 100))

Best hyperparameters (Decision Tree): {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2}
Accuracy with tuning (Decision Tree): 100.00%
Best hyperparameters (AdaBoost): {'estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 1, 'n_estimators': 50}
Accuracy with tuning (AdaBoost): 100.00%


# **Tugas 3**
Dengan menggunakan dataset diabetes, buatlah ensemble voting dengan algoritma

1. Logistic Regression
2. SVM kernel polynomial
3. Decission Tree

Anda boleh melakukan eksplorasi dengan melakukan tunning hyperparameter

In [71]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB # import Naive Bayes model Gaussian (asumsi data terdistribusi normal)
from sklearn.svm import SVC # import SVM classifier
from sklearn.ensemble import VotingClassifier # import model Voting
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [65]:
# Load Data

dbt = pd.read_csv('data/diabetes.csv')

dbt.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [91]:
# Split the data
X = dbt.drop('Outcome', axis=1)
y = dbt.Outcome

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [92]:
# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [93]:
# Logistic Regression
log_clf = LogisticRegression()
log_clf.fit(X_train, y_train)
y_pred_log = log_clf.predict(X_test)
accuracy_log = accuracy_score(y_test, y_pred_log)


print('Logistic Regression:')
print('Accuracy:', accuracy_log)
print('Classification Report:\n', classification_report(y_test, y_pred_log))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_log))

Logistic Regression:
Accuracy: 0.7597402597402597
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.83      0.82        99
           1       0.67      0.64      0.65        55

    accuracy                           0.76       154
   macro avg       0.74      0.73      0.74       154
weighted avg       0.76      0.76      0.76       154

Confusion Matrix:
 [[82 17]
 [20 35]]


In [94]:
# SVM with Polynomial Kernel
svm_clf = SVC(kernel='poly', probability=True)
svm_clf.fit(X_train, y_train)
y_pred_svm = svm_clf.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)

print('SVM Kernel Polynomial:')
print('Accuracy:', accuracy_svm)
print('Classification Report:\n', classification_report(y_test, y_pred_svm))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_svm))

SVM Kernel Polynomial:
Accuracy: 0.7402597402597403
Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.91      0.82        99
           1       0.73      0.44      0.55        55

    accuracy                           0.74       154
   macro avg       0.74      0.67      0.68       154
weighted avg       0.74      0.74      0.72       154

Confusion Matrix:
 [[90  9]
 [31 24]]


In [96]:
# Decision Tree
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

y_pred_tree = tree_clf.predict(X_test)
accuracy_tree = accuracy_score(y_test, y_pred_tree)

print('Decision Tree:')
print('Accuracy:', accuracy_tree)
print('Classification Report:\n', classification_report(y_test, y_pred_tree))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_tree))

Decision Tree:
Accuracy: 0.7012987012987013
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.77      0.77        99
           1       0.58      0.58      0.58        55

    accuracy                           0.70       154
   macro avg       0.67      0.67      0.67       154
weighted avg       0.70      0.70      0.70       154

Confusion Matrix:
 [[76 23]
 [23 32]]
