# **Business Understanding**

### Problem Background

Penyakit diabetes menjadi masalah kesehatan yang semakin meningkat di seluruh dunia. Identifikasi dini dan prediksi risiko diabetes dapat membantu dalam pencegahan dan manajemen penyakit yang lebih baik. Namun, belum ada metode yang optimal untuk melakukan prediksi secara efisien dan akurat.

### Goals

Pengembangan model prediksi diabetes yang akurat dan dapat diinterpretasikan

# **Data Understanding**

### Import Library

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

In [2]:
%matplotlib inline
pd.options.display.max_rows = None
pd.options.display.max_columns = None

In [3]:
sns.set(rc={'figure.figsize':(20.7,8.27)})
sns.set_style("whitegrid")
sns.color_palette("dark")

### Import Drive

In [4]:
from google.colab import drive

In [None]:
drive.mount('/content/drive/')

In [None]:
%cd '/content/drive/MyDrive/SEM6/'

In [None]:
diabetes= pd.read_csv('diabetes.csv')

### Descriptive Analysis

In [None]:
diabetes.head()

Informasi fitur

- Pregencies: menyatakan Jumlah kehamilan
- Glucose: menyatakan kadar Glukosa dalam darah
- BloodPreasure: menyatakan pengukuran tekanan darah
- SkinThickness: menyatakan ketebalan kulit
- Insulin: menyatakan kadar Insulin dalam darah
- BMI: menyatakan indeks massa tubuh
- DiabetesPedigreeFunction : menyatakan persentase Diabetes
- Age: menyatakan usia
- Outcome: menyatakan hasil akhir untuk 1 adalah YA, 0 adalah TIDAK

In [None]:
diabetes.info()

In [None]:
diabetes.shape

In [None]:
print(diabetes.dtypes)

In [None]:
diabetes.describe()

Insight Descriptive Analyst:

- Diketahui bahwa tidak terdapat missing value pada dataset
- Tipe data setiap fitur telah sesuai
- Tipe data fitur dataset bertipe numerik (int dan float)
- Terdapat 9 kolom dan 768 baris
- Outcome sebagai fitur target atau label
- Karena dataset telah memiliki label sehingga metode yang digunakan adalah supervised learning
- Karena fitur target bertipe kategorikal, dipilih model klasifikasi
- Dari keseluruhan data terlihat untuk fitur pregnancies, glucose, bloodpresure, skinthickness, BMI, DiabetesPedigreeFunction, Age memiliki distribusi normal. hal ini dilihat dari jarak antara mean dan median data yang tidak terlalu jauh.
- Sedangkan untuk fitur insulin memiliki jarak yang cukup jauh untuk mean dan median datanya yaitu 79,37 dan 30,5 sehingga menimbulkan indikasi adanya outliers.

# **Pre Processing**

In [None]:
# check missing values:
diabetes.isna().sum().to_frame('isna').T

In [None]:
# check invalid values:
for col in diabetes:
    print(f"{col}: {diabetes[col].unique()}\n")

In [None]:
# check duplicated data
print(f"Number of dublicated: {diabetes.duplicated().sum()}")

In [None]:
%matplotlib inline
sns.pairplot(diabetes)
plt.show()

# **Exploratory Data Analisyst**

## **EDA 1**

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(diabetes['BMI'], diabetes['BloodPressure'])
ax.set_xlabel('(body mass index of people)')
ax.set_ylabel('(bp of the people )')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(diabetes['Glucose'], diabetes['Insulin'])
ax.set_xlabel('(Glucose of the people)')
ax.set_ylabel('(Insulin of the people )')
plt.show()

##**Multivariate analysyst 1**

In [None]:
selected_columns=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age','Outcome']
subset_diabetes = diabetes[selected_columns]

In [None]:
correlation_matrix = subset_diabetes.corr()

In [None]:
print(correlation_matrix)

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heat Map Korelasi')
plt.show()

## **Analisa Outliers Analysyst 1**

In [None]:
# Menganalisis outlier menggunakan boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=diabetes.drop(columns='Outcome'))
plt.title('Boxplot untuk Fitur-Fitur Diabetes Dataset')
plt.xticks(rotation=45)
plt.show()

Insight :
- Nilai korelasi antar variabel dapat memberikan gambaran tentang hubungan linear antara fitur. Korelasi berkisar antara -1 hingga 1. Nilai mendekati 1 menunjukkan korelasi positif yang kuat, sementara nilai mendekati -1 menunjukkan korelasi negatif yang kuat. Korelasi mendekati 0 menunjukkan tidak adanya hubungan linear yang signifikan antara variabel tersebut.
- fitur Glucose memiliki korelasi positif yang signifikan dengan fitur  Outcome, dengan nilai korelasi sekitar 0.47. Ini menunjukkan bahwa tingkat glukosa yang lebih tinggi cenderung berkorelasi dengan peningkatan risiko diabetes.
- Terdapat korelasi positif yang cukup kuat antara insulin dan glukosa, dengan nilai korelasi sekitar 0.33. Ini wajar karena insulin diproduksi sebagai respons terhadap konsentrasi glukosa dalam darah.
- dari boxplot diatas fitur yang memiliki outliers terpaut jauh yaitu fitur insulin


## **Univariate analysyst**


In [None]:
diabetes.describe().round(3).T

In [None]:
df2 = diabetes.copy()
df2['Outcome'] = df2['Outcome'].map({1: "Yes", 0: "No"})

In [None]:
df2.head()

In [None]:
df2.info()

In [None]:
categorical = df2.select_dtypes(['object'])
# draw countplot and pie plot of categorical data
for col in categorical:

    fig, axes = plt.subplots(1,2,figsize=(10,4))

    # count of col (countplot)
    sns.countplot(data=df2, x=col, ax=axes[0])
    for container in axes[0].containers:
        axes[0].bar_label(container)
    # count of col (pie chart)
    slices = df2[col].value_counts().values
    activities = [f"{i} ({var})" for i, var in zip(df2[col].value_counts().index, df2[col].value_counts().index)]
    axes[1].pie(slices, labels=activities, shadow=True, autopct='%1.1f%%')

    plt.suptitle(f'Count of Unique Value in {col}', y=1.09)
    plt.show()

In [None]:
df3 = df2.copy()

In [None]:
# Membuat fungsi untuk mengkategorikan usia
def Age_Category(Age):
    if Age < 26:
        return 'Remaja'
    elif Age < 60:
        return 'Dewasa'
    else:
        return 'Manula'

# Menggunakan fungsi untuk membuat kolom kategori usia baru
df3['kategori_usia'] = df3['Age'].apply(Age_Category)

In [None]:
df3.info()

In [None]:
df3 = df3.drop(columns=['Age'])

In [None]:
df3['kategori_usia'].value_counts()

In [None]:
# Menampilkan distribusi kategori usia dalam diagram batang
plt.figure(figsize=(6, 4))
df3['kategori_usia'].value_counts().plot(kind='bar', color='lightskyblue')
plt.suptitle(f'Age Category Distribution', y=1.09)
plt.xlabel('Kategori Usia', fontsize=12)
plt.ylabel('Jumlah', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

plt.figure(figsize=(6, 4))
df3['kategori_usia'].value_counts().plot(kind='pie', shadow=True, autopct='%1.1f%%', colors=['lightcoral', 'lightskyblue', 'lightgreen'])
plt.ylabel('')
plt.tight_layout()
plt.show()

## **Bivariate Analysyst**

In [None]:
# Menampilkan distribusi Outcome berdasarkan kategori usia
plt.figure(figsize=(10, 6))
sns.countplot(data=df3, x='kategori_usia', hue='Outcome')
plt.title('Distribusi Outcome Berdasarkan Kategori Usia')
plt.xlabel('Kategori Usia')
plt.ylabel('Jumlah')
plt.legend(title='Outcome')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
outcome_distribution = df3.groupby(['kategori_usia', 'Outcome']).size().unstack()
outcome_distribution

## **Multivariate Analysyst**

In [None]:
# Swarmplot untuk Glucose
plt.figure(figsize=(10, 6))
sns.swarmplot(data=df3, x='kategori_usia', y='Glucose', hue='Outcome')
plt.title('Swarmplot untuk Glucose')
plt.xlabel('Kategori Usia', fontsize=12)
plt.ylabel('Glucose', fontsize=12)
plt.legend(title='Outcome')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Swarmplot untuk BloodPressure
plt.figure(figsize=(10, 6))
sns.swarmplot(data=df3, x='kategori_usia', y='BloodPressure', hue='Outcome')
plt.title('Swarmplot untuk BloodPressure')
plt.xlabel('Kategori Usia', fontsize=12)
plt.ylabel('BloodPressure', fontsize=12)
plt.legend(title='Outcome')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## INSIGHT 3

1. Dari 768 data terdapat 268 data yang bernilai 1 / YES / Positive Diabetes , dan 500 data bernilai 0 / NO / Negative Diabetes.
2. Presentase distribusi Outcome 65,1% NO dan 34,9% YES
3. Dilakukan penkategorian berdasarkan fitur Age / usia
4. Setelah dilakukan pengkategorian usia ternyata terdapat :
     - 34,8% atau sejumlah 267 kategori usia remaja (< 25thn)
     - 61,1% atau sejumlah 469 kategori usia dewasa (<60thn)
     - 4,2% atau sejumlah 32 kategori usia manula (>60thn)
5. Meskipun begitu distribusi positive diabetes tertinggi berada pada kategori usia Dewasa yaitu sejumlah 214

# **MODELLING**

## **SVM**

In [None]:
#Pisahkan atribut
X = diabetes[diabetes.columns[:8]]

#Pisahkan label output

y = diabetes['Outcome']

In [None]:
from sklearn.preprocessing import StandardScaler

#Standarisasi

scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

In [None]:
from sklearn.model_selection import train_test_split

# memisahkan data untuk training dan testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.svm import SVC

# membuat objek SVC dan memanggil fungsi fit untuk melatih model
clf = SVC()
svc_model = clf.fit(X_train, y_train)

In [None]:
# Menampilkan skor akurasi prediksi
svc_model.score(X_test, y_test)

In [None]:
diabetes.head(5)

In [None]:
#Test prediksi
#Jika 0 maka Tidak diabetes, jika 1 maka diabetes


print(svc_model.predict([[4,142,72,30,0,33.6,0.627,20]])[0])

In [None]:
#Test prediksi
#Jika 0 maka Tidak diabetes, jika 1 maka diabetes


print(svc_model.predict([[4, 142, 72, 30, 0, 33.6, 0.627, 20]])[0])

## **LOGISTIC REGRESSION**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Memisahkan data menjadi set pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Membuat model Logistic Regression
logreg_model = LogisticRegression(max_iter=1000, random_state=42)
print(logreg_model)

In [None]:
# Melatih model dengan data latih
logreg_model.fit(X_train, y_train)

# Membuat prediksi dengan model yang dilatih
y_pred = logreg_model.predict(X_test)

In [None]:
# Menampilkan skor akurasi prediksi
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:
# Mengambil contoh satu instance untuk diprediksi
sample_instance = [[4, 142, 72, 30, 0, 33.6, 0.627, 20]]
# Melakukan prediksi menggunakan model Logistic Regression yang telah dilatih sebelumnya
predicted_outcome = logreg_model.predict(sample_instance)
# Menampilkan hasil prediksi
print(predicted_outcome[0])

## **Random forest**

In [None]:
# Memisahkan data menjadi set pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Membuat model Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [None]:
# Membuat prediksi dengan model yang dilatih
y_pred = rf_model.predict(X_test)

In [None]:
# Menampilkan skor akurasi prediksi
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:
# Mengambil contoh satu instance untuk diprediksi
sample_instance = [[4, 142, 72, 30, 0, 33.6, 0.627, 20]]

# Melakukan prediksi menggunakan model Random Forest yang telah dilatih sebelumnya
predicted_outcome = rf_model.predict(sample_instance)

# Menampilkan hasil prediksi
print(predicted_outcome[0])

## **GRADIENBOOSTING**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Memisahkan data menjadi set pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Membuat model Gradient Boosting
gbm_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gbm_model.fit(X_train, y_train)

In [None]:
# Membuat prediksi dengan model yang dilatih
y_pred = gbm_model.predict(X_test)

In [None]:
# Evaluasi model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


# **Final Model**

In [None]:
# Data hasil akurasi algoritma model
data = {
    'Model': ['SVM', 'Logistic Regression', 'Random Forest', 'Gradient Boosting'],
    'Accuracy': [0.727, 0.753, 0.727, 0.740]
}
df4 = pd.DataFrame(data)
max_accuracy_row = df4.loc[df['Accuracy'].idxmax()]
highlighted_df4 = df4.style.apply(lambda x: ['background: green' if x.name == max_accuracy_row.name else '' for i in x], axis=1)
highlighted_df4

# **OVERSAMPLING**

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import pandas as pd

df5 = diabetes.copy()
df5['Outcome'] = df5['Outcome'].map({1: "Yes", 0: "No"})

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import pandas as pd

X = df5.drop('Outcome', axis=1)
y = df5['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)

X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
df_resampled = pd.concat([pd.DataFrame(X_train_resampled, columns=X_train.columns), pd.DataFrame(y_train_resampled, columns=['Outcome'])], axis=1)
print("Jumlah sampel setelah oversampling:")
print(df_resampled['Outcome'].value_counts())

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = LogisticRegression()

model.fit(X_train_resampled, y_train_resampled)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l2']
}
model = LogisticRegression(max_iter=1000)

# GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_resampled, y_train_resampled)
print("Parameter terbaik:", grid_search.best_params_)
print("Akurasi terbaik:", grid_search.best_score_)

## standarisasi fitur

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)

In [None]:
standarized_data = scaler.transform(X)

In [None]:
print(standarized_data)

In [None]:
X = standarized_data
Y = diabetes['Outcome']

In [None]:
print(X)
print(Y)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= 0.2, stratify=Y, random_state=2)
print(X.shape, X_train.shape, X_test.shape)

In [None]:
from sklearn import svm
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, Y_train)
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Akurasi data training = ', training_data_accuracy)

In [None]:
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Akurasi data testing = ', test_data_accuracy)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X = diabetes[diabetes.columns[:8]]
y = diabetes['Outcome']

scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

clf = SVC()
svc_model = clf.fit(X_train, y_train)
svc_model.score(X_test, y_test)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
logreg_model = LogisticRegression(max_iter=1000, random_state=42)
logreg_model.fit(X_train, y_train)
y_pred = logreg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Membagi data menjadi data pelatihan dan data uji dengan stratifikasi
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

# Inisialisasi model regresi logistik
logreg_model = LogisticRegression(max_iter=1000, random_state=42)

# Melatih model pada data pelatihan
logreg_model.fit(X_train, y_train)

# Membuat prediksi pada data uji
y_pred = logreg_model.predict(X_test)

# Evaluasi model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# UNDERSAMPLING

In [None]:
from imblearn.under_sampling import RandomUnderSampler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

In [None]:
print("Jumlah kelas setelah undersampling:", y_resampled.value_counts())

In [None]:
# LOGISTIC REG
logreg_model = LogisticRegression(max_iter=1000, random_state=42)
logreg_model.fit(X_resampled, y_resampled)
y_pred = logreg_model.predict(X_test)

# Evaluasi model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# **HANDLING OUTLIERS**

In [None]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Assuming you want to use logistic regression as your classifier
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display the number of data after oversampling
print("Number of data after oversampling:")
print(pd.Series(y_resampled).value_counts())

In [None]:
# Logistic Regression
logreg_model = LogisticRegression(max_iter=1000, random_state=42)
logreg_model.fit(X_train, y_train)
y_pred = logreg_model.predict(X_test)

# akurasi
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi model:", accuracy)

In [None]:
# Membuat model SVM
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("Akurasi SVM:", accuracy_svm)

In [None]:
# Membuat model Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Akurasi Random Forest:", accuracy_rf)

In [None]:
# Membuat model Gradient Boosting
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Akurasi Gradient Boosting:", accuracy_gb)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Membuat confusion matrix dalam bentuk DataFrame
conf_matrix_rf_df = pd.DataFrame(conf_matrix_rf, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])

# Membuat visualisasi confusion matrix dalam bentuk diagram tabel
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_rf_df, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix

# Menghitung confusion matrix untuk Random Forest
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix Random Forest:")
print(conf_matrix_rf)


In [None]:
from sklearn.metrics import accuracy_score

# Menghitung akurasi pada data pelatihan
y_train_pred = rf_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Akurasi pada data pelatihan:", train_accuracy)

# Menghitung akurasi pada data pengujian
y_test_pred = rf_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Akurasi pada data pengujian:", test_accuracy)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Melakukan cross-validation dengan 5 fold
scores = cross_val_score(rf_model, X_train, y_train, cv=5)

# Menampilkan hasil cross-validation
print("Cross Validation Scores:", scores)
print("Rata-rata Cross Validation Score:", scores.mean())


# Save model

In [None]:
# Menyimpan model dalam format .sav
import joblib
joblib.dump(rf_model, 'random_forest_model.sav')

In [None]:
import pickle
with open('random_forest_model.sav', 'wb') as f:
    pickle.dump(rf_model, f)

In [None]:
import os

save_dir = 'C:/Users/zaima/Downloads/streamlit-diabetes-main/streamlit-diabetes-main'
model_filename = 'random_forest_model.sav'

# Membuat direktori jika belum ada
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

model_path = os.path.join(save_dir, model_filename)
with open(model_path, 'wb') as f:
    pickle.dump(rf_model, f)

print("Model Random Forest telah disimpan di:", model_path)

In [None]:
import pickle
filename = 'random_forest_model.sav'
pickle.dump(rf_model, open(filename,'wb'))

In [None]:
import os

folder_path = 'D:/saved_models'  # Ganti dengan path folder Anda

if os.path.exists(folder_path):
    print("Direktori ada.")
else:
    print("Direktori tidak ada atau tidak dapat diakses.")


In [None]:
import os
import pickle

# Tentukan direktori dan nama file untuk menyimpan model di drive D
save_dir = '/'
model_filename = 'random_forest_model.sav'

# Pastikan direktori tersedia, jika tidak, buat direktori baru
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# Gabungkan direktori dan nama file untuk mendapatkan path lengkap
model_path = os.path.join(save_dir, model_filename)

# Simpan model ke dalam file .sav menggunakan pickle
with open(model_path, 'wb') as f:
    pickle.dump(rf_model, f)

print("Model Random Forest telah disimpan di:", model_path)


In [None]:
pd.show_versions(as_json=False)