# Tumor Analysis
1. Permasalahan
    Penggunaan kecerdasan buatan di industri kesehatan telah berkembang pesat dari tahun ke tahun. Dengan berbagai pemodelan yang dibuat untuk memecahkan permasalahan dunia kesehatan. Akan tetapi ada beberapa algoritma yang sudah menggunguli algoritma lainnya. Untuk itu diperlukan testing sebanyaknya untuk mendapatkan hasil terbaik dari pemodelan terhadap suatu permasalahan yang ada di dunia kesehatan untuk meningkatkan pengambilan suatu keputusan bagi tenaga medis. Pada dataset kali ini akan dimodelkan untuk memprediksi apakah "Image" terjangkit penyakit tumor. Secara umum, disini saya ingin membuktikan apakah fitur citra otak memberikan informasi yang cukup untuk memprediksi secara akurat mendeteksi tumor otak jika sebatas menggunakan "Image".

2. Tujuan
    Untuk membandingkan model/algoritma dalam pengelompokkan data. Dan implementasi manakah yang paling efektif

# DATA LOAD

In [7]:
import pandas as pd
url = "https://raw.githubusercontent.com/kyky2912/Tumor-Otak/main/Brain%20Tumor.csv"

data = pd.read_csv(url, error_bad_lines=False)

#FITUR DATA 
"First order feature"


1.   Mean
2.   Variance
3. Standard Deviation
4. Skewness
5. Kurtosis

"Second order feature"

*   Contrast
*   Energy
*   ASM (Angular second moment)
*   Entropy
*   Homogeneity
*   Dissimilarity
*   Corelation
*   Courseness

In [8]:
#Tampilkan 5 data pertama
data.head()

#tumor (1)
#no-tumor (0)

Unnamed: 0,Image,Class,Mean,Variance,Standard Deviation,Entropy,Skewness,Kurtosis,Contrast,Energy,ASM,Homogeneity,Dissimilarity,Correlation,Coarseness
0,Image1,0,6.535339,619.587845,24.891522,0.109059,4.276477,18.900575,98.613971,0.293314,0.086033,0.530941,4.473346,0.981939,7.458341e-155
1,Image2,0,8.749969,805.957634,28.389393,0.266538,3.718116,14.464618,63.858816,0.475051,0.225674,0.651352,3.220072,0.988834,7.458341e-155
2,Image3,1,7.341095,1143.808219,33.820234,0.001467,5.06175,26.479563,81.867206,0.031917,0.001019,0.268275,5.9818,0.978014,7.458341e-155
3,Image4,1,5.958145,959.711985,30.979219,0.001477,5.677977,33.428845,151.229741,0.032024,0.001026,0.243851,7.700919,0.964189,7.458341e-155
4,Image5,0,7.315231,729.540579,27.010009,0.146761,4.283221,19.079108,174.988756,0.343849,0.118232,0.50114,6.834689,0.972789,7.458341e-155


# Visualisasi Data

In [9]:
#import plotly
import plotly.express as px

In [10]:
class_counts = data["Class"].value_counts()
count_fig = px.bar(x=class_counts.index, y=class_counts, labels={"x":"Class", "y":"Number of Images"},
                   title="Count by Presence of Tumor")
count_fig.show()

#tumor (1)
#no-tumor (0)
Dapat dilihat bahwa non-tumor lebih banyak dibandingkan dengan yang tumor

In [11]:
scatter1 = px.scatter_matrix(data, dimensions= data.columns[2:4], color="Class")
scatter1.show()

In [12]:
scatter2 = px.scatter_matrix(data, dimensions= data.columns[5:9], color="Class")
scatter2.show()

In [13]:
scatter3 = px.scatter_matrix(data, dimensions= data.columns[9:13], color="Class")
scatter3.show()

3 plot pencar di atas menunjukkan bahwa kemungkinan ada hubungan antara keberadaan tumor dan perubahan fitur yang terlihat, yang merupakan pertanda baik bahwa kami akan berhasil mengembangkan model.

# Mengekstrak Fitur dan Nilai Target

In [16]:
features = data[list(data.columns[2:])]
target = data["Class"]

In [19]:
#Tampilkan 5 data pertama
features.head()

Unnamed: 0,Mean,Variance,Standard Deviation,Entropy,Skewness,Kurtosis,Contrast,Energy,ASM,Homogeneity,Dissimilarity,Correlation,Coarseness
0,6.535339,619.587845,24.891522,0.109059,4.276477,18.900575,98.613971,0.293314,0.086033,0.530941,4.473346,0.981939,7.458341e-155
1,8.749969,805.957634,28.389393,0.266538,3.718116,14.464618,63.858816,0.475051,0.225674,0.651352,3.220072,0.988834,7.458341e-155
2,7.341095,1143.808219,33.820234,0.001467,5.06175,26.479563,81.867206,0.031917,0.001019,0.268275,5.9818,0.978014,7.458341e-155
3,5.958145,959.711985,30.979219,0.001477,5.677977,33.428845,151.229741,0.032024,0.001026,0.243851,7.700919,0.964189,7.458341e-155
4,7.315231,729.540579,27.010009,0.146761,4.283221,19.079108,174.988756,0.343849,0.118232,0.50114,6.834689,0.972789,7.458341e-155


In [20]:
target.head()

0    0
1    0
2    1
3    1
4    0
Name: Class, dtype: int64

# Splitting Data
Dengan split secara random ke dalam subset
(75% training , 25% testing).
Dikarenakan dataset memiliki banyak features dengan range values yang berbeda dari yang lainnya. Penggunaan scaling pada training dan testing data dengan metode dalam Scikit-learn.

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1000)

In [22]:
from sklearn import preprocessing
X_train_scaled = preprocessing.scale(X_train)
X_test_scaled = preprocessing.scale(X_test)

In [24]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

#tentukan metode pemilihan dan tentukan fungsi skor menjadi f_regresi
select = SelectKBest(score_func=f_classif, k=3)
select.fit(X_train, y_train)

#transform training and testing sets jadi hanya fitur yang dipilih yang dipertahankan
X_train_selected = select.transform(X_train)
X_test_selected = select.transform(X_test)

In [25]:
print('Selected features after RFE:')
bool_list = select.get_support()
for i in range(len(bool_list)):
    if (bool_list[i]):
        print('\t' + features.columns[i])

Selected features after RFE:
	Entropy
	Energy
	Homogeneity


Selanjutnya, digunakan 3 fitur yang ditentukan sebagai yang paling penting dan membuat plot pencar 3 dimensi menggunakan fitur sebagai sumbu.

In [26]:
scatter3d_tumor = px.scatter_3d(data, x="Entropy", y="Energy", z="Homogeneity", color="Class", title="Three-Dimensional Feature Plot")
scatter3d_tumor.show()

Plot ini menunjukkan hubungan yang kuat antara fitur ini dan keberadaan tumor dalam gambar, yang menunjukkan bahwa 3 fitur ini kemungkinan dapat digunakan untuk membuat model yang cukup akurat. Namun, harus tetap memilih untuk menggunakan semua fitur untuk membuat model dengan semua fitur untuk tidak menunjukkan overfitting.

# Implementation Model

# Linear Regression
Algoritma pertama adalah regresi linier, yang selama pelatihan menyesuaikan fitur masing-masing dengan koefisien untuk meminimalkan jumlah sisa kuadrat, yang seharusnya dapat memisahkan nilai fitur yang konsisten dengan tumor otak dari yang tidak memiliki tumor otak.

In [27]:
#import model
from sklearn.linear_model import LinearRegression

In [28]:
#Unscaled data
lin_reg = LinearRegression()
lin_reg.fit(X=X_train, y= y_train)

lin_train_accuracy = lin_reg.score(X_train, y_train)
lin_test_accuracy = lin_reg.score(X_test, y_test)

In [29]:
#Scaled data
lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X=X_train_scaled, y= y_train)

lin_train_accuracy_scaled = lin_reg_scaled.score(X_train_scaled, y_train)
lin_test_accuracy_scaled = lin_reg_scaled.score(X_test_scaled, y_test)

In [30]:
#Fitur yang dipertahankan
lin_reg_selected = LinearRegression()
lin_reg_selected.fit(X=X_train_selected, y= y_train)

lin_train_accuracy_selected = lin_reg_selected.score(X_train_selected, y_train)
lin_test_accuracy_selected = lin_reg_selected.score(X_test_selected, y_test)

In [31]:
print("Linear Regression Accuracy:")
print("\t Training set accuracy with no scaling: " + format(lin_train_accuracy*100, '.2f') + '%')
print("\t Testing set accuracy with no scaling: " + format(lin_test_accuracy*100, '.2f') + '%\n')
print("\t Training set accuracy with scaling: " + format(lin_train_accuracy_scaled*100, '.2f') + '%')
print("\t Testing set accuracy with scaling: " + format(lin_test_accuracy_scaled*100, '.2f') + '%\n')
print("\t Training set accuracy with selection: " + format(lin_train_accuracy_selected*100, '.2f') + '%')
print("\t Testing set accuracy with selection: " + format(lin_test_accuracy_selected*100, '.2f') + '%')

Linear Regression Accuracy:
	 Training set accuracy with no scaling: 87.31%
	 Testing set accuracy with no scaling: 86.00%

	 Training set accuracy with scaling: 87.31%
	 Testing set accuracy with scaling: 86.26%

	 Training set accuracy with selection: 81.02%
	 Testing set accuracy with selection: 80.34%


# Logistic Regression
algoritma yang sangat populer digunakan untuk klasifikasi biner. Selama pengujian, cara kerjanya mirip dengan model regresi linier, tetapi menggunakan fungsi logistik dengan koefisien untuk membuat model.

In [32]:
#import model
from sklearn.linear_model import LogisticRegression

In [33]:
#Unscaled data
log_reg = LogisticRegression(max_iter=1000000)
log_reg.fit(X=X_train, y= y_train)

log_train_accuracy = log_reg.score(X_train, y_train)
log_test_accuracy = log_reg.score(X_test, y_test)

In [34]:
#Scaled data
log_reg_scaled = LogisticRegression(max_iter=1000000)
log_reg_scaled.fit(X=X_train_scaled, y= y_train)

log_train_accuracy_scaled = log_reg_scaled.score(X_train_scaled, y_train)
log_test_accuracy_scaled = log_reg_scaled.score(X_test_scaled, y_test)

In [35]:
#Fitur yang dipertahankan
log_reg_selected = LogisticRegression(max_iter=1000000)
log_reg_selected.fit(X=X_train_selected, y= y_train)

log_train_accuracy_selected = log_reg_selected.score(X_train_selected, y_train)
log_test_accuracy_selected = log_reg_selected.score(X_test_selected, y_test)

In [36]:
print("Logistic Regression Accuracy:")
print("\t Training set accuracy with no scaling: " + format(log_train_accuracy*100, '.2f') + '%')
print("\t Testing set accuracy with no scaling: " + format(log_test_accuracy*100, '.2f') + '%\n')
print("\t Training set accuracy with scaling: " + format(log_train_accuracy_scaled*100, '.2f') + '%')
print("\t Testing set accuracy with scaling: " + format(log_test_accuracy_scaled*100, '.2f') + '%\n')
print("\t Training set accuracy with selection: " + format(log_train_accuracy_selected*100, '.2f') + '%')
print("\t Testing set accuracy with selection: " + format(log_test_accuracy_selected*100, '.2f') + '%')

Logistic Regression Accuracy:
	 Training set accuracy with no scaling: 97.31%
	 Testing set accuracy with no scaling: 96.71%

	 Training set accuracy with scaling: 98.44%
	 Testing set accuracy with scaling: 98.41%

	 Training set accuracy with selection: 96.95%
	 Testing set accuracy with selection: 97.02%


# K-Nearest Neighbor Classification
Algoritma ini adalah algoritma yang sangat sederhana yang membandingkan titik data yang dimasukkan ke titik data k yang ditentukan sebagai "paling dekat" dengan masukan, dan informasi ini digunakan untuk menentukan klasifikasi berdasarkan titik data tersebut.

In [38]:
#Import
from sklearn.neighbors import KNeighborsClassifier

In [39]:
#Unscaled data
knn = KNeighborsClassifier()
knn.fit(X=X_train, y=y_train)

knn_train_accuracy = knn.score(X_train, y_train)
knn_test_accuracy = knn.score(X_test, y_test)

In [40]:
#Scaled data
knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X=X_train_scaled, y=y_train)

knn_train_accuracy_scaled = knn_scaled.score(X_train_scaled, y_train)
knn_test_accuracy_scaled = knn_scaled.score(X_test_scaled, y_test)

In [41]:
#Fitur yang dipertahankan
knn_selected = KNeighborsClassifier()
knn_selected.fit(X=X_train_selected, y=y_train)

knn_train_accuracy_selected = knn_selected.score(X_train_selected, y_train)
knn_test_accuracy_selected = knn_selected.score(X_test_selected, y_test)

In [42]:
print("K-Nearest Neighbor Accuracy:")
print("\t Training set accuracy with no scaling: " + format(knn_train_accuracy*100, '.2f') + '%')
print("\t Testing set accuracy with no scaling: " + format(knn_test_accuracy*100, '.2f') + '%\n')
print("\t Training set accuracy with scaling: " + format(knn_train_accuracy_scaled*100, '.2f') + '%')
print("\t Testing set accuracy with scaling: " + format(knn_test_accuracy_scaled*100, '.2f') + '%\n')
print("\t Training set accuracy with selection: " + format(knn_train_accuracy_selected*100, '.2f') + '%')
print("\t Testing set accuracy with selection: " + format(knn_test_accuracy_selected*100, '.2f') + '%')

K-Nearest Neighbor Accuracy:
	 Training set accuracy with no scaling: 88.05%
	 Testing set accuracy with no scaling: 80.87%

	 Training set accuracy with scaling: 98.76%
	 Testing set accuracy with scaling: 97.98%

	 Training set accuracy with selection: 98.19%
	 Testing set accuracy with selection: 98.19%


# Hasil
## Akurasi

In [43]:
print("Linear Regression Accuracy:")
print("\t Training set accuracy with no scaling: " + format(lin_train_accuracy*100, '.2f') + '%')
print("\t Testing set accuracy with no scaling: " + format(lin_test_accuracy*100, '.2f') + '%\n')
print("\t Training set accuracy with scaling: " + format(lin_train_accuracy_scaled*100, '.2f') + '%')
print("\t Testing set accuracy with scaling: " + format(lin_test_accuracy_scaled*100, '.2f') + '%\n')
print("\t Training set accuracy with selection: " + format(lin_train_accuracy_selected*100, '.2f') + '%')
print("\t Testing set accuracy with selection: " + format(lin_test_accuracy_selected*100, '.2f') + '%\n\n')
print("Logistic Regression Accuracy:")
print("\t Training set accuracy with no scaling: " + format(log_train_accuracy*100, '.2f') + '%')
print("\t Testing set accuracy with no scaling: " + format(log_test_accuracy*100, '.2f') + '%\n')
print("\t Training set accuracy with scaling: " + format(log_train_accuracy_scaled*100, '.2f') + '%')
print("\t Testing set accuracy with scaling: " + format(log_test_accuracy_scaled*100, '.2f') + '%\n')
print("\t Training set accuracy with selection: " + format(log_train_accuracy_selected*100, '.2f') + '%')
print("\t Testing set accuracy with selection: " + format(log_test_accuracy_selected*100, '.2f') + '%\n\n')
print("K-Nearest Neighbor Accuracy:")
print("\t Training set accuracy with no scaling: " + format(knn_train_accuracy*100, '.2f') + '%')
print("\t Testing set accuracy with no scaling: " + format(knn_test_accuracy*100, '.2f') + '%\n')
print("\t Training set accuracy with scaling: " + format(knn_train_accuracy_scaled*100, '.2f') + '%')
print("\t Testing set accuracy with scaling: " + format(knn_test_accuracy_scaled*100, '.2f') + '%\n')
print("\t Training set accuracy with selection: " + format(knn_train_accuracy_selected*100, '.2f') + '%')
print("\t Testing set accuracy with selection: " + format(knn_test_accuracy_selected*100, '.2f') + '%\n\n')

Linear Regression Accuracy:
	 Training set accuracy with no scaling: 87.31%
	 Testing set accuracy with no scaling: 86.00%

	 Training set accuracy with scaling: 87.31%
	 Testing set accuracy with scaling: 86.26%

	 Training set accuracy with selection: 81.02%
	 Testing set accuracy with selection: 80.34%


Logistic Regression Accuracy:
	 Training set accuracy with no scaling: 97.31%
	 Testing set accuracy with no scaling: 96.71%

	 Training set accuracy with scaling: 98.44%
	 Testing set accuracy with scaling: 98.41%

	 Training set accuracy with selection: 96.95%
	 Testing set accuracy with selection: 97.02%


K-Nearest Neighbor Accuracy:
	 Training set accuracy with no scaling: 88.05%
	 Testing set accuracy with no scaling: 80.87%

	 Training set accuracy with scaling: 98.76%
	 Testing set accuracy with scaling: 97.98%

	 Training set accuracy with selection: 98.19%
	 Testing set accuracy with selection: 98.19%




# Pendapat

Pemodelan dengan Regresi Linear menyajikan tingkat akurasi yang buruk, dengan scaling tidak mampu membantu banyak dan pemilihan fitur menghasilkan akurasi buruk pada training dan testing.

Pemodelan dengan Regresi logistik memberikan hasil yang sangat akurat, dengan penskalaan sedikit meningkatkan akurasi dan pemilihan sedikit menurun. Akurasi tinggi hasil dengan pemilihan menunjukkan bahwa model spesifik ini kemungkinan dapat digunakan dengan sangat efektif untuk menggeneralisasi dan secara efektif memprediksi keberadaan tumor otak dalam gambar.

Pemodelan dengan Algoritma KNN secara mengejutkan memberikan akurasi terbaik setelah penskalaan dan pemilihan. Pekerjaan dengan algoritme ini menunjukkan manfaat penskalaan dan pemilihan saat menerapkan algoritme.

# Penutup

Proyek ini dapat dipelajari lebih lanjut mengenai algoritma klasifikasi pembelajaran mesin seperti regresi linier dan regresi logistik melalui proses penggalian data, menerapkan berbagai algoritme, membandingkan hasil, dan mencari cara untuk meningkatkan algoritma tersebut agar pada akhirnya bekerja untuk memprediksi secara efektif keberadaan tumor dalam citra otak tertentu.

Hasil menunjukkan bahwa, dengan menggunakan fitur citra otak, secara efektif dapat mengklasifikasikan citra otak apakah mengandung tumor atau tidak mengandung tumor dengan akurasi tinggi. Pekerjaan ini menunjukkan keefektifan ekstrim dari algoritma pembelajaran mesin ini dan manfaat yang dapat dimiliki teknologi tersebut pada industri perawatan kesehatan.