###**Analisa data klasifikasi data dengan menggunakan naive bayes dan knn untuk data**

**Langkah 1 : Mengimport Library**

Untuk langkah pertama ini akan selalu mencakup mengimpor library yaitu NumPy, Pandas, dan Matplotlib. Yang mana NumPy (Numerical Python) adalah library Python yang fokus pada scientific computing. NumPy memiliki kemampuan untuk membentuk objek N-dimensional array, yang mirip dengan list pada Python. Kemudian Pandas (Python for Data Analysis) adalah library Python yang fokus untuk proses analisis data seperti manipulasi data, persiapan data, dan pembersihan data. Sedangkan Matplotlib adalah library Python yang fokus pada visualisasi data seperti membuat plot grafik.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

**Langkah 2 : Mengimport Kumpulan Data**

Pada langkah ini, kami akan mengimpor dataset Breast Cancer yang disimpan di repositori github saya sebagai breast-cancer.csv dan menyimpannya ke variabel dataset. Kemudian menampilkan 5 baris pertama dari data tersebut.

In [9]:
dataset = pd.read_csv('https://raw.githubusercontent.com/datasets/breast-cancer/master/data/breast-cancer.csv')
dataset.head(5)

Unnamed: 0,age,mefalsepause,tumor-size,inv-falsedes,falsede-caps,deg-malig,breast,breast-quad,irradiat,class
0,40-49,premefalse,15-19,0-2,True,3,right,left_up,False,recurrence-events
1,50-59,ge40,15-19,0-2,False,1,right,central,False,false-recurrence-events
2,50-59,ge40,35-39,0-2,False,2,left,left_low,False,recurrence-events
3,40-49,premefalse,35-39,0-2,True,3,right,left_low,True,false-recurrence-events
4,40-49,premefalse,30-34,3-5,True,2,left,right_up,False,recurrence-events


In [27]:
col_names = ['age', 'mefalsepause', 'tumor-size', 'inv-falsedes', 'falsede-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat', 'class']

dataset.columns = col_names

dataset.columns

Index(['age', 'mefalsepause', 'tumor-size', 'inv-falsedes', 'falsede-caps',
       'deg-malig', 'breast', 'breast-quad', 'irradiat', 'class'],
      dtype='object')

In [28]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272 entries, 0 to 271
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   age           272 non-null    object
 1   mefalsepause  272 non-null    object
 2   tumor-size    272 non-null    object
 3   inv-falsedes  272 non-null    object
 4   falsede-caps  264 non-null    object
 5   deg-malig     272 non-null    int64 
 6   breast        272 non-null    object
 7   breast-quad   271 non-null    object
 8   irradiat      272 non-null    bool  
 9   class         272 non-null    object
dtypes: bool(1), int64(1), object(8)
memory usage: 19.5+ KB


In [29]:
# find categorical variables

categorical = [var for var in dataset.columns if dataset[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

There are 8 categorical variables

The categorical variables are :

 ['age', 'mefalsepause', 'tumor-size', 'inv-falsedes', 'falsede-caps', 'breast', 'breast-quad', 'class']


In [30]:
dataset[categorical].head()

Unnamed: 0,age,mefalsepause,tumor-size,inv-falsedes,falsede-caps,breast,breast-quad,class
0,40-49,premefalse,15-19,0-2,True,right,left_up,recurrence-events
1,50-59,ge40,15-19,0-2,False,right,central,false-recurrence-events
2,50-59,ge40,35-39,0-2,False,left,left_low,recurrence-events
3,40-49,premefalse,35-39,0-2,True,right,left_low,false-recurrence-events
4,40-49,premefalse,30-34,3-5,True,left,right_up,recurrence-events


In [31]:
dataset[categorical].isnull().sum()

age             0
mefalsepause    0
tumor-size      0
inv-falsedes    0
falsede-caps    8
breast          0
breast-quad     1
class           0
dtype: int64

In [32]:
for var in categorical: 
    
    print(dataset[var].value_counts())

50-59    91
40-49    87
60-69    53
30-39    34
70-79     6
20-29     1
Name: age, dtype: int64
premefalse    143
ge40          122
lt40            7
Name: mefalsepause, dtype: int64
30-34    60
25-29    51
20-24    47
15-19    26
10-14    26
40-44    21
35-39    18
0-4       8
50-54     8
5-9       4
45-49     3
Name: tumor-size, dtype: int64
0-2      200
3-5       36
6-8       16
9-11      10
15-17      6
12-14      3
24-26      1
Name: inv-falsedes, dtype: int64
False    209
True      55
Name: falsede-caps, dtype: int64
left     143
right    129
Name: breast, dtype: int64
left_low     103
left_up       92
right_up      32
right_low     24
central       20
Name: breast-quad, dtype: int64
false-recurrence-events    191
recurrence-events           81
Name: class, dtype: int64


In [33]:
for var in categorical:
    
    print(var, ' contains ', len(dataset[var].unique()), ' labels')

age  contains  6  labels
mefalsepause  contains  3  labels
tumor-size  contains  11  labels
inv-falsedes  contains  7  labels
falsede-caps  contains  3  labels
breast  contains  2  labels
breast-quad  contains  6  labels
class  contains  2  labels


In [34]:
numerical = [var for var in dataset.columns if dataset[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :', numerical)

There are 2 numerical variables

The numerical variables are : ['deg-malig', 'irradiat']


In [35]:
dataset[numerical].head()

Unnamed: 0,deg-malig,irradiat
0,3,False
1,1,False
2,2,False
3,3,True
4,2,False


In [36]:
X = dataset.drop(['class'], axis=1)

y = dataset['class']

In [37]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [38]:
X_train.shape, X_test.shape

((190, 9), (82, 9))

In [39]:
!pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.0


In [40]:
import category_encoders as ce

In [41]:
encoder = ce.OneHotEncoder(cols=['age', 'mefalsepause', 'tumor-size', 'inv-falsedes', 'falsede-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [42]:
X_train.head()

Unnamed: 0,age_1,age_2,age_3,age_4,age_5,mefalsepause_1,mefalsepause_2,mefalsepause_3,tumor-size_1,tumor-size_2,...,deg-malig_3,breast_1,breast_2,breast-quad_1,breast-quad_2,breast-quad_3,breast-quad_4,breast-quad_5,irradiat_1,irradiat_2
189,1,0,0,0,0,1,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0
196,0,1,0,0,0,1,0,0,0,1,...,0,1,0,0,1,0,0,0,1,0
129,1,0,0,0,0,1,0,0,1,0,...,0,0,1,0,0,1,0,0,1,0
252,0,0,1,0,0,0,1,0,1,0,...,0,0,1,0,0,1,0,0,0,1
21,0,0,1,0,0,0,1,0,0,0,...,1,1,0,0,0,1,0,0,1,0


In [43]:
X_train.shape

(190, 41)

In [44]:
X_test.head()

Unnamed: 0,age_1,age_2,age_3,age_4,age_5,mefalsepause_1,mefalsepause_2,mefalsepause_3,tumor-size_1,tumor-size_2,...,deg-malig_3,breast_1,breast_2,breast-quad_1,breast-quad_2,breast-quad_3,breast-quad_4,breast-quad_5,irradiat_1,irradiat_2
217,0,1,0,0,0,0,1,0,1,0,...,1,0,1,1,0,0,0,0,0,1
258,1,0,0,0,0,1,0,0,0,0,...,1,1,0,1,0,0,0,0,0,1
201,0,1,0,0,0,0,1,0,1,0,...,0,0,1,0,0,1,0,0,1,0
111,0,0,0,1,0,0,1,0,0,0,...,0,0,1,1,0,0,0,0,1,0
237,0,0,1,0,0,1,0,0,0,1,...,0,0,1,0,0,1,0,0,1,0


In [45]:
X_test.shape

(82, 41)

In [46]:
cols = X_train.columns
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
X_train.head()

Unnamed: 0,age_1,age_2,age_3,age_4,age_5,mefalsepause_1,mefalsepause_2,mefalsepause_3,tumor-size_1,tumor-size_2,...,deg-malig_3,breast_1,breast_2,breast-quad_1,breast-quad_2,breast-quad_3,breast-quad_4,breast-quad_5,irradiat_1,irradiat_2
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,-1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-1.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,1.0,-1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [47]:
# train a Gaussian Naive Bayes classifier on the training set
from sklearn.naive_bayes import GaussianNB


# instantiate the model
gnb = GaussianNB()


# fit the model
gnb.fit(X_train, y_train)

In [48]:
y_pred = gnb.predict(X_test)

y_pred

array(['false-recurrence-events', 'recurrence-events',
       'recurrence-events', 'false-recurrence-events',
       'false-recurrence-events', 'false-recurrence-events',
       'false-recurrence-events', 'false-recurrence-events',
       'recurrence-events', 'false-recurrence-events',
       'recurrence-events', 'false-recurrence-events',
       'false-recurrence-events', 'false-recurrence-events',
       'false-recurrence-events', 'false-recurrence-events',
       'false-recurrence-events', 'recurrence-events',
       'false-recurrence-events', 'false-recurrence-events',
       'recurrence-events', 'false-recurrence-events',
       'false-recurrence-events', 'recurrence-events',
       'false-recurrence-events', 'false-recurrence-events',
       'false-recurrence-events', 'false-recurrence-events',
       'recurrence-events', 'recurrence-events', 'recurrence-events',
       'recurrence-events', 'false-recurrence-events',
       'recurrence-events', 'false-recurrence-events',
       '

In [49]:
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Model accuracy score: 0.6220


In [50]:
dataset = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
dataset

Unnamed: 0,Real Values,Predicted Values
217,false-recurrence-events,false-recurrence-events
258,false-recurrence-events,recurrence-events
201,recurrence-events,recurrence-events
111,false-recurrence-events,false-recurrence-events
237,false-recurrence-events,false-recurrence-events
...,...,...
109,false-recurrence-events,recurrence-events
204,recurrence-events,false-recurrence-events
261,false-recurrence-events,false-recurrence-events
191,false-recurrence-events,recurrence-events
