## Apa itu Scikit-Learn?

__Scikit-Learn__ adalah sebuah <i>library</i> Python yang sangat bermanfaat untuk membangun model <i>machine learning</i>. Scikit-Learn atau yang dikenal juga dengan __sklearn__  membantu dalam pemodelan data dengan klasifikasi, regresi, klasterisasi atau <i>clustering</i>, dan juga fitur-fitur lainnya untuk memudahkan pemodelan <i>machine learning</i>.

## Features

Ada banyak fitur yang dapat kita pergunakan untuk membantu dalam pemodelan <i>machine learning</i>. Beberapa <i>package</i> yang sering digunakan di antaranya:

* __Datasets:__ <i>load</i> dataset yang sudah disediakan oleh Python.
* __Supervised Learning Algoritms:__ Mengakses algoritma <i>supervised learning</i>.
* __Clustering:__ Pemodelan data dengan algoritma <i>clustering</i>.
* __Dimentionality Reduction:__ Mereduksi jumlah atribut pada data.
* __Cross Validation:__ Untuk mengecek akurasi pada model <i>supervised</i>.
* __Feature Selection:__ Mengidentifikasi atribut yang berguna dalam pemodelan <i>machine learning</i>.
* __Parameter Tuning:__ Memilih parameter terbaik agar model <i>machine learning</i> yang dihasilkan memiliki akurasi yang baik.

Masih banyak lagi fitur-fitur pada Scikit-Learn yang dapat digunakan untuk beragam kebutuhan kita dalam pemodelan <i>machine learning</i>. Namun, pada <i>notebook</i> ini hanya akan membahas tentang 7 fitur yang telah disebutkan di atas.

### Datasets

Dataset yang telah disediakan oleh Python pada <i>library</i> Sklearn, di antaranya:

* Dataset __Boston House Prices__ (<i>regression</i>) --> <code>load_boston()</code>
* Dataset __Iris__ (<i>classification</i>) --> <code>load_iris()</code>
* Dataset __Diabetes__ (<i>regression</i>) --> <code>load_diabetes()</code>
* Dataset __Digits__ (<i>classification</i>) --> <code>load_digits()</code>
* Dataset __Wine__ (<i>classification</i>) --> <code>load_wine()</code>
* Dataset __Breast Cancer Wisconsin__ (<i>classification</i>) --> <code>load_breast_cancer()</code>
* Dataset __Linnerud__ (<i>multivariate regression</i>) --> <code>load_linnerud()</code>

#### Load dataset

Untuk me-<i>load</i> dataset tersebut, harus mengimpor <i>package</i>-nya terlebih dahulu, kemudian buat dataframe Pandas dari dataset yang sudah di-<i>load</i>. Misal menggunakan dataset Breast Cancer Wisconsin, maka impor datasetnya dan membuat dataframenya seperti di bawah ini.

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

# load dataset diabetes
data_bc = load_breast_cancer()

# membuat dataframe
df_bc = pd.DataFrame(data_bc['data'], columns=data_bc['feature_names'])
df_bc.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Menggunakan dataset Boston House Prices.

In [None]:
from sklearn.datasets import load_boston
import pandas as pd               

# load dataset boston
boston = load_boston()

# membuat dataframe
df_boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])
df_boston.head()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


#### Menampilkan target dari dataset

Parameter <code>data</code> yang digunakan di atas hanya menampilkan atribut dari dataset dan TIDAK menampilkan targetnya. Untuk melihat target dari masing-masing baris data gunakan parameter <code>target</code>.

In [None]:
# Menampilkan target dataset breast cancer

data_bc['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

Untuk klasifikasi, lihat nama kategori dari target dengan parameter <code>target_names</code>. Misal dataset Breast Cancer memiliki target yang direpresentasikan dengan angka <code>0</code> dan <code>1</code> dan ingin mengetahui apa nama kategori/kelas yang direpresentasikan dengan bilangan integer tersebut.

In [None]:
# Melihat nama kelas target

data_bc['target_names']

array(['malignant', 'benign'], dtype='<U9')

#### Melihat deskripsi dataset

Melihat deskripsi dari dataset tersebut seperti di bawah ini.

In [None]:
# Melihat deskripsi dataset breast cancer

print(data_bc['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

### Modeling Algorithms

Scikit-Learn menyediakan <i>package</i> untuk mengaplikasikan algoritma-algoritma <i>machine learning</i> dengan mudah. Misalnya untuk algoritma <i>supervised learning</i>, ada banyak algoritma yang dapat digunakan, di antaranya:

* __Linear Regression__ (<i>regression</i>) --> <code>linear_model.LinearRegression()</code>
* __Logistic Regression__ (<i>classification</i>) --> <code>linear_model.LogisticRegression()</code>
* __K-Nearest Neighbors (KNN)__ (<i>classification</i>) --> <code>neighbors.KNeighborsClassifier()</code>
* __Support Vector Machine (SVM)__ (<i>classification</i>) --> <code>svm.SVC()</code>
* __Support Vector Machine (SVM)__ (<i>regression</i>) --> <code>svm.SVR()</code>
* __Decision Tree__ (<i>classification</i>) --> <code>tree.DecisionTreeClassfier()</code>
* __Decision Tree__ (<i>regression</i>) --> <code>tree.DecisionTreeRegressor()</code>
* __Random Forest__ (<i>classification</i>) --> <code>ensemble.RandomForestClassifier()</code>
* __Naive Bayes__ (<i>classification</i>) --> <code>naive_bayes.GaussianNB()</code>
* __K-Means__ (<i>clustering</i>) --> <code>cluster.KMeans()</code>
* __DBSCAN__ (<i>clustering</i>) --> <code>cluster.DBSCAN()</code>

Selain beberapa algoritma di atas, masih banyak algoritma-algoritma untuk klasifikasi, regresi, dan clustering yang dapat digunakan.

Sebagai contoh, misal ingin memodelkan dataset Boston House Prices menggunakan algoritma Linear Regression. Yang harus dilakukan pertama kali adalah mengimpor <i>package</i> yang dibutuhkan. <i>import</i> <code>LinearRegression</code> dari <i>module</i> <code>sklearn.linear_model</code>. Perhatikan kode berikut.

In [None]:
from sklearn.datasets import load_boston               # import dataset Boston
from sklearn.linear_model import LinearRegression      # import algoritma Linear Regression
              
# load dataset boston
boston = load_boston()

# mendefinisikan atribut dan target 
X = boston['data']
y = boston['target']

# memodelkan atribut dan target dengan Linear Regression
model_lr = LinearRegression()
model_lr.fit(X, y)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

LinearRegression()

Secara sederhananya, mendefinisikan model linear regression dengan menyimpannya pada variabel <code>model_lr</code>. Kemudian latih model dengan memberikan nilai <code>X</code> dan <code>y</code> sebagai parameter pada <i>method</i> <code>fit()</code>.

Sampai disini, model linear regression sederhana sudah berhasil dibuat.

untuk algoritma linear regression, dapat menampilkan __koefisien__ dan __intercept__ seperti di bawah ini.

In [None]:
# Menampilkan koefisien dan intercept

print('Koefisien model Linear Regression:\n')
print(model_lr.coef_)
print('\n')
print('Intercept: ', model_lr.intercept_)

Koefisien model Linear Regression:

[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]


Intercept:  36.459488385090125


### Model Evaluation

Untuk mengevaluasi model <i>machine learning</i> yang telah dibuat, gunakan <i>package</i> yang telah disediakan Python pada Sklearn.

__Model Evaluation Techniques__
* __Holdout Validation__ --> <code>model_selection.train_test_split()</code> (__Train/test split__ adalah teknik <i>Holdout validation</i> yang paling banyak digunakan)
* __Cross Validation__ --> <code>model_selection.KFold()</code> (__K-Fold Cross Validation adalah__ teknik <i>cross validation</i> yang paling banyak digunakan)


__Model Evaluation Metrics__
* __Accuracy__ (<i>classification</i>) --> <code>metrics.accuracy_score()</code>
* __Confussion Matrix__ (<i>classification</i>) --> <code>metrics.confusion_matrix()</code>
* __F-Measure / F1 Score__ (<i>classification</i>) --> <code>metrics.f1_score()</code>
* __Logarithmic Loss__ (<i>classification</i>) --> <code>metrics.log_loss()</code>
* __Mean Absolute Error (MAE)__ (<i>regression</i>) --> <code>metrics.mean_absolute_error()</code>
* __Mean Squared Error (MSE)__ (<i>regression</i>) --> <code>metrics.mean_squared_error()</code>

Selain teknik dan metrik evaluasi model <i>machine learning</i> yang telah disebutkan di atas, masih banyak lagi pilihan-pilihan evaluasi model yang dapat digunakan sesuai dengan karakteristik dataset kita.

In [None]:
from sklearn.datasets import load_boston               # import dataset Boston
from sklearn.linear_model import LinearRegression      # import algoritma Linear Regression
from sklearn.model_selection import train_test_split   # import train/test split 
from sklearn.metrics import mean_absolute_error        # import mean absolute error (MAE)
from sklearn.metrics import mean_squared_error         # import mean squared error
              
# load dataset boston
boston = load_boston()

# mendefinisikan atribut dan target 
X = boston['data']
y = boston['target']

# membagi data menjadi train set dan test set untuk evaluasi
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# memodelkan atribut dan target dengan Linear Regression
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)           # melatih model dengan data X_train dan y_train
y_predict = model_lr.predict(X_test)     # menguji model dengan data X_test

# menghitung MAE dan MSE untuk melihat kinerja model
print('MAE: ', mean_absolute_error(y_predict, y_test))
print('MSE: ', mean_squared_error(y_predict, y_test))

MAE:  3.148255754816822
MSE:  20.724023437339717



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### Preprocessing

Tahap <i>preprocessing</i> data, seperti normalisasi, standarisasi dan <i>encoding</i>, dapat diaplikasikan pada dataset dengan menggunakan Sklearn.

__Feature Scaling__
* __Normalization__ --> <code>preprocessing.Normalizer()</code>
* __Standard Scaling__ --> <code>preprocessing.StandardScaler()</code>
* __Min Max Scaling__ --> <code>preprocessing.MinMaxScaler()</code>

__Encoding__
* __Label Encoding__ --> <code>preprocessing.LabelEncoder()</code>
* __One-Hot Encoding__ --> <code>preprocessing.OneHotEncoder()</code>

Menormalisasi dataset Boston House Prices sebelum dilakukan pemodelan data. Perhatikan kode berikut.

In [None]:
import pandas as pd
from sklearn.datasets import load_boston               # import dataset Boston
from sklearn.preprocessing import Normalizer           # import normalizer
              
# load dataset boston
boston = load_boston()

# normalisasi data
norm = Normalizer().fit_transform(boston['data'])

# membuat dataframe untuk menampilkan hasil normalisasi pada dataframe
df_boston = pd.DataFrame(norm, columns=boston['feature_names'])
df_boston.head()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,1.3e-05,0.035997,0.00462,0.0,0.001076,0.013149,0.130388,0.008179,0.002,0.591945,0.030597,0.793727,0.009959
1,5.8e-05,0.0,0.014977,0.0,0.000994,0.013602,0.16714,0.010522,0.004237,0.512648,0.037707,0.840785,0.019362
2,5.9e-05,0.0,0.015174,0.0,0.001007,0.015421,0.13114,0.010661,0.004293,0.519409,0.038204,0.843138,0.00865
3,7.1e-05,0.0,0.004785,0.0,0.001005,0.01536,0.100527,0.013306,0.006585,0.487268,0.041045,0.866174,0.006453
4,0.000151,0.0,0.004755,0.0,0.000999,0.015587,0.118209,0.013222,0.006543,0.484177,0.040784,0.865631,0.011625


### Feature Selection

<i>Feature selection</i> terkadang dibutuhkan karena dapat memaksimalkan kinerja model <i>machine learning</i>. Seleksi fitur dapat berupa <i>feature extraction</i>, <i>feature elimination</i>, atau <i>dimensionality reduction</i>. Seluruh teknik tersebut akan menghasilkan fitur yang dirasa penting atau menggabungkan fitur agar nantinya model akan bekerja lebih baik.

Dengan Sklearn, dapat dengan mudah melakukan seleksi fitur dari dataset yang dimiliki dengan berbagai algoritma yang telah disediakan Python.

__Feature Selection__
* __Recursive Feature Elimination (RFE)__ --> <code>feature_selection.RFE()</code>
* __Select K Best__ --> <code>feature_selection.SelectKBest()</code>
* __chi2__ --> <code>feature_selection.chi2()</code>
* __Select From Model__ --> <code>feature_selection.SelectFromModel()</code>

__Dimensionality Reduction Algorithms__
* __Principal Component Analysis (PCA)__ --> <code>decomposition.PCA()</code>
* __Linear Discriminant Analysis (LDA)__ --> <code>discriminant_analysis.LinearDiscriminantAnalysis()</code>

Selain yang telah disebutkan di atas, masih banyak lagi <i>package</i> yang dapat digunakan untuk seleksi fitur.

Sebagai contoh menggunakan teknik reduksi dimensi dengan PCA. Perhatikan contoh berikut.

In [None]:
import pandas as pd
from sklearn.datasets import load_boston               # import dataset Boston
from sklearn.preprocessing import Normalizer           # import normalizer
from sklearn.decomposition import PCA                  # import PCA
              
# load dataset boston
boston = load_boston()

# normalisasi data
norm_data = Normalizer().fit_transform(boston['data'])

# mengaplikasikan PCA
pca = PCA(n_components=2).fit_transform(norm_data)

df_pca = pd.DataFrame(data = pca, columns = ['Fitur 1', 'Fitur 2'])
df_pca.head()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Unnamed: 0,Fitur 1,Fitur 2
0,-0.169264,-0.006695
1,-0.249895,-0.06637
2,-0.24892,-0.037394
3,-0.287084,-0.015875
4,-0.287936,-0.030092


### Parameter Tuning

<i>Parameter tuning</i> pada dasarnya adalah memilih atau mengkombinasikan parameter-parameter agar algoritma <i>machine learning</i> dapat menghasilkan performa terbaiknya. Setidaknya di Scikit-Learn Python ada 2 teknik parameter tuning, yaitu Grid Search dan Randomized Search.

* __Grid Search__ --> <code>model_selection.GridSearchCV()</code>
* __Randomized Search__ --> <code>model_selection.RandomizedSearchCV()</code>

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# load dataset
data_bc = load_breast_cancer()

# mendefinisikan atribut dan target 
X = data_bc['data']
y = data_bc['target']

# membagi data menjadi train set dan test set untuk evaluasi
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

# mendefinisikan parameter yang ingin dicoba
param_grid = {'n_neighbors': np.arange(2,20), 
             'weights': ['distance', 'uniform'], 
             'p': [1, 2],
             'algorithm': ['auto', 'brute', 'kd_tree', 'ball_tree']}

# modeling dengan KNN + Randomized Search
model_knn = KNeighborsClassifier()
gscv = GridSearchCV(model_knn, param_grid, scoring='accuracy', cv=5)
gscv.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'algorithm': ['auto', 'brute', 'kd_tree', 'ball_tree'],
                         'n_neighbors': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19]),
                         'p': [1, 2], 'weights': ['distance', 'uniform']},
             scoring='accuracy')

Untuk menampilkan kombinasi parameter terbaik dapat menggunakan atribut <code>.best_params_</code>.

In [None]:
# Menampilkan kombinasi parameter terbaik

gscv.best_params_

{'algorithm': 'auto', 'n_neighbors': 9, 'p': 1, 'weights': 'uniform'}

Untuk menampilkan skor terbaik dapat menggunakan atribut <code>.best_score_</code>.

In [None]:
# Menampilkan skor terbaik

gscv.best_score_

0.9506703146374831

Masih banyak proses yang dapat dilakukan dengan memanfaatkan Scikit-Learn di Python. Yang telah disampaikan di atas adalah sebagian proses yang harus diketahui untuk memulai terjun di bidang <i>data science</i> atau <i>machine learning</i>.



---


Semoga Bermanfaat dan jangan lupa main-main kesini: <a href="https://nurpurwanto.github.io/">**nurpurwanto**</a> Terimakasih.

---


