[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kjmazidi/Machine_Learning_3rd_edition/blob/master/Volume_II_Python/Part-I-Python/Chapter_2/sklearn.ipynb)

# Scikit-learn

The sklearn project (https://scikit-learn.org/ started in 2007 and made its first public release in 2010. An international team of coders work on this open-source project, releasing new versions about every 3 months. As of this writing, Version 1.6 is the current version as of this notebook. The community has also provided a nice getting started guide: https://scikit-learn.org/stable/getting_started.html

Our goal here is to just get started using sklearn.

### Code Accompanying ***The Machine Learning Handbooks***, Volume II, Chapter 2

#### Book pdf is available on the GitHub repo: <https://github.com/kjmazidi/Machine_Learning_3rd_edition>

###### (c) 2025 KJG Mazidi, all rights reserved

### sample datasets

scikit-learn comes with some datasets that you can simply load as shown below. Currently these include some well-known datasets like iris, boston, diabetes and digits. 

The built-in data sets are objects that have a **.data** member variable which holds the data array which is of size **n_samples**, **n_features**. If it is a supervised learning data set it will have response variables stored in the  **.target** member. 

In [9]:
# imports used in this notebook

import sklearn
from sklearn import datasets
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [2]:
# load the iris data set

iris = datasets.load_iris()
print(iris.data[:5])  # first 5 rows of data
print(iris.target[:5]) # first 5 labels
print('iris shape is ', iris.data.shape)  # get the shape of the data

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]
iris shape is  (150, 4)


### bunch

Using the type() function we see that iris is a *bunch*. A sklearn bunch is like a dictionary in that it has key-value pairs. We can print out the keys. 

In [3]:
print(type(iris))
print(iris.keys())

<class 'sklearn.utils._bunch.Bunch'>
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])



The data and target values are numpy arrays. 

In [4]:
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


The target is what we are trying to learn - the species of iris, coded as 0, 1 or 2. We can see the labels corresponding to these codes using **target_names**.

In [5]:
print(iris.target)
print(iris.target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'versicolor' 'virginica']


### data exploration

Now that we know some general things about the iris dataset, let's explore further. It is customary to let X represent the predictors and y the target.

We can also make a pandas dataframe. The head() function shows us the first few rows, info() and describe() tell us more about the data frame.

In [8]:
X = iris.data
y = iris.target

# form the data into a pandas data frame
df = pd.DataFrame(X, columns=iris.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


### machine learning

All machine learning algorithms in scikit-learn are implemented as Python classes. They implement the algorithm, make predictions and store all information about the model. The .fit() method is used to fit a model to the data. The .predict() method is used to predict on new data, for example the test data. Remember, the model is usually fit on the training data and evaluated on the test data.

### knn classifier

k-Nearest Neighbors can be used for classification or regression. We will use it for classifying the iris species. The kNN algorithm works by finding k examples nearest to the example you want to predict, then classifying it according to the majority class of the neighbors. 

In [10]:
# uses sklearn KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(iris['data'], iris['target'])

We fit a model but we need to know how good it is. Accuracy is a common metric used for classification models. Accuracy is the percentage of correct classifications. We train on the training set and make predictions on the test set and compute accuracy on the test set. By default it makes a 75/25 train/test split but we can override that as seen below. We use the stratify argument so that the test and train have the same proportions of the 3 classes as the entire data set.

Here we chose k=7. Note that the higher k, the less curvy the decision boundary and the less likely to overfit. However if you make k too large you might underfit the data.

In [11]:
# uses sklearn.model_selection  train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
knn.score(X_test, y_test)

0.9555555555555556