# My first `scikit-learn` notebook

In [1]:
import pandas as pd
import numpy as np
from random import choices
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

Load a dataset

In [12]:
forecast = pd.read_csv('Forecast.csv')
forecast.head()

Unnamed: 0,Temperature,Humidity,Wind_Speed,Go-Out
0,6,85,30,0
1,14,90,35,0
2,15,86,8,1
3,21,56,15,1
4,17,67,9,1


Setup the `numpy` arrays used to train classifiers

In [13]:
y = forecast.pop('Go-Out').values  # target feature
X = forecast.values                # training features
type(X),type(y)

(numpy.ndarray, numpy.ndarray)

Train a *k*-NN classifier

In [14]:
kNN = KNeighborsClassifier(n_neighbors=3) 
kNN.fit(X,y)


Set up sample test data and use for prediction

In [15]:
X_test = np.array([[8,70,11],
                   [8,69,15]])
kNN.predict(X_test)

array([1, 0])

All `sklearn` classifiers implement the `Estimator` API.

In [16]:
tree = DecisionTreeClassifier()
tree.fit(X,y)
tree.predict(X_test)

array([1, 1])

In [17]:
lr = LogisticRegression()
lr.fit(X,y)
lr.predict(X_test)

array([0, 0])

Swapping between classifiers (Estimators) makes model selection easy.  
Note that each predictor gives different results for the test data examples...

In [18]:
cfrs = [kNN,tree,lr]
for cfr in cfrs:
    cfr.fit(X,y)
    print(cfr.predict(X_test))

[1 0]
[1 1]
[0 0]


## Preprocessing
All preprocessing modules implement the `Transformer`  API.

Note that the standardisation is *fit* to the training data and then applied (using *transform*) to both the training data and the test data  

In [19]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)   # standardise to zero mean and unit variance
X_scaled = scaler.transform(X)
X_test_scaled = scaler.transform(X_test)
X_test_scaled

array([[-1.59094327, -0.05406252, -0.79537086],
       [-1.59094327, -0.10040182, -0.37117307]])

In [20]:
mm_scaler = preprocessing.MinMaxScaler()        # standardise to range [0,1]
mm_scaler.fit(X)
X_scaled = mm_scaler.transform(X)
X_test_scaled = mm_scaler.transform(X_test)
X_test_scaled

array([[0.125     , 0.6875    , 0.17241379],
       [0.125     , 0.675     , 0.31034483]])

# Try It Yourself

Using the `penguin_size` dataset, experiment with some of the different models available in *sci-kit learn*. Some examples of what you can try are

* [Decision Trees](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [KNN Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

You can try each of the algorithms with and without scalering, and explore the parameters outlined in the SKLearn documentation for each to see what impact it has on the results.


In [23]:

from sklearn.metrics import accuracy_score

# Read in the data
penguins_train = pd.read_csv('penguins_train.csv')
penguins_test = pd.read_csv('penguins_test.csv')


penguins_train.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181,3750,male
1,Adelie,Torgersen,39.5,17.4,186,3800,female
2,Adelie,Torgersen,40.3,18.0,195,3250,female
3,Adelie,Torgersen,36.7,19.3,193,3450,female
4,Adelie,Torgersen,39.3,20.6,190,3650,male


In [25]:
# Some models (such as the decision tree, for example) don't work with categorical data. 
# The lines below use a label encoder to convert categorical data to numerical.
# For example, `Gentoo`, `Emperor`, `Chinstrap` as penguin species would get replaced with 1, 2, 3
# We'll talk more about label encoding and other things to watch out for as the module progresses.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(penguins_train["species"])
penguins_train["species"] = le.transform(penguins_train["species"])
penguins_test["species"] = le.transform(penguins_test["species"])
le.fit(penguins_train["island"])
penguins_train["island"] = le.transform(penguins_train["island"])
penguins_test["island"] = le.transform(penguins_test["island"])

le.fit(penguins_train["sex"])
penguins_train["sex"] = le.transform(penguins_train["sex"])
penguins_test["sex"] = le.transform(penguins_test["sex"])

penguins_train.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,39.1,18.7,181,3750,1
1,0,2,39.5,17.4,186,3800,0
2,0,2,40.3,18.0,195,3250,0
3,0,2,36.7,19.3,193,3450,0
4,0,2,39.3,20.6,190,3650,1


In [26]:
# Create the descriptive features and target feature arrays for both training and test data
y_train = penguins_train.pop('species')
X_train = penguins_train.values

y_test = penguins_test.pop('species')
X_test = penguins_test.values


In [28]:
# Preprocessing goes here. It is done on the training set. 
# Make sure that any preprocessing done to the training data is also done to the test data
scaler = preprocessing.StandardScaler().fit(X_train)   # standardise to zero mean and unit variance
X_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled


array([[ 1.88939197, -0.89604189,  0.7807321 , -1.42675157, -0.56847478,
         0.99103121],
       [ 1.88939197, -0.82278787,  0.11958397, -1.06947358, -0.50628618,
        -1.00904996],
       [ 1.88939197, -0.67627982,  0.42472926, -0.42637319, -1.1903608 ,
        -1.00904996],
       ...,
       [ 0.48812799,  1.02687621,  0.52644436, -0.56928439, -0.53738048,
         0.99103121],
       [ 0.48812799,  1.24663828,  0.93330475,  0.64546078, -0.13315457,
         0.99103121],
       [ 0.48812799,  1.13675725,  0.7807321 , -0.2120064 , -0.53738048,
        -1.00904996]])

In [31]:
# kNN = KNeighborsClassifier(n_neighbors=3) 
# kNN.fit(X_scaled,y_train)
# y_pred = kNN.predict(X_test_scaled)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [34]:
# y_pred = [] # the predict(X_test) method on your classifier will return a list of predictions for y_test

# create a classifier
# make sure you `fit` the classifier on the training data before you try to predict


# A handy way to measure the accuracy of your classifier which compares actual targets against predictions
kNN = KNeighborsClassifier(n_neighbors=3) 
tree = DecisionTreeClassifier()
lr = LogisticRegression()

cfrs = [kNN,tree,lr]
for cfr in cfrs:
    cfr.fit(X_scaled,y_train)
    y_pred = cfr.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy is {accuracy}")

# TODO why accuracy is always 1? shouldn't I have scaled the data?



Accuracy is 1.0
