# Lab 3: Supervised learning

This lab shows how to run simple classification models using the scikit-learn library. In this lab, you will learn how to build different classification models on a given training set and then apply them to predict the classes on a test set. This lab will also show you how to derive accuracy from the test set, one of the most famous performance measures.

### Contents
- 3-1. Supervised learning models using scikit-learn
  - Perceptron
  - K-nearest neighbors
  - Decision tree
  - Support vector machines

- 3-2. Manual implementation
  - Perceptron
  - K-nearest neighbors

## 3-1. Supervised learning models  using scikit-learn 

You may already know basic concepts of scikit-learn from the previous labs. We will keep the same format, but now our task is **supervised learning**, which means we now deal with the datasets with answers.

To use scikit-learn, we do not need to rely on python syntax such as functions or classes; rather, we just load and call the methods provided by the library directly on the console.
We will use **Connectionist Bench** from UCI Machine Learning Repository, which can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data). We already located the dataset into **datasets** directory, so you can also simply include it from there. This dataset has two classes: ***Mines***, ***Rocks*** with 60 attributes representing each data entity. More information can be found [here](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)).

#### Load the libraries

Basic libraries used throughout this lab session. Random seed is set to ensure the same results with the instructor's ones.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
RANDOM_SEED = 12345

#### Load the data

The first thing you need to do is to load the data and check if it's correctly loaded. We will use a simple pandas method called **read_csv** to load csv-like datasets - datasets with a unique separator such as comma (,) or tab (   ). Since there is no header of the table in the dataset, you need to choose not to use the first row as a set of column names. 

* The dataset is located in the **datasets** directory and its name is **sonar.all-data**.

* header parameter should be provided.

In [None]:
data = pd.read_csv("datasets/sonar.all-data", header = None)

You can always check whether the shape of data by looking at the first five rows using the **head** method.

* DataFrame.head()

In [None]:
data.head()

You can also check the null values. You can use .info() that you learned from the previous labs.

* DataFrame.info()

In [None]:
data.info()

There is also isnull() function to check nulls in the dataframe.

* isnull().sum() will return column-wise summation of `True`s.
* isnull().sum().sum() will finally return how many nulls are in the datset.

In [None]:
data.isnull().sum().sum()

Our dataframe has both attributes (0-59) and labels (60) together in itself. However, scikit-learn requires that labels and data attributes should be separated. Let's separate the data labels from the dataset.

In [None]:
X = data.drop(60, axis=1)

In [None]:
y = data.iloc[:, -1]
y = data[60]

Next, we will split the dataset into two sets: training and test sets. Since we will not apply any validation strategy, such as k-fold cross-validation, splitting the whole dataset into two sets will be enough.

To do this, we can manually pick some part of the data to create two different subsets. However, scikit-learn also provides one method for this job. We will use the **train_test_split** function in scikit_learn in the **model_selection** package.

In [None]:
from sklearn.model_selection import train_test_split

This method divides the entire dataset into a training set and a test set. To do this, we need to specify required parameters such as our data attributes (X) and labels (y) and what percent we want to have for the test set (test_size). This method also has optional parameters such as 1) whether we want to allow shuffling (shuffle), 2) random state (random_state), or 3) whether we want to keep the label's proportions when we divide the data (stratify).

Here we will divide the training and test sets with a 70:30 ratio.

* Specify X, y, test_size, random_state, stratify

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_SEED, stratify=y)

#### Perceptron

The first algorithm we are going to make is **Perceptron**. Perceptron is a binary classifier having one weight (w) and one bias (b) value $w∙x+b$. You can also regard it as a single neuron classifier.

* Scikit-learn has perceptron as its built-in function.

**Perceptron** is in the linear_model package of scikit_learn.

In [None]:
from sklearn.linear_model import Perceptron

To perform analysis, we first need to make an instance by calling a class **Perceptron**. It receives few parameters, which can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html). Since we are trying a basic perceptron we learned from the lecture, we do not need to put all the parameters scikit-learn supports. We will not use any regularization or early-stopping here, but there are still some parameters we need to consider.

- max_iter: Perceptron can converge or cannot converge; it depends on the dataset. So we can at least set some reasonable maximum iteration.
- fit_intercept: Perceptron can have intercept (or bias) value or not. You can state it here (True/False).
- tol: Since it is also possible that Perceptron is not converged forever, we can state a stopping criterion. The iteration will stop when loss > previous_loss - tol.
- shuffle: We can shuffle the training data with each iteration.

You can firstly create our instance with the following options:

* maximum iteration = 100.
* without shuffling.
* without a tol value.

In [None]:
ppn = Perceptron(max_iter=100, tol=None, shuffle = False)

Since we already have prepared our training data (X_train, y_train), we can call **fit** function with those variables.

In [None]:
ppn.fit(X_train, y_train)

Now our model has trained its weights and bias and this information is stored in our instance **ppn**. Now we can get a *test error* on our test set by calling the **score** method, and check the predicted labels by calling the **predict** method.

In [None]:
ppn.score(X_test, y_test)

In [None]:
ppn.predict(X_test)

If we set the **tol** parameter, the algorithm might finish earlier than our maximum iteration. We can also check it as it is stored in *n_iter_* variable in our instance.

In [None]:
ppn.n_iter_

In [None]:
ppn = Perceptron(max_iter=100, tol=0.1, shuffle = False)
ppn.fit(X_train, y_train)
ppn.n_iter_

#### K-nearest neighbors

The next algorithm we will try is k-nearest neighbors (kNN). In scikit-learn, all methods and processes we need are entirely the same. The only change is when we create an instance because different models will have different parameters. We can find kNN in the **neighbors** package of scikit-learn.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

kNN is a simple algorithm having a small number of parameters. Detailed information can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). However, we will only focus on the number of neighbors now as it is a critical factor of the algorithm's performance. We can do some experiments by changing the parameter. Besides that, you can also change the distance function from Euclidean to something else (p), and also, you can put more weight on the closest neighbor if the case is the numerical prediction (weights). Supported distance measures and other parameter information can be found on the official page.

* Create a new instance of KNeighborsClassifier with n_neighbors=3

In [None]:
neigh = KNeighborsClassifier(n_neighbors=3)

After making the instance, we can train and test our algorithm in the same way as Perceptron. We can use **fit** for training, **score** to get test accuracy, and **predict** to get the predicted labels of the test dataset.
- Fit the classifier to `X_train` and `y_train`.

In [None]:
neigh.fit(X_train, y_train)

- Return a classification score of the trained model on `X_test` and `y_test`.

In [None]:
neigh.score(X_test, y_test)

- Print predicted labels of `X_test`.

In [None]:
neigh.predict(X_test)

#### Decision tree

Let's deal with the **decision tree**. When running a decision tree using scikit-learn, the process after creating an instance is again the same as other classifiers. Therefore, the most important thing is understanding the parameters for each model to correctly create a new instance. You can find a normal decision tree in the **tree** package.

In [None]:
from sklearn.tree import DecisionTreeClassifier

To just create one decision tree instance, we do not need to put any parameter as DecisionTreeClassifier has default options for every parameter it has. The parameters are used to constraint the tree by limiting the maximum depth or minimum samples to split. There is no very optimal set of parameters that can be applied to all cases, so we may need to optimize it by running further optimization techniques such as *grid search*, which we will look into in the next lab. Here we are going to use a normal decision tree without specifying any parameter. Due to its randomness inside, we still need to state random state.

- Create a new instance of DecisionTreeClassifier with random_state=RANDOM_SEED

In [None]:
dtc = DecisionTreeClassifier(random_state = RANDOM_SEED)

After making the instance, we can train and test our algorithm in the same way as Perceptron. We can use **fit** for training, **score** to get test accuracy, and **predict** to get the predicted labels of the test dataset.

- Fit the classifier to `X_train` and `y_train`.

In [None]:
dtc.fit(X_train, y_train)

- Return a classification score of the trained model on `X_test` and `y_test`.

In [None]:
dtc.score(X_test, y_test)

- Print predicted labels of `X_test`.

In [None]:
dtc.predict(X_test)

#### Support vector machines

Scikit-learn offers a variety of support vector machine algorithms: SVC, NuSVC, and LinearSVC. SVC is a basic form of support vector machine supporting various kernels, while LinearSVC forms a linear boundary without a kernel (You can find more [here](https://scikit-learn.org/stable/modules/svm.html)). NuSVC is similar to SVC, but the biggest feature of it is that we can adjust the number of support vectors. All these three classifiers are available in the **SVM** package, and in this lab, we will use SVC.

In [None]:
from sklearn.svm import SVC

**SVC** has a lot of parameters, similar to the decision tree we have seen earlier. Many parameters are used to fine-tune the model. One important parameter here is **C**, a regularization factor. This value is an indicator of how much the training set of the SVM can cover. The larger C, the smaller SVM's margin area, which means that the training set's fitting ability to the training set becomes stronger than before. Therefore, it is important to find C that can give the right level of regularization.

We can plot the iris dataset by differing C. This lab includes a function from the scikit-learn user guide website, which shows the decision boundary of an SVC model. We can try to apply various C values and see the difference.

- Please import **plot_iris** function by running this block below.

In [None]:
from dami_dsv.supervised_learning.plot_iris import plot_iris

You can freely change the value of C and see differences.

In [None]:
plot_iris(C=100.0)

In [None]:
plot_iris(C=5.0)

In [None]:
plot_iris(C=0.1)

Now we can apply SVM to our dataset. There are many available kernels that scikit-learn supports, but we will use the RBF kernel, set as a default in scikit-learn. We will have a chance to deal with other kernels in the upcoming assignment.

In [None]:
svc = SVC(gamma="scale")

- Fit the classifier to `X_train` and `y_train`.

In [None]:
svc.fit(X_train, y_train)

- Return a classification score of the trained model on `X_test` and `y_test`.

In [None]:
svc.score(X_test, y_test)

- Print predicted labels of `X_test`.

In [None]:
svc.predict(X_test)

You can change C and see the difference in the test score.

In [None]:
svc2 = SVC(C = 5, gamma="scale")
svc2.fit(X_train, y_train)
svc2.score(X_test, y_test)

When we give too much space for the margin, SVM can be underfitted and lose enough classification power. Too much generalization cannot always be good.

In [None]:
svc3 = SVC(C = 0.1, gamma="scale")
svc3.fit(X_train, y_train)
svc3.score(X_test, y_test)

## 3-2. Manual implementation

Now it is time to implement some algorithms we tried in this lab manually. It will give you a more robust understanding of the algorithm. We are going to implement simple ones: **perceptron** and **kNN**.

In those implementations, we use the **class** notation and **self** variables inside. This structure is made to give you the same experience with scikit-learn when testing. You only use **self** here to call the methods defined in the class structure or to access the class variable defined by self inside the class. The class-based structure will not appear in the assignment.

#### Perceptron

Before implementing perceptron, we need to change the letter classes into numbers as perceptron assumes that it receives binary numeric classes.

In [None]:
y_train_numeric = y_train.replace(['M', 'R'], [0, 1])
y_test_numeric = y_test.replace(['M', 'R'], [0, 1])

Here we already have a basic structure of our new perceptron classifier! It has the same structure with scikit-learn's one, so we can test our model in the same way as perceptron in Pandas after finishing the development.

In [None]:
class Perceptron():
    def __init__(self, max_iter):
        """
        A constructor that receives parameters and save them into member variables.
        You will receive max_iter value and need to save into self.max_iter.

        Input:
          max_iter: The maximum iteration of the algorithm.
        Output:
          None.
        """
        return
    
    def fit(self, X, y):
        """
        A method to train the model by receiving the training dataset and labels.
        
        Input:
          X: Training dataset.
          y: Training labels.

        Output:
          None.
          
        """    
        #- Step 1: The algorithm needs to set an empty list of size |attributes|+1 to save our weights (vector w) and bias (b).
        #          The additional value is used for intercept (or bias) value of the perceptron classifier.
        self.w = None
        
        #- Step 2: The algorithm iterates self.max_iter times and train the model.
        for _ in range(None):
        
        #- Step 3: For each iteration, we traverse all rows in our dataset and predict the label of each row 
        #          by calling self.predict method.
        #          We can calculate the 'error' to check whether our prediction was correct or not,
        #          by substracting a predicted label from a true label.
        #          
            for _, _ in None:
                prediction = None
                error = None
        
        #- Step 4: When prediction was wrong, we update the weights by adding [error*row] to the previous weights. 
        #          For intercept value, we update it by simply adding the error to the previous value.
        #          Assign the values to self.w so we can use the updated version in the next iteration.
        
                self.w[0] = self.w[0] + error
                self.w[1:] = self.w[1:] + error * row
        
        return
                
    def predict(self, d1):
        """
        A method to predict a label with trained weights.

        Input:
          row: A single row from dataset.
        Output:
          Binary integer (0 or 1).
        """
        
        #- Step 1: We calculate the dot product of our weights (self.w) and the given row.
        #          For the bias value, we multiply this value by one since we do not have any value in the received row.
        
        act = self.w[0]
        act += self.w[1:].dot(row)
        
        #- Step 2: If the dot product is bigger than or equal to zero, return 1. Otherwise, return 0.
        
        if act >= 0:
            return 1
        else: return 0
        
    def score(self, X, y):
        """
        A method to calculate an accuracy score of a received dataset X and labels Y.

        Input:
          X: Dataset that we want to calculate scores.
          y: True labels for the dataset X.

        Output:
          score: An accuracy with a range of [0, 1].

        """
        
        #- Step 1: Set the initial loss value to zero.
        
        loss = 0
        
        #- Step 2: We traverse all rows in our dataset and predict the label of each row
        #          by calling self.predict method. If the label is different (prediction was wrong),
        #          we add one to the loss value.
        for idx, row in X.iterrows():
            prediction = self.predict(row)
            
            if (y[idx] - prediction) != 0:
                loss += 1
        
        #- Step 3: Calculate the accuracy score: Divide the summed loss value by the size of the dataset
        #          and substract it from one.
        
        accuracy = 1 - loss/len(y)
        
        #- Step 4: Return the accuracy score.
        return accuracy
    

ANSWER

In [None]:
class Perceptron():
    def __init__(self, max_iter):
        self.max_iter = max_iter
        
    def fit(self, X, y):
        self.w = np.zeros(len(X.iloc[0])+1)
        for it in range(self.max_iter):
            loss = 0
            for idx, row in X.iterrows():
                prediction = self.predict(row)
                error = y[idx] - prediction
                loss += error
                self.w[0] = self.w[0] + error
                self.w[1:] = self.w[1:] + error * row
                
    def predict(self, row):
        act = self.w[0]
        act += self.w[1:].dot(row)
        if act >= 0:
            return 1
        else: return 0
    
    def score(self, X, y):
        loss = 0
        for idx, row in X.iterrows():
            prediction = self.predict(row)
            error = y[idx] - prediction
            loss += abs(error)
        return 1 - loss/len(y)

Now we are done with implementation! Then we can create the instance, train the model, and test it in the same way!

Run the codes below to check the result of the algorithm.

In [None]:
p = Perceptron(max_iter=100)

In [None]:
p.fit(X_train, y_train_numeric)

Now, let's check the score is the same as the one we got from scikit-learn that we already tried in the lab.

In [None]:
p.score(X_test, y_test_numeric)

#### kNN

Now it is kNN's turn. To implement kNN easier, we may need **Counter**, one of the built-in data structures in Python collections. It is okay if you do not know it, but it makes your job much easier! If you want to know it, refer to Python's official document [here](https://docs.python.org/3/library/collections.html#counter-objects).

In [None]:
from collections import Counter

We will use the same structure again so that we can test in the same way!

In [None]:
class KNN:
    def __init__(self, n_neighbors):
        """
        A constructor that receives parameters and save them into member variables.
        You will receive n_neighbors value and need to save into self.n_neighbors.

        Input:
          n_neighbors: Number of neighbors to look when we predict.
        Output:
          None.
        """
        self.n_neighbors = None
        
        return
        
    def fit(self, X, y):
        """
        A method to train the model by receiving the training dataset and labels.

        - Step 1: Since there is no training process in KNN algorithm, we can just save the dataset and labels into
                  the member variables self.X, self.y.
                  
        Input:
          X: Training dataset.
          y: Training labels.

        Output:
          None.
        """
        self.X = None
        self.y = None
        
        return None
    
    def euclidean_dist(self, d1, d2):
        """
        A method to calculate an euclidean distance between two data points d1 and d2.

        - Step 1: We calculate an euclidean distance, by substracting one point from the other, 
                  square it and take a squared root.
                  
        Input:
          d1, d2: Data points (rows) from the dataset.
          
        Output:
          distance: An euclidean distance value between d1 and d2.
        """
        euclidean = None
        return None
        
    def predict(self, row):
        """
        A method to predict a label with trained weights.

        Input:
          row: A single row from dataset.
        Output:
          Binary integer (0 or 1).
        """
        
        #- Step 1: For a given row, we need to calculate distances from all data points of the training set.
        # - To do this, create an empty list to save distances
        distances = []
        
        #- Step 2: We iterate all training datasets and get [n_neighbors] nearest data points by calculating
        #          euclidean distances from the input data point to all training datasets.
        
        # - iterate every row in the training set. We also need indices to recognize the rows.
        for idx, row_train in self.X.iterrows():
            # - calculate distance between the given row (row) and chosen row in the loop.
            dist = self.euclidean_dist(row, row_train)
            # - append the calculated distance to the list
            distances.append((idx, dist))
        
        # sort the distances by the distance values, so we can get top k nearest neighbors
        distances.sort(key=lambda x: x[1])
        
        #- Step 2: We will use the sorted list to get the labels from self.y 
        #          with the indices of [n_neighbors] nearest data points from self.X
        #          and perform majority vote on [n_neighbors] nearest data points' labels. 
        #           In this stage, you can use collections.Counter to make this task easier.
        
        # - Create a list to keep the labels (y) of the chosen nearest neighbors
        neighbors = []
        
        # - loop n_neighbors times and get first k (n_neighbors) labels using self.y
        for i in range(self.n_neighbors):
            neighbors.append(self.y[distances[i][0]])
        
        # - Step 3: Return the label that majority of the data points have.
        # - Use Counter.most_common to return the most common label.
        final_guess = Counter(neighbors).most_common(1)[0][0]
        
        return final_guess
    
    def score(self, X, y):
        """
        A method to calculate an accuracy score of a received dataset X and labels Y.

        Input:
          X: Dataset that we want to calculate scores
          y: True labels for the dataset X

        Output:
          score: An accuracy with a range of [0, 1]

        """
        
        #- Step 1: Set the initial loss value to zero.
        loss = None
        
        #- Step 2: We traverse all rows in our dataset and predict the label of each row 
        #          by calling self.predict method. 
                
        for idx, row in X.iterrows():
            prediction = self.predict(row)
            # - If the label is different (prediction was wrong), we add one to the loss value.
            error = 1 if y[idx] == prediction else 0
            loss += abs(error)
        # - Step 3: Divide the loss value by a size of the dataset and substract it from one to get an accuracy score.
        accuracy = loss/len(y)
        # - Step 4: Return the accuracy score.
        return loss/len(y)

ANSWER

In [None]:
class KNN:
    def __init__(self, n_neighbors):
        self.n_neighbors = n_neighbors
        
    def fit(self, X, y):
        self.X = X
        self.y = y
    
    def euclidean_dist(self, d1, d2):
        #print(np.sqrt(np.sum((d1 - d2)**2)))
        return np.sqrt(np.sum((d1 - d2)**2))
        
    def predict(self, row):
        distances = []
        for idx, d2 in self.X.iterrows():
            dist = self.euclidean_dist(row, d2)
            distances.append((idx, dist))
        
        distances.sort(key=lambda x: x[1])
        neighbors = []
        for i in range(self.n_neighbors):
            neighbors.append(self.y[distances[i][0]])
        
        final_guess = Counter(neighbors).most_common(1)[0][0]
        return final_guess
    
    def score(self, X, y):
        loss = 0

        for idx, row in X.iterrows():
            prediction = self.predict(row)
            error = 1 if y[idx] == prediction else 0
            loss += abs(error)
        return loss/len(y)

Now, let's test it and see if it returns the same score on our test dataset!

In [None]:
knn = KNN(n_neighbors = 3)

In [None]:
knn.fit(X_train, y_train)

Now let's check the score is the same as the one we got from scikit-learn.

In [None]:
knn.score(X_test, y_test)

# END OF LAB 3