# KNN Classifier

**K-nearest neighbors** algorithm is a supervised learning algorithm. Easy to implement, it can be used to solve classification and regression problems. Here we use it as a classifier.


**Advantages**:
- Easy to implement.
- We don't need to tune too many parameters
- Classifier and Regression algorithm
- Intuitive, easy to understand and help to take decisions

**Disadvantages**
- The algorithm becomes slower with many independent variables and observations.

**Intuition**
- Step 1: Select the number K of neighbors
- Step 2: Calculate the distance (euclidean or manhattan for example) from the non classified point to the other points
- Step 3: Take K nearest neighbors according to the calculated distance.
- Step 4: Among thoses K neighbors, count the number of points belonging to each category.
- Step 5: Assign the new point to the most present category among these K neighbors.
- Step 6: Our model is ready.

In [86]:
from river import datasets
from river import evaluate
from river import metrics
from river import neighbors
from river import preprocessing

dataset = datasets.Phishing()

model_knn = (
    preprocessing.StandardScaler() |
    neighbors.KNNClassifier(window_size=50)
)

evaluate.progressive_val_score(dataset, model_knn, metrics.Accuracy())

Accuracy: 84.55%

In [87]:
evaluate.progressive_val_score(dataset, model_knn, metrics.F1())

F1: 82.33%

In [88]:
evaluate.progressive_val_score(dataset, model_knn, metrics.ROCAUC())

ROCAUC: 91.54%

### With Manhattan distance instead of Euclidian

Manhattan distance (L1 norm) may be preferable to Euclidean distance (L2 norm) for the case of high dimensional data.

In [89]:
import functools
from river import utils
model_knnman = (
    preprocessing.StandardScaler() |
    neighbors.KNNClassifier(
        window_size=50,
        distance_func=functools.partial(utils.math.minkowski_distance, p=1)
    )
)
evaluate.progressive_val_score(dataset, model_knnman, metrics.Accuracy())

Accuracy: 86.87%

In [90]:
evaluate.progressive_val_score(dataset, model_knnman, metrics.F1())

F1: 85.21%

In [91]:
evaluate.progressive_val_score(dataset, model_knnman, metrics.ROCAUC())

ROCAUC: 93.10%

# AdaBoostClassifier

**Boosting algorithms** are a set of the low accurate classifier to create a highly accurate classifier. Low accuracy classifier (or weak classifier) offers the accuracy better than the flipping of a coin. Highly accurate classifier (or strong classifier) offer a error rate close to 0. Boosting algorithm can track the model who failed the accurate prediction. They are less affected by overfitting problems.

**AdaBoost** or Adaptativ Boosting is one of ensemble boosting classifier. 
It combines multiple classifiers to increase the accuracy of classifiers. 
AdaBoost is an iterative ensemble method.
It builds a strong classifier by combining multiple poorly performing classifiers to get high accuracy strong classifier.
The basic concept behinf AdaBoost is to set the weights of classifier and training the data sample at each iteration such that it ensures the accurate predictions of unusual observations. Any machine learning algorithm can be used as base classifier if it accepts weights on the training set. Adaboost should meet two conditions :
- The classifier should be trained interactively on various weighed training examples.
- At each iteration, it tries to provide an excellent fit for these examples by minimizing training error.

In [92]:
from river import datasets
from river import ensemble
from river import evaluate
from river import metrics
from river import tree

dataset = datasets.Phishing()

# metric = metrics.LogLoss()

model_ada = ensemble.AdaBoostClassifier(
    model=(
        tree.HoeffdingTreeClassifier(
            split_criterion='gini',
            delta=1e-5,
            grace_period=2000
        )
    ),
    n_models=5,
    seed=42
)

In [93]:
evaluate.progressive_val_score(dataset, model_ada, metrics.Accuracy())

Accuracy: 88.63%

In [94]:
evaluate.progressive_val_score(dataset, model_ada, metrics.F1())

F1: 87.56%

In [95]:
evaluate.progressive_val_score(dataset, model_ada, metrics.ROCAUC())

ROCAUC: 96.35%

# BaggingClassifier

**Bagging**: when we generate multiple versiones of predictors and aggregate them to use in a model.
The procedure of applying Bootstrap sampling on the training dataset and then Aggregating when estimating a numerical outcome and doing a plurality vote when predicting a class is called Bagging

**The Bagging Algorithm**
- Step 1: Generate Bootstrap Samples of same size (with replacement) are repeatedly taken from the training dataset, so that each record has an equal probability of being selected.
- Step 2: A classification or estimation model is trained on each bootstrap sample drawn in Step 1, and a prediction is recorded for each sample.
- Step 3: The bagging ensemble prediction is then defined to be the class with the most votes in Step 2 (for classification models) or the average of the predictions made in Step 2 (for estimation models).

In [96]:
from river import datasets
from river import ensemble
from river import evaluate
from river import linear_model
from river import metrics
from river import optim
from river import preprocessing

dataset = datasets.Phishing()

model_bagging = ensemble.BaggingClassifier(
    model=(
        preprocessing.StandardScaler() |
        linear_model.LogisticRegression()
    ),
    n_models=3,
    seed=42
)

# metric = metrics.F1()

In [97]:
evaluate.progressive_val_score(dataset, model_bagging, metrics.Accuracy())

Accuracy: 89.20%

In [98]:
evaluate.progressive_val_score(dataset, model_bagging, metrics.F1())

F1: 88.69%

In [99]:
evaluate.progressive_val_score(dataset, model_ada, metrics.ROCAUC())

ROCAUC: 95.93%

# VotingClassifier

A **Voting Classifier** is a machine learning model that trains on an ensemble of numerous models and
predicts an output (class) based on their highest probability of chosen class as the output. It simply
aggregates the findings of each classifier passed into Voting Classifier and predicts the output class based
on the highest majority of voting. The idea is instead of creating separate dedicated models and finding
the accuracy for each of them, we create a single model which trains by these models and predicts output
based on their combined majority of voting for each output class.

In [107]:
from river import datasets
from river import ensemble
from river import evaluate
from river import linear_model
from river import metrics
from river import naive_bayes
from river import preprocessing
from river import tree

dataset = datasets.Phishing()

model_voting = (
    preprocessing.StandardScaler() |
    ensemble.VotingClassifier([
        linear_model.LogisticRegression(),
        tree.HoeffdingTreeClassifier(),
        naive_bayes.GaussianNB()
    ])
)

# metric = metrics.F1()

In [108]:
evaluate.progressive_val_score(dataset, model_voting, metrics.Accuracy())


Accuracy: 88.48%

In [109]:
evaluate.progressive_val_score(dataset, model_voting, metrics.F1())


F1: 88.91%

# ADWINBaggingClassifier

**ADWIN Bagging** is the online bagging method with the addition of the
ADWIN algorithm as a change detector. If concept drift is detected, the worst member of the
ensemble (based on the error estimation by ADWIN) is replaced by a new (empty) classifier.

ADWIN is parameter- and assumption-free in the sense that it automatically detects and adapts to the current rate of change. Its only parameter is a confidence bound δ, indicating how confident we want to be in the algorithm’s output, inherent to all algorithms dealing with random processes.


In [15]:
from river import datasets
from river import ensemble
from river import evaluate
from river import linear_model
from river import metrics
from river import optim
from river import preprocessing

dataset = datasets.Phishing()

model_adwin = ensemble.ADWINBaggingClassifier(
    model=(
        preprocessing.StandardScaler() |
        linear_model.LogisticRegression()
    ),
    n_models=3,
    seed=42
)

# metric = metrics.F1()

In [16]:
evaluate.progressive_val_score(dataset, model_adwin, metrics.Accuracy())


Accuracy: 89.20%

In [17]:
evaluate.progressive_val_score(dataset, model_adwin, metrics.F1())

F1: 88.61%

In [None]:
evaluate.progressive_val_score(dataset, model_adwin, metrics.ROCAUC())

ROCAUC: 96.01%