# Introduction to Machine Learning
## Lesson 6 Support Vector Machines
## Introduction

In this lab work, we will familiarize ourselves with the classification algorithm using support vector machines.


## Task
It is often necessary to classify data in machine learning algorithms. Each data object is represented as a vector (a point) in p-dimensional space (a sequence of numbers). Each of these points belongs to only one of two classes. We are interested in whether we can divide the points by a hyperplane of dimension "p-1". This is a typical case of linear separability. There can be many such hyperplanes. Therefore, it is natural to believe that maximizing the gap between classes contributes to more confident classification. That is, can we find a hyperplane such that the distance from it to the nearest point is maximized. This would mean that the distance between the two closest points lying on opposite sides of the hyperplane is maximized. If such a hyperplane exists, we will be most interested in it, it is called the optimal separating hyperplane, and the linear classifier corresponding to it is called the optimal separating classifier.


### Performing the Classification
To do so you will need:
- Obtain data from competition
- Create a Jupyter notebook which will produce a file for submission
- Submit it to the competition

### Questions

*In the practical part of the assignment, we compared the distributions of the outputs of the solver function (probability estimation) of SVM and LogReg. Usually the histogram for LogReg has heavier tails than the histogram for SVM. What can this be due to? 

1. It cannot be explained in any way, just an empirical fact.
2. It is a matter of the functionals the models are trained on. SVM just needs to make the indentation equal to 1, while LogReg maximizes the indentation
3. Because SVM is bad at estimating probabilities, while LogReg is good at estimating probabilities



**Answer: 2)** Explanation in the answer itself

_______________________________________________________

*Choose the correct statement about SVM and Logreg*: 

1. A calibrated SVM always estimates probabilities better than LogReg
2.	SVM will have slightly higher accuracy on average than LogReg
3- You can't say it will be better by any metric. It's just that one model estimates probability and the other does not
4. SVM is trained faster than LogReg.




**Answer: 2)** Because LogReg maximizes its confidence (tries to estimate probabilities correctly), it may benefit from being wrong on objects close to the separating hyperplane. It is sort of more prone to adjust for "distant" objects in the feature space.

_______________________________________________________

## Importing required Libraries

First we need to import necessary libraries:

[Pandas](https://pandas.pydata.org/) - For data analysis and manipulation

[Numpy](https://numpy.org/) - To deal with matrices

[Matplotlib](https://matplotlib.org/) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## Preparaing Data
Preparing data for machine learning involves several steps such as data collection, cleaning the data from noise and outliers, transforming the data into a suitable format, normalizing or standardizing values, and creating and selecting features (feature engineering) to improve the quality of the model. This process is important to ensure the accuracy and reliability of machine learning models because data quality directly affects their performance.

In [None]:
data = pd.read_csv('processed_train.csv')
data.head()

Partitioning the sample into train and test parts is an important step in the process of developing and evaluating machine learning models. The main reasons for this are:

1) To evaluate model performance: Splitting the data allows us to use one part of the data (the training part) to train the model and the other part (the test part) to evaluate its performance. This helps in understanding how the model will perform on new, previously unseen data.

2) Prevent overfitting: If a model is trained and evaluated on the same data, there is a risk of overfitting, where the model adapts too well to the training data and loses the ability to generalize to new data. Data partitioning helps to identify such problems.

3) Objective model comparison: The test dataset serves as a benchmark to objectively compare different models or hyperparameters. This allows you to select the model that shows the best performance on new data.

In [None]:
from sklearn.model_selection import train_test_split

X = data.drop('HasDetections', axis=1)
y = data['HasDetections']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=33)

#### Exercise 1
Create two pipelines for machine learning models: Logistic Regression and Linear SVM, including data preprocessing with MinMaxScaler. Train and evaluate the models.
Use functions from sklearn library

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler

pipe_lr = Pipeline([('lr_scaler', MinMaxScaler()), 
                    ('lr_estimator', LogisticRegression())])

pipe_svm = Pipeline([('lr_scaler', MinMaxScaler()), 
                    ('svm_estimator', LinearSVC())])

#### Exercise 2

Build the ROC for both models, calculate the AUC. Of course, train on the traine and measure on the test.

Notice! The classical `SVM' implementation, as in the lectures, does not provide any probability estimation. To transform the outputs into probabilities, we used a sigmoid function in practice. Here we suggest that you transform the outputs of `decision_function` into probabilities in a proportional way.

For example, you have trained `SVM` and on test data the model produced the following `decision_function` outputs:

(-10, -5, 0, +2, +10, +15)

For each number, we need to make a transformation into an expression of the form `P(y = +1 | x)`.

On the one hand, a negative sign of a number will signal to us that `P(y = +1 | x) < 0.5`.

Then a positive one would signal that `P(y = +1 | x) > 0.5`. 

On the other hand, for those objects that the model is most confident about, we will put marginal probabilities. For the example above:

`P(y = +1 | -10) = 0`, `P(y = +1 | +15) = 1`. For all intermediate objects, we apply the proportional transformation. For example:

$$
P(y = +1 | -5) = \frac{|-5|}{|-10|} \cdot 0.5
$$

$$
P(y = +1 | +2) = \frac{|+2|}{|+15|} \cdot 0.5 + 0.5
$$

Use function **fit()**

In [None]:
pipe_lr.fit(X_train, y_train)
pipe_svm.fit(X_train, y_train)

import **roc_curve** and **RocCurveDisplay**

**roc_curve** - A ROC curve is constructed by plotting the ratio of True Positive Rate (TPR) against the ratio of False Positive Rate (FPR) at various probability thresholds. The roc_curve function returns the FPR, TPR, and probability thresholds used to construct the curve.

**RocCurveDisplay** - This class simplifies the process of creating and displaying a ROC curve, making it easy to plot and interpret model performance graphs.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay

#### Exercise 2.1

1) Train a Logistic Regression model using a training sample.
Using the trained model, predict the probabilities of positive class for the test sample.
2) Calculate the TPR (True Positive Rate) and FPR (False Positive Rate) values for different classification thresholds using roc_curve function.
3) Construct a ROC curve using the calculated TPR and FPR values and display it using the RocCurveDisplay function.
4) Interpret the resulting ROC curve, explaining the value of the area under the curve (AUC) and how it reflects the quality of the model performance.

In [None]:
preds_lr = pipe_lr.predict_proba(X_test)[:, 1]

fpr1, tpr1, _ = roc_curve(y_test, preds_lr)
roc_display1 = RocCurveDisplay(fpr=fpr1, tpr=tpr1).plot()

#### Exercise 2.2

1) Train an SVM model using a training sample.
2) Using the trained model, predict the values of the decision function for the test sample.
3) Find the minimum and maximum values among the predicted decision function values.
4) Convert the SVM predictions to probabilities using the specified formula:
- For negative values, convert them to a range of 0 to 0.5.
- For positive values, convert them to a range of 0.5 to 1.
5) Explain the logic behind converting decision function values to probabilities and how this might affect the interpretation of the results.

In [None]:
decision_preds = pipe_svm.decision_function(X_test)

min_pred = min(decision_preds)
max_pred = max(decision_preds)


preds_svm = [-abs(x-min_pred)/min_pred*0.5 
             if x <= 0 
             else abs(x/max_pred)*0.5+0.5 
             for x in decision_preds]

#### Exercise 2.3

1) Calculate False Positive Rate (FPR), True Positive Rate (TPR) and thresholds for the ROC curve based on predicted preds_svm probabilities and true y_test labels.
2) Сreate a RocCurveDisplay object and call the plot() method to display the ROC curve based on the calculated FPR (False Positive Rate) and TPR (True Positive Rate) values stored in the fpr2 and tpr2 variables, respectively.

In [None]:
fpr2, tpr2, _ = roc_curve(y_test, preds_svm)
roc_display2 = RocCurveDisplay(fpr=fpr2, tpr=tpr2).plot()

#### Exercise 2.4
Derive the AUC values for the two machine learning models (logistic regression and SVM) based on their respective ROC curves, which were pre-calculated and stored in variables fpr1, tpr1 for logistic regression and fpr2, tpr2 for SVM.

In [None]:
from sklearn.metrics import auc 

print('LogReg auc =', auc(fpr1, tpr1))
print('SVM auc =', auc(fpr2, tpr2))

#### Exercise 2.5

Build calibration curves for both models. You cannot use the from_estimator method for svm.

In [None]:
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve


prob_true_lr, prob_pred_lr = calibration_curve(y_test,
                                               preds_lr,
                                               n_bins=5)


fig = plt.figure()
fig.set_size_inches(8, 8)

plt.plot(prob_true_lr, prob_pred_lr)
plt.plot(np.linspace(0, 1, 5), np.linspace(0, 1, 5))

plt.show()

In [None]:
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve


prob_true_svm, prob_pred_svm = calibration_curve(y_test,
                                               preds_svm,
                                               n_bins=5)


fig = plt.figure()
fig.set_size_inches(8, 8)

plt.plot(prob_true_svm, prob_pred_svm)
plt.plot(np.linspace(0, 1, 5), np.linspace(0, 1, 5))

plt.show()

#### Exercise 2.6
Calibrate the probabilities for the SVM model using the CalibratedClassifierCV method and obtain the adjusted probabilities for the test sample.

In [None]:
### Калибровка

from sklearn.calibration import CalibratedClassifierCV

plats_calibration = CalibratedClassifierCV(pipe_svm,
                                           cv=3,
                                           method='sigmoid').fit(X_train, y_train)

plats_calibration_preds = plats_calibration.predict_proba(X_test)[:, 1]

#### Exercise 2.7

Visualize the calibration curve and compare it with the ideal diagonal line

In [None]:
### Новая кривая

prob_true_svm, prob_pred_svm = calibration_curve(y_test,
                                                 plats_calibration_preds,
                                                 n_bins=5)


fig = plt.figure()
fig.set_size_inches(8, 8)

plt.plot(prob_true_svm, prob_pred_svm)
plt.plot(np.linspace(0, 1, 5), np.linspace(0, 1, 5))

plt.show()