# Introduction to Supervised Learning and Classification

---------------

### LDA

The aim of LDA is to maximize the between-class variance and minimize the within-class variance, through a linear discriminant function, under the assumption that data in every class are described by a Gaussian Probability Density Function with the same covariance.

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)

In [1]:
# Importing libraries
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [2]:
# Data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

In [3]:
# Model initialization
clf = LinearDiscriminantAnalysis()

In [4]:
# Training
clf.fit(X, y)

In [5]:
# Predicting
print(clf.predict([[-0.8, -1]]))

[1]


In [6]:
# Model params details
clf.get_params()

{'covariance_estimator': None,
 'n_components': None,
 'priors': None,
 'shrinkage': None,
 'solver': 'svd',
 'store_covariance': False,
 'tol': 0.0001}

---------------

### QDA

Quadratic Discriminant Analysis (QDA) is a generative model. QDA assumes that each class follows a Gaussian distribution. The class-specific prior is simply the proportion of data points that belong to the class. The class-specific mean vector is the average of the input variables that belong to the class.

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html)

In [7]:
# Importing libraries
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import numpy as np

In [8]:
# Data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

In [9]:
# Model
clf = QuadraticDiscriminantAnalysis()

In [10]:
# Training
clf.fit(X, y)

In [11]:
# Predicting
print(clf.predict([[-0.8, -1]]))

[1]


In [12]:
# Model params details
clf.get_params()

{'priors': None, 'reg_param': 0.0, 'store_covariance': False, 'tol': 0.0001}

---------------

### KNN

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. It's easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [13]:
# Importing library
from sklearn.neighbors import KNeighborsClassifier

In [14]:
# Data
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

In [15]:
# Model
neigh = KNeighborsClassifier(n_neighbors=3, 
                             weights='uniform', 
                             algorithm='auto', 
                             leaf_size=30, 
                             p=2, 
                             metric='minkowski', 
                             metric_params=None, 
                             n_jobs=None)

In [16]:
# Training
neigh.fit(X, y)

In [17]:
# Predicting
print(neigh.predict([[1.1]]))

[0]


In [18]:
# Probabilities estimated
print(neigh.predict_proba([[0.9]]))

[[0.66666667 0.33333333]]


In [19]:
# Model params details
neigh.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}

---------------

### Classification Evaluation Metrics

* A **Classification report** is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report.

* A **confusion matrix** is a table that is used to define the performance of a classification algorithm. A confusion matrix visualizes and summarizes the performance of a classification algorithm.

* The **precision-recall curve** is constructed by calculating and plotting the precision against the recall for a single classifier at a variety of thresholds. For example, if we use **logistic regression**, the threshold would be the predicted probability of an observation belonging to the positive class.

Documentation:
- [Metrics and scoring: quantifying the quality of predictions](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)
- [sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report)
- [sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix)
- [sklearn.metrics.precision_recall_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve)

In [20]:
# Importing libraries
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve
from sklearn.metrics import PrecisionRecallDisplay

In [21]:
# Actual data and predicted data
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]

target_names = ['class 0', 'class 1', 'class 2']

In [22]:
# Classification report
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5



In [23]:
# Confussion Matrix
# Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true 
# label being i-th class and predicted label being j-th class.
confusion_matrix(y_true, y_pred)

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 2]], dtype=int64)

In [24]:
# restricted to the binary classification task
y_true = [0, 0, 1, 1]
y_score = [0.1, 0.4, 0.35, 0.8]
precision_recall_curve(y_true, y_scores)

NameError: name 'y_scores' is not defined

In [None]:
# Plot the Precision-Recall curve
display = PrecisionRecallDisplay.from_predictions(y_true, y_score, name="Classifier Model")

---------------

### Model Tuning

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

---------------

# Curated Articles - Introduction to Supervised Learning and Classification

- [LDA and QDA](https://maelfabien.github.io/machinelearning/LDA/#): This article goes into the fundamentals of LDA and QDA, the underlying mathematical formulation, the associated assumptions, and the differences between them.
- [Logistic Regression](https://kambria.io/blog/logistic-regression-for-machine-learning/): This is a basic article that talks about the key concepts of the logistic regression algorithm - its need, underlying mathematical formulation, and how that can help in classifying binary / multiple classes.
- [Confusion Matrix](https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/): This article explains one of the most powerful performance measurement methods for classification methods - the confusion matrix. This is a detailed read that helps throw light on key concepts and interpretations involved with various metrics that can be calculated from the confusion matrix. 
- [The K-NN Algorithm](https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning): This article talks about the K-NN algorithm and the steps involved to use it in your data. It also explains the various advantages and disadvantages of K-NN (along with the Python implementation)

---------------