# Machine Learning with scikit-learn

## What Is scikit-learn?

Scikit-learn provides a large range of algorithms in machine learning that are unified under a common and intuitive API. Most of the dozens of classes provided for various kinds of models share the large majority of the same calling interface. Very often—as we will see in examples below—you can easily substitute one algorithm for another with nearly no change in your underlying code. This allows you to explore the problem space quickly, and often arrive at an optimal, or at least satisficing$^1$ approach to your problem domain or datasets.

* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

<hr/>

<small>$^1$<i>Satisficing is a decision-making strategy of searching through the alternatives until an acceptability threshold is met. It is a portmanteau of satisfy and suffice, and was introduced by Herbert A. Simon in 1956. He maintained that many natural problems are characterized by computational intractability or a lack of information, both of which preclude the use of mathematical optimization procedures.</i></small>

## Overview of Techniques Used in Machine Learning

The diagram below is from the scikit-learn documentation, but the same general schematic of different techniques and algorithms that it outlines applies equally to any other library.  The classes represented in bubbles mostly will have equivalent versions in other libraries.



![](img/sklearn-topics.png)


## Classification versus Regression versus Clustering

### Classification

Classification is a type of **supervised learning** in which the targets for a prediction are a set of categorical values.

### Regression

Regression is a type of **supervised learning** in which the targets for a prediction are quantitative or continuous values.

### Clustering

Clustering is a type of **unsupervised learning** where you want to identify similarities among collections of items without an *a prior* classification scheme. You may or may not have an *a priori* about the number of categories.

## Categorical versus Ordinal versus Continuous Variables

Features come in one of three basic types.

### Categorical variables 

Some are **categorical** (also called nominal): A discrete set of values that a feature may assume, often named by words or codes (but sometimes confusingly as integers where an order may be misleadingly implied).

### Ordinal variables

Some are **ordinal**: There is a scale from low to high in the data values, but the spacing in the data may have little to no relationship to the underlying phenomenon. For example, while an airline or credit card "reward program" might have levels of Gold/Silver/Platinum/Diamond, there is probably no real sense in which Diamond is "4 times as much" as Gold, even though they are encoded as 1-4.

### Continuous variables

Some are **continuous** or quantitative: Some quantity is actually measured such that a number represents the amount of it. The distribution of these measurements is likely not to be uniform and linear (in which case scaling might be relevant), but there is a real thing being measured. Measurements might be quantized for continuous variables, but that does not necessarily make them ordinal instead. For example, we might measure annual rainfall in each town only to the nearest inch, and hence have integers for that feature.

This notion of types of variables applies to statistics broadly. Some other concepts are genuinely specific to machine learning.  

## One-hot Encoding

For many machine learning algorithms, including neural networks, it is more useful to have a categorical feature with N possible values encoded as N features, each taking a binary value. Several tools, including a couple functions in scikit-learn will transform raw datasets into this format. Obviously, by encoding this way, dimensionality is increased.

Let us illustrate using a toy test dataset.  The following whimsical data is suggested in a blog post by [Håkon Hapnes Strand](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science).  Imagine we collected some data on individual organisms—namely taxonomic class, height, and lifespan.  Depending on our purpose, we might use this data for either supervised or unsupervised learning techniques (if we had a lot more observations, and a number more features).

In [None]:
# Data: individual organism; height; lifespan

data= [
    ['human', 1.7, 85],
    ['alien', 1.8, 92],
    ['penguin', 1.2, 37],
    ['octopus', 2.3, 25],
    ['alien', 1.7, 85],
    ['human', 1.2, 37],
    ['octopus', 0.4, 8],
    ['human', 2.0, 97]
]

print(data)  # perform a raw print

In [None]:
# The data with its original feature, just as a DataFrame
import pandas as pd
naive = pd.DataFrame(data, columns=['species', 'height (M)', 'lifespan (years)'])
naive

In [None]:
# The data one-hot encoded
encoded = pd.get_dummies(naive)
#encoded   # let's display before we replace those species' features

encoded.columns = [c.replace('species_','') for c in encoded.columns]
encoded

### Code Examples

In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np

y_true = ["human",   "octopus", "human", "human", "octopus", "penguin", "penguin"]
y_pred = ["octopus", "octopus", "human", "human", "octopus", "human",   "penguin"]
labels = ['octopus', 'penguin', 'human']

In [None]:
cm = confusion_matrix(y_true, y_pred, labels=labels)

#print('---------------')
#print(cm[0,0])  # explore C_{0,1}
#print('---------------')

print("Confusion Matrix (predict/actual):\n", 
      pd.DataFrame(cm, index=labels, columns=labels), sep="  ")

recall = np.diag(cm) / np.sum(cm, axis=1)
print("\nRecall:\n", pd.Series(recall, index=labels), sep="")

precision = np.diag(cm) / np.sum(cm, axis=0)
print("\nPrecision:\n", pd.Series(precision, index=labels), sep="")

print("\nAccuracy:\n", np.sum(np.diag(cm)) / np.sum(cm))


In this particular case, F1 score is very close to accuracy.  In fact, using the "micro" averaging method reduces the result to accuracy.  Using the "macro" averaging makes it equivalent to a NumPy reduction from the formula given.

In [None]:
from sklearn.metrics import f1_score
weighted_f1 = f1_score(y_true, y_pred, average="weighted")
print("\nWeighted F1 score:\n", weighted_f1, sep="")

macro_f1 = f1_score(y_true, y_pred, average='macro')
print("\nMacro F1 score:\n", macro_f1, sep="")


micro_f1 = f1_score(y_true, y_pred, average='micro')
print("\nMicro F1 score:\n", micro_f1, sep="")

In [None]:
print("Naive averaging F1 score:", np.mean(2*(recall*precision)/(recall+precision)))
print(" sklearn macro averaging:", f1_score(y_true, y_pred, average="macro"))

## Conclusion

We have learnt: 
* Roadmap on scikit-learn: what learning models we want to use
* Pandas and its dataframe
* one-hot encoding
* Confusion matrix
* Accuracy
* Precision
* Recall