# Machine learning using scikit-learn

There are many kinds of machine learning algorithms. Here, we consider: *supervised* and *unsupervised* learning. 

Examples for supervised algorithms: classification, regression, etc.  
Examples for unsupervised algorithms: clustering, dimension reduction, etc.

## scikit-learn estimators

Scikit-learn strives to have a uniform interface across all objects. Given a scikit-learn *estimator* named `model`, the following methods are available:

- Available in **all estimators**
  + `model.fit()` : Fit training data. For supervised learning applications,
    this accepts two arguments: the data `X` and the labels `y` (e.g., `model.fit(X, y)`).
    For unsupervised learning applications, ``fit`` takes only a single argument,
    the data `X` (e.g. `model.fit(X)`).
    
- Available in **supervised estimators**
  + `model.predict()` : Given a trained model, predict the label of a new set of data.
    This method accepts one argument, the new data `X_new` (e.g., `model.predict(X_new)`),
    and returns the learned label for each object in the array.
  + `model.fit_predict()`: Fits and predicts at the same time.  
  + `model.predict_proba()` : For classification problems, some estimators also provide
    this method, which returns the probability that a new observation has each categorical label.
    In this case, the label with the highest probability is returned by `model.predict()`.
  + `model.score()` : An indication of how well the model fits the training data.  Scores are between 0 and 1, with a larger score indicating a better fit.
  
## Data in scikit-learn

Data in scikit-learn, with very few exceptions, is assumed to be stored as a
**two-dimensional array** of size `[n_samples, n_features]`. Many algorithms also accept ``scipy.sparse`` matrices of the same shape.

- **n_samples:**   The number of samples: each sample is an item to process (e.g., classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
- **n_features:**  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner. Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.
  
Support for pandas DataFrames is on the horizon (see [proposal](https://github.com/scikit-learn/enhancement_proposals/blob/4588fc41908eaa8e3a3988addb00084e14a5f888/slep014/proposal.rst)).
  
### Numerical vs. categorical

What if you have categorical features?  For example, imagine there is dataset containing the color of the
iris:

    color in [red, blue, purple]

You might be tempted to assign numbers to these features, i.e. *red=1, blue=2, purple=3*
but in general **this is a bad idea**.  Estimators tend to operate under the assumption that
numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike
than 1 and 3, and this is often not the case for categorical features.

A better strategy is to give each category its own dimension.  
The enriched iris feature set would hence be in this case:

- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- color=purple (1.0 or 0.0)
- color=blue (1.0 or 0.0)
- color=red (1.0 or 0.0)

Note that using many of these categorical features may result in data which is better
represented as a **sparse matrix**, as we'll see with the text classification example
below.

#### Using the DictVectorizer to encode categorical features

When the source data has a list of dicts where the values are either string names of categories or numerical values, you can use the `DictVectorizer` class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:

In [1]:
#disable some annoying warning
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

import numpy as np
import pandas as pd

import altair as alt

In [2]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

In [3]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
tf_measurements = vec.fit_transform(measurements)
tf_measurements.toarray()

array([[ 1.,  0.,  0., 33.],
       [ 0.,  1.,  0., 12.],
       [ 0.,  0.,  1., 18.]])

In [4]:
vec.get_feature_names_out()

array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'],
      dtype=object)

In [5]:
pd.DataFrame(tf_measurements.toarray(), columns=vec.get_feature_names_out())

Unnamed: 0,city=Dubai,city=London,city=San Francisco,temperature
0,1.0,0.0,0.0,33.0
1,0.0,1.0,0.0,12.0
2,0.0,0.0,1.0,18.0


#### Using Pandas to encode categorical features
You can also use pandas to encode categorical features with the [get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) function.

In [6]:
import pandas as pd

In [7]:
pd_measurement = pd.DataFrame(measurements)
pd_measurement.head()

Unnamed: 0,city,temperature
0,Dubai,33.0
1,London,12.0
2,San Francisco,18.0


In [8]:
# get all categorical features of the data
pd_measurement.select_dtypes(exclude=['number']).columns

Index(['city'], dtype='object')

In [9]:
# encode the categorical features
encoded_data = pd.get_dummies(pd_measurement)
# hint: be careful about which features to encode. For example. avoid the 'car' (= name) feature from the mtcars.cvs file as its an identifier for every car

encoded_data.head()

Unnamed: 0,temperature,city_Dubai,city_London,city_San Francisco
0,33.0,1,0,0
1,12.0,0,1,0
2,18.0,0,0,1


## Unsupervised Partitioning using K-Means
With K-means algorithm, you can put your data into _k_ clusters.

In [10]:
#load the iris datasets
import sklearn.datasets

data = sklearn.datasets.load_iris(as_frame=True)
iris = data.frame
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [11]:
sklearn.__version__

'1.0.1'

In [12]:
from sklearn.cluster import KMeans
iris['predict'] = KMeans(n_clusters=3, random_state = 102).fit_predict(data.data)

In [13]:
chart = alt.Chart(iris).mark_circle().encode(
    alt.X('sepal length (cm)', scale=alt.Scale(zero=False)),
    alt.Y('sepal width (cm)', scale=alt.Scale(zero=False))
)

chart.encode(color="target:N").properties(title='Ground Truth') | \
    chart.encode(color="predict:N").properties(title='KMeans-3 clusterer')

## Supervised classification using decision trees

Well, the result is not that great. Let's use a supervised classifier.

First, split our data into training and test set.

In [14]:
import sklearn.model_selection

data_train, data_test, target_train, target_test = sklearn.model_selection.train_test_split(
    data.data, data.target, test_size=0.20, random_state = 5)

print(data.data.shape, data_train.shape, data_test.shape)

(150, 4) (120, 4) (30, 4)


Now, we use a *DecisionTree* to learn a model and test our result.

In [15]:
from sklearn.tree import DecisionTreeClassifier

instance = DecisionTreeClassifier()
r = instance.fit(data_train, target_train)
target_predict = instance.predict(data_test)

from sklearn.metrics import accuracy_score
print('Prediction accuracy: ', accuracy_score(target_predict, target_test))


Prediction accuracy:  0.8666666666666667


Pretty good, isn't it?

## Dimension reduction using MDS and PCA

If we go back to our K-Means example, the clustering doesn't really make sense. However, we are just looking at two out of four dimensions. So, we can't really see the real distances/similarities between items. Dimension reduction techniques reduce the number of dimensions, while preserving the inner structure of the higher dimensions. We take a look at two of them: Multi Dimensional Scaling (MDS) and Principal Component Analysis (PCA). 

In [16]:
from sklearn import manifold

#create mds instance
mds = manifold.MDS(n_components=2, random_state=5)

#fit the model and get the embedded coordinates
pos = mds.fit_transform(data.data)
iris['mds_x']= pos[:, 0]
iris['mds_y']= pos[:, 1]

chart = alt.Chart(iris).mark_point().encode(
    alt.X('mds_x'),
    alt.Y('mds_y'),
    color='target:N'
)

chart

In [17]:
#compare with PCA

from sklearn import decomposition

pca = decomposition.PCA(n_components=2)
pca_pos = pca.fit_transform(data.data)
iris['pca_x']= pca_pos[:, 0]
iris['pca_y']= pca_pos[:, 1]

chart | chart.encode(x='pca_x', y='pca_y')

Seems like versicolor and virginicia are more similar than setosa.

## TASK

> Create an interactive colored plot of the Iris dataset projected in 2D using MDS. The color should correspong to the result of a K-Means clusterin alrogithm where the user can interactivly define the number of clusters between 1 and 10. 

In [18]:
from ipywidgets import interact

@interact(n_clusters=(1,10))
def draw_plot(n_clusters):
    kmeans = KMeans(n_clusters=n_clusters, random_state = 102)
    iris['predict'] = kmeans.fit_predict(data.data)
    return alt.Chart(iris).mark_point().encode(
        alt.X('mds_x'),
        alt.Y('mds_y'),
        color='predict:N'
    )

interactive(children=(IntSlider(value=5, description='n_clusters', max=10, min=1), Output()), _dom_classes=('w…

Thanks!