# Unsupervised Learning

As we could see, there are many models used in trajectory classification, in this notebook we are going to use and test unsupervised learning models (in case it wasn't obvious enough already).

First of all we are going to load the vectors where the trajectories are described by their characteristics.

In [1]:
import feature_vec as fv

metadata = fv.get_selected_data()
feat_vectors, clss_mask, clss = fv.get_feat_vectors(metadata)

100.00%

## K-Means

We will use the KMeans model with 5 clusters, the idea is that there is one left for each class. In addition, we will try some transformations to the data, we will standardize them and we will try to keep the most important characteristics with a PCA model.

In [2]:
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

feat_v = StandardScaler().fit_transform(PCA(n_components=30, random_state=0).fit_transform(feat_vectors))

kmeans = KMeans(n_clusters=5,
                n_init=15,
                init='random',
                tol=1e-6, 
                random_state=0,
                verbose=True,
                algorithm='elkan')


Now let's check how pure the clusters are.

In [11]:
import pprint
import numpy as np

def count_classes(model, classes=None, cls_count=5):
    if classes is None:
        classes = clss
    count_dict = {i: [0]*cls_count for i in range(len(set(model.labels_[np.where(model.labels_ != -1)])))}
    for i in range(len(model.labels_)):
        index = model.labels_[i]
        if index == -1: continue
        count_dict[index][classes[i]] += 1
    pprint.pprint(count_dict)

In [4]:
y_pred = kmeans.fit_predict(feat_v)
count_classes(kmeans)

Initialization complete
Iteration 0, inertia 124322.17915291403
Iteration 1, inertia 110874.02438229012
Iteration 2, inertia 110098.84092695064
Iteration 3, inertia 109446.66400887344
Iteration 4, inertia 109142.35233711892
Iteration 5, inertia 108999.36674464999
Iteration 6, inertia 108907.66137573136
Iteration 7, inertia 108825.06883077111
Iteration 8, inertia 108730.36910133551
Iteration 9, inertia 108652.95006652156
Iteration 10, inertia 108586.73105285979
Iteration 11, inertia 108503.1155604387
Iteration 12, inertia 108362.86568897488
Iteration 13, inertia 108187.6026159172
Iteration 14, inertia 108119.79687883561
Iteration 15, inertia 108100.25752606614
Iteration 16, inertia 108088.59622577779
Iteration 17, inertia 108074.2078293663
Iteration 18, inertia 108041.71355991889
Iteration 19, inertia 107981.53601436625
Iteration 20, inertia 107924.08140683039
Iteration 21, inertia 107888.62007062673
Iteration 22, inertia 107848.20853902522
Iteration 23, inertia 107782.01546557544
Itera

Iteration 15, inertia 108030.87843852662
Iteration 16, inertia 108019.05480510287
Iteration 17, inertia 107987.51737799961
Iteration 18, inertia 107910.35366455081
Iteration 19, inertia 107812.7636811995
Iteration 20, inertia 107747.6064370014
Iteration 21, inertia 107703.13858983667
Iteration 22, inertia 107669.1967422175
Iteration 23, inertia 107652.79537821152
Iteration 24, inertia 107643.36709708425
Iteration 25, inertia 107632.26881423486
Iteration 26, inertia 107617.5201571949
Iteration 27, inertia 107605.8479126519
Iteration 28, inertia 107592.54940874818
Iteration 29, inertia 107579.13448177923
Iteration 30, inertia 107573.25902567571
Iteration 31, inertia 107570.52455655747
Iteration 32, inertia 107568.09825683002
Iteration 33, inertia 107559.79282888958
Iteration 34, inertia 107535.34166825555
Iteration 35, inertia 107523.67338210932
Iteration 36, inertia 107507.88032757907
Iteration 37, inertia 107488.18733286274
Iteration 38, inertia 107456.86329807281
Iteration 39, inertia

It obviously doesn't look good.

Now, let's perform the tests for homogeneity, the integrity of the clusters and the Silhouette Coefficient.

In [5]:
from sklearn import metrics

# Calculate the homogeneity and completeness of the clusters.
homogeneity = metrics.homogeneity_score(clss, y_pred)
completeness = metrics.completeness_score(clss, y_pred) 

# Calculate the Silhouette coefficient ratio for each sample.
silh = metrics.silhouette_samples(feat_vectors, y_pred)

# Calculate the mean Silhouette coefficient of all data points.
silh_mean = metrics.silhouette_score(feat_vectors, y_pred)

print(homogeneity,
      completeness,
      silh, silh_mean)


0.07789914357567077 0.094675106734564 [ 0.60540582 -0.96862113 -0.96862113 ... -0.6062343   0.60540582
 -0.6062343 ] -0.49642492760458085


## DBSCAN

We now apply the DBSCAN model as another alternative.

In [6]:
from sklearn.cluster import DBSCAN

We used `StandarScaler` to standarize features to have 0 mean and variance 1.

In [7]:
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(PCA(n_components=8).fit_transform(feat_vectors))

dbscan = DBSCAN(eps=0.1, min_samples=14).fit(X)

count_classes(dbscan)

{0: [92, 149, 5, 4, 258],
 1: [21, 0, 0, 0, 0],
 2: [12, 1, 0, 0, 1],
 3: [35, 17, 0, 0, 13]}


## OPTICS

In [17]:
from sklearn.cluster import OPTICS

new_md = [md for md in metadata if md["class"] in ["walk", "car", "bike"]]
new_feat_vectors, _, new_clss = fv.get_feat_vectors(new_md)


100.00%

In [34]:
from sklearn.model_selection import train_test_split

X = StandardScaler().fit_transform(PCA(n_components=5, random_state=0).fit_transform(new_feat_vectors))
optics = OPTICS(min_samples=34)

y_pred = optics.fit_predict(X)

count_classes(optics, new_clss, 3)

{0: [1, 38, 0], 1: [1, 7, 43], 2: [64, 5, 2]}


In [39]:
print("Homogeneity:", metrics.homogeneity_score(new_clss, y_pred))
print("Completness:", metrics.completeness_score(new_clss, y_pred))
print("Silhouette score:", metrics.silhouette_score(X, y_pred))

Homogeneity: 0.055920123521278814
Completness: 0.19366948821911267
Silhouette score: -0.45836681885834213
