# Unsupervised Learning

Como pudimos ver, hay muchos modelos usados ​​en la clasificación de trayectorias, en este cuaderno vamos a usar y probar modelos de aprendizaje no supervisado (en caso de que no haya sido lo suficientemente obvio ya)

Primero que todo vamos a cargar los vectores donde se describen las trayectorias por sus características.

In [16]:
import feature_vec as fv

metadata = fv.get_selected_data()
feat_vectors, clss_mask, clss = fv.get_feat_vectors(metadata)

100.00%

## K-Means

Aplicando el algoritmo K-Means a los datos

In [88]:
from sklearn.cluster import KMeans
import numpy as np
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

feat_v = StandardScaler().fit_transform(PCA(n_components=30, random_state=0).fit_transform(feat_vectors))

kmeans = KMeans(n_clusters=5,
                n_init=15,
                init='random',
                tol=1e-6, 
                random_state=0,
                verbose=True,
                algorithm='elkan')

y_pred = kmeans.fit_predict(feat_v)
count_dict = {i: [0]*5 for i in range(5)}
clss_count = [0]*5
for i in range(len(kmeans.labels_)):
    count_dict[kmeans.labels_[i]][clss[i]] += 1
            
count_dict

Initialization complete
Iteration 0, inertia 126205.77228228719
Iteration 1, inertia 110903.28182396787
Iteration 2, inertia 110305.38403939131
Iteration 3, inertia 109916.39790080063
Iteration 4, inertia 109561.94322885176
Iteration 5, inertia 109209.03269696262
Iteration 6, inertia 108756.51001883682
Iteration 7, inertia 108500.5487137201
Iteration 8, inertia 108425.01470709371
Iteration 9, inertia 108350.17246006029
Iteration 10, inertia 108276.82062786326
Iteration 11, inertia 108201.42058726051
Iteration 12, inertia 108124.5073752792
Iteration 13, inertia 108063.58137652167
Iteration 14, inertia 108013.56371003223
Iteration 15, inertia 107981.25148018509
Iteration 16, inertia 107955.48395906355
Iteration 17, inertia 107938.9383928439
Iteration 18, inertia 107933.38195554455
Iteration 19, inertia 107926.28958117869
Iteration 20, inertia 107915.77559234075
Iteration 21, inertia 107907.63214180157
Iteration 22, inertia 107902.07879837076
Iteration 23, inertia 107893.19745094105
Itera

{0: [29, 55, 105, 86, 75],
 1: [63, 5, 0, 99, 319],
 2: [725, 214, 72, 651, 349],
 3: [221, 88, 33, 81, 150],
 4: [140, 47, 85, 70, 181]}

Ahora, determinemos la homogeneidad, la integridad de los clusters y el Coeficiente de Silhouette.

In [89]:
from sklearn import metrics

# Calcular la homogeneidad y la integridad de los clusters.
homogeneity = metrics.homogeneity_score(clss, y_pred)
completeness = metrics.completeness_score(clss, y_pred) 

# Calcular el coeficiente de coeficiente de Silhouette para cada muestra.
s = metrics.silhouette_samples(feat_vectors, y_pred)

# Calcule el coeficiente de Silhouette medio de todos los puntos de datos.
s_mean = metrics.silhouette_score(feat_vectors, y_pred)

print(homogeneity,
    completeness,
    s, s_mean)

0.08166501438023839 0.0892204077790774 [-0.68982217 -0.3659875  -0.89824299 ...  0.36445689 -0.3659875
 -0.3659875 ] -0.42709379881958637


## DBSCAN

Aplicamos ahora el algoritmo DBSCAN como otro modelo alternativo.

In [15]:
from sklearn.cluster import DBSCAN

Usamos `SatandarScaler` para estandarizar características a media cero y varianza unitaria. Luego aplicamos el algoritmo a los datos.

In [71]:
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(PCA(n_components=8).fit_transform(feat_vectors))

dbscan = DBSCAN(eps=0.04, min_samples=9).fit(X)

count_dict = {i: [0]*5 for i in range(5)}
clss_count = [0]*5
for i in range(len(dbscan.labels_)):
    index = dbscan.labels_[i]
    if index == -1: continue
    count_dict[index][clss[i]] += 1
            
count_dict

{0: [0, 0, 16, 0, 0],
 1: [0, 0, 9, 0, 0],
 2: [0, 1, 14, 0, 0],
 3: [0, 0, 0, 0, 0],
 4: [0, 0, 0, 0, 0]}

Y ahora calculamos e imprimimos las evaluaciones de las métricas.

In [None]:
print('Number of clusters: {}'.format(len(set(dbscan[np.where(dbscan != -1)]))))
print('Homogeneity: {}'.format(metrics.homogeneity_score(clss, dbscan)))
print('Completeness: {}'.format(metrics.completeness_score(clss, dbscan)))
# print('Mean Silhouette score: {}'.format(metrics.silhouette_score(X, dbscan)))


## OPTICS

In [90]:
from sklearn.cluster import OPTICS
import numpy as np

clustering = OPTICS(min_samples=100).fit(X)
len(clustering.labels_)

3943