## Clustering

Now that we know the interesting tweets, we can cluster them in order to find area of interests.

In [66]:
#Imports
import os
import json
import pandas as pd
import numpy as np
import csv

## Data loading

We load the non urban tweets info previouly computed

In [67]:
non_urban = pd.DataFrame.from_csv("data/cleaned_non_urban.tsv", sep="\t")

## Clustering

We use the DBSCAN algorithm to compute the clusters and find point of interests. Since this algorithms requires a lot of memory, we compute it on a random subset of 200'000 tweets to find the clusters center first.

In [68]:
array_data = non_urban[["longitude","latitude"]].as_matrix()
array_data.shape

(1199362, 2)

Create subset

In [69]:
idx = np.random.randint(array_data.shape[0], size=200000)
subset = array_data[idx]
subset

array([[  7.4426 ,  46.172  ],
       [  6.45029,  45.9063 ],
       [ 10.0293 ,  46.9783 ],
       ..., 
       [  7.03545,  46.5661 ],
       [  6.8901 ,  46.5063 ],
       [  6.68697,  46.1926 ]])

Running the DBSCAN on the subset

In [102]:
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

db = DBSCAN(eps=0.01, min_samples=50).fit(subset)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)


Estimated number of clusters: 534


Display the number of tweets for each cluster (cluster -1 means that tweet belongs to no cluster)

In [103]:
test = pd.DataFrame(subset)
test.columns = ["longitude","latitude"]
test["cluster"] = labels
count_per_cluster = test.groupby("cluster").count().sort_values("longitude",ascending=False).reset_index()[["cluster", "longitude"]]
count_per_cluster.columns=["cluster","count"]
count_per_cluster.head()

Unnamed: 0,cluster,count
0,-1,49749
1,5,8354
2,45,5171
3,58,3683
4,4,3457


Remove the unassigned cluster and display the clusters with the most tweet counts in the subset

In [104]:
test_cleaned = test[test.cluster != -1]
test_cleaned.head()
test_cleaned.groupby("cluster").count().sort_values("longitude",ascending=False).head()


Unnamed: 0_level_0,longitude,latitude
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
5,8354,8354
45,5171,5171
58,3683,3683
4,3457,3457
24,3296,3296


# Centroids

We compute the average latitude and longitude for each clusters

In [105]:
test_array_data = test_cleaned[["longitude","latitude"]].as_matrix()
test_array_labels = test_cleaned[["cluster"]].as_matrix()

In [106]:
clusters_centroids = test_cleaned.groupby("cluster").mean().reset_index().as_matrix()
clusters_centroids

array([[   0.        ,    6.43165284,   45.88530463],
       [   1.        ,   10.02139632,   46.98967791],
       [   2.        ,    7.6223579 ,   45.87525018],
       ..., 
       [ 531.        ,    6.95492042,   46.24035   ],
       [ 532.        ,    9.6970716 ,   46.399812  ],
       [ 533.        ,    8.63494909,   45.83577636]])

## Cluster Assignements

Using the clusters centers computed from the subsets, we now assign to each tweets of the total data to the closest cluster, if there is one close enough.

In [107]:
#limit the radius of the clusters to about 10 kilometers
treshold = 0.1 * 0.1
non_urban["cluster"] = -1 
non_urban["diff"] = treshold

for i in range(clusters_centroids.shape[0]):
    non_urban["longdiff"] = (non_urban["longitude"] - clusters_centroids[i][1]) ** 2
    non_urban["latdiff"] = (non_urban["latitude"] - clusters_centroids[i][2]) ** 2
    non_urban["totdiff"] = (non_urban["latdiff"] + non_urban["longdiff"])
    assigned = (non_urban["totdiff"] < treshold) & (non_urban["totdiff"] < non_urban["diff"])
    non_urban["cluster"][assigned] = i
    non_urban["diff"][assigned] =  non_urban["totdiff"][assigned]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [108]:
non_urban.head()

Unnamed: 0_level_0,date,longitude,latitude,used,cluster,diff,longdiff,latdiff,totdiff
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9564875001,2010-02-24 06:26:53,7.44237,46.8958,True,313,0.001752,1.422245,1.12365,2.545895
9575552318,2010-02-24 13:28:52,8.06674,46.3913,True,357,2.1e-05,0.322862,0.308607,0.631468
9575623646,2010-02-24 13:30:51,8.06763,46.391,True,357,2.9e-05,0.321851,0.308273,0.630124
9587557928,2010-02-24 18:45:38,8.77847,47.2034,True,50,9e-06,0.020598,1.870394,1.890993
9621355274,2010-02-25 11:16:50,7.53729,46.8894,True,435,0.001768,1.204855,1.110123,2.314978


## Data export

We now export all the tweets assigned to cluster and their info, in order to process them in the following notebook

In [109]:
non_urban[["date", "longitude", "latitude", "cluster"]].to_csv("data/cleaned_non_urban_with_clusters_dbscan.tsv", "\t")

In [110]:
pd.DataFrame(clusters_centroids).to_csv("data/clusters_centers_dbscan.tsv", sep="\t")