# Clustering Geocoded Tweets of 2012 London Olympic Games
by [Talha Oz](http://talhaoz.com) (submitted as GeoSocial Class Assignment #3)

#### Q1
I use DBSCAN clustering, a density based algorithm, to cluster the tweets based on their geographic locations. To do so, I first compute the distances between every unique pair of tweeting points using Vincenty algorithm implemented in geopy, and fit and predict clusters on this distance matrix using sklearn's DBSCAN implementation.

Since the locations of the tweeps are scattered around several countries, by ignoring the potential clusters with less than 10 tweeps, we are particularly interested in highly densed clusters of tweeps, where DBSCAN is known to be good in detecting.

#### Q2
1. Read in the csv file into a dataframe by assigning column names, as there is no header in the provided CSV file.
2. Group the tweets by their coordinates [exact (lat,lon) pairs]:
 1. Average the sentiment polarities
 2. Count number of tweets in each group
3. It is interesting that 1778 of 5729 tweets are from same location, i.e. London city center.
4. Higher eps, lower min_samples enable us to have low density clusters. Default values were 0.5 and 5, but changed to 15 and 10, respectively.
5. Cluster sizes examined and a colormap is selected accordingly
6. Folium (Leaflet.js) library is used for interactive mapping where locations are marked in circles whose radii are proportional to tweets originated from the same lat,lon

#### Q3
* Five clusters are detected.
* 286 points (of 1778) could not be detected.
* Performed bad in not so densed regions.

#### Q4
Please see below, an interactive map is provided as the output of the last command

In [1]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import pairwise_distances
from geopy.distance import vincenty
import numpy as np
import folium
from palettable.colorbrewer.qualitative import Dark2_6
from IPython.display import HTML

In [2]:
def vincenty_mi(p1,p2):
    return vincenty((p1[0],p1[1]),(p2[0],p2[1])).miles

In [3]:
df = pd.read_csv('Olympic_torch_2012_UK.csv',header=None,names=['twtime','lat','lon','sp'],parse_dates=[0])
df['cnt'] = 0
# average sp (sentiment polarities) and count tweets from the same lat/lon
df = pd.DataFrame(df.groupby(by=['lat','lon'],as_index=False).agg({'cnt':len,'sp':np.mean}))
print('Total number of tweets:',df['cnt'].sum())
print('Location with the highest tweet count (London city center):')
df[df.cnt == df['cnt'].max()]

Total number of tweets: 5729
Location with the highest tweet count (London city center):


Unnamed: 0,lat,lon,sp,cnt
364,51.506325,-0.127144,0.762092,1778


In [4]:
# this takes about 1 min 14 secs (measured by %timeit -n1 -r1)...
X = pairwise_distances(df[['lat','lon']],metric=vincenty_mi)

In [5]:
db = DBSCAN(eps=15,min_samples=10,metric='precomputed').fit_predict(X) # eps=0.3, min_samples=10
df['cluster'] = db
df.head()

Unnamed: 0,lat,lon,sp,cnt,cluster
0,46.126862,3.42999,0.0,1,-1
1,46.211401,2.20936,0.636364,11,-1
2,46.289863,3.060979,3.0,1,-1
3,46.707375,0.87453,0.0,1,-1
4,46.914511,1.160956,0.0,1,-1


In [7]:
grouped = df.groupby(by='cluster',as_index=False)
print('size of each cluster:',[{k:len(v)} for k,v in grouped.groups.items()])

size of each cluster: [{0: 572}, {1: 277}, {2: 19}, {3: 15}, {4: 16}, {-1: 286}]


In [9]:
# this cell can be removed as the cluster IDs are in the range of [-1,numOfClusters-1]
# so, instead of colors[x['cluster']], we could directly use Dark2_6.hex_colors[x['cluster']]
colors = {}
for i,c in enumerate(set(df['cluster'])):
    colors.update({c:Dark2_6.hex_colors[i]})

In [10]:
uk = folium.Map(location=[53.3, -3.5], zoom_start=7,  width=991, height = 1000)
df.apply(lambda x: uk.circle_marker(location=[x['lat'], x['lon']],
                 radius=x['cnt']*10,
                 popup=str(x['cluster']), line_color=colors[x['cluster']],
                 fill_color=colors[x['cluster']], fill_opacity=0.2),
         axis=1);
uk.create_map(path='uk.html')
HTML('<iframe src="uk.html" style="width: 100%; height: 1000px; border: none"></iframe>')