# Event detection from Twitter
> The aim of this project is to use geolocated tweets to identify the travelling patterns across the Swiss-border. We will consider the radius of gyration (ROG) to measure the distance convered by the users over a period of time. The reference point for each user will be chosen as the location at which the most frequent geo-located tweets were made. We will query individual tweets to decide the end points for the ROG. This will be categoried by the journeys made between cities and also on a temporal basis to demonstrate the dynamic changes over a period of time.

# Load Data

We start by reading our preprocessed data.

In [1]:
import pandas as pd
import numpy as np
import csv
import pickle
import json

In [2]:
df = pickle.load(open('data/xaa_proc.p', 'rb'))

In [3]:
df.head()

Unnamed: 0,id,userId,text,longitude,latitude,inReplyTo,placeLatitude,placeLongitude,followersCount,friendsCount,statusesCount,userLocation,year,month,day,week_number,hour,city,state,country
551,10231423626,6257282.0,"The new apartment is nice, but there is no Wif...",7.58531,47.5455,,47.5367,7.57849,14249,9260.0,19585.0,"Potsdam, Germany",2010.0,3.0,9.0,10.0,18.0,Binningen,Basel-Landschaft,ch
609,10292646240,15602037.0,Is that wet yet solid stuff on my screen suppo...,8.52725,47.3876,,47.3791,8.50021,177,136.0,5167.0,"Zürich, Switzerland",2010.0,3.0,10.0,10.0,22.0,Zürich,Zürich,ch
611,10309829732,625553.0,I'm at DCTI - David Dufour in Geneva http://go...,6.13183,46.2006,,46.1996,6.13011,471,82.0,3363.0,"Geneva, Switzerland",2010.0,3.0,11.0,10.0,5.0,Genève,Genève,ch
612,10310391132,17341045.0,God morgon! :-),7.44235,46.8957,,46.9214,7.38855,586,508.0,9016.0,"Bern, Switzerland",2010.0,3.0,11.0,10.0,6.0,Köniz,Bern - Berne,ch
618,10311568050,634553.0,"At this very minute, the sun is pink.",6.199,46.2043,,46.1938,6.15415,2230,387.0,10605.0,"Geneva, Switzerland",2010.0,3.0,11.0,10.0,7.0,Genève,Genève,ch


# Preprocess data

## Drop missing data

Geo-location and content of every tweet is necessary for event detection. Hence we drop data with missing values in 'text', 'placeLongitude' or 'placeLatitude'.

In [4]:
df.shape

(22611, 20)

In [5]:
df.dropna(subset=['text', 'placeLongitude', 'placeLatitude'], inplace=True)

In [6]:
df.shape

(22611, 20)

## Handle longitude and latitude

We lower the accuracy of longitude and latitude in order to get clusters.

In [7]:
def truncate_latitude(row):
    return round(row.placeLatitude, 3)

df['truncatedLatitude'] = df.apply(truncate_latitude, axis=1)

In [8]:
def truncate_longitude(row):
    return round(row.placeLongitude, 3)

df['truncatedLongitude'] = df.apply(truncate_longitude, axis=1)

In [9]:
df = df[['id', 'userId', 'text', 'truncatedLongitude', 'truncatedLatitude', 'city', 'year', 'month', 'day']]

# Event detection

We define an event by one hashtag mentioned in no less than 5 tweets posted on one day, in a certain area(with the same truncatedLongitude and truncatedLatitude). 

### Extract hashtags

We create a new column 'hashtags' which stores the list of hashtags extracted from every tweet.

In [10]:
def extract_hashtags(row):
    text = str(row.text)
    hashtags = []
    for word in text.split():
        if word[0] == '#':
            hashtag = word[1:]
            hashtags.append(hashtag)

    return hashtags

df['hashtags'] = df.apply(extract_hashtags, axis=1)

### Flatten hashtags to multiple hashtag

In [11]:
df = df.groupby(['id', 'userId', 'text', 'truncatedLongitude', 'truncatedLatitude', 'city', 'year', 'month', 'day']).hashtags.apply(lambda x: pd.DataFrame(x.values[0])).reset_index()
df = df.drop('level_9', axis=1)

In [12]:
df.rename(columns={0:'hashtag'}, inplace=True)

In [13]:
df.shape

(6570, 10)

In [14]:
df.head()

Unnamed: 0,id,userId,text,truncatedLongitude,truncatedLatitude,city,year,month,day,hashtag
0,10002265813295104,16272692.0,#Amazon #Boykott wäre jetzt eine vernünftige S...,9.637,47.561,Wasserburg (Bodensee),2010.0,12.0,1.0,Amazon
1,10002265813295104,16272692.0,#Amazon #Boykott wäre jetzt eine vernünftige S...,9.637,47.561,Wasserburg (Bodensee),2010.0,12.0,1.0,Boykott
2,10002265813295104,16272692.0,#Amazon #Boykott wäre jetzt eine vernünftige S...,9.637,47.561,Wasserburg (Bodensee),2010.0,12.0,1.0,Wikileaks
3,10002504087502848,17366149.0,@barracuda00795 sollte sicher kein Problem wer...,7.437,46.945,Bern,2010.0,12.0,1.0,flug
4,10002862222344192,17366149.0,@RahelRadisli willst du das sicher? ;-) #schne...,7.437,46.945,Bern,2010.0,12.0,1.0,schnee


### Detect events in cluster

In [15]:
events = df.groupby(['year', 'month', 'day', 'city', 'truncatedLongitude', 'truncatedLatitude', 'hashtag'])[['id']].count()
events.rename(columns={'id':'tweetCount'}, inplace=True)
events.reset_index(inplace=True)

In [16]:
events.shape

(5887, 8)

In [17]:
events.head()

Unnamed: 0,year,month,day,city,truncatedLongitude,truncatedLatitude,hashtag,tweetCount
0,2010.0,3.0,12.0,Zürich,8.5,47.379,bloFo,1
1,2010.0,3.0,12.0,Zürich,8.544,47.364,hinweisebitte,1
2,2010.0,3.0,12.0,Zürich,8.544,47.364,stadelhofen,1
3,2010.0,3.0,13.0,Genève,6.134,46.201,fb,1
4,2010.0,3.0,13.0,Genève,6.14,46.2,fb,1


In [18]:
events = events[events['tweetCount'] >= 5]
events.shape

(29, 8)

In [19]:
events

Unnamed: 0,year,month,day,city,truncatedLongitude,truncatedLatitude,hashtag,tweetCount
38,2010.0,3.0,25.0,Zürich,8.523,47.352,bosw,8
109,2010.0,4.0,19.0,Zürich,8.537,47.377,Sechselaeuten,9
285,2010.0,6.0,2.0,Bern,7.395,46.954,muse,7
304,2010.0,6.0,3.0,Zürich,8.537,47.377,twitref,6
348,2010.0,6.0,12.0,Zürich,8.537,47.377,mcw,5
389,2010.0,6.0,16.0,Zürich,8.537,47.377,sui,5
412,2010.0,6.0,20.0,Bern,7.395,46.954,NZL,7
465,2010.0,6.0,24.0,Zürich,8.537,47.377,swisscrmforum,18
480,2010.0,6.0,26.0,Bern,7.395,46.954,go2leu2606,5
532,2010.0,7.0,3.0,Zürich,8.537,47.377,twittboat,11


# Visualization
We build a webpage for visualizing detected events at nnnsyyy.github.io/ADA_Project/

To do so, first, we need to generate a json file which is going to be read by the events.

In [20]:
def create_item(start, end, latitude, longitude, text):
    item = {"type":"Feature","properties":{"start":start, "end":end, "text":text},"geometry":{"type":"Point", "coordinates":[latitude, longitude]}}
    return item

In [21]:
items = []

def gen_data(x):
    date = "{}-{}-{}".format(int(x.year), int(x.month), int(x.day))
    item = create_item(date, date, x.truncatedLongitude, x.truncatedLatitude, '')
    items.append(item)

events.apply(gen_data, axis=1)
print()




In [22]:
len(items)

29

In [23]:
f = open("events.json", "w")

In [24]:
f.write('onLoadData({"type":"FeatureCollection","features":')

50

In [25]:
json.dump(items, f)

In [26]:
f.write('});');

In [27]:
f.close();