# Event detection from Twitter
> The aim of this project is to use geolocated tweets to identify the travelling patterns across the Swiss-border. We will consider the radius of gyration (ROG) to measure the distance convered by the users over a period of time. The reference point for each user will be chosen as the location at which the most frequent geo-located tweets were made. We will query individual tweets to decide the end points for the ROG. This will be categoried by the journeys made between cities and also on a temporal basis to demonstrate the dynamic changes over a period of time.

# Overview

## What is event?

An event is an abstract idea which has a topic, a temporal dimension, and a set of entities such as location, person, organization etc associated with it

## How to detect events?

...

# Literature review

Papers or projects:
* Supervised learning
 * Geo-spatial Event Detection in the Twitter Stream
 * ... TO ADD
* Unsupervised learning
 * https://github.com/harshil93/Event-Detection-and-Clustering-for-Twitter
 * ... TO ADD



# Load Data

We start by reading in schema file and extracting the list of column names from it.

In [1]:
import pandas as pd
import numpy as np
import csv

In [2]:
schema_path = 'data/twitter-swisscom/schema.txt'
schema = pd.read_csv(schema_path, header=None)

In [3]:
# extract columns
columns = []
# for row in schema.row:
for index, row in schema.iterrows():
    entries = row.loc[0].split(" ")
    entries_filt = list(filter(('').__ne__, entries))
    columns.append(entries_filt[1])

In [4]:
columns

['id',
 'userId',
 'createdAt',
 'text',
 'longitude',
 'latitude',
 'placeId',
 'inReplyTo',
 'source',
 'truncated',
 'placeLatitude',
 'placeLongitude',
 'sourceName',
 'sourceUrl',
 'userName',
 'screenName',
 'followersCount',
 'friendsCount',
 'statusesCount',
 'userLocation']

Then we read in the dataset, using sample.tsv rather than twex.tsv for now.  
For some unknown reason, I failed to split sample dataset by '\t'. It gave somewhat weird output with only 10 columns. As a temporary workaround, I subsituted all occurrences of '\t' for ',,,,,'  in sample dataset and saved it as sample-workaround.tsv. 

In [5]:
data_path = 'data/twitter-swisscom/sample.tsv'
df = pd.read_csv(data_path, sep="\t",encoding='utf-8', quoting=csv.QUOTE_NONE, header=None, escapechar='\\', na_values='N', names=columns)
df.shape

(8790, 20)

In [6]:
df.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,776522983837954049,735449229028675584,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,,,51c0e6b24c64e54e,,1,,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.
1,776523000636203010,2741685639,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,,,4e7c21fd2af027c6,,1,,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037,3771,Suisse
2,776523045200691200,435239151,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,,,12eb9b254faf37a3,7.765221e+17,5,,47.201,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595,30191,Fontain
3,776523058404290560,503244217,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,,,30bcd7f767b4041e,7.765216e+17,1,,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417,12262,Shargeyah
4,776523058504925185,452805259,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,,3,,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172,3390,İstanbul/Burgazada


# Explore data

In [7]:
df.dtypes

id                  int64
userId              int64
createdAt          object
text               object
longitude         float64
latitude          float64
placeId            object
inReplyTo         float64
source              int64
truncated         float64
placeLatitude     float64
placeLongitude    float64
sourceName         object
sourceUrl          object
userName           object
screenName         object
followersCount      int64
friendsCount        int64
statusesCount       int64
userLocation       object
dtype: object

How many items miss longitude, latitude, or both?

In [8]:
# miss longitude
len(df[df['longitude'].isnull()]) / len(df['longitude'])

0.823094425483504

In [9]:
# miss latitude
len(df[df['latitude'].isnull()]) / len(df['latitude'])

0.823094425483504

In [10]:
# miss longitude or latitude
len(df[df['longitude'].isnull() | df['latitude'].isnull()]) / df.shape[0]

0.823094425483504

We can see from above that a high fraction of data misses (longitude, latitude). There are also some items that miss only one of longitude and latitude.

How many items miss placeLatitude, placeLongitude, or both?

In [11]:
# miss placeLatitude
len(df[df['placeLatitude'].isnull()]) / df.shape[0]

0.0

In [12]:
# miss placeLongitude
len(df[df['placeLongitude'].isnull()]) / df.shape[0]

0.0

In [13]:
# miss placeLatitude or placeLongitude
len(df[df['placeLatitude'].isnull() | df['placeLongitude'].isnull()]) / df.shape[0]

0.0

Let's check the content of some tweets. We can see from below that many of them are not in English. French (and German?) are prevalent, which creates a challenge for event detection.

In [14]:
df['text'].values[:100]

array(['se lo dici tu... https://t.co/x7Qm1VHBKL',
       'https://t.co/noYrTnqmg9',
       '@BesacTof @Leonid_CCCP Tu dois t\'engager en signant précisément:"Je partage les valeurs républicaines de la droite et du centre 1/2...',
       '@Mno0or_Abyat اشوف مظاهرات على قانون العمل الجديد انا طلعت الصباح من هناك مافيه شي بس اشوف الأخبار قبل شوي المشكلة راجع بعد يومين !!',
       'Greek night #geneve (@ Emilios in Genève) https://t.co/sEplW0Mcyz',
       '@gregorypons #BusinessMontresVision https://t.co/T01r96nCfw',
       'dillo https://t.co/hScjeZbi4c',
       'Miii le voci nere.. Che meraviglia.. #XF10',
       '@Manu_Aka_Manny Sorry ', 'Je veut ca https://t.co/NZpSScxQ70',
       'seh https://t.co/HDbhb8yVma', 'Buenas noches',
       'Am looking like a 2004 kid I swear you looking like you in your early40s.@Dremoapg she say make I nr touch her you be statue for Museum??',
       "Comunque se sei figa non c'è bisogno di caricare una foto giornaliera sui social, noi maschi non ce lo sc

# Preprocess data

Preprocess content of tweets: remove urls, ..

Group tweets by coordinates with reduced precision

In [15]:
def computeLongitude(row):
    # TODO: how to convert string to nan?
    # pd.to_numeric will keep strings
    x = pd.to_numeric(row.longitude, errors='coerce')
    return np.nan if pd.isnull(x) else round(float(x), 2)
    
df['derivedLongitude'] = df.apply(computeLongitude, axis=1)

Remove tweets outside Switzerland

Group tweets by date

In [16]:
df.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,...,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation,derivedLongitude
0,776522983837954049,735449229028675584,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,,,51c0e6b24c64e54e,,1,,...,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.,
1,776523000636203010,2741685639,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,,,4e7c21fd2af027c6,,1,,...,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037,3771,Suisse,
2,776523045200691200,435239151,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,,,12eb9b254faf37a3,7.765221e+17,5,,...,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595,30191,Fontain,
3,776523058404290560,503244217,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,,,30bcd7f767b4041e,7.765216e+17,1,,...,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417,12262,Shargeyah,
4,776523058504925185,452805259,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,,3,,...,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172,3390,İstanbul/Burgazada,6.14


In [17]:
import csv

pd.to_datetime(df.createdAt)

0      2016-09-15 20:48:01
1      2016-09-15 20:48:05
2      2016-09-15 20:48:15
3      2016-09-15 20:48:18
4      2016-09-15 20:48:18
5      2016-09-15 20:48:21
6      2016-09-15 20:48:27
7      2016-09-15 20:48:29
8      2016-09-15 20:48:35
9      2016-09-15 20:48:37
10     2016-09-15 20:48:40
11     2016-09-15 20:48:42
12     2016-09-15 20:48:44
13     2016-09-15 20:48:44
14     2016-09-15 20:48:50
15     2016-09-15 20:48:53
16     2016-09-15 20:49:03
17     2016-09-15 20:49:06
18     2016-09-15 20:49:09
19     2016-09-15 20:49:09
20     2016-09-15 20:49:11
21     2016-09-15 20:49:17
22     2016-09-15 20:49:28
23     2016-09-15 20:49:32
24     2016-09-15 20:49:32
25     2016-09-15 20:49:34
26     2016-09-15 20:49:37
27     2016-09-15 20:49:39
28     2016-09-15 20:49:42
29     2016-09-15 20:49:42
               ...        
8760   2016-09-16 16:32:13
8761   2016-09-16 16:32:16
8762   2016-09-16 16:32:20
8763   2016-09-16 16:32:35
8764   2016-09-16 16:32:40
8765   2016-09-16 16:32:40
8

Create clusters of tweets by temporal & spatial similarity

Create feature for machine learning:   
language, location(latitude, longitude), time(day, time period), etc

# Machine learning

TODO: supervised learning or unsupervised learning?

Supervised learning: predict if a cluster of tweets refer to an event?
1. manually mark some clusters of tweets as containing an event or not.
2. train model
3. predict & find out clusters of tweets that contain events
4. run topic modeling over those clusters to filter event-related tweets & find out the exact event

Unsupervised learning
1. find out topics with topic modeling
2. filter out topics that are close in time and location, related to some entities, etc
3. figure out events

# Visualization
Specify a date, display (location, event/list of tweets) in map? like https://www.youtube.com/watch?v=WGEjI0TvWnk  ?