# Event detection from Twitter
> The aim of this project is to use geolocated tweets to identify the travelling patterns across the Swiss-border. We will consider the radius of gyration (ROG) to measure the distance convered by the users over a period of time. The reference point for each user will be chosen as the location at which the most frequent geo-located tweets were made. We will query individual tweets to decide the end points for the ROG. This will be categoried by the journeys made between cities and also on a temporal basis to demonstrate the dynamic changes over a period of time.

# Overview

## What is event?

An event is an abstract idea which has a topic, a temporal dimension, and a set of entities such as location, person, organization etc associated with it

## How to detect events?

...

# Literature review

Papers or projects:
* Supervised learning
 * Geo-spatial Event Detection in the Twitter Stream
 * ... TO ADD
* Unsupervised learning
 * https://github.com/harshil93/Event-Detection-and-Clustering-for-Twitter
 * ... TO ADD



# Load Data

We start by reading in schema file and extracting the list of column names from it.

In [1]:
import pandas as pd
import numpy as np
import csv

In [2]:
schema_path = 'data/twitter-swisscom/schema.txt'
schema = pd.read_csv(schema_path, header=None)

In [3]:
# extract columns
columns = []
# for row in schema.row:
for index, row in schema.iterrows():
    entries = row.loc[0].split(" ")
    entries_filt = list(filter(('').__ne__, entries))
    columns.append(entries_filt[1])

In [4]:
len(columns)

20

Then we read in the dataset, using sample.tsv rather than twex.tsv for now.  
For some unknown reason, I failed to split sample dataset by '\t'. It gave somewhat weird output with only 10 columns. As a temporary workaround, I subsituted all occurrences of '\t' for ',,,,,'  in sample dataset and saved it as sample-workaround.tsv. 

In [5]:
data_path = 'data/twitter-swisscom/sample-workaround.tsv'
df= pd.read_csv(data_path, sep=',,,,,', encoding='utf-8', quoting=csv.QUOTE_NONE, header=None, na_values="\\N", names=columns)
df.shape

  from ipykernel import kernelapp as app


(9999, 20)

In [6]:
df.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,776522983837954049,7.354492e+17,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,,,51c0e6b24c64e54e,,1,�,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110.0,28621,Earleen.
1,776523000636203010,2741686000.0,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,,,4e7c21fd2af027c6,,1,�,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037.0,3771,Suisse
2,776523045200691200,435239200.0,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,,,12eb9b254faf37a3,7.765221e+17,5,�,47.201,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595.0,30191,Fontain
3,776523058404290560,503244200.0,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,,,30bcd7f767b4041e,7.765216e+17,1,�,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417.0,12262,Shargeyah
4,776523058504925185,452805300.0,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,,3,�,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172.0,3390,İstanbul/Burgazada


# Explore data

How many items miss longitude, latitude, or both?

In [7]:
# miss longitude
len(df[df['longitude'].isnull()]) / len(df['longitude'])

0.8538853885388539

In [8]:
# miss latitude
len(df[df['latitude'].isnull()]) / len(df['latitude'])

0.7928792879287929

In [9]:
# miss longitude or latitude
len(df[df['longitude'].isnull() | df['latitude'].isnull()]) / df.shape[0]

0.8541854185418541

We can see from above that a high fraction of data misses (longitude, latitude). There are also some items that miss only one of longitude and latitude.

How many items miss placeLatitude, placeLongitude, or both?

In [10]:
# miss placeLatitude
len(df[df['placeLatitude'].isnull()]) / df.shape[0]

0.12091209120912091

In [11]:
# miss placeLongitude
len(df[df['placeLongitude'].isnull()]) / df.shape[0]

0.1273127312731273

In [12]:
# miss placeLatitude or placeLongitude
len(df[df['placeLatitude'].isnull() | df['placeLongitude'].isnull()]) / df.shape[0]

0.1273127312731273

# Preprocess data

Group tweets by coordinates with reduced precision

In [None]:
def computeLongitude(row):
    # TODO: how to convert string to nan?
    # pd.to_numeric will keep strings
    x = pd.to_numeric(row.longitude, errors='coerce')
    return np.nan if pd.isnull(x) else round(float(x), 2)
    
df['derivedLongitude'] = df.apply(computeLongitude, axis=1)

Remove tweets outside Switzerland

Group tweets by date

Create clusters of tweets by temporal & spatial similarity

Create feature for machine learning:   
language, location(latitude, longitude), time(day, time period), etc

# Machine learning

TODO: supervised learning or unsupervised learning?

Supervised learning: predict if a cluster of tweets refer to an event?
1. manually mark some clusters of tweets as containing an event or not.
2. train model
3. predict & find out clusters of tweets that contain events
4. run topic modeling over those clusters to filter event-related tweets & find out the exact event

Unsupervised learning
1. find out topics with topic modeling
2. filter out topics that are close in time and location, related to some entities, etc
3. figure out events

# Visualization
Specify a date, display (location, event/list of tweets) in map? like https://www.youtube.com/watch?v=WGEjI0TvWnk  ?