## Create Event Database
- we are interested in data for event summarization
- we define the event to consume an entire day
- and we arbitrarily set event bounds -1day to +1day before and after event day
- this is done in UTC time for treating all events equal

#### import tools

In [154]:
import pandas as pd
import csv
from pandas.tseries.offsets import *
import re

#### import data

In [155]:
event_name = input('Enter Event Name: ')

Enter Event Name: Chelsea vs. Tottenham Premier League Match


In [156]:
event_filename = re.sub("\W+", "", event_name.strip())
print(event_filename)

ChelseavsTottenhamPremierLeagueMatch


In [157]:
keyword = input('Enter Twitter Query Term: ')

Enter Twitter Query Term: #CHETOT


In [158]:
keyword = keyword.replace(' ', '_').lower()
print(keyword)

#chetot


In [159]:
df = pd.read_table('data/final/query_%s_data.txt' % keyword, sep='\t', encoding='utf-8', header=0)

#### (OPTIONAL) 
#### merge query data to event data
- alter below to include number of queries for event

In [160]:
#queryterm1 =
#queryterm2 = 

In [161]:
#df_query1 = pd.read_table('data/final/query_%s_data.txt' % queryterm1, sep='\t', encoding='utf-8', header=0)
#df_query2 = pd.read_table('data/final/query_%s_data.txt' % queryterm2, sep='\t', encoding='utf-8', header=0)

In [162]:
#df_query1['query'] = queryterm1
#df_qeury1['query'] = queryterm2

In [163]:
#df = pd.concat([df_query1, df_query2])
#df.shape

In [164]:
#drop duplicate tweets
#handles overlap between queries
#df = df.drop_duplicates('id_str')
#df.shape

#### datetime
- convert string to pandas datetime
- sort data on date posted ascending

In [165]:
df['created_at'] = pd.to_datetime(df['created_at'])
df = df.sort_values(by='created_at', ascending=True)

#### slice event on date range
- we discard that data which is out of our range

In [166]:
event_date = pd.to_datetime(input('Enter Event Date (YYYY-MM-DD): '))

Enter Event Date (YYYY-MM-DD): 2016-05-02


In [167]:
start = event_date - DateOffset(days=1)
finish = event_date + DateOffset(days=2)
print('START: ', start, 'FINISH: ', finish)

START:  2016-05-01 00:00:00 FINISH:  2016-05-04 00:00:00


In [168]:
df = df[(df['created_at'] >= start) & (df['created_at'] < finish)]
df.shape

(29808, 28)

#### create master database
- add index as master id, rename twitter id
- export master event file

In [169]:
df.reset_index(drop=True, inplace=True)
df = df.reset_index().rename(columns={'index' : 'master_id', 'id_str' : 'twitter_id'})

In [170]:
df.to_csv('data/final/event_%s_data.txt' % event_filename, sep='\t', encoding='utf-8', header=True, index=False)

#### get annotation data
- define data that will go to crowdflower for annotation
- exclude retweets, replies
- add event title, description
- sample 10k, set seed

In [171]:
df_forann = df[(df['is_retweet'] == 0) & (df['is_reply'] == 0)][['master_id', 'twitter_id', 'created_at', 'text']]

In [172]:
df_forann.loc[:,'event'] = event_name
df_forann.loc[:,'event_description'] = input('Enter Event Description: ')

Enter Event Description: The 2015–16 Premier League is the 24th season of the Premier League, the top English professional league for association football clubs, since its establishment in 1992.


In [173]:
#no retweets, only text
df_forann_smpl = df_forann.sample(10000, random_state=2016)
#format for crowdflower import
#delimited by ;
#double quote strings
df_forann_smpl.to_csv('data/final/event_%s_annsample.csv' % event_filename, sep=',', quoting=csv.QUOTE_NONNUMERIC, encoding='utf-8', header=True, index=False)