## Tweets from old games

`scotrugbytweets` uses some rudimentary NLP-based named entity extraction from `nltk` with some fairly poor results. I've added a 'boring term' blacklist (team names, home nations, sponsors) but this is fairly unscalable and might miss out on some cool organic trends. In this notebook, I'll have a look at historical tweets around a Glasgow Warriors and Edinburgh Rugby game to get some insight into what is typically being tweeted about, and what might be interesting for `scotrugbytweets` visualisations.

In [17]:
import pandas as pd
from tweepy import OAuthHandler, Cursor, API
import yaml

#add directory above to path (listener Python namespace)
import sys
sys.path.append("..")

with open('../credentials.yaml', 'r') as f:
    creds = yaml.load(f)
    
auth = OAuthHandler(creds['consumer']['key'], creds['consumer']['secret'])
auth.set_access_token(creds['access']['key'], creds['access']['secret'])

Due to the week-limit on the Twitter API, we can only grab historical tweets from two Scottish games.

* **Glasgow Warriors** (A) v. Benetton Treviso (5/1/19 15:00KO)
* **Edinburgh** (H) v. Southern Kings (5/1/19 19:35 KO)

In [69]:
games_meta = {
    'GLA': {'kickoff': pd.to_datetime('2019-01-05 15:00'), 
            'twitter_punc': ['@GlasgowWarriors', '#wearewarriors', '#warriornation'], 
            'opposition': '@BenettonTreviso'},
    'EDI': {'kickoff': pd.to_datetime('2019-01-05 19:35'), 
            'twitter_punc': ['@EdinburghRugby', '#alwaysedinburgh'], 
            'opposition': '@SouthernKings'}
        }

In [74]:
games_meta['GLA']['twitter_punc']

['@GlasgowWarriors', '#wearewarriors', '#warriornation']

In [93]:
import csv
from nlp import clean_tweet, extract_entities
from itertools import chain

for team in games_meta.keys():
    print(team)
    csv_file = open('tweets_{0}.csv'.format(team), 'w')
    csv_writer = csv.writer(csv_file)
    
    #time definition
    window = '1 day'
    start_time = games_meta[team]['kickoff'] - pd.Timedelta(window)
    end_time = games_meta[team]['kickoff'] + pd.Timedelta(window)

    api = API(auth)
    tweets = Cursor(api.search, q=games_meta[team]['twitter_punc'], lang="en",
           since=start_time, until=end_time)

    for tweet in tweets.items():
        cleaned_tweet = clean_tweet(tweet.text)
        print(cleaned_tweet)
        csv_writer.writerow([tweet.created_at, cleaned_tweet, extract_entities(cleaned_tweet)])

GLA
EDI


In [92]:
cleaned_tweet

'RT Just going to leave this here'

In [89]:
!ls -lh

total 280
-rw-r--r--  1 jjac  staff   7.9K  8 Jan 18:50 Tweets from old games.ipynb
-rw-r--r--  1 jjac  staff    81K  8 Jan 18:49 tweets.csv
-rw-r--r--  1 jjac  staff     0B  8 Jan 19:07 tweets_EDI.csv
-rw-r--r--  1 jjac  staff     0B  8 Jan 19:07 tweets_GLA.csv


In [85]:
tweets_df = pd.read_csv('./tweets.csv', header=None)
tweets_df.columns = ['timestamp', 'tweet', 'entities']

tweets_df['entities'] = tweets_df['entities'].apply(lambda row: row[1:-1].split(', '))
entities = pd.Series([row[1:-1] for row in list(chain(*tweets_df['entities'].values))])
entities = entities[entities != ''] #filter empty results

In [86]:
entities.value_counts()

Benetton         43
Italian          29
Conference       24
Treviso          17
Hastings         13
Sam Johnson      12
Rosie            10
Horne             8
Johnson           8
Europe            8
Rennie            8
SCRUM             7
Kick              7
Nairn             7
Dave Rennie       6
Dave              6
GLA               5
McDowall          5
Hogg              5
Jackson           5
Adam              5
Good              5
ESPN              5
Special Win       5
Kebble            5
Yep               4
Monigo            4
Click             4
Thomson           4
Capitano          4
                 ..
TRY               1
Narin             1
Poor              1
Dire              1
Didn              1
LOTR              1
GLASGOW           1
Strauss           1
Total             1
Xmas              1
Damp              1
Matthew Smith     1
Hmmm              1
Thank             1
Update            1
Keep              1
Possession        1
Fantastic         1
Irish             1
