## Tweets from old games

`scotrugbytweets` uses some rudimentary NLP-based named entity extraction from `nltk` with some fairly poor results. I've added a 'boring term' blacklist (team names, home nations, sponsors) but this is fairly unscalable and might miss out on some cool organic trends. In this notebook, I'll have a look at historical tweets around a Glasgow Warriors and Edinburgh Rugby game to get some insight into what is typically being tweeted about, and what might be interesting for `scotrugbytweets` visualisations.

In [17]:
import pandas as pd
from tweepy import OAuthHandler, Cursor, API
import yaml

#add directory above to path (listener Python namespace)
import sys
sys.path.append("..")

with open('../credentials.yaml', 'r') as f:
    creds = yaml.load(f)
    
auth = OAuthHandler(creds['consumer']['key'], creds['consumer']['secret'])
auth.set_access_token(creds['access']['key'], creds['access']['secret'])

Due to the week-limit on the Twitter API, we can only grab historical tweets from two Scottish games.

* **Glasgow Warriors** (A) v. Benetton Treviso (5/1/19 15:00KO)
* **Edinburgh** (H) v. Southern Kings (5/1/19 19:35 KO)

In [153]:
games_meta = {
    'GLA': {'kickoff': pd.to_datetime('2019-01-05 15:00'), 
            'twitter_punc': ['@GlasgowWarriors'], 
            'opposition': '@BenettonTreviso'},
    'EDI': {'kickoff': pd.to_datetime('2019-01-05 19:35'), 
            'twitter_punc': ['@EdinburghRugby'], 
            'opposition': '@SouthernKings'}
        }

def querify(hashtags):
    query = ' OR '.join(hashtags)
    return query

In [155]:
import csv
from nlp import clean_tweet, extract_entities
from itertools import chain

for team in games_meta.keys():
    print(team)
    csv_file = open('tweets_{0}.csv'.format(team), 'w')
    csv_writer = csv.writer(csv_file)
    
    #time definition
    window = '1 day'
    start_time = str(games_meta[team]['kickoff'] - pd.Timedelta(window)).split(' ')[0]
    end_time = str(games_meta[team]['kickoff'] + pd.Timedelta(window)).split(' ')[0]

    api = API(auth, wait_on_rate_limit=True)
    tweets = Cursor(api.search, q=querify(games_meta[team]['twitter_punc']), lang="en",
           since=start_time, until=end_time)

    for tweet in tweets.items():
        cleaned_tweet = clean_tweet(tweet.text)
        csv_writer.writerow([tweet.created_at, cleaned_tweet, extract_entities(cleaned_tweet)])

GLA
EDI


In [160]:
tweets_df = pd.read_csv('./tweets_GLA.csv', header=None)
tweets_df.columns = ['timestamp', 'tweet', 'entities']

tweets_df['entities'] = tweets_df['entities'].apply(lambda row: row[1:-1].split(', '))
entities = pd.Series([row[1:-1] for row in list(chain(*tweets_df['entities'].values))])
entities = entities[entities != ''] #filter empty results

In [161]:
entities.value_counts()

Benetton          41
Italian           26
Conference        20
Treviso           17
Sam Johnson       12
Hastings          12
Europe             8
Johnson            8
Kick               7
Rennie             7
Horne              7
SCRUM              7
Nairn              7
Dave Rennie        6
Dave               6
McDowall           5
GLA                5
Hogg               5
Kebble             5
Good               5
Adam               5
Jackson            5
Thomson            4
Special Win        4
Click              4
Yep                4
Monigo             4
Wilson             3
Happiness          3
Milestone          3
                  ..
Keep               1
America            1
Thank              1
Mark               1
Autumn             1
Dylan              1
Benetton Rugby     1
TRY                1
Narin              1
Poor               1
Dire               1
COME               1
Charlie            1
RUGBY              1
Jones              1
Dolce Vita         1
Unfortunate  