# Beltway reporters

Goal here is:
1. Get a sense of the first tweets for each user.
2. Using mentions by the periodical reporters as an example, show that
   these mentions can be categorized by comparing them against lists
   of known Twitter accounts. For now, the rough categories
   are government, media, politicians, and reporters but these can
   be fixed later. This can then be replicated for other groups
   (e.g., newspaper reporters), as well as retweets and replies.
3. Start putting together a list of additional Twitter accounts that need
   to be categorized.

## Setup
This creates some functions used to load the data.

In [10]:
import pandas as pd
import numpy as np
import json
from dateutil.parser import parse as date_parse
import gzip
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Filepaths of the files to load.
filepaths = ['d59d27e2f2ed4778881573df2ecf2fad_001.json.gz',
            '25319652321b4bb498b250ffc53aa0f0_001.json.gz']

# Load tweets from gzipped, line-oriented JSON files, possibly transforming with provided function
# and limiting by number of tweets.
# Returns an iterator.
def tweet_iter(filepaths, limit=None, tweet_transform_func=None):
    for filepath in filepaths:
        with gzip.open(filepath) as file:
            for count, line in enumerate(file):
                if count % 50000 == 0:
                    logging.debug('Loaded %s', count)
                tweet = json.loads(line)
                if tweet_transform_func:
                    tweet_transform_ret = tweet_transform_func(tweet)
                    if isinstance(tweet_transform_ret, list):
                        for tweet in tweet_transform_ret:
                            yield tweet
                    else:
                        yield tweet_transform_ret
                else:
                    yield tweet
                if count+1 == limit:
                    break


## Find first tweet for each user
Note that by comparing the created_at date and the user_created_at date it can be determined which are new accounts and which are prolific tweeters.

__Note that this is using data collected from the periodical press.__

### Load the data and count.

In [2]:
# Simply the tweet on load
def tweet_transform(tweet):
    return { 
        'id': tweet['id_str'], 
        'created_at': date_parse(tweet['created_at']),
        'user_id': tweet['user']['id_str'],
        'screen_name': tweet['user']['screen_name'],
        'user_created_at': date_parse(tweet['user']['created_at']),
    }

tweet_df = pd.DataFrame(tweet_iter(filepaths, tweet_transform_func=tweet_transform))
tweet_df.count()

created_at         3364440
id                 3364440
screen_name        3364440
user_created_at    3364440
user_id            3364440
dtype: int64

### View the top of the data.

In [3]:
tweet_df.head()

Unnamed: 0,created_at,id,screen_name,user_created_at,user_id
0,2017-03-31 14:41:35+00:00,847821180832804864,A_Childers_,2013-08-01 21:44:28+00:00,1638925448
1,2017-03-31 14:15:34+00:00,847814632643473411,A_Childers_,2013-08-01 21:44:28+00:00,1638925448
2,2017-03-31 01:52:09+00:00,847627543142219776,A_Childers_,2013-08-01 21:44:28+00:00,1638925448
3,2017-03-30 23:52:23+00:00,847597404719267841,A_Childers_,2013-08-01 21:44:28+00:00,1638925448
4,2017-03-30 23:37:48+00:00,847593734896324608,A_Childers_,2013-08-01 21:44:28+00:00,1638925448


### First tweet for each user
created_at is the date of the first tweet in the dataset for the user. This can be compared
against user_created_at, the date the user account was created.

In [11]:
tweet_df.loc[tweet_df.groupby('user_id')['created_at'].idxmin()].sort_values('created_at', ascending=False).head(20)

Unnamed: 0,created_at,id,screen_name,user_created_at,user_id
946753,2017-03-31 17:07:58+00:00,847858018809237504,sklee_ca,2009-09-23 17:09:53+00:00,76696176
745318,2017-03-27 13:25:07+00:00,846352383609356288,emmaroller,2009-08-18 19:10:55+00:00,66768858
959935,2017-03-17 17:27:58+00:00,842789621314588673,RebeccaEHoffman,2017-03-17 17:18:52+00:00,842787331224584192
942211,2017-03-15 14:16:31+00:00,842016664237559809,ErinMcManus15,2017-02-15 21:03:24+00:00,831972200014045191
2530157,2017-03-10 16:33:43+00:00,840239253745471489,EvanMcS,2009-02-06 23:09:59+00:00,20281013
810658,2017-03-10 13:43:41+00:00,840196461472174081,CahnEmily,2009-01-10 03:19:50+00:00,18825339
1906009,2017-03-08 20:33:21+00:00,839574782341414912,LaurenFCarroll,2009-04-10 06:29:32+00:00,30176025
2552103,2017-03-07 16:38:59+00:00,839153415515168768,ericgeller,2007-04-08 20:27:11+00:00,3817401
780962,2017-03-03 22:00:52+00:00,837784869958737920,HotlineJosh,2009-02-22 23:45:46+00:00,21612122
2200367,2017-03-02 16:35:58+00:00,837340716560953344,chrisgeidner,2009-03-05 06:48:00+00:00,22891564


## Top mentions
Determine who is being mentioned and attempt to characterize.

__Note that this is using data collected from the periodical press.__

In [12]:
# Simply the tweet on load
def mention_transform(tweet):
    mentions = []
    for mention in tweet.get('entities', {}).get('user_mentions', []):
        mentions.append({
            'id': tweet['id_str'],
            'user_id': tweet['user']['id_str'],
            'screen_name': tweet['user']['screen_name'],
            'mention_user_id': mention['id_str'],
            'mention_screen_name': mention['screen_name']
        })
    return mentions

mention_df = pd.DataFrame(tweet_iter(filepaths, tweet_transform_func=mention_transform))


DEBUG:root:Loaded 0
DEBUG:root:Loaded 50000
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 150000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 250000
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 350000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 450000
DEBUG:root:Loaded 500000
DEBUG:root:Loaded 550000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 650000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 750000
DEBUG:root:Loaded 800000
DEBUG:root:Loaded 850000
DEBUG:root:Loaded 900000
DEBUG:root:Loaded 950000
DEBUG:root:Loaded 1000000
DEBUG:root:Loaded 1050000
DEBUG:root:Loaded 1100000
DEBUG:root:Loaded 1150000
DEBUG:root:Loaded 1200000
DEBUG:root:Loaded 1250000
DEBUG:root:Loaded 1300000
DEBUG:root:Loaded 1350000
DEBUG:root:Loaded 1400000
DEBUG:root:Loaded 1450000
DEBUG:root:Loaded 1500000
DEBUG:root:Loaded 0
DEBUG:root:Loaded 50000
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 150000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 250000
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 350000
DEBUG:root:Loaded 400000
D

### Number of mentions found in the dataset

In [13]:
mention_df.count()

id                     3029170
mention_screen_name    3029170
mention_user_id        3029170
screen_name            3029170
user_id                3029170
dtype: int64

### The mention data
Each mention consists of the tweet id, the screen name and user id that is mentioned,
and the screen_name and user_id that is mentioning.

In [14]:
mention_df.head()

Unnamed: 0,id,mention_screen_name,mention_user_id,screen_name,user_id
0,847821180832804864,paulconndc,64502388,A_Childers_,1638925448
1,847821180832804864,Pat_Ambrosio,2497185313,A_Childers_,1638925448
2,847814632643473411,azevin,14744078,A_Childers_,1638925448
3,847627543142219776,davidbschultz,53739928,A_Childers_,1638925448
4,847597404719267841,davidbschultz,53739928,A_Childers_,1638925448


### Top mentioned accounts

In [15]:
mention_summary_df = pd.DataFrame(mention_df.groupby('mention_screen_name').size().reset_index(name='mention_screen_name_count'))
mention_summary_df['mention_screen_name_lower'] = mention_summary_df.mention_screen_name.apply(str.lower)
mention_summary_df.sort_values('mention_screen_name_count', ascending=False).head(50)

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower
183261,realDonaldTrump,33899,realdonaldtrump
2674,AP,29688,ap
17686,CQnow,20586,cqnow
108893,WSJ,18616,wsj
78501,POTUS,16846,potus
105905,USATODAY,15634,usatoday
43586,HillaryClinton,15628,hillaryclinton
180900,politico,15310,politico
185611,rollcall,15118,rollcall
86343,Reuters,14000,reuters


### Load known Twitter accounts

In [16]:
def seed_iter(filepath):
    with open(filepath) as file:
        for line in file:
            screen_name, user_id = line.split(',')
            yield {'screen_name': screen_name, 'user_id': user_id[:-1]}

def load_seed_df(filepath, seed_type):
    df = pd.DataFrame(seed_iter(filepath))
    df['screen_name_lower'] = df.screen_name.apply(str.lower)
    df['type'] = seed_type
    return df

In [17]:
federal_agencies_df = load_seed_df('federal_agencies.csv', 'government')
federal_agencies_df.count()

screen_name          2968
user_id              2968
screen_name_lower    2968
type                 2968
dtype: int64

In [18]:
news_outlets_df = load_seed_df('news_outlets.csv', 'media')
news_outlets_df.count()

screen_name          92
user_id              92
screen_name_lower    92
type                 92
dtype: int64

In [19]:
newspaper_reporters_df = load_seed_df('newspaper_reporters.csv', 'reporters')
newspaper_reporters_df.count()

screen_name          790
user_id              790
screen_name_lower    790
type                 790
dtype: int64

In [20]:
periodical_reporters_df = load_seed_df('periodical_reporters.csv', 'reporters')
periodical_reporters_df.count()

screen_name          677
user_id              677
screen_name_lower    677
type                 677
dtype: int64

In [21]:
administration_officials_df = load_seed_df('administration_officials.csv', 'politicians')
administration_officials_df.count()

screen_name          63
user_id              63
screen_name_lower    63
type                 63
dtype: int64

In [22]:
cabinet_df = load_seed_df('cabinet.csv', 'politicians')
cabinet_df.count()

screen_name          12
user_id              12
screen_name_lower    12
type                 12
dtype: int64

In [23]:
representatives_df = load_seed_df('representatives.csv', 'politicians')
representatives_df.count()

screen_name          431
user_id              431
screen_name_lower    431
type                 431
dtype: int64

In [24]:
senators_df = load_seed_df('senators.csv', 'politicians')
senators_df.count()

screen_name          100
user_id              100
screen_name_lower    100
type                 100
dtype: int64

In [25]:
media_df = load_seed_df('media.csv', 'media')
media_df.count()

screen_name          5997
user_id              5997
screen_name_lower    5997
type                 5997
dtype: int64

In [26]:
# Order is deliberate here, since will be deduplicating.
screen_name_lookup_df = newspaper_reporters_df.append([administration_officials_df,
                                      news_outlets_df,
                                      periodical_reporters_df,
                                      cabinet_df,
                                      representatives_df,
                                      senators_df,
                                      media_df,
                                      federal_agencies_df], ignore_index=True).drop_duplicates(subset='screen_name_lower')
screen_name_lookup_df.count()

screen_name          10932
user_id              10932
screen_name_lower    10932
type                 10932
dtype: int64

### Join the mentions and the known Twitter accounts

In [27]:
mention_join_df = pd.merge(mention_summary_df, screen_name_lookup_df, how='left', left_on='mention_screen_name_lower', right_on='screen_name_lower')
mention_join_df['type'].fillna('unknown', inplace=True)

### Top (by mention count) accounts that are matched against known Twitter accounts

In [28]:
mention_join_df[pd.notnull(mention_join_df.screen_name)].sort_values('mention_screen_name_count', ascending=False).head()

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower,screen_name,user_id,screen_name_lower,type
183261,realDonaldTrump,33899,realdonaldtrump,realDonaldTrump,25073877,realdonaldtrump,politicians
2674,AP,29688,ap,AP,51241574,ap,media
108893,WSJ,18616,wsj,WSJ,3108351,wsj,media
78501,POTUS,16846,potus,POTUS,822215679726100480,potus,politicians
105905,USATODAY,15634,usatoday,USATODAY,15754281,usatoday,media


### Number of matched accounts
mention_screen_name is the number of unique mentioned accounts. screen_name is the
number of matched unique accounts.

In [29]:
mention_join_df.count()

mention_screen_name          205243
mention_screen_name_count    205243
mention_screen_name_lower    205243
screen_name                    4445
user_id                        4445
screen_name_lower              4445
type                         205243
dtype: int64

### Top accounts by mentions
NaN for screen_name indicates that it is not matched with an known Twitter account.

In [30]:
mention_join_df.sort_values('mention_screen_name_count', ascending=False).head(50)

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower,screen_name,user_id,screen_name_lower,type
183261,realDonaldTrump,33899,realdonaldtrump,realDonaldTrump,25073877.0,realdonaldtrump,politicians
2674,AP,29688,ap,AP,51241574.0,ap,media
17686,CQnow,20586,cqnow,,,,unknown
108893,WSJ,18616,wsj,WSJ,3108351.0,wsj,media
78501,POTUS,16846,potus,POTUS,8.222156797261005e+17,potus,politicians
105905,USATODAY,15634,usatoday,USATODAY,15754281.0,usatoday,media
43586,HillaryClinton,15628,hillaryclinton,,,,unknown
180900,politico,15310,politico,politico,9300262.0,politico,media
185611,rollcall,15118,rollcall,rollcall,15922214.0,rollcall,media
86343,Reuters,14000,reuters,Reuters,1652541.0,reuters,media


### Mentions by account type

In [31]:
mention_join_df.groupby('type').sum()

Unnamed: 0_level_0,mention_screen_name_count
type,Unnamed: 1_level_1
government,61099
media,236317
politicians,164298
reporters,563052
unknown,2004404


### Top (by mentions) accounts that are not known.
These are the accounts that we will want to categorize.

In [32]:
mention_join_df[mention_join_df.type == 'unknown'].sort_values('mention_screen_name_count', ascending=False).head(50)

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower,screen_name,user_id,screen_name_lower,type
17686,CQnow,20586,cqnow,,,,unknown
43586,HillaryClinton,15628,hillaryclinton,,,,unknown
125458,business,12370,business,,,,unknown
138297,educationweek,10667,educationweek,,,,unknown
13548,BloombergBNA,10124,bloombergbna,,,,unknown
134090,dcexaminer,8924,dcexaminer,,,,unknown
31198,EEPublishing,7271,eepublishing,,,,unknown
165932,maggieNYT,6848,maggienyt,,,,unknown
123990,bpolitics,5839,bpolitics,,,,unknown
13560,BloombergLaw,5744,bloomberglaw,,,,unknown
