# Beltway reporters

Goal here is:
1. Get a sense of the first tweets for each user.
2. Using mentions by the periodical reporters as an example, show that
   these mentions can be categorized by comparing them against lists
   of known Twitter accounts. For now, the rough categories
   are government, media, politicians, and reporters but these can
   be fixed later. This can then be replicated for other groups
   (e.g., newspaper reporters), as well as retweets and replies.
3. Start putting together a list of additional Twitter accounts that need
   to be categorized.

## Setup
This creates some functions used to load the data.

In [1]:
import pandas as pd
import numpy as np
import json
from dateutil.parser import parse as date_parse

# Load tweets from line-oriented JSON file, possibly transforming with provided function
# and limiting by number of tweets.
# Returns an iterator.
def tweet_iter(filepath, limit=None, tweet_transform_func=None):
    with open(filepath) as file:
        for count, line in enumerate(file):
            tweet = json.loads(line)
            if tweet_transform_func:
                tweet_transform_ret = tweet_transform_func(tweet)
                if isinstance(tweet_transform_ret, list):
                    for tweet in tweet_transform_ret:
                        yield tweet
                else:
                    yield tweet_transform_ret
            else:
                yield tweet
            if count+1 == limit:
                break


## Find first tweet for each user
Note that by comparing the created_at date and the user_created_at date it can be determined which are new accounts and which are prolific tweeters.

__Note that this is using data collected from the periodical press.__

### Load the data and count.

In [2]:
# Simply the tweet on load
def tweet_transform(tweet):
    return { 
        'id': tweet['id_str'], 
        'created_at': date_parse(tweet['created_at']),
        'user_id': tweet['user']['id_str'],
        'screen_name': tweet['user']['screen_name'],
        'user_created_at': date_parse(tweet['user']['created_at']),
    }

tweet_df = pd.DataFrame(tweet_iter('034f608f30bd4800a1c448b490125b1d_001.json', tweet_transform_func=tweet_transform))
tweet_df.count()

created_at         515215
id                 515215
screen_name        515215
user_created_at    515215
user_id            515215
dtype: int64

### View the top of the data.

In [3]:
tweet_df.head()

Unnamed: 0,created_at,id,screen_name,user_created_at,user_id
0,2017-03-09 17:30:14+00:00,839891087095447552,Bruninga,2010-01-21 19:04:23+00:00,107170206
1,2017-03-09 17:25:19+00:00,839889849419186176,Bruninga,2010-01-21 19:04:23+00:00,107170206
2,2017-03-08 18:05:52+00:00,839537669067718659,Bruninga,2010-01-21 19:04:23+00:00,107170206
3,2017-03-07 02:58:35+00:00,838946955292192768,Bruninga,2010-01-21 19:04:23+00:00,107170206
4,2017-03-01 19:27:43+00:00,837021552428343296,Bruninga,2010-01-21 19:04:23+00:00,107170206


### First tweet for each user
created_at is the date of the first tweet in the dataset for the user. This can be compared
against user_created_at, the date the user account was created.

In [4]:
tweet_df.loc[tweet_df.groupby('user_id')['created_at'].idxmin()].sort_values('created_at', ascending=False).head(20)

Unnamed: 0,created_at,id,screen_name,user_created_at,user_id
18253,2017-02-28 19:00:31+00:00,836652316191244288,BeddingfieldMJ,2017-02-28 18:23:37+00:00,836643030161625089
502656,2017-02-08 21:50:54+00:00,829447440034066432,OSHAReporter,2017-02-03 19:50:14+00:00,827605131109793792
5963,2017-02-03 20:46:05+00:00,827619189259194368,brianbeutler,2009-02-23 21:31:16+00:00,21696279
288588,2017-02-03 01:33:02+00:00,827329014612291585,JSwiftTWS,2008-06-17 15:19:03+00:00,15146659
372117,2017-02-01 15:30:41+00:00,826815038728052736,ToporHCF,2017-02-01 15:05:54+00:00,826808800921403393
126979,2017-01-22 14:19:59+00:00,823173369281544192,edroso,2008-08-05 01:29:22+00:00,15730608
214337,2017-01-21 20:44:35+00:00,822907768801816578,MikeMadden,2008-01-11 22:43:03+00:00,12134692
77791,2016-12-21 00:51:22+00:00,811373460857454592,GrahamDavidA,2009-06-13 20:18:33+00:00,46955476
144540,2016-12-20 17:55:31+00:00,811268809021657089,DefenseBaron,2009-01-15 20:59:03+00:00,19038768
28914,2016-12-20 05:48:07+00:00,811085752041926656,andrewperezdc,2009-02-22 17:16:51+00:00,21579498


## Top mentions
Determine who is being mentioned and attempt to characterize.

__Note that this is using data collected from the periodical press.__

In [5]:
# Simply the tweet on load
def mention_transform(tweet):
    mentions = []
    for mention in tweet.get('entities', {}).get('user_mentions', []):
        mentions.append({
            'id': tweet['id_str'],
            'user_id': tweet['user']['id_str'],
            'screen_name': tweet['user']['screen_name'],
            'mention_user_id': mention['id_str'],
            'mention_screen_name': mention['screen_name']
        })
    return mentions

mention_df = pd.DataFrame(tweet_iter('034f608f30bd4800a1c448b490125b1d_001.json', tweet_transform_func=mention_transform))


### Number of mentions found in the dataset

In [6]:
mention_df.count()

id                     466766
mention_screen_name    466766
mention_user_id        466766
screen_name            466766
user_id                466766
dtype: int64

### The mention data
Each mention consists of the tweet id, the screen name and user id that is mentioned,
and the screen_name and user_id that is mentioning.

In [7]:
mention_df.head()

Unnamed: 0,id,mention_screen_name,mention_user_id,screen_name,user_id
0,839891087095447552,BBNAEnvironment,41847726,Bruninga,107170206
1,839891087095447552,BloombergBNA,459277523,Bruninga,107170206
2,839889849419186176,EPA,14615871,Bruninga,107170206
3,839889849419186176,BBNAEnvironment,41847726,Bruninga,107170206
4,839889849419186176,BloombergBNA,459277523,Bruninga,107170206


### Top mentioned accounts

In [8]:
mention_summary_df = pd.DataFrame(mention_df.groupby('mention_screen_name').size().reset_index(name='mention_screen_name_count'))
mention_summary_df['mention_screen_name_lower'] = mention_summary_df.mention_screen_name.apply(str.lower)
mention_summary_df.sort_values('mention_screen_name_count', ascending=False).head(50)

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower
4569,BloombergBNA,7055,bloombergbna
61138,realDonaldTrump,3367,realdonaldtrump
4579,BloombergLaw,2710,bloomberglaw
36690,WashBlade,2300,washblade
19206,KimberlyRobinsn,2254,kimberlyrobinsn
23462,MorningConsult,2221,morningconsult
12295,ForeignPolicy,2144,foreignpolicy
26187,POTUS,1916,potus
10450,EPA,1847,epa
31756,Slate,1820,slate


### Load known Twitter accounts

In [9]:
def seed_iter(filepath):
    with open(filepath) as file:
        for line in file:
            screen_name, user_id = line.split(',')
            yield {'screen_name': screen_name, 'user_id': user_id[:-1]}

def load_seed_df(filepath, seed_type):
    df = pd.DataFrame(seed_iter(filepath))
    df['screen_name_lower'] = df.screen_name.apply(str.lower)
    df['type'] = seed_type
    return df

In [10]:
federal_agencies_df = load_seed_df('federal_agencies.csv', 'government')
federal_agencies_df.count()

screen_name          2968
user_id              2968
screen_name_lower    2968
type                 2968
dtype: int64

In [11]:
news_outlets_df = load_seed_df('news_outlets.csv', 'media')
news_outlets_df.count()

screen_name          92
user_id              92
screen_name_lower    92
type                 92
dtype: int64

In [12]:
newspaper_reporters_df = load_seed_df('newspaper_reporters.csv', 'reporters')
newspaper_reporters_df.count()

screen_name          795
user_id              795
screen_name_lower    795
type                 795
dtype: int64

In [13]:
periodical_reporters_df = load_seed_df('periodical_reporters.csv', 'reporters')
periodical_reporters_df.count()

screen_name          256
user_id              256
screen_name_lower    256
type                 256
dtype: int64

In [14]:
administration_officials_df = load_seed_df('administration_officials.csv', 'politicians')
administration_officials_df.count()

screen_name          63
user_id              63
screen_name_lower    63
type                 63
dtype: int64

In [15]:
cabinet_df = load_seed_df('cabinet.csv', 'politicians')
cabinet_df.count()

screen_name          12
user_id              12
screen_name_lower    12
type                 12
dtype: int64

In [16]:
representatives_df = load_seed_df('representatives.csv', 'politicians')
representatives_df.count()

screen_name          431
user_id              431
screen_name_lower    431
type                 431
dtype: int64

In [17]:
senators_df = load_seed_df('senators.csv', 'politicians')
senators_df.count()

screen_name          100
user_id              100
screen_name_lower    100
type                 100
dtype: int64

In [18]:
media_df = load_seed_df('media.csv', 'media')
media_df.count()

screen_name          5997
user_id              5997
screen_name_lower    5997
type                 5997
dtype: int64

In [19]:
# Order is deliberate here, since will be deduplicating.
screen_name_lookup_df = newspaper_reporters_df.append([administration_officials_df,
                                      news_outlets_df,
                                      periodical_reporters_df,
                                      cabinet_df,
                                      representatives_df,
                                      senators_df,
                                      media_df,
                                      federal_agencies_df], ignore_index=True).drop_duplicates(subset='screen_name_lower')
screen_name_lookup_df.count()

screen_name          10524
user_id              10524
screen_name_lower    10524
type                 10524
dtype: int64

### Join the mentions and the known Twitter accounts

In [20]:
mention_join_df = pd.merge(mention_summary_df, screen_name_lookup_df, how='left', left_on='mention_screen_name_lower', right_on='screen_name_lower')
mention_join_df['type'].fillna('unknown', inplace=True)

### Top (by mention count) accounts that are matched against known Twitter accounts

In [21]:
mention_join_df[pd.notnull(mention_join_df.screen_name)].sort_values('mention_screen_name_count', ascending=False).head()

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower,screen_name,user_id,screen_name_lower,type
61138,realDonaldTrump,3367,realdonaldtrump,realDonaldTrump,25073877,realdonaldtrump,politicians
19206,KimberlyRobinsn,2254,kimberlyrobinsn,KimberlyRobinsn,906734342,kimberlyrobinsn,reporters
26187,POTUS,1916,potus,POTUS,822215679726100480,potus,politicians
10450,EPA,1847,epa,EPA,14615871,epa,government
58984,nytimes,1779,nytimes,nytimes,807095,nytimes,media


### Number of matched accounts
mention_screen_name is the number of unique mentioned accounts. screen_name is the
number of matched unique accounts.

In [22]:
mention_join_df.count()

mention_screen_name          68235
mention_screen_name_count    68235
mention_screen_name_lower    68235
screen_name                   2490
user_id                       2490
screen_name_lower             2490
type                         68235
dtype: int64

### Top accounts by mentions
NaN for screen_name indicates that it is not matched with an known Twitter account.

In [23]:
mention_join_df.sort_values('mention_screen_name_count', ascending=False).head(50)

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower,screen_name,user_id,screen_name_lower,type
4569,BloombergBNA,7055,bloombergbna,,,,unknown
61138,realDonaldTrump,3367,realdonaldtrump,realDonaldTrump,25073877.0,realdonaldtrump,politicians
4579,BloombergLaw,2710,bloomberglaw,,,,unknown
36690,WashBlade,2300,washblade,,,,unknown
19206,KimberlyRobinsn,2254,kimberlyrobinsn,KimberlyRobinsn,906734342.0,kimberlyrobinsn,reporters
23462,MorningConsult,2221,morningconsult,,,,unknown
12295,ForeignPolicy,2144,foreignpolicy,,,,unknown
26187,POTUS,1916,potus,POTUS,8.222156797261005e+17,potus,politicians
10450,EPA,1847,epa,EPA,14615871.0,epa,government
31756,Slate,1820,slate,,,,unknown


### Mentions by account type

In [24]:
mention_join_df.groupby('type').sum()

Unnamed: 0_level_0,mention_screen_name_count
type,Unnamed: 1_level_1
government,11548
media,19064
politicians,17713
reporters,50909
unknown,367532


### Top (by mentions) accounts that are not known.
These are the accounts that we will want to categorize.

In [25]:
mention_join_df[mention_join_df.type == 'unknown'].sort_values('mention_screen_name_count', ascending=False).head(50)

Unnamed: 0,mention_screen_name,mention_screen_name_count,mention_screen_name_lower,screen_name,user_id,screen_name_lower,type
4569,BloombergBNA,7055,bloombergbna,,,,unknown
4579,BloombergLaw,2710,bloomberglaw,,,,unknown
36690,WashBlade,2300,washblade,,,,unknown
23462,MorningConsult,2221,morningconsult,,,,unknown
12295,ForeignPolicy,2144,foreignpolicy,,,,unknown
31756,Slate,1820,slate,,,,unknown
42006,business,1562,business,,,,unknown
43103,chronicle,1431,chronicle,,,,unknown
14580,HillaryClinton,1163,hillaryclinton,,,,unknown
9394,DefenseOne,996,defenseone,,,,unknown
