<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"></ul></div>

# Gender dynamics

## Tweet data prep

### Load the tweets

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import logging
from dateutil.parser import parse as date_parse
from utils import load_tweet_df, tweet_type
import matplotlib.pyplot as plt


logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Set float format so doesn't display scientific notation
pd.options.display.float_format = '{:20,.2f}'.format

def tweet_transform(tweet):
    return {
        'tweet_id': tweet['id_str'], 
        'tweet_created_at': date_parse(tweet['created_at']),
        'user_id': tweet['user']['id_str'],
        'screen_name': tweet['user']['screen_name'],
        'tweet_type': tweet_type(tweet)
    }

tweet_df = load_tweet_df(tweet_transform, ['tweet_id', 'user_id', 'screen_name', 'tweet_created_at', 'tweet_type'], dedupe_columns=['tweet_id'])
tweet_df.count()

INFO:root:Loading from tweets/642bf140607547cb9d4c6b1fc49772aa_001.json.gz
DEBUG:root:Loaded 50000
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 150000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 250000
INFO:root:Loading from tweets/9f7ed17c16a1494c8690b4053609539d_001.json.gz
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 350000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 450000
DEBUG:root:Loaded 500000
INFO:root:Loading from tweets/41feff28312c433ab004cd822212f4c2_001.json.gz
DEBUG:root:Loaded 550000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 650000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 750000
DEBUG:root:Loaded 800000


tweet_id            817136
user_id             817136
screen_name         817136
tweet_created_at    817136
tweet_type          817136
dtype: int64

In [2]:
tweet_df.head()

Unnamed: 0,tweet_id,user_id,screen_name,tweet_created_at,tweet_type
0,872631046088601600,327862439,jonathanvswan,2017-06-08 01:47:08+00:00,retweet
1,872610483647516673,327862439,jonathanvswan,2017-06-08 00:25:26+00:00,retweet
2,872609618626826240,327862439,jonathanvswan,2017-06-08 00:22:00+00:00,retweet
3,872605974699311104,327862439,jonathanvswan,2017-06-08 00:07:31+00:00,retweet
4,872603191518646276,327862439,jonathanvswan,2017-06-07 23:56:27+00:00,retweet


## Tweet analysis

### What are the first and last tweets in the dataset?

In [3]:
tweet_df.tweet_created_at.min()

Timestamp('2017-06-01 04:00:01+0000', tz='UTC')

In [4]:
tweet_df.tweet_created_at.max()

Timestamp('2017-08-01 03:59:58+0000', tz='UTC')

### How many retweets, original tweets, replies, and quotes are in dataset?

In [5]:
pd.DataFrame({'count':tweet_df.tweet_type.value_counts(), 
              'percentage':tweet_df.tweet_type.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})

Unnamed: 0,count,percentage
retweet,345266,42.3%
original,233926,28.6%
reply,126254,15.5%
quote,111690,13.7%


## Tweeter data prep

This comes from the following sources:
1. User lookup: These are lists of users exported from SFM. These are the final set of beltway journalists. Accounts that were suspended or deleted have been removed from this list. Also, this list will include users that did not tweet (i.e., have no tweets in dataset).
2. Tweets in the dataset: Used to generate tweet counts per tweeter. However, since some beltway journalists may not have tweeted, this may be a subset of the user lookup. Also, it may include the tweets of some users that were later excluded because their accounts were suspended or deleted or determined to not be beltway journalists.
3. User info lookup: Information on users that was manually coded in the beltway journalist spreadsheet or looked up from Twitter's API. This includes some accounts that were excluded from data collection for various reasons such as working for a foreign news organization or no longer working as a beltway journalist. Thus, these are a superset of the user lookup.

Thus, the tweeter data should include tweet and user info data only from users in the user lookup.

### Load user lookup

In [6]:
user_lookup_filepaths = ('lookups/senate_press_lookup.csv',
                         'lookups/periodical_press_lookup.csv',
                         'lookups/radio_and_television_lookup.csv')
user_lookup_df = pd.concat((pd.read_csv(user_lookup_filepath, usecols=['Uid', 'Token'], dtype={'Uid': str}) for user_lookup_filepath in user_lookup_filepaths))
user_lookup_df.set_index('Uid', inplace=True)
user_lookup_df.rename(columns={'Token': 'screen_name'}, inplace=True)
user_lookup_df.index.names = ['user_id']
# Some users may be in multiple lists, so need to drop duplicates
user_lookup_df = user_lookup_df[~user_lookup_df.index.duplicated()]

user_lookup_df.count()

screen_name    2487
dtype: int64

In [7]:
user_lookup_df.head()

Unnamed: 0_level_0,screen_name
user_id,Unnamed: 1_level_1
23455653,abettel
33919343,AshleyRParker
18580432,b_fung
399225358,b_muzz
18834692,becca_milfeld


### Tweets in dataset per tweeter

In [8]:
user_tweet_count_df = tweet_df[['user_id', 'tweet_type']].groupby(['user_id', 'tweet_type']).size().unstack()
user_tweet_count_df.fillna(0, inplace=True)
user_tweet_count_df['tweets_in_dataset'] = user_tweet_count_df.original + user_tweet_count_df.quote + user_tweet_count_df.reply + user_tweet_count_df.retweet
user_tweet_count_df.count()

tweet_type
original             2292
quote                2292
reply                2292
retweet              2292
tweets_in_dataset    2292
dtype: int64

In [9]:
user_tweet_count_df.head()

tweet_type,original,quote,reply,retweet,tweets_in_dataset
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001991865,13.0,3.0,1.0,31.0,48.0
1002229862,48.0,20.0,3.0,118.0,189.0
100270054,1.0,0.0,0.0,0.0,1.0
100802089,4.0,7.0,12.0,17.0,40.0
100860790,102.0,26.0,4.0,166.0,298.0


### Load user info

In [10]:
user_info_df = pd.read_csv('source_data/user_info_lookup.csv', names=['user_id', 'name', 'organization', 'position',
                                            'gender', 'followers_count', 'following_count', 'tweet_count',
                                            'user_created_at', 'verified', 'protected'],
                          dtype={'user_id': str}).set_index(['user_id'])
user_info_df.count()

name               2506
organization       2477
position           2503
gender             2505
followers_count    2506
following_count    2506
tweet_count        2506
user_created_at    2506
verified           2506
protected          2506
dtype: int64

In [11]:
user_info_df.head()

Unnamed: 0_level_0,name,organization,position,gender,followers_count,following_count,tweet_count,user_created_at,verified,protected
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
20711445,"Glinski, Nina",,Freelance Reporter,F,963,507,909,Thu Feb 12 20:00:53 +0000 2009,False,False
258917371,"Enders, David",,Journalist,M,1444,484,6296,Mon Feb 28 19:52:03 +0000 2011,True,False
297046834,"Barakat, Matthew",Associated Press,Northern Virginia Correspondent,M,759,352,631,Wed May 11 20:55:24 +0000 2011,True,False
455585786,"Atkins, Kimberly",Boston Herald,Chief Washington Reporter/Columnist,F,2944,2691,6277,Thu Jan 05 08:26:46 +0000 2012,True,False
42584840,"Vlahou, Toula",CQ Roll Call,Editor & Podcast Producer,F,2703,201,6366,Tue May 26 07:41:38 +0000 2009,False,False


In [12]:
user_summary_df = user_lookup_df.join((user_info_df, user_tweet_count_df), how='left')
# Fill Nans
user_summary_df['organization'].fillna('', inplace=True)
user_summary_df['original'].fillna(0, inplace=True)
user_summary_df['quote'].fillna(0, inplace=True)
user_summary_df['reply'].fillna(0, inplace=True)
user_summary_df['retweet'].fillna(0, inplace=True)
user_summary_df['tweets_in_dataset'].fillna(0, inplace=True)
user_summary_df.count()

screen_name          2487
name                 2487
organization         2487
position             2484
gender               2486
followers_count      2487
following_count      2487
tweet_count          2487
user_created_at      2487
verified             2487
protected            2487
original             2487
quote                2487
reply                2487
retweet              2487
tweets_in_dataset    2487
dtype: int64

In [13]:
user_summary_df.head()

Unnamed: 0_level_0,screen_name,name,organization,position,gender,followers_count,following_count,tweet_count,user_created_at,verified,protected,original,quote,reply,retweet,tweets_in_dataset
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
23455653,abettel,"Bettelheim, Adriel",Politico,Health Care Editor,F,2664,1055,15990,Mon Mar 09 16:32:20 +0000 2009,True,False,289.0,12.0,6.0,52.0,359.0
33919343,AshleyRParker,"Parker, Ashley",Washington Post,White House Reporter,F,122382,2342,12433,Tue Apr 21 14:28:57 +0000 2009,True,False,172.0,67.0,11.0,120.0,370.0
18580432,b_fung,"Fung, Brian",Washington Post,Tech Reporter,M,16558,2062,44799,Sat Jan 03 15:15:57 +0000 2009,True,False,257.0,85.0,205.0,82.0,629.0
399225358,b_muzz,"Murray, Brendan",Bloomberg News,"Managing Editor, U.S. Economy",M,624,382,360,Thu Oct 27 05:34:05 +0000 2011,True,False,3.0,0.0,0.0,5.0,8.0
18834692,becca_milfeld,"Milfeld, Becca",Agence France-Presse,English Desk Editor and Journalist,F,483,993,1484,Sat Jan 10 13:58:43 +0000 2009,False,False,3.0,14.0,0.0,7.0,24.0


### Remove users with no tweets in dataset

In [14]:
user_summary_df[user_summary_df.tweets_in_dataset == 0].count()

screen_name          195
name                 195
organization         195
position             195
gender               194
followers_count      195
following_count      195
tweet_count          195
user_created_at      195
verified             195
protected            195
original             195
quote                195
reply                195
retweet              195
tweets_in_dataset    195
dtype: int64

In [15]:
user_summary_df = user_summary_df[user_summary_df.tweets_in_dataset != 0]
user_summary_df.count()

screen_name          2292
name                 2292
organization         2292
position             2289
gender               2292
followers_count      2292
following_count      2292
tweet_count          2292
user_created_at      2292
verified             2292
protected            2292
original             2292
quote                2292
reply                2292
retweet              2292
tweets_in_dataset    2292
dtype: int64

## Tweeter analysis

### How many of the journalists are male / female?

In [16]:
journalist_gender_summary_df = pd.DataFrame({'count':user_summary_df.gender.value_counts(), 'percentage':user_summary_df.gender.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})
journalist_gender_summary_df

Unnamed: 0,count,percentage
M,1299,56.7%
F,993,43.3%


### Summary

* 25%, 50%, 75% are the percentiles. (Min is equivalent to 0%. Max is equivalent to 100%. 50% is the median.)
* std is standard deviation, normalized by N-1.

#### All

In [17]:
user_summary_df[['followers_count', 'following_count', 'tweet_count', 'original', 'quote', 'reply', 'retweet', 'tweets_in_dataset']].describe()

Unnamed: 0,followers_count,following_count,tweet_count,original,quote,reply,retweet,tweets_in_dataset
count,2292.0,2292.0,2292.0,2292.0,2292.0,2292.0,2292.0,2292.0
mean,16467.62,1444.83,9619.69,102.06,48.73,55.08,150.64,356.52
std,91886.9,3003.0,16618.09,169.43,135.9,249.18,585.08,833.76
min,6.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
25%,831.75,505.75,1449.5,10.0,1.0,1.0,8.0,32.0
50%,2419.5,998.5,4211.5,41.0,9.0,5.0,39.0,122.0
75%,7348.75,1713.5,10817.25,124.25,43.0,30.0,129.0,375.0
max,2176578.0,96194.0,208763.0,2693.0,3069.0,9033.0,21524.0,21547.0


#### Female

In [18]:
user_summary_df[user_summary_df.gender == 'F'][['followers_count', 'following_count', 'tweet_count', 'original', 'quote', 'reply', 'retweet', 'tweets_in_dataset']].describe()

Unnamed: 0,followers_count,following_count,tweet_count,original,quote,reply,retweet,tweets_in_dataset
count,993.0,993.0,993.0,993.0,993.0,993.0,993.0,993.0
mean,11609.53,1314.07,7498.74,83.84,39.27,32.06,135.55,290.72
std,65563.72,1250.56,11312.72,124.86,135.05,94.73,724.92,833.07
min,6.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
25%,825.0,567.0,1393.0,8.0,1.0,1.0,9.0,32.0
50%,2327.0,1034.0,4055.0,39.0,9.0,4.0,37.0,111.0
75%,6340.0,1659.0,8983.0,111.0,33.0,21.0,115.0,314.0
max,1388543.0,18197.0,118713.0,1440.0,3069.0,1458.0,21524.0,21547.0


#### Male

In [19]:
user_summary_df[user_summary_df.gender == 'M'][['followers_count', 'following_count', 'tweet_count', 'original', 'quote', 'reply', 'retweet', 'tweets_in_dataset']].describe()

Unnamed: 0,followers_count,following_count,tweet_count,original,quote,reply,retweet,tweets_in_dataset
count,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0,1299.0
mean,20181.31,1544.78,11241.02,115.99,55.96,72.69,162.17,406.81
std,107635.37,3833.89,19584.46,195.72,136.16,319.41,449.75,831.1
min,10.0,0.0,5.0,0.0,0.0,0.0,0.0,1.0
25%,857.5,472.0,1477.0,12.0,0.0,1.0,6.0,33.0
50%,2498.0,953.0,4401.0,44.0,9.0,6.0,40.0,131.0
75%,8341.5,1763.0,12584.5,140.0,50.5,38.5,142.0,428.0
max,2176578.0,96194.0,208763.0,2693.0,1955.0,9033.0,7528.0,11432.0


### Verified

#### Of all journalists, how many are verified?

In [20]:
pd.DataFrame({'count':user_summary_df.verified.value_counts(), 'percentage':user_summary_df.verified.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})

Unnamed: 0,count,percentage
True,1240,54.1%
False,1052,45.9%


#### Of female journalists, how many are verified?

In [21]:
pd.DataFrame({'count':user_summary_df[user_summary_df.gender == 'F'].verified.value_counts(), 'percentage':user_summary_df[user_summary_df.gender == 'F'].verified.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})

Unnamed: 0,count,percentage
True,512,51.6%
False,481,48.4%


#### Of male journalists, how many are verified?

In [22]:
pd.DataFrame({'count':user_summary_df[user_summary_df.gender == 'M'].verified.value_counts(), 'percentage':user_summary_df[user_summary_df.gender == 'M'].verified.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})

Unnamed: 0,count,percentage
True,728,56.0%
False,571,44.0%


## Mention data prep

### Load mentions from tweets
Including original tweets only

In [23]:
%matplotlib inline
import pandas as pd
import numpy as np
import logging
from dateutil.parser import parse as date_parse
from utils import load_tweet_df, tweet_type
import matplotlib.pyplot as plt


logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Set float format so doesn't display scientific notation
pd.options.display.float_format = '{:20,.2f}'.format

# Simply the tweet on load
def mention_transform(tweet):
    mentions = []
    if tweet_type(tweet) == 'original':
        for mention in tweet.get('entities', {}).get('user_mentions', []):
            mentions.append({
                'tweet_id': tweet['id_str'],
                'user_id': tweet['user']['id_str'],
                'screen_name': tweet['user']['screen_name'],
                'mention_user_id': mention['id_str'],
                'mention_screen_name': mention['screen_name'],
                'tweet_created_at': date_parse(tweet['created_at'])
            })
    return mentions

base_mention_df = load_tweet_df(mention_transform, ['tweet_id', 'user_id', 'screen_name', 'mention_user_id',
                                           'mention_screen_name', 'tweet_created_at'], 
                           dedupe_columns=['tweet_id', 'mention_user_id'])
base_mention_df.count()

INFO:root:Loading from tweets/642bf140607547cb9d4c6b1fc49772aa_001.json.gz
DEBUG:root:Loaded 50000
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 150000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 250000
INFO:root:Loading from tweets/9f7ed17c16a1494c8690b4053609539d_001.json.gz
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 350000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 450000
DEBUG:root:Loaded 500000
INFO:root:Loading from tweets/41feff28312c433ab004cd822212f4c2_001.json.gz
DEBUG:root:Loaded 550000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 650000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 750000
DEBUG:root:Loaded 800000


tweet_id               118210
user_id                118210
screen_name            118210
mention_user_id        118210
mention_screen_name    118210
tweet_created_at       118210
dtype: int64

In [24]:
base_mention_df.head()

Unnamed: 0,tweet_id,user_id,screen_name,mention_user_id,mention_screen_name,tweet_created_at
0,872522339962978307,327862439,jonathanvswan,800707492346925056,axios,2017-06-07 18:35:11+00:00
1,872484939530461184,327862439,jonathanvswan,17494010,SenSchumer,2017-06-07 16:06:34+00:00
2,872475140575170562,327862439,jonathanvswan,2836421,MSNBC,2017-06-07 15:27:37+00:00
3,872475140575170562,327862439,jonathanvswan,800707492346925056,axios,2017-06-07 15:27:37+00:00
4,872459457946673154,327862439,jonathanvswan,800707492346925056,axios,2017-06-07 14:25:18+00:00


### Add gender of mentioner

In [25]:
mention_df = base_mention_df.join(user_summary_df['gender'], on='user_id')
mention_df.count()

tweet_id               118210
user_id                118210
screen_name            118210
mention_user_id        118210
mention_screen_name    118210
tweet_created_at       118210
gender                 118210
dtype: int64

### How many tweets have mentions?

In [26]:
mention_df['tweet_id'].unique().size

84942

### How many users are mentioned? (All users, not just journalists)

In [27]:
mention_df['mention_user_id'].unique().size

17730

### Limit to mentions of journalists

In [28]:
journalists_mention_df = mention_df.join(user_summary_df['gender'], how='inner', on='mention_user_id', rsuffix='_mention')
journalists_mention_df.rename(columns = {'gender_mention': 'mention_gender'}, inplace=True)
journalists_mention_df.count()

tweet_id               14298
user_id                14298
screen_name            14298
mention_user_id        14298
mention_screen_name    14298
tweet_created_at       14298
gender                 14298
mention_gender         14298
dtype: int64

In [29]:
journalists_mention_df.head()

Unnamed: 0,tweet_id,user_id,screen_name,mention_user_id,mention_screen_name,tweet_created_at,gender,mention_gender
16,870408075878027268,327862439,jonathanvswan,16031927,greta,2017-06-01 22:33:51+00:00,M,F
283,872581449861541893,19847765,sahilkapur,16031927,greta,2017-06-07 22:30:04+00:00,M,F
2202,872578055910371328,21252618,JakeSherman,16031927,greta,2017-06-07 22:16:34+00:00,M,F
15977,880841069243629568,70511174,Hadas_Gold,16031927,greta,2017-06-30 17:30:50+00:00,F,F
17258,880183952018886661,90077282,politicoalex,16031927,greta,2017-06-28 21:59:41+00:00,M,F


### Functions for summarizing mentions by beltway journalists

In [30]:
# Gender of beltway journalists mentioned by beltway journalists
def journalist_mention_gender_summary(mention_df):
    gender_summary_df = pd.DataFrame({'count': mention_df.mention_gender.value_counts(), 
                  'percentage': mention_df.mention_gender.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})
    gender_summary_df.reset_index(inplace=True)
    gender_summary_df['avg_mentions'] = gender_summary_df.apply(lambda row: row['count'] / journalist_gender_summary_df.loc[row['index']]['count'], axis=1)    
    gender_summary_df.set_index('index', inplace=True, drop=True)
    return gender_summary_df

def journalist_mention_summary(mention_df):
    # Mention count
    mention_count_df = pd.DataFrame(mention_df.mention_user_id.value_counts().rename('mention_count'))

    # Mentioning users. That is, the number of unique users mentioning each user.
    mention_user_id_per_user_df = mention_df[['mention_user_id', 'user_id']].drop_duplicates()
    mentioning_user_count_df = pd.DataFrame(mention_user_id_per_user_df.groupby('mention_user_id').size(), columns=['mentioning_count'])
    mentioning_user_count_df.index.name = 'user_id'

    # Join with user summary
    journalist_mention_summary_df = user_summary_df.join([mention_count_df, mentioning_user_count_df])
    journalist_mention_summary_df.fillna(0, inplace=True)
    journalist_mention_summary_df = journalist_mention_summary_df.sort_values(['mention_count', 'mentioning_count', 'followers_count'], ascending=False)
    return journalist_mention_summary_df

# Gender of top journalists mentioned by beltway journalists
def top_journalist_mention_gender_summary(mention_summary_df, mentioning_count_threshold=0, head=100):
    top_mention_summary_df = mention_summary_df[mention_summary_df.mentioning_count > mentioning_count_threshold].head(head)
    return pd.DataFrame({'count': top_mention_summary_df.gender.value_counts(), 
                  'percentage': top_mention_summary_df.gender.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})


# Fields for displaying journalist mention summaries
journalist_mention_summary_fields = ['screen_name', 'name', 'organization', 'gender', 'followers_count', 'mention_count', 'mentioning_count']


## Mentioned analysis
*Note that for each of these, the complete list is being written to CSV in the output directory.*


### Original tweets (since mentions are extracted from original tweets)

#### Of the original tweets, how many were posted by male journalists / female journalists?

In [31]:
original_tweets_by_gender_df = user_summary_df[['gender', 'original']].groupby('gender').sum()
original_tweets_by_gender_df['percentage'] = original_tweets_by_gender_df.original.div(user_summary_df.original.sum()).mul(100).round(1).astype(str) + '%'
original_tweets_by_gender_df.reset_index(inplace=True)
original_tweets_by_gender_df['avg_original'] = original_tweets_by_gender_df.apply(lambda row: row['original'] / journalist_gender_summary_df.loc[row['gender']]['count'], axis=1)
original_tweets_by_gender_df.set_index('gender', inplace=True, drop=True)
original_tweets_by_gender_df

Unnamed: 0_level_0,original,percentage,avg_original
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,83251.0,35.6%,83.84
M,150675.0,64.4%,115.99


#### Who posted the most original tweets?

In [32]:
user_summary_df[['screen_name', 'name', 'organization', 'gender', 'followers_count', 'tweet_count', 'original', 'tweets_in_dataset']].sort_values(['original'], ascending=False).head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,tweet_count,original,tweets_in_dataset
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
16187637,ChadPergram,"Pergram, Chad",Fox News,M,59305,61461,2693.0,2693.0
31127446,markknoller,"Knoller, Mark",CBS News,M,301474,115132,1858.0,2089.0
16459325,ryanbeckwith,"Beckwith, Ryan Teague",Time Magazine,M,20947,92203,1534.0,5187.0
19580890,LeeCamp,"Camp, Lee",RTTV America,M,67601,52051,1517.0,3708.0
18825339,CahnEmily,"Cahn, Emily",Mic,F,16980,100803,1440.0,8196.0
593813785,DonnaYoungDC,"Young, Donna",S&P Global Market Intelligence,F,5894,49967,1332.0,4414.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,148143,1316.0,5078.0
21316253,ZekeJMiller,"Miller, Zeke J.",Time Magazine,M,198517,161148,1271.0,2106.0
36246939,malbertnews,"Albert, Mark",The Voyage Report,M,3575,28230,1078.0,1151.0
117467779,palbergo,"Albergo, Paul F.",Bloomberg BNA,M,1191,18083,1043.0,1236.0


#### Mentions of all accounts (not just journalists)

#### Of journalists mentioning accounts, which are mentioned the most?
This is based on screen name, which could have changed during collection period. However, for the users that would be at the top of this list, seems unlikely.

In [33]:
# Mention count
mention_count_screen_name_df = pd.DataFrame(mention_df.mention_screen_name.value_counts().rename('mention_count'))

# Count of mentioning users
mention_user_id_per_user_screen_name_df = mention_df[['mention_screen_name', 'user_id']].drop_duplicates()
mentioning_count_screen_name_df = pd.DataFrame(mention_user_id_per_user_screen_name_df.groupby('mention_screen_name').size(), columns=['mentioning_count'])
mentioning_count_screen_name_df.index.name = 'screen_name'

all_mentioned_df = mention_count_screen_name_df.join(mentioning_count_screen_name_df)
all_mentioned_df.to_csv('output/all_mentioned_by_journalists.csv')
all_mentioned_df.head(25)

Unnamed: 0,mention_count,mentioning_count
realDonaldTrump,2876,452
POTUS,2265,253
wusa9,2111,41
AP,1948,143
USATODAY,1235,105
nbcwashington,1230,70
WSJ,1227,152
dcexaminer,1034,53
SHSanders45,927,148
nytimes,829,289


#### Same, but ordered by the number of journalists mentioning the account

In [34]:
all_mentioned_df.sort_values(['mentioning_count', 'mention_count'], ascending=False).head(25)

Unnamed: 0,mention_count,mentioning_count
realDonaldTrump,2876,452
nytimes,829,289
POTUS,2265,253
SenJohnMcCain,599,231
Scaramucci,657,198
CNN,628,186
politico,747,181
SpeakerRyan,700,181
PressSec,654,178
washingtonpost,413,154


### Journalists mentioning journalists

#### Of journalists mentioning journalists, who is mentioned the most?

In [35]:
journalists_mention_summary_df = journalist_mention_summary(journalists_mention_df)
journalists_mention_summary_df.to_csv('output/journalists_mentioned_by_journalists.csv')
journalists_mention_summary_df[journalist_mention_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,mention_count,mentioning_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
325050734,AllysonRaeWx,"Banks, Allyson",WUSA–TV,F,6918,330.0,7.0
28496589,TenaciousTopper,"Shutt, Charles",WUSA–TV,M,15868,239.0,13.0
63149389,hbwx,"Bernstein, Howard",WUSA–TV,M,8337,235.0,10.0
407013776,burgessev,"Everett, John B.",Politico,M,31010,212.0,46.0
16018516,jenhab,"Haberkorn, Jennifer A.",Politico,F,20028,200.0,31.0
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,143.0,41.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,127.0,51.0
169586280,WaPoSean,"Sullivan, Sean",Washington Post,M,22860,117.0,20.0
997684836,pkcapitol,"Kane, Paul",Washington Post,M,31300,116.0,47.0
108617810,DanaBashCNN,"Bash, Dana",CNN,F,281861,115.0,55.0


#### Same, but ordered by number of journalists mentioning

In [36]:
journalists_mention_summary_df[journalist_mention_summary_fields].sort_values(['mentioning_count', 'mention_count'], ascending=False).head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,mention_count,mentioning_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
108617810,DanaBashCNN,"Bash, Dana",CNN,F,281861,115.0,55.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,127.0,51.0
997684836,pkcapitol,"Kane, Paul",Washington Post,M,31300,116.0,47.0
407013776,burgessev,"Everett, John B.",Politico,M,31010,212.0,46.0
112526560,kenvogel,"Vogel, Kenneth P.",Politico,M,53894,67.0,45.0
18227519,morningmika,"Brzezinski, Mika",MSNBC,F,653031,70.0,44.0
123327472,peterbakernyt,"Baker, Peter",New York Times,M,96956,107.0,43.0
39155029,mkraju,"Raju, Manu K.",CNN,M,88366,95.0,43.0
13524182,daveweigel,"Weigel, David",Washington Post,M,332344,106.0,42.0
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,143.0,41.0


#### Of journalists mentioning other journalists, how many are male / female?

In [37]:
journalist_mention_gender_summary(journalists_mention_df)


Unnamed: 0_level_0,count,percentage,avg_mentions
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M,8298,58.0%,6.39
F,6000,42.0%,6.04


#### On average how many times are journalists mentioned by other journalists?

In [38]:
journalists_mention_summary_df[['mention_count']].describe()

Unnamed: 0,mention_count
count,2292.0
mean,6.24
std,17.59
min,0.0
25%,0.0
50%,1.0
75%,5.0
max,330.0


### Journalists mentioning female journalists

#### Of journalists mentioning female journalists who is mentioned the most?

In [39]:
female_journalists_mention_summary_df = journalists_mention_summary_df[journalists_mention_summary_df.gender == 'F']
female_journalists_mention_summary_df.to_csv('output/female_journalists_mentioned_by_journalists.csv')
female_journalists_mention_summary_df[journalist_mention_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,mention_count,mentioning_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
325050734,AllysonRaeWx,"Banks, Allyson",WUSA–TV,F,6918,330.0,7.0
16018516,jenhab,"Haberkorn, Jennifer A.",Politico,F,20028,200.0,31.0
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,143.0,41.0
108617810,DanaBashCNN,"Bash, Dana",CNN,F,281861,115.0,55.0
82151660,kelsey_snell,"Snell, Kelse",Washington Post,F,8108,109.0,22.0
33919343,AshleyRParker,"Parker, Ashley",Washington Post,F,122382,100.0,31.0
52392666,ZoeTillman,"Tillman, Zoe",BuzzFeed,F,15246,87.0,14.0
26632935,HopeSeck,"Hodge Seck, Hope",Military.com,F,4584,83.0,3.0
16441088,jestei,"Steinhauer, Jennifer",New York Times,F,13452,76.0,26.0
18227519,morningmika,"Brzezinski, Mika",MSNBC,F,653031,70.0,44.0


#### On average, how many times are female journalists mentioned by journalists?

In [40]:
female_journalists_mention_summary_df[['mention_count']].describe()

Unnamed: 0,mention_count
count,993.0
mean,6.04
std,17.95
min,0.0
25%,0.0
50%,1.0
75%,4.0
max,330.0


### Journalists mentioning male journalists

#### Of journalists mentioning male journalists, who do they mention the most?

In [41]:
male_journalists_mention_summary_df = journalists_mention_summary_df[journalists_mention_summary_df.gender == 'M']
male_journalists_mention_summary_df.to_csv('output/male_journalists_mentioned_by_journalists.csv')
male_journalists_mention_summary_df[journalist_mention_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,mention_count,mentioning_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
28496589,TenaciousTopper,"Shutt, Charles",WUSA–TV,M,15868,239.0,13.0
63149389,hbwx,"Bernstein, Howard",WUSA–TV,M,8337,235.0,10.0
407013776,burgessev,"Everett, John B.",Politico,M,31010,212.0,46.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,127.0,51.0
169586280,WaPoSean,"Sullivan, Sean",Washington Post,M,22860,117.0,20.0
997684836,pkcapitol,"Kane, Paul",Washington Post,M,31300,116.0,47.0
123327472,peterbakernyt,"Baker, Peter",New York Times,M,96956,107.0,43.0
13524182,daveweigel,"Weigel, David",Washington Post,M,332344,106.0,42.0
46557945,StevenTDennis,"Dennis, Steven T.",Bloomberg News,M,55762,105.0,27.0
15931637,jonkarl,"Karl, Jonathan",ABC News,M,183467,104.0,40.0


#### On average, how many times are male journalists mentioned by journalists?

In [42]:
male_journalists_mention_summary_df[['mention_count']].describe()

Unnamed: 0,mention_count
count,1299.0
mean,6.39
std,17.31
min,0.0
25%,0.0
50%,1.0
75%,5.0
max,239.0


### Female journalists mentioning other journalists

#### Of female journalists mentioning other journalists, who do they mention the most?

In [43]:
journalists_mentioned_by_female_summary_df = journalist_mention_summary(journalists_mention_df[journalists_mention_df.gender == 'F'])
journalists_mentioned_by_female_summary_df.to_csv('output/journalists_mentioned_by_female_journalists.csv')
journalists_mentioned_by_female_summary_df[journalist_mention_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,mention_count,mentioning_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
407013776,burgessev,"Everett, John B.",Politico,M,31010,164.0,20.0
16018516,jenhab,"Haberkorn, Jennifer A.",Politico,F,20028,116.0,13.0
46557945,StevenTDennis,"Dennis, Steven T.",Bloomberg News,M,55762,79.0,10.0
169586280,WaPoSean,"Sullivan, Sean",Washington Post,M,22860,71.0,11.0
48802204,HardballChris,"Matthews, Chris",NBC News,M,718330,70.0,3.0
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,64.0,16.0
22891564,chrisgeidner,"Geidner, Chris",BuzzFeed,M,83316,61.0,6.0
108617810,DanaBashCNN,"Bash, Dana",CNN,F,281861,60.0,26.0
16067683,pauldemko,"Demko, Paul Jeffrey",Politico,M,8170,57.0,10.0
313545488,LauraLitvan,"Litvan, Laura",Bloomberg News,F,4468,53.0,2.0


#### Of female journalists mentioning journalists, how many are male / female?

In [44]:
journalist_mention_gender_summary(journalists_mention_df[journalists_mention_df.gender == 'F'])

Unnamed: 0_level_0,count,percentage,avg_mentions
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M,3162,54.8%,2.43
F,2605,45.2%,2.62


### Male journalists mentioning other journalists

#### Of male journalists mentioning other journalists, who do they mention the most?

In [45]:
journalists_mentioned_by_male_summary_df = journalist_mention_summary(journalists_mention_df[journalists_mention_df.gender == 'M'])
journalists_mentioned_by_male_summary_df.to_csv('output/journalists_mentioned_by_male_journalists.csv')
journalists_mentioned_by_male_summary_df[journalist_mention_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,mention_count,mentioning_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
325050734,AllysonRaeWx,"Banks, Allyson",WUSA–TV,F,6918,324.0,4.0
28496589,TenaciousTopper,"Shutt, Charles",WUSA–TV,M,15868,225.0,7.0
63149389,hbwx,"Bernstein, Howard",WUSA–TV,M,8337,225.0,4.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,87.0,30.0
13524182,daveweigel,"Weigel, David",Washington Post,M,332344,84.0,30.0
16018516,jenhab,"Haberkorn, Jennifer A.",Politico,F,20028,84.0,18.0
997684836,pkcapitol,"Kane, Paul",Washington Post,M,31300,81.0,34.0
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,79.0,25.0
123327472,peterbakernyt,"Baker, Peter",New York Times,M,96956,78.0,29.0
26632935,HopeSeck,"Hodge Seck, Hope",Military.com,F,4584,76.0,1.0


#### Of male journalists mentioning other journalists, how many are male / female?

In [46]:
journalist_mention_gender_summary(journalists_mention_df[journalists_mention_df.gender == 'M'])

Unnamed: 0_level_0,count,percentage,avg_mentions
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M,5136,60.2%,3.95
F,3395,39.8%,3.42


## Retweet data prep

### Load retweets from tweets
Including retweets and quotes

In [47]:
# Simply the tweet on load
def retweet_transform(tweet):
    if tweet_type(tweet) in ('retweet', 'quote'):
        retweet = tweet.get('retweeted_status') or tweet.get('quoted_status')
        return {
            'tweet_id': tweet['id_str'],
            'user_id': tweet['user']['id_str'],
            'screen_name': tweet['user']['screen_name'],
            'retweet_user_id': retweet['user']['id_str'],
            'retweet_screen_name': retweet['user']['screen_name'],
            'tweet_created_at': date_parse(tweet['created_at'])            
        }
    return None

base_retweet_df = load_tweet_df(retweet_transform, ['tweet_id', 'user_id', 'screen_name', 'retweet_user_id',
                                           'retweet_screen_name', 'tweet_created_at'],
                           dedupe_columns=['tweet_id'])

base_retweet_df.count()

INFO:root:Loading from tweets/642bf140607547cb9d4c6b1fc49772aa_001.json.gz
DEBUG:root:Loaded 50000
DEBUG:root:Loaded 100000
DEBUG:root:Loaded 150000
DEBUG:root:Loaded 200000
DEBUG:root:Loaded 250000
INFO:root:Loading from tweets/9f7ed17c16a1494c8690b4053609539d_001.json.gz
DEBUG:root:Loaded 300000
DEBUG:root:Loaded 350000
DEBUG:root:Loaded 400000
DEBUG:root:Loaded 450000
DEBUG:root:Loaded 500000
INFO:root:Loading from tweets/41feff28312c433ab004cd822212f4c2_001.json.gz
DEBUG:root:Loaded 550000
DEBUG:root:Loaded 600000
DEBUG:root:Loaded 650000
DEBUG:root:Loaded 700000
DEBUG:root:Loaded 750000
DEBUG:root:Loaded 800000


tweet_id               456956
user_id                456956
screen_name            456956
retweet_user_id        456956
retweet_screen_name    456956
tweet_created_at       456956
dtype: int64

In [48]:
base_retweet_df.head()

Unnamed: 0,tweet_id,user_id,screen_name,retweet_user_id,retweet_screen_name,tweet_created_at
0,872631046088601600,327862439,jonathanvswan,93069110,maggieNYT,2017-06-08 01:47:08+00:00
1,872610483647516673,327862439,jonathanvswan,160951141,TomNamako,2017-06-08 00:25:26+00:00
2,872609618626826240,327862439,jonathanvswan,18678924,jmartNYT,2017-06-08 00:22:00+00:00
3,872605974699311104,327862439,jonathanvswan,93069110,maggieNYT,2017-06-08 00:07:31+00:00
4,872603191518646276,327862439,jonathanvswan,94784682,JonathanTurley,2017-06-07 23:56:27+00:00


### Add gender of retweeter

In [49]:
retweet_df = base_retweet_df.join(user_summary_df['gender'], on='user_id')
retweet_df.count()

tweet_id               456956
user_id                456956
screen_name            456956
retweet_user_id        456956
retweet_screen_name    456956
tweet_created_at       456956
gender                 456956
dtype: int64

### How many users have been retweeted by journalists?

In [50]:
retweet_df['retweet_user_id'].unique().size

49154

### Limit to retweeted journalists

In [51]:
journalists_retweet_df = retweet_df.join(user_summary_df['gender'], how='inner', on='retweet_user_id', rsuffix='_retweet')
journalists_retweet_df.rename(columns = {'gender_retweet': 'retweet_gender'}, inplace=True)
journalists_retweet_df.count()

tweet_id               117048
user_id                117048
screen_name            117048
retweet_user_id        117048
retweet_screen_name    117048
tweet_created_at       117048
gender                 117048
retweet_gender         117048
dtype: int64

In [52]:
journalists_retweet_df.head()

Unnamed: 0,tweet_id,user_id,screen_name,retweet_user_id,retweet_screen_name,tweet_created_at,gender,retweet_gender
2,872609618626826240,327862439,jonathanvswan,18678924,jmartNYT,2017-06-08 00:22:00+00:00,M,M
435,871437820044464128,242169927,colinwilhelm,18678924,jmartNYT,2017-06-04 18:45:41+00:00,M,M
1406,872620054889857024,163589845,PoliticoKevin,18678924,jmartNYT,2017-06-08 01:03:28+00:00,M,M
1424,872240756597174272,163589845,PoliticoKevin,18678924,jmartNYT,2017-06-06 23:56:16+00:00,M,M
1455,870749993279385601,163589845,PoliticoKevin,18678924,jmartNYT,2017-06-02 21:12:30+00:00,M,M


### Functions for summarizing retweets by beltway journalists

In [53]:
# Gender of beltway journalists retweeted by beltway journalists
def journalist_retweet_gender_summary(retweet_df):
    gender_summary_df = pd.DataFrame({'count':retweet_df.retweet_gender.value_counts(), 
                  'percentage': retweet_df.retweet_gender.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})
    gender_summary_df.reset_index(inplace=True)
    gender_summary_df['avg_retweets'] = gender_summary_df.apply(lambda row: row['count'] / journalist_gender_summary_df.loc[row['index']]['count'], axis=1)    
    gender_summary_df.set_index('index', inplace=True, drop=True)
    return gender_summary_df


def journalist_retweet_summary(retweet_df):
    # Retweet count
    retweet_count_df = pd.DataFrame(retweet_df.retweet_user_id.value_counts().rename('retweet_count'))

    # Retweeting users. That is, the number of unique users retweeting each user.
    retweet_user_id_per_user_df = retweet_df[['retweet_user_id', 'user_id']].drop_duplicates()
    retweeting_user_count_df = pd.DataFrame(retweet_user_id_per_user_df.groupby('retweet_user_id').size(), columns=['retweeting_count'])
    retweeting_user_count_df.index.name = 'user_id'

    # Join with user summary
    journalist_retweet_summary_df = user_summary_df.join([retweet_count_df, retweeting_user_count_df])
    journalist_retweet_summary_df.fillna(0, inplace=True)
    journalist_retweet_summary_df = journalist_retweet_summary_df.sort_values(['retweet_count', 'retweeting_count', 'followers_count'], ascending=False)
    return journalist_retweet_summary_df

# Gender of top journalists retweeted by beltway journalists
def top_journalist_retweet_gender_summary(retweet_summary_df, retweeting_count_threshold=0, head=100):
    top_retweet_summary_df = retweet_summary_df[retweet_summary_df.retweeting_count > retweeting_count_threshold].head(head)
    return pd.DataFrame({'count': top_retweet_summary_df.gender.value_counts(), 
                  'percentage': top_retweet_summary_df.gender.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'})

# Fields for displaying journalist mention summaries
journalist_retweet_summary_fields = ['screen_name', 'name', 'organization', 'gender', 'followers_count', 'retweet_count', 'retweeting_count']


## Retweet analysis
*Note that for each of these, the complete list is being written to CSV in the output directory.*


### Retweets of all accounts (not just journalists)

#### Of journalists retweeting other accounts, how many of the retweets are from males / females?
That is, by gender of retweeter.

In [54]:
retweets_by_gender_df = user_summary_df[['gender', 'retweet', 'quote']].groupby('gender').sum()
retweets_by_gender_df['total'] = retweets_by_gender_df.retweet + retweets_by_gender_df.quote
retweets_by_gender_df['percentage'] = retweets_by_gender_df.total.div(retweets_by_gender_df.total.sum()).mul(100).round(1).astype(str) + '%'
retweets_by_gender_df.reset_index(inplace=True)
retweets_by_gender_df['avg_retweets'] = retweets_by_gender_df.apply(lambda row: row['total'] / journalist_gender_summary_df.loc[row['gender']]['count'], axis=1)
retweets_by_gender_df.set_index('gender', inplace=True, drop=True)
retweets_by_gender_df

Unnamed: 0_level_0,retweet,quote,total,percentage,avg_retweets
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,134606.0,38998.0,173604.0,38.0%,174.83
M,210660.0,72692.0,283352.0,62.0%,218.13


#### Of journalists retweeting other accounts, who retweets the most?

In [55]:
retweet_user_summary_df = user_summary_df.loc[:,('screen_name', 'name', 'organization', 'gender', 'followers_count', 'tweet_count', 'retweet', 'quote', 'tweets_in_dataset')]
retweet_user_summary_df['retweet_count'] = retweet_user_summary_df.retweet + retweet_user_summary_df.quote
retweet_user_summary_df.sort_values(['retweet_count'], ascending=False).head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,tweet_count,retweet,quote,tweets_in_dataset,retweet_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2453025128,gloriaminott,"Minott, Gloria",WPFW–FM,F,586,61473,21524.0,0.0,21547.0,21524.0
304988603,NeilWMcCabe,"McCabe, Neil",Breitbart News,M,18903,64673,7528.0,625.0,9370.0,8153.0
18825339,CahnEmily,"Cahn, Emily",Mic,F,16980,100803,4449.0,1834.0,8196.0,6283.0
191964162,SamLitzinger,"Litzinger, Sam",CBS News,M,2329,95236,6017.0,225.0,7537.0,6242.0
21612122,HotlineJosh,"Kraushaar, Josh P.",National Journal,M,50438,156610,4881.0,893.0,6703.0,5774.0
259395895,JohnJHarwood,"Harwood, John",CNBC,M,149040,78015,4570.0,822.0,6377.0,5392.0
16031927,greta,"Van Susteren, Greta",MSNBC,F,1186850,116645,794.0,3069.0,4792.0,3863.0
21810329,sdonnan,"Donnan, Shawn",Financial Times,M,12311,79125,3332.0,449.0,4537.0,3781.0
47408060,JonathanLanday,"Landay, Jonathan",McClatchy Newspapers,M,11213,81042,3687.0,80.0,4285.0,3767.0
13524182,daveweigel,"Weigel, David",Washington Post,M,332344,169908,2703.0,859.0,4564.0,3562.0


#### Of journalists retweeting other accounts, who is retweeted the most?
This is based on screen name, which could have changed during collection period. However, for the users that would be at the top of this list, seems unlikely.

In [56]:
# Retweet count
retweet_count_screen_name_df = pd.DataFrame(retweet_df.retweet_screen_name.value_counts().rename('retweet_count'))

# Count of retweeting users
retweet_user_id_per_user_screen_name_df = retweet_df[['retweet_screen_name', 'user_id']].drop_duplicates()
retweeting_count_screen_name_df = pd.DataFrame(retweet_user_id_per_user_screen_name_df.groupby('retweet_screen_name').size(), columns=['retweeting_count'])
retweeting_count_screen_name_df.index.name = 'screen_name'

all_retweeted_df = retweet_count_screen_name_df.join(retweeting_count_screen_name_df)
all_retweeted_df.to_csv('output/all_retweeted_by_journalists.csv')
all_retweeted_df.head(25)

Unnamed: 0,retweet_count,retweeting_count
realDonaldTrump,6650,807
thehill,5424,457
BraddJaffy,3564,554
maggieNYT,3024,530
business,3000,229
washingtonpost,2638,498
AP,2480,581
politico,2335,334
nytimes,2268,485
WSJ,1949,213


### Journalists retweeting other journalists

#### Of journalists retweeting other journalists, who is retweeted the most?

In [57]:
journalists_retweet_summary_df = journalist_retweet_summary(journalists_retweet_df)
journalists_retweet_summary_df.to_csv('output/journalists_retweeted_by_journalists.csv')
journalists_retweet_summary_df[journalist_retweet_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,retweet_count,retweeting_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
407013776,burgessev,"Everett, John B.",Politico,M,31010,1836.0,289.0
21316253,ZekeJMiller,"Miller, Zeke J.",Time Magazine,M,198517,1723.0,387.0
19107878,GlennThrush,"Thrush, Glenn H.",New York Times,M,308181,1577.0,451.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,1459.0,397.0
46557945,StevenTDennis,"Dennis, Steven T.",Bloomberg News,M,55762,1403.0,280.0
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,1393.0,327.0
39155029,mkraju,"Raju, Manu K.",CNN,M,88366,1359.0,341.0
31127446,markknoller,"Knoller, Mark",CBS News,M,301474,1343.0,341.0
398088661,MEPFuller,"Fuller, Matt E.",Huffington Post,M,77919,1324.0,286.0
13524182,daveweigel,"Weigel, David",Washington Post,M,332344,1221.0,306.0


#### Of journalists retweeting other journalists, how many of the retweets are of males / females?

In [58]:
journalist_retweet_gender_summary(journalists_retweet_df)


Unnamed: 0_level_0,count,percentage,avg_retweets
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M,80634,68.9%,62.07
F,36414,31.1%,36.67


#### On average, how many times are journalists retweeted by other journalists?

In [59]:
journalists_retweet_summary_df[['retweet_count']].describe()

Unnamed: 0,retweet_count
count,2292.0
mean,51.07
std,149.06
min,0.0
25%,0.0
50%,6.0
75%,33.0
max,1836.0


### Journalists retweeting female journalists

#### Of journalists retweeting female journalists, who is retweeted the most?

In [60]:
female_journalists_retweet_summary_df = journalists_retweet_summary_df[journalists_retweet_summary_df.gender == 'F']
female_journalists_retweet_summary_df.to_csv('output/female_journalists_retweeted_by_journalists.csv')
female_journalists_retweet_summary_df[journalist_retweet_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,retweet_count,retweeting_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,1393.0,327.0
33653195,ericawerner,"Werner, Erica",Associated Press,F,14049,939.0,281.0
12354832,kasie,"Hunt, Kasie",NBC News,F,187357,909.0,388.0
70511174,Hadas_Gold,"Gold, Hadas",Politico,F,45221,849.0,306.0
593813785,DonnaYoungDC,"Young, Donna",S&P Global Market Intelligence,F,5894,708.0,13.0
167024520,rachaelmbade,"Bade, Rachel M.",Politico,F,30164,614.0,161.0
33919343,AshleyRParker,"Parker, Ashley",Washington Post,F,122382,539.0,268.0
139738464,mj_lee,"Lee, MJ",CNN,F,31940,518.0,189.0
16018516,jenhab,"Haberkorn, Jennifer A.",Politico,F,20028,474.0,136.0
18825339,CahnEmily,"Cahn, Emily",Mic,F,16980,444.0,118.0


#### On average, how many times are female journalists retweeted by other journalists?

In [61]:
female_journalists_retweet_summary_df[['retweet_count']].describe()

Unnamed: 0,retweet_count
count,993.0
mean,36.67
std,97.34
min,0.0
25%,0.0
50%,5.0
75%,25.0
max,1393.0


### Journalists retweeting male journalists

#### Of journalists retweeting male journalists, who is retweeted the most?

In [62]:
male_journalists_retweet_summary_df = journalists_retweet_summary_df[journalists_retweet_summary_df.gender == 'M']
male_journalists_retweet_summary_df.to_csv('output/male_journalists_retweeted_by_journalists.csv')
male_journalists_retweet_summary_df[journalist_retweet_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,retweet_count,retweeting_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
407013776,burgessev,"Everett, John B.",Politico,M,31010,1836.0,289.0
21316253,ZekeJMiller,"Miller, Zeke J.",Time Magazine,M,198517,1723.0,387.0
19107878,GlennThrush,"Thrush, Glenn H.",New York Times,M,308181,1577.0,451.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,1459.0,397.0
46557945,StevenTDennis,"Dennis, Steven T.",Bloomberg News,M,55762,1403.0,280.0
39155029,mkraju,"Raju, Manu K.",CNN,M,88366,1359.0,341.0
31127446,markknoller,"Knoller, Mark",CBS News,M,301474,1343.0,341.0
398088661,MEPFuller,"Fuller, Matt E.",Huffington Post,M,77919,1324.0,286.0
13524182,daveweigel,"Weigel, David",Washington Post,M,332344,1221.0,306.0
14007532,frankthorp,"Thorp, Frank",NBC News,M,39798,1207.0,334.0


#### On average, how many times are male journalists retweeted by other journalists?

In [63]:
male_journalists_retweet_summary_df[['retweet_count']].describe()

Unnamed: 0,retweet_count
count,1299.0
mean,62.07
std,178.04
min,0.0
25%,1.0
50%,8.0
75%,39.5
max,1836.0


### Female journalists retweeting other journalists

#### Of female journalists retweeting other journalists, who is retweeted the most?

In [64]:
journalists_retweeted_by_female_summary_df = journalist_retweet_summary(journalists_retweet_df[journalists_retweet_df.gender == 'F'])
journalists_retweeted_by_female_summary_df.to_csv('output/journalists_retweeted_by_female_journalists.csv')
journalists_retweeted_by_female_summary_df[journalist_retweet_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,retweet_count,retweeting_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
407013776,burgessev,"Everett, John B.",Politico,M,31010,748.0,122.0
593813785,DonnaYoungDC,"Young, Donna",S&P Global Market Intelligence,F,5894,704.0,9.0
19186003,seungminkim,"Kim, Seung Min",Politico,F,33980,572.0,142.0
31127446,markknoller,"Knoller, Mark",CBS News,M,301474,549.0,140.0
21316253,ZekeJMiller,"Miller, Zeke J.",Time Magazine,M,198517,516.0,149.0
46557945,StevenTDennis,"Dennis, Steven T.",Bloomberg News,M,55762,503.0,97.0
14007532,frankthorp,"Thorp, Frank",NBC News,M,39798,470.0,140.0
19107878,GlennThrush,"Thrush, Glenn H.",New York Times,M,308181,463.0,165.0
33653195,ericawerner,"Werner, Erica",Associated Press,F,14049,452.0,119.0
398088661,MEPFuller,"Fuller, Matt E.",Huffington Post,M,77919,447.0,116.0


#### Of female journalists retweeting other journalists, how many are male / female?
Average is of female journalists retweeting other journalists, how many retweets does each male / female journalist receive.

In [65]:
journalist_retweet_gender_summary(journalists_retweet_df[journalists_retweet_df.gender == 'F'])


Unnamed: 0_level_0,count,percentage,avg_retweets
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M,25410,59.6%,19.56
F,17228,40.4%,17.35


#### On average, how many times do female journalists retweet male / female / all journalists?
That is, retweets per female journalist.  

In [66]:
female_journalists_retweet_df = journalists_retweet_df[journalists_retweet_df.gender == 'F']
female_journalists_retweet_by_gender_df = pd.merge(user_summary_df[user_summary_df.gender == 'F'], female_journalists_retweet_df.groupby(['user_id', 'retweet_gender']).size().unstack(), how='left', left_index=True, right_index=True)[['F', 'M']]
female_journalists_retweet_by_gender_df.fillna(0, inplace=True)
female_journalists_retweet_by_gender_df['all'] = female_journalists_retweet_by_gender_df.F + female_journalists_retweet_by_gender_df.M
female_journalists_retweet_by_gender_df.describe()

Unnamed: 0,F,M,all
count,993.0,993.0,993.0
mean,17.35,25.59,42.94
std,45.34,74.55,113.79
min,0.0,0.0,0.0
25%,0.0,1.0,2.0
50%,4.0,6.0,10.0
75%,16.0,22.0,39.0
max,857.0,1779.0,2385.0


### Male journalists retweeting other journalists

#### Of male journalists retweeting other journalists, who is retweeted the most?

In [67]:
journalists_retweeted_by_male_summary_df = journalist_retweet_summary(journalists_retweet_df[journalists_retweet_df.gender == 'M'])
journalists_retweeted_by_male_summary_df.to_csv('output/journalists_retweeted_by_male_journalists.csv')
journalists_retweeted_by_male_summary_df[journalist_retweet_summary_fields].head(25)

Unnamed: 0_level_0,screen_name,name,organization,gender,followers_count,retweet_count,retweeting_count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
21316253,ZekeJMiller,"Miller, Zeke J.",Time Magazine,M,198517,1207.0,238.0
19107878,GlennThrush,"Thrush, Glenn H.",New York Times,M,308181,1114.0,286.0
407013776,burgessev,"Everett, John B.",Politico,M,31010,1088.0,167.0
14529929,jaketapper,"Tapper, Jake",CNN,M,1305680,1071.0,239.0
13524182,daveweigel,"Weigel, David",Washington Post,M,332344,975.0,209.0
39155029,mkraju,"Raju, Manu K.",CNN,M,88366,956.0,209.0
46557945,StevenTDennis,"Dennis, Steven T.",Bloomberg News,M,55762,900.0,183.0
398088661,MEPFuller,"Fuller, Matt E.",Huffington Post,M,77919,877.0,170.0
19847765,sahilkapur,"Kapur, Sahil",Bloomberg News,M,69086,848.0,193.0
16006592,BenjySarlin,"Sarlin, Benjamin",NBC News,M,78075,828.0,141.0


#### Of male  journalists retweeting other journalists, how many are male / female?
Average is of male journalists retweeting other journalists, how many retweets does each male / female journalist receive.

In [68]:
journalist_retweet_gender_summary(journalists_retweet_df[journalists_retweet_df.gender == 'M'])

Unnamed: 0_level_0,count,percentage,avg_retweets
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M,55224,74.2%,42.51
F,19186,25.8%,19.32


#### On average, how many times do male journalists retweet male / female / all journalists?
That is, retweets per male journalist.  

In [69]:
male_journalists_retweet_df = journalists_retweet_df[journalists_retweet_df.gender == 'M']
male_journalists_retweet_by_gender_df = pd.merge(user_summary_df[user_summary_df.gender == 'M'], male_journalists_retweet_df.groupby(['user_id', 'retweet_gender']).size().unstack(), how='left', left_index=True, right_index=True)[['F', 'M']]
male_journalists_retweet_by_gender_df.fillna(0, inplace=True)
male_journalists_retweet_by_gender_df['all'] = male_journalists_retweet_by_gender_df.F + male_journalists_retweet_by_gender_df.M
male_journalists_retweet_by_gender_df.describe()

Unnamed: 0,F,M,all
count,1299.0,1299.0,1299.0
mean,14.77,42.51,57.28
std,33.5,106.87,136.92
min,0.0,0.0,0.0
25%,0.0,1.0,1.0
50%,3.0,7.0,11.0
75%,14.0,35.0,50.0
max,442.0,1414.0,1766.0
