# Twitter IRA Data Analysis

On October 16, 2018, Twitter released several large datasets of tweets and account information related to foreign influence of US elections. From Twitter's [Elections Integrity](https://about.twitter.com/en_us/values/elections-integrity.html) page:

> Our initial disclosures cover two previously disclosed campaigns, and include information from 3,841 accounts believed to be connected to the Russian Internet Research Agency, and 770 accounts believed to originate in Iran. For additional information about this disclosure, see our announcement.
>
> These datasets include all public, nondeleted Tweets and media (e.g., images and videos) from accounts we believe are connected to state-backed information operations. Tweets deleted by these users prior to their suspension (which are not included in these datasets) comprise less than 1% of their overall activity. Note that not all of the accounts we identified as connected to these campaigns actively Tweeted, so the number of accounts represented in the datasets may be less than the total number of accounts listed here.

We use this archive of tweets to do a network analysis of the accounts related to the Russian Internet Research Agency (IRA). This is the larger of the two datasets they released. The tweet level information contains three different types of relationships between users: retweets, replies, and mentions. We'll use all three to build our network. The first part of this tutorial will handle the data cleaning necessary to generate a network.

Large scale network analysis is then completed using [graph-tool](https://graph-tool.skewed.de/). The graph-tool library has a unique API that might seem a little tricky at first, particularly if you're coming from using something like NetworkX for network analysis. However, once the fundamentals of the API are understood, graph-tool enables powerful analysis: it's typically much faster than NetworkX and the visualizations are much better. 


## Data Import

A few notes:
1. I wasn't able to load in the IRA data directly from the `.zip` file downloaded, so I had to manually extract it.
2. The dataset is fairly large, so we're only going to use the columns we actually need to store less data in memory
3. The `user_mentions` column is a list of other `userid`s, but it gets imported as a string. Thus, we'll have to convert this column from a string representation of a list to an actual list object.

In [1]:
import pandas as pd
import numpy as np

In [2]:
def str_list_split(input_str):
    """Converts a list as a string to a list object"""
    if not isinstance(input_str, str):
        return np.nan
    items_str = str(input_str)[1:-1] # first and last characters are "[" and "]"
    splits = [str(item) for item in items_str.split(", ") if str(item)]
    if splits:
        return splits
    else: 
        return np.nan

In [3]:
dtype = {"tweetid" : str,
 "userid" : str,
 "in_reply_to_userid" : str, 
 "retweet_userid" : str}

In [4]:
df = (pd.read_csv("./data/ira_tweets_csv_hashed.csv",
                 dtype=dtype,
                 usecols=["tweetid", "userid", "in_reply_to_userid", "retweet_userid", "user_mentions"])
      .assign(user_mentions=lambda x: x['user_mentions'].apply(str_list_split)))

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9041308 entries, 0 to 9041307
Data columns (total 5 columns):
tweetid               object
userid                object
in_reply_to_userid    object
retweet_userid        object
user_mentions         object
dtypes: object(5)
memory usage: 344.9+ MB


In [6]:
df.head()

Unnamed: 0,tweetid,userid,in_reply_to_userid,retweet_userid,user_mentions
0,877919995476496385,249064136b1c5cb00a705316ab73dd9b53785748ab757f...,,2572896396.0,[2572896396]
1,492388766930444288,0974d5dbee4ca9bd6c3b46d62a5cbdbd5c0d86e196b624...,,,
2,719455077589721089,bda40f262856eee77c48a332e5eb23bc4f1943d600867d...,40807205.0,,[40807205]
3,536179342423105537,bda40f262856eee77c48a332e5eb23bc4f1943d600867d...,,,
4,841410788409630720,a53ed619f1dea6015c7c878bf744b0eefe8f7272dccf34...,,,


## Edge Weight Creation

We'll process the data in a way to support network analysis. First, we want to find the unique users in the dataset. We need these unique users to operate on the retweet, reply, and mention columns so that we're only capturing relationships between users in our dataset -- that is, if a user in our dataset retweeted another userid that isn't in our dataset, then we won't use that information. This is beacuse we want to include only bidirectional information, but we don't know if `userid`s outside of our dataset are retweeting `userid`s inside our dataset.

This is fairly straightforward for the retweet and reply relationships, since a user can only retweet or reply to another single user at a time. The mentions column requires a bit more work, since multiple users can be mentioned within a tweet. For mentions, we'll create a new variable that only has mentions of other users within our dataset. Then, we'll iterate through each tweet and create a new relationship for each mention included in the tweet.

For all relationships we'll exclude self-retweets, self-replies, and self-mentions.

In [7]:
unique_users = set(df["userid"].drop_duplicates())

rt_df = df[
    (df["retweet_userid"].isin(unique_users)) & (df["userid"] != df["retweet_userid"])
]

reply_df = df[
    (df["in_reply_to_userid"].isin(unique_users)) & (df["userid"] != df["in_reply_to_userid"])
]

In [8]:
def keep_mentions_in_dataset(mentions):
    return [m for m in mentions if m in unique_users]

In [9]:
df["mentions_in_dataset"] = df["user_mentions"].dropna().apply(lambda x: keep_mentions_in_dataset(x)).dropna()

In [10]:
mention_edges = []
for row in df.dropna(subset=['mentions_in_dataset']).itertuples():
    userid = row.userid
    for user_mention in row.mentions_in_dataset:
        if userid != user_mention:
            mention_data = (userid, user_mention)
            mention_edges.append(mention_data)
            
mention_df = pd.DataFrame(mention_edges, columns=['userid', 'user_mention'])

## The Great Weight Merge
Now that we have our list of relationships, we're going to aggregate them by counting the number of directed relationship between each pair of users. We'll calculate this for each type of relationship, and then merge all the datasets together.

We'll standardize the relationships by calling the originating `userid` column `source`, and the receiving column `target`. 

In [11]:
rt_weights = (
    rt_df.groupby(["userid", "retweet_userid"])
    .size()
    .reset_index()
    .rename(columns={"userid": "source", "retweet_userid": "target", 0: "rt_weight"})
)
reply_weights = (
    reply_df.groupby(["userid", "in_reply_to_userid"])
    .size()
    .reset_index()
    .rename(
        columns={"userid": "source", "in_reply_to_userid": "target", 0: "reply_weight"}
    )
)
mention_weights = (
    mention_df.groupby(["userid", "user_mention"])
    .size()
    .reset_index()
    .rename(columns={"userid": "source", "user_mention": "target", 0: "mention_weight"})
)

In [12]:
print("N Retweets:", len(rt_weights))
print("N Replies:", len(reply_weights))
print("N Mentions:", len(mention_weights))

N Retweets: 168074
N Replies: 20276
N Mentions: 180345


In [13]:
all_weights = (
    rt_weights
    .merge(reply_weights, on=["source", "target"], how="outer")
    .merge(mention_weights, on=["source", "target"], how="outer")
    .fillna(0)
              )

In [14]:
all_weights.describe()

Unnamed: 0,rt_weight,reply_weight,mention_weight
count,181015.0,181015.0,181015.0
mean,4.945502,1.012601,5.902864
std,15.56989,19.088364,22.901294
min,0.0,0.0,0.0
25%,1.0,0.0,1.0
50%,1.0,0.0,2.0
75%,4.0,0.0,5.0
max,1624.0,2011.0,1683.0


## Building a Node List for `graph-tool`

Now that we have a list of the relationships, we're going to build a list of the unique users in our dataset.  In a network analysis context, these are called our "nodes" or "vertices". Because of how the `graph-tool` API operates, we're going to index all of of our nodes to make it easy to create the graph.

The process goes like this:
1. Find all the unique nodes by taking the `set` of both the `source` and `target` columns
2. Sort this set of unique nodes by name. This isn't necessary, but since sets are unordered, its nice to put them in a sorted list first so we can depend on the indexing being deterministic. 
3. Index the nodes and create maps for `index -> node` and `node -> index`
4. Apply the node indexing back to the edges data.

Creating the indexed list of nodes is the essential step here for use with `graph-tool`. All of the relationships and properties of the nodes tie back to their index, so it's important to keep the index map around for reference.

In [15]:
all_nodes = set(all_weights['source'].drop_duplicates()) | set(all_weights['target'].drop_duplicates())
all_nodes = sorted(list(all_nodes))
node_map = dict(enumerate(all_nodes))
node_map_inv = {v : k for k, v in node_map.items()}

all_weights['source_id'] = all_weights['source'].map(node_map_inv)
all_weights['target_id'] = all_weights['target'].map(node_map_inv)

## Export

Finally, we'll export our node and edges list to a `.csv` file.

In [17]:
nodes = pd.DataFrame(list(node_map.items()), columns=['index', 'userid'])
nodes.to_csv('./data/nodes.csv')
all_weights.to_csv('./data/edge_weights.csv')