# Case Study 4: Co-retweets

Amplification of information sources is perhaps the most common form of manipulation. 
On Twitter, a group of accounts retweeting the same tweets or the same set of accounts may signal coordinated behavior. Therefore we focus on retweets in this case study.

We apply the proposed approach to detect coordinated accounts that amplify narratives related to the White Helmets, a volunteer organization that was [targeted by disinformation campaigns during the civil war in Syria](www.theguardian.com/world/2017/dec/18/syria-white-helmets-conspiracy-theories). Recent reports identify Russian sources behind these campaigns. The data was collected from Twitter using English and Arabic keywords. 

## Detection

We construct the bipartite network between retweeting accounts and retweeted messages, excluding self-retweets and accounts having less than ten retweets.
This network is weighted using TF-IDF to discount the contributions of popular tweets.
Each account is therefore represented as a TF-IDF vector of retweeted tweet IDs.
The projected co-retweet network is then weighted by the cosine similarity between the account vectors.
Finally, to focus on evidence of potential coordination, we keep only the most suspicious 0.5\% of the edges.  These parameters can be tuned to trade off between precision and recall; the effect of the thresholds on the precision is analyzed in the Discussion section of the paper.

| Parameter          | Description                  |
|--------------------|------------------------------|
| Conjecture         | High overlapping of retweets |
| Support filter     | Accounts with <10 retweets   |
| Trace              | Retweeted tweet ID           |
| Eng. trace         | No                           |
| Bipartite weight   | TF-IDF                       |
| Proj. weight       | Cosine similarity            |
| Edge filter        | Keep top 0.5% weights        |
| Clustering         | Connected components         |
| Data source        | DARPA SocialSim              |
| Data period        | Apr 2018–Mar 2019            |
| No. accounts       | 11,669                       |

*Table: Case study 3 summary*

# Helper functions

In [21]:
import sys
import glob
import argparse
import numpy as np
import pandas as pd
import gzip
import json
from sparse_dot_topn import sp_matmul_topn
from tqdm import tqdm
from os.path import join
from itertools import combinations
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict, Counter


def counters2edges(counter_dict, p1_col, p2_col, w_col):
    # Count how many times each feature co-occurs with each user
    edge = pd.DataFrame.from_records(
        [
            (uid, feature, count)
            for uid, feature_counter in counter_dict.items()
            for feature, count in feature_counter.items()
        ],
        columns=[p1_col, p2_col, w_col],
    )
    return edge


def prepare_content(in_fp, tqdm_desc="parse raw content", tqdm_total=None):
    """
    Parse retweet ids from a raw tweet file.
    Note that the fields in here are different from the usual Twitter fields, because this is a anonymized dataset
    i.e., we access the attribute "id_h" instead of "id_str"
    """
    user_features = defaultdict(lambda: defaultdict(Counter))
    idx = 0
    for line in tqdm(in_fp, desc=tqdm_desc, total=tqdm_total):
        twt = json.loads(line)
        if "user" not in twt:
            continue

        user_id = twt["user"]["id_h"]
        if "retweeted_status" in twt:
            try:
                rtwt = twt["retweeted_status"]
                retweeted_tweet_id = rtwt["id_h"]
                user_features["retweet_tid"][user_id][retweeted_tweet_id] += 1
            except Exception as e:
                print(f"pass {idx}")
                idx += 1
                continue

    return user_features


def dummy_func(doc):
    return doc


def calculate_interaction(
    edge_df,
    p1_col,
    p2_col,
    w_col,
    node1_col,
    node2_col,
    sim_col,
    sup_col,
    tqdm_total=None,
    tqdm_desc="calculating interaction",
):
    """
    calculate interactions between nodes in partition 1

    input:
        edge_df : dataframe that contains (p1, p2, w)
        p1_col : column name that represent p1, i.e: uid
        p2_col : column name that represent p2 , e.g: time
        w_col : column name that represent weight, if None it's unweighted
        node1_col, node2_col: col after projection, both are uid
        supports: dictionary of userids - post count {p1_node: support}
    return:
        interactions : dict { (p1_node1, p1_node2) : interaction_weight }

    assumption:
        1. non-empty inputs: edge_df
        2. p1_node1 < p1_node2 holds for the tuples in interactions
    """
    user_ids, docs = zip(
        *(
            (p1_node, grp[p2_col].repeat(grp[w_col]).values)
            for p1_node, grp in edge_df.groupby(p1_col)
            if len(grp) > 1
        )
    )

    try:
        tfidf = TfidfVectorizer(
            analyzer="word",
            tokenizer=dummy_func,
            preprocessor=dummy_func,
            use_idf=True,
            token_pattern=None,
            lowercase=False,
            sublinear_tf=True,
            min_df=3,
        )
        docs_vec = tfidf.fit_transform(docs)
    except ValueError:
        tfidf = TfidfVectorizer(
            analyzer="word",
            tokenizer=dummy_func,
            preprocessor=dummy_func,
            use_idf=True,
            token_pattern=None,
            lowercase=False,
            sublinear_tf=True,
            min_df=1,
        )
        docs_vec = tfidf.fit_transform(docs)

    results = sp_matmul_topn(docs_vec, docs_vec.T, top_n=100)

    interactions = pd.DataFrame.from_records(
        [
            (
                u1,
                user_ids[u2_idx],
                sim,
                min(supports[user_ids[u1_idx]], supports[user_ids[u2_idx]]),
            )
            for u1_idx, u1 in enumerate(user_ids)
            for u2_idx, sim in zip(
                results.indices[results.indptr[u1_idx] : results.indptr[u1_idx + 1]],
                results.data[results.indptr[u1_idx] : results.indptr[u1_idx + 1]],
            )
            if u1_idx < u2_idx
        ],
        columns=[node1_col, node2_col, sim_col, sup_col],
    )

    return interactions

## Count the number of lines to use progress bar (optional)

In [22]:
# get number of tweets
raw_json_gz = f"data/sample_raw_tweets.json.gz"
# number of lines in the input file
numline_file = f"data/sample_raw_tweets.numline"

In [23]:
! cat {raw_json_gz} | wc -l > {numline_file}

In [24]:
numline = int(open(numline_file).read().strip())
numline

1850502

# Step 1. Parse json of tweets to a table of user - feature values 

Construct a uid - feature - count table from raw tweets 

Input: raw tweet.json.gz files 

Output: feature parquets
In this case study, the feature are the tweet IDs that a user retweeted.

In [25]:
outdir = "results"
# column name of nodes in partite 1 and 2
p1_col = "uid"
p2_col = "feature"
# column name of edge weights
w_col = "cnt"

feature = "retweet_id"
# read input
with gzip.open(raw_json_gz, "rb") as in_fp:
    # do work
    user_features = prepare_content(in_fp, tqdm_total=numline)

# write output
for feature, counter_dict in user_features.items():
    edge_df = counters2edges(counter_dict, p1_col, p2_col, w_col)

    fname = "{}.edge.parquet".format(feature)
    edge_df.to_parquet(join(outdir, fname))

parse raw content:  32%|███▏      | 600000/1850502 [00:37<01:19, 15801.26it/s]


## Let's take a look at the feature dataframe

In [26]:
edge_df_name = f"{outdir}/{feature}.edge.parquet"
pd.read_parquet(edge_df_name).head()

Unnamed: 0,uid,feature,cnt
0,GpiuNvbRtbXwfynWiD_czA,XQFFKrN4jekP1eBSssFQsQ,1
1,N1t6Jzysm3Y885HGvTXLkQ,XQFFKrN4jekP1eBSssFQsQ,1
2,nLUp9Cli7fPuBSO8ghqgtQ,Clnzx7_oryrhThV-XRBSmQ,1
3,nLUp9Cli7fPuBSO8ghqgtQ,DvoA5c_dijc9wPLa06t50Q,1
4,FulMsCaamosiAYela5pgrg,t3YbgFjNzocBgkeeW3khyA,1


# Step 2. Calculate the similarity between features 

We consider each user as a document, and "words" are the feature scores, i.e., retweet ids. 
- We first filter out users with lack of support, i.e., those reposts less than `sup_thresh = 10` times

- We then create a TF-IDF vector for each user, where each dimension is a retweet id, and the value is the number of retweets.

We output a dataframe with the following columns: `user1, user2, similarity, support`; where each record represents an interaction between `user1` and `user2`; `similarity` is the strength of connections and `support` is the post count of `user1`

In [28]:
# column name of node1, node2, similarity, support
node1_col = "user1"
node2_col = "user2"
sim_col = "similarity"
sup_col = "support"
sup_thresh = 8
# takes a .parquet file of edge table
# save to output file of interactions with p-values
interaction_file = f"{outdir}/{feature}.interactions.parquet"
# read input
edge_df = pd.read_parquet(edge_df_name)

if not (len(edge_df) > 0 and np.any(edge_df.groupby(p1_col).size() > 1)):
    raise ValueError(
        "Empty input edge table. Run `prepare_content()` to create edge table first"
    )
else:
    # calculate the support, which is the post count for each user
    supports = edge_df.groupby(p1_col)[w_col].sum().to_dict()
    print("No users: ", len(supports))
    # filter nodes based on support
    supports = {k: v for k, v in supports.items() if v > sup_thresh}
    print(f"No users after filtering with threshold={sup_thresh}: {len(supports)}")
    filtered_nodes = list(supports.keys())
    # remove the nodes with low support from edge_df
    edge_df = edge_df[edge_df[p1_col].isin(filtered_nodes)]

    # calculate similirity between users
    interaction_df = calculate_interaction(
        edge_df, p1_col, p2_col, w_col, node1_col, node2_col, sim_col, sup_col
    )
    # write output
    interaction_df.to_parquet(interaction_file)

No users:  154414
No users after filtering with threshold=8: 4095


# Let's take a look at the interaction df

In [29]:
interaction_df = pd.read_parquet(interaction_file)
interaction_df.head()

Unnamed: 0,user1,user2,similarity,support
0,-1Yb7yo_soaiKjc5x3biGQ,myVI0yg25Ws-HHLvOMwtyw,0.264076,11
1,-1Yb7yo_soaiKjc5x3biGQ,lnHATpuKdTSVO40BbCJETQ,0.232556,12
2,-1Yb7yo_soaiKjc5x3biGQ,a6Wn9Gtt28SxNe6BKnsJIw,0.291757,9
3,-1Yb7yo_soaiKjc5x3biGQ,DuwH448XfvmUrM6hDDYvYw,0.285331,9
4,-1Yb7yo_soaiKjc5x3biGQ,718SpdW5td0JVg7t4a7daQ,0.294394,9


# Step 3. Filter low edge weight and create a network 
For this case study, we apply an weight filter `sim_thresh = 0.995`, i.e., only accounts with similarity >0.995 percentile are considered

In [34]:
# path to input indexed text of tweets
twttext = f"{outdir}/table.by.user.pkl"

# path to output file of the graphml for Gephi visualization
outgraph = f"{outdir}/{feature}.filtered.coord.graphml"
# path to output file of context of each suspicious group
grouptext = f"{outdir}/{feature}filtered.coord.pkl"

# read input parquet file of interaction
interaction_df = pd.read_parquet(interaction_file)

# define the filters
sim_thresh = 0.995

In [None]:
import networkx as nx

# apply edge weight filter: 0.995 quantile cosine similarity of TFIDF vectors
interaction_df = interaction_df[
    interaction_df[sim_col] >= interaction_df[sim_col].quantile(sim_thresh)
]
# create a graph object from the interaction table
G = nx.Graph()
for idx, row in interaction_df.iterrows():
    try:
        G.add_edge(
            row[node1_col], row[node2_col], weight=row[sim_col], support=row[sup_col]
        )
    except:
        print(row)
        pass
#  write graph object to file
nx.write_gml(G, outgraph)
del interaction_df

# Note

The above code matches the filtering procedure described in the paper: 
1. users with low support (few posts) are removed
2. tf-idf vectors are calculated
3. similarity between tf-idf vectors are calculated, and edges with low similarity are removed. 

The results of case study 5 in the paper were obtained by a slightly different ordering of these steps:
1. tf-idf vectors are calculated
2. similarity between tf-idf vectors are calculated, and edges with low similarity are removed
3. users with low support are removed. 

The resulting networks are very similar, and the result reported in the paper are not affected by this change.
