# Case Study 5: Activity Synchronization

``Pump \& dump" is a shady scheme where the price of a stock is inflated by simulating a surge in buyer interest through false statements (pump) to sell the cheaply purchased stock at a higher price (dump). Investors are vulnerable to this kind of manipulation because they want to act quickly when acquiring stocks that seem to promise high future profits. By exposing investors to information seemingly from different sources in a short period of time, fraudsters create a false sense of urgency that prompts victims to act.

Social media provides fertile grounds for this type of scam. We investigate the effectiveness of our approach in detecting coordinated cryptocurrency pump \& dump campaigns on Twitter. The data was collected using keywords and cashtags (e.g., \$BTC) associated with 25 vulnerable cryptocoins as query terms. We consider both original tweets and retweets because they all add to the stream of information considered by potential buyers. More details on the dataset are found in 

| **Conjecture**       | Synchronous activities       |
|-----------------------|------------------------------|
| **Support filter**    | Accounts with < 8 tweets    |
| **Trace**             | Tweet timestamp             |
| **Eng. trace**        | 30-minute time intervals    |
| **Bipartite weight**  | TF-IDF                      |
| **Proj. weight**      | Cosine similarity           |
| **Edge filter**       | Keep top 0.5% weights       |
| **Clustering**        | Connected components        |
| **Data source**       | DARPA SocialSim             |
| **Data period**       | Jan 2017--Jan 2019          |
| **No. accounts**      | 887,239                     |

*Table: Case study 5 summary*

## Detection

We hypothesize that coordinated pump \& dump campaigns use software to have multiple accounts post pump messages in close temporal proximity. Tweet timestamps are therefore used as the behavioral trace of the accounts. 
The shorter the time interval in which two tweets are posted, the less likely they are to be coincidental. However, short time intervals result in significantly fewer matches and increased computation time. On the other hand, longer (e.g., daily) intervals produce many false positive matches. To balance between these concerns, we use 30-minute time intervals.

Intuitively, it is likely that any two users would post one or two tweets that fall within any time interval; however, the same is not true for a set of more tweets. To focus on accounts with sufficient support for coordination, we only keep those that post at least eight messages. This specific support threshold value is chosen to minimize false positive matches

The tweets are then binned based on the time interval in which they are posted. These time features are used to construct the bipartite network of accounts and tweet times. Edges are weighted using TF-IDF. Similar to the previous case, the projected account coordination network is weighted by the cosine similarity between the TF-IDF vectors. Upon manual inspection, we found that many of the tweets being shared in this network are not related to cryptocurrencies, while only a small percentage of edges are about this topic. These edges also have high similarity and yield a strong signal of coordination. Thus, we only preserve the 0.5\% of edges with largest cosine similarity 

# Helper functions

In [1]:
import sys
import glob
import argparse
import numpy as np
import pandas as pd
from sparse_dot_topn import sp_matmul_topn
from tqdm import tqdm
from os.path import join
from itertools import combinations
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer


def counters2edges(counter_dict, p1_col, p2_col, w_col):
    # Count how many times each feature co-occurs with each user
    edge = pd.DataFrame.from_records(
        [
            (uid, feature, count)
            for uid, feature_counter in counter_dict.items()
            for feature, count in feature_counter.items()
        ],
        columns=[p1_col, p2_col, w_col],
    )
    return edge


def prepare_content(in_fp, tqdm_desc="parse raw content", tqdm_total=None):
    # construct uid - feature - count table using the defined counters2edges() function
    user_features = defaultdict(lambda: defaultdict(Counter))
    n = 0
    idx = 0
    for line in tqdm(in_fp, desc=tqdm_desc, total=tqdm_total):
        twt = json.loads(line)

        if "timestamp_ms" in twt:
            try:
                # user_id = twt['user']['id_str_h'] #TODO: uncomment
                user_id = twt["user"]["id_str"]
                twt_time_str = twt["timestamp_ms"][:-3]
                created_at = int(twt_time_str)
            except:
                print(f"pass {idx}")
                idx += 1
                continue

            # find min time
            if n == 0:
                min_time = created_at
            # convert the time to the nearest 30 minutes
            if ((created_at - min_time) % 1800) == 0:
                user_features["tweet_time"][user_id][created_at] += 1
            else:
                if created_at < min_time:
                    time_bin = math.ceil((created_at - min_time) / 1800)
                if created_at > min_time:
                    time_bin = math.floor((created_at - min_time) / 1800)
                twt_time = min_time + time_bin * 1800
                user_features["tweet_time"][user_id][twt_time] += 1
            n += 1

    return user_features


def dummy_func(doc):
    return doc


def calculate_interaction(
    edge_df,
    p1_col,
    p2_col,
    w_col,
    node1_col,
    node2_col,
    sim_col,
    sup_col,
    tqdm_total=None,
    tqdm_desc="calculating interaction",
):
    """
    calculate interactions between nodes in partition 1

    input:
        edge_df : dataframe that contains (p1, p2, w)
        p1_col : column name that represent p1, i.e: uid
        p2_col : column name that represent p2 , e.g: time
        w_col : column name that represent weight, if None it's unweighted
        node1_col, node2_col: col after projection, both are uid
        supports: dictionary of userids - post count {p1_node: support}
    return:
        interactions : dict { (p1_node1, p1_node2) : interaction_weight }

    assumption:
        1. non-empty inputs: edge_df
        2. p1_node1 < p1_node2 holds for the tuples in interactions
    """
    user_ids, docs = zip(
        *(
            (p1_node, grp[p2_col].repeat(grp[w_col]).values)
            for p1_node, grp in edge_df.groupby(p1_col)
            if len(grp) > 1
        )
    )

    try:
        tfidf = TfidfVectorizer(
            analyzer="word",
            tokenizer=dummy_func,
            preprocessor=dummy_func,
            use_idf=True,
            token_pattern=None,
            lowercase=False,
            sublinear_tf=True,
            min_df=3,
        )
        docs_vec = tfidf.fit_transform(docs)
    except ValueError:
        tfidf = TfidfVectorizer(
            analyzer="word",
            tokenizer=dummy_func,
            preprocessor=dummy_func,
            use_idf=True,
            token_pattern=None,
            lowercase=False,
            sublinear_tf=True,
            min_df=1,
        )
        docs_vec = tfidf.fit_transform(docs)

    results = sp_matmul_topn(docs_vec, docs_vec.T, top_n=100)

    interactions = pd.DataFrame.from_records(
        [
            (
                u1,
                user_ids[u2_idx],
                sim,
                min(supports[user_ids[u1_idx]], supports[user_ids[u2_idx]]),
            )
            for u1_idx, u1 in enumerate(user_ids)
            for u2_idx, sim in zip(
                results.indices[results.indptr[u1_idx] : results.indptr[u1_idx + 1]],
                results.data[results.indptr[u1_idx] : results.indptr[u1_idx + 1]],
            )
            if u1_idx < u2_idx
        ],
        columns=[node1_col, node2_col, sim_col, sup_col],
    )

    return interactions

## Count the number of lines to use progress bar (optional)

In [None]:
fname = "candidates_streaming_data"  # replace with raw_tweets
# get number of tweets
raw_json_gz = f"/Users/baott/tcd/tcd_code/data/{fname}.json.gz"
# number of lines in the input file
numline_file = f"/Users/baott/tcd/tcd_code/data/{fname}.numline"

In [None]:
! cat {raw_json_gz} | wc -l > {numline_file}

In [None]:
numline = int(open(numline_file).read().strip())
numline

652547

# Step 1. Parse json of tweets to a table of user - feature values 

Construct a uid - feature - count table from raw tweets 

Input: raw tweet.json.gz files 

Output: feature parquets
In this case study, the feature are the timestamps of tweets that a user posted.
For other case studies, the features can be different. E.g., in case study 4, the features are the tweet IDs that a user retweeted.

In [None]:
# path to input parquet file of edge table
infile = raw_json_gz

# number of lines in the input file
numline = numline
# column name of nodes in partite 1 and 2
p1_col = "uid"
p2_col = "feature"
# column name of edge weights
w_col = "cnt"

feature = "tweet_time"
# read input
with gzip.open(infile, "rb") as in_fp:
    # do work
    user_features = prepare_content(in_fp, tqdm_total=numline)

# write output
for feature, counter_dict in user_features.items():
    edge_df = counters2edges(counter_dict, p1_col, p2_col, w_col)

    fname = "{}.edge.parquet".format(feature)
    edge_df.to_parquet(join(outdir, fname))

parse raw content:  16%|█▌        | 104948/652547 [00:10<00:52, 10356.18it/s]


In [57]:
edge_df_name = f"{outdir}/{feature}.edge.parquet"

## Let's take a look at the feature dataframe

In [3]:
pd.read_parquet("../features/pnd/tweet_time.edge.parquet").head()

Unnamed: 0,uid,feature,cnt
0,4OVTHrO8tazRZibJuR2OOg,1491409852,1
1,4OVTHrO8tazRZibJuR2OOg,1491445852,1
2,4OVTHrO8tazRZibJuR2OOg,1493334052,1
3,4OVTHrO8tazRZibJuR2OOg,1491343252,2
4,4OVTHrO8tazRZibJuR2OOg,1492144252,1


# Step 2. Calculate the similarity between features 

We consider each user as a document, and "words" are the feature scores, i.e., time intervals. 
- We first filter out users with lack of support, i.e., those posting less than `sup_thresh = 8`

- We then create a TF-IDF vector for each user, where each dimension is a time interval, and the value is the number of tweets in that time interval.

We output a dataframe with the following columns: `user1, user2, similarity, support`; where each record represents an interaction between `user1` and `user2`; `similarity` is the strength of connections and `support` is the post count of `user1`

In [73]:
# column name of node1, node2, similarity, support
node1_col = "user1"
node2_col = "user2"
sim_col = "similarity"
sup_col = "support"
sup_thresh = 8
# takes a .parquet file of edge table
# save to output file of interactions with p-values
interaction_file = f"{outdir}/{feature}.interactions.parquet"
outfile = interaction_file
# read input
edge_df = pd.read_parquet(edge_df_name)

if not (len(edge_df) > 0 and np.any(edge_df.groupby(p1_col).size() > 1)):
    raise ValueError(
        "Empty input edge table. Run `prepare_content()` to create edge table first"
    )
else:
    # calculate the support, which is the post count for each user
    supports = edge_df.groupby(p1_col)[w_col].sum().to_dict()
    print("No users: ", len(supports))
    # filter nodes based on support
    supports = {k: v for k, v in supports.items() if v > sup_thresh}
    print(f"No users after filtering with threshold={sup_thresh}: {len(supports)}")
    filtered_nodes = list(supports.keys())
    # remove the nodes with low support from edge_df
    edge_df = edge_df[edge_df[p1_col].isin(filtered_nodes)]

    # calculate similirity between users
    interaction_df = calculate_interaction(
        edge_df, p1_col, p2_col, w_col, node1_col, node2_col, sim_col, sup_col
    )
    # write output
    interaction_df.to_parquet(outfile)

# Let's take a look at the interaction df

In [2]:
interaction_df = pd.read_parquet(
    "../features/pnd/filter_support.tweet_time.interactions.parquet"
)
interaction_df.head()

Unnamed: 0,user1,user2,similarity,support
0,---7ozV6QOWYeeNpX3s_Yw,yEQ29U-82sbvEDZfwApHdA,0.082107,9
1,---7ozV6QOWYeeNpX3s_Yw,j1Vd4_5A3BaE8VSqaQysfw,0.070501,9
2,---7ozV6QOWYeeNpX3s_Yw,PKD7gEVVncUB4SD_R0ahww,0.145915,9
3,---7ozV6QOWYeeNpX3s_Yw,P2dJ8F5ig5dAwosHqXAVTQ,0.077978,9
4,---7ozV6QOWYeeNpX3s_Yw,L7FI6SkWiZ4Mxb433r-g7g,0.073727,9


# Step 3. Filter low edge weight and create a network 
For case study 5, we apply an weight filter `sim_thresh = 0.995`, i.e., only accounts with similarity >0.995 percentile are considered

In [None]:
# path to input indexed text of tweets
twttext = f"{outdir}/table.by.user.pkl"

# path to output file of the graphml for Gephi visualization
outgraph = f"{outdir}/{feature}.filtered.coord.graphml"
# path to output file of context of each suspicious group
grouptext = f"{outdir}/{feature}filtered.coord.pkl"

# read input parquet file of interaction
interaction_df = pd.read_parquet(interaction_file)

# define the filters
sim_thresh = 0.995

<class 'networkx.classes.graph.Graph'>
13


In [None]:
# apply edge weight filter: 0.995 quantile cosine similarity of TFIDF vectors
interaction_df = interaction_df[
    interaction_df[sim_col] >= interaction_df[sim_col].quantile(sim_thresh)
]
# create a graph object from the interaction table
G = nx.Graph()
for idx, row in interaction_df.iterrows():
    try:
        G.add_edge(
            row[node1_col], row[node2_col], weight=row[sim_col], support=row[sup_col]
        )
    except:
        print(row)
        pass
#  write graph object to file
nx.write_gml(G, outgraph)
del interaction_df

# get the tweet texts for each connected component for manual annotation
has_text = twttext is not None
if has_text:
    with open(twttext, "rb") as f:
        tweet_table = pickle.load(f)
# texts (list of dict): each element is a connected component
# represented by a dictionary where each key is a node_id and the value is the tweet text
texts = [
    {node_id: tweet_table[node_id] for node_id in component}
    for component in nx.connected_components(G)
]
texts.sort(key=len, reverse=True)
# save to file
with open(grouptext, "wb") as f:
    pickle.dump(texts, f)

# Note

The above code matches the filtering procedure described in the paper: 
1. users with low support (few posts) are removed
2. tf-idf vectors are calculated
3. similarity between tf-idf vectors are calculated, and edges with low similarity are removed. 

The results of case study 5 in the paper were obtained by a slightly different ordering of these steps:
1. tf-idf vectors are calculated
2. similarity between tf-idf vectors are calculated, and edges with low similarity are removed
3. users with low support are removed. 

The resulting networks are very similar, and the result reported in the paper are not affected by this change.


# Optional final step: save user_id and tweet texts 

For optional close-reading of the case study, we parse the raw json of tweets to a table of user_id - tweet text and save to file

In [None]:
#!/usr/bin/env python3
import re
import os
import sys
import gzip
import pickle
import argparse
import pandas as pd
import ujson as json
from tqdm import tqdm
from os.path import join
from collections import defaultdict, Counter
import math
import pyarrow
import numpy as np
import networkx as nx

# path to output directory for output randomized edge tables
outdir = "12-04-2024_features"
if not os.path.exists(outdir):
    os.makedirs(outdir)


def prepare_content(in_fp, tqdm_desc="parse user tweet", tqdm_total=None):
    # create a dictionary of user_id -> list of tweets by that user
    user_tweets = defaultdict(list)
    idx = 0
    for line in tqdm(in_fp, desc=tqdm_desc, total=tqdm_total):
        twt = json.loads(line)
        try:
            # user_id = twt['user']['id_str_h']
            # user_tweets[user_id].append(twt['text_m'])
            user_id = twt["user"]["id_str"]
            user_tweets[user_id].append(twt["text"])
        except:
            pass

            idx += 1
            print(f"pass {idx}")
    return user_tweets


# path to input parquet file of edge table
infile = raw_json_gz
# path to output pickle file of tweet table
outfile = f"{outdir}/table.by.user.pkl"


# read input to create a dictionary of user_id -> list of tweets by that user
with gzip.open(infile, "rb") as in_fp:
    # do work
    user_tweets = prepare_content(in_fp, tqdm_total=numline)
with open(outfile, "wb") as out_fp:
    # write output
    pickle.dump(user_tweets, out_fp)

parse user tweet:   1%|▏         | 8209/652547 [00:00<00:57, 11155.01it/s]

pass 1
pass 2


parse user tweet:   2%|▏         | 12907/652547 [00:01<00:57, 11201.46it/s]

pass 3
pass 4
pass 5
pass 6
pass 7
pass 8


parse user tweet:   2%|▏         | 15473/652547 [00:01<00:52, 12049.13it/s]

pass 9
pass 10


parse user tweet:   3%|▎         | 19062/652547 [00:01<00:53, 11752.45it/s]

pass 11
pass 12
pass 13
pass 14
pass 15
pass 16
pass 17
pass 18
pass 19
pass 20
pass 21


parse user tweet:   3%|▎         | 21398/652547 [00:01<00:56, 11153.23it/s]

pass 22
pass 23
pass 24
pass 25
pass 26
pass 27
pass 28
pass 29
pass 30
pass 31


parse user tweet:   4%|▍         | 24724/652547 [00:02<01:06, 9394.61it/s] 

pass 32
pass 33
pass 34
pass 35
pass 36
pass 37


parse user tweet:   4%|▍         | 26784/652547 [00:02<01:09, 9036.33it/s]

pass 38
pass 39
pass 40
pass 41


parse user tweet:   4%|▍         | 28934/652547 [00:02<01:02, 9919.40it/s]

pass 42
pass 43
pass 44
pass 45
pass 46
pass 47
pass 48


parse user tweet:   5%|▍         | 31337/652547 [00:02<00:56, 10984.32it/s]

pass 49
pass 50
pass 51
pass 52
pass 53
pass 54
pass 55


parse user tweet:   5%|▌         | 33972/652547 [00:03<00:51, 12085.95it/s]

pass 56
pass 57
pass 58
pass 59
pass 60
pass 61
pass 62
pass 63
pass 64
pass 65
pass 66


parse user tweet:   6%|▌         | 40596/652547 [00:03<00:47, 12795.12it/s]

pass 67
pass 68
pass 69
pass 70
pass 71
pass 72
pass 73
pass 74


parse user tweet:   7%|▋         | 43144/652547 [00:03<00:48, 12544.95it/s]

pass 75
pass 76
pass 77
pass 78
pass 79


parse user tweet:   7%|▋         | 45623/652547 [00:04<00:50, 12043.87it/s]

pass 80
pass 81
pass 82
pass 83


parse user tweet:   8%|▊         | 50819/652547 [00:04<00:47, 12658.66it/s]

pass 84
pass 85
pass 86
pass 87


parse user tweet:   8%|▊         | 52087/652547 [00:04<00:47, 12522.46it/s]

pass 88
pass 89
pass 90
pass 91
pass 92
pass 93
pass 94


parse user tweet:   9%|▊         | 55841/652547 [00:05<00:53, 11157.73it/s]

pass 95
pass 96
pass 97


parse user tweet:   9%|▉         | 58260/652547 [00:05<00:51, 11548.68it/s]

pass 98
pass 99
pass 100
pass 101
pass 102


parse user tweet:   9%|▉         | 60782/652547 [00:05<00:48, 12079.02it/s]

pass 103
pass 104
pass 105
pass 106


parse user tweet:  10%|█         | 65773/652547 [00:05<00:48, 12224.26it/s]

pass 107
pass 108
pass 109
pass 110


parse user tweet:  11%|█         | 70928/652547 [00:06<00:46, 12600.37it/s]

pass 111
pass 112


parse user tweet:  11%|█▏        | 74752/652547 [00:06<00:49, 11670.53it/s]

pass 113
pass 114


parse user tweet:  12%|█▏        | 75928/652547 [00:06<00:53, 10729.60it/s]

pass 115
pass 116
pass 117
pass 118


parse user tweet:  12%|█▏        | 78104/652547 [00:06<00:55, 10334.63it/s]

pass 119
pass 120


parse user tweet:  13%|█▎        | 83075/652547 [00:07<00:48, 11802.98it/s]

pass 121
pass 122
pass 123


parse user tweet:  14%|█▍        | 90509/652547 [00:07<00:46, 11990.20it/s]

pass 124
pass 125
pass 126
pass 127


parse user tweet:  15%|█▍        | 95418/652547 [00:08<00:47, 11638.98it/s]

pass 128


parse user tweet:  15%|█▌        | 100354/652547 [00:08<00:45, 12047.54it/s]

pass 129
pass 130


parse user tweet:  16%|█▌        | 104948/652547 [00:09<00:47, 11416.60it/s]


pass 131
pass 132
pass 133
pass 134
pass 135


In [None]:
user_tweets

defaultdict(list,
            {'150521075': ['I’m on my way to Bullock County to help get out the vote!  Please remember today (OCT 24th) is the last day to regi… https://t.co/cfeUdgSPfd'],
             '1400915708259696640': ['RT @willainsworthAL: Addie is super excited to be campaigning with our next U.S. Senator @KatieBrittforAL and former Patriots player @wesle…',
              'The Port of Mobile has a tremendous positive impact across all 67 counties in our state. This data confirms that th… https://t.co/VZQnkcmahH',
              'We applaud the Alabama families, teachers, and school administrators who navigated this uniquely challenging time i… https://t.co/1Q619RCJBE',
              "The Nation's Report Card released today shows that Alabama is the only state nationwide to have fourth grade readin… https://t.co/WOaDCPEGck",
              'We were excited to be at Hancock Whitney Stadium in Mobile to watch @SouthAlabamaFB and @TroyTrojansFB battle for t… https://t.co/TLOMvGyawJ

In [None]:
user_to_tweets = f"{outdir}/table.by.user.pkl"