# Case Study 3

A  data  preprocessing  step  filters  out  accounts  with  few tweets and hashtags.
The thresholds depend on the time period under evaluation. In this case we use a minimum of
five tweets and five unique hashtags over a period of 24 hours to ensure sufficient
support for possible coordination.

We  engineer  features  that  combine  content (hashtags) and activity (timestamps) traces.
In particular, we use ordered  sequences of  hashtags  for  each  user. The bipartite
network consists of accounts in one layer and hashtag sequences in the other.
In the projection phase, we draw an edge between two accounts with identical hashtag sequences.
These edges are unweighted and we do not apply any filtering, based on the assumption
that two independent users are unlikely to post identical sequences of five or more hashtags
on the same day.

We identify suspicious groups of accounts by removing singleton nodes and then extracting
the connected components of  the  network.  Large  components  are  more  suspicious,
as  it  is  less  likely  that  many  accounts  post  the same hashtag sequences by chance.

| Conjecture           | Similar large sequence of hashtags                                               |
|----------------------|----------------------------------------------------------------------------------|
| Support filter       | At least 5 tweets, 5 hashtags per day                                           |
| Trace                | Hashtags in a tweet                                                             |
| Eng. trace           | Ordered sequence of hashtags in a day                                           |
| Bipartite weight     | NA, the bipartite is unweighted                                                 |
| Proj. weight         | Co-occurrence                                                                   |
| Edge filter          | No                                                                             |
| Clustering           | Connected components                                                           |
| Data source          | BEV (Yang, Hui, and Menczer 2019)                                              |
| Data period          | Octâ€“Dec 2018                                                                   |
| No. accounts         | 59,389,305                                                                     |

*Table 3: Case study 3 summary*

In [None]:
import json
import gzip
import pandas as pd
import networkx as nx
from itertools import combinations

### The tweets file should only contain native tweets from the v1.1 Twitter API

In [None]:
# minimum number of tweets per user
min_tweets = 5

# minimum number of hashtags
min_unique_hashtags = 5
min_tweets_with_hashtags = 5

tweetid = []
text = []
hashtags = []
userid = []
created_at = []

# The tweets json file is a daily dump of tweets
with gzip.open('tweets.json.gz', 'rb') as f:
    for line in f:
        try:
            tmp_json = json.loads(line)
            for row in range(len(tmp_json)):
                tweetid.append(tmp_json[row]['id_str'])
                text.append(tmp_json[row]['text'])
                userid.append(tmp_json[row]['user']['id_str'])
                created_at.append(pd.to_datetime(tmp_json[row]['created_at']))
                hashtags.append(tmp_json[row]['entities']['hashtags'])
        except Exception as e:
            print(e)

data = pd.DataFrame(tweetid, columns=['tweetid'])
data['userid'] = userid
data['created_at'] = created_at
data['text'] = text
data['hashtags'] = hashtags
data.sort_values(by='created_at', inplace=True)

# Identify users that published the same sequence of hashtags
hashtag_seqs = dict()
# We iterate over each user in the dataframe
for user in data['userid'].unique():
    hashtag_sequence_list = []
    twts_with_hashtags_count = 0
    # For users that have at least the minimum required number of tweets,
    if len(data[data['userid'] == user].index) >= min_tweets:
        # we count how many of them have hashtags
        # and keep track of the hashtags
        for index in data[data['userid'] == user].index:
            if len(data.loc[index,'hashtags']) != 0:
                twts_with_hashtags_count += 1
                for hashtag in data.loc[index,'hashtags']:
                    hashtag_sequence_list.append(hashtag['text'])
        # If an account has at least the minimum required number of tweets that contain hashtags
        # and enough of them are unique, then we add the user to a dictionary that
        # keeps track of qualifying users
        if (len(set(hashtag_sequence_list)) >= min_unique_hashtags) & (twts_with_hashtags_count >= min_tweets_with_hashtags):
            hashtag_sequence_str = "".join(hashtag_sequence_list)
            if hashtag_sequence_str in hashtag_seqs.keys():
                hashtag_seqs[hashtag_sequence_str].append(user)
            else:
                hashtag_seqs[hashtag_sequence_str] = [user]

# Build an undirect network of users that met the above thresholds and
# have the same hashtag sequences. The nodes (users) in the network will be connected
# to other users with the same hashtag sequence
G = nx.Graph()
for key in hashtag_seqs.keys():
    if len(hashtag_seqs[key]) > 1:
        for comb in list(combinations(hashtag_seqs[key], 2)):
            nx.add_edge(comb[0], comb[1])

nx.write_gexf(G,'hashtag_coordination_graph.gexf')