# Anonymous Twitter Data Collection

Jennifer Shan

1. [Collecting followers and following](#collect)
2. [Checking for alphanumeric variations of 'anonymous' and 'legion'](#filter)
3. [Adding Anonymous, profile, and content features](#add)
4. [Applying DecisionTreeClassifier and RandomForestClassifier](#apply)
5. [Collecting Anonymous-affiliated tweets](#anon)
6. [Collecting randomly sampled tweets](#random)

In [38]:
%store -r consumer_key
%store -r consumer_secret
%store -r access_token
%store -r access_token_secret

In [39]:
import tweepy

In [40]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit = True)

We specify our seed accounts.

In [41]:
ausers = ['AnonyOps', 'YourAnonNews', 'YourAnonCentral', 'AnonPress']

<a id="collect"></a>
**1. Collecting followers and following**  
We collect a list of followers and a list of followed users.

In [42]:
import json

We only grab 5000 followers for every seed account.

In [197]:
fwr = {}

for a in ausers:
    fwr[a] = []
    for page in tweepy.Cursor(api.get_follower_ids, screen_name = a).pages():
        fwr[a].extend(page)
        break

In [122]:
fwg = {}

for a in ausers:
    fwg[a] = []
    for page in tweepy.Cursor(api.get_friend_ids, screen_name = a).pages():
        fwg[a].extend(page)

Let's check what we're dealing with.

In [198]:
for a in ausers:
    print(a, len(fwr[a]), len(fwg[a]))

AnonyOps 5000 1141
YourAnonNews 5000 807
YourAnonCentral 5000 665
AnonPress 5000 101


We don't want to run this again.

In [199]:
with open('fwr.json', 'w') as f:
    json.dump(fwr, f)
with open('fwg.json', 'w') as f:
    json.dump(fwg, f)

<a id="filter"></a>
**2. Checking for alphanumeric variations of 'anonymous' and 'legion'**  
We can check with `variations` in our users since these are probably Anonymous-affiliated.

In [13]:
fwr = json.load(open('fwr.json'))
fwg = json.load(open('fwg.json'))

In [11]:
variations = ('anonymous an0nym0u5 anonymou5 an0nymous anonym0us anonym0u5 '
              'an0nymou5 an0nym0us anony legion leg1on an0ny anon l3gion legi0n '
              'l3g1on leg10n an0n le3gi0n l3g10n')
variations = variations.split()

We check in name or screen name.

In [25]:
anonfwr = {}

for a in ausers:
    anonfwr[a] = []
    for u in fwr[a]:
        try:
            user = api.get_user(user_id = u)
            name = user.name.lower()
            scrname = user.screen_name.lower()
            if any(var in name for var in variations) or any(var in scrname for var in variations):
                anonfwr[a].append(u)
        except tweepy.errors.Forbidden as e:
            print(e)
        except tweepy.errors.NotFound as e:
            print(e)

Rate limit reached. Sleeping for: 421


403 Forbidden
63 - User has been suspended.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 647


404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 650
Rate limit reached. Sleeping for: 661


403 Forbidden
63 - User has been suspended.


Rate limit reached. Sleeping for: 662
Rate limit reached. Sleeping for: 711


403 Forbidden
63 - User has been suspended.
403 Forbidden
63 - User has been suspended.
404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 714


404 Not Found
50 - User not found.
404 Not Found
50 - User not found.
404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 727


404 Not Found
50 - User not found.
404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 727


404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 728


404 Not Found
50 - User not found.
403 Forbidden
63 - User has been suspended.
403 Forbidden
63 - User has been suspended.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 726


403 Forbidden
63 - User has been suspended.
403 Forbidden
63 - User has been suspended.
403 Forbidden
63 - User has been suspended.


Rate limit reached. Sleeping for: 697


404 Not Found
50 - User not found.
404 Not Found
50 - User not found.
404 Not Found
50 - User not found.
404 Not Found
50 - User not found.
403 Forbidden
63 - User has been suspended.
403 Forbidden
63 - User has been suspended.


Rate limit reached. Sleeping for: 733


404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 712


403 Forbidden
63 - User has been suspended.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 716


403 Forbidden
63 - User has been suspended.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 713


404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 710


403 Forbidden
63 - User has been suspended.
404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 708


404 Not Found
50 - User not found.
404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 683


404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 726
Rate limit reached. Sleeping for: 717
Rate limit reached. Sleeping for: 683


In [26]:
anonfwg = {}

for a in ausers:
    anonfwg[a] = []
    for u in fwg[a]:
        try:
            user = api.get_user(user_id = u)
            name = user.name.lower()
            scrname = user.screen_name.lower()
            if any(var in name for var in variations) or any(var in scrname for var in variations):
                anonfwg[a].append(u)
        except tweepy.errors.Forbidden as e:
            print(e)
        except tweepy.errors.NotFound as e:
            print(e)

Rate limit reached. Sleeping for: 691


404 Not Found
50 - User not found.


Rate limit reached. Sleeping for: 686
Rate limit reached. Sleeping for: 684


In [27]:
for a in ausers:
    print(a, len(anonfwr[a]), len(anonfwg[a]))

AnonyOps 140 47
YourAnonNews 53 82
YourAnonCentral 41 66
AnonPress 101 9


We don't want to run this again.

In [28]:
with open('anonfwr.json', 'w') as f:
    json.dump(anonfwr, f)
with open('anonfwg.json', 'w') as f:
    json.dump(anonfwg, f)

<a id="add"></a>
**3. Adding Anonymous, profile, and content features**  
We can add features for our accounts.

In [2]:
import pandas as pd
import re
import string
import emoji

In [6]:
anonfwr = json.load(open('anonfwr.json'))
anonfwg = json.load(open('anonfwg.json'))

We combine all users together into `users`.

In [26]:
users = []

for a in ausers:
    users.extend(anonfwr[a])
    users.extend(anonfwg[a])

print(len(users))

539


We get name, screen name, and description for each user.

In [139]:
names = []
usernames = []
desc = []

for u in users:
    user = api.get_user(user_id = u)
    name = user.name
    scrname = user.screen_name
    d = user.description
    names.append(name)
    usernames.append(scrname)
    desc.append(d)  

print(len(usernames), len(names), len(desc))

539 539 539


In [141]:
links = []
for username in usernames:
    links.append('https://twitter.com/' + username)

We create a `DataFrame` to store all our information.

In [28]:
anonacc = pd.DataFrame({'id': users, 'name': names, 'screen name': usernames, 'description': desc, 'profile': links})
anonacc.drop_duplicates(inplace = True, ignore_index = True)

def click(val): # makes links clickable
    return '<a href="{}">{}</a>'.format(val, val)

anonacc.style.format({'profile': click})

Unnamed: 0,id,name,screen name,description,profile
0,1434935899645566976,CAnon Yacht Club,CAnonYachtClub,"If you love Gaming, NFT, Metaverse and making money in Crypto, join our community for discussion. Like and follow us for special giveaway!Not a financial advice",https://twitter.com/CAnonYachtClub
1,1469439375637241858,AnonMedic001 2.0,HisRobbins,Just here from the comments and #AMCSQUEEZE Kill the Gibson #Over9000,https://twitter.com/HisRobbins
2,1429176781777555457,Anonymous Hong Kong,Op_HongKong,Anonymous Hong Kong. Latest News/Hacks Democracy is human right you dare to deny it? #FreeHongKong #OpHongKong #Anonymous,https://twitter.com/Op_HongKong
3,1401080776817471488,Anonymous,cryptobugfixer,#Bitcoin,https://twitter.com/cryptobugfixer
4,1468690875937169411,Anonymous,RealAnonyNews,News reporting twitter true daily news freedom anony center join us & justice text #Anonymous #InterWebs,https://twitter.com/RealAnonyNews
5,1391716598252285952,QanonlikeFulda2021,QFulda2021,,https://twitter.com/QFulda2021
6,1464968757143650309,Anonymous,Anonymo95956205,,https://twitter.com/Anonymo95956205
7,1465624651590160385,Anonim,anonimresmi,Gerçekleri söylemekten ve susmaktan vaz geçme! senin bir hakkin var! ezdirme kendini sisteme karşı. #Anonim,https://twitter.com/anonimresmi
8,1462567083271593992,JJ,metis_anon,Here for the Earth and its keepers. Indigenous Resistance,https://twitter.com/metis_anon
9,1463147172661121024,anonymarie,anonymarie1,Estamos te Observando! We are Watching you!,https://twitter.com/anonymarie1


We write functions to check for Anonymous features.

In [43]:
def anonymous(x): # checks for anonymous
    x = x.lower().replace('.', '')
    if 'anonymous' in x:
        return True
    return False
    
def anon(x): # checks for anon
    x = x.lower().replace('.', '')
    if 'anon' in x:
        return True
    return False
    
def anony(x): # checks for anony
    x = x.lower().replace('.', '')
    if 'anony' in x:
        return True
    return False
    
def legion(x): # checks for legion
    x = x.lower().replace('.', '')
    if 'legion' in x:
        return True
    return False
    
def ops(x): # checks for ops
    x = x.lower().replace('.', '')
    if 'ops' in x:
        return True
    return False
    
def motto(x): # checks for motto in description
    x = x.lower().replace('.', '')
    motto = 'we are anonymous we are legion we do not forgive we do not forget expect us'
    if motto in x:
        return True
    return False

In [44]:
features = [anonymous, anon, anony, legion, ops]
check = {'name': '_in_name', 'screen name': '_in_scr', 'description': '_in_desc'}

for f in features:
    for c in check:
        anonacc[f.__name__ + check[c]] = anonacc[c].apply(f)
        
anonacc['motto_in_desc'] = anonacc['description'].apply(motto)

We write functions to check for profile features.

In [45]:
def tweet_number(x): # gets number of tweets
    user = api.get_user(user_id = x)
    return user.statuses_count
    
def fwr_number(x): # gets number of followers
    user = api.get_user(user_id = x)
    return user.followers_count
    
def fwg_number(x): # gets number following
    user = api.get_user(user_id = x)
    return user.friends_count
    
def f_f_ratio(x): # gets follower-friend ratio
    fwr = fwr_number(x)
    fwg = fwg_number(x)
    if fwg == 0:
        return fwg
    return fwr/fwg
    
def location_prov(x): # gets if location is provided
    user = api.get_user(user_id = x)
    if user.location is None:
        return False
    return True

In [46]:
features = [tweet_number, fwr_number, fwg_number, f_f_ratio, location_prov]

for f in features:
    anonacc[f.__name__] = anonacc['id'].apply(f)

Rate limit reached. Sleeping for: 739
Rate limit reached. Sleeping for: 678


We write functions to check for content features.

In [47]:
def char_number(x):
    return len(x)
    
def word_number(x):
    return len(re.findall(r'\w+', x))
    
def uppercase_number(x):
    return sum(1 for c in x if c.isupper())
    
def lowercase_number(x):
    return sum(1 for c in x if c.islower())

def alpha_number(x):
    return sum(1 for c in x if c.isalpha())

def num_number(x):
    return sum(1 for c in x if c.isdigit())

def punc_number(x):
    return sum(1 for c in x if c in string.punctuation)  
    
def emoji_number(x):
    return sum(1 for c in x if c in emoji.UNICODE_EMOJI['en'])
    
def hashtag_number(x):
    return x.count('#')

In [48]:
features = [char_number, uppercase_number, lowercase_number, alpha_number, num_number, punc_number]
check = {'name': '_in_name', 'screen name': '_in_scr', 'description': '_in_desc'}

for f in features:
    for c in check:
        anonacc[f.__name__ + check[c]] = anonacc[c].apply(f)

anonacc['word_number_in_name'] = anonacc['name'].apply(word_number)
anonacc['word_number_in_desc'] = anonacc['description'].apply(word_number)
anonacc['emoji_number_in_name'] = anonacc['name'].apply(emoji_number)
anonacc['emoji_number_in_desc'] = anonacc['description'].apply(emoji_number)
anonacc['hashtag_number_in_name'] = anonacc['name'].apply(hashtag_number)
anonacc['hashtag_number_in_desc'] = anonacc['description'].apply(hashtag_number)

Let's see what `anonacc` looks like.

In [49]:
anonacc

Unnamed: 0,id,name,screen name,description,profile,anonymous_in_name,anonymous_in_scr,anonymous_in_desc,anon_in_name,anon_in_scr,...,num_number_in_desc,punc_number_in_name,punc_number_in_scr,punc_number_in_desc,word_number_in_name,word_number_in_desc,emoji_number_in_name,emoji_number_in_desc,hashtag_number_in_name,hashtag_number_in_desc
0,1434935899645566976,CAnon Yacht Club,CAnonYachtClub,"If you love Gaming, NFT, Metaverse and making ...",https://twitter.com/CAnonYachtClub,False,False,False,True,True,...,0,0,0,5,3,27,0,0,0,0
1,1469439375637241858,AnonMedic001 2.0,HisRobbins,Just here from the comments and #AMCSQUEEZE\nK...,https://twitter.com/HisRobbins,False,False,False,True,False,...,4,1,0,2,3,11,0,0,0,2
2,1429176781777555457,Anonymous Hong Kong,Op_HongKong,Anonymous Hong Kong. Latest News/Hacks \nDemoc...,https://twitter.com/Op_HongKong,True,False,True,True,False,...,0,0,1,6,3,18,0,0,0,3
3,1401080776817471488,Anonymous,cryptobugfixer,#Bitcoin,https://twitter.com/cryptobugfixer,True,False,False,True,False,...,0,0,0,1,1,1,0,0,0,1
4,1468690875937169411,Anonymous,RealAnonyNews,News reporting twitter true daily news freedom...,https://twitter.com/RealAnonyNews,True,False,True,True,True,...,0,0,0,3,1,15,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
423,1389672130086936577,AnonJsf,YourAnonJ,Not to follow. Just share\n🇨🇴🏴‍☠️ United for C...,https://twitter.com/YourAnonJ,False,False,False,True,True,...,0,0,0,1,1,8,0,2,0,0
424,1185963459357413377,Anonymous,camarozulia,Anonymous,https://twitter.com/camarozulia,True,False,True,True,False,...,0,0,0,0,1,1,0,0,0,0
425,1390055524494823424,Anonymous,Anonymo48157551,"we are legion, we are Anonymous",https://twitter.com/Anonymo48157551,True,False,True,True,True,...,0,0,0,1,1,6,0,0,0,0
426,3130880601,Jhojan Andres,LEGIONESPECIAL,,https://twitter.com/LEGIONESPECIAL,False,False,False,False,False,...,0,0,0,0,2,0,0,0,0,0


We should store this.

In [50]:
anonacc.to_json('anonacc.json')

<a id="apply"></a>
**4. Applying DecisionTreeClassifier and RandomForestClassifier**  
We manually label accounts to train `DecisionTreeClassifier` and `RandomForestClassifier`. 

In [103]:
import csv
import sklearn
from sklearn import ensemble, tree
from sklearn.model_selection import GridSearchCV, cross_validate, cross_val_predict

In [29]:
anonacc = pd.read_json('anonacc.json')

An Anonymous account has at least one Anonymous keyword from `variations` in name or screen name and in description and has a profile or background image containing a Guy Fawkes mask or a floating businessman.

In [81]:
with open('affiliation.csv', encoding = 'utf-8-sig') as f:
    reader = csv.reader(f)
    aff = [int(line[0]) for line in reader]

Let's add this to `anonacc`.

In [94]:
anonacc.insert(4, 'affiliation', aff)
anonacc

Unnamed: 0,id,name,screen name,description,affiliation,profile,anonymous_in_name,anonymous_in_scr,anonymous_in_desc,anon_in_name,...,num_number_in_desc,punc_number_in_name,punc_number_in_scr,punc_number_in_desc,word_number_in_name,word_number_in_desc,emoji_number_in_name,emoji_number_in_desc,hashtag_number_in_name,hashtag_number_in_desc
0,1434935899645566976,CAnon Yacht Club,CAnonYachtClub,"If you love Gaming, NFT, Metaverse and making ...",0,https://twitter.com/CAnonYachtClub,False,False,False,True,...,0,0,0,5,3,27,0,0,0,0
1,1469439375637241858,AnonMedic001 2.0,HisRobbins,Just here from the comments and #AMCSQUEEZE\nK...,0,https://twitter.com/HisRobbins,False,False,False,True,...,4,1,0,2,3,11,0,0,0,2
2,1429176781777555457,Anonymous Hong Kong,Op_HongKong,Anonymous Hong Kong. Latest News/Hacks \nDemoc...,0,https://twitter.com/Op_HongKong,True,False,True,True,...,0,0,1,6,3,18,0,0,0,3
3,1401080776817471488,Anonymous,cryptobugfixer,#Bitcoin,0,https://twitter.com/cryptobugfixer,True,False,False,True,...,0,0,0,1,1,1,0,0,0,1
4,1468690875937169411,Anonymous,RealAnonyNews,News reporting twitter true daily news freedom...,1,https://twitter.com/RealAnonyNews,True,False,True,True,...,0,0,0,3,1,15,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
423,1389672130086936577,AnonJsf,YourAnonJ,Not to follow. Just share\n🇨🇴🏴‍☠️ United for C...,0,https://twitter.com/YourAnonJ,False,False,False,True,...,0,0,0,1,1,8,0,2,0,0
424,1185963459357413377,Anonymous,camarozulia,Anonymous,1,https://twitter.com/camarozulia,True,False,True,True,...,0,0,0,0,1,1,0,0,0,0
425,1390055524494823424,Anonymous,Anonymo48157551,"we are legion, we are Anonymous",1,https://twitter.com/Anonymo48157551,True,False,True,True,...,0,0,0,1,1,6,0,0,0,0
426,3130880601,Jhojan Andres,LEGIONESPECIAL,,0,https://twitter.com/LEGIONESPECIAL,False,False,False,False,...,0,0,0,0,2,0,0,0,0,0


We define X and y.

In [139]:
X = anonacc.loc[:,'anonymous_in_name':,]
y = anonacc['affiliation']

In [148]:
grid_params = {'n_estimators': range(100, 600, 100), 'criterion': ['gini','entropy'],
               'max_depth': range(10, 20), 'max_samples': [0.2, 0.4, 0.6, 0.8, 1.0]}
gs = GridSearchCV(ensemble.RandomForestClassifier(random_state = 0), grid_params,
                  verbose = 1, cv = 5)
gs_results = gs.fit(X, y)

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


In [149]:
print(gs_results.best_score_)
print(gs_results.best_params_)

0.9112995896032832
{'criterion': 'entropy', 'max_depth': 10, 'max_samples': 1.0, 'n_estimators': 400}


In [150]:
randomforest = ensemble.RandomForestClassifier(random_state = 0, n_estimators = 400,
                                               criterion = 'entropy', max_depth = 10, 
                                               max_samples = 1)

In [151]:
randomforest.fit(X, y)
randomforest.score(X, y)

0.6214953271028038

In [117]:
cv_results_cv = cross_validate(randomforest, X, y, cv = 5)
cv_results_cv

{'fit_time': array([0.56436491, 0.56949282, 0.5301919 , 0.55721593, 0.64493203]),
 'score_time': array([0.04416013, 0.03484821, 0.03571415, 0.03786778, 0.04299808]),
 'test_score': array([0.88372093, 0.90697674, 0.90697674, 0.89411765, 0.96470588])}

In [114]:
grid_params = {'criterion': ['gini','entropy'], 'max_depth': range(2, 20)}
gs = GridSearchCV(tree.DecisionTreeClassifier(random_state = 0), grid_params,
                  verbose = 1, cv = 5)
gs_results = gs.fit(X, y)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [115]:
print(gs_results.best_score_)
print(gs_results.best_params_)

0.911326949384405
{'criterion': 'gini', 'max_depth': 2}


In [132]:
decisiontree = tree.DecisionTreeClassifier(random_state = 0, criterion = 'gini', max_depth = 2)

In [133]:
decisiontree.fit(X, y)
decisiontree.score(X, y)

0.9228971962616822

In [142]:
cv_results_cv = cross_validate(decisiontree, X, y, cv = 5)
cv_results_cv

{'fit_time': array([0.00928402, 0.00645614, 0.00399899, 0.0046761 , 0.004071  ]),
 'score_time': array([0.00293994, 0.00263596, 0.00171328, 0.00307894, 0.00173092]),
 'test_score': array([0.84883721, 0.93023256, 0.90697674, 0.90588235, 0.96470588])}

Looks like `DecisionTreeClassifier` wins over `RandomTreeClassifier`! We could continue applying this on more users to determine who is Anonymous-affiliated.

<a id="anon"></a>
**5. Collecting Anonymous-affiliated tweets that include emojis**  
We collect tweets from Anonymous-affiliated accounts that include emojis.

In [173]:
anonaff = anonacc[anonacc['affiliation'] == 1]['screen name'].tolist()

We exclude retweets and tweets without emojis.

In [178]:
acoll = {}

for a in anonaff:
    try:
        acoll[a] = []
        for tweet in tweepy.Cursor(api.search_tweets, q = 'from:' + a).items(): # last 7 days
            if tweet.text.startswith('RT @') == False and any(c in tweet.text for c in emoji.UNICODE_EMOJI['en']):
                acoll[a].append(tweet.text)
    except tweepy.errors.Forbidden as e:
        print(e)
    except tweepy.errors.NotFound as e:
        print(e)
    except tweepy.errors.Unauthorized as e:
        print(e)

Rate limit reached. Sleeping for: 840
Rate limit reached. Sleeping for: 860
Rate limit reached. Sleeping for: 836


In [443]:
atweets = [tweet for a in acoll.values() for tweet in a if len(a) > 0 ]

We should store this.

In [446]:
with open('atweets.json', 'w') as f:
    json.dump(atweets, f)

<a id="random"></a>
**6. Collecting randomly sampled tweets that include emojis**  
We set up a `Stream` to collect tweets containing 'the'.

In [415]:
class MyStream(tweepy.Stream):
    def on_status(self, status):
        if len(rtweets) == 137:
            return False
        if status.text.startswith('RT @') == False and any(e in status.text for e in emoji.UNICODE_EMOJI['en']):
            rtweets.append(status.text)

In [416]:
stream = MyStream(consumer_key = consumer_key, consumer_secret = consumer_secret,
                  access_token = access_token, access_token_secret = access_token_secret)

In [420]:
rtweets = []

try:
    stream.filter(track = ['the']) # using a common word to approximate randomness
except KeyboardInterrupt:
    stream.disconnect()

In [447]:
print(len(atweets))
print(len(rtweets))

137
137


We should store this.

In [None]:
with open('rtweets.json', 'w') as f:
    json.dump(rtweets, f)