Gather, process examples of text from similar domains as the white supremacist dataset that isn't white supremacist

# Reddit data (forum data)
Scraped by get_reddit.py

In [10]:
# Load data
# Is different 
# Load from /tmp since it has direct year info and no newline issues though could also load from data/reddit_comments (which is csv)
import os
import pandas as pd
from tqdm.notebook import tqdm

dirpath = '../tmp/'
fpaths = sorted([fname for fname in os.listdir(dirpath) if 'subreddit_comments' in fname])
dfs = []
for fname in tqdm(fpaths):
    # print(fname)
    fpath = os.path.join(dirpath, fname)
    if fname.endswith('.csv'):
        sub =  pd.read_csv(fpath, index_col=0, engine='python')
    elif fname.endswith('.pkl'):
        sub = pd.read_pickle(fpath)
    year, subreddit, _, _ = fname.split('_')
    dfs.append(sub.assign(year=fname[:4]).assign(subreddit=subreddit.lower()))
    
reddit = pd.concat(dfs)
reddit['year'] = reddit.year.astype(int)
print(len(reddit))
reddit.columns

  0%|          | 0/35 [00:00<?, ?it/s]

7622973


Index(['author', 'body', 'created_utc', 'id', 'parent_id', 'score', 'created',
       'year', 'subreddit', 'text', 'permalink'],
      dtype='object')

In [8]:
pd.set_option('display.max_colwidth', None)
reddit[['body', 'text']].sample(10)

Unnamed: 0,body,text
158874,"That PDF was so much more interesting than the typical ""we all benefit from foreign workers on the farms"" I keep seeing. Thank you.","That PDF was so much more interesting than the typical ""we all benefit from foreign workers on the farms"" I keep seeing. Thank you."
397196,&gt; The reason they do it is because American tech companies (as they have openly stated in the past) consider themselves more like private governments than conventional corporations.\nMore complete bullshit from /u/GatineauKing \n\nThis has been the cultural norm in US stores since well before there were major US tech companies. \n\n,
70136,What things about work culture in Europe did you dislike?,
2811,"As someone from outside the USA I feel the need to write, ""FFS, everyone on the planet knows and regards the USA as an evil corporate empire with less real democracy than Stalin's holiday camp.","As someone from outside the USA I feel the need to write, ""FFS, everyone on the planet knows and regards the USA as an evil corporate empire with less real democracy than Stalin's holiday camp."
589140,There Can Be Only One!,
313232,Porque no los tres?,Porque no los tres?
1291071,so what's does being redacted mean exactly?,so what's does being redacted mean exactly?
198885,*whoosh*\n\nI was just continuing the joke. :),
933641,"I like her, personally, but the whole Franken deal soured me on her a bit. If she was part of the push to get him to prematurely resign, I'm not a fan of that.","I like her, personally, but the whole Franken deal soured me on her a bit. If she was part of the push to get him to prematurely resign, I'm not a fan of that."
265661,No. The first amendment is pretty cool.,


In [4]:
# Compare with white supremacist dataset
# Load white supremacist dataset to count posts over time
import datetime

path = '../tmp/white_supremacist_corpus.pkl'
ws_data = pd.read_pickle(path)

# Select forum data, Group by year
yearly = ws_data.query('domain=="forum"').groupby(by=ws_data.timestamp.dt.year)['text'].count()
lookup = pd.DataFrame(yearly)
lookup['begin'] = pd.to_datetime(yearly.index.astype(int).astype(str), format='%Y')
lookup['end'] = lookup.begin.shift(-1, fill_value = datetime.datetime(2020,1,1))
lookup.index.name = 'year'
lookup.index = lookup.index.astype(int)
lookup.rename(columns={'text': 'post_count'}, inplace=True)
lookup

Unnamed: 0_level_0,post_count,begin,end
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2001,1439,2001-01-01,2002-01-01
2002,13302,2002-01-01,2003-01-01
2003,27470,2003-01-01,2004-01-01
2004,50783,2004-01-01,2005-01-01
2005,54377,2005-01-01,2006-01-01
2006,80554,2006-01-01,2007-01-01
2007,63544,2007-01-01,2008-01-01
2008,74203,2008-01-01,2009-01-01
2009,94895,2009-01-01,2010-01-01
2010,62649,2010-01-01,2011-01-01


In [14]:
# Sample comments from different subreddits
pd.concat([reddit.subreddit.value_counts(), reddit.subreddit.value_counts(normalize=True)], axis=1)

Unnamed: 0,subreddit,subreddit.1
politics,4111029,0.539295
europe,1944666,0.255106
askanamerican,1526955,0.20031
usa,40323,0.00529


In [16]:
reddit.groupby('year')['body'].count()

year
2011      83192
2012     128309
2013     105839
2014      94515
2015     144790
2016     215733
2017    2887805
2018    2654264
2019    1308424
Name: body, dtype: int64

In [18]:
reddit['year'].dtype

dtype('int64')

In [21]:
# Sample by proportion of the subreddit data 
# (though was maxed out in sampling by the number of white supremacist posts in the data that year, so isn't a true proportional sample by popularity)

reddit_sampled = reddit.groupby('year').apply(lambda group: group.sample(lookup.post_count[group.name])).reset_index(drop = True)
reddit_sampled

Unnamed: 0,author,body,created_utc,id,parent_id,score,created,year,subreddit,text,permalink
0,[deleted],Oh well *that's* the straw that broke the camel's back. I was totally going to vote for Romney before I found this out.\n\n/not a real issue,1325192381,c3c7eei,t3_nv2lg,6,1.325207e+09,2011,politics,Oh well *that's* the straw that broke the camel's back. I was totally going to vote for Romney before I found this out.\n\n/not a real issue,
1,matude,"Actually, good point... there's no reason my naive vision of a nice modern future will become true. In fact, more likely the new houses will be even worse because of the government pressure to get them up asap. :/",1323793402,c37ol0o,t1_c37m1fi,1,1.323808e+09,2011,europe,,
2,ian80,"Thank you, thank you, thank you. \n\nRon Paul represents an evolving social/political consciousness. I don't think the country as a whole (or should I say as a hole) is willing to make that leap just yet, but Ron is representing the sentiment of a large demographic who are sick of the corruption currently being lived through. The corruption, of course, isn't solely caused by the ones on top. The ones truly to blame are the populace as a whole who won't stand up to unjust behavior. Hence the importance of Ron: he's the only one expressing a sentiment that is needed for the country to evolve into something new and better.\n\nHe's no savior. He's just the one with the balls to say what needs to be said. Thank God. He may not become president, but he's changing people's minds for the better.",1325260390,c3cfxwj,t1_c3cfrww,3,1.325275e+09,2011,politics,"Thank you, thank you, thank you. \n\nRon Paul represents an evolving social/political consciousness. I don't think the country as a whole (or should I say as a hole) is willing to make that leap just yet, but Ron is representing the sentiment of a large demographic who are sick of the corruption currently being lived through. The corruption, of course, isn't solely caused by the ones on top. The ones truly to blame are the populace as a whole who won't stand up to unjust behavior. Hence the importance of Ron: he's the only one expressing a sentiment that is needed for the country to evolve into something new and better.\n\nHe's no savior. He's just the one with the balls to say what needs to be said. Thank God. He may not become president, but he's changing people's minds for the better.",
3,arthur_sc_king,"Whatever he's smoking (other than [choad](http://catb.org/jargon/html/C/choad.html)), I want some. It'd be interesting to try having such a complete and utter break from reality.",1325183025,c3c6d4g,t3_nv59e,1,1.325197e+09,2011,politics,"Whatever he's smoking (other than [choad](http://catb.org/jargon/html/C/choad.html)), I want some. It'd be interesting to try having such a complete and utter break from reality.",
4,italianjob17,"I'll upvote you because what you said is quite true in Italy... the amount of shit we became able to be fed (even ejoying it!) is impressive, there's even people that said that being a bastard womanizer is not a big fuss, but a medal of honour indeed. God it looks like I'm living in 4chan here sometimes.",1305806592,c1uvvep,t1_c1uvrfj,2,1.305821e+09,2011,europe,,
...,...,...,...,...,...,...,...,...,...,...,...
4110898,WoollyMuffler,"I'd be happy if it was night all the time here, because then you wouldn't be able to see how grey it is.",1575032333,f92iamj,t3_e3cmw7,1,1.575047e+09,2019,europe,,/r/europe/comments/e3cmw7/good_black_friday_to_everyone/f92iamj/
4110899,NorwegianSteam,"I would go salted, but way more important than that is that you bring real maple syrup. If it has more than 1 ingredient it's garbage.",1575391113,f9k23b0,t3_e5ih1x,1,1.575406e+09,2019,askanamerican,,/r/AskAnAmerican/comments/e5ih1x/salted_or_unsalted_butter_on_pancake/f9k23b0/
4110900,ChapoMigrant,"America needs to *secure Mexican territory*, you know, for a *safe zone* against cartel violence.",1573418502,f75s1f6,t1_f74yb13,0,1.573433e+09,2019,europe,,/r/europe/comments/du9lav/rest_in_peace_the_founder_of_the_republic_of/f75s1f6/
4110901,Anargramy,Cant wait. No more 12 hr waits at a+e.,1576681253,fbalkbg,t1_fbahimz,1,1.576696e+09,2019,usa,,/r/usa/comments/eccmxn/hm_yes_healthcare/fbalkbg/


In [23]:
reddit_sampled.year.value_counts().sort_index()

2011      56932
2012      87025
2013      71877
2014      63350
2015      71974
2016     110085
2017    1625386
2018    1367449
2019     656825
Name: year, dtype: int64

In [25]:
lookup.loc[2011:].post_count

year
2011      56932
2012      87025
2013      71877
2014      63350
2015      71974
2016     110085
2017    1625386
2018    1367449
2019     656825
Name: post_count, dtype: int64

In [33]:
pd.concat([reddit_sampled.subreddit.value_counts(), reddit_sampled.subreddit.value_counts(normalize=True)], axis=1)

Unnamed: 0,subreddit,subreddit.1
politics,2230282,0.542528
europe,1054433,0.256497
askanamerican,804732,0.195756
usa,21456,0.005219


In [27]:
# Tokenize, lowercase Reddit reddit_sampled
import nltk
from tqdm.notebook import tqdm
from multiprocessing import Pool

def preprocess(inp):
    return ' '.join(nltk.word_tokenize(str(inp))).lower()

with Pool(20) as p:
    reddit_sampled['processed'] = list(tqdm(p.imap(preprocess, reddit_sampled.body), total=len(reddit_sampled)))
reddit_sampled['word_count'] = reddit_sampled.processed.str.split().str.len()

  0%|          | 0/4110903 [00:00<?, ?it/s]

In [29]:
# reddit_gped = data.groupby('year').word_count.sum()
reddit_wc = reddit_sampled.groupby('year').agg({'word_count': ['sum', 'mean']})
reddit_wc

# Compare number of words between white supremacist dataset forum data and Reddit data
selected_ws = ws_data.loc[(ws_data.timestamp.dt.year.isin(reddit_sampled.year.unique())) & (ws_data.domain == 'forum'),]
selected_ws['word_count'] = selected_ws.text.str.split().str.len()

print(selected_ws.word_count.mean())
print(reddit_sampled.word_count.mean())

ws_wc = selected_ws.groupby(ws_data.timestamp.dt.year).agg({'word_count': ['sum', 'mean']})
ws_wc

In [31]:
comparison = pd.concat([ws_wc, reddit_wc], axis=1, keys=['white_supremacist', 'reddit'])
# comparison = pd.concat([ws_wc, reddit_wc], axis=1)
comparison['reddit_percentage'] = comparison[('reddit', 'word_count', 'sum')]/comparison[('white_supremacist', 'word_count', 'sum')]
comparison

Unnamed: 0_level_0,white_supremacist,white_supremacist,reddit,reddit,reddit_percentage
Unnamed: 0_level_1,word_count,word_count,word_count,word_count,Unnamed: 5_level_1
Unnamed: 0_level_2,sum,mean,sum,mean,Unnamed: 5_level_2
2011.0,6446972,113.239865,3363967,59.087455,0.52179
2012.0,8424657,96.80732,4911861,56.441953,0.583034
2013.0,6869854,95.577918,4250035,59.129276,0.61865
2014.0,6054223,95.567845,3428457,54.11929,0.566292
2015.0,5538318,76.94887,4027860,55.962709,0.727271
2016.0,6191536,56.24323,5433458,49.356933,0.877562
2017.0,47959049,29.506252,69659453,42.857175,1.452478
2018.0,36979076,27.04238,58315819,42.645699,1.576995
2019.0,17675839,26.911033,26411751,40.211245,1.494229


In [25]:
# Sample comments vs submissions
from IPython.display import display
pd.set_option('display.max_colwidth', None)

for post_type in ['comment', 'submission']:
    print(post_type)
    display(data.query('post_type == @post_type')[['text']].sample(20))

comment


Unnamed: 0,text
55771,The same can be said for fascist regimes.
13284,Ignorance.
20569,"To be honest, I'm not particularly interesting in debating the interpretation of the passage. All I'm trying to tell you is that you need to provide something more than some lawyer's interpretation of the passage and your opinion in the matter in order to claim that other people are misinterpreting the passage since they are not evidences or facts. Or else you are just making a baseless claim and it would not be surprising that people don't agree with you."
48625,Wrong about what? (I covered a lot of ground on the last post)
8311,"4 easier states, AK, AZ, Montana I believe, and Vermont, don't require a ccw. But I believe all still have one so you can carry to another state that recognizes said ccw."
11618,"Oh wow, this account has to be the most impressive Markov text generator I've ever seen. Incredible."
40066,"This is a really nice post. See, its helpful, kind, and offers alternatives. Its an example on how not to be a dick, unlike your first post."
2905,"I just have to say that I went to a very bike-friendly university and I worked at a bike shop for a few months (not claiming to be an expert) and I've never been in as bike-friendly a place as Manhattan. \n\nWalking was amazing too, much safer and friendlier than crossing the six-lane roads near where I grew up."
46724,none of which have ever won an election
55711,"To be fair the GTA series of games (which I played and liked) do tend to have a pretty jacked up and sociopathic narrative. While they don't focus on terrorism and there has always a veneer of good vs evil and redemption, the redemptions comes via morally ambiguous mass killings. I'm not arguing that GTA is breeding serial killers or anything like it, just that the subject matter of these games is as perverse and anti social as the subject matter on Fox News."


submission


Unnamed: 0,text
42918,Gen. David Petraeus arrived at a private lawn party in 2010 with a 28-car motorcade.
1531,Conservative Columnist Charles Krauthammer: Obama Caused GOP Civil War
2574,Letter Allegedly Penned by Chinese Labor Camp Prisoner Found Inside Box of Halloween Decorations from Kmart
18216,"George Will: ""Quite Literally, Opposition To Gay Marriage Is Dying"""
6388,"NRA advocates for armed guards in every school. In fact, Columbine High had an armed guard. He was out monitoring the Smoker's Corner while the shooters did their work inside."
14886,Petition to make illegal for news to lie to their audience in the US
49111,"Romney ""shellshocked"" by loss"
39349,"War: Israel Launches Gaza Invasion, Assassinates Hamas Military Wing Chief"
24511,"Gay Marriage Should be Addressed by SCOTUS, and Championed by Republicans"
42579,UPS ends charitable giving to boy scouts of America over Anti-gay discrimination


# Discord data (chat) 

In [29]:
# Compare with white supremacist dataset
# Load white supremacist dataset to count posts over time
import datetime

path = '../tmp/white_supremacist_corpus.pkl'
ws_data = pd.read_pickle(path)

# Select forum data, Group by year
chat_yearly = ws_data.query('domain=="chat"').groupby(by=ws_data.timestamp.dt.year)['text'].count()
chat_lookup = pd.DataFrame(chat_yearly)
chat_lookup['begin'] = pd.to_datetime(chat_yearly.index.astype(int).astype(str), format='%Y')
chat_lookup['end'] = chat_lookup.begin.shift(-1, fill_value = datetime.datetime(2019,1,1))
chat_lookup.index.name = 'year'
chat_lookup.index = chat_lookup.index.astype(int)
chat_lookup.rename(columns={'text': 'post_count'}, inplace=True)
print(chat_lookup.post_count.sum())
chat_lookup

46130


Unnamed: 0_level_0,post_count,begin,end
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017,33790,2017-01-01,2018-01-01
2018,12340,2018-01-01,2019-01-01


In [14]:
# Load random Discord data
import os
import json
import pandas as pd
from tqdm.notebook import tqdm
import re
from nltk.tokenize import TweetTokenizer
from multiprocessing import Pool

base_dirpath = '/storage2/mamille3/data/discord'
dirpath = os.path.join(base_dirpath, 'v3', 'content/drive/Shareddrives/Datasets/cleaned-v4/discord-v3')
# with open(os.path.join(dirpath, 'stats.json')) as f:
#     stats = json.load(f)
# stats
tokenizer = TweetTokenizer(strip_handles=True)

def process_dump(fname):
    if not fname.endswith('.txt'):
        return
    fpath = os.path.join(dirpath, fname)
    with open(fpath) as f:
        messages = [process_chat(message) for line in f.read().splitlines() for message in line.split('\t')]
    return messages

def process_chat(text):
    if ': ' in text:
        res =  text.split(': ')[1]
    else:
        res = text
    # # Remove usernames
    # res = re.sub(r'@.*\b', '', res)
    # Tokenize, lowercase
    return ' '.join(tokenizer.tokenize(res)).lower()

fpaths = [os.path.join(dirpath, fname) for fname in os.listdir(dirpath)]
with Pool(20) as p:
    chats = list(tqdm(p.imap(process_dump, fpaths), total=len(fpaths)))

In [26]:
# Create a df
df = pd.DataFrame({'text': [chat for chat_list in [el for el in chats if el is not None] for chat in chat_list]})
df

Unnamed: 0,text
0,hugs luci
1,snugglehugs
2,aite im gon na shower\npce
3,i feel big depresso but idk what i 'm sad abou...
4,yeah
...,...
132481515,stupid roaches i 'll be back in a few days
132481516,.
132481517,i ate his willy
132481518,lmfao


In [31]:
# Compare with white supremacist data
selected_ws = ws_data.query('domain=="chat"').copy()
selected_ws['word_count'] = selected_ws.text.str.split().str.len()
print(len(selected_ws))
print(selected_ws.word_count.sum())

46130
551967


In [32]:
selected_ws.word_count.mean()

11.96546715803165

In [38]:
sample = df.sample(int(selected_ws.word_count.sum()/4.2)) # 4.2 is the average words/post, so trying to match the number of words
print(len(sample))
sample['word_count'] = sample.text.str.split().str.len()
print(sample.word_count.sum())
print(sample.word_count.mean())

131420
542813
4.130368284888145


# Long-form articles

In [37]:
# Compare with white supremacist dataset
# Load white supremacist dataset to count posts over time

path = '../tmp/white_supremacist_corpus.pkl'
ws_data = pd.read_pickle(path)

In [38]:
# Select forum data, Group by year
articles_yearly = ws_data.query('domain=="long-form"').groupby(by=ws_data.timestamp.dt.year)['text'].count()
articles_lookup = pd.DataFrame(articles_yearly)
articles_lookup['begin'] = pd.to_datetime(articles_yearly.index.astype(int).astype(str), format='%Y')
articles_lookup['end'] = [x.replace(year=x.year + 1) for x in articles_lookup['begin']]
articles_lookup.index.name = 'year'
articles_lookup.index = articles_lookup.index.astype(int)
articles_lookup.rename(columns={'text': 'post_count'}, inplace=True)
print(articles_lookup.post_count.sum())
articles_lookup

47939


Unnamed: 0_level_0,post_count,begin,end
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1968,47,1968-01-01,1969-01-01
1973,1231,1973-01-01,1974-01-01
1978,1585,1978-01-01,1979-01-01
1988,33,1988-01-01,1989-01-01
2005,1,2005-01-01,2006-01-01
2010,2,2010-01-01,2011-01-01
2011,18886,2011-01-01,2012-01-01
2012,175,2012-01-01,2013-01-01
2013,2783,2013-01-01,2014-01-01
2014,7564,2014-01-01,2015-01-01


In [41]:
ws_data.query('domain=="long-form"').groupby(by=ws_data.timestamp.dt.year).source.unique()

timestamp
1968.0                                      [enoch_powell_rivers_of_blood_speech]
1973.0                                          [raspail_camp_of_the_saints_book]
1978.0                                           [pierce_the_turner_diaries_book]
1988.0                                            [lane_white_genocide_manifesto]
2005.0                                                            [daily_stormer]
2010.0                                      [daily_stormer, american_renaissance]
2011.0                                         [daily_stormer, breivik_manifesto]
2012.0    [daily_stormer, american_renaissance, camus_the_great_replacement_book]
2013.0                                      [daily_stormer, american_renaissance]
2014.0                                      [daily_stormer, american_renaissance]
2015.0                                      [daily_stormer, american_renaissance]
2016.0                                      [daily_stormer, american_renaissance]
2017.0

In [None]:
# Load NOW corpus
import os
from tqdm.notebook import tqdm
import re
import pandas as pd
from datetime import datetime

now_dirpath = '/storage2/mamille3/data/news/now2010-2021/text'

countries = [ # since most articles are from the Daily Stormer and American Renaissance, US-based websites
    'us'
]
fnames = sorted([fname for fname in os.listdir(now_dirpath) if re.search(r'us', fname, flags=re.IGNORECASE)])

dfs = []
for fname in tqdm(fnames):
    fpath = os.path.join(now_dirpath, fname)
    m = re.search(r'\d\d-\d\d', fname)
    if m is None:
        m = re.search(r'\d\d_\d\d', fname)
    date_str = m.group()
    date = datetime.strptime(date_str, '%y-%m')
    year = date.year
    with open(fpath) as f:
        articles = f.read().splitlines()
    dfs.append(pd.DataFrame({'text': [process_now(line) for line in articles], 'year': year}))
len(dfs)

  0%|          | 0/240 [00:00<?, ?it/s]

In [60]:
fname

'20_06-us1.txt'

In [None]:
articles = pd.concat(dfs)
print(len(articles))
articles.head()

In [55]:
def process_now(inp):
    """ Preprocess NOW articles """
    text = re.sub(r'@@\d+ ', '', inp)
    text = re.sub(r'<\w+>', '', text)
    text = text.replace('@ @ @ @ @ @ @ @ @ @ ', '')
    return text

processed = [process_now(line) for line in articles]
processed[:2]

[' Believe it or not : Avatar takes 1 petabyte of storage space , equivalent to a 32 YEAR long MP3 .  Avatar , the movie Oscar winning Director James Cameron , is now the world \'s most expensive movie ever made .  The movie has received enormous praise , particularly from the movie and tech blogosphere who have been blown away by the films 3D and CGI effects . Also , the movie currently sits at 26 on IMDB \'s list of the most highly rated movies of all time , no mean feat .  One thing that particularly fascinated me however is Cameron \'s use of a new 3D shooting technology called 3D Fusion . The technique requires an enormous amount of processing power , and needless to say , also a lot of storage space , but how much ?  The entire rendering of Avatar reportedly requires over 1 Petabyte of storage space according to BBC \'s Clickbits , which is the equivalent of 500 harddrives of 2TB each . That \'s equal to a 32 year long MP3 file ( in Miramar , New Zealand , ( population 8,334 ) in

In [None]:
# Just take 2011 on