# **NLP Project:  Classification of ‘Triggering’ Content on Social Media**

CSCI 4152 - Natural Language Processing

Keelin Sekerka-Bajbus

**Data Scraping from Reddit using Pushshift (Reddit) API**

In [9]:
!pip install psaw



In [10]:
import csv
import time
import pandas as pd
import requests as req
import os
import sys
import numpy as np
import datetime
import json
from psaw import PushshiftAPI

#import nltk

We will use the Pushshift API (using the PSAW Api wrapper) to extract social media posts from Reddit pertaining to the following 10 trigger warning labels that will make up the classes for this multi-class classification problem. 

**Classes:**


1.   Anxiety
2.   Depression
3.   PTSD
4.   Eating Disorder
5.   Dysphoria
6.   Domestic Violence
7.   Death
8.   Suicide
9.  Abuse 










Data will be collected from the following subreddits to facilitate class labelling:  

1.   Anxiety --> r/Anxiety
2.   Depression --> r/Depression
3.   PTSD --> r/PTSD
4.   Eating Disorder --> r/EDAnonymous
5.   Dysmorphia --> r/BodyDysmorphia
6.   Domestic Violence --> r/DomesticViolence
7.   Death --> r/death (NSFW)
8.   Suicide --> r/SuicideWatch
9.  Abuse --> r/AbusiveRelationships, 

Connecting to Pushshift API

In [11]:
print("Set up API connection")
api = PushshiftAPI()

Set up API connection


In [12]:
subreddits = ['suicidewatch',
              'depression',
              'ptsd',
              'anxiety',
              'EDAnonymous',
              'BodyDysmorphia',
              'DomesticViolence',
              'Death',
              'AbusiveRelationships']

In [13]:
filtered = ['title','selftext', 'subreddit']

In [14]:
# method to fetch posts from specific subreddits, scrape 10k most recent posts
def data_scraper(subreddit, limit_size, filters):
  posts = list(api.search_submissions(subreddit=subreddit,limit=limit_size, filter=filters))
  df = pd.DataFrame([thing.d_ for thing in posts])

  return df

In [15]:
full_df = pd.DataFrame()

In [None]:
# method to fetch posts from specific subreddits, scrape 10k most recent posts
sexual_assault = data_scraper('sexualassault', 25000, filtered)
sexual_assault.head()
sexual_assault.shape



(10382, 5)

In [16]:
for sub in subreddits:
  sub_df = data_scraper(sub,100000,filtered)
  full_df = pd.concat([full_df,sub_df])
  print(sub)
  print(sub_df.shape)



suicidewatch
(100000, 5)
depression
(100000, 5)
ptsd
(38825, 5)
anxiety
(100000, 5)


KeyboardInterrupt: ignored

In [19]:
subreddits = [
              'EDAnonymous',
              'BodyDysmorphia',
              'DomesticViolence',
              'Death',
              'AbusiveRelationships']

In [20]:
for sub in subreddits:
  sub_df = data_scraper(sub,100000,filtered)
  full_df = pd.concat([full_df,sub_df])
  print(sub)
  print(sub_df.shape)



EDAnonymous
(100000, 5)
BodyDysmorphia
(21491, 5)
DomesticViolence
(13022, 5)
Death
(11342, 5)
AbusiveRelationships
(24334, 5)


In [21]:
full_df.head()

Unnamed: 0,created_utc,selftext,subreddit,title,created
0,1638296055,I have been considerning death as my only way ...,SuicideWatch,Suicide over grades,1638296000.0
1,1638295977,[removed],SuicideWatch,Euthanasia under 18 should be legal and withou...,1638296000.0
2,1638295971,"I know this is typical, cliched, blah blah, bu...",SuicideWatch,I (21F) feel like my life is over,1638296000.0
3,1638295931,just remembered racist jokes i used to make t...,SuicideWatch,i hate myself,1638296000.0
4,1638295927,[removed],SuicideWatch,Has anybody else wondered how many of the post...,1638296000.0


Cleaning the data to remove any deleted posts, dropping author, created_utc, id, created columns for anonymization

In [22]:
full_df = full_df.drop(columns=['created_utc','created'])
full_df = full_df[['title','selftext', 'subreddit']]
full_df.head()

Unnamed: 0,title,selftext,subreddit
0,Suicide over grades,I have been considerning death as my only way ...,SuicideWatch
1,Euthanasia under 18 should be legal and withou...,[removed],SuicideWatch
2,I (21F) feel like my life is over,"I know this is typical, cliched, blah blah, bu...",SuicideWatch
3,i hate myself,just remembered racist jokes i used to make t...,SuicideWatch
4,Has anybody else wondered how many of the post...,[removed],SuicideWatch


In [23]:
full_df.shape

(509014, 3)

**Creation of uncleaned dataset csv to save raw data**

In [None]:
full_df.to_csv('uncleaned_dataset_triggers_bigger.csv')
from google.colab import files
files.download("uncleaned_dataset_triggers_bigger.csv")

**Dataset processing**

In [26]:
df = full_df

Dropping Null Values, or posts that have been deleted or removed. 

In [27]:
df.dropna(inplace=True)
df.shape

(508070, 3)

In [28]:
# remove any posts that have [removed] or [deleted] in selftext
indx = df.index[df.selftext == '[removed]'].tolist()
print(indx)

[1, 4, 13, 18, 35, 39, 41, 42, 43, 51, 60, 65, 75, 92, 104, 108, 124, 135, 140, 142, 147, 180, 181, 186, 188, 191, 194, 207, 214, 215, 225, 226, 231, 233, 236, 246, 264, 289, 312, 313, 328, 331, 339, 348, 353, 357, 360, 361, 370, 372, 381, 383, 395, 401, 411, 417, 419, 429, 431, 436, 441, 453, 458, 461, 495, 501, 515, 532, 534, 543, 556, 564, 565, 571, 573, 580, 586, 609, 632, 638, 653, 659, 664, 669, 689, 691, 693, 696, 698, 722, 735, 764, 766, 804, 810, 811, 813, 828, 830, 843, 870, 871, 890, 895, 898, 905, 908, 912, 930, 931, 932, 937, 940, 943, 944, 951, 954, 974, 986, 988, 989, 1015, 1022, 1034, 1040, 1049, 1064, 1065, 1069, 1088, 1109, 1121, 1125, 1127, 1136, 1137, 1156, 1158, 1161, 1163, 1169, 1182, 1185, 1187, 1190, 1192, 1193, 1197, 1201, 1204, 1205, 1207, 1249, 1250, 1255, 1264, 1265, 1275, 1279, 1295, 1310, 1320, 1321, 1324, 1329, 1331, 1340, 1347, 1350, 1386, 1387, 1391, 1392, 1393, 1396, 1397, 1404, 1407, 1420, 1423, 1425, 1431, 1432, 1433, 1435, 1440, 1443, 1445, 1452, 14

In [29]:
df.drop(indx, inplace=True)
df.shape

(330178, 3)

In [30]:
indx = df.index[df.selftext == '[deleted]'].tolist()
print(indx)
df.drop(indx, inplace=True)
df.shape

[2784, 2961, 3770, 4793, 5772, 5815, 5820, 5908, 5941, 9652, 12303, 14547, 14558, 14604, 14631, 14664, 14681, 14693, 14699, 14716, 14732, 14752, 14760, 14762, 14769, 14797, 14809, 14825, 14844, 14852, 14853, 14856, 14866, 14906, 14910, 14914, 14929, 15163, 15455, 16355, 16730, 19150, 22253, 22648, 23086, 23170, 23739, 25110, 25162, 26697, 27259, 27659, 27684, 27710, 27730, 27754, 27764, 27777, 27821, 27832, 27865, 27872, 27877, 27890, 27895, 27899, 27903, 27910, 27946, 27947, 27950, 27958, 27964, 27980, 28008, 28037, 28047, 28050, 28067, 28069, 28077, 28080, 28081, 28150, 28168, 28169, 28177, 28180, 28181, 28184, 28187, 28191, 28198, 28199, 28224, 28231, 28247, 28257, 28283, 28284, 28291, 28300, 28305, 28309, 28310, 28315, 28323, 28324, 28330, 28338, 28341, 28360, 28395, 28403, 28404, 28412, 28419, 28422, 28425, 28448, 28466, 28468, 28474, 28478, 28487, 28489, 28491, 28493, 28497, 28502, 28514, 28520, 28523, 28526, 28530, 28533, 28539, 28550, 28582, 28584, 28585, 28935, 30264, 31658, 3

(309049, 3)

In [31]:
# drop duplicates
df.drop_duplicates(keep='first',inplace=True)
df.shape

(307717, 3)

Remove emojis

In [32]:
df['selftext'] = df['selftext'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
df['title'] = df['title'].str.replace(r'[^\x00-\x7F]+', '', regex=True)

df.head()

Unnamed: 0,title,selftext,subreddit
7,Suicidal Partner,My boyfriend has been depressed and suicidal s...,SuicideWatch
10,Why does there always have to be a reason to f...,when im honest with my dad if im anxious depre...,SuicideWatch
17,my friend's attempted and i dont know how to r...,"hi, i just got a suicidal note/message from a ...",SuicideWatch
20,Im scared of myself,I literally think about it hourly. Im so hideo...,SuicideWatch
22,I dont want to die,But I dont really see a way out of this I have...,SuicideWatch


We assign class labels based on the subreddit of the scraped data.

In [33]:
df['subreddit'].unique()

array(['SuicideWatch', 'depression', 'ptsd', 'Anxiety', 'EDAnonymous',
       'BodyDysmorphia', 'domesticviolence', 'death',
       'abusiverelationships'], dtype=object)

In [34]:
sub_conditions = [
                  (df['subreddit'] == 'SuicideWatch'),
                  (df['subreddit'] == 'depression'),
                  (df['subreddit'] == 'ptsd'),
                  (df['subreddit'] == 'Anxiety'),
                  (df['subreddit'] == 'EDAnonymous'),
                  (df['subreddit'] == 'BodyDysmorphia'),
                  (df['subreddit'] == 'domesticviolence'),
                  (df['subreddit'] == 'death'),
                  (df['subreddit'] == 'abusiverelationships')
]

In [35]:
classes_names = [
                 'Suicide',
                 'Depression',
                 'PTSD',
                 'Anxiety',
                 'Eating Disorder',
                 'Dysmorphia',
                 'Domestic Violence',
                 'Death',
                 'Abuse'
]

In [36]:
df['class'] = np.select(sub_conditions, classes_names)
df.head()

Unnamed: 0,title,selftext,subreddit,class
7,Suicidal Partner,My boyfriend has been depressed and suicidal s...,SuicideWatch,Suicide
10,Why does there always have to be a reason to f...,when im honest with my dad if im anxious depre...,SuicideWatch,Suicide
17,my friend's attempted and i dont know how to r...,"hi, i just got a suicidal note/message from a ...",SuicideWatch,Suicide
20,Im scared of myself,I literally think about it hourly. Im so hideo...,SuicideWatch,Suicide
22,I dont want to die,But I dont really see a way out of this I have...,SuicideWatch,Suicide


In [37]:
df.drop(columns=['subreddit'], inplace=True)
df.head()

Unnamed: 0,title,selftext,class
7,Suicidal Partner,My boyfriend has been depressed and suicidal s...,Suicide
10,Why does there always have to be a reason to f...,when im honest with my dad if im anxious depre...,Suicide
17,my friend's attempted and i dont know how to r...,"hi, i just got a suicidal note/message from a ...",Suicide
20,Im scared of myself,I literally think about it hourly. Im so hideo...,Suicide
22,I dont want to die,But I dont really see a way out of this I have...,Suicide


In [42]:
df.reset_index()
df.head()
df.tail()
df.shape

(307717, 3)

In [40]:
df.to_csv('dataset_triggers_new.csv',encoding='utf-8')
files.download("dataset_triggers_new.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>