# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Detecting Hate Speech


--- 

SG-DSI-41 Group 01: Lionel Foo, Joel Lim, Poon Wenzhe, Daryl Chia

### <b> Notebook: 03 Reddit Scraping for Demo</b>

#### Overview

* Scrape a subreddit for hottest threads to obtain data for the demo.



---

### 01 Import Libraries

In [1]:
# 1. Installation
#!pip install praw

# 2. Imports
import praw
import pandas as pd
import re

---

### 02 Data Collection - Scraping Subreddit Data

* Scrape 100 "hottest" threads & accompanying comments from r/TheRightCantMeme/ subreddit
* The below includes leftover code from Project 3 scraping with details such as post hint

In [2]:
# authenticate with Reddit API - redacted input api key as required
reddit = praw.Reddit(
    client_id = "",
    client_secret = "",
    user_agent = "",
    ratelimit_seconds = 30)

# create 2 separate lists to store threads and comments after we scrape them from the subreddit
threads = []
comments = []

# subreddit to scrape
subreddit = reddit.subreddit("TheRightCantMeme")

# iterate through hottest threads in subreddit
# scrape 100 threads (for illustrative purposes: to reset to 1,000) 
for submission in subreddit.hot(limit = 100):
    try:
        tr_author_nm = submission.author.name
    except:
        #store empty name
        tr_author_nm = ""
    if hasattr(submission, "post_hint"):
        hint = submission.post_hint
    else:
        hint = ""


    # store thread data
    thread = {
        "id": submission.id,
        "title": submission.title, 
        "score": submission.score,
        "num_comments": submission.num_comments,
        "post_hint": hint,
        "self_text": submission.selftext,
        "author_name": tr_author_nm,
        "url": submission.url
    }  
    threads.append(thread)
    

    # store all comment data under thread
    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        # required as there are some comments without author name
        author_nm = ""
        try:
            author_nm = comment.author.name
        except:
            # store empy name
            author_nm = ""
        # store thread data
        comment_data = {
            "thread_id": submission.id, 
            "comment_id": comment.id,
            "comment_text": comment.body,
            "comment_score" : comment.score,
            "author_name": author_nm
        }
        comments.append(comment_data)

* Convert Lists to Pandas Dataframes, and Export Dataframes as CSV files

In [3]:
df_threads = pd.DataFrame(threads)
df_comments = pd.DataFrame(comments)

df_threads.to_csv('./demo/threads.csv', index=False)
df_comments.to_csv('./demo/comments.csv', index=False)


---

### 03 EDA & Cleaning

* Check the number of threads and comments

In [4]:
print('Thread Rows: ', df_threads.shape[0])
print('Comment Rows: ', df_comments.shape[0])

Thread Rows:  100
Comment Rows:  4130


* Removal of rows with comment text as [deleted]

In [5]:
# create a mask variable to contain rows with "deleted" or "removed" based on part (2c)
deleted_texts = ["[deleted]", "[removed]"]

# 1. remove them from dataframe = coffee
df_comments_clean = df_comments[~df_comments["comment_text"].isin(deleted_texts)]

print('Comment Rows: ', df_comments_clean.shape[0])

Comment Rows:  4025


* Remove Reddit's own markdown formatting for links and keep only the text

In [6]:
# testing out the regex expression to see if we filtered out the rows correctly
link_regex = r'\[(.+?)\]\((.+?)\)'
df_comments_clean[df_comments_clean['comment_text'].str.match(link_regex)].head()

Unnamed: 0,thread_id,comment_id,comment_text,comment_score,author_name
359,tzaw3i,j9a8f7s,[Those](https://en.wikipedia.org/wiki/Powers_o...,3,Niomedes
943,1al0n50,kpc2tjm,[Here](https://www.mic.com/articles/185045/wol...,79,Alric_Rahl
1060,1al51n7,kpefwp3,[Ayo](https://en.m.wikipedia.org/wiki/Lalibela),22,Soviet-pirate
1997,1ajh1mt,kp1mimr,[Christian Socialism](https://en.wikipedia.org...,10,Less-Country-2767
2295,1aiso8s,koxvpz7,[the talmud doesn’t actually say that](http://...,50,TravisPorerr


In [7]:
# define a function to filter out rows with markdown formatting on links

def keep_text_from_link(text):
    link_regex = r'\[(.+?)\]\((.+?)\)'
    
    def replace_link(match):
        return match.group(1)

    return re.sub(link_regex, replace_link, text)

In [8]:
# apply to dataframe
df_comments_clean["comment_text"] = df_comments_clean["comment_text"].apply(lambda x: keep_text_from_link(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_comments_clean["comment_text"] = df_comments_clean["comment_text"].apply(lambda x: keep_text_from_link(x))


* Check if filtering is successful

In [9]:
df_comments_clean[df_comments_clean["comment_text"].str.match(link_regex)].head()

Unnamed: 0,thread_id,comment_id,comment_text,comment_score,author_name


* Removal of url-only comments

In [10]:
# testing out the regex expression to see if we filtered out the rows correctly
url_regex = r'https?://\S+|www\.\S+'
df_comments_clean[df_comments_clean["comment_text"].str.match(url_regex)]

Unnamed: 0,thread_id,comment_id,comment_text,comment_score,author_name
297,tzaw3i,i6mb3wa,https://www.usatoday.com/story/news/politics/2...,14,jliane
364,tzaw3i,ikybdyo,https://en.m.wikipedia.org/wiki/Accelerationism,51,maxwellsearcy
1247,1al7vv9,kpg35e7,https://en.m.wikipedia.org/wiki/Jewish_settlem...,5,SSeptic
2380,1aisx9r,kp0krq9,https://images.app.goo.gl/PcN1G9qPHUNtHnud7\n\...,9,Quartia
3433,1ah3162,komabqx,https://www.nbcnews.com/news/us-news/pa-man-ar...,13,Titaniumfury
3539,1agyr36,kokjrds,https://en.wikipedia.org/wiki/Critical_race_th...,57,faultydesign
3557,1agyr36,kol31a5,https://www.statista.com/statistics/476456/mas...,75,gartsmith
3580,1agyr36,kozg67s,https://www2.ed.gov/admins/lead/safety/prevent...,1,Depressed_Squirrl
3936,1agde4c,kogem6a,https://en.m.wikipedia.org/wiki/Lehi_(militant...,16,Okayhatstand


In [11]:
# define a function to filter out rows with urls and return with a blank instead
def remove_url(text):
    url_regex = r'https?://\S+|www\.\S+'
    return re.sub(url_regex, '', text)

In [12]:
# apply to the dataframe
df_comments_clean["comment_text"] = df_comments_clean["comment_text"].apply(lambda x: remove_url(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_comments_clean["comment_text"] = df_comments_clean["comment_text"].apply(lambda x: remove_url(x))


* Check if filtering is successful

In [13]:
df_comments_clean[df_comments_clean['comment_text'].str.match(url_regex)]

Unnamed: 0,thread_id,comment_id,comment_text,comment_score,author_name


* Removal of comments posted by AutoModerator

In [14]:
# only save comment rows that are not posted by automoderator
df_comments_clean = df_comments_clean[df_comments_clean["author_name"] != "AutoModerator"]

* sanity check on updated dataframe to see if any comments by automoderator left

In [15]:
df_comments_clean[df_comments_clean["author_name"] == "AutoModerator"]

Unnamed: 0,thread_id,comment_id,comment_text,comment_score,author_name


In [16]:
print('Comment Rows: ', df_comments_clean.shape[0])

Comment Rows:  3894


* Check that there are no empty comments

In [17]:
print("No. of empty comments:", df_comments_clean["comment_text"].isnull().sum())

No. of empty comments: 0


* Check the final dataframe row counts

In [18]:
print('Thread Rows: ', df_threads.shape[0])
print('Comment Rows: ', df_comments.shape[0])

Thread Rows:  100
Comment Rows:  4130



---

### 04 Merge Comments & Threads Dataframes

Merge the thread and comment dataframes for demo purposes.

In [19]:
# perform merging
clean_merged = pd.merge(df_comments_clean, df_threads, left_on='thread_id', right_on='id', how='left', suffixes=('', '_thread'))
clean_merged.shape

(3894, 13)

In [20]:
# check rows and columns if merge is done appropriately
clean_merged.head()

Unnamed: 0,thread_id,comment_id,comment_text,comment_score,author_name,id,title,score,num_comments,post_hint,self_text,author_name_thread,url
0,tzaw3i,i3zbuma,The left wing doesn’t exist in this country. C...,1856,_realm_breaker,tzaw3i,"2nd post regarding Biden. again, he's not left...",8109,736,image,,the_red_guard,https://i.redd.it/dliu0r47tcs81.jpg
1,tzaw3i,i49gyqr,As horriable as trump was. I do miss his senil...,514,englishcrumpit,tzaw3i,"2nd post regarding Biden. again, he's not left...",8109,736,image,,the_red_guard,https://i.redd.it/dliu0r47tcs81.jpg
2,tzaw3i,i5zbs1f,"Why is ""Nothing fundamentally changing"" not a...",108,,tzaw3i,"2nd post regarding Biden. again, he's not left...",8109,736,image,,the_red_guard,https://i.redd.it/dliu0r47tcs81.jpg
3,tzaw3i,i45nffy,The sheer amount of libs using this sub to def...,205,Somelebguy989,tzaw3i,"2nd post regarding Biden. again, he's not left...",8109,736,image,,the_red_guard,https://i.redd.it/dliu0r47tcs81.jpg
4,tzaw3i,i50tba0,He isn’t Trump so that one should be Checked off,78,,tzaw3i,"2nd post regarding Biden. again, he's not left...",8109,736,image,,the_red_guard,https://i.redd.it/dliu0r47tcs81.jpg



---

### 05 Save Cleaned Merged Dataframes as CSV

In [21]:
clean_merged.to_csv("./demo/clean_merged.csv", index=False)