# Reddit Classifier

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scraping-for-data" data-toc-modified-id="Scraping-for-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scraping for data</a></span></li></ul></div>

## Scraping for data

In [4]:
import requests
import pandas as pd
import numpy as np

In [16]:
url = 'https://www.reddit.com/r/atheism.json?limit=100'
base_url = 'https://www.reddit.com/r/'

In [17]:
res = requests.get(url, headers={'User-agent': 'Pony Inc 1.0'})

In [18]:
res.json()['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [8]:
posts = res.json()['data']['children']

In [19]:
len(posts)

101

In [12]:
posts[0]['data']['title']

'The secular survey'

In [15]:
posts[100]['data']['title']

'Video of pastor allegedly performing oral sex on a woman who is not his wife goes viral, energizes social media'

In [10]:
posts[0]['data']['name']

't3_dir1oo'

In [191]:
[ {key:post['data'][key]  for key in ['selftext','title']}   for post in res.json()['data']['children'][:10]]

[{'selftext': '',
  'title': 'In U.S., Decline of Christianity Continues at Rapid Pace (PEW)'},
 {'selftext': 'The “church people”, as she calls them, has been harassing my daughter to try to get her to come to church.  She wants them to stop.  I was shocked to learn that this is happening.    How can a public school do this?  Granted, it is the Bible Belt (US).  \n\nI want to get them kicked out of the school but I don’t know how to go about it since I work for the school and would probably be fired in retaliation.   Again, it’s the Bible Belt. \n\nI sent a message to the Freedom From Religion foundation.  I hope the contact me back.',
  'title': 'My daughter’s public high school allows pastors from the local mega church to come inside the school and proselytize to students during lunch.'},
 {'selftext': '',
  'title': 'Americans becoming less Christian as over a quarter follow no religion | World news | The Guardian'},
 {'selftext': 'I freaking love the Pew Forum.\nhttps://www.pewfor

In [186]:
[ post['data']['title'] + post['data']['selftext'] for post in res.json()['data']['children'][:10] ]

AttributeError: 'dict' object has no attribute 'json'

In [192]:
[ post['data']['title'] + post['data']['selftext'] for post in res.json()['data']['children'][:2] ]

['In U.S., Decline of Christianity Continues at Rapid Pace (PEW)',
 'My daughter’s public high school allows pastors from the local mega church to come inside the school and proselytize to students during lunch.The “church people”, as she calls them, has been harassing my daughter to try to get her to come to church.  She wants them to stop.  I was shocked to learn that this is happening.    How can a public school do this?  Granted, it is the Bible Belt (US).  \n\nI want to get them kicked out of the school but I don’t know how to go about it since I work for the school and would probably be fired in retaliation.   Again, it’s the Bible Belt. \n\nI sent a message to the Freedom From Religion foundation.  I hope the contact me back.']

In [115]:
def extract_content(response):
    
    return [ post['data']['title'] + post['data']['selftext'] for post in response.json()['data']['children'][:10] ]

In [73]:
reddits = []
for post in res.json()['data']['children'][:2]:
    reddit={ }
    p = post['data']['selftext']
    print(post['data'].keys())

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'steward_reports', 'author_flair_richtext', 'gildings', 'post_hint', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', '

In [211]:
a = {'a':[1,2],'b':[3,4]}
for i,val in a.items():
    print(val)

[1, 2]
[3, 4]


In [48]:
from collections import defaultdict

def extract_content(response, topic):
    # 0th index refers to title appended with selftext
    rslt = defaultdict(list)
    for post in response.json()['data']['children']:
        rslt['content'].append(post['data']['title'] + post['data']['selftext'])
        rslt['topic'].append(topic)
        rslt['id'].append(post['data']['name'])
        
    return rslt

def scrape_topic(topic,depth=12):
    '''
    Scrape the reddit topic for all the posts
    
    '''
    base_url = 'https://www.reddit.com/r/'
    url = base_url + topic + '.json?limit=100'
    last_entry_name = ''
    payload = None
    
    
    for i in range(depth):
        if last_entry_name == None:
            break
        if last_entry_name != '':
            url = base_url + topic + '.json?limit=100&after=' + last_entry_name
    
        res = requests.get(url, headers={'User-agent': 'Pony Inc 1.0'})
        print(f"request url: {url} for {topic} at depth:{i}")
        
        last_entry_name = res.json()['data']['after']
        print(f"Last entry name : {last_entry_name} ")
        content = extract_content(res,topic)
        if not payload:
            payload = content
        else:
            for k,v in payload.items():
                payload[k] += content[k]
        if i < depth-1:
            sleep_duration = random.randint(2,30)
            time.sleep(sleep_duration)
        print(f'Scraping {topic}, {i} pages in..')
    return {'topic': topic, 'data': payload }
    

In [28]:
import concurrent.futures as cf
import random, time


def get_reddit_data(topics):
    '''
    Main function to scrape reddit data into a pandas dataframe.
    Args: topics
     - An array object of topics in reddit to scrape from
    '''
    rdf = pd.DataFrame()
    with cf.ThreadPoolExecutor(max_workers=5) as executor:
        for future in executor.map(scrape_topic, topics):
            if rdf.empty:
                rdf = pd.DataFrame(future['data'])
            else:
                new = pd.DataFrame(future['data'])
                rdf = pd.concat([rdf,new],ignore_index=True)
            
    return rdf
        
      

In [49]:
!rm reddit_scrape_bio_chem.csv

rm: reddit_scrape.csv: No such file or directory


In [61]:
import os.path

rdf = get_reddit_data(['AskDocs','legaladvice'])
print(rdf.groupby(by='topic').count())

# filename = 'reddit_scrape_bio_chem.csv'
filename = 'reddit_scrape_docs_legal.csv'

use_header=True

if os.path.exists(filename):
    use_header = False
    
rdf.to_csv(filename,mode='a',header=use_header, index=False)

request url: https://www.reddit.com/r/legaladvice.json?limit=100 for legaladvice at depth:0
Last entry name : t3_dkz2et 
request url: https://www.reddit.com/r/AskDocs.json?limit=100 for AskDocs at depth:0
Last entry name : t3_dkv1dd 
Scraping legaladvice, 0 pages in..
request url: https://www.reddit.com/r/legaladvice.json?limit=100&after=t3_dkz2et for legaladvice at depth:1
Last entry name : t3_dklq67 
Scraping legaladvice, 1 pages in..
request url: https://www.reddit.com/r/legaladvice.json?limit=100&after=t3_dklq67 for legaladvice at depth:2
Last entry name : t3_dkqwo4 
Scraping AskDocs, 0 pages in..
request url: https://www.reddit.com/r/AskDocs.json?limit=100&after=t3_dkv1dd for AskDocs at depth:1
Last entry name : t3_dkrf1i 
Scraping legaladvice, 2 pages in..
request url: https://www.reddit.com/r/legaladvice.json?limit=100&after=t3_dkqwo4 for legaladvice at depth:3
Last entry name : t3_dkbfqt 
Scraping AskDocs, 1 pages in..
request url: https://www.reddit.com/r/AskDocs.json?limit=10

In [32]:
res = requests.get('https://www.reddit.com/r/medicine.json?limit=100&after=t3_cq3rbx', headers={'User-agent': 'Pony Inc 1.0'})


In [43]:
res.json()['data']['children'][98]['data']['after']

KeyError: 'after'

In [58]:
len(res.json()['data']['children'])

41

In [52]:
res = requests.get('https://www.reddit.com/r/chemistry.json?limit=100&after=t3_d9mcs6', headers={'User-agent': 'Pony Inc 1.0'})


In [53]:
len(res.json()['data']['children'])

41

In [64]:
filename

'reddit_scrape_docs_legal.csv'

In [65]:
test_pd = pd.read_csv(filename)

In [66]:
test_pd

Unnamed: 0,content,topic,id
0,Weekly Discussion/General Questions Thread - O...,AskDocs,t3_dhnk4v
1,(F18) Hit my head off a nail in the wall repea...,AskDocs,t3_dktyay
2,"[29F] Fell on ribs a month ago, xray showed no...",AskDocs,t3_dl06ul
3,What happened to me last night ?25M\nWeight 73...,AskDocs,t3_dkudly
4,"Back injury advise.26\nMale\n5"" 11\n85kg\nCauc...",AskDocs,t3_dl0nns
...,...,...,...
1755,[FL] In June my wife got a speeding ticket and...,legaladvice,t3_djt05l
1756,Domestic Violence MichiganI’m seeking advice f...,legaladvice,t3_djvyvt
1757,My future ex wife is trying to get our dogs ce...,legaladvice,t3_djblj9
1758,Rented apartment is at 85% humidity. Building ...,legaladvice,t3_djsw8k


In [69]:
test_pd[test_pd.duplicated(subset=['id'])]

Unnamed: 0,content,topic,id


In [68]:
test_pd

Unnamed: 0,content,topic,id
0,Weekly Discussion/General Questions Thread - O...,AskDocs,t3_dhnk4v
1,(F18) Hit my head off a nail in the wall repea...,AskDocs,t3_dktyay
2,"[29F] Fell on ribs a month ago, xray showed no...",AskDocs,t3_dl06ul
3,What happened to me last night ?25M\nWeight 73...,AskDocs,t3_dkudly
4,"Back injury advise.26\nMale\n5"" 11\n85kg\nCauc...",AskDocs,t3_dl0nns
...,...,...,...
1755,[FL] In June my wife got a speeding ticket and...,legaladvice,t3_djt05l
1756,Domestic Violence MichiganI’m seeking advice f...,legaladvice,t3_djvyvt
1757,My future ex wife is trying to get our dogs ce...,legaladvice,t3_djblj9
1758,Rented apartment is at 85% humidity. Building ...,legaladvice,t3_djsw8k


In [148]:
import concurrent.futures as cf
# result = {}
result=[]
with cf.ThreadPoolExecutor(max_workers=5) as executor:
#     result = { top :executor.map(scrape_topic, top)  for top in ['boardgames','atheism'] }
    futures = executor.map(scrape_topic, ['boardgames','atheism'])
#     print(result)
print("result:",futures)

result: <generator object Executor.map.<locals>.result_iterator at 0x0000019F8A451F48>


In [122]:
rdf.shape

(26, 1)

In [123]:
rdf

Unnamed: 0,boardgames
0,/r/boardgames Daily Discussion and Game Recomm...
1,The Board Game at the Heart of Viking Culture&...
2,"What's your go-to player colour, and why?Since..."
3,Board Game Atlas Acquires Board Game Prices
4,"TIL that according to Graham Nash, Jimi Hendri..."
5,Are there any upcoming board game secret santa...
6,"Designer Diary 2: Origins, Expanded | Oath: Ch..."
7,Mage Knight Ultimate Edition first play charac...
8,Artemis Project vs other dice placement gamesA...
9,Pandemic: Rapid Response - A lesson in persona...
