# Data Acquisition: Reddit Scraping

In this exercise, we will search a query (e.g., "data science") on the old Reddit interface (https://www.old.reddit.com/). We will then grab the url (e.g., https://old.reddit.com/search?q=data+science) of the search page and scrap the returned posts. The reason for using the old Reddit interface is that the html tags are user-friendly. We will focus on extracting title, author, author's profile, subreddit, tag, timestamp, number of votes, and number of comments. 
<img src="../images/reddit_search.png" />



* You are free to use your own query string. 
* On the search page, a set of subreddits are shown. Ignore these subreddits and focus on extracting Reddit posts. 



**Activity 1:** Fetch the page and create a soup object using Beautiful soup library

In [1]:
# Your code for activity 1 goes here..
#---------------------------------------

#import the library to query a website
import requests
# import Beautiful soup library to access 
# functions to parse the data returned from the website
from bs4 import BeautifulSoup

#import pandas to convert list to data frame
import pandas as pd
#imprt numpy
import numpy as np


headers = {'User-Agent': 'MyAPP/1.0'}  
# this will make sure our query is comming from a browser and it's not a bot


# specify the url
url = 'https://old.reddit.com/search?q=data+science'

# Open website URL and return the html to the variable 'response'
response = requests.get(url, headers=headers)

# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "html")



In [2]:
attrs = {'class': 'search-result'}

In [3]:
# Generates a list of all the html objects 

posts = soup.find_all('div', class_="search-result")

In [4]:
# post_1 = posts[3]

**Activity 2:** Extract the titles and URLs of the retrieved posts from the soup and print them.

In [5]:
# Titles

Title = []

for post in posts[3:]:
    content = post.find("a", class_="search-title",text=True)
    Title.append(content.text)
    print(content.text)

Data science even at mature companies can be a mixed bag.
Who are your data science heroes?
College Professor in Data Science Course Just Said That Functional Programming Is Better Than OOP, Does He Have a Point?
Do you use OOP in your daily data science work?
The Key Word in Data Science is Science, not Data
I built an interactive map to help people self-teaching Data Science online. It's like a skill tree for Data Science!
Ethics of a data science project I am undertaking
What data science skills do you see as in-demand given evolution of data science field in last few years?
So I, a data science noob, ran sentiment analysis on as much BTS MVs on r/kpop I could find...
How I use Data Science to Trade Options Around Earnings
Why do people look down on data science work and “computer” work in general?
Data Science for the Good of Society: are there realistic employment options?
I am interested in creating a group of new comers and intermediate Data science and ML practitioners just to 

In [6]:
# Links

URL = []

for post in posts[3:]:
    link = post.find("a")
    URL.append(link.get('href'))
    print(link.get('href'))

/r/datascience/comments/pu1y72/data_science_even_at_mature_companies_can_be_a/
/r/datascience/comments/pq44jp/who_are_your_data_science_heroes/
/r/datascience/comments/ppntvz/college_professor_in_data_science_course_just/
/r/datascience/comments/pkw92b/do_you_use_oop_in_your_daily_data_science_work/
/r/datascience/comments/p7hpd9/the_key_word_in_data_science_is_science_not_data/
/r/learndatascience/comments/pjplux/i_built_an_interactive_map_to_help_people/
/r/datascience/comments/prrou1/ethics_of_a_data_science_project_i_am_undertaking/
/r/datascience/comments/pj3dls/what_data_science_skills_do_you_see_as_indemand/
/r/kpopthoughts/comments/plyyes/so_i_a_data_science_noob_ran_sentiment_analysis/
/r/wallstreetbets/comments/psyjv5/how_i_use_data_science_to_trade_options_around/
/r/datascience/comments/pmb7a3/why_do_people_look_down_on_data_science_work_and/
/r/datascience/comments/p5mzc3/data_science_for_the_good_of_society_are_there/
/r/datascience/comments/oy2vfu/i_am_interested_in_crea

**Activity 3:** Extract the author ids and their profile links from the retrieved posts and print them.

In [7]:
# author

Author = []

for post in posts[3:]:
    author = post.find("a", class_="author",text=True)
    Author.append(author.text)
    print(author.text)

__compactsupport__
GravityAI
Illustrious_Ice_5022
rightheart
yoi12321
InstinctiveDoubt
productive_guy123
svyas
palebabbu
nema31lebowski
ogretronz
saindoja
yaakarsh1011
saik2363
VictorChen1
MisterInvicta
hyperxenophiliac
TheLSales
aznpersuazion
kribz666
fu11m3ta1
kunal_packtpub


In [8]:
# profile link

Author_Profile_Link =[]

for post in posts[3:]:
    author = post.find("a", class_="author",text=True)
    Author_Profile_Link.append(author.get('href'))
    print(author.get('href'))

https://old.reddit.com/user/__compactsupport__
https://old.reddit.com/user/GravityAI
https://old.reddit.com/user/Illustrious_Ice_5022
https://old.reddit.com/user/rightheart
https://old.reddit.com/user/yoi12321
https://old.reddit.com/user/InstinctiveDoubt
https://old.reddit.com/user/productive_guy123
https://old.reddit.com/user/svyas
https://old.reddit.com/user/palebabbu
https://old.reddit.com/user/nema31lebowski
https://old.reddit.com/user/ogretronz
https://old.reddit.com/user/saindoja
https://old.reddit.com/user/yaakarsh1011
https://old.reddit.com/user/saik2363
https://old.reddit.com/user/VictorChen1
https://old.reddit.com/user/MisterInvicta
https://old.reddit.com/user/hyperxenophiliac
https://old.reddit.com/user/TheLSales
https://old.reddit.com/user/aznpersuazion
https://old.reddit.com/user/kribz666
https://old.reddit.com/user/fu11m3ta1
https://old.reddit.com/user/kunal_packtpub


**Activity 4:** Extract the submission time of the retrieved posts and print them.

In [9]:
# time

Submission_Time =[]


for post in posts[3:]:
    time = post.find("time",text=True)
    #print(time.text)
    Submission_Time.append(time.get('title'))
    print(time.get('title'))
    #print(time.get('datetime'))

Thu Sep 23 18:49:43 2021 UTC
Fri Sep 17 17:00:55 2021 UTC
Thu Sep 16 22:40:21 2021 UTC
Thu Sep 9 11:50:16 2021 UTC
Thu Aug 19 16:01:05 2021 UTC
Tue Sep 7 15:43:32 2021 UTC
Mon Sep 20 10:00:51 2021 UTC
Mon Sep 6 16:57:19 2021 UTC
Sat Sep 11 02:31:51 2021 UTC
Wed Sep 22 03:03:33 2021 UTC
Sat Sep 11 17:06:12 2021 UTC
Mon Aug 16 19:18:19 2021 UTC
Wed Aug 4 21:30:53 2021 UTC
Fri Sep 17 14:11:21 2021 UTC
Fri Sep 17 01:20:46 2021 UTC
Tue Sep 21 18:05:12 2021 UTC
Tue Sep 21 14:19:20 2021 UTC
Tue Sep 14 13:05:59 2021 UTC
Tue Jul 20 01:29:22 2021 UTC
Sun Sep 19 20:49:43 2021 UTC
Sun Sep 19 00:16:10 2021 UTC
Fri Sep 24 09:30:56 2021 UTC


**Activity 5:** Extract the subreddits of the retrieved posts and print them

In [10]:
# sub reddit

Subreddit = []

for post in posts[3:]:
    subreddit = post.find("a", class_="search-subreddit-link",text=True)
    Subreddit.append(subreddit.text)
    print(subreddit.text)

r/datascience
r/datascience
r/datascience
r/datascience
r/datascience
r/learndatascience
r/datascience
r/datascience
r/kpopthoughts
r/wallstreetbets
r/datascience
r/datascience
r/datascience
r/datascience
r/Notion
r/canoo
r/rstats
r/AerospaceEngineering
r/datascience
r/smallstreetbets
r/analytics
r/learndatascience


**Activity 6:** Extract the associated tag(s) of the retrieved posts and print them

In [11]:
# tags

Tag = []

for post in posts[3:]:
    tag = post.find("span", class_="linkflairlabel",text=True)
    if tag is None:
        Tag.append('None')
        print('None')
    else:
        Tag.append(tag.text)
        print(tag.text)
    

Discussion
Fun/Trivia
Discussion
Discussion
Discussion
Resources
Discussion
Discussion
Boy Groups
Discussion
Discussion
Career
Networking
Networking
Showcase
New Hires
None
Career
Career
Discussion
Question
Resources


**Activity 7:** Extract the points of the retrieved posts and print them

In [12]:
# points

Points = []

for post in posts[3:]:
    points = post.find("span", class_="search-score",text=True)
    Points.append(points.text)
    print(points.text)

281 points
191 points
109 points
211 points
1,217 points
746 points
68 points
196 points
183 points
43 points
53 points
245 points
334 points
156 points
165 points
61 points
24 points
22 points
681 points
103 points
31 points
10 points


**Activity 8:** Extract the num of comments of the retrieved posts and print them

In [13]:
# comments

Comments = []

for post in posts[3:]:
    comments = post.find("a", class_="search-comments",text=True)
    Comments.append(comments.text)
    print(comments.text)

105 comments
119 comments
128 comments
158 comments
156 comments
54 comments
53 comments
95 comments
50 comments
31 comments
63 comments
152 comments
238 comments
19 comments
17 comments
17 comments
28 comments
40 comments
302 comments
12 comments
24 comments
20 comments


**Activity 9:** Using the above nine features create a dataframe for the retrieved posts, and print the first 10 entries. 

In [14]:
# I appended lists in each of the above sections. Combining here for Data Frame

Reddit_DF = pd.DataFrame({
    'Title': Title,
    'URL': URL,
    'Author': Author,
    'Author_Profile_Link': Author_Profile_Link,
    'Submission_Time': Submission_Time,
    'Subreddit': Subreddit,
    'Tag': Tag,
    'Points': Points,
    'Comments': Comments
})

Reddit_DF.head()

Unnamed: 0,Title,URL,Author,Author_Profile_Link,Submission_Time,Subreddit,Tag,Points,Comments
0,Data science even at mature companies can be a...,/r/datascience/comments/pu1y72/data_science_ev...,__compactsupport__,https://old.reddit.com/user/__compactsupport__,Thu Sep 23 18:49:43 2021 UTC,r/datascience,Discussion,281 points,105 comments
1,Who are your data science heroes?,/r/datascience/comments/pq44jp/who_are_your_da...,GravityAI,https://old.reddit.com/user/GravityAI,Fri Sep 17 17:00:55 2021 UTC,r/datascience,Fun/Trivia,191 points,119 comments
2,College Professor in Data Science Course Just ...,/r/datascience/comments/ppntvz/college_profess...,Illustrious_Ice_5022,https://old.reddit.com/user/Illustrious_Ice_5022,Thu Sep 16 22:40:21 2021 UTC,r/datascience,Discussion,109 points,128 comments
3,Do you use OOP in your daily data science work?,/r/datascience/comments/pkw92b/do_you_use_oop_...,rightheart,https://old.reddit.com/user/rightheart,Thu Sep 9 11:50:16 2021 UTC,r/datascience,Discussion,211 points,158 comments
4,"The Key Word in Data Science is Science, not Data",/r/datascience/comments/p7hpd9/the_key_word_in...,yoi12321,https://old.reddit.com/user/yoi12321,Thu Aug 19 16:01:05 2021 UTC,r/datascience,Discussion,"1,217 points",156 comments


**Activity 10:** Save the retrieved posts in a json file.  

In [15]:
# Saving the entire dataframe as a json file

Reddit_DF.to_json('LCM_Reddit_DF.json')

# Save your notebook, then `File > Close and Halt`