# **How to collect and analyze text from social media (1) Web-Scraping**
created by. Yoonwon Jung  
email: ywjung@snu.ac.kr  
Reserachgate: https://www.researchgate.net/profile/Yoonwon-Jung


## Crawling Basics: BeautifulSoup

### 1. 라이브러리 임포트

- requests = 웹페이지 가져오는 라이브러리
- bs4 = 크롤링 라이브러리

In [None]:
import requests
from bs4 import BeautifulSoup

### 2. 웹페이지 가져오기

1. 각 컴퓨터가 가지고 있는 ip주소 => 웹 주소
2. 'http://~' 해당 html 에 있는 파일을 'res' 라는 variable에 지정

In [None]:
res = requests.get('http://abcdefg')

- HTML 파일 확인하는 법
: Chrome 웹 브라우저/ 오른쪽 클릭 & 페이지 소스 보기 (view page source)

### 3. 웹페이지 파싱
parsing = 문자열 의미 분석  

In [None]:
#HTML 파싱한 정보가 'soup'이라는 variable에 지정됨
soup = BeautifulSoup(res.content, 'html.parser')

### 4. 필요한 데이터 추출

In [None]:
mydata = soup.find('title')

1. **태그**와 **속성**으로 선택 (**find** 함수)
crawling_data = soup.find('h1') <br>
crawling_data = soup.find('title') <br>
crawling_data = soup.find('p', class_='cssstyle') <br>
crawling_data = soup.find('p', attrs = {'align': 'center'})

2. **CSS Selector** 로 선택
crawling_data = soup.select('html > title') <br>
crawling_data = soup.select('div.article_view') <br>
crawling_data = soup.select('#harmonyContainer') <br>
crawling_data = soup.select('div#mArticle')

### 5. 추출한 데이터 활용

In [None]:
print(mydata.get_text())

### 예제

크롤링할 페이지 주소: https://news.v.daum.net/v/20210823155607617

In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.v.daum.net/v/20210823155607617')

soup = BeautifulSoup(res.content, 'html.parser')

In [None]:
mydata = soup.find('title')

print(mydata.get_text())

2023년부터 고교학점제 부분 도입..수업 170시간 줄어


In [None]:
mydata = soup.find('div', class_="layer_util layer_summary")

print(mydata.get_text())


2025년부터 전면도입되는 고교학점제의 점진적 적용을 위해 현재 중학교 2학년 학생들이 고등학교 1학년에 되는 2023년부터 고교 3년 동안 이수해야 하는 수업시간이 2890시간에서 2720시간으로 170시간 줄어든다.
교육부는 23일 고교교육 혁신 추진단 회의를 열고 '2025년 고교학점제 전면 적용을 위한 단계적 이행 계획'을 발표했다. 기사 제목과 주요 문장을 기반으로 자동요약한 결과입니다. 전체 맥락을 이해하기 위해서는 본문 보기를 권장합니다.



## Crawling Basics: Using API
크롤링 대상 사이트: Reddit    
https://www.reddit.com/r/lonely/ : 188k members
https://www.reddit.com/r/loneliness/ : 8.1k members

## 1. Reddit API (PRAW)
참고: https://www.storybench.org/how-to-scrape-reddit-with-python/

In [None]:
pip install praw

Collecting praw
  Downloading https://files.pythonhosted.org/packages/2c/15/4bcc44271afce0316c73cd2ed35f951f1363a07d4d5d5440ae5eb2baad78/praw-7.1.0-py3-none-any.whl (152kB)
Collecting websocket-client>=0.54.0 (from praw)
  Downloading https://files.pythonhosted.org/packages/4c/5f/f61b420143ed1c8dc69f9eaec5ff1ac36109d52c80de49d66e0c36c3dfdf/websocket_client-0.57.0-py2.py3-none-any.whl (200kB)
Collecting prawcore<2.0,>=1.3.0 (from praw)
  Downloading https://files.pythonhosted.org/packages/1d/40/b741437ce4c7b64f928513817b29c0a615efb66ab5e5e01f66fe92d2d95b/prawcore-1.5.0-py3-none-any.whl
Collecting update-checker>=0.17 (from praw)
  Downloading https://files.pythonhosted.org/packages/0c/ba/8dd7fa5f0b1c6a8ac62f8f57f7e794160c1f86f31c6d0fb00f582372a3e4/update_checker-0.18.0-py3-none-any.whl
Installing collected packages: websocket-client, prawcore, update-checker, praw
Successfully installed praw-7.1.0 prawcore-1.5.0 update-checker-0.18.0 websocket-client-0.57.0
Note: you may need to restart

In [None]:
import praw #Python Reddit API Wrapper
import pandas as pd
import datetime as dt

Reddit사이트에 가서“Create an App”: API 접근을 위한 OAuth2 key를 받는다: https://www.reddit.com/prefs/apps  
아래에 key를 받아서 다른 정보와 함께 입력한다.

In [None]:
reddit = praw.Reddit(client_id='PERSONAL_USE_SCRIPT_14_CHARS',
                     client_secret='SECRET_KEY_27_CHARS ',
                     user_agent='YOUR_APP_NAME',
                     username='YOUR_REDDIT_USER_NAME',
                     password='YOUR_REDDIT_LOGIN_PASSWORD')

### Title, Body 가져오기

In [None]:
subreddit = reddit.subreddit('검색어')
top_subreddit = subreddit.top(limit=500)
top_dict = { "title":[], \
             "body":[]}
for submission in top_subreddit:
    top_dict["title"].append(submission.title)
    top_dict["body"].append(submission.selftext)

In [None]:
reddit_top_data = pd.DataFrame(top_dict)
reddit_top_data

### 예제: Loneliness Web-scraping (210106~210107)
### (1) Top 1000 posts & comments

In [None]:
#Subreddit 설정
subreddit = reddit.subreddit('lonely')

#Top1000크롤링
subreddit_top = subreddit.top(limit=1000)
#query가 있을 경우 subreddit.top(query, limit = 1000)
subreddit_new = subreddit.new(limit=1000)

post_dict_top = {
    "title":[], \
    "body":[], \
    "score" : [], \
    "id" : [], \
    "url" : [], \
    "comms_num": [], \
    "created" : []
            }
# score of the post: number of upvotes minus the number of downvotes.
# unique id of the post
# url of the post
# the number of comments on the post
# timestamp of the post

comments_dict_top = {
    "comment_id" : [], \
    "comment_parent_id" : [],  \
    "comment_body" : [],  \
    "comment_link_id" : [],  \
    "created" : []
                }
# unique comm id
# comment parent id
# text in comment
# link to the comment
# timestamp of the post

for submission in subreddit_top:
    post_dict_top["title"].append(submission.title)
    post_dict_top["body"].append(submission.selftext)
    post_dict_top["id"].append(submission.id)
    post_dict_top["url"].append(submission.url)
    post_dict_top["comms_num"].append(submission.num_comments)
    post_dict_top["created"].append(submission.created)

    ##### Acessing comments on the post
    submission.comments.replace_more(limit = None)
    for comment in submission.comments.list():
        comments_dict_top["comment_id"].append(comment.id)
        comments_dict_top["comment_parent_id"].append(comment.parent_id)
        comments_dict_top["comment_body"].append(comment.body)
        comments_dict_top["comment_link_id"].append(comment.link_id)
        comments_dict_top["created"].append(comment.created)

In [None]:
post_dict_top

{'title': ['I am dying and no one is coming to my funeral.',
  'Does anyone ever feel so lonely that whenever a person of the opposite sex/same sex treats you like a human being, you instantly fall in love with them only then to realize how pathetic you really are?',
  'Does anybody have friends and family, but still feel lonely because nobody knows the "real" you.',
  'I am so lonely that I smile after seeing someone upvoted me!',
  'Whoever’s reading this, I pray that one day you don’t have to pretend to be happy anymore. I pray you find your purpose & no longer feel like your alone slowly drowning in the middle of the ocean. I hope that you find someone who brings light, joy, & life into your darkest days.',
  'Do you ever get so lonely you start reading old messages from people you liked/loved at the time?',
  'You ever just hug your blankets and fantasize about how one day you’ll find that special someone who’ll hold on to you while telling you how much they love you?',
  'This is

In [None]:
#Dataframe으로 출력
reddit_top_posts = pd.DataFrame.from_dict(post_dict_top, orient='index')
reddit_top_posts = reddit_top_posts.transpose()
reddit_top_comments = pd.DataFrame.from_dict(comments_dict_top, orient='index')
reddit_top_comments = reddit_top_comments.transpose()

import datetime
def get_date(submission):
    time = submission
    return datetime.datetime.fromtimestamp(time)

timestamps_tp = reddit_top_posts["created"].apply(get_date)
reddit_top_posts = reddit_top_posts.assign(timestamp = timestamps_tp)

timestamps_tc = reddit_top_comments["created"].apply(get_date)
reddit_top_comments = reddit_top_comments.assign(timestamp = timestamps_tc)

In [None]:
reddit_top_posts

Unnamed: 0,title,body,score,id,url,comms_num,created,timestamp
0,I am dying and no one is coming to my funeral.,Throwaway because my main account is for posit...,,dnixdr,https://www.reddit.com/r/lonely/comments/dnixd...,158,1.57215e+09,2019-10-27 13:00:08
1,Does anyone ever feel so lonely that whenever ...,It's been one of those days for me.,,f2rsfs,https://www.reddit.com/r/lonely/comments/f2rsf...,185,1.58155e+09,2020-02-13 07:12:58
2,"Does anybody have friends and family, but stil...",I feel like I create a different person with e...,,esptqg,https://www.reddit.com/r/lonely/comments/esptq...,185,1.57979e+09,2020-01-24 00:33:58
3,I am so lonely that I smile after seeing someo...,Not a click bait! M genuinely lonely af,,jwh0zt,https://www.reddit.com/r/lonely/comments/jwh0z...,100,1.60574e+09,2020-11-19 07:46:02
4,"Whoever’s reading this, I pray that one day yo...",,,da0l7s,https://www.reddit.com/r/lonely/comments/da0l7...,98,1.56962e+09,2019-09-28 06:42:24
...,...,...,...,...,...,...,...,...
995,Does anyone else go to be early because they f...,I’m just lonely and sad. I’d rather go to bed ...,,c775nl,https://www.reddit.com/r/lonely/comments/c775n...,38,1.56189e+09,2019-06-30 18:08:36
996,A taste of intimacy,I finally got the intimacy I was craving. It h...,,ca7njm,https://www.reddit.com/r/lonely/comments/ca7nj...,35,1.56254e+09,2019-07-08 07:53:16
997,girlfriend,i feel like many people in my age only think a...,,hcxy4x,https://www.reddit.com/r/lonely/comments/hcxy4...,43,1.59273e+09,2020-06-21 18:18:32
998,I have so much love to give yet no one wants i...,I am a kind and caring person but yet I end up...,,gc8pne,https://www.reddit.com/r/lonely/comments/gc8pn...,33,1.58846e+09,2020-05-03 08:52:56


In [None]:
reddit_top_comments

Unnamed: 0,comment_id,comment_parent_id,comment_body,comment_link_id,created
0,f5bk1dq,t3_dnixdr,"No one might go to your funeral, but please th...",t3_dnixdr,1.57215e+09
1,f5bdaxx,t3_dnixdr,I want to become a neurosurgeon... I am so sor...,t3_dnixdr,1.57215e+09
2,f5bjllr,t3_dnixdr,[삭제된 글],t3_dnixdr,1.57215e+09
3,f5bwahi,t3_dnixdr,Did you regret living your life that way befor...,t3_dnixdr,1.57216e+09
4,f5bxdkf,t3_dnixdr,"Damn, this really touched me.",t3_dnixdr,1.57216e+09
...,...,...,...,...,...
67692,f33cgic,t1_f33bl47,"You’re right, i was wrong. One day you might b...",t3_dfb1mw,1.57066e+09
67693,f32wqc9,t1_f32w5cw,yep exactly. i can’t promise a dude i’ll feel ...,t3_dfb1mw,1.57066e+09
67694,f33cnnw,t1_f33cgic,lmfao.\n you got some great comebacks huh.,t3_dfb1mw,1.57066e+09
67695,f32wzf6,t1_f32wqc9,[삭제된 글],t3_dfb1mw,1.57066e+09


# (2) new1000 posts & comments 크롤링

In [None]:
post_dict_new = {
    "title":[], \
    "body":[], \
    "score" : [], \
    "id" : [], \
    "url" : [], \
    "comms_num": [], \
    "created" : [],
            }

comments_dict_new = {
    "comment_id" : [], \
    "comment_parent_id" : [],  \
    "comment_body" : [],  \
    "comment_link_id" : [],  \
    "created" : []
                }

for submission in subreddit_new:
    post_dict_new["title"].append(submission.title)
    post_dict_new["body"].append(submission.selftext)
    post_dict_new["id"].append(submission.id)
    post_dict_new["url"].append(submission.url)
    post_dict_new["comms_num"].append(submission.num_comments)
    post_dict_new["created"].append(submission.created)

    ##### Acessing comments on the post
    submission.comments.replace_more(limit = None)
    for comment in submission.comments.list():
        comments_dict_new["comment_body"].append(comment.body)
        comments_dict_new["comment_id"].append(comment.id)
        comments_dict_new["comment_parent_id"].append(comment.parent_id)
        comments_dict_new["comment_link_id"].append(comment.link_id)
        comments_dict_new["created"].append(comment.created)

In [None]:
#Dataframe으로 출력
reddit_new_posts = pd.DataFrame.from_dict(post_dict_new, orient='index')
reddit_new_posts = reddit_new_posts.transpose()
reddit_new_comments = pd.DataFrame.from_dict(comments_dict_new, orient='index')
reddit_new_comments = reddit_new_comments.transpose()

import datetime
def get_date(submission):
    time = submission
    return datetime.datetime.fromtimestamp(time)

timestamps_np = reddit_new_posts["created"].apply(get_date)
reddit_new_posts = reddit_new_posts.assign(timestamp = timestamps_np)

timestamps_nc = reddit_new_comments["created"].apply(get_date)
reddit_new_comments = reddit_new_comments.assign(timestamp = timestamps_nc)

In [None]:
reddit_new_posts

Unnamed: 0,title,body,score,id,url,comms_num,created,timestamp
0,Feeling lonely and need to talk on voice app? ...,"Hi there, fellow human! I'm a shy and a bit an...",,krp08v,https://www.reddit.com/r/lonely/comments/krp08...,0,1.60997e+09,2021-01-07 07:35:07
1,"Everyday i get teary eyed after i leave work, ...",Header.,,krot9h,https://www.reddit.com/r/lonely/comments/krot9...,0,1.60997e+09,2021-01-07 07:24:09
2,Thinking a lot about romantic relationships la...,"With my laptop broken, my relationship problem...",,krnpn4,https://www.reddit.com/r/lonely/comments/krnpn...,0,1.60997e+09,2021-01-07 06:18:54
3,The pandemic is taking a toll on me,I'm also a little heartbroken at the moment. L...,,krmxiv,https://www.reddit.com/r/lonely/comments/krmxi...,1,1.60996e+09,2021-01-07 05:27:06
4,Holidays as single are a f curse.,"On Christmas, on WhatsApp (my only social alon...",,krmtv5,https://www.reddit.com/r/lonely/comments/krmtv...,0,1.60996e+09,2021-01-07 05:20:42
...,...,...,...,...,...,...,...,...
992,"advice, should I keep looking for emotional su...",I think im pretty careful with who I chose to ...,,km3dae,https://www.reddit.com/r/lonely/comments/km3da...,1,1.60923e+09,2020-12-29 17:56:37
993,Need some one to get my mind off of everything,Going through a break up. We live together and...,,km3boa,https://www.reddit.com/r/lonely/comments/km3bo...,1,1.60923e+09,2020-12-29 17:53:56
994,I don't even have acquaintances anymore,"It's one thing to not have friends, but still ...",,km2ucu,https://www.reddit.com/r/lonely/comments/km2uc...,3,1.60923e+09,2020-12-29 17:27:24
995,I Have the Worst Relationship Ever,So I met this girl during freshman year of hig...,,km2kvi,https://www.reddit.com/r/lonely/comments/km2kv...,1,1.60923e+09,2020-12-29 17:12:56


In [None]:
reddit_new_comments

Unnamed: 0,comment_id,comment_parent_id,comment_body,comment_link_id,created,timestamp
0,giaz1o2,t3_krmxiv,"Fear not as you grow from strength. Remember, ...",t3_krmxiv,1.60997e+09,2021-01-07 07:16:32
1,giaqlfg,t3_krma85,Jesus I am so sorry you are going through this...,t3_krma85,1.60997e+09,2021-01-07 05:45:48
2,giakz1w,t3_krlytp,Why is it so hard for me to make friends? I wi...,t3_krlytp,1.60996e+09,2021-01-07 04:26:23
3,gib1fko,t3_krlytp,I share the same feeling. What’s helped me has...,t3_krlytp,1.60997e+09,2021-01-07 07:38:23
4,giarr7k,t3_krlqn9,I can understand you in many ways. Trust only ...,t3_krlqn9,1.60997e+09,2021-01-07 06:00:02
...,...,...,...,...,...,...
5695,ghd7cpv,t3_km3boa,"I’m really sorry to hear that, I’ve recently g...",t3_km3boa,1.60925e+09,2020-12-29 22:26:37
5696,ghce0yv,t3_km2ucu,Have you tried joining the discord groups and ...,t3_km2ucu,1.60923e+09,2020-12-29 17:48:08
5697,ghcfcj5,t3_km2ucu,Me too,t3_km2ucu,1.60923e+09,2020-12-29 18:00:10
5698,ghcz7s3,t3_km2ucu,I think my acquaintances got tired of how much...,t3_km2ucu,1.60924e+09,2020-12-29 21:00:31


In [None]:
reddit_new_comments.loc[8]['comment_body']

'I’m not sure - I probably did play with their emotions (unintentionally) since I arranged hook ups with them but had no intention of going since it was a fake profile. I really didn’t understand catfishing at the time even though that’s what I was doing. I completely agree though that there’s harm in doing it'

In [None]:
#csv로 export
reddit_top_posts.to_csv('Reddit_Loneliness_top_posts.csv', index=False)
reddit_top_comments.to_csv('Reddit_Loneliness_top_comments.csv', index=False)
reddit_new_posts.to_csv('Reddit_Loneliness_new_posts.csv', index=False)
reddit_new_comments.to_csv('Reddit_Loneliness_new_comments.csv', index=False)

## 2. Pushshift API
### 특정 기간의 글을 크롤링하려면?  praw로는 못하고, pushshift를 쓰면 가능   

reference:
https://rareloot.medium.com/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563  
https://colab.research.google.com/drive/1biLcXeHs8yZD1x9f3gv-cNJXEq7tpyoO?usp=sharing  
https://www.osrsbox.com/blog/2019/03/18/watercooler-scraping-an-entire-subreddit-2007scape/  

### 바로 예제로 배우기: 2020년 12월의 마지막 5일과 2021년 1월의 첫 5일 크롤링

In [None]:
import pandas as pd
import requests #Pushshift accesses Reddit via an url so this is needed
import json #JSON manipulation
import csv #To Convert final table into a csv file to save to your machine
import time
import datetime

In [None]:
#Adapted from this https://gist.github.com/dylankilkenny/3dbf6123527260165f8c5c3bc3ee331b
#This function builds an Pushshift URL, accesses the webpage and stores JSON data in a nested list

#query가 있을 때 함수
"""
def getPushshiftData(query, after, before, sub):
    #Build URL
    url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    #Print URL to show user
    print(url)
    #Request URL
    r = requests.get(url)
    #Load JSON data from webpage into data variable
    data = json.loads(r.text)
    #return the data element which contains all the submissions data
    return data['data']
"""

In [None]:
#Adapted from this https://gist.github.com/dylankilkenny/3dbf6123527260165f8c5c3bc3ee331b
#This function builds an Pushshift URL, accesses the webpage and stores JSON data in a nested list

#query가 없을 때 submission가져오는 함수
def getPushshiftData(after, before, sub):
    #Build URL
    url = 'https://api.pushshift.io/reddit/search/submission/?'+'size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    #Print URL to show user
    print(url)
    #Request URL
    r = requests.get(url)
    #Load JSON data from webpage into data variable
    data = json.loads(r.text)
    #return the data element which contains all the submissions data
    return data['data']

## 1. Post from "lonely" subreddit

### 12월 27일~1월 5일

In [None]:
#Create your timestamps and queries for your search URL
#https://www.unixtimestamp.com/index.php > Use this to create your timestamps
after = "1609027200" #Submissions after this timestamp
before = "1609847999" #Submissions before this timestamp
query = None #Keyword(s) to look for in submissions
sub = "lonely" #Which Subreddit to search in
#sub = "loneliness"

In [None]:
# We need to run this function outside the loop first to get the updated after variable
data_12 = getPushshiftData(after, before, sub)

https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609027200&before=1609847999&subreddit=lonely


In [None]:
data_12

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'AA1723',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_3ofdaeh9',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1609027648,
  'domain': 'self.lonely',
  'full_link': 'https://www.reddit.com/r/lonely/comments/kkt2w3/been_feeling_extra_lonely_as_of_late_and_haunted/',
  'gildings': {},
  'id': 'kkt2w3',
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': True,
  'num_comments': 8,
  'num_crossposts': 0,
  'o

In [None]:
#This function will be used to extract the key data points from each JSON result
def collectSubData(subm):
    #subData was created at the start to hold all the data which is then added to our global subStats dictionary.
    subData = list() #list to store data points
    title = subm['title']
    url = subm['url']
    try:
        body = subm['selftext']
    except KeyError:
        body = "NaN"
    #flairs are not always present so we wrap in try/except
    try:
        flair = subm['link_flair_text']
    except KeyError:
        flair = "NaN"
    author = subm['author']
    sub_id = subm['id']
    score = subm['score']
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    numComms = subm['num_comments']
    permalink = subm['permalink']

    #Put all data points into a tuple and append to subData
    subData.append((sub_id,title,body,url,author,score,created,numComms,permalink,flair))
    #Create a dictionary entry of current submission data and store all data related to it
    subStats[sub_id] = subData

In [None]:
#subCount tracks the no. of total submissions we collect
subCount = 0
#subStats is the dictionary where we will store our data.
subStats = {}

In [None]:
# Will run until all posts have been gathered i.e. When the length of data variable = 0
# from the 'after' date up until before date
while len(data_12) > 0: #The length of data is the number submissions (data[0], data[1] etc), once it hits zero (after and before vars are the same) end
    for submission in data_12:
        collectSubData(submission)
        subCount+=1
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data_12))
    print(str(datetime.datetime.fromtimestamp(data_12[-1]['created_utc'])))
    #update after variable to last created date of submission
    after = data_12[-1]['created_utc']
    #data has changed due to the new after variable provided by above code
    data_12 = getPushshiftData(after, before, sub)

print(len(data_12))

100
2020-12-27 21:47:08
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609073228&before=1609847999&subreddit=lonely
100
2020-12-28 13:58:26
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609131506&before=1609847999&subreddit=lonely
100
2020-12-29 06:12:54
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609189974&before=1609847999&subreddit=lonely
100
2020-12-29 18:12:19
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609233139&before=1609847999&subreddit=lonely
100
2020-12-30 11:29:59
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609295399&before=1609847999&subreddit=lonely
100
2020-12-31 04:38:38
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609357118&before=1609847999&subreddit=lonely
100
2020-12-31 17:56:24
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609404984&before=1609847999&subreddit=lonely
100
2021-01-01 07:23:40
https://api.pushs

In [None]:
#Check submission
print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

1571 submissions have added to list
1st entry is:
Been feeling extra lonely as of late and haunted by memories 😔 could use some distractions. created: 1
Last entry is:
Anyone up rn? created: 1


In [None]:
def updateSubs_file():
    upload_count = 0
    #location = "\\Reddit Data\\" >> If you're running this outside of a notebook you'll need this to direct to a specific location
    print("input filename of submission file, please add .csv")
    filename = input() #This asks the user what to name the file
    file = filename
    with open(file, 'w', newline='', encoding='utf-8') as file:
        a = csv.writer(file, delimiter=',')
        headers = ["Post ID","Title", "Body","Url","Author","Score","Publish Date","Total No. of Comments","Permalink","Flair"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1

        print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

input filename of submission file, please add .csv
12~1_lonely_subreddit.csv
1571 submissions have been uploaded


In [None]:
df_12= pd.read_csv("12~1_lonely_subreddit.csv")

## 2. Post from "loneliness" subreddit

### 12월 27일~1월 5일

In [None]:
#Create your timestamps and queries for your search URL
#https://www.unixtimestamp.com/index.php > Use this to create your timestamps
after = "1609027200" #Submissions after this timestamp
before = "1609847999" #Submissions before this timestamp
query = None #Keyword(s) to look for in submissions
sub = "loneliness" #Which Subreddit to search in
#sub = "loneliness"

In [None]:
# We need to run this function outside the loop first to get the updated after variable
data_12 = getPushshiftData(after, before, sub)

https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609027200&before=1609847999&subreddit=loneliness


In [None]:
data_12

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'casperthespookyghost',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_8tn5q3qi',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1609031491,
  'domain': 'self.loneliness',
  'full_link': 'https://www.reddit.com/r/loneliness/comments/kku4o1/something_positive_here/',
  'gildings': {},
  'id': 'kku4o1',
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': True,
  'num_comments': 3,
  'num_crossposts': 0,
  'over

In [None]:
#This function will be used to extract the key data points from each JSON result
def collectSubData(subm):
    #subData was created at the start to hold all the data which is then added to our global subStats dictionary.
    subData = list() #list to store data points
    title = subm['title']
    url = subm['url']
    try:
        body = subm['selftext']
    except KeyError:
        body = "NaN"
    #flairs are not always present so we wrap in try/except
    try:
        flair = subm['link_flair_text']
    except KeyError:
        flair = "NaN"
    author = subm['author']
    sub_id = subm['id']
    score = subm['score']
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    numComms = subm['num_comments']
    permalink = subm['permalink']

    #Put all data points into a tuple and append to subData
    subData.append((sub_id,title,body,url,author,score,created,numComms,permalink,flair))
    #Create a dictionary entry of current submission data and store all data related to it
    subStats[sub_id] = subData

In [None]:
#subCount tracks the no. of total submissions we collect
subCount = 0
#subStats is the dictionary where we will store our data.
subStats = {}

In [None]:
# Will run until all posts have been gathered i.e. When the length of data variable = 0
# from the 'after' date up until before date
while len(data_12) > 0: #The length of data is the number submissions (data[0], data[1] etc), once it hits zero (after and before vars are the same) end
    for submission in data_12:
        collectSubData(submission)
        subCount+=1
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data_12))
    print(str(datetime.datetime.fromtimestamp(data_12[-1]['created_utc'])))
    #update after variable to last created date of submission
    after = data_12[-1]['created_utc']
    #data has changed due to the new after variable provided by above code
    data_12 = getPushshiftData(after, before, sub)

print(len(data_12))

36
2021-01-05 10:33:25
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1609810405&before=1609847999&subreddit=loneliness
0


In [None]:
#Check submission
print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

36 submissions have added to list
1st entry is:
Something positive here ;) created: 1
Last entry is:
Negative Health Effects? created: 1


In [None]:
def updateSubs_file():
    upload_count = 0
    #location = "\\Reddit Data\\" >> If you're running this outside of a notebook you'll need this to direct to a specific location
    print("input filename of submission file, please add .csv")
    filename = input() #This asks the user what to name the file
    file = filename
    with open(file, 'w', newline='', encoding='utf-8') as file:
        a = csv.writer(file, delimiter=',')
        headers = ["Post ID","Title", "Body","Url","Author","Score","Publish Date","Total No. of Comments","Permalink","Flair"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1

        print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

input filename of submission file, please add .csv
12~1_loneliness_subreddit.csv
36 submissions have been uploaded


In [None]:
df_12= pd.read_csv("12~1_loneliness_subreddit.csv")