## Reddit PRAW Scraper

### Using Reddit API

For fetching Reddit data using API, we will be using a Python wrapper to Reddit API: [PRAW: The Python Reddit API Wrapper](https://github.com/praw-dev/praw)

Documentation: https://praw.readthedocs.io/en/latest/index.html

In [3]:
import praw

import nltk, re, pprint

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import word_tokenize, tokenize
from nltk import FreqDist

import pandas as pd
import numpy as np

from sklearn.datasets import load_files
from sklearn.metrics import classification_report

import os

In [4]:
reddit = praw.Reddit(client_id='', 
                     client_secret='', 
                     user_agent='')

In [22]:
# get 10 hot posts from the MachineLearning subreddit
# hot_posts = reddit.subreddit('datascience').hot(limit=10)  # hot posts

new_posts = reddit.subreddit('datascience').new(limit=10)  # new posts

# get hottest posts from all subreddits
# hot_posts = reddit.subreddit('all').hot(limit=10)

In [6]:
all_posts = list(new_posts) 

In [7]:
for post in all_posts:
    print(f"id : {post.id}")
    print(f"title : {post.title}")
    print(f"url : {post.url}")
    print(f"author : {str(post.author)} {type(str(post.author))}")
    print(f"score : {post.score} {type(post.score)} ")
    print(f"subreddit : {post.subreddit} {type(post.subreddit)} ")
    print(f"num_comments : {post.num_comments}")
    print(f"body : {post.selftext}")
    print(f"created : {post.created}")
    print(f"link_flair_text : {post.link_flair_text}")
    break  # break the loop after printing information about the first post

id : 1hurdd1
title : Weekly Entering & Transitioning - Thread 06 Jan, 2025 - 13 Jan, 2025
url : https://www.reddit.com/r/datascience/comments/1hurdd1/weekly_entering_transitioning_thread_06_jan_2025/
author : AutoModerator <class 'str'>
score : 7 <class 'int'> 
subreddit : datascience <class 'praw.models.reddit.subreddit.Subreddit'> 
num_comments : 41
body :  

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

* Learning resources (e.g. books, tutorials, videos)
* Traditional education (e.g. schools, degrees, electives)
* Alternative education (e.g. online courses, bootcamps)
* Job search questions (e.g. resumes, applying, career prospects)
* Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and Resources pages on

In [24]:
reddit_df = pd.DataFrame([vars(post) for post in new_posts])

In [26]:
reddit_df

Unnamed: 0,comment_limit,comment_sort,_reddit,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,...,media,is_video,_fetched,_additional_fetch_params,_comments_by_id,post_hint,crosspost_parent_list,url_overridden_by_dest,preview,crosspost_parent
0,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,"Hello,\nIs there a way to get an image from an...",t2_bpcrc4t2k,False,,0,...,,False,False,{},{},,,,,
1,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,,t2_5pa1eqhy,False,,0,...,,False,False,{},{},link,"[{'approved_at_utc': None, 'subreddit': 'OpenA...",/r/OpenAI/comments/1hwc8xp/cag_improved_rag_fr...,{'images': [{'source': {'url': 'https://extern...,t3_1hwc8xp
2,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,"As the title says, which one would you install...",t2_10u15itxe6,False,,0,...,,False,False,{},{},,,,,
3,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,I started last year at my second full-time dat...,t2_1zkrsyfq,False,,0,...,,False,False,{},{},,,,,
4,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,I'm running a gradient boosting machine with t...,t2_6cjiszgb,False,,0,...,,False,False,{},{},,,,,
5,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,Hi all\n\nI've been in DS and aligned fields i...,t2_t8udov,False,,0,...,,False,False,{},{},,,,,
6,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,,t2_5pa1eqhy,False,,0,...,,False,False,{},{},link,"[{'approved_at_utc': None, 'subreddit': 'OpenA...",/r/OpenAI/comments/1hvnjf6/tried_leetcode_prob...,{'images': [{'source': {'url': 'https://extern...,t3_1hvnjf6
7,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,So I tried to compile a list of top LLMs (acco...,t2_5pa1eqhy,False,,0,...,,False,False,{},{},self,,,{'images': [{'source': {'url': 'https://extern...,
8,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,"Hey all. First, I'd like to thank everyone for...",t2_9wge0haf,False,,0,...,,False,False,{},{},,,,,
9,2048,confidence,<praw.reddit.Reddit object at 0x16a466350>,,datascience,I am doing a bachelor in DS but honestly i bee...,t2_1e45ka03,False,,0,...,,False,False,{},{},,,,,


In [30]:
reddit_df = reddit_df[['id', 'title', 'url', 'author', 'score', 'subreddit', 'num_comments', 
                 'selftext', 'created', 'link_flair_text']]
# reddit_df = reddit_df.astype(str)

In [28]:
reddit_df.columns

Index(['comment_limit', 'comment_sort', '_reddit', 'approved_at_utc',
       'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title',
       'gilded',
       ...
       'media', 'is_video', '_fetched', '_additional_fetch_params',
       '_comments_by_id', 'post_hint', 'crosspost_parent_list',
       'url_overridden_by_dest', 'preview', 'crosspost_parent'],
      dtype='object', length=116)

In [34]:
reddit_df.head()

Unnamed: 0,id,title,url,author,score,subreddit,num_comments,selftext,created,link_flair_text
0,1hwmsd2,absolute path to image in shiny ui,https://www.reddit.com/r/datascience/comments/...,Due-Duty961,0,datascience,1,"Hello,\nIs there a way to get an image from an...",1736350000.0,Coding
1,1hwcayh,CAG : Improved RAG framework using cache,/r/OpenAI/comments/1hwc8xp/cag_improved_rag_fr...,mehul_gupta1997,2,datascience,3,,1736314000.0,AI
2,1hw5s76,As of 2025 which one would you install? Minifo...,https://www.reddit.com/r/datascience/comments/...,SmartPercent177,32,datascience,71,"As the title says, which one would you install...",1736294000.0,Discussion
3,1hvzskd,Change my mind: feature stores are needless co...,https://www.reddit.com/r/datascience/comments/...,Any-Fig-921,109,datascience,46,I started last year at my second full-time dat...,1736278000.0,Discussion
4,1hvy3ld,Gradient boosting machine still running after ...,https://www.reddit.com/r/datascience/comments/...,RobertWF_47,16,datascience,38,I'm running a gradient boosting machine with t...,1736274000.0,ML
