# Scraping Reddit Data  

<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Reddit%20Webscraping%20using%20PRAW/Reddit%20API.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/TannerGilbert/Tutorials/blob/master/Reddit%20Webscraping%20using%20PRAW/Reddit%20API.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

![](https://www.redditstatic.com/new-icon.png)  
Using the PRAW library, a wrapper for the Reddit API, everyone can easily scrape data from Reddit or even create a Reddit bot.

In [1]:
#!pip install praw


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import praw
import pandas as pd

Before it can be used to scrape data we need to authenticate ourselves. For this we need to create a Reddit instance and provide it with a client_id , client_secret and a user_agent . To create a Reddit application and get your id and secret you need to navigate to [this page](https://www.reddit.com/prefs/apps).

In [3]:
reddit = praw.Reddit(client_id='...',
                     client_secret='...',
                     user_agent='...')

We can get information or posts from a specifc subreddit using the reddit.subreddit method and passing it a subreddit name.

In [6]:
# get hot posts from all subreddits
hot_posts = reddit.subreddit('all').hot(limit=10)
for post in hot_posts:
    print(post.title)

Not this year
This is NOT going to end well:
me_irl
Trump (claimed height of 6‚Äô3) standing next to Vivek Ramaswamy (5‚Äô10)
These kids are so pure
Meanwhile in China
Sports Scholarship
American Rule
Well done ü§£
nobody wants to work anymore [oc]


In [4]:
# get 10 hot posts from the Pikabu subreddit
hot_posts = reddit.subreddit('Pikabu').hot(limit=10)

Now that we scraped 10 posts we can loop through them and print some information.

In [5]:
for post in hot_posts:
    print(post.title)

–í–æ–µ–Ω–Ω–æ–π —Å–∏—Ç—É–∞—Ü–∏–∏ –º–µ–≥–∞—Ç—Ä–µ–¥
–ù–æ –∑–∞–ø–∞—Ö! [‚Ä¶] –≠—Ç–æ –±—ã–ª –∑–∞–ø–∞—Ö‚Ä¶ –ø–æ–±–µ–¥—ã!
–ù—é–¥—Å—ã –Ω–∞ –∫–∞–∂–¥—ã–π –¥–µ–Ω—å
–¶–∏–Ω–∏–∫. –ó–ª–æ–¥–µ–π—Å–∫–æ–µ [—Å–ª–∞–π–¥–µ—Ä]
–ö–æ–≥–¥–∞ –Ω–µ –≤ –∫—É—Ä—Å–µ –ø–æ—Å–ª–µ–¥–Ω–∏—Ö –Ω–æ–≤–æ—Å—Ç–µ–π.
–ë—ã–≤–∞–µ—Ç ¬Ø\_(„ÉÑ)_/¬Ø
–ò–∑–≤–∏–Ω–∏—Ç–µ, –Ω–∞ –Ω–∞—à–∏ 40 —Å–º –Ω–µ –±—ã–ª–æ –∫–∞—Ä—Ç–∏–Ω–∫–∏
–†—É–º—ã–Ω—Å–∫–∏–µ –±–∞–Ω–¥–∏—Ç—ã
–¢–∞–∫ –Ω–∞—á–∞–ª–æ—Å—å –≤–æ—Å—Å—Ç–∞–Ω–∏–µ –º–∞—à–∏–Ω...
–≠—Ç–æ –Ω–µ–ø—Ä–∞–≤–∏–ª—å–Ω—ã–µ –ø—á—ë–ª—ã (—Å)


In [7]:
# Specify the URL of the Reddit post you want to parse
post_url = 'https://www.reddit.com/r/Pikabu/comments/110gz4q/–Ω–∞–≥–≥–µ—Ç—Å—ã_–ø—Ä–æ–≤–µ—Ä–µ–Ω–Ω—ã–π_—Ä–µ—Ü–µ–ø—Ç/'

# Create a submission object for the post
submission = reddit.submission(url=post_url)

In [9]:
for top_level_comment in submission.comments:
    print(top_level_comment.body)

- 1 –∫–≥ –∫—É—Ä–∏–Ω–æ–≥–æ —Ñ–∏–ª–µ 
- 150–≥. –ø–∞–Ω–∏—Ä–æ–≤–æ—á–Ω—ã—Ö —Å—É—Ö–∞—Ä–µ–π. 
- 150–≥ –∫—É–∫—É—Ä—É–∑–Ω—ã—Ö —Ö–ª–æ–ø—å–µ–≤.
- 2 —Å—Ç.–ª. –∫—É–∫—É—Ä—É–∑–Ω–æ–π –º—É–∫–∏. –õ–∏–±–æ –æ–±—ã—á–Ω–æ–π –ø—à–µ–Ω–∏—á–Ω–æ–π.
- 1 —á.–ª. —Å –≥–æ—Ä–∫–æ–π –ø–∞–ø—Ä–∏–∫–∏
- 1 —á.–ª. —Å –≥–æ—Ä–∫–æ–π —Å—É—Ö–æ–≥–æ —á–µ—Å–Ω–æ–∫–∞
- 3 —á.–ª. —Å –≥–æ—Ä–∫–æ–π —Å–æ–ª–∏
- 3/4 —á.–ª. —á–µ—Ä–Ω–æ–≥–æ –ø–µ—Ä—Ü–∞.
- 2 —Å—Ç.–ª. –º–æ–ª–æ–∫–∞ –∏–ª–∏ –Ω–µ–∂–∏—Ä–Ω—ã—Ö —Å–ª–∏–≤–æ–∫
- 2 —è–π—Ü–∞
- —Å–∞—Ö–∞—Ä -2—á.–ª.
- —Ä–∞—Å—Ç–∏—Ç–µ–ª—å–Ω–æ–µ –º–∞—Å–ª–æ
++++++++++++++++++++++++++++
- –ú—è—Å–æ –º–∏–Ω—É—Ç –∑–∞ 20 –∑–∞–±—Ä–æ—Å–∏—Ç—å –≤ –º–æ—Ä–æ–∑–∏–ª–∫—É, —á—Ç–æ–±—ã –µ–≥–æ –ª–µ–≥—á–µ –±—ã–ª–æ –∞–∫–∫—É—Ä–∞—Ç–Ω–æ –Ω–∞—Ä–µ–∑–∞—Ç—å.
- –•–ª–æ–ø—å—è –∏–∑–º–µ–ª—å—á–∏—Ç—å –≤ –±–ª–µ–Ω–¥–µ—Ä–µ –ø–æ–∫–∞ –æ–Ω–∏ –Ω–µ —Å—Ç–∞–Ω—É—Ç —Ä–∞–∑–º–µ—Ä–æ–º –ø—Ä–∏–º–µ—Ä–Ω–æ —Å —Ñ—Ä–∞–∫—Ü–∏—é –≤–∞—à–∏—Ö —Å—É—Ö–∞—Ä–µ–π. 
- –í—Å–µ —Å–ø–µ—Ü–∏–∏ –æ–±—ä–µ–¥–∏–Ω–∏—Ç—å –≤–º–µ—Å—Ç–µ —Å —Å–æ–ª—å—é –∏ –ø–µ—Ä–µ–º–µ—à–∞—Ç—å.
- –í –±–æ–ª—å—à–æ–π –º–∏—Å–∫–µ

In [28]:
# Choosing subreddit, number of posts to extract
subreddit_name = 'Pikabu'
num_posts_to_extract = 1000 

all_comments = []

# Iterating through the specified number of posts in the subreddit
for submission in reddit.subreddit(subreddit_name).new(limit=num_posts_to_extract):
    submission.comments.replace_more(limit=None)

    # Iterating through the comments and appending them to the all_comments list
    for comment in submission.comments.list():
        all_comments.append({
            'Author': str(comment.author),
            'Score': comment.score,
            'Comment': comment.body,
            'Submission Title': submission.title,
        })

# Export to a CSV file:
df = pd.DataFrame(all_comments)
df.to_csv('reddit_comments_pikabu.csv', index=False)

In [29]:
df.head()

Unnamed: 0,Author,Score,Comment,Submission Title
0,RECabu,2,–ó–∞–ø–∏—Å–∞–ª –Ω–∞ –≤–∏–¥–µ–æ–∫–∞—Å—Å–µ—Ç—É **[–í–µ—á–Ω–æ–µ —Å–∏—è–Ω–∏–µ —á–∏—Å—Ç–æ...,–ë—É–¥–∏–ª—å–Ω–∏–∫
1,SamSamABC1,3,–º–æ–∂–µ—Ç –∫–∞–∫ —ç—Ç–æ? - \n\nhttps://preview.redd.it/0...,"–í—ã–≥–ª—è–¥–∏—Ç –∑–∞–ª–∏–ø–∞—Ç–µ–ª—å–Ω–æ, –∫–∞–∫ –∫–∞–∫–æ–π-—Ç–æ —Ñ–∏–∑–∏—á–µ—Å–∫–∏–π..."
2,IvanovRomannn,-1,–ù–∞–ø–æ–º–∏–Ω–∞–µ—Ç –Ω–∞—à—É –¥–µ–º–æ–∫—Ä–∞—Ç–∏—é,"–í—ã–≥–ª—è–¥–∏—Ç –∑–∞–ª–∏–ø–∞—Ç–µ–ª—å–Ω–æ, –∫–∞–∫ –∫–∞–∫–æ–π-—Ç–æ —Ñ–∏–∑–∏—á–µ—Å–∫–∏–π..."
3,RECabu,1,–ó–∞–ø–∏—Å–∞–ª –Ω–∞ –≤–∏–¥–µ–æ–∫–∞—Å—Å–µ—Ç—É **[–í—Å–ø–æ–º–Ω–∏—Ç—å –≤—Å—ë](http...,"–í—ã–≥–ª—è–¥–∏—Ç –∑–∞–ª–∏–ø–∞—Ç–µ–ª—å–Ω–æ, –∫–∞–∫ –∫–∞–∫–æ–π-—Ç–æ —Ñ–∏–∑–∏—á–µ—Å–∫–∏–π..."
4,Initial-Carpenter,1,–ö—Ä–∞—Å–∏–≤–æ–µ,"–í—ã–≥–ª—è–¥–∏—Ç –∑–∞–ª–∏–ø–∞—Ç–µ–ª—å–Ω–æ, –∫–∞–∫ –∫–∞–∫–æ–π-—Ç–æ —Ñ–∏–∑–∏—á–µ—Å–∫–∏–π..."


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20240 entries, 0 to 20239
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Author            20240 non-null  object
 1   Score             20240 non-null  int64 
 2   Comment           20240 non-null  object
 3   Submission Title  20240 non-null  object
dtypes: int64(1), object(3)
memory usage: 632.6+ KB
