# Collecting Reddit Data with Reddit API


In [1]:
!pip install python-dotenv
!pip install praw

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update_checker, prawcore, praw
Successfully installed praw-7.8.1 prawcore-2.4.0 update_checker-0.18.0


In [2]:
# Import libraries
import praw
import datetime
import pandas as pd
import os
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load environment variables
load_dotenv('x.env')

# Set up credentials from env file
reddit = praw.Reddit(
    client_id=os.getenv("CLIENT_ID"),
    client_secret=os.getenv("CLIENT_SECRET"),
    user_agent=os.getenv("USER_AGENT")
)

### Get Training Data

In [4]:
# Get training data range
start_date = datetime.datetime(2024, 11, 1)  # November 1, 2024
end_date = datetime.datetime(2025, 1, 31)  # January 31, 2025

# Convert datetime dates to Unix timestamps
start_timestamp = int(start_date.timestamp())
end_timestamp = int(end_date.timestamp())

# Query subreddit posts containing 'NVIDIA' within the date range in various subreddits
subreddits = ['stocks', 'investing', 'money', 'DayTrading', 'wallstreetbets']
# Create filtered posts list
train_filtered_posts = []
# Loop through subreddits
for subreddit in subreddits:
  curr_subreddit = reddit.subreddit(subreddit)

  # Search query with a time filter (limit to posts from Nov-Jan)
  posts = curr_subreddit.search("NVIDIA",
                            sort='new',
                            time_filter='all',
                            limit=None)
  # Loop through search results, if they are within time period, add to the list
  for post in posts:
      post_date = datetime.datetime.utcfromtimestamp(post.created_utc)
      if start_timestamp <= post.created_utc <= end_timestamp:
          train_filtered_posts.append({
              'Post_Title': post.title,
              'Post_URL': post.url,
              'Post_Text': post.selftext,
              'Date_Posted': post_date,
              'Upvotes': post.score,
              'Comments': post.num_comments,
              'Subreddit': post.subreddit.display_name,
          })

# Convert the list of dictionaries to pandas df
train_reddit_df = pd.DataFrame(train_filtered_posts)

print(len(train_reddit_df))
train_reddit_df.head()


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

447


Unnamed: 0,Post_Title,Post_URL,Post_Text,Date_Posted,Upvotes,Comments,Subreddit
0,Intel's revenue forecast disappoints as invest...,https://www.reddit.com/r/stocks/comments/1idxs...,"Intel's (INTC.O), opens new tab first-quarter ...",2025-01-30 21:17:18,238,79,stocks
1,Nvidia’s Prime time to buy,https://www.reddit.com/r/stocks/comments/1idqh...,\nStocks are emotional in nature. The Nvidia i...,2025-01-30 16:12:44,9,42,stocks
2,These are the stocks on my watchlist (01/30),https://www.reddit.com/r/stocks/comments/1ido0...,This is a daily watchlist for short-term tradi...,2025-01-30 14:20:34,25,15,stocks
3,1/30) - Thursday's Pre-Market News & Stock Movers,https://www.reddit.com/r/stocks/comments/1idni...,#Good morning traders and investors of the r/s...,2025-01-30 13:57:10,11,2,stocks
4,Meta's CAPEX Spending Exceeds the Combined Net...,https://www.reddit.com/r/stocks/comments/1id9r...,**META** plans to spend **$60-$65 billion** in...,2025-01-30 00:50:17,239,87,stocks


In [5]:
train_reddit_df.to_csv('train_reddit_df_w_text.csv', index=False)

### Get Test Data

In [6]:
# Get test data range
start_date = datetime.datetime(2025, 2, 1)  # February, 1, 2025
end_date = datetime.datetime(2025, 2, 7)  # February, 7, 2025

start_timestamp = int(start_date.timestamp())
end_timestamp = int(end_date.timestamp())

# Query subreddit posts containing 'NVIDIA' within the date range in various subreddits
subreddits = ['stocks', 'investing', 'money', 'DayTrading', 'wallstreetbets']
test_filtered_posts = []
for subreddit in subreddits:
  curr_subreddit = reddit.subreddit(subreddit)

  # Search query with a time filter (limit to posts from Feb1-Feb7)
  posts = curr_subreddit.search("NVIDIA",
                            sort='new',
                            time_filter='all',
                            limit=None)
  # Loop through search results, append if within desired time range
  for post in posts:
      post_date = datetime.datetime.utcfromtimestamp(post.created_utc)
      if start_timestamp <= post.created_utc <= end_timestamp:
          test_filtered_posts.append({
              'Post_Title': post.title,
              'Post_URL': post.url,
              'Post_Text': post.selftext,
              'Date_Posted': post_date,
              'Upvotes': post.score,
              'Comments': post.num_comments,
              'Subreddit': post.subreddit.display_name,
          })

# Convert the list of dictionaries to a DataFrame
test_reddit_df = pd.DataFrame(test_filtered_posts)

print(len(test_reddit_df))
test_reddit_df.head()

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

40


Unnamed: 0,Post_Title,Post_URL,Post_Text,Date_Posted,Upvotes,Comments,Subreddit
0,Why did DeepSeek cause NVIDIA to drop? Doesn’t...,https://www.reddit.com/r/stocks/comments/1ijbs...,Doesn’t it use NVIDIA to train their LLMs as w...,2025-02-06 19:59:32,0,23,stocks
1,"SMCI - Road to Redemption, or The Final Blow?",https://www.reddit.com/r/stocks/comments/1ij45...,We are approaching what could be one of the la...,2025-02-06 14:45:18,36,67,stocks
2,These are the stocks on my watchlist (02/6),https://www.reddit.com/r/stocks/comments/1ij3k...,This is a daily watchlist for short-term tradi...,2025-02-06 14:18:22,52,16,stocks
3,Thinking about NVDA beyond 2025 Hyperscaler Ca...,https://www.reddit.com/r/stocks/comments/1iihl...,With 3/4 hyperscalers reporting earnings alrea...,2025-02-05 18:54:07,63,34,stocks
4,Big tech CapEx: 2024 vs. 2025 and increase in ...,https://www.reddit.com/r/stocks/comments/1ihzo...,\nAI infrastructure spending is accelerating f...,2025-02-05 02:29:16,41,14,stocks


In [7]:
test_reddit_df.to_csv('test_reddit_df_w_text.csv', index=False)