# DSCI 511: Data Acquisition and Pre-Processing <br> Term Project Phase 1: Scoping a data set

## The big picture
Welcome to your term project! This is the first portion of a two-part, open-ended team assignment that will culminate in a presentation during the last week of class or the regularly scheduled final exam period. Overall, this term project is intended to provide some open-ended experience with building a complex dataset and making it available. Specifically, all projects for this course will entail the following:

- The construction, acquistion, integration, enrichment, and distribution of a project-motivated and computationally significant dataset.

The first report on your team's project will constitue a discussion of what the dataset is that you want to build/access/create, why you believe it will be possilbe to conduct, and how long you believe it will take to build, in addition to discussion of the sorts of tasks that will be involved. Additionally, this inital project planning report should speculate and provide examples potential dataset uses, whether academic or commercial.

__Note__: All reports should inclue a high level abstract/discussion in a tone that is set for a completely diverse audience.

Later on, a final report will recap progress at the task you're group has come up with, specifically revisiting what you _though_ the dataset development would take,  as compared to the actual work involved and obstacles encountered.

__Important__: because your project reports will have discussion intermingled with data and code as output, I not only request the  submission of your work in Jupyter Notebooks format, but additionly recommend conducting your work as a group collaboratively in Jupyter notebooks.

## This is only a guideline

While I will provide some idea of structure and expectation for your project it is important to note that this is an intentionally open-ended project. Hence, no specific rubric is provided. The courses of different projects will require overcoming different obstacles, and success in a data science project is ultimately a (partial) function of a team's abaility to adapt to project needs. However, all work should be well documented, articulately presented, and justified. If at any point it is unclear what to do or how to represent your project's work, please do not hesitate to ask your instructor for direction.

## Your team

The first thing you'll have to do in this phase is organize into a project team. Data science is often conducted in teams, with different team members covering the diversity of knowledge and skills relevant to the different areas that a project must support to succeed. Even though our course is only focused on early-phase data science tasks (data set development), be sure to consider the strengths of your teamates and interests for gaining experience in dataset construction&mdash;if you want extensive experience with web scraping, pitch a project about this with a few other interested points. It will help to discuss interests. Be sure to write out the names of the project team's members in your first report and answer the two questions:

1. What areas/skills/domains does the team member presently identify with?
2. Into which areas/skills/domains would the team member like to grow?

## Your topic

The course of your project will be determined by two things:

1. the motivations present in your project's team and
2. the data your project is able to pull together.

Thus, choosing your topic is closely tied to both your team and the data you are able to identify. To start, discuss the domain interests present on your project team. Te get you on your way, let's start with two questions:

1. Is there an aspect of the IoT, natural world, society, literature, or art, etc. that you would like to investigate computationally through what might be considered 'data'?

2. What sort of data-medium are you interested to work with?&mdash;For example: transaction records, stock prices, memes and online conversations, open-domain poems, congressional records, News Articles, songs and popularity, Associated Press Images, transit records, call logs, CCTV footage, etcetera.

Whatever the direction you set for your project please make sure you document it well, keeping track of how its objectives and strategies change as you encounter available materials and other existing work.

## What you're responsible for in this phase
Ok, so here's the goal again for phase 1. You must:

- scope a computationally tangible artifact&mdash;heretofore known as the data set&mdash;whose study is expected to satisfy goals pertaining to the project's topic of interest.

This phase of the project will set expectations and a work plan for your project's open-ended work. Not only should you scope the collection of your dataset, but determine what mode's of distribution will be possible once its produces. Will you have to distribute access code, or will you be able to directly provide links to stored data.

Ultimately, the completion of your poject will produce raw materials for other folks (possibly you) interested in trying out analysis applications in future coursework (DSCI 521). So, as you identify a potential data set be sure to be realistic about what is possible to collect and how you can preprocess it for use! Ultimately, please make sure that some portion of your target data are guarenteed to be collectable. However, it's okay to try for some data that are a reach, just document any un- or partially successful efforts in your report and discuss what obstacles prevented those data from being collected.

### Things I'll be looking for in a Phase 1 report

- a background report on the team's members, their self-identified skills, and individual contributions
- a discussion of what you would like to your data to do/hope it is good for
- an exhibition of a sample of your data&mdash;show me it exists and what it looks like, even if very raw
- a discussion of who might be interested in your data set
- a discussion of how your data is limited and could be improved
- a discussion of how your data were created, e.g., people texting, The Earth's molten core spinning, etc.
- a discussion of what sort of access rights presently exist on your data and how/if you will make them available

As a heads up, by the end of the term and in your final report I'll be looking for things like
- a data dictionary or README.md that describes what is present in the data set and where or how to access
- code that documents the construction of your data&mdash;I should be able to re-construct/re-access it!
- code that allows me or someone else to interact with your data set
- tables and figures indicating the size and variety present in your data

_Note_: These are not exhaustive lists of topics or tasks worth covering in your project. In general, if there's something interesting about your dataset, whether relating to its construction, existence, representative population or _anything else_, then be sure to document it!

Reference to orgianl code in first cell below

## Pulling data

In [None]:
import praw
import json
import os
import time
from dotenv import load_dotenv
from datetime import datetime

load_dotenv()  # Load credential from .env

reddit = praw.Reddit(
    client_id=os.getenv("client_id"),
    client_secret=os.getenv("client_secret"),
    user_agent=os.getenv("user_agent")
)


# List of subreddits to take data
subreddits = ["wallstreetbets", "stocks", "investing"]
output_file = "hot_discussion_June1.json"

# Create file if not exist
if not os.path.exists(output_file):
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump([], f)

# Load data
with open(output_file, "r", encoding="utf-8") as f:
    all_data = json.load(f)

total_posts = 0

# Loop through each subreddit
for sub_name in subreddits:
    subreddit = reddit.subreddit(sub_name)
    print(f"\n Pulling hot posts from r/{sub_name}...")
    hot_posts = subreddit.hot(limit=1000)

    for post in hot_posts:
        post_created = datetime.utcfromtimestamp(post.created_utc)

        post_info = {
            "subreddit": sub_name,
            "post_id": post.id,
            "title": post.title,
            "score": post.score,
            "flair": post.link_flair_text,
            "created_utc": post_created.isoformat(),
            "num_comments": post.num_comments,
            "url": post.url,
            "comments": []
        }

        post.comments.replace_more(limit=0)
        for comment in post.comments:
            post_info["comments"].append({
                "comment_body": comment.body,
                "comment_score": comment.score,
                "comment_created_utc": datetime.utcfromtimestamp(comment.created_utc).isoformat()
            })

        all_data.append(post_info)
        total_posts += 1
        print(f"Collected post from r/{sub_name}: {post.title[:60]}... ({len(post_info['comments'])} comments)")

        time.sleep(1)  # Respect Reddit rate limit

# Save updated data
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(all_data, f, ensure_ascii=False, indent=2)


print("\nDone pulling data")
print(f"Total posts collected: {total_posts}")
print(f"Total posts saved in {output_file}: {len(all_data)}")



 Pulling hot posts from r/wallstreetbets...


  post_created = datetime.utcfromtimestamp(post.created_utc)
  "comment_created_utc": datetime.utcfromtimestamp(comment.created_utc).isoformat()


Collected post from r/wallstreetbets: Weekend Discussion Thread for the Weekend of May 30, 2025... (460 comments)
Collected post from r/wallstreetbets: Weekly Earnings Thread 6/2 - 6/6... (84 comments)
Collected post from r/wallstreetbets: Tariff cheat code... (37 comments)
Collected post from r/wallstreetbets: $30k interest-free margin loan idea... (50 comments)
Collected post from r/wallstreetbets: Can't go broke taking profits... (4 comments)
Collected post from r/wallstreetbets: How cooked am I?... (46 comments)
Collected post from r/wallstreetbets: UNH YOLO... (15 comments)
Collected post from r/wallstreetbets: Thanks for the gains Vlad!... (15 comments)
Collected post from r/wallstreetbets: momey... (41 comments)
Collected post from r/wallstreetbets: Started scalping once I hit 25k. No looking back now! 2 days... (56 comments)
Collected post from r/wallstreetbets: YOLO nvda calls... (53 comments)
Collected post from r/wallstreetbets: Should I sell PLTR? Up 1100%... (389 comments)

## Preprocessing

In [3]:
import pandas as pd
import json

file_path = '/Users/tomorrowcute/MSDS_2024_2026/spring_2025/DSCI511/Team project/Term project/hot_discussion_June1.json'

# Load the JSON file
with open(file_path, 'r') as file:
    data = json.load(file)

# Convert JSON to DataFrame
df = pd.DataFrame(data)

df.head()  


Unnamed: 0,subreddit,post_id,title,score,flair,created_utc,num_comments,url,comments
0,wallstreetbets,1kzdtf9,Weekend Discussion Thread for the Weekend of M...,168,Weekend Discussion,2025-05-30T19:57:24,10208,https://www.reddit.com/r/wallstreetbets/commen...,[{'comment_body': 'Farewell emojis these were ...
1,wallstreetbets,1kz68li,Weekly Earnings Thread 6/2 - 6/6,78,Earnings Thread,2025-05-30T14:52:51,259,https://i.redd.it/ypo8tjhfnx3f1.jpeg,[{'comment_body': 'New strat this week: 1. Wa...
2,wallstreetbets,1l0nofm,Tariff cheat code,737,Gain,2025-06-01T12:23:54,99,https://i.redd.it/f6w0psdo6b4f1.jpeg,[{'comment_body': ' **User Report**| | | | :--...
3,wallstreetbets,1l0rbl6,$30k interest-free margin loan idea,172,Discussion,2025-06-01T15:13:05,86,https://www.reddit.com/r/wallstreetbets/commen...,[{'comment_body': ' **User Report**| | | | :--...
4,wallstreetbets,1l0sbtr,Can't go broke taking profits,18,Gain,2025-06-01T15:54:17,7,https://i.redd.it/dhi43npr7c4f1.png,[{'comment_body': ' **User Report**| | | | :--...


In [None]:
# -------2. Posts Data cleaning -----
print("\nStep 2: cleaning posts data")

df_posts_cleaned = df.copy()

# Drop duplicate posts
print("\nDropping duplicate posts by post_id")
orginal_post_count = len(df_posts_cleaned)
df_posts_cleaned.drop_duplicates(subset=['post_id'], inplace=True, keep="first")
print(f"Posts before filtering: {orginal_post_count}")
print(f"Posts after filtering: {len(df_posts_cleaned)}")

# Handling missing flair if needed
print("\nHandling missing flair if needed")
df_posts_cleaned['flair'].fillna('No Flair', inplace=True)
print("Nulls in 'flair' after fillna:", df_posts_cleaned['flair'].isnull().sum())  # should be 0
print("Value counts for 'flair':")
print(df_posts_cleaned['flair'].value_counts(dropna=False).head())

# Data type conversion
print("\nData type conversion")
df_posts_cleaned['score'] = pd.to_numeric(df_posts_cleaned['score'], errors='coerce').astype('Int64')
df_posts_cleaned['num_comments'] = pd.to_numeric(df_posts_cleaned['num_comments'], errors='coerce').astype('Int64')
df_posts_cleaned['created_utc'] = pd.to_datetime(df_posts_cleaned['created_utc'], errors='coerce')
print("----Post data after conversion----\n", df_posts_cleaned.dtypes[["score", "num_comments", "created_utc"]])

# Check for NaNs introduced by "coerce"
print("---Nulls in 'score' after conversion:", df_posts_cleaned['score'].isnull().sum())
print("---Nulls in 'num_comments' after conversion:", df_posts_cleaned['num_comments'].isnull().sum())
print("---Nulls in 'created_utc' after conversion:", df_posts_cleaned['created_utc'].isnull().sum())

df_posts_cleaned.dropna(subset=['score', 'num_comments', 'created_utc', 'post_id', 'title'], inplace=True)
print("---Nulls in 'score' after dropna:", df_posts_cleaned['score'].isnull().sum())

print("\nskipping fixed historical date filtering")

print("\nLowercasing post titles")
df_posts_cleaned['title'] = df_posts_cleaned['title'].astype(str).str.lower()
print('---Sample of lowercased titles---')
print(df_posts_cleaned['title'].head())

# Remove Deleted/"Removed" posts
print("\nRemoving posts with '[deleted]' or '[removed]' titles...")
orginal_post_count_before_del_filter = len(df_posts_cleaned)
df_posts_cleaned = df_posts_cleaned[~df_posts_cleaned['title'].isin(['[deleted]', '[removed]'])]
print(f"Posts before filtering: {orginal_post_count_before_del_filter}")
print(f"Posts after filtering: {len(df_posts_cleaned)}")

if "df_posts_cleaned" not in locals():
    df_posts_cleaned = pd.DataFrame()
if "df_comments_final_cleaned" not in locals():
    df_comments_final_cleaned = pd.DataFrame()




Step 2: cleaning posts data

Dropping duplicate posts by post_id
Posts before filtering: 1120
Posts after filtering: 1120

Handling missing flair if needed
Nulls in 'flair' after fillna: 0
Value counts for 'flair':
flair
No Flair      564
Gain          114
YOLO           88
Discussion     49
News           40
Name: count, dtype: int64

Data type conversion
----Post data after conversion----
 score                    Int64
num_comments             Int64
created_utc     datetime64[ns]
dtype: object
---Nulls in 'score' after conversion: 0
---Nulls in 'num_comments' after conversion: 0
---Nulls in 'created_utc' after conversion: 0
---Nulls in 'score' after dropna: 0

skipping fixed historical date filtering

Lowercasing post titles
---Sample of lowercased titles---
0    weekend discussion thread for the weekend of m...
1                     weekly earnings thread 6/2 - 6/6
2                                    tariff cheat code
3                  $30k interest-free margin loan idea
4      

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_posts_cleaned['flair'].fillna('No Flair', inplace=True)


In [5]:
  # ---- 3. Comments data cleaning and prep ----
  print("\nStep 3: cleaning comments data")

  df_comments_final_cleaned = pd.DataFrame()

  if not df_posts_cleaned.empty and 'comments' in df_posts_cleaned.columns:
    print("\n a. Exploding comments from posts")

    mask_is_list = df_posts_cleaned['comments'].apply(lambda x: isinstance(x, list))
    posts_with_valid_comment_lists = df_posts_cleaned[mask_is_list]

    if not posts_with_valid_comment_lists.empty:
      print(f"---Total comment entries before explode: {len(df_posts_cleaned)}---")

      df_comments_exploded = posts_with_valid_comment_lists[["post_id", "comments"]].explode('comments').reset_index(drop=True)

      df_comments_exploded = df_comments_exploded[df_comments_exploded['comments'].notna()]
      print(f"---Total comment entries after explode: {len(df_comments_exploded)}---")


  # Extract comment info into Columns
      print("\n b. Extracting comment info into columns")
      if not df_comments_exploded.empty:
        df_comments_exploded.loc[:,'comment_body'] = df_comments_exploded['comments'].apply(lambda x: x.get('comment_body') if isinstance(x, dict) else None)
        df_comments_exploded.loc[:,'comment_score'] = df_comments_exploded['comments'].apply(lambda x: x.get('comment_score') if isinstance(x, dict) else None)
        df_comments_exploded.loc[:,'comment_created_utc'] = df_comments_exploded['comments'].apply(lambda x: x.get('comment_created_utc') if isinstance(x, dict) else None)

        #df_commments_final will be the dataframe for comments
        df_comments_intermediate = df_comments_exploded.drop(columns=['comments'])


        print("---Sample of comments data---")
        print(df_comments_intermediate.head())
        print(("---Null values in extracted comment columns---"))
        print(df_comments_intermediate[["comment_body", "comment_score", "comment_created_utc"]].isnull().sum())

        #Data Type Conversion for comments
        print("\n c. Data type conversion for comments")
        df_comments_intermediate.loc[:,'comment_score'] = pd.to_numeric(df_comments_intermediate['comment_score'], errors='coerce').astype('Int64')
        df_comments_intermediate.loc[:,'comment_created_utc'] = pd.to_datetime(df_comments_intermediate['comment_created_utc'], errors='coerce')

        print("---Comments data after conversion---")
        print(df_comments_intermediate.dtypes[["comment_score", "comment_created_utc"]])

        print("---Null values in 'comment_score' columns after coerce---")
        print(df_comments_intermediate["comment_score"].isnull().sum())
        print("---Null values in 'comment_created_utc' columns after coerce---")
        print(df_comments_intermediate["comment_created_utc"].isnull().sum())


        df_comments_intermediate.dropna(subset=['comment_score', 'comment_created_utc'], inplace=True)
        print("---Null values in 'comment_score' columns after dropna---")

        #Text cleaning for comment Bodies
        print("\n d. Lowercasing comment bodies")
        df_comments_intermediate.loc[:, 'comment_body_cleaned'] = df_comments_intermediate['comment_body'].astype(str).str.lower()

        print('---Sample of lowercased comment bodies---')
        print(df_comments_intermediate['comment_body'].head())
        print(df_comments_intermediate['comment_body_cleaned'].head())


        #Removed Deleted/Removed comments

        print("\n e. Removing deleted/removed comments")


        original_comment_count = len(df_comments_intermediate)
        df_comments_intermediate = df_comments_intermediate[~df_comments_intermediate['comment_body_cleaned'].isin(['deleted]', '[removed]'])]
        print(f"Comments before filtering: {original_comment_count}")
        print(f"Comments after filtering: {len(df_comments_intermediate)}")


    #f Removed empty comment Bodies
        print("\n f. Removing empty comment bodies")
        orginal_comment_count = len(df_comments_intermediate)
        df_comments_final_cleaned =  df_comments_intermediate [df_comments_intermediate['comment_body_cleaned'].str.strip() != '' ]
        print(f"Comments before filtering: {orginal_comment_count}")
        print(f"Comments after filtering: {len(df_comments_final_cleaned)}")
    else:
      print("No valid comments data to explode")
  else:
    print("No posts valid list-type data to explode")

  if "comments" in df_posts_cleaned.columns:
    df_posts_cleaned = df_posts_cleaned.drop(columns=["comments"])
    print("\nDropped 'comments' column from posts DataFrame")

if "df_comments_final_cleaned" not in locals() or not isinstance(df_comments_final_cleaned, pd.DataFrame):
    df_comments_final_cleaned = pd.DataFrame(columns=["post_id", "comment_body", "comment_score", "comment_created_utc", "comment_body_cleaned"])



Step 3: cleaning comments data

 a. Exploding comments from posts
---Total comment entries before explode: 1120---
---Total comment entries after explode: 42932---

 b. Extracting comment info into columns
---Sample of comments data---
   post_id                                       comment_body  comment_score  \
0  1kzdtf9  Farewell emojis these were my favorites:\n\n![...             61   
1  1kzdtf9               what a fucking close lmao clown shit             51   
2  1kzdtf9                  Risked 900k to make $7 today lmao            143   
3  1kzdtf9  Fellow regards, I’m learning. \n\nI didn’t buy...             94   
4  1kzdtf9         Please end at exactly 0.00% for the vibes.             45   

   comment_created_utc  
0  2025-05-30T20:58:52  
1  2025-05-30T19:59:09  
2  2025-05-30T20:05:05  
3  2025-05-30T20:24:07  
4  2025-05-30T19:59:32  
---Null values in extracted comment columns---
comment_body           0
comment_score          0
comment_created_utc    0
dtype: int

In [7]:
import re
from typing_extensions import final

print("\nStep 4: Adanced Text Cleaning for posts and comments")

def advanced_text_cleaner(text): #remove_punctuation
  if pd.isna(text):
    return ""
  text = str(text)
#Remove URLS
  text = re.sub(r'http\S+|www\.\S+', '', text , flags=re.MULTILINE)
# Remove user mentions
  text = re.sub(r'(?:/u/|u/)\w+|r/\w+', '', text)
# Remove hashtags
  text = re.sub(r'(@\w+|#\w+)', '', text)
# Remove specail characters , punctuation
  text = re.sub(r'[^\w\s]', '', text)
# Remove extra white space
  text = re.sub(r'\s+', ' ', text).strip()
  return text

#Apply to post titles

print("\nApplying advanced text cleaning to post titles")
if not df_posts_cleaned.empty and 'title' in df_posts_cleaned.columns:
  df_posts_cleaned.loc[:, 'title_nlp'] = df_posts_cleaned['title'].apply(advanced_text_cleaner)
  print("---Sample of cleaned post titles---")
  print(df_posts_cleaned[['title', "title_nlp"]].head())
else:
  print("No 'title' due to empty DataFrame.")
  if 'title_nlp' not in df_posts_cleaned.columns:
    df_posts_cleaned['title_nlp'] = pd.Series(dtype='object')


#Apply to comment bodies
if not df_comments_final_cleaned.empty and 'comment_body_cleaned' in df_comments_final_cleaned.columns:
  df_comments_final_cleaned.loc[:, 'comment_body_nlp'] = df_comments_final_cleaned['comment_body_cleaned'].apply(advanced_text_cleaner)
  print("---Sample of cleaned comment bodies---")
  print(df_comments_final_cleaned[['comment_body_cleaned', "comment_body_nlp"]].head())

  orginal_comment_count = len(df_comments_final_cleaned)
  df_comments_final_cleaned = df_comments_final_cleaned[df_comments_final_cleaned['comment_body_nlp'].str.strip() != '']
  print(f"Comments before filtering: {orginal_comment_count}")
  print(f"Comments after filtering: {len(df_comments_final_cleaned)}")
else:
  if 'comment_body_nlp' not in df_comments_final_cleaned.columns:
    df_comments_final_cleaned['comment_body_nlp'] = pd.Series(dtype='object')



Step 4: Adanced Text Cleaning for posts and comments

Applying advanced text cleaning to post titles
---Sample of cleaned post titles---
                                               title  \
0  weekend discussion thread for the weekend of m...   
1                   weekly earnings thread 6/2 - 6/6   
2                                  tariff cheat code   
3                $30k interest-free margin loan idea   
4                      can't go broke taking profits   

                                           title_nlp  
0  weekend discussion thread for the weekend of m...  
1                       weekly earnings thread 62 66  
2                                  tariff cheat code  
3                  30k interestfree margin loan idea  
4                       cant go broke taking profits  
---Sample of cleaned comment bodies---
                                comment_body_cleaned  \
0  farewell emojis these were my favorites:\n\n![...   
1               what a fucking close lmao cl

Saving and verification

In [None]:
#----5. Finalize Data Frames for output ---
print("\nStep 5: Finalize Data Frames for output")

#Defining final columns for 'posts' CSV

final_posts_columns = [
  'post_id',
  'title',
  'score',
  'flair',
  'created_utc',
  'num_comments',
  'url'
]

if not df_posts_cleaned.empty and 'title_nlp' in df_posts_cleaned.columns:
  for col in ['post_id', 'score', 'flair', 'created_utc', 'num_comments', 'url']:
    if col not in df_posts_cleaned.columns:
      df_posts_cleaned[col] = None #pd.Series(dtype='object')

  df_posts_output = df_posts_cleaned[['post_id','title_nlp', 'score', 'flair', 'created_utc', 'num_comments', 'url']].copy()
  df_posts_output.rename(columns={'title_nlp': 'title'}, inplace=True)

  df_posts_output = df_posts_output[final_posts_columns]

else:
  print("No 'title_nlp' due to empty DataFrame.")
  df_posts_output = pd.DataFrame(columns=final_posts_columns)

#Define columns for comments CSV
final_comments_columns = [
  'post_id',
  'comment_body', #comment_body_nlp
  'comment_score',
  'comment_created_utc'
]
if not df_comments_final_cleaned.empty and 'comment_body_nlp' in df_comments_final_cleaned.columns:
  for col in ['post_id', 'comment_score', 'comment_created_utc']:
    if col not in df_comments_final_cleaned.columns:
      df_comments_final_cleaned[col] = None #pd.Series(dtype='object')

  df_comments_output = df_comments_final_cleaned[['post_id', 'comment_body_nlp', 'comment_score', 'comment_created_utc']].copy()
  df_comments_output.rename(columns={'comment_body_nlp': 'comment_body'}, inplace=True)
  df_comments_output = df_comments_output[final_comments_columns]
else:
  print("No 'comment_body_nlp' due to empty DataFrame.")
  df_comments_output = pd.DataFrame(columns=final_comments_columns) #'post_id', 'comment_body', 'comment_score', 'comment_created_utc'

print("\nFinal Posts DataFrame for CSV (head):")
print(df_posts_output.head())
df_posts_output.info()

print("\nFinal Comments DataFrame for CSV (head):")
print(df_comments_output.head())
df_comments_output.info()


Step 5: Finalize Data Frames for output

Final Posts DataFrame for CSV (head):
   post_id                                              title  score  \
0  1kzdtf9  weekend discussion thread for the weekend of m...    168   
1  1kz68li                       weekly earnings thread 62 66     78   
2  1l0nofm                                  tariff cheat code    737   
3  1l0rbl6                  30k interestfree margin loan idea    172   
4  1l0sbtr                       cant go broke taking profits     18   

                flair         created_utc  num_comments  \
0  Weekend Discussion 2025-05-30 19:57:24         10208   
1     Earnings Thread 2025-05-30 14:52:51           259   
2                Gain 2025-06-01 12:23:54            99   
3          Discussion 2025-06-01 15:13:05            86   
4                Gain 2025-06-01 15:54:17             7   

                                                 url  
0  https://www.reddit.com/r/wallstreetbets/commen...  
1               https:

In [9]:

print("\nStep 6: Saving to JSON Files")
OUTPUT_JSON_FILE = "reddit_datafinal.json"

try:

  df_posts_output_serializable = df_posts_output.copy()
  if "created_utc" in df_posts_output_serializable.columns:
    df_posts_output_serializable["created_utc"] = df_posts_output_serializable["created_utc"].astype(str)

  df_comments_output_serializable = df_comments_output.copy()
  if "comment_created_utc" in df_comments_output_serializable.columns:
    df_comments_output_serializable["comment_created_utc"] = df_comments_output_serializable["comment_created_utc"].astype(str)


  posts_data = df_posts_output_serializable.to_dict(orient='records')
  comments_data = df_comments_output_serializable.to_dict(orient='records')

  #combine into a single dictionary
  combined_data = {
    "posts": posts_data,
    "comments": comments_data
  }

  with open(OUTPUT_JSON_FILE, "w", encoding="utf-8") as f:
    json.dump(combined_data, f, ensure_ascii=False, indent=4)

    print(f"Saved {len(posts_data)} posts and {len(comments_data)} comments to {OUTPUT_JSON_FILE}")
except Exception as e:
  print(f"Error saving data to {OUTPUT_JSON_FILE}: {e}")


print(f"\n---------------------------------------------------")
print(f"\n Done with data cleaning")
print(f"Saved {len(posts_data)} posts and {len(comments_data)} comments to {OUTPUT_JSON_FILE}")
print(f'---------------------------------------------------')



Step 6: Saving to JSON Files
Saved 1120 posts and 42322 comments to reddit_datafinal.json

---------------------------------------------------

 Done with data cleaning
Saved 1120 posts and 42322 comments to reddit_datafinal.json
---------------------------------------------------


In [10]:
#---7. Final Verifcation output ---
print("\nStep 7: Final Verifcation output")
if posts_data:
  print("\nSample of final saved posts (head):")
  print(posts_data[:3])
  if len(posts_data) > 3:
    print("\nSample of final saved posts (tail):")
    print(posts_data[-3:])
else:
  print("No posts to verify.")

if comments_data:
  print("\nSample of final saved posts (head):")
  print(comments_data[:3])
  if len(comments_data) > 3:
    print("\nSample of final saved posts (tail):")
    print(comments_data[-3:])
else:
  print("No comments to verify.")



Step 7: Final Verifcation output

Sample of final saved posts (head):
[{'post_id': '1kzdtf9', 'title': 'weekend discussion thread for the weekend of may 30 2025', 'score': 168, 'flair': 'Weekend Discussion', 'created_utc': '2025-05-30 19:57:24', 'num_comments': 10208, 'url': 'https://www.reddit.com/r/wallstreetbets/comments/1kzdtf9/weekend_discussion_thread_for_the_weekend_of_may/'}, {'post_id': '1kz68li', 'title': 'weekly earnings thread 62 66', 'score': 78, 'flair': 'Earnings Thread', 'created_utc': '2025-05-30 14:52:51', 'num_comments': 259, 'url': 'https://i.redd.it/ypo8tjhfnx3f1.jpeg'}, {'post_id': '1l0nofm', 'title': 'tariff cheat code', 'score': 737, 'flair': 'Gain', 'created_utc': '2025-06-01 12:23:54', 'num_comments': 99, 'url': 'https://i.redd.it/f6w0psdo6b4f1.jpeg'}]

Sample of final saved posts (tail):
[{'post_id': '1kjbjrj', 'title': 'has anyone used request for quote rfq orders to invest', 'score': 3, 'flair': 'No Flair', 'created_utc': '2025-05-10 14:27:38', 'num_commen