# Data Collection for Sentiment Analysis

This notebook is dedicated to collecting raw social media data for sentiment analysis. We will utilize APIs or web scraping techniques to gather the data.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Function to scrape posts from a subreddit
def scrape_reddit(subreddit, max_posts=500):
    base_url = f"https://www.reddit.com/r/{subreddit}/new/.rss"
    headers = {"User-Agent": "Mozilla/5.0"}
    
    posts = []
    
    while len(posts) < max_posts:
        response = requests.get(base_url, headers=headers)
        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            break
        
        soup = BeautifulSoup(response.content, "xml")
        entries = soup.find_all("entry")

        for entry in entries:
            if len(posts) >= max_posts:
                break
            posts.append({
                "title": entry.title.text,
                "author": entry.author.find("name").text if entry.author else "Unknown",
                "link": entry.link["href"],
                "published": entry.published.text,
                "content": entry.content.text if entry.content else ""
            })

            
        
        time.sleep(1)  # Avoid too many requests in a short time
    
    return pd.DataFrame(posts)

# Scrape 500 posts from a subreddit (change 'technology' to any subreddit you want)
df = scrape_reddit("politics", 500)



# Display the first few rows
print(df.head())


                                               title            author  \
0  Putin offers to sell minerals to Trump, includ...       /u/Drext833   
1  Oversight agency finds Trump’s federal worker ...        /u/1900grs   
2  Latest News on Dogie - Department of Governmen...       /u/Staragox   
3  ‘It’s bedlam’: Federal workers left in limbo a...    /u/Ace-Cuddler   
4  Florida congressman investigated for alleged D...  /u/rapidcreek409   

                                                link  \
0  https://www.reddit.com/r/politics/comments/1ix...   
1  https://www.reddit.com/r/politics/comments/1ix...   
2  https://www.reddit.com/r/politics/comments/1ix...   
3  https://www.reddit.com/r/politics/comments/1ix...   
4  https://www.reddit.com/r/politics/comments/1ix...   

                   published  \
0  2025-02-25T13:18:11+00:00   
1  2025-02-25T13:16:52+00:00   
2  2025-02-25T13:09:07+00:00   
3  2025-02-25T13:08:30+00:00   
4  2025-02-25T13:06:40+00:00   

                         

## Initial Data Exploration

In this section, we will perform some initial exploration of the collected data to understand its structure and contents.

In [3]:
# Display basic information about the DataFrame
if 'df' in locals():
    print(df.info())
    print(df.describe())
else:
    print('DataFrame not available for exploration.')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      500 non-null    object
 1   author     500 non-null    object
 2   link       500 non-null    object
 3   published  500 non-null    object
 4   content    500 non-null    object
dtypes: object(5)
memory usage: 19.7+ KB
None
                                                    title       author  \
count                                                 500          500   
unique                                                 25           24   
top     ‘It’s bedlam’: Federal workers left in limbo a...  /u/zsreport   
freq                                                   24           40   

                                                     link  \
count                                                 500   
unique                                                 26   
top     https://www.red

## Save the Collected Data

Finally, we will save the collected data to the raw data directory for further processing.

In [5]:
import os

# Define the directory and file path
directory = '../data/raw'
file_path = os.path.join(directory, 'social_media_data.csv')

# Check if the directory exists, if not, create it
if not os.path.exists(directory):
    os.makedirs(directory)

# Save the DataFrame if it exists
if 'df' in locals():
    df.to_csv(file_path, index=False, mode='w')
    print('Data overwritten in data/raw/social_media_data.csv')
else:
    print('No data to save.')

OSError: Cannot save file into a non-existent directory: '..\data\raw'

In [17]:
df.shape

(500, 5)