# 01 - Data Collection
## INSY 669 Text Analytics | GLP-1 Weight Loss Drugs

This notebook documents our data collection process from three sources:
1. **Reddit** (r/Ozempic, r/Semaglutide, r/WegovyWeightLoss) via Arctic Shift API
2. **WebMD** (patient reviews) via web scraping of reviews.webmd.com
3. **News articles** (major health news outlets) via Google News RSS feeds

In [None]:
import pandas as pdimport numpy as npfrom bs4 import BeautifulSoupimport requestsimport timeimport osimport warningswarnings.filterwarnings('ignore')

## 1.1 Reddit Data Collection

We collected posts from three GLP-1 related subreddits using the **Arctic Shift API** (https://arctic-shift.photon-reddit.com/), a free, no-authentication-required archive of Reddit data. Posts were collected month-by-month across January-November 2024 to ensure temporal coverage, with up to 100 posts per subreddit per month.

**API Endpoint:** `https://arctic-shift.photon-reddit.com/api/posts/search`

In [None]:
# Reddit collection via Arctic Shift API
import requests
from datetime import datetime

SUBREDDITS = ['Ozempic', 'Semaglutide', 'WegovyWeightLoss']
API_URL = "https://arctic-shift.photon-reddit.com/api/posts/search"

# Collect month-by-month for Jan-Nov 2024
months = [
    (datetime(2024, m, 1), datetime(2024, m + 1, 1) if m < 12 else datetime(2025, 1, 1))
    for m in range(1, 12)
]

# Example: fetch one month from one subreddit
params = {
    'subreddit': 'Ozempic',
    'after': int(datetime(2024, 1, 1).timestamp()),
    'before': int(datetime(2024, 2, 1).timestamp()),
    'limit': 100,
}
resp = requests.get(API_URL, params=params, timeout=30)
sample = resp.json().get('data', [])
print(f"Sample API response: {len(sample)} posts from r/Ozempic (Jan 2024)")
if sample:
    post = sample[0]
    print(f"  Post ID: {post['id']}, Title: {post['title'][:80]}...")

# Full collection was done via collect_reddit_v2.py
# Loading pre-collected data:
df_reddit = pd.read_csv('../data/reddit_posts.csv')
print(f"\nTotal Reddit posts collected: {len(df_reddit)}")
print(f"Subreddits: {df_reddit['subreddit'].value_counts().to_dict()}")
print(f"Date range: {df_reddit['date'].min()} to {df_reddit['date'].max()}")
df_reddit.head()

## 1.2 WebMD Reviews Collection

We scraped patient reviews from **reviews.webmd.com** for Ozempic and Wegovy. The site embeds review data in a `window.__INITIAL_STATE__` JSON object on each page. We paginated through all available pages (20 reviews per page) using the `?page=N` parameter.

**URLs:**
- Ozempic: `https://reviews.webmd.com/drugs/drugreview-ozempic-semaglutide`
- Wegovy: `https://reviews.webmd.com/drugs/drugreview-wegovy-semaglutide`

In [None]:
# WebMD scraping approach - extract reviews from embedded JSON
import json

url = 'https://reviews.webmd.com/drugs/drugreview-ozempic-semaglutide?page=1'
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)'})
soup = BeautifulSoup(response.text, 'html.parser')

# Extract reviews from __INITIAL_STATE__ embedded JSON
for script in soup.find_all('script'):
    text = script.string or ''
    if 'window.__INITIAL_STATE__' in text:
        json_str = text.replace('window.__INITIAL_STATE__=', '').strip().rstrip(';')
        data = json.loads(json_str)
        sample_reviews = data['all_reviews']['drug_review_nimvs'][0]['review_nimvs']
        print(f"Sample page: {len(sample_reviews)} reviews extracted from page 1")
        print(f"Sample review: {sample_reviews[0]['UserExperience'][:150]}...")
        break

# Full collection was done via collect_webmd_real.py (paginated through all pages)
# Loading pre-collected data:
df_webmd = pd.read_csv('../data/webmd_reviews.csv')
print(f"\nTotal WebMD reviews: {len(df_webmd)}")
print(f"By drug: {df_webmd['drug'].value_counts().to_dict()}")
df_webmd.head()

## 1.3 News Articles Collection

We collected news articles about GLP-1 drugs via **Google News RSS feeds**. Multiple search queries were used to capture diverse coverage (weight loss, side effects, insurance, shortages, clinical trials). Articles were deduplicated by title.

In [None]:
# News collection via Google News RSS
import re

queries = [
    'Ozempic weight loss', 'Wegovy weight loss', 'semaglutide obesity',
    'GLP-1 weight loss drug', 'Ozempic side effects',
    'Wegovy insurance coverage', 'Ozempic shortage', 'semaglutide clinical trial',
]

# Example: fetch one query
sample_query = queries[0]
rss_url = f"https://news.google.com/rss/search?q={sample_query.replace(' ', '+')}+after:2024-01-01+before:2024-12-01&hl=en-US&gl=US&ceid=US:en"
response = requests.get(rss_url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'xml')
items = soup.find_all('item')
print(f"Sample RSS query '{sample_query}': {len(items)} articles")
if items:
    title = items[0].find('title').get_text(strip=True)
    source = items[0].find('source').get_text(strip=True) if items[0].find('source') else 'Unknown'
    print(f"  First article: [{source}] {title[:100]}")

# Full collection used 8 queries, deduplicated by title, cleaned HTML from descriptions
# Loading pre-collected data:
df_news = pd.read_csv('../data/news_articles.csv')
print(f"\nTotal news articles: {len(df_news)}")
print(f"Unique sources: {df_news['source'].nunique()}")
print(f"Top 5 sources: {df_news['source'].value_counts().head(5).to_dict()}")
df_news.head()

## 1.4 Data Summary

In [None]:
print("=" * 50)print("DATA COLLECTION SUMMARY")print("=" * 50)print(f"\nReddit posts:     {len(df_reddit):>6}")print(f"WebMD reviews:    {len(df_webmd):>6}")print(f"News articles:    {len(df_news):>6}")print(f"{'─' * 30}")print(f"Total documents:  {len(df_reddit) + len(df_webmd) + len(df_news):>6}")print(f"\nPublic corpus:    {len(df_reddit) + len(df_webmd):>6} (Reddit + WebMD)")print(f"Media corpus:     {len(df_news):>6} (News articles)")print(f"\nDate range: {df_reddit['date'].min()} to {df_reddit['date'].max()}")print(f"\nReddit subreddits: {df_reddit['subreddit'].unique().tolist()}")print(f"News sources: {df_news['source'].nunique()} unique outlets")