# Notebook 1: Data Collection

**Objectives:**
- Scrape TimeOut NYC's "Things to Do This Weekend" page
- Parse event listings (title, description, date, category, price, location)
- Save raw data to CSV

**Target:** 80+ events from TimeOut NYC

---


## Setup & Imports


In [None]:
# Install required packages if needed
# !pip install beautifulsoup4 requests lxml pandas python-dotenv


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import re
from pathlib import Path
import os

# Create data directories if they don't exist
Path('../data/raw').mkdir(parents=True, exist_ok=True)
Path('../data/processed').mkdir(parents=True, exist_ok=True)
Path('../data/test_datasets').mkdir(parents=True, exist_ok=True)

print("✅ Imports successful!")
print(f"✅ Data directories created")


✅ Imports successful!
✅ Data directories created


## 1. Setup Web Scraping

✅✅✅ **Important:** We're setting up proper headers to avoid being blocked by the website.


## 2. Fetch HTML Content


In [2]:
import requests
from datetime import datetime

BASE_URL = "https://www.timeout.com/newyork/things-to-do/things-to-do-in-nyc-this-weekend"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
}

def fetch_page(url, headers):
    """Fetch HTML content from URL"""
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        print(f"✅ Successfully fetched page (Status: {response.status_code})")
        print(f"✅ Content length: {len(response.text)} characters")
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"❌ Error fetching page: {e}")
        return None

# Fetch the page
html_content = fetch_page(BASE_URL, HEADERS)

if html_content:
    # Save HTML to local file for inspection/debugging
    today = datetime.now().strftime('%Y%m%d')
    html_file = f'../data/raw/timeout_page_{today}.html'
    
    with open(html_file, 'w', encoding='utf-8') as f:
        f.write(html_content)
    
    print(f"✅ HTML saved to: {html_file}")
    print(f"✅ You can now inspect it to see the structure!")
    print("\n✅✅✅ Page fetched and saved! Ready to parse.")


✅ Successfully fetched page (Status: 200)
✅ Content length: 843320 characters
✅ HTML saved to: ../data/raw/timeout_page_20251019.html
✅ You can now inspect it to see the structure!

✅✅✅ Page fetched and saved! Ready to parse.


## 3. Debug HTML Structure

✅✅✅ **Let's inspect the HTML to find the right selectors:**


In [3]:
def parse_events(html_content):
    """Parse event data from HTML - TimeOut NYC specific"""
    soup = BeautifulSoup(html_content, 'lxml')
    events = []
    
    # TimeOut NYC uses <article class="tile"> for events
    event_cards = soup.find_all('article', class_=re.compile(r'tile|article', re.I))
    
    print(f"Found {len(event_cards)} event cards")
    
    # Skip the first card (it's a header)
    for idx, card in enumerate(event_cards[1:], start=1):
        try:
            # Extract title from <h3> inside <a>
            title_elem = card.find('h3')
            if not title_elem:
                title_elem = card.find(['h2', 'h4'])
            title = title_elem.get_text(strip=True) if title_elem else None
            
            if not title or len(title) < 3:
                continue
            
            # Extract URL from <a> tag
            link_elem = card.find('a', href=True)
            url = link_elem['href'] if link_elem else ""
            if url and not url.startswith('http'):
                url = f"https://www.timeout.com{url}"
            
            # Extract description/summary (look for ALL p tags to get full description)
            desc_paragraphs = card.find_all('p')
            if desc_paragraphs:
                # Combine all paragraph texts with space separator
                description = ' '.join([p.get_text(strip=True) for p in desc_paragraphs if p.get_text(strip=True)])
            else:
                # Try finding divs with substantial text
                content_div = card.find('div', class_=re.compile(r'content|description|summary', re.I))
                if content_div:
                    desc_paragraphs = content_div.find_all('p')
                    description = ' '.join([p.get_text(strip=True) for p in desc_paragraphs if p.get_text(strip=True)])
                else:
                    description = title
            
            # Fallback to title if description is empty
            if not description or len(description) < 10:
                description = title
            
            # Extract category (often in data-layer or category tags)
            category = "General"
            # Look in data attributes
            if link_elem and 'data-layer' in str(link_elem):
                data_layer = str(link_elem.get('data-layer', ''))
                if 'category' in data_layer.lower():
                    # Extract category from data-layer JSON
                    category_match = re.search(r'"category":"([^"]+)"', data_layer)
                    if category_match:
                        category = category_match.group(1)
            
            
            # Add event to list
            events.append({
                'event_id': f'evt_{len(events)+1:03d}',
                'title': title,
                'description': description,  # Limit description length
                'url': url,
            })
            
        except Exception as e:
            print(f"Error parsing event {idx}: {e}")
            continue
    
    return events

# Parse events
events = parse_events(html_content)
print(f"\n✅ Successfully parsed {len(events)} events")


Found 103 event cards

✅ Successfully parsed 101 events


## 4. Data Validation & Preview

✅✅✅ **Let's check what we scraped:**


In [4]:
# Create DataFrame
df = pd.DataFrame(events)

# Display basic info
print(f"Total events scraped: {len(df)}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"\nData shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# Preview first 5 events
print("\n" + "="*80)
print("PREVIEW: First 5 Events")
print("="*80)
df.head()


Total events scraped: 101

Column names: ['event_id', 'title', 'description', 'url']

Data shape: (101, 4)

Missing values:
event_id       0
title          0
description    0
url            0
dtype: int64

PREVIEW: First 5 Events


Unnamed: 0,event_id,title,description,url
0,evt_001,1.Open House New York,"Admit it, are you a nosy New Yorker? Same here...",https://www.timeout.com/newyork/news/open-hous...
1,evt_002,2.Tompkins Square Park Halloween Dog Parade,"The Village Halloween Parade is fun and all, b...",https://www.timeout.com/newyork/things-to-do/t...
2,evt_003,3.Gowanus Open Studios,Stroll through Gowanus to visit the art studio...,https://www.timeout.com/newyork/art/gowanus-op...
3,evt_004,"4.""Renoir’s Drawings"" at The Morgan",Renoir’s sketchbook is moving into the spotlig...,https://www.timeout.com/newyork/news/renoirs-r...
4,evt_005,5.The Amazing Maize Maze,"For the past 50 years, the Queens County Farm ...",https://www.timeout.com/newyork/news/the-amazi...


In [6]:
# Check data quality
print("\n✅✅✅ DATA QUALITY CHECKS:\n")

print(f"1. Events with valid titles: {(df['title'] != 'No title').sum()} / {len(df)}")
print(f"2. Events with descriptions: {(df['description'] != 'No description').sum()} / {len(df)}")


✅✅✅ DATA QUALITY CHECKS:

1. Events with valid titles: 101 / 101
2. Events with descriptions: 101 / 101


## 5. Save Raw Data to CSV

✅✅✅ **Saving to:** `data/raw/timeout_events_YYYYMMDD.csv`


In [7]:
# Generate filename with today's date
today = datetime.now().strftime('%Y%m%d')
output_file = f'../data/raw/timeout_events_{today}.csv'

# Save to CSV
df.to_csv(output_file, index=False)

print(f"✅ Data saved to: {output_file}")
print(f"✅ Total events saved: {len(df)}")
print(f"✅ File size: {os.path.getsize(output_file) / 1024:.2f} KB")

# Verify we can read it back
verify_df = pd.read_csv(output_file)
print(f"\n✅ Verification: Successfully read back {len(verify_df)} events from CSV")


✅ Data saved to: ../data/raw/timeout_events_20251019.csv
✅ Total events saved: 101
✅ File size: 83.56 KB

✅ Verification: Successfully read back 101 events from CSV


## 6. Summary

✅✅✅ **Notebook 1 Complete!**


In [8]:
print("="*80)
print("NOTEBOOK 1 SUMMARY: DATA COLLECTION")
print("="*80)
print(f"\n✅✅✅ SUCCESSFULLY COMPLETED!\n")
print(f"📊 Events Collected: {len(df)}")
print(f"📁 Saved to: {output_file}")
print(f"🗂️  Columns: {', '.join(df.columns)}")
print(f"\n📈 Summary Statistics:")
print(f"   - Events with descriptions: {(df['description'] != 'No description').sum()}")
print(f"   - Events with URLs: {(df['url'] != '').sum()}")

if len(df) >= 80:
    print(f"\n✅ SUCCESS: Collected {len(df)} events (target: 80+)")
else:
    print(f"\n⚠️  WARNING: Only collected {len(df)} events (target: 80+)")
    print(f"   Consider scraping additional pages or sections")

print(f"\n📝 Next Step: Notebook 2 - Data Processing & Vector DB")
print(f"   - Extract baby_friendly metadata using LLM")
print(f"   - Generate embeddings with OpenAI")
print(f"   - Set up Qdrant vector database")
print("="*80)


NOTEBOOK 1 SUMMARY: DATA COLLECTION

✅✅✅ SUCCESSFULLY COMPLETED!

📊 Events Collected: 101
📁 Saved to: ../data/raw/timeout_events_20251019.csv
🗂️  Columns: event_id, title, description, url

📈 Summary Statistics:
   - Events with descriptions: 101
   - Events with URLs: 101

✅ SUCCESS: Collected 101 events (target: 80+)

📝 Next Step: Notebook 2 - Data Processing & Vector DB
   - Extract baby_friendly metadata using LLM
   - Generate embeddings with OpenAI
   - Set up Qdrant vector database


---

## ✅✅✅ Notebook 1 Complete!

**What we accomplished:**
1. ✅ Set up web scraping with proper headers
2. ✅ Scraped TimeOut NYC event listings
3. ✅ Parsed event data (title, description, date, category, price, location, url)
4. ✅ Validated data quality
5. ✅ Saved raw data to CSV: `data/raw/timeout_events_YYYYMMDD.csv`

**CSV Structure:**
- `event_id`: Unique identifier
- `title`: Event name
- `description`: Event summary
- `date`: When it happens
- `category`: Type (Arts, Food, Outdoor, etc.)
- `price`: free, $, $$, $$$
- `location`: Neighborhood/venue
- `url`: Link to full event page
- `scraped_at`: Timestamp

**Next Steps:**
- Move to **Notebook 2: Data Processing & Vector DB**
- Extract `baby_friendly` metadata using GPT-4
- Generate embeddings with OpenAI
- Set up Qdrant vector database

---
