# Notebook 1: Data Collection

**Objectives:**
- Scrape TimeOut NYC's "Things to Do This Weekend" page
- Parse event listings (title, description, date, category, price, location)
- Save raw data to CSV

**Target:** 80+ events from TimeOut NYC

---


## Setup & Imports


In [9]:
# Install required packages if needed
# !pip install beautifulsoup4 requests lxml pandas python-dotenv


In [10]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import re
from pathlib import Path
import os

# Create data directories if they don't exist
Path('../data/raw').mkdir(parents=True, exist_ok=True)
Path('../data/processed').mkdir(parents=True, exist_ok=True)
Path('../data/test_datasets').mkdir(parents=True, exist_ok=True)

print("‚úÖ Imports successful!")
print(f"‚úÖ Data directories created")


‚úÖ Imports successful!
‚úÖ Data directories created


## 1. Setup Web Scraping

‚úÖ‚úÖ‚úÖ **Important:** We're setting up proper headers to avoid being blocked by the website.


## 2. Fetch HTML Content


## 3. Download Event Pages

Now we'll download the full HTML for each event page to extract detailed descriptions.


In [11]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import time

def download_event_page(event_id, url, headers, output_dir):
    """Download a single event page and save it to disk"""
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Create safe filename from title
        safe_title = re.sub(r'[^\w\s-]', '', url.split('/')[-1])[:50]
        filename = f"{event_id}_{safe_title}.html"
        filepath = output_dir / filename
        
        # Save HTML
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(response.text)
        
        return event_id, str(filepath)
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Error downloading {url}: {e}")
        return event_id, None

# Create directory for event HTML files
today = datetime.now().strftime('%Y%m%d')
html_dir = Path(f'../data/raw/event_pages_{today}')
html_dir.mkdir(parents=True, exist_ok=True)

print(f"üîÑ Downloading {len(df)} event pages in parallel...")
print(f"üìÅ Saving to: {html_dir}\n")

# Download all pages in parallel
downloaded_files = {}
with ThreadPoolExecutor(max_workers=10) as executor:
    # Submit all download tasks
    futures = {
        executor.submit(download_event_page, row['event_id'], row['url'], HEADERS, html_dir): row['event_id']
        for _, row in df.iterrows()
    }
    
    # Collect results
    completed = 0
    for future in as_completed(futures):
        completed += 1
        event_id, filepath = future.result()
        if filepath:
            downloaded_files[event_id] = filepath
        
        if completed % 10 == 0 or completed == len(df):
            print(f"  ‚úÖ Downloaded {completed}/{len(df)} pages...")

# Add filepath column to DataFrame
df['html_filepath'] = df['event_id'].map(downloaded_files)

print(f"\n‚úÖ Successfully downloaded {len(downloaded_files)}/{len(df)} event pages")
print(f"üìä Success rate: {len(downloaded_files)/len(df)*100:.1f}%")


üîÑ Downloading 90 event pages in parallel...
üìÅ Saving to: ../data/raw/event_pages_20251106

  ‚úÖ Downloaded 10/90 pages...
  ‚úÖ Downloaded 20/90 pages...
  ‚úÖ Downloaded 30/90 pages...
  ‚úÖ Downloaded 40/90 pages...
  ‚úÖ Downloaded 50/90 pages...
  ‚úÖ Downloaded 60/90 pages...
  ‚úÖ Downloaded 70/90 pages...
  ‚úÖ Downloaded 80/90 pages...
  ‚úÖ Downloaded 90/90 pages...

‚úÖ Successfully downloaded 90/90 event pages
üìä Success rate: 100.0%


## 4. Extract Long Descriptions

Extract the full description from each downloaded event page.


In [12]:
def extract_long_description_from_file(html_filepath):
    """Extract long description from a downloaded HTML file"""
    if not html_filepath or not Path(html_filepath).exists():
        return ""
    
    try:
        with open(html_filepath, 'r', encoding='utf-8') as f:
            html_content = f.read()
        
        soup = BeautifulSoup(html_content, 'lxml')
        
        # Remove unwanted elements
        for element in soup(['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript']):
            element.decompose()
        
        # Find the content div - TimeOut uses id="content" with contentAnnotation class
        content_div = soup.find('div', id='content')
        if not content_div:
            # Fallback: look for div with contentAnnotation class
            content_div = soup.find('div', class_=re.compile(r'contentAnnotation', re.I))
        
        if content_div:
            # Extract all paragraph text
            paragraphs = content_div.find_all('p')
            long_desc_parts = []
            
            for p in paragraphs:
                text = p.get_text(strip=True)
                # Skip very short paragraphs
                if len(text) < 20:
                    continue
                # Skip RECOMMENDED links
                if text.startswith('RECOMMENDED:'):
                    continue
                # Skip social media links
                if 'View this post on Instagram' in text or 'A post shared by' in text:
                    continue
                long_desc_parts.append(text)
            
            long_desc = ' '.join(long_desc_parts)
            long_desc = re.sub(r'\s+', ' ', long_desc).strip()
            
            if len(long_desc) > 100:
                return long_desc[:5000]
        
        return ""
    except Exception as e:
        return f"Error: {e}"

print("üîÑ Extracting long descriptions from downloaded pages...\n")

# Apply extraction function to all events
df['long_description'] = df['html_filepath'].apply(
    lambda filepath: extract_long_description_from_file(filepath) if filepath else ""
)

# Show statistics
successful = (df['long_description'] != '').sum()
avg_length = df[df['long_description'] != '']['long_description'].str.len().mean()

print(f"‚úÖ Extracted descriptions for {successful}/{len(df)} events")
print(f"üìä Average description length: {avg_length:.0f} characters")

# Show sample
print("\n" + "="*80)
print("SAMPLE: First Event Long Description")
print("="*80)
first_desc = df.iloc[0]['long_description']
print(f"{first_desc[:300]}..." if len(first_desc) > 300 else first_desc)


üîÑ Extracting long descriptions from downloaded pages...

‚úÖ Extracted descriptions for 90/90 events
üìä Average description length: 2281 characters

SAMPLE: First Event Long Description
The New York Comedy Festival(NYCF), the country‚Äôs largest and longest-running annual comedy festival, will return for its 21st edition this November, with over 200 comedians across 100 shows at iconic NYC venues likeCarnegie Hall,Madison Square Garden, theBeacon TheatreandTown HallfromFriday, Novemb...


In [13]:
import requests
from datetime import datetime

BASE_URL = "https://www.timeout.com/newyork/things-to-do/things-to-do-in-nyc-this-weekend"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
}

def fetch_page(url, headers):
    """Fetch HTML content from URL"""
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        print(f"‚úÖ Successfully fetched page (Status: {response.status_code})")
        print(f"‚úÖ Content length: {len(response.text)} characters")
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"‚ùå Error fetching page: {e}")
        return None

# Fetch the page
html_content = fetch_page(BASE_URL, HEADERS)

if html_content:
    # Save HTML to local file for inspection/debugging
    today = datetime.now().strftime('%Y%m%d')
    html_file = f'../data/raw/timeout_page_{today}.html'
    
    with open(html_file, 'w', encoding='utf-8') as f:
        f.write(html_content)
    
    print(f"‚úÖ HTML saved to: {html_file}")
    print(f"‚úÖ You can now inspect it to see the structure!")
    print("\n‚úÖ‚úÖ‚úÖ Page fetched and saved! Ready to parse.")


‚úÖ Successfully fetched page (Status: 200)
‚úÖ Content length: 774788 characters
‚úÖ HTML saved to: ../data/raw/timeout_page_20251106.html
‚úÖ You can now inspect it to see the structure!

‚úÖ‚úÖ‚úÖ Page fetched and saved! Ready to parse.


## 3. Debug HTML Structure

‚úÖ‚úÖ‚úÖ **Let's inspect the HTML to find the right selectors:**


In [14]:
def parse_events(html_content):
    """Parse event data from HTML - TimeOut NYC specific"""
    soup = BeautifulSoup(html_content, 'lxml')
    events = []
    
    # TimeOut NYC uses <article class="tile"> for events
    event_cards = soup.find_all('article', class_=re.compile(r'tile|article', re.I))
    
    print(f"Found {len(event_cards)} event cards")
    
    # Skip the first card (it's a header)
    for idx, card in enumerate(event_cards[1:], start=1):
        try:
            # Extract title from <h3> inside <a>
            title_elem = card.find('h3')
            if not title_elem:
                title_elem = card.find(['h2', 'h4'])
            title = title_elem.get_text(strip=True) if title_elem else None
            
            if not title or len(title) < 3:
                continue
            
            # Extract URL from <a> tag
            link_elem = card.find('a', href=True)
            url = link_elem['href'] if link_elem else ""
            if url and not url.startswith('http'):
                url = f"https://www.timeout.com{url}"
            
            # Extract description/summary (look for ALL p tags to get full description)
            desc_paragraphs = card.find_all('p')
            if desc_paragraphs:
                # Combine all paragraph texts with space separator
                description = ' '.join([p.get_text(strip=True) for p in desc_paragraphs if p.get_text(strip=True)])
            else:
                # Try finding divs with substantial text
                content_div = card.find('div', class_=re.compile(r'content|description|summary', re.I))
                if content_div:
                    desc_paragraphs = content_div.find_all('p')
                    description = ' '.join([p.get_text(strip=True) for p in desc_paragraphs if p.get_text(strip=True)])
                else:
                    description = title
            
            # Fallback to title if description is empty
            if not description or len(description) < 10:
                description = title
            
            # Extract category (often in data-layer or category tags)
            category = "General"
            # Look in data attributes
            if link_elem and 'data-layer' in str(link_elem):
                data_layer = str(link_elem.get('data-layer', ''))
                if 'category' in data_layer.lower():
                    # Extract category from data-layer JSON
                    category_match = re.search(r'"category":"([^"]+)"', data_layer)
                    if category_match:
                        category = category_match.group(1)
            
            
            # Add event to list
            events.append({
                'event_id': f'evt_{len(events)+1:03d}',
                'title': title,
                'description': description,  # Limit description length
                'url': url,
            })
            
        except Exception as e:
            print(f"Error parsing event {idx}: {e}")
            continue
    
    return events

# Parse events
events = parse_events(html_content)
print(f"\n‚úÖ Successfully parsed {len(events)} events")


Found 92 event cards

‚úÖ Successfully parsed 90 events


## 4. Data Validation & Preview

‚úÖ‚úÖ‚úÖ **Let's check what we scraped:**


In [15]:
# Create DataFrame
df = pd.DataFrame(events)

# Display basic info
print(f"Total events scraped: {len(df)}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"\nData shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# Preview first 5 events
print("\n" + "="*80)
print("PREVIEW: First 5 Events")
print("="*80)
df.head()


Total events scraped: 90

Column names: ['event_id', 'title', 'description', 'url']

Data shape: (90, 4)

Missing values:
event_id       0
title          0
description    0
url            0
dtype: int64

PREVIEW: First 5 Events


Unnamed: 0,event_id,title,description,url
0,evt_001,1.The NY Comedy Festival,The New York Comedy Festival is where the best...,https://www.timeout.com/newyork/news/the-ny-co...
1,evt_002,2.The Other Art Fair Brooklyn,Connect with artists in-person and explore¬†hun...,https://www.timeout.com/newyork/things-to-do/t...
2,evt_003,3.Canstruction,This annual cans-for-a-cause competitionpitsar...,https://www.timeout.com/newyork/things-to-do/c...
3,evt_004,4.Queer History Walking Tour,"This fall, explore¬†the long and rich history o...",https://www.timeout.com/newyork/lgbtq/queer-hi...
4,evt_005,5.Cheese Week,"New Yorkers, prepare to get a littleextra chee...",https://www.timeout.com/newyork/news/dairy-lov...


In [16]:
# Check data quality
print("\n‚úÖ‚úÖ‚úÖ DATA QUALITY CHECKS:\n")

print(f"1. Events with valid titles: {(df['title'] != 'No title').sum()} / {len(df)}")
print(f"2. Events with descriptions: {(df['description'] != 'No description').sum()} / {len(df)}")


‚úÖ‚úÖ‚úÖ DATA QUALITY CHECKS:

1. Events with valid titles: 90 / 90
2. Events with descriptions: 90 / 90


## 5. Save Raw Data to CSV

‚úÖ‚úÖ‚úÖ **Saving to:** `data/raw/timeout_events_YYYYMMDD.csv`


In [17]:
# Generate filename with today's date
today = datetime.now().strftime('%Y%m%d')
output_file = f'../data/raw/timeout_events_{today}.csv'

# Save to CSV
df.to_csv(output_file, index=False)

print(f"‚úÖ Data saved to: {output_file}")
print(f"‚úÖ Total events saved: {len(df)}")
print(f"‚úÖ File size: {os.path.getsize(output_file) / 1024:.2f} KB")

# Verify we can read it back
verify_df = pd.read_csv(output_file)
print(f"\n‚úÖ Verification: Successfully read back {len(verify_df)} events from CSV")


‚úÖ Data saved to: ../data/raw/timeout_events_20251106.csv
‚úÖ Total events saved: 90
‚úÖ File size: 75.13 KB

‚úÖ Verification: Successfully read back 90 events from CSV


## 6. Summary

‚úÖ‚úÖ‚úÖ **Notebook 1 Complete!**


In [18]:
print("="*80)
print("NOTEBOOK 1 SUMMARY: DATA COLLECTION")
print("="*80)
print(f"\n‚úÖ‚úÖ‚úÖ SUCCESSFULLY COMPLETED!\n")
print(f"üìä Events Collected: {len(df)}")
print(f"üìÅ Saved to: {output_file}")
print(f"üóÇÔ∏è  Columns: {', '.join(df.columns)}")
print(f"\nüìà Summary Statistics:")
print(f"   - Events with descriptions: {(df['description'] != 'No description').sum()}")
print(f"   - Events with URLs: {(df['url'] != '').sum()}")

if len(df) >= 80:
    print(f"\n‚úÖ SUCCESS: Collected {len(df)} events (target: 80+)")
else:
    print(f"\n‚ö†Ô∏è  WARNING: Only collected {len(df)} events (target: 80+)")
    print(f"   Consider scraping additional pages or sections")

print(f"\nüìù Next Step: Notebook 2 - Data Processing & Vector DB")
print(f"   - Extract baby_friendly metadata using LLM")
print(f"   - Generate embeddings with OpenAI")
print(f"   - Set up Qdrant vector database")
print("="*80)


NOTEBOOK 1 SUMMARY: DATA COLLECTION

‚úÖ‚úÖ‚úÖ SUCCESSFULLY COMPLETED!

üìä Events Collected: 90
üìÅ Saved to: ../data/raw/timeout_events_20251106.csv
üóÇÔ∏è  Columns: event_id, title, description, url

üìà Summary Statistics:
   - Events with descriptions: 90
   - Events with URLs: 90

‚úÖ SUCCESS: Collected 90 events (target: 80+)

üìù Next Step: Notebook 2 - Data Processing & Vector DB
   - Extract baby_friendly metadata using LLM
   - Generate embeddings with OpenAI
   - Set up Qdrant vector database


---

## ‚úÖ‚úÖ‚úÖ Notebook 1 Complete!

**What we accomplished:**
1. ‚úÖ Set up web scraping with proper headers
2. ‚úÖ Scraped TimeOut NYC event listings
3. ‚úÖ Parsed event data (title, description, date, category, price, location, url)
4. ‚úÖ Validated data quality
5. ‚úÖ Saved raw data to CSV: `data/raw/timeout_events_YYYYMMDD.csv`

**CSV Structure:**
- `event_id`: Unique identifier
- `title`: Event name
- `description`: Event summary
- `date`: When it happens
- `category`: Type (Arts, Food, Outdoor, etc.)
- `price`: free, $, $$, $$$
- `location`: Neighborhood/venue
- `url`: Link to full event page
- `scraped_at`: Timestamp

**Next Steps:**
- Move to **Notebook 2: Data Processing & Vector DB**
- Extract `baby_friendly` metadata using GPT-4
- Generate embeddings with OpenAI
- Set up Qdrant vector database

---
