Notebook 1: Data Collection
NYC Event Recommender - TimeOut NYC Scraper

This notebook collects event data from TimeOut NYC including:
- Basic event info (title, url, short description)
- Full event descriptions
- Pricing information (is_free)


1. Setup and Imports


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import datetime
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

print("‚úÖ All imports successful")


‚úÖ All imports successful


In [3]:
# Create data directories
Path('../data/raw').mkdir(parents=True, exist_ok=True)
Path('../data/processed').mkdir(parents=True, exist_ok=True)
Path('../data/test_datasets').mkdir(parents=True, exist_ok=True)

print("‚úÖ Data directories created")


‚úÖ Data directories created


2. Fetch HTML Content


In [4]:
# TimeOut NYC Things To Do page
URL = 'https://www.timeout.com/newyork/things-to-do/things-to-do-in-nyc-this-weekend'
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

print(f"üîÑ Fetching: {URL}")
response = requests.get(URL, headers=HEADERS)
response.raise_for_status()
html_content = response.text

# Save raw HTML
today = datetime.now().strftime('%Y%m%d')
html_file = f'../data/raw/timeout_page_{today}.html'
with open(html_file, 'w', encoding='utf-8') as f:
    f.write(html_content)

print(f"‚úÖ Saved HTML to {html_file}")
print(f"üìä HTML size: {len(html_content):,} characters")


üîÑ Fetching: https://www.timeout.com/newyork/things-to-do/things-to-do-in-nyc-this-weekend
‚úÖ Saved HTML to ../data/raw/timeout_page_20251107.html
üìä HTML size: 761,441 characters


3. Parse Events from Listing Page


In [5]:
def parse_events(html_content):
    """Parse event data from HTML - TimeOut NYC specific"""
    soup = BeautifulSoup(html_content, 'lxml')
    events = []
    
    # TimeOut NYC uses <article class="tile"> for events
    event_cards = soup.find_all('article', class_=re.compile(r'tile|article', re.I))
    
    print(f"Found {len(event_cards)} event cards")
    
    # Skip the first card (it's a header)
    for idx, card in enumerate(event_cards[1:], start=1):
        try:
            # Extract title
            title_elem = card.find('h3') or card.find(['h2', 'h4'])
            title = title_elem.get_text(strip=True) if title_elem else None
            
            if not title or len(title) < 3:
                continue
            
            # Remove leading numbers from title
            title = re.sub(r'^\d+\.\s*', '', title).strip()
            
            # Extract URL
            link_elem = card.find('a', href=True)
            url = link_elem['href'] if link_elem else ""
            if url and not url.startswith('http'):
                url = f"https://www.timeout.com{url}"
            
            # Extract short description
            desc_paragraphs = card.find_all('p')
            if desc_paragraphs:
                description = ' '.join([p.get_text(strip=True) for p in desc_paragraphs if p.get_text(strip=True)])
            else:
                description = title
            
            if not description or len(description) < 10:
                description = title
            
            # Add event to list
            events.append({
                'event_id': f'evt_{len(events)+1:03d}',
                'title': title,
                'description': description,
                'url': url,
            })
            
        except Exception as e:
            print(f"Error parsing event {idx}: {e}")
            continue
    
    return events

# Parse events
events = parse_events(html_content)
print(f"\n‚úÖ Successfully parsed {len(events)} events")


Found 92 event cards

‚úÖ Successfully parsed 90 events


In [6]:
# Create DataFrame
df = pd.DataFrame(events)
print(f"‚úÖ Created DataFrame with {len(df)} events")
print(f"\nColumns: {list(df.columns)}")
df.head()


‚úÖ Created DataFrame with 90 events

Columns: ['event_id', 'title', 'description', 'url']


Unnamed: 0,event_id,title,description,url
0,evt_001,The NY Comedy Festival,The New York Comedy Festival is where the best...,https://www.timeout.com/newyork/news/the-ny-co...
1,evt_002,The Other Art Fair Brooklyn,Connect with artists in-person and explore¬†hun...,https://www.timeout.com/newyork/things-to-do/t...
2,evt_003,Canstruction,This annual cans-for-a-cause competitionpitsar...,https://www.timeout.com/newyork/things-to-do/c...
3,evt_004,Queer History Walking Tour,"This fall, explore¬†the long and rich history o...",https://www.timeout.com/newyork/lgbtq/queer-hi...
4,evt_005,Cheese Week,"New Yorkers, prepare to get a littleextra chee...",https://www.timeout.com/newyork/news/dairy-lov...


4. Download Individual Event Pages


In [7]:
def download_event_page(event_id, url, headers, output_dir):
    """Download a single event page and save it to disk"""
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Create safe filename from URL
        safe_title = re.sub(r'[^\w\s-]', '', url.split('/')[-1])[:50]
        filename = f"{event_id}_{safe_title}.html"
        filepath = output_dir / filename
        
        # Save HTML
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(response.text)
        
        return event_id, str(filepath)
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Error downloading {url}: {e}")
        return event_id, None

# Create directory for event HTML files
today = datetime.now().strftime('%Y%m%d')
html_dir = Path(f'../data/raw/event_pages_{today}')
html_dir.mkdir(parents=True, exist_ok=True)

print(f"üîÑ Downloading {len(df)} event pages in parallel...")
print(f"üìÅ Saving to: {html_dir}\n")

# Download all pages in parallel
downloaded_files = {}
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {
        executor.submit(download_event_page, row['event_id'], row['url'], HEADERS, html_dir): row['event_id']
        for _, row in df.iterrows()
    }
    
    completed = 0
    for future in as_completed(futures):
        completed += 1
        event_id, filepath = future.result()
        if filepath:
            downloaded_files[event_id] = filepath
        
        if completed % 10 == 0 or completed == len(df):
            print(f"  ‚úÖ Downloaded {completed}/{len(df)} pages...")

# Add filepath column to DataFrame
df['html_filepath'] = df['event_id'].map(downloaded_files)

print(f"\n‚úÖ Successfully downloaded {len(downloaded_files)}/{len(df)} event pages")
print(f"üìä Success rate: {len(downloaded_files)/len(df)*100:.1f}%")


üîÑ Downloading 90 event pages in parallel...
üìÅ Saving to: ../data/raw/event_pages_20251107

  ‚úÖ Downloaded 10/90 pages...
  ‚úÖ Downloaded 20/90 pages...
  ‚úÖ Downloaded 30/90 pages...
  ‚úÖ Downloaded 40/90 pages...
  ‚úÖ Downloaded 50/90 pages...
  ‚úÖ Downloaded 60/90 pages...
  ‚úÖ Downloaded 70/90 pages...
  ‚úÖ Downloaded 80/90 pages...
  ‚úÖ Downloaded 90/90 pages...

‚úÖ Successfully downloaded 90/90 event pages
üìä Success rate: 100.0%


5. Extract Long Descriptions and Pricing


In [None]:
def extract_event_details_from_file(html_filepath):
    """Extract long_description and is_free from event HTML"""
    if not html_filepath or not Path(html_filepath).exists():
        return "", True  # Assume free if file doesn't exist
    
    try:
        with open(html_filepath, 'r', encoding='utf-8') as f:
            soup = BeautifulSoup(f.read(), 'lxml')
        
        # Check price section first
        is_free = None
        price_section = soup.find('div', attrs={'data-section': 'price'})
        if price_section:
            price_text = price_section.get_text(strip=True).lower()
            if 'free' in price_text:
                is_free = True
            elif re.search(r'\$\d+', price_text):
                is_free = False
        
        # Clean up soup
        for tag in soup(['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'noscript']):
            tag.decompose()
        
        # Extract long description
        content_div = soup.find('div', id='content') or soup.find('div', class_=re.compile(r'contentAnnotation', re.I))
        long_desc = ""
        
        if content_div:
            paragraphs = content_div.find_all('p')
            valid_paragraphs = []
            
            for p in paragraphs:
                text = p.get_text(strip=True)
                if len(text) < 20 or text.startswith('RECOMMENDED:') or 'View this post on Instagram' in text:
                    continue
                
                # Fallback price detection if not found in price section
                if is_free is None:
                    text_lower = text.lower()
                    if 'free' in text_lower or 'no admission' in text_lower:
                        is_free = True
                    elif '$' in text and re.search(r'\$\d+', text):
                        is_free = False
                
                valid_paragraphs.append(text)
            
            long_desc = re.sub(r'\s+', ' ', ' '.join(valid_paragraphs)).strip()[:5000]
        
        # If no price info found anywhere, assume it's free
        if is_free is None:
            is_free = True
        
        return long_desc, is_free
    
    except Exception as e:
        return f"Error: {e}", True  # Assume free on error


# Extract details from all downloaded pages
print("üîÑ Extracting long descriptions and pricing info...\n")

results = df['html_filepath'].apply(lambda path: extract_event_details_from_file(path) if path else ("", None))
df['long_description'] = results.apply(lambda x: x[0])
df['is_free'] = results.apply(lambda x: x[1])

# Statistics
successful = (df['long_description'] != '').sum()
has_pricing = df['is_free'].notna().sum()
free = (df['is_free'] == True).sum()
paid = (df['is_free'] == False).sum()

print(f"‚úÖ Extracted descriptions: {successful}/{len(df)}")
print(f"‚úÖ Detected pricing: {has_pricing}/{len(df)}")
print(f"\nüí∞ Pricing: {free} free, {paid} paid, {len(df) - has_pricing} unknown")
print(f"üìä Avg length: {df[df['long_description'] != '']['long_description'].str.len().mean():.0f} chars")

# Sample
print("\n" + "="*80)
print(f"SAMPLE: {df.iloc[0]['title']}")
print(f"Is Free: {df.iloc[0]['is_free']}")
print(f"Description: {df.iloc[0]['long_description'][:300]}...")
print("="*80)


üîÑ Extracting long descriptions and pricing info...

‚úÖ Extracted descriptions: 90/90
‚úÖ Detected pricing: 67/90

üí∞ Pricing: 23 free, 44 paid, 23 unknown
üìä Avg length: 2300 chars

SAMPLE: The NY Comedy Festival
Is Free: None
Description: The New York Comedy Festival(NYCF), the country‚Äôs largest and longest-running annual comedy festival, will return for its 21st edition this November, with over 200 comedians across 100 shows at iconic NYC venues likeCarnegie Hall,Madison Square Garden, theBeacon TheatreandTown HallfromFriday, Novemb...


6. Save Final Dataset


In [9]:
# Drop html_filepath column (not needed for later use)
df_final = df.drop(columns=['html_filepath'])

# Save to CSV
today = datetime.now().strftime('%Y%m%d')
output_file = f'../data/raw/timeout_events_{today}.csv'
df_final.to_csv(output_file, index=False)

print(f"‚úÖ Saved {len(df_final)} events to {output_file}")
print(f"\nüìã Final columns: {list(df_final.columns)}")
print(f"\nüéØ Dataset ready for processing in Notebook 2!")


‚úÖ Saved 90 events to ../data/raw/timeout_events_20251107.csv

üìã Final columns: ['event_id', 'title', 'description', 'url', 'long_description', 'is_free']

üéØ Dataset ready for processing in Notebook 2!


7. Data Preview


In [10]:
# Display final dataset
print(f"üìä Dataset Shape: {df_final.shape}")
print(f"\nFirst 3 events:")
df_final.head(3)


üìä Dataset Shape: (90, 6)

First 3 events:


Unnamed: 0,event_id,title,description,url,long_description,is_free
0,evt_001,The NY Comedy Festival,The New York Comedy Festival is where the best...,https://www.timeout.com/newyork/news/the-ny-co...,"The New York Comedy Festival(NYCF), the countr...",
1,evt_002,The Other Art Fair Brooklyn,Connect with artists in-person and explore¬†hun...,https://www.timeout.com/newyork/things-to-do/t...,Connect with artists in-person and explore hun...,False
2,evt_003,Canstruction,This annual cans-for-a-cause competitionpitsar...,https://www.timeout.com/newyork/things-to-do/c...,This annual cans-for-a-cause competitionpitsar...,True


‚úÖ‚úÖ‚úÖ Notebook 1 Complete!

**What we collected:**
- Event titles and URLs
- Short descriptions (from listing page)
- Long descriptions (from individual pages)
- Pricing information (is_free)

**Next step:** Open `02_data_processing_and_vectordb.ipynb` to process this data and create embeddings!
