# Search History Information Extraction & Analysis

## Project Overview
This notebook analyzes Google search history data to extract meaningful insights about user interests, preferences, and behavior patterns. Using Named Entity Recognition (NER) with OpenAI's GPT-4o-mini model and web scraping, we extract structured information from both search queries and visited pages to build a comprehensive user interest profile.

### Key Capabilities:
- Extract and categorize search queries by semantic meaning
- Analyze temporal patterns and search behavior
- Scrape and extract content from visited web pages
- Identify search clusters and user intent patterns
- Generate comprehensive user interest profiles
- Track API costs and optimize caching for efficiency

## Section 1: Load and Explore Search History Data

**Objective:** Load the Google search history JSON file and understand its structure, including all available fields and how records are organized.

**What this section does:**
- Loads search history from the JSON file
- Examines the data structure and available fields
- Shows sample records to understand the format
- Displays statistics about total records

In [1]:
import json
import pandas as pd
import numpy as np
from datetime import datetime
from collections import Counter, defaultdict
import re
from urllib.parse import urlparse, parse_qs, unquote
import warnings
warnings.filterwarnings('ignore')

# Load the search history
with open('search_history.json', 'r') as f:
    search_data = json.load(f)

print(f"Total search history records: {len(search_data)}")
print(f"\nFirst 3 records structure:")
for i, record in enumerate(search_data[:3]):
    print(f"\n--- Record {i+1} ---")
    print(f"Keys: {record.keys()}")
    print(f"Title: {record.get('title', 'N/A')[:80]}")
    print(f"Time: {record.get('time', 'N/A')}")

Total search history records: 55383

First 3 records structure:

--- Record 1 ---
Keys: dict_keys(['header', 'title', 'titleUrl', 'time', 'products', 'activityControls'])
Title: Visited https://www.businessinsider.com/shivon-zilis-reported-mother-elon-musk-t
Time: 2024-06-23T22:21:50.431Z

--- Record 2 ---
Keys: dict_keys(['header', 'title', 'titleUrl', 'time', 'products', 'activityControls'])
Title: Visited Elon Musk and Shivon Zilis privately welcome third baby – NBC10 ...
Time: 2024-06-23T22:20:53.934Z

--- Record 3 ---
Keys: dict_keys(['header', 'title', 'titleUrl', 'time', 'products', 'activityControls', 'locationInfos'])
Title: Searched for elon musk shivon zilis
Time: 2024-06-23T22:20:47.560Z


## Section 2: Parse and Clean Search History Records

**Objective:** Process raw search history records to extract, classify, and structure the data for downstream analysis.

**What this section does:**
- Extracts search queries from title field using regex patterns
- Classifies activities (searches, page visits, notifications)
- Handles three types of records: "Searched for", "Visited", and other activities
- Converts timestamps to datetime format
- Adds temporal features (date, hour, day_of_week)
- Displays distribution of activity types and date range

In [2]:
# Function to extract search query from title
def extract_search_query(title):
    """
    Extract actual search query from the title field.
    Handles three types of records:
    1. "Searched for [query]" - explicit search queries
    2. "Visited [URL/Title]" - visited pages (not actual searches)
    3. Notifications and other non-search activities
    """
    if not title:
        return None
    
    # Pattern 1: "Searched for [query]"
    search_match = re.search(r'^Searched for (.+?)(?:\s*$|[\?&])', title)
    if search_match:
        return search_match.group(1).strip()
    
    # Pattern 2: Extract from notification topics
    if "notification" in title.lower() and "Including topics:" in title:
        return None  # Notifications aren't searches
    
    # Pattern 3: Visited URLs - extract from title content but mark differently
    if title.startswith("Visited"):
        return None
    
    return None

# Function to classify activity type
def classify_activity(record):
    """Classify the type of activity"""
    title = record.get('title', '')
    
    if title.startswith('Searched for'):
        return 'search_query'
    elif title.startswith('Visited'):
        return 'page_visit'
    elif 'notification' in title.lower():
        return 'notification'
    else:
        return 'other'

# Clean and structure the data
cleaned_records = []
for record in search_data:
    cleaned = {
        'timestamp': record.get('time'),
        'title': record.get('title'),
        'titleUrl': record.get('titleUrl'),
        'activity_type': classify_activity(record),
        'search_query': extract_search_query(record.get('title')),
        'location': record.get('locationInfos', [{}])[0].get('name') if record.get('locationInfos') else None,
    }
    cleaned_records.append(cleaned)

# Convert to DataFrame
df = pd.DataFrame(cleaned_records)

# Convert timestamp to datetime - use format='ISO8601' to handle mixed formats
df['timestamp'] = pd.to_datetime(df['timestamp'], format='ISO8601', utc=True)
df['date'] = df['timestamp'].dt.date
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.day_name()

# Show data summary
print(f"Total records: {len(df)}")
print(f"\nActivity type distribution:")
print(df['activity_type'].value_counts())
print(f"\nDate range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"\nSample cleaned records:")
print(df[['timestamp', 'activity_type', 'search_query', 'title']].head(10))

Total records: 55383

Activity type distribution:
activity_type
search_query    30542
page_visit      22496
other            2187
notification      158
Name: count, dtype: int64

Date range: 2017-06-08 16:42:55.223000+00:00 to 2024-06-23 22:21:50.431000+00:00

Sample cleaned records:
                         timestamp activity_type             search_query  \
0 2024-06-23 22:21:50.431000+00:00    page_visit                     None   
1 2024-06-23 22:20:53.934000+00:00    page_visit                     None   
2 2024-06-23 22:20:47.560000+00:00  search_query   elon musk shivon zilis   
3 2024-06-23 17:08:38.542000+00:00  notification                     None   
4 2024-06-23 16:52:09.311000+00:00  search_query  bank station fire alert   
5 2024-06-23 16:52:00.916000+00:00  search_query  bank station fire alert   
6 2024-06-22 20:40:58.305000+00:00  search_query      mukesh ambani house   
7 2024-06-22 07:59:27.621000+00:00    page_visit                     None   
8 2024-06-22 07:39:03.

## Section 3: Extract Search Queries and Categorize by Topics

**Objective:** Filter search queries and categorize them into semantic categories using keyword-based classification.

**What this section does:**
- Filters out page visits and non-search activities to get pure search queries
- Uses predefined keyword mapping to categorize queries (fashion, real_estate, technology, etc.)
- Counts occurrences of each search query
- Displays top 20 most frequent searches
- Shows category distribution and sample searches per category

In [3]:
# Category mapping based on keyword analysis
CATEGORY_KEYWORDS = {
    'fashion': ['sandal', 'shoe', 'brand', 'hermes', 'tory burch', 'valentino', 'mule', 'raffia', 'dress', 'apparel'],
    'real_estate': ['plot', 'property', 'layout', 'apartment', 'real estate', 'house', 'bangalore', 'mysore', 'hsr'],
    'travel': ['metro', 'airport', 'station', 'hotel', 'airbnb', 'virgin active', 'travel'],
    'technology': ['elevenlabs', 'midjourney', 'openai', 'ai', 'tech', 'software'],
    'business': ['balderton', 'equity', 'employee', 'startup', 'venture'],
    'news': ['news', 'politics', 'sunak', 'starmer', 'bbc', 'reuters', 'alert', 'fire'],
    'entertainment': ['elon', 'musk', 'celebrity', 'businessinsider'],
    'wellness': ['massage', 'health', 'gym', 'fitness', 'active'],
    'shopping': ['amazon', 'shopping', 'retail', 'buy', 'product'],
    'books': ['autobiography', 'morita', 'sony', 'japan', 'made'],
}

def categorize_query(query):
    """Categorize a search query based on keywords"""
    if not query:
        return 'uncategorized'
    
    query_lower = query.lower()
    for category, keywords in CATEGORY_KEYWORDS.items():
        if any(keyword in query_lower for keyword in keywords):
            return category
    return 'general_interest'

# Get only search queries (not page visits or notifications)
search_queries_df = df[df['search_query'].notna()].copy()
search_queries_df['category'] = search_queries_df['search_query'].apply(categorize_query)

print(f"Total actual search queries extracted: {len(search_queries_df)}")
print(f"\nTop 20 search queries:")
query_counts = search_queries_df['search_query'].value_counts().head(20)
for i, (query, count) in enumerate(query_counts.items(), 1):
    print(f"{i:2d}. {query:40s} (Count: {count})")

print(f"\n\nCategory Distribution:")
category_dist = search_queries_df['category'].value_counts()
for category, count in category_dist.items():
    pct = (count / len(search_queries_df)) * 100
    print(f"{category:20s}: {count:3d} searches ({pct:5.1f}%)")

print(f"\n\nSearches by category sample:")
for category in search_queries_df['category'].unique()[:5]:
    category_queries = search_queries_df[search_queries_df['category'] == category]['search_query'].unique()
    print(f"\n{category.upper()}:")
    for query in category_queries[:3]:
        print(f"  - {query}")

Total actual search queries extracted: 30535

Top 20 search queries:
 1. otta                                     (Count: 91)
 2. ukraine                                  (Count: 55)
 3. keats                                    (Count: 47)
 4. bbc                                      (Count: 46)
 5. israel                                   (Count: 44)
 6. gmail                                    (Count: 36)
 7. maps                                     (Count: 32)
 8. ukraine.                                 (Count: 31)
 9. kcl email                                (Count: 29)
10. calculator                               (Count: 26)
11. drive                                    (Count: 26)
12. british royals news                      (Count: 23)
13. german to english                        (Count: 22)
14. beautyfini                               (Count: 22)
15. kcl student records                      (Count: 22)
16. kate middleton                           (Count: 20)
17. facebook       

## Section 4: Analyze Search Patterns and Temporal Trends

**Objective:** Understand when and how users search, including patterns by day of week, time of day, and search clustering behavior.

**What this section does:**
- Analyzes search activity by day of week
- Shows search distribution across 24 hours
- Identifies most active dates
- Finds search clusters (multiple related searches within 5 minutes)
- Displays example search sequences to understand user search behavior

In [4]:
# Temporal Analysis
print("=== TEMPORAL SEARCH ANALYSIS ===\n")

print("Search Activity by Day of Week:")
dow_counts = df[df['activity_type'] == 'search_query'].groupby('day_of_week').size()
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_counts = dow_counts.reindex(day_order, fill_value=0)
for day, count in dow_counts.items():
    bar = "█" * (count // 5) if count > 0 else ""
    print(f"{day:10s}: {count:3d} searches {bar}")

print("\n\nSearch Activity by Hour of Day:")
hour_counts = df[df['activity_type'] == 'search_query'].groupby('hour').size()
for hour in range(24):
    count = hour_counts.get(hour, 0)
    bar = "█" * (count // 2) if count > 0 else ""
    print(f"{hour:02d}:00 - {hour:02d}:59: {count:3d} searches {bar}")

print("\n\nMost Active Dates (Top 10):")
date_counts = df[df['activity_type'] == 'search_query'].groupby('date').size().sort_values(ascending=False)
for i, (date, count) in enumerate(date_counts.head(10).items(), 1):
    print(f"{i:2d}. {date}: {count:3d} searches")

# Behavior Analysis
print("\n\n=== SEARCH BEHAVIOR ANALYSIS ===\n")

# Search sequence analysis
search_queries_sorted = search_queries_df.sort_values('timestamp')
print("Example search sequences (showing how searches are grouped in time):")
print("(Likely searching for related topics in clusters)")

# Extract hour and minute for grouping
search_queries_sorted['hour_minute'] = search_queries_sorted['timestamp'].dt.strftime('%Y-%m-%d %H:%M')

# Find clusters of searches within 5 minutes
search_clusters = []
current_cluster = []
last_time = None

for idx, row in search_queries_sorted.iterrows():
    if last_time is None:
        current_cluster = [row['search_query']]
    elif (row['timestamp'] - last_time).total_seconds() <= 300:  # 5 minutes
        current_cluster.append(row['search_query'])
    else:
        if len(current_cluster) > 1:
            search_clusters.append(current_cluster)
        current_cluster = [row['search_query']]
    last_time = row['timestamp']

if len(current_cluster) > 1:
    search_clusters.append(current_cluster)

print(f"Found {len(search_clusters)} search clusters (2+ searches within 5 minutes):\n")
for i, cluster in enumerate(search_clusters[:5], 1):
    print(f"Cluster {i}:")
    for query in cluster:
        print(f"  - {query}")
    print()

=== TEMPORAL SEARCH ANALYSIS ===

Search Activity by Day of Week:
Monday    : 4089 searches █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Tuesday   : 4416 searches ████████████████████████████████████████████████████████████████

## Section 5: Key Insights and Extracted Preferences

**Objective:** Extract actionable insights from search data to build a user interest profile and infer lifestyle characteristics.

**What this section does:**
- Identifies primary interest categories and their percentages
- Analyzes sub-interests within major categories (e.g., specific fashion brands)
- Infers lifestyle profile based on search patterns
- Calculates page visits to search ratio for purchase intent
- Displays recent search trends (last 7 days)

In [5]:
print("=== KEY CUSTOMER INSIGHTS FROM SEARCH ANALYSIS ===\n")

# Primary Interests
print("PRIMARY INTERESTS IDENTIFIED:")
print("-" * 50)
primary_categories = search_queries_df['category'].value_counts().head(5)
for i, (cat, count) in enumerate(primary_categories.items(), 1):
    pct = (count / len(search_queries_df)) * 100
    print(f"{i}. {cat.replace('_', ' ').title():25s} - {pct:5.1f}% of searches")

# Sub-interests within Fashion
print("\n\nFASHION PREFERENCES (Most Relevant Category):")
print("-" * 50)
fashion_searches = search_queries_df[search_queries_df['category'] == 'fashion']['search_query'].unique()
print("Specific brands and items searched:")
for search in fashion_searches:
    print(f"  • {search}")

# Real Estate Interests
print("\n\nREAL ESTATE & LOCATION INTERESTS:")
print("-" * 50)
realstate_searches = search_queries_df[search_queries_df['category'] == 'real_estate']['search_query'].unique()
for search in realstate_searches:
    print(f"  • {search}")

# Inferred Lifestyle Profile
print("\n\nINFERRED LIFESTYLE PROFILE:")
print("-" * 50)
profile_points = []

if len(search_queries_df[search_queries_df['category'] == 'fashion']) > 0:
    profile_points.append("✓ Fashion-conscious: Interested in luxury and designer brands (Hermes, Tory Burch, Valentino)")

if len(search_queries_df[search_queries_df['category'] == 'real_estate']) > 0:
    profile_points.append("✓ Property investor/seeker: Actively searching for real estate in premium areas of Bangalore")

if len(search_queries_df[search_queries_df['category'] == 'wellness']) > 0:
    profile_points.append("✓ Health-conscious: Interested in fitness and wellness facilities (Virgin Active, massage services)")

if len(search_queries_df[search_queries_df['category'] == 'technology']) > 0:
    profile_points.append("✓ Tech-savvy: Following AI/ML trends and innovation (ElevenLabs, Midjourney, OpenAI)")

if len(search_queries_df[search_queries_df['category'] == 'business']) > 0:
    profile_points.append("✓ Professionally oriented: Interest in startup equity and talent management")

for point in profile_points:
    print(point)

# Shopping behavior
print("\n\nSHOPPING BEHAVIOR INSIGHTS:")
print("-" * 50)
page_visits_df = df[df['activity_type'] == 'page_visit'].copy()
print(f"Total page visits: {len(page_visits_df)}")
print(f"Page visits vs searches ratio: {len(page_visits_df) / len(search_queries_df):.2f}:1")
print("Indicates high purchase intent when searching (follows up searches with visits)")

# Most recent searches indicate current interests
print("\n\nRECENT SEARCH TRENDS (Last 7 days):")
print("-" * 50)
recent_searches = search_queries_df.sort_values('timestamp', ascending=False).head(10)
recent_categories = recent_searches['category'].value_counts()
for cat, count in recent_categories.items():
    print(f"  • {cat.replace('_', ' ').title():20s} - {count} recent searches")

=== KEY CUSTOMER INSIGHTS FROM SEARCH ANALYSIS ===

PRIMARY INTERESTS IDENTIFIED:
--------------------------------------------------
1. General Interest          -  83.7% of searches
2. Technology                -   8.6% of searches
3. Wellness                  -   1.6% of searches
4. Travel                    -   1.3% of searches
5. Real Estate               -   1.0% of searches


FASHION PREFERENCES (Most Relevant Category):
--------------------------------------------------
Specific brands and items searched:
  • hermes oasis sandals
  • hermes sandals
  • tory burch raffia sandals
  • help to buy email address
  • apple watch hermes sports band
  • apple watch hermes band
  • brandalley
  • before brands failure
  • gen z health brands
  • serviceable addressable market
  • white strappy espadrille sandals
  • valentino gold hoop earrings
  • brands like monica vinader
  • best indian jewellery brands
  • barclays investment bank address
  • king's college london address
  • meetcl

## Section 6: Advanced NER-Based Entity Extraction Using OpenAI

**Objective:** Use OpenAI's GPT-4o-mini model to perform sophisticated Named Entity Recognition and extract structured information (category, topic, entities) from search queries.

**What this section does:**
- Initializes OpenAI client with API credentials
- Defines extraction prompt for structured information extraction
- Implements extraction function that returns category, topic, and entities
- Tests extraction on sample queries
- Sets up caching system to avoid re-processing identical texts

In [6]:
# Install and configure OpenAI
import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from .env (if present)
load_dotenv()

api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise RuntimeError("OPENAI_API_KEY not found. Set it in your environment or .env file.")

# Initialize OpenAI client
client = OpenAI(api_key=api_key)

print("OpenAI client initialized successfully!")
print("Using model: gpt-4o-mini for NER extraction")

OpenAI client initialized successfully!
Using model: gpt-4o-mini for NER extraction


In [7]:
# Define the extraction prompt for OpenAI
EXTRACTION_SYSTEM_PROMPT = """You are an expert at analyzing search queries and web page titles to extract structured information.

For each input text, extract:
1. CATEGORY: The main semantic category (choose from: Fashion & Accessories, Real Estate & Property, Technology & Innovation, Wellness & Health, Travel & Transportation, Business & Finance, News & Media, Shopping & Retail, Entertainment, Books & Learning, General Interest)
2. TOPIC: The main subject or focus of the query/title
3. ITEMS: Specific entities mentioned (brands, products, people, places, organizations)

Examples:
- "Hermes sandals for women" → Category: Fashion & Accessories, Topic: Sandals, Items: ["Hermes"]
- "Real estate property in HSR Layout Bangalore" → Category: Real Estate & Property, Topic: Property search, Items: ["HSR Layout", "Bangalore"]
- "How to bake sourdough bread" → Category: Books & Learning, Topic: Baking, Items: ["sourdough"]
- "children using smartphones too often" → Category: Wellness & Health, Topic: Smartphone usage, Items: ["children", "smartphones"]

Return ONLY valid JSON in this exact format:
{
  "category": "Category Name",
  "topic": "Main topic",
  "items": ["item1", "item2"]
}"""

def extract_entities_and_context(text):
    """
    Extract named entities and categorize them using OpenAI GPT-4o-mini
    
    Returns:
    {
        'category': str,
        'topic': str,
        'items': list,
        'raw_text': str
    }
    """
    if not text or not isinstance(text, str):
        return {
            'category': 'Uncategorized',
            'topic': 'Unknown',
            'items': [],
            'raw_text': text
        }
    
    try:
        # Call OpenAI API
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": EXTRACTION_SYSTEM_PROMPT},
                {"role": "user", "content": f"Extract structured information from: {text}"}
            ],
            temperature=0,
            max_tokens=150,
            response_format={"type": "json_object"}
        )
        
        # Parse response
        result = json.loads(response.choices[0].message.content)
        
        return {
            'category': result.get('category', 'General Interest'),
            'topic': result.get('topic', 'Not specified'),
            'items': result.get('items', []),
            'raw_text': text
        }
    
    except Exception as e:
        # Fallback to simple keyword matching if API fails
        print(f"API error for '{text[:50]}...': {e}")
        return {
            'category': 'General Interest',
            'topic': text[:30] if len(text) > 30 else text,
            'items': [],
            'raw_text': text
        }

# Test the extraction function
print("=== TESTING OPENAI NER EXTRACTION ===\n")
test_queries = [
    "Hermes sandals for women",
    "How to bake sourdough bread",
    "Real estate property in HSR Layout Bangalore",
    "Children smartphone addiction research"
]

for query in test_queries:
    result = extract_entities_and_context(query)
    print(f"Query: {query}")
    print(f"  Category: {result['category']}")
    print(f"  Topic: {result['topic']}")
    print(f"  Items: {result['items']}")
    print()

=== TESTING OPENAI NER EXTRACTION ===

Query: Hermes sandals for women
  Category: Fashion & Accessories
  Topic: Sandals
  Items: ['Hermes']



KeyboardInterrupt: 

In [None]:
# Load previously cached results if available
import os

CACHE_FILE = 'openai_extraction_cache.json'

if os.path.exists(CACHE_FILE):
    try:
        with open(CACHE_FILE, 'r') as f:
            extraction_cache = json.load(f)
        print(f"✓ Loaded {len(extraction_cache):,} cached extractions from '{CACHE_FILE}'")
        print("  This will save API calls and processing time!")
    except Exception as e:
        print(f"⚠ Could not load cache file: {e}")
        extraction_cache = {}
else:
    extraction_cache = {}
    print(f"No existing cache found. Starting fresh.")

No existing cache found. Starting fresh.


### 6.1: Load Cache and Initialize Batch Processing

**Objective:** Load previously cached API responses and set up batch processing utilities to optimize API usage and reduce costs.

**What this section does:**
- Loads existing extraction cache if available
- Creates batch processing function for efficient API calls
- Implements caching mechanism to avoid reprocessing
- Displays cache statistics

In [None]:
# Batch processing function to reduce API costs
import time
from typing import List, Dict

# Initialize cache if not already loaded
if 'extraction_cache' not in globals():
    extraction_cache = {}

def extract_entities_batch(texts: List[str], batch_size: int = 10) -> List[Dict]:
    """
    Process texts in batches to optimize API usage and add progress tracking
    """
    results = []
    total = len(texts)
    
    print(f"Processing {total} texts in batches of {batch_size}...")
    
    for i in range(0, total, batch_size):
        batch = texts[i:i + batch_size]
        batch_results = []
        
        for text in batch:
            result = extract_entities_and_context(text)
            batch_results.append(result)
            time.sleep(0.1)  # Small delay to avoid rate limits
        
        results.extend(batch_results)
        
        # Progress update
        progress = min(i + batch_size, total)
        pct = (progress / total) * 100
        print(f"Progress: {progress}/{total} ({pct:.1f}%) - Last: '{batch[-1][:40]}...'")
    
    print("✓ Batch processing complete!")
    return results

# Cache-based approach to avoid re-processing
# Note: extraction_cache is loaded from file in the previous cell
def extract_entities_and_context_cached(text):
    """
    Cached version to avoid re-processing identical queries.
    Checks cache first, then calls API if needed.
    """
    if not text or not isinstance(text, str):
        return {
            'category': 'Uncategorized',
            'topic': 'Unknown',
            'items': [],
            'raw_text': text
        }
    
    # Check cache
    if text in extraction_cache:
        return extraction_cache[text]
    
    # Extract using API
    result = extract_entities_and_context(text)
    
    # Store in cache
    extraction_cache[text] = result
    
    return result

print("Batch processing utilities ready!")
print(f"Cache size: {len(extraction_cache):,} entries")

Batch processing utilities ready!
Cache size: 0 entries


In [None]:
# Apply NER extraction to all search queries using OpenAI
print("=== EXTRACTING STRUCTURED INFORMATION FROM ALL SEARCHES ===\n")
print(f"Total unique search queries: {search_queries_df['search_query'].nunique()}")
print(f"Total search records (with duplicates): {len(search_queries_df)}\n")

# Get unique queries to minimize API calls
unique_queries = search_queries_df['search_query'].unique()
print(f"Processing {len(unique_queries)} unique queries with OpenAI GPT-4o-mini...")
print("This may take a few minutes. Progress will be shown below.\n")

# Process in batches
unique_results = {}
batch_size = 5  # Process 5 at a time to manage rate limits

for i in range(0, len(unique_queries), batch_size):
    batch = unique_queries[i:i + batch_size]
    
    for query in batch:
        unique_results[query] = extract_entities_and_context_cached(query)
    
    # Progress update
    progress = min(i + batch_size, len(unique_queries))
    pct = (progress / len(unique_queries)) * 100
    print(f"Progress: {progress}/{len(unique_queries)} ({pct:.1f}%)")
    
    # Small delay to respect rate limits
    if i + batch_size < len(unique_queries):
        time.sleep(1)

print("\n✓ Extraction complete!")

# Map results back to all rows
search_queries_df['ner_extraction'] = search_queries_df['search_query'].map(unique_results)

# Expand the extracted data into separate columns
search_queries_df['semantic_category'] = search_queries_df['ner_extraction'].apply(lambda x: x['category'])
search_queries_df['topic'] = search_queries_df['ner_extraction'].apply(lambda x: x['topic'])
search_queries_df['items'] = search_queries_df['ner_extraction'].apply(lambda x: x['items'])

# Display top results
print("\nSample of extracted structured information:\n")
display_df = search_queries_df[['search_query', 'semantic_category', 'topic', 'items']].drop_duplicates(subset=['search_query']).head(15)
for idx, row in display_df.iterrows():
    print(f"Search: {row['search_query']}")
    print(f"  Category: {row['semantic_category']}")
    print(f"  Topic: {row['topic']}")
    print(f"  Items: {', '.join(row['items']) if row['items'] else 'None'}")
    print()

print(f"\n\nSemantic Category Distribution:")
print("=" * 60)
category_dist = search_queries_df['semantic_category'].value_counts()
for category, count in category_dist.items():
    pct = (count / len(search_queries_df)) * 100
    bar = "█" * (count // max(1, len(search_queries_df) // 30))
    print(f"{category:30s} {count:3d} ({pct:5.1f}%) {bar}")

=== EXTRACTING STRUCTURED INFORMATION FROM ALL SEARCHES ===

Total unique search queries: 25807
Total search records (with duplicates): 30535

Processing 25807 unique queries with OpenAI GPT-4o-mini...
This may take a few minutes. Progress will be shown below.

Progress: 5/25807 (0.0%)
Progress: 10/25807 (0.0%)
Progress: 15/25807 (0.1%)
Progress: 20/25807 (0.1%)
Progress: 25/25807 (0.1%)
Progress: 30/25807 (0.1%)
Progress: 35/25807 (0.1%)


KeyboardInterrupt: 

### 6.2: Extract and Process All Search Queries

**Objective:** Apply NER extraction to all unique search queries to extract semantic categories, topics, and entities using the OpenAI API.

**What this section does:**
- Processes all unique search queries through GPT-4o-mini
- Batches requests to manage API rate limits
- Maps results back to the dataframe with semantic categories
- Expands extraction into separate columns (category, topic, items)
- Displays sample results and category distribution

In [None]:
# Web scraping utilities for page visit URL content extraction
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
import re

# Cache for scraped content to avoid re-scraping
scraped_content_cache = {}

def clean_google_url(url):
    """Extract actual URL from Google redirect URL"""
    if 'google.com/url' in url:
        try:
            parsed = urlparse(url)
            params = parse_qs(parsed.query)
            if 'q' in params:
                return params['q'][0]
        except:
            pass
    return url

def extract_url_from_title(title):
    """Extract URL from page visit title (format: 'Visited <URL>')"""
    match = re.search(r'https?://[^\s<>"]+|www\.[^\s<>"]+', title)
    if match:
        url = match.group(0)
        if not url.startswith('http'):
            url = 'http://' + url
        return url
    return None

def scrape_url_content(url, timeout=10):
    """
    Scrape website content with fallback strategies:
    1. Try full URL
    2. If failed, try root domain
    3. If failed, return None (will use URL text semantic parsing)
    """
    if not url:
        return None
    
    # Check cache first
    if url in scraped_content_cache:
        return scraped_content_cache[url]
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
    }
    
    try:
        # Try full URL first
        response = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(['script', 'style', 'nav', 'footer', 'header']):
            script.decompose()
        
        # Extract text
        text = soup.get_text(separator=' ', strip=True)
        
        # Clean up whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        # Limit text length (GPT-4o-mini has token limits)
        max_chars = 3000
        if len(text) > max_chars:
            text = text[:max_chars] + "..."
        
        # Cache the result
        scraped_content_cache[url] = text
        return text
        
    except Exception as e:
        # Fallback: try root domain
        try:
            parsed = urlparse(url)
            root_url = f"{parsed.scheme}://{parsed.netloc}"
            
            if root_url != url:  # Only try if different from original
                response = requests.get(root_url, headers=headers, timeout=timeout, allow_redirects=True)
                response.raise_for_status()
                
                soup = BeautifulSoup(response.content, 'html.parser')
                for script in soup(['script', 'style', 'nav', 'footer', 'header']):
                    script.decompose()
                
                text = soup.get_text(separator=' ', strip=True)
                text = re.sub(r'\s+', ' ', text).strip()
                
                if len(text) > 3000:
                    text = text[:3000] + "..."
                
                scraped_content_cache[url] = text
                return text
        except:
            pass
        
        # All scraping failed
        scraped_content_cache[url] = None
        return None

def get_page_visit_content(title, url=None):
    """
    Extract content for page visit with fallback strategy:
    1. Use provided URL or extract from title
    2. Clean Google redirect URLs
    3. Scrape website content
    4. If scraping fails, use URL text for semantic parsing
    """
    # Use provided URL or try to extract from title
    if not url:
        url = extract_url_from_title(title)
    
    if not url:
        # No URL found, use title directly
        return title
    
    # Clean Google redirect URLs
    url = clean_google_url(url)
    
    # Try to scrape content
    content = scrape_url_content(url)
    
    if content:
        # Successfully scraped - use scraped content
        return f"Website: {url}\nContent: {content}"
    else:
        # Scraping failed - use URL for semantic parsing
        # Extract meaningful parts from URL
        parsed = urlparse(url)
        domain = parsed.netloc.replace('www.', '')
        path = parsed.path.strip('/').replace('/', ' ')
        
        # Combine domain and path for semantic analysis
        semantic_text = f"{domain} {path}".replace('-', ' ').replace('_', ' ')
        return f"URL: {url}\nSemantic context: {semantic_text}"

print("✓ Web scraping utilities loaded")
print(f"  - extract_url_from_title(): Extract URL from page visit titles")
print(f"  - scrape_url_content(): Scrape website with fallback to root domain")
print(f"  - get_page_visit_content(): Main function with full fallback strategy")

✓ Web scraping utilities loaded
  - extract_url_from_title(): Extract URL from page visit titles
  - scrape_url_content(): Scrape website with fallback to root domain
  - get_page_visit_content(): Main function with full fallback strategy


## Section 7: Web Scraping and Page Visit Analysis

**Objective:** Extract content from visited web pages and apply the same NER extraction to understand what pages users actually visit.

**What this section does:**
- Implements web scraping utilities to extract page content
- Cleans Google redirect URLs to get actual target URLs
- Falls back to URL parsing if scraping fails
- Creates caching system for scraped content
- Prepares utilities for extracting content from visited pages

In [None]:
# Test the web scraping function with a sample
test_title = "Visited https://www.example.com/products/shoes"
print("Testing web scraping with sample URL:")
print(f"Title: {test_title}\n")

url = extract_url_from_title(test_title)
print(f"Extracted URL: {url}")

content = get_page_visit_content(test_title)
print(f"\nContent preview (first 200 chars):")
print(content[:200] + "..." if len(content) > 200 else content)

Testing web scraping with sample URL:
Title: Visited https://www.example.com/products/shoes

Extracted URL: https://www.example.com/products/shoes

Content preview (first 200 chars):
Website: https://www.example.com/products/shoes
Content: Example Domain Example Domain This domain is for use in documentation examples without needing permission. Avoid use in operations. Learn more


### 7.1: Test Web Scraping with Sample Data

**Objective:** Verify that web scraping functions work correctly with sample URLs.

**What this section does:**
- Tests URL extraction from page titles
- Tests content scraping functionality
- Shows preview of scraped content

In [None]:
# Check actual page visit titles format
print("Sample page visit titles from the dataset:\n")
sample_visits = page_visits_df['title'].head(10).tolist()
for i, title in enumerate(sample_visits, 1):
    print(f"{i}. {title}")
    url = extract_url_from_title(title)
    print(f"   Extracted URL: {url}\n")

Sample page visit titles from the dataset:

1. Visited https://www.businessinsider.com/shivon-zilis-reported-mother-elon-musk-twins-2022-7?amp
   Extracted URL: https://www.businessinsider.com/shivon-zilis-reported-mother-elon-musk-twins-2022-7?amp

2. Visited Elon Musk and Shivon Zilis privately welcome third baby – NBC10 ...
   Extracted URL: None

3. Visited Teens could lose bank accounts and driving licences for snubbing ...
   Extracted URL: None

4. Visited Starmer: Sunak showing 'total lack of leadership' - BBC News
   Extracted URL: None

5. Visited Sunak looked like a man who was running the country - until he ...
   Extracted URL: None

6. Visited Balderton Essentials Guide to Employee Equity - Balderton Capital
   Extracted URL: None

7. Visited https://urban.co/en-gb/all-treatments?utm_medium=cpc&utm_term=urban%20massage&creative=644204401832&netw=g&utm_source=google_mobile&match=e&device=m&model=&pos=&ace=&utm_campaign=2022898691&utm_adgroupid=72394300020&campaignid=202289

### 7.2: Inspect Page Visit Data Format

**Objective:** Examine the actual page visit records from the dataset to understand their format.

**What this section does:**
- Displays sample page visit titles
- Shows extracted URLs from titles
- Verifies data structure for processing

In [None]:
# Check if titleUrl field is available
print("\nChecking titleUrl field availability:\n")
sample_with_urls = page_visits_df[['title', 'titleUrl']].head(10)
for idx, row in sample_with_urls.iterrows():
    print(f"Title: {row['title'][:60]}...")
    print(f"  URL: {row['titleUrl']}\n")


Checking titleUrl field availability:

Title: Visited https://www.businessinsider.com/shivon-zilis-reporte...
  URL: https://www.google.com/url?q=https://www.businessinsider.com/shivon-zilis-reported-mother-elon-musk-twins-2022-7%3Famp&usg=AOvVaw1JpQbDqah1O4c5A5wg4Who

Title: Visited Elon Musk and Shivon Zilis privately welcome third b...
  URL: https://www.google.com/url?q=https://www.nbcphiladelphia.com/entertainment/entertainment-news/elon-musk-and-shivon-zilis-privately-welcome-third-baby/3892694/&usg=AOvVaw0BqY5StEFFTmHdppOUNY4V

Title: Visited Teens could lose bank accounts and driving licences ...
  URL: https://www.google.com/url?q=https://www.independent.co.uk/news/uk/politics/rishi-sunak-general-election-national-service-b2566249.html&usg=AOvVaw15XA5JKBYTMEfFaM6V9rI5

Title: Visited Starmer: Sunak showing 'total lack of leadership' - ...
  URL: https://www.google.com/url?q=https://www.bbc.co.uk/news/videos/cx00z3pldz5o&usg=AOvVaw3ctOceA0AWj8e-dU-4UPWL

Title: Visited Sunak l

### 7.3: Verify URL Field Availability

**Objective:** Check that the titleUrl field is available in the data and accessible for processing.

**What this section does:**
- Displays titles and corresponding Google redirect URLs
- Verifies data structure has necessary fields

In [None]:
# Test the updated web scraping with actual data
test_row = page_visits_df.iloc[1]  # Second row
test_title = test_row['title']
test_url = test_row['titleUrl']

print(f"Testing with actual page visit:")
print(f"Title: {test_title}")
print(f"Google URL: {test_url[:80]}...")
print()

# Clean the URL
cleaned_url = clean_google_url(test_url)
print(f"Cleaned URL: {cleaned_url}")
print()

# Get content
content = get_page_visit_content(test_title, test_url)
print(f"Content type: {content.split(':')[0] if ':' in content else 'Title only'}")
print(f"Content preview (first 300 chars):")
print(content[:300] + "..." if len(content) > 300 else content)

Testing with actual page visit:
Title: Visited Elon Musk and Shivon Zilis privately welcome third baby – NBC10 ...
Google URL: https://www.google.com/url?q=https://www.nbcphiladelphia.com/entertainment/enter...

Cleaned URL: https://www.nbcphiladelphia.com/entertainment/entertainment-news/elon-musk-and-shivon-zilis-privately-welcome-third-baby/3892694/

Content type: Website
Content preview (first 300 chars):
Website: https://www.nbcphiladelphia.com/entertainment/entertainment-news/elon-musk-and-shivon-zilis-privately-welcome-third-baby/3892694/
Content: Elon Musk and Shivon Zilis privately welcome third baby – NBC10 Philadelphia Skip to content Celebrity News Elon Musk and Shivon Zilis privately welcome...


### 7.4: Test Web Scraping with Real Data

**Objective:** Test the complete web scraping pipeline with actual data from the dataset.

**What this section does:**
- Tests URL cleaning from Google redirects
- Demonstrates the complete content extraction flow
- Shows content type and preview

In [None]:
# Test the complete extraction flow with a small sample
print("=== TESTING COMPLETE EXTRACTION FLOW ===\n")

# Create a small test sample
test_visits = page_visits_df.head(10).copy()
title_to_url_test = test_visits.groupby('title')['titleUrl'].first().to_dict()
unique_test_titles = list(title_to_url_test.keys())

print(f"Processing {len(unique_test_titles)} test page visits...\n")

test_results = {}
for i, title in enumerate(unique_test_titles, 1):
    url = title_to_url_test.get(title)
    content = get_page_visit_content(title, url)
    
    # Show what we're extracting from
    content_type = "Scraped website" if content.startswith("Website:") else "URL parsing" if content.startswith("URL:") else "Title only"
    print(f"{i}. {title[:50]}... [{content_type}]")
    
    # Extract entities
    result = extract_entities_and_context_cached(content)
    test_results[title] = result
    
    print(f"   → Category: {result['category']}, Topic: {result['topic']}, Items: {result['items']}\n")

print("✓ Test complete!")

=== TESTING COMPLETE EXTRACTION FLOW ===

Processing 10 test page visits...



Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1. Visited Balderton Essentials Guide to Employee Equ... [Scraped website]
   → Category: Business & Finance, Topic: Equity Guide, Items: ['Balderton Capital']

2. Visited Elon Musk and Shivon Zilis privately welco... [Scraped website]
   → Category: Entertainment, Topic: Celebrity news, Items: ['Elon Musk', 'Shivon Zilis', 'Neuralink', 'Tesla', 'SpaceX', 'Grimes', 'Justine Wilson']

3. Visited Made in Japan: Akio Morita And Sony: Amazo... [Scraped website]
   → Category: Books & Learning, Topic: Biography of Akio Morita and Sony, Items: ['Akio Morita', 'Sony', 'Penguin']

4. Visited Starmer: Sunak showing 'total lack of lead... [Scraped website]
   → Category: News & Media, Topic: Political leadership and accountability, Items: ['Keir Starmer', 'Rishi Sunak', 'Conservative candidates', 'Gambling Commission']

5. Visited Sunak looked like a man who was running th... [Scraped website]
   → Category: News & Media, Topic: Political event coverage, Items: ['Rishi Sunak', 'BBC', 'National S

### 7.6: Process All Page Visits with Web Scraping and NER

**Objective:** Apply comprehensive extraction pipeline to all page visits in the dataset.

**What this section does:**
- Processes all unique page visit titles through the complete pipeline
- Scrapes website content with automatic fallback strategies
- Applies OpenAI NER extraction to categorize and extract entities
- Tracks scraping success rates (full scrape vs URL parsing fallback)
- Maps results back to dataframe and displays category distribution

In [None]:
# Extract and organize page visits using OpenAI NER with web scraping
print("=== EXTRACTING INFORMATION FROM PAGE VISITS ===\n")
print(f"Total unique page visit titles: {page_visits_df['title'].nunique()}")
print(f"Total page visit records: {len(page_visits_df)}\n")

# Create a mapping of titles to URLs for processing
# Since multiple visits might have same title but different URLs, we'll use the first URL for each title
title_to_url = page_visits_df.groupby('title')['titleUrl'].first().to_dict()

unique_titles = list(title_to_url.keys())
print(f"Processing {len(unique_titles)} unique page titles with web scraping + OpenAI GPT-4o-mini...")
print("Progress will be shown below.\n")

# Process in batches
unique_title_results = {}
batch_size = 5
scrape_success = 0
scrape_fallback_root = 0
scrape_failed = 0

for i in range(0, len(unique_titles), batch_size):
    batch = unique_titles[i:i + batch_size]
    
    for title in batch:
        # Get URL for this title
        url = title_to_url.get(title)
        
        # Get content using web scraping with fallback strategy
        content = get_page_visit_content(title, url)
        
        # Track scraping results
        if content.startswith("Website:"):
            scrape_success += 1
        elif content.startswith("URL:"):
            scrape_failed += 1
        
        # Extract entities from the scraped content
        unique_title_results[title] = extract_entities_and_context_cached(content)
    
    progress = min(i + batch_size, len(unique_titles))
    pct = (progress / len(unique_titles)) * 100
    print(f"Progress: {progress}/{len(unique_titles)} ({pct:.1f}%) | Scraped: {scrape_success} | Failed: {scrape_failed}")
    
    if i + batch_size < len(unique_titles):
        time.sleep(1)

print(f"\n✓ Page visit extraction complete!")
print(f"  Successfully scraped: {scrape_success}")
print(f"  Fallback to URL parsing: {scrape_failed}")
print(f"  Total processed: {len(unique_titles)}")

# Map results back
page_visits_df['ner_extraction'] = page_visits_df['title'].map(unique_title_results)
page_visits_df['semantic_category'] = page_visits_df['ner_extraction'].apply(lambda x: x['category'])
page_visits_df['topic'] = page_visits_df['ner_extraction'].apply(lambda x: x['topic'])
page_visits_df['items'] = page_visits_df['ner_extraction'].apply(lambda x: x['items'])

print("\nSample page visits with extracted information:\n")
display_visits = page_visits_df[['title', 'semantic_category', 'topic', 'items']].drop_duplicates(subset=['title']).head(10)
for idx, row in display_visits.iterrows():
    print(f"Page: {row['title'][:60]}...")
    print(f"  Category: {row['semantic_category']}")
    print(f"  Topic: {row['topic']}")
    print(f"  Items: {', '.join(row['items']) if row['items'] else 'None'}")
    print()

print(f"\n\nPage Visits by Semantic Category:")
print("=" * 60)
visit_category_dist = page_visits_df['semantic_category'].value_counts()
for category, count in visit_category_dist.items():
    pct = (count / len(page_visits_df)) * 100
    bar = "█" * (count // max(1, len(page_visits_df) // 30))
    print(f"{category:30s} {count:3d} ({pct:5.1f}%) {bar}")

=== EXTRACTING INFORMATION FROM PAGE VISITS ===

Total unique page visit titles: 19808
Total page visit records: 22496

Processing 19808 unique page titles with web scraping + OpenAI GPT-4o-mini...
Progress will be shown below.

Progress: 5/19808 (0.0%) | Scraped: 5 | Failed: 0
Progress: 5/19808 (0.0%) | Scraped: 5 | Failed: 0
Progress: 10/19808 (0.1%) | Scraped: 9 | Failed: 1
Progress: 10/19808 (0.1%) | Scraped: 9 | Failed: 1
Progress: 15/19808 (0.1%) | Scraped: 13 | Failed: 2
Progress: 15/19808 (0.1%) | Scraped: 13 | Failed: 2
Progress: 20/19808 (0.1%) | Scraped: 18 | Failed: 2
Progress: 20/19808 (0.1%) | Scraped: 18 | Failed: 2
Progress: 25/19808 (0.1%) | Scraped: 23 | Failed: 2
Progress: 25/19808 (0.1%) | Scraped: 23 | Failed: 2
Progress: 30/19808 (0.2%) | Scraped: 28 | Failed: 2
Progress: 30/19808 (0.2%) | Scraped: 28 | Failed: 2
Progress: 35/19808 (0.2%) | Scraped: 33 | Failed: 2
Progress: 35/19808 (0.2%) | Scraped: 33 | Failed: 2
Progress: 40/19808 (0.2%) | Scraped: 34 | Failed:

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 125/19808 (0.6%) | Scraped: 91 | Failed: 34
Progress: 130/19808 (0.7%) | Scraped: 96 | Failed: 34
Progress: 130/19808 (0.7%) | Scraped: 96 | Failed: 34
Progress: 135/19808 (0.7%) | Scraped: 101 | Failed: 34
Progress: 135/19808 (0.7%) | Scraped: 101 | Failed: 34
Progress: 140/19808 (0.7%) | Scraped: 106 | Failed: 34
Progress: 140/19808 (0.7%) | Scraped: 106 | Failed: 34
Progress: 145/19808 (0.7%) | Scraped: 110 | Failed: 35
Progress: 145/19808 (0.7%) | Scraped: 110 | Failed: 35


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 150/19808 (0.8%) | Scraped: 114 | Failed: 36
Progress: 155/19808 (0.8%) | Scraped: 119 | Failed: 36
Progress: 155/19808 (0.8%) | Scraped: 119 | Failed: 36
Progress: 160/19808 (0.8%) | Scraped: 124 | Failed: 36
Progress: 160/19808 (0.8%) | Scraped: 124 | Failed: 36
Progress: 165/19808 (0.8%) | Scraped: 129 | Failed: 36
Progress: 165/19808 (0.8%) | Scraped: 129 | Failed: 36
Progress: 170/19808 (0.9%) | Scraped: 133 | Failed: 37
Progress: 170/19808 (0.9%) | Scraped: 133 | Failed: 37
Progress: 175/19808 (0.9%) | Scraped: 138 | Failed: 37
Progress: 175/19808 (0.9%) | Scraped: 138 | Failed: 37
API error for 'Website: https://www.visittnt.com/south-india-tour...': Unterminated string starting at: line 4 column 380 (char 461)
API error for 'Website: https://www.visittnt.com/south-india-tour...': Unterminated string starting at: line 4 column 380 (char 461)
Progress: 180/19808 (0.9%) | Scraped: 142 | Failed: 38
Progress: 180/19808 (0.9%) | Scraped: 142 | Failed: 38
Progress: 185/19808

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 260/19808 (1.3%) | Scraped: 212 | Failed: 48


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 265/19808 (1.3%) | Scraped: 216 | Failed: 49
Progress: 270/19808 (1.4%) | Scraped: 220 | Failed: 50
Progress: 270/19808 (1.4%) | Scraped: 220 | Failed: 50
Progress: 275/19808 (1.4%) | Scraped: 225 | Failed: 50
Progress: 275/19808 (1.4%) | Scraped: 225 | Failed: 50
Progress: 280/19808 (1.4%) | Scraped: 228 | Failed: 52
Progress: 280/19808 (1.4%) | Scraped: 228 | Failed: 52
Progress: 285/19808 (1.4%) | Scraped: 232 | Failed: 53
Progress: 285/19808 (1.4%) | Scraped: 232 | Failed: 53
API error for 'Website: https://www.goodhousekeeping.com/uk/produ...': Unterminated string starting at: line 15 column 5 (char 510)
API error for 'Website: https://www.goodhousekeeping.com/uk/produ...': Unterminated string starting at: line 15 column 5 (char 510)
Progress: 290/19808 (1.5%) | Scraped: 236 | Failed: 54
Progress: 290/19808 (1.5%) | Scraped: 236 | Failed: 54
Progress: 295/19808 (1.5%) | Scraped: 240 | Failed: 55
Progress: 295/19808 (1.5%) | Scraped: 240 | Failed: 55
Progress: 300/19808 (

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 415/19808 (2.1%) | Scraped: 343 | Failed: 72
Progress: 420/19808 (2.1%) | Scraped: 348 | Failed: 72
Progress: 420/19808 (2.1%) | Scraped: 348 | Failed: 72
Progress: 425/19808 (2.1%) | Scraped: 353 | Failed: 72
Progress: 425/19808 (2.1%) | Scraped: 353 | Failed: 72
Progress: 430/19808 (2.2%) | Scraped: 358 | Failed: 72
Progress: 430/19808 (2.2%) | Scraped: 358 | Failed: 72
Progress: 435/19808 (2.2%) | Scraped: 363 | Failed: 72
Progress: 435/19808 (2.2%) | Scraped: 363 | Failed: 72
Progress: 440/19808 (2.2%) | Scraped: 368 | Failed: 72
Progress: 440/19808 (2.2%) | Scraped: 368 | Failed: 72
API error for 'Website: https://www.53degreescapital.com/
Content...': Unterminated string starting at: line 4 column 379 (char 457)
API error for 'Website: https://www.53degreescapital.com/
Content...': Unterminated string starting at: line 4 column 379 (char 457)
Progress: 445/19808 (2.2%) | Scraped: 373 | Failed: 72
Progress: 445/19808 (2.2%) | Scraped: 373 | Failed: 72
API error for 'Webs

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 455/19808 (2.3%) | Scraped: 382 | Failed: 73
Progress: 460/19808 (2.3%) | Scraped: 387 | Failed: 73
Progress: 460/19808 (2.3%) | Scraped: 387 | Failed: 73
Progress: 465/19808 (2.3%) | Scraped: 392 | Failed: 73
Progress: 465/19808 (2.3%) | Scraped: 392 | Failed: 73
Progress: 470/19808 (2.4%) | Scraped: 397 | Failed: 73
Progress: 470/19808 (2.4%) | Scraped: 397 | Failed: 73
Progress: 475/19808 (2.4%) | Scraped: 402 | Failed: 73
Progress: 475/19808 (2.4%) | Scraped: 402 | Failed: 73
Progress: 480/19808 (2.4%) | Scraped: 407 | Failed: 73
Progress: 480/19808 (2.4%) | Scraped: 407 | Failed: 73
Progress: 485/19808 (2.4%) | Scraped: 412 | Failed: 73
Progress: 485/19808 (2.4%) | Scraped: 412 | Failed: 73
Progress: 490/19808 (2.5%) | Scraped: 416 | Failed: 74
Progress: 490/19808 (2.5%) | Scraped: 416 | Failed: 74
Progress: 495/19808 (2.5%) | Scraped: 421 | Failed: 74
Progress: 495/19808 (2.5%) | Scraped: 421 | Failed: 74
Progress: 500/19808 (2.5%) | Scraped: 426 | Failed: 74
Progress: 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 605/19808 (3.1%) | Scraped: 515 | Failed: 90
Progress: 610/19808 (3.1%) | Scraped: 519 | Failed: 91
Progress: 610/19808 (3.1%) | Scraped: 519 | Failed: 91


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 615/19808 (3.1%) | Scraped: 524 | Failed: 91
Progress: 620/19808 (3.1%) | Scraped: 526 | Failed: 94
Progress: 620/19808 (3.1%) | Scraped: 526 | Failed: 94
Progress: 625/19808 (3.2%) | Scraped: 530 | Failed: 95
Progress: 625/19808 (3.2%) | Scraped: 530 | Failed: 95
Progress: 630/19808 (3.2%) | Scraped: 535 | Failed: 95
Progress: 630/19808 (3.2%) | Scraped: 535 | Failed: 95
Progress: 635/19808 (3.2%) | Scraped: 539 | Failed: 96
Progress: 635/19808 (3.2%) | Scraped: 539 | Failed: 96
Progress: 640/19808 (3.2%) | Scraped: 544 | Failed: 96
Progress: 640/19808 (3.2%) | Scraped: 544 | Failed: 96
Progress: 645/19808 (3.3%) | Scraped: 548 | Failed: 97
Progress: 645/19808 (3.3%) | Scraped: 548 | Failed: 97
Progress: 650/19808 (3.3%) | Scraped: 553 | Failed: 97
Progress: 650/19808 (3.3%) | Scraped: 553 | Failed: 97
Progress: 655/19808 (3.3%) | Scraped: 557 | Failed: 98
Progress: 655/19808 (3.3%) | Scraped: 557 | Failed: 98
Progress: 660/19808 (3.3%) | Scraped: 562 | Failed: 98
Progress: 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 670/19808 (3.4%) | Scraped: 571 | Failed: 99
Progress: 675/19808 (3.4%) | Scraped: 575 | Failed: 100
Progress: 675/19808 (3.4%) | Scraped: 575 | Failed: 100
Progress: 680/19808 (3.4%) | Scraped: 580 | Failed: 100
Progress: 680/19808 (3.4%) | Scraped: 580 | Failed: 100
Progress: 685/19808 (3.5%) | Scraped: 585 | Failed: 100
Progress: 685/19808 (3.5%) | Scraped: 585 | Failed: 100
Progress: 690/19808 (3.5%) | Scraped: 589 | Failed: 101
Progress: 690/19808 (3.5%) | Scraped: 589 | Failed: 101
Progress: 695/19808 (3.5%) | Scraped: 594 | Failed: 101
Progress: 695/19808 (3.5%) | Scraped: 594 | Failed: 101
Progress: 700/19808 (3.5%) | Scraped: 599 | Failed: 101
Progress: 700/19808 (3.5%) | Scraped: 599 | Failed: 101
Progress: 705/19808 (3.6%) | Scraped: 604 | Failed: 101
Progress: 705/19808 (3.6%) | Scraped: 604 | Failed: 101
Progress: 710/19808 (3.6%) | Scraped: 609 | Failed: 101
Progress: 710/19808 (3.6%) | Scraped: 609 | Failed: 101


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 715/19808 (3.6%) | Scraped: 614 | Failed: 101
Progress: 720/19808 (3.6%) | Scraped: 618 | Failed: 102
Progress: 720/19808 (3.6%) | Scraped: 618 | Failed: 102
Progress: 725/19808 (3.7%) | Scraped: 623 | Failed: 102
Progress: 725/19808 (3.7%) | Scraped: 623 | Failed: 102
Progress: 730/19808 (3.7%) | Scraped: 628 | Failed: 102
Progress: 730/19808 (3.7%) | Scraped: 628 | Failed: 102
Progress: 735/19808 (3.7%) | Scraped: 633 | Failed: 102
Progress: 735/19808 (3.7%) | Scraped: 633 | Failed: 102
Progress: 740/19808 (3.7%) | Scraped: 638 | Failed: 102
Progress: 740/19808 (3.7%) | Scraped: 638 | Failed: 102
Progress: 745/19808 (3.8%) | Scraped: 642 | Failed: 103
Progress: 745/19808 (3.8%) | Scraped: 642 | Failed: 103
Progress: 750/19808 (3.8%) | Scraped: 646 | Failed: 104
Progress: 750/19808 (3.8%) | Scraped: 646 | Failed: 104
Progress: 755/19808 (3.8%) | Scraped: 649 | Failed: 106
Progress: 755/19808 (3.8%) | Scraped: 649 | Failed: 106
Progress: 760/19808 (3.8%) | Scraped: 653 | Fail

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 895/19808 (4.5%) | Scraped: 774 | Failed: 121
Progress: 900/19808 (4.5%) | Scraped: 779 | Failed: 121
Progress: 900/19808 (4.5%) | Scraped: 779 | Failed: 121
Progress: 905/19808 (4.6%) | Scraped: 783 | Failed: 122
Progress: 905/19808 (4.6%) | Scraped: 783 | Failed: 122
Progress: 910/19808 (4.6%) | Scraped: 788 | Failed: 122
Progress: 910/19808 (4.6%) | Scraped: 788 | Failed: 122
Progress: 915/19808 (4.6%) | Scraped: 793 | Failed: 122
Progress: 915/19808 (4.6%) | Scraped: 793 | Failed: 122
Progress: 920/19808 (4.6%) | Scraped: 798 | Failed: 122
Progress: 920/19808 (4.6%) | Scraped: 798 | Failed: 122
Progress: 925/19808 (4.7%) | Scraped: 802 | Failed: 123
Progress: 925/19808 (4.7%) | Scraped: 802 | Failed: 123
Progress: 930/19808 (4.7%) | Scraped: 806 | Failed: 124
Progress: 930/19808 (4.7%) | Scraped: 806 | Failed: 124
Progress: 935/19808 (4.7%) | Scraped: 811 | Failed: 124
Progress: 935/19808 (4.7%) | Scraped: 811 | Failed: 124
Progress: 940/19808 (4.7%) | Scraped: 816 | Fail

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 1370/19808 (6.9%) | Scraped: 1193 | Failed: 177
Progress: 1375/19808 (6.9%) | Scraped: 1197 | Failed: 178
Progress: 1375/19808 (6.9%) | Scraped: 1197 | Failed: 178
Progress: 1380/19808 (7.0%) | Scraped: 1202 | Failed: 178
Progress: 1380/19808 (7.0%) | Scraped: 1202 | Failed: 178
Progress: 1385/19808 (7.0%) | Scraped: 1207 | Failed: 178
Progress: 1385/19808 (7.0%) | Scraped: 1207 | Failed: 178
Progress: 1390/19808 (7.0%) | Scraped: 1212 | Failed: 178
Progress: 1390/19808 (7.0%) | Scraped: 1212 | Failed: 178
Progress: 1395/19808 (7.0%) | Scraped: 1215 | Failed: 180
Progress: 1395/19808 (7.0%) | Scraped: 1215 | Failed: 180
Progress: 1400/19808 (7.1%) | Scraped: 1217 | Failed: 183
Progress: 1400/19808 (7.1%) | Scraped: 1217 | Failed: 183
Progress: 1405/19808 (7.1%) | Scraped: 1222 | Failed: 183
Progress: 1405/19808 (7.1%) | Scraped: 1222 | Failed: 183
Progress: 1410/19808 (7.1%) | Scraped: 1227 | Failed: 183
Progress: 1410/19808 (7.1%) | Scraped: 1227 | Failed: 183
Progress: 1415

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 1455/19808 (7.3%) | Scraped: 1268 | Failed: 187
Progress: 1460/19808 (7.4%) | Scraped: 1273 | Failed: 187
Progress: 1460/19808 (7.4%) | Scraped: 1273 | Failed: 187
API error for 'Website: https://www.healthstatus.com/health_blog/...': Unterminated string starting at: line 4 column 304 (char 379)
API error for 'Website: https://www.healthstatus.com/health_blog/...': Unterminated string starting at: line 4 column 304 (char 379)
Progress: 1465/19808 (7.4%) | Scraped: 1277 | Failed: 188
Progress: 1465/19808 (7.4%) | Scraped: 1277 | Failed: 188
Progress: 1470/19808 (7.4%) | Scraped: 1281 | Failed: 189
Progress: 1470/19808 (7.4%) | Scraped: 1281 | Failed: 189
Progress: 1475/19808 (7.4%) | Scraped: 1285 | Failed: 190
Progress: 1475/19808 (7.4%) | Scraped: 1285 | Failed: 190
Progress: 1480/19808 (7.5%) | Scraped: 1288 | Failed: 192
Progress: 1480/19808 (7.5%) | Scraped: 1288 | Failed: 192
Progress: 1485/19808 (7.5%) | Scraped: 1293 | Failed: 192
Progress: 1485/19808 (7.5%) | Scraped:

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 1920/19808 (9.7%) | Scraped: 1663 | Failed: 257
Progress: 1925/19808 (9.7%) | Scraped: 1668 | Failed: 257
Progress: 1925/19808 (9.7%) | Scraped: 1668 | Failed: 257
Progress: 1930/19808 (9.7%) | Scraped: 1672 | Failed: 258
Progress: 1930/19808 (9.7%) | Scraped: 1672 | Failed: 258
Progress: 1935/19808 (9.8%) | Scraped: 1677 | Failed: 258
Progress: 1935/19808 (9.8%) | Scraped: 1677 | Failed: 258
Progress: 1940/19808 (9.8%) | Scraped: 1681 | Failed: 259
Progress: 1940/19808 (9.8%) | Scraped: 1681 | Failed: 259
Progress: 1945/19808 (9.8%) | Scraped: 1683 | Failed: 262
Progress: 1945/19808 (9.8%) | Scraped: 1683 | Failed: 262
Progress: 1950/19808 (9.8%) | Scraped: 1688 | Failed: 262
Progress: 1950/19808 (9.8%) | Scraped: 1688 | Failed: 262
Progress: 1955/19808 (9.9%) | Scraped: 1692 | Failed: 263
Progress: 1955/19808 (9.9%) | Scraped: 1692 | Failed: 263
Progress: 1960/19808 (9.9%) | Scraped: 1697 | Failed: 263
Progress: 1960/19808 (9.9%) | Scraped: 1697 | Failed: 263
Progress: 1965

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2095/19808 (10.6%) | Scraped: 1807 | Failed: 288
Progress: 2100/19808 (10.6%) | Scraped: 1810 | Failed: 290
Progress: 2100/19808 (10.6%) | Scraped: 1810 | Failed: 290


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2105/19808 (10.6%) | Scraped: 1814 | Failed: 291
Progress: 2110/19808 (10.7%) | Scraped: 1819 | Failed: 291
Progress: 2110/19808 (10.7%) | Scraped: 1819 | Failed: 291


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2115/19808 (10.7%) | Scraped: 1824 | Failed: 291
API error for 'Website: https://www.conceptventures.vc/
Content: ...': Unterminated string starting at: line 4 column 348 (char 428)
API error for 'Website: https://www.conceptventures.vc/
Content: ...': Unterminated string starting at: line 4 column 348 (char 428)
Progress: 2120/19808 (10.7%) | Scraped: 1829 | Failed: 291
Progress: 2120/19808 (10.7%) | Scraped: 1829 | Failed: 291
Progress: 2125/19808 (10.7%) | Scraped: 1833 | Failed: 292
Progress: 2125/19808 (10.7%) | Scraped: 1833 | Failed: 292
Progress: 2130/19808 (10.8%) | Scraped: 1838 | Failed: 292
Progress: 2130/19808 (10.8%) | Scraped: 1838 | Failed: 292
Progress: 2135/19808 (10.8%) | Scraped: 1841 | Failed: 294
Progress: 2135/19808 (10.8%) | Scraped: 1841 | Failed: 294
Progress: 2140/19808 (10.8%) | Scraped: 1845 | Failed: 295
Progress: 2140/19808 (10.8%) | Scraped: 1845 | Failed: 295
Progress: 2145/19808 (10.8%) | Scraped: 1850 | Failed: 295
Progress: 2145/19808 (10.8

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2205/19808 (11.1%) | Scraped: 1905 | Failed: 300
Progress: 2210/19808 (11.2%) | Scraped: 1910 | Failed: 300
Progress: 2210/19808 (11.2%) | Scraped: 1910 | Failed: 300
Progress: 2215/19808 (11.2%) | Scraped: 1914 | Failed: 301
Progress: 2215/19808 (11.2%) | Scraped: 1914 | Failed: 301
Progress: 2220/19808 (11.2%) | Scraped: 1919 | Failed: 301
Progress: 2220/19808 (11.2%) | Scraped: 1919 | Failed: 301
Progress: 2225/19808 (11.2%) | Scraped: 1924 | Failed: 301
Progress: 2225/19808 (11.2%) | Scraped: 1924 | Failed: 301
Progress: 2230/19808 (11.3%) | Scraped: 1929 | Failed: 301
Progress: 2230/19808 (11.3%) | Scraped: 1929 | Failed: 301
Progress: 2235/19808 (11.3%) | Scraped: 1931 | Failed: 304
Progress: 2235/19808 (11.3%) | Scraped: 1931 | Failed: 304
Progress: 2240/19808 (11.3%) | Scraped: 1934 | Failed: 306
Progress: 2240/19808 (11.3%) | Scraped: 1934 | Failed: 306
Progress: 2245/19808 (11.3%) | Scraped: 1939 | Failed: 306
Progress: 2245/19808 (11.3%) | Scraped: 1939 | Failed: 3

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2255/19808 (11.4%) | Scraped: 1948 | Failed: 307
Progress: 2260/19808 (11.4%) | Scraped: 1952 | Failed: 308
Progress: 2260/19808 (11.4%) | Scraped: 1952 | Failed: 308
Progress: 2265/19808 (11.4%) | Scraped: 1956 | Failed: 309
Progress: 2265/19808 (11.4%) | Scraped: 1956 | Failed: 309
Progress: 2270/19808 (11.5%) | Scraped: 1960 | Failed: 310
Progress: 2270/19808 (11.5%) | Scraped: 1960 | Failed: 310
Progress: 2275/19808 (11.5%) | Scraped: 1965 | Failed: 310
Progress: 2275/19808 (11.5%) | Scraped: 1965 | Failed: 310
Progress: 2280/19808 (11.5%) | Scraped: 1967 | Failed: 313
Progress: 2280/19808 (11.5%) | Scraped: 1967 | Failed: 313
Progress: 2285/19808 (11.5%) | Scraped: 1971 | Failed: 314
Progress: 2285/19808 (11.5%) | Scraped: 1971 | Failed: 314
Progress: 2290/19808 (11.6%) | Scraped: 1976 | Failed: 314
Progress: 2290/19808 (11.6%) | Scraped: 1976 | Failed: 314
Progress: 2295/19808 (11.6%) | Scraped: 1981 | Failed: 314
Progress: 2295/19808 (11.6%) | Scraped: 1981 | Failed: 3

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2385/19808 (12.0%) | Scraped: 2054 | Failed: 331
Progress: 2390/19808 (12.1%) | Scraped: 2059 | Failed: 331
Progress: 2390/19808 (12.1%) | Scraped: 2059 | Failed: 331
Progress: 2395/19808 (12.1%) | Scraped: 2063 | Failed: 332
Progress: 2395/19808 (12.1%) | Scraped: 2063 | Failed: 332
Progress: 2400/19808 (12.1%) | Scraped: 2068 | Failed: 332
Progress: 2400/19808 (12.1%) | Scraped: 2068 | Failed: 332
Progress: 2405/19808 (12.1%) | Scraped: 2073 | Failed: 332
Progress: 2405/19808 (12.1%) | Scraped: 2073 | Failed: 332
Progress: 2410/19808 (12.2%) | Scraped: 2078 | Failed: 332
Progress: 2410/19808 (12.2%) | Scraped: 2078 | Failed: 332
API error for 'Website: https://www.reddit.com/r/suggestmeabook/c...': Unterminated string starting at: line 23 column 5 (char 514)
API error for 'Website: https://www.reddit.com/r/suggestmeabook/c...': Unterminated string starting at: line 23 column 5 (char 514)
Progress: 2415/19808 (12.2%) | Scraped: 2081 | Failed: 334
Progress: 2415/19808 (12.2%)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2500/19808 (12.6%) | Scraped: 2161 | Failed: 339
Progress: 2505/19808 (12.6%) | Scraped: 2166 | Failed: 339
Progress: 2505/19808 (12.6%) | Scraped: 2166 | Failed: 339
Progress: 2510/19808 (12.7%) | Scraped: 2171 | Failed: 339
Progress: 2510/19808 (12.7%) | Scraped: 2171 | Failed: 339
Progress: 2515/19808 (12.7%) | Scraped: 2176 | Failed: 339
Progress: 2515/19808 (12.7%) | Scraped: 2176 | Failed: 339
Progress: 2520/19808 (12.7%) | Scraped: 2180 | Failed: 340
Progress: 2520/19808 (12.7%) | Scraped: 2180 | Failed: 340
Progress: 2525/19808 (12.7%) | Scraped: 2185 | Failed: 340
Progress: 2525/19808 (12.7%) | Scraped: 2185 | Failed: 340
Progress: 2530/19808 (12.8%) | Scraped: 2190 | Failed: 340
Progress: 2530/19808 (12.8%) | Scraped: 2190 | Failed: 340
Progress: 2535/19808 (12.8%) | Scraped: 2195 | Failed: 340
Progress: 2535/19808 (12.8%) | Scraped: 2195 | Failed: 340
Progress: 2540/19808 (12.8%) | Scraped: 2200 | Failed: 340
Progress: 2540/19808 (12.8%) | Scraped: 2200 | Failed: 3

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2615/19808 (13.2%) | Scraped: 2265 | Failed: 350
Progress: 2620/19808 (13.2%) | Scraped: 2269 | Failed: 351
Progress: 2620/19808 (13.2%) | Scraped: 2269 | Failed: 351
Progress: 2625/19808 (13.3%) | Scraped: 2274 | Failed: 351
Progress: 2625/19808 (13.3%) | Scraped: 2274 | Failed: 351
Progress: 2630/19808 (13.3%) | Scraped: 2279 | Failed: 351
Progress: 2630/19808 (13.3%) | Scraped: 2279 | Failed: 351
Progress: 2635/19808 (13.3%) | Scraped: 2284 | Failed: 351
Progress: 2635/19808 (13.3%) | Scraped: 2284 | Failed: 351
Progress: 2640/19808 (13.3%) | Scraped: 2288 | Failed: 352
Progress: 2640/19808 (13.3%) | Scraped: 2288 | Failed: 352
Progress: 2645/19808 (13.4%) | Scraped: 2292 | Failed: 353
Progress: 2645/19808 (13.4%) | Scraped: 2292 | Failed: 353
Progress: 2650/19808 (13.4%) | Scraped: 2297 | Failed: 353
Progress: 2650/19808 (13.4%) | Scraped: 2297 | Failed: 353
Progress: 2655/19808 (13.4%) | Scraped: 2302 | Failed: 353
Progress: 2655/19808 (13.4%) | Scraped: 2302 | Failed: 3

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2715/19808 (13.7%) | Scraped: 2361 | Failed: 354
Progress: 2720/19808 (13.7%) | Scraped: 2365 | Failed: 355
Progress: 2720/19808 (13.7%) | Scraped: 2365 | Failed: 355
Progress: 2725/19808 (13.8%) | Scraped: 2369 | Failed: 356
Progress: 2725/19808 (13.8%) | Scraped: 2369 | Failed: 356
Progress: 2730/19808 (13.8%) | Scraped: 2373 | Failed: 357
Progress: 2730/19808 (13.8%) | Scraped: 2373 | Failed: 357
Progress: 2735/19808 (13.8%) | Scraped: 2378 | Failed: 357
Progress: 2735/19808 (13.8%) | Scraped: 2378 | Failed: 357
Progress: 2740/19808 (13.8%) | Scraped: 2383 | Failed: 357
Progress: 2740/19808 (13.8%) | Scraped: 2383 | Failed: 357
Progress: 2745/19808 (13.9%) | Scraped: 2388 | Failed: 357
Progress: 2745/19808 (13.9%) | Scraped: 2388 | Failed: 357
Progress: 2750/19808 (13.9%) | Scraped: 2393 | Failed: 357
Progress: 2750/19808 (13.9%) | Scraped: 2393 | Failed: 357
Progress: 2755/19808 (13.9%) | Scraped: 2398 | Failed: 357
Progress: 2755/19808 (13.9%) | Scraped: 2398 | Failed: 3

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2775/19808 (14.0%) | Scraped: 2416 | Failed: 359
Progress: 2780/19808 (14.0%) | Scraped: 2420 | Failed: 360
Progress: 2780/19808 (14.0%) | Scraped: 2420 | Failed: 360
Progress: 2785/19808 (14.1%) | Scraped: 2425 | Failed: 360
Progress: 2785/19808 (14.1%) | Scraped: 2425 | Failed: 360
Progress: 2790/19808 (14.1%) | Scraped: 2430 | Failed: 360
Progress: 2790/19808 (14.1%) | Scraped: 2430 | Failed: 360
API error for 'Website: https://www.joinef.com/
Content: Found, d...': Unterminated string starting at: line 27 column 5 (char 484)
API error for 'Website: https://www.joinef.com/
Content: Found, d...': Unterminated string starting at: line 27 column 5 (char 484)
Progress: 2795/19808 (14.1%) | Scraped: 2434 | Failed: 361
Progress: 2795/19808 (14.1%) | Scraped: 2434 | Failed: 361
Progress: 2800/19808 (14.1%) | Scraped: 2439 | Failed: 361
Progress: 2800/19808 (14.1%) | Scraped: 2439 | Failed: 361


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2805/19808 (14.2%) | Scraped: 2443 | Failed: 362
Progress: 2810/19808 (14.2%) | Scraped: 2448 | Failed: 362
Progress: 2810/19808 (14.2%) | Scraped: 2448 | Failed: 362
Progress: 2815/19808 (14.2%) | Scraped: 2453 | Failed: 362
Progress: 2815/19808 (14.2%) | Scraped: 2453 | Failed: 362
Progress: 2820/19808 (14.2%) | Scraped: 2457 | Failed: 363
Progress: 2820/19808 (14.2%) | Scraped: 2457 | Failed: 363
Progress: 2825/19808 (14.3%) | Scraped: 2461 | Failed: 364
Progress: 2825/19808 (14.3%) | Scraped: 2461 | Failed: 364
Progress: 2830/19808 (14.3%) | Scraped: 2465 | Failed: 365
Progress: 2830/19808 (14.3%) | Scraped: 2465 | Failed: 365
Progress: 2835/19808 (14.3%) | Scraped: 2470 | Failed: 365
Progress: 2835/19808 (14.3%) | Scraped: 2470 | Failed: 365
Progress: 2840/19808 (14.3%) | Scraped: 2475 | Failed: 365
Progress: 2840/19808 (14.3%) | Scraped: 2475 | Failed: 365
Progress: 2845/19808 (14.4%) | Scraped: 2479 | Failed: 366
Progress: 2845/19808 (14.4%) | Scraped: 2479 | Failed: 3

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2915/19808 (14.7%) | Scraped: 2539 | Failed: 376


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 2920/19808 (14.7%) | Scraped: 2544 | Failed: 376
Progress: 2925/19808 (14.8%) | Scraped: 2547 | Failed: 378
Progress: 2925/19808 (14.8%) | Scraped: 2547 | Failed: 378
Progress: 2930/19808 (14.8%) | Scraped: 2552 | Failed: 378
Progress: 2930/19808 (14.8%) | Scraped: 2552 | Failed: 378
API error for 'Website: https://www.federalreserve.gov/releases/l...': Unterminated string starting at: line 22 column 5 (char 429)
API error for 'Website: https://www.federalreserve.gov/releases/l...': Unterminated string starting at: line 22 column 5 (char 429)
Progress: 2935/19808 (14.8%) | Scraped: 2556 | Failed: 379
Progress: 2935/19808 (14.8%) | Scraped: 2556 | Failed: 379
Progress: 2940/19808 (14.8%) | Scraped: 2561 | Failed: 379
Progress: 2940/19808 (14.8%) | Scraped: 2561 | Failed: 379
Progress: 2945/19808 (14.9%) | Scraped: 2565 | Failed: 380
Progress: 2945/19808 (14.9%) | Scraped: 2565 | Failed: 380
Progress: 2950/19808 (14.9%) | Scraped: 2570 | Failed: 380
Progress: 2950/19808 (14.9%)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 3035/19808 (15.3%) | Scraped: 2645 | Failed: 390
Progress: 3040/19808 (15.3%) | Scraped: 2649 | Failed: 391
Progress: 3040/19808 (15.3%) | Scraped: 2649 | Failed: 391
Progress: 3045/19808 (15.4%) | Scraped: 2654 | Failed: 391
Progress: 3045/19808 (15.4%) | Scraped: 2654 | Failed: 391
Progress: 3050/19808 (15.4%) | Scraped: 2659 | Failed: 391
Progress: 3050/19808 (15.4%) | Scraped: 2659 | Failed: 391
Progress: 3055/19808 (15.4%) | Scraped: 2662 | Failed: 393
Progress: 3055/19808 (15.4%) | Scraped: 2662 | Failed: 393
Progress: 3060/19808 (15.4%) | Scraped: 2667 | Failed: 393
Progress: 3060/19808 (15.4%) | Scraped: 2667 | Failed: 393
Progress: 3065/19808 (15.5%) | Scraped: 2671 | Failed: 394
Progress: 3065/19808 (15.5%) | Scraped: 2671 | Failed: 394
Progress: 3070/19808 (15.5%) | Scraped: 2676 | Failed: 394
Progress: 3070/19808 (15.5%) | Scraped: 2676 | Failed: 394
API error for 'Website: https://www.fintechdesignsummit.com/
Cont...': Unterminated string starting at: line 4 colu

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 3170/19808 (16.0%) | Scraped: 2762 | Failed: 408
Progress: 3175/19808 (16.0%) | Scraped: 2766 | Failed: 409
Progress: 3175/19808 (16.0%) | Scraped: 2766 | Failed: 409
Progress: 3180/19808 (16.1%) | Scraped: 2769 | Failed: 411
Progress: 3180/19808 (16.1%) | Scraped: 2769 | Failed: 411
Progress: 3185/19808 (16.1%) | Scraped: 2774 | Failed: 411
Progress: 3185/19808 (16.1%) | Scraped: 2774 | Failed: 411
Progress: 3190/19808 (16.1%) | Scraped: 2779 | Failed: 411
Progress: 3190/19808 (16.1%) | Scraped: 2779 | Failed: 411
Progress: 3195/19808 (16.1%) | Scraped: 2783 | Failed: 412
Progress: 3195/19808 (16.1%) | Scraped: 2783 | Failed: 412
Progress: 3200/19808 (16.2%) | Scraped: 2787 | Failed: 413
Progress: 3200/19808 (16.2%) | Scraped: 2787 | Failed: 413
Progress: 3205/19808 (16.2%) | Scraped: 2792 | Failed: 413
Progress: 3205/19808 (16.2%) | Scraped: 2792 | Failed: 413
Progress: 3210/19808 (16.2%) | Scraped: 2797 | Failed: 413
Progress: 3210/19808 (16.2%) | Scraped: 2797 | Failed: 4

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 3815/19808 (19.3%) | Scraped: 3321 | Failed: 494
Progress: 3820/19808 (19.3%) | Scraped: 3324 | Failed: 496
Progress: 3820/19808 (19.3%) | Scraped: 3324 | Failed: 496
Progress: 3825/19808 (19.3%) | Scraped: 3328 | Failed: 497
Progress: 3825/19808 (19.3%) | Scraped: 3328 | Failed: 497
Progress: 3830/19808 (19.3%) | Scraped: 3330 | Failed: 500
Progress: 3830/19808 (19.3%) | Scraped: 3330 | Failed: 500
Progress: 3835/19808 (19.4%) | Scraped: 3334 | Failed: 501
Progress: 3835/19808 (19.4%) | Scraped: 3334 | Failed: 501
Progress: 3840/19808 (19.4%) | Scraped: 3339 | Failed: 501
Progress: 3840/19808 (19.4%) | Scraped: 3339 | Failed: 501
Progress: 3845/19808 (19.4%) | Scraped: 3344 | Failed: 501
Progress: 3845/19808 (19.4%) | Scraped: 3344 | Failed: 501
Progress: 3850/19808 (19.4%) | Scraped: 3348 | Failed: 502
Progress: 3850/19808 (19.4%) | Scraped: 3348 | Failed: 502
Progress: 3855/19808 (19.5%) | Scraped: 3353 | Failed: 502
Progress: 3855/19808 (19.5%) | Scraped: 3353 | Failed: 5

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 3945/19808 (19.9%) | Scraped: 3431 | Failed: 514
Progress: 3950/19808 (19.9%) | Scraped: 3436 | Failed: 514
Progress: 3950/19808 (19.9%) | Scraped: 3436 | Failed: 514
Progress: 3955/19808 (20.0%) | Scraped: 3441 | Failed: 514
Progress: 3955/19808 (20.0%) | Scraped: 3441 | Failed: 514
Progress: 3960/19808 (20.0%) | Scraped: 3445 | Failed: 515
Progress: 3960/19808 (20.0%) | Scraped: 3445 | Failed: 515
Progress: 3965/19808 (20.0%) | Scraped: 3449 | Failed: 516
Progress: 3965/19808 (20.0%) | Scraped: 3449 | Failed: 516
Progress: 3970/19808 (20.0%) | Scraped: 3454 | Failed: 516
Progress: 3970/19808 (20.0%) | Scraped: 3454 | Failed: 516
Progress: 3975/19808 (20.1%) | Scraped: 3458 | Failed: 517
Progress: 3975/19808 (20.1%) | Scraped: 3458 | Failed: 517
Progress: 3980/19808 (20.1%) | Scraped: 3463 | Failed: 517
Progress: 3980/19808 (20.1%) | Scraped: 3463 | Failed: 517
Progress: 3985/19808 (20.1%) | Scraped: 3466 | Failed: 519
Progress: 3985/19808 (20.1%) | Scraped: 3466 | Failed: 5

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4080/19808 (20.6%) | Scraped: 3548 | Failed: 532
Progress: 4085/19808 (20.6%) | Scraped: 3551 | Failed: 534
Progress: 4085/19808 (20.6%) | Scraped: 3551 | Failed: 534
Progress: 4090/19808 (20.6%) | Scraped: 3554 | Failed: 536
Progress: 4090/19808 (20.6%) | Scraped: 3554 | Failed: 536
Progress: 4095/19808 (20.7%) | Scraped: 3557 | Failed: 538
Progress: 4095/19808 (20.7%) | Scraped: 3557 | Failed: 538
Progress: 4100/19808 (20.7%) | Scraped: 3562 | Failed: 538
Progress: 4100/19808 (20.7%) | Scraped: 3562 | Failed: 538
Progress: 4105/19808 (20.7%) | Scraped: 3564 | Failed: 541
Progress: 4105/19808 (20.7%) | Scraped: 3564 | Failed: 541
Progress: 4110/19808 (20.7%) | Scraped: 3569 | Failed: 541
Progress: 4110/19808 (20.7%) | Scraped: 3569 | Failed: 541
Progress: 4115/19808 (20.8%) | Scraped: 3574 | Failed: 541
Progress: 4115/19808 (20.8%) | Scraped: 3574 | Failed: 541
Progress: 4120/19808 (20.8%) | Scraped: 3578 | Failed: 542
Progress: 4120/19808 (20.8%) | Scraped: 3578 | Failed: 5

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4290/19808 (21.7%) | Scraped: 3719 | Failed: 571
Progress: 4295/19808 (21.7%) | Scraped: 3723 | Failed: 572
Progress: 4295/19808 (21.7%) | Scraped: 3723 | Failed: 572
Progress: 4300/19808 (21.7%) | Scraped: 3727 | Failed: 573
Progress: 4300/19808 (21.7%) | Scraped: 3727 | Failed: 573
Progress: 4305/19808 (21.7%) | Scraped: 3731 | Failed: 574
Progress: 4305/19808 (21.7%) | Scraped: 3731 | Failed: 574
Progress: 4310/19808 (21.8%) | Scraped: 3736 | Failed: 574
Progress: 4310/19808 (21.8%) | Scraped: 3736 | Failed: 574
Progress: 4315/19808 (21.8%) | Scraped: 3741 | Failed: 574
Progress: 4315/19808 (21.8%) | Scraped: 3741 | Failed: 574
Progress: 4320/19808 (21.8%) | Scraped: 3742 | Failed: 578
Progress: 4320/19808 (21.8%) | Scraped: 3742 | Failed: 578
Progress: 4325/19808 (21.8%) | Scraped: 3746 | Failed: 579
Progress: 4325/19808 (21.8%) | Scraped: 3746 | Failed: 579
Progress: 4330/19808 (21.9%) | Scraped: 3750 | Failed: 580
Progress: 4330/19808 (21.9%) | Scraped: 3750 | Failed: 5

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4355/19808 (22.0%) | Scraped: 3772 | Failed: 583
Progress: 4360/19808 (22.0%) | Scraped: 3777 | Failed: 583
Progress: 4360/19808 (22.0%) | Scraped: 3777 | Failed: 583


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4365/19808 (22.0%) | Scraped: 3780 | Failed: 585
Progress: 4370/19808 (22.1%) | Scraped: 3785 | Failed: 585
Progress: 4370/19808 (22.1%) | Scraped: 3785 | Failed: 585
Progress: 4375/19808 (22.1%) | Scraped: 3789 | Failed: 586
Progress: 4375/19808 (22.1%) | Scraped: 3789 | Failed: 586
Progress: 4380/19808 (22.1%) | Scraped: 3792 | Failed: 588
Progress: 4380/19808 (22.1%) | Scraped: 3792 | Failed: 588
Progress: 4385/19808 (22.1%) | Scraped: 3797 | Failed: 588
Progress: 4385/19808 (22.1%) | Scraped: 3797 | Failed: 588
Progress: 4390/19808 (22.2%) | Scraped: 3802 | Failed: 588
Progress: 4390/19808 (22.2%) | Scraped: 3802 | Failed: 588
Progress: 4395/19808 (22.2%) | Scraped: 3806 | Failed: 589
Progress: 4395/19808 (22.2%) | Scraped: 3806 | Failed: 589
Progress: 4400/19808 (22.2%) | Scraped: 3811 | Failed: 589
Progress: 4400/19808 (22.2%) | Scraped: 3811 | Failed: 589
API error for 'Website: https://www.goindigo.in/baggage/baggage-a...': Unterminated string starting at: line 4 colu

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4535/19808 (22.9%) | Scraped: 3932 | Failed: 603
Progress: 4540/19808 (22.9%) | Scraped: 3937 | Failed: 603
Progress: 4540/19808 (22.9%) | Scraped: 3937 | Failed: 603
Progress: 4545/19808 (22.9%) | Scraped: 3941 | Failed: 604
Progress: 4545/19808 (22.9%) | Scraped: 3941 | Failed: 604
Progress: 4550/19808 (23.0%) | Scraped: 3944 | Failed: 606
Progress: 4550/19808 (23.0%) | Scraped: 3944 | Failed: 606
Progress: 4555/19808 (23.0%) | Scraped: 3948 | Failed: 607
Progress: 4555/19808 (23.0%) | Scraped: 3948 | Failed: 607
Progress: 4560/19808 (23.0%) | Scraped: 3952 | Failed: 608
Progress: 4560/19808 (23.0%) | Scraped: 3952 | Failed: 608
Progress: 4565/19808 (23.0%) | Scraped: 3955 | Failed: 610
Progress: 4565/19808 (23.0%) | Scraped: 3955 | Failed: 610
Progress: 4570/19808 (23.1%) | Scraped: 3960 | Failed: 610
Progress: 4570/19808 (23.1%) | Scraped: 3960 | Failed: 610
Progress: 4575/19808 (23.1%) | Scraped: 3963 | Failed: 612
Progress: 4575/19808 (23.1%) | Scraped: 3963 | Failed: 6

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4595/19808 (23.2%) | Scraped: 3981 | Failed: 614
Progress: 4600/19808 (23.2%) | Scraped: 3985 | Failed: 615
Progress: 4600/19808 (23.2%) | Scraped: 3985 | Failed: 615
Progress: 4605/19808 (23.2%) | Scraped: 3988 | Failed: 617
Progress: 4605/19808 (23.2%) | Scraped: 3988 | Failed: 617
Progress: 4610/19808 (23.3%) | Scraped: 3993 | Failed: 617
Progress: 4610/19808 (23.3%) | Scraped: 3993 | Failed: 617
Progress: 4615/19808 (23.3%) | Scraped: 3996 | Failed: 619
Progress: 4615/19808 (23.3%) | Scraped: 3996 | Failed: 619
Progress: 4620/19808 (23.3%) | Scraped: 4001 | Failed: 619
Progress: 4620/19808 (23.3%) | Scraped: 4001 | Failed: 619
Progress: 4625/19808 (23.3%) | Scraped: 4005 | Failed: 620
Progress: 4625/19808 (23.3%) | Scraped: 4005 | Failed: 620


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4630/19808 (23.4%) | Scraped: 4010 | Failed: 620
Progress: 4635/19808 (23.4%) | Scraped: 4015 | Failed: 620
Progress: 4635/19808 (23.4%) | Scraped: 4015 | Failed: 620
Progress: 4640/19808 (23.4%) | Scraped: 4018 | Failed: 622
Progress: 4640/19808 (23.4%) | Scraped: 4018 | Failed: 622
Progress: 4645/19808 (23.5%) | Scraped: 4022 | Failed: 623
Progress: 4645/19808 (23.5%) | Scraped: 4022 | Failed: 623
Progress: 4650/19808 (23.5%) | Scraped: 4026 | Failed: 624
Progress: 4650/19808 (23.5%) | Scraped: 4026 | Failed: 624
Progress: 4655/19808 (23.5%) | Scraped: 4030 | Failed: 625
Progress: 4655/19808 (23.5%) | Scraped: 4030 | Failed: 625
Progress: 4660/19808 (23.5%) | Scraped: 4035 | Failed: 625
Progress: 4660/19808 (23.5%) | Scraped: 4035 | Failed: 625
Progress: 4665/19808 (23.6%) | Scraped: 4039 | Failed: 626
Progress: 4665/19808 (23.6%) | Scraped: 4039 | Failed: 626
Progress: 4670/19808 (23.6%) | Scraped: 4043 | Failed: 627
Progress: 4670/19808 (23.6%) | Scraped: 4043 | Failed: 6

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 4930/19808 (24.9%) | Scraped: 4253 | Failed: 677
Progress: 4935/19808 (24.9%) | Scraped: 4256 | Failed: 679
Progress: 4935/19808 (24.9%) | Scraped: 4256 | Failed: 679
Progress: 4940/19808 (24.9%) | Scraped: 4260 | Failed: 680
Progress: 4940/19808 (24.9%) | Scraped: 4260 | Failed: 680
Progress: 4945/19808 (25.0%) | Scraped: 4265 | Failed: 680
Progress: 4945/19808 (25.0%) | Scraped: 4265 | Failed: 680
Progress: 4950/19808 (25.0%) | Scraped: 4269 | Failed: 681
Progress: 4950/19808 (25.0%) | Scraped: 4269 | Failed: 681
API error for 'Website: https://lsvp.com/
Content: Lightspeed Ven...': Unterminated string starting at: line 25 column 5 (char 491)
API error for 'Website: https://lsvp.com/
Content: Lightspeed Ven...': Unterminated string starting at: line 25 column 5 (char 491)
Progress: 4955/19808 (25.0%) | Scraped: 4273 | Failed: 682
Progress: 4955/19808 (25.0%) | Scraped: 4273 | Failed: 682
Progress: 4960/19808 (25.0%) | Scraped: 4278 | Failed: 682
Progress: 4960/19808 (25.0%)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 5375/19808 (27.1%) | Scraped: 4622 | Failed: 753
Progress: 5380/19808 (27.2%) | Scraped: 4627 | Failed: 753
Progress: 5380/19808 (27.2%) | Scraped: 4627 | Failed: 753
Progress: 5385/19808 (27.2%) | Scraped: 4632 | Failed: 753
Progress: 5385/19808 (27.2%) | Scraped: 4632 | Failed: 753
Progress: 5390/19808 (27.2%) | Scraped: 4637 | Failed: 753
Progress: 5390/19808 (27.2%) | Scraped: 4637 | Failed: 753
Progress: 5395/19808 (27.2%) | Scraped: 4641 | Failed: 754
Progress: 5395/19808 (27.2%) | Scraped: 4641 | Failed: 754
Progress: 5400/19808 (27.3%) | Scraped: 4645 | Failed: 755
Progress: 5400/19808 (27.3%) | Scraped: 4645 | Failed: 755
Progress: 5405/19808 (27.3%) | Scraped: 4648 | Failed: 757
Progress: 5405/19808 (27.3%) | Scraped: 4648 | Failed: 757
Progress: 5410/19808 (27.3%) | Scraped: 4652 | Failed: 758
Progress: 5410/19808 (27.3%) | Scraped: 4652 | Failed: 758
Progress: 5415/19808 (27.3%) | Scraped: 4657 | Failed: 758
Progress: 5415/19808 (27.3%) | Scraped: 4657 | Failed: 7

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 5445/19808 (27.5%) | Scraped: 4683 | Failed: 762
Progress: 5450/19808 (27.5%) | Scraped: 4687 | Failed: 763
Progress: 5450/19808 (27.5%) | Scraped: 4687 | Failed: 763
Progress: 5455/19808 (27.5%) | Scraped: 4692 | Failed: 763
Progress: 5455/19808 (27.5%) | Scraped: 4692 | Failed: 763


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 5460/19808 (27.6%) | Scraped: 4696 | Failed: 764
Progress: 5465/19808 (27.6%) | Scraped: 4701 | Failed: 764
Progress: 5465/19808 (27.6%) | Scraped: 4701 | Failed: 764
Progress: 5470/19808 (27.6%) | Scraped: 4705 | Failed: 765
Progress: 5470/19808 (27.6%) | Scraped: 4705 | Failed: 765
Progress: 5475/19808 (27.6%) | Scraped: 4710 | Failed: 765
Progress: 5475/19808 (27.6%) | Scraped: 4710 | Failed: 765
Progress: 5480/19808 (27.7%) | Scraped: 4715 | Failed: 765
Progress: 5480/19808 (27.7%) | Scraped: 4715 | Failed: 765
Progress: 5485/19808 (27.7%) | Scraped: 4719 | Failed: 766
Progress: 5485/19808 (27.7%) | Scraped: 4719 | Failed: 766
Progress: 5490/19808 (27.7%) | Scraped: 4724 | Failed: 766
Progress: 5490/19808 (27.7%) | Scraped: 4724 | Failed: 766
Progress: 5495/19808 (27.7%) | Scraped: 4728 | Failed: 767
Progress: 5495/19808 (27.7%) | Scraped: 4728 | Failed: 767
Progress: 5500/19808 (27.8%) | Scraped: 4733 | Failed: 767
Progress: 5500/19808 (27.8%) | Scraped: 4733 | Failed: 7

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 5785/19808 (29.2%) | Scraped: 4976 | Failed: 809
Progress: 5790/19808 (29.2%) | Scraped: 4981 | Failed: 809
Progress: 5790/19808 (29.2%) | Scraped: 4981 | Failed: 809
API error for 'Website: https://octopusventures.com/
Content: Oct...': Unterminated string starting at: line 4 column 372 (char 447)
API error for 'Website: https://octopusventures.com/
Content: Oct...': Unterminated string starting at: line 4 column 372 (char 447)
Progress: 5795/19808 (29.3%) | Scraped: 4986 | Failed: 809
Progress: 5795/19808 (29.3%) | Scraped: 4986 | Failed: 809
Progress: 5800/19808 (29.3%) | Scraped: 4990 | Failed: 810
Progress: 5800/19808 (29.3%) | Scraped: 4990 | Failed: 810
Progress: 5805/19808 (29.3%) | Scraped: 4995 | Failed: 810
Progress: 5805/19808 (29.3%) | Scraped: 4995 | Failed: 810
Progress: 5810/19808 (29.3%) | Scraped: 5000 | Failed: 810
Progress: 5810/19808 (29.3%) | Scraped: 5000 | Failed: 810
Progress: 5815/19808 (29.4%) | Scraped: 5003 | Failed: 812
Progress: 5815/19808 (29.4

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 5885/19808 (29.7%) | Scraped: 5063 | Failed: 822
Progress: 5890/19808 (29.7%) | Scraped: 5068 | Failed: 822
Progress: 5890/19808 (29.7%) | Scraped: 5068 | Failed: 822
Progress: 5895/19808 (29.8%) | Scraped: 5073 | Failed: 822
Progress: 5895/19808 (29.8%) | Scraped: 5073 | Failed: 822
Progress: 5900/19808 (29.8%) | Scraped: 5078 | Failed: 822
Progress: 5900/19808 (29.8%) | Scraped: 5078 | Failed: 822
Progress: 5905/19808 (29.8%) | Scraped: 5083 | Failed: 822
Progress: 5905/19808 (29.8%) | Scraped: 5083 | Failed: 822
Progress: 5910/19808 (29.8%) | Scraped: 5088 | Failed: 822
Progress: 5910/19808 (29.8%) | Scraped: 5088 | Failed: 822
API error for 'Website: https://www.backed.vc/faqs/
Content: Buil...': Unterminated string starting at: line 4 column 354 (char 436)
API error for 'Website: https://www.backed.vc/faqs/
Content: Buil...': Unterminated string starting at: line 4 column 354 (char 436)
Progress: 5915/19808 (29.9%) | Scraped: 5092 | Failed: 823
Progress: 5915/19808 (29.9

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 5940/19808 (30.0%) | Scraped: 5114 | Failed: 826
Progress: 5945/19808 (30.0%) | Scraped: 5117 | Failed: 828
Progress: 5945/19808 (30.0%) | Scraped: 5117 | Failed: 828


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 5950/19808 (30.0%) | Scraped: 5121 | Failed: 829
Progress: 5955/19808 (30.1%) | Scraped: 5126 | Failed: 829
Progress: 5955/19808 (30.1%) | Scraped: 5126 | Failed: 829
Progress: 5960/19808 (30.1%) | Scraped: 5129 | Failed: 831
Progress: 5960/19808 (30.1%) | Scraped: 5129 | Failed: 831
Progress: 5965/19808 (30.1%) | Scraped: 5133 | Failed: 832
Progress: 5965/19808 (30.1%) | Scraped: 5133 | Failed: 832
Progress: 5970/19808 (30.1%) | Scraped: 5138 | Failed: 832
Progress: 5970/19808 (30.1%) | Scraped: 5138 | Failed: 832
API error for 'Website: https://www.packtpub.com/
Content: Packt ...': Unterminated string starting at: line 20 column 5 (char 615)
API error for 'Website: https://www.packtpub.com/
Content: Packt ...': Unterminated string starting at: line 20 column 5 (char 615)
Progress: 5975/19808 (30.2%) | Scraped: 5143 | Failed: 832
Progress: 5975/19808 (30.2%) | Scraped: 5143 | Failed: 832
Progress: 5980/19808 (30.2%) | Scraped: 5148 | Failed: 832
Progress: 5980/19808 (30.2%)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6010/19808 (30.3%) | Scraped: 5174 | Failed: 836
Progress: 6015/19808 (30.4%) | Scraped: 5179 | Failed: 836
Progress: 6015/19808 (30.4%) | Scraped: 5179 | Failed: 836
Progress: 6020/19808 (30.4%) | Scraped: 5184 | Failed: 836
Progress: 6020/19808 (30.4%) | Scraped: 5184 | Failed: 836


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6025/19808 (30.4%) | Scraped: 5189 | Failed: 836
Progress: 6030/19808 (30.4%) | Scraped: 5193 | Failed: 837
Progress: 6030/19808 (30.4%) | Scraped: 5193 | Failed: 837
Progress: 6035/19808 (30.5%) | Scraped: 5197 | Failed: 838
Progress: 6035/19808 (30.5%) | Scraped: 5197 | Failed: 838
Progress: 6040/19808 (30.5%) | Scraped: 5201 | Failed: 839
Progress: 6040/19808 (30.5%) | Scraped: 5201 | Failed: 839
Progress: 6045/19808 (30.5%) | Scraped: 5206 | Failed: 839
Progress: 6045/19808 (30.5%) | Scraped: 5206 | Failed: 839
Progress: 6050/19808 (30.5%) | Scraped: 5210 | Failed: 840
Progress: 6050/19808 (30.5%) | Scraped: 5210 | Failed: 840
Progress: 6055/19808 (30.6%) | Scraped: 5214 | Failed: 841
Progress: 6055/19808 (30.6%) | Scraped: 5214 | Failed: 841
Progress: 6060/19808 (30.6%) | Scraped: 5219 | Failed: 841
Progress: 6060/19808 (30.6%) | Scraped: 5219 | Failed: 841
Progress: 6065/19808 (30.6%) | Scraped: 5223 | Failed: 842
Progress: 6065/19808 (30.6%) | Scraped: 5223 | Failed: 8

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6100/19808 (30.8%) | Scraped: 5256 | Failed: 844
Progress: 6105/19808 (30.8%) | Scraped: 5260 | Failed: 845
Progress: 6105/19808 (30.8%) | Scraped: 5260 | Failed: 845
Progress: 6110/19808 (30.8%) | Scraped: 5265 | Failed: 845
Progress: 6110/19808 (30.8%) | Scraped: 5265 | Failed: 845
Progress: 6115/19808 (30.9%) | Scraped: 5270 | Failed: 845
Progress: 6115/19808 (30.9%) | Scraped: 5270 | Failed: 845
Progress: 6120/19808 (30.9%) | Scraped: 5275 | Failed: 845
Progress: 6120/19808 (30.9%) | Scraped: 5275 | Failed: 845
Progress: 6125/19808 (30.9%) | Scraped: 5280 | Failed: 845
Progress: 6125/19808 (30.9%) | Scraped: 5280 | Failed: 845
Progress: 6130/19808 (30.9%) | Scraped: 5283 | Failed: 847
Progress: 6130/19808 (30.9%) | Scraped: 5283 | Failed: 847
Progress: 6135/19808 (31.0%) | Scraped: 5287 | Failed: 848
Progress: 6135/19808 (31.0%) | Scraped: 5287 | Failed: 848
Progress: 6140/19808 (31.0%) | Scraped: 5291 | Failed: 849
Progress: 6140/19808 (31.0%) | Scraped: 5291 | Failed: 8

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6220/19808 (31.4%) | Scraped: 5358 | Failed: 862
Progress: 6225/19808 (31.4%) | Scraped: 5363 | Failed: 862
Progress: 6225/19808 (31.4%) | Scraped: 5363 | Failed: 862
Progress: 6230/19808 (31.5%) | Scraped: 5368 | Failed: 862
Progress: 6230/19808 (31.5%) | Scraped: 5368 | Failed: 862
Progress: 6235/19808 (31.5%) | Scraped: 5372 | Failed: 863
Progress: 6235/19808 (31.5%) | Scraped: 5372 | Failed: 863
Progress: 6240/19808 (31.5%) | Scraped: 5377 | Failed: 863
Progress: 6240/19808 (31.5%) | Scraped: 5377 | Failed: 863
Progress: 6245/19808 (31.5%) | Scraped: 5382 | Failed: 863
Progress: 6245/19808 (31.5%) | Scraped: 5382 | Failed: 863
Progress: 6250/19808 (31.6%) | Scraped: 5387 | Failed: 863
Progress: 6250/19808 (31.6%) | Scraped: 5387 | Failed: 863
Progress: 6255/19808 (31.6%) | Scraped: 5391 | Failed: 864
Progress: 6255/19808 (31.6%) | Scraped: 5391 | Failed: 864
Progress: 6260/19808 (31.6%) | Scraped: 5396 | Failed: 864
Progress: 6260/19808 (31.6%) | Scraped: 5396 | Failed: 8

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6405/19808 (32.3%) | Scraped: 5514 | Failed: 891
Progress: 6410/19808 (32.4%) | Scraped: 5519 | Failed: 891
Progress: 6410/19808 (32.4%) | Scraped: 5519 | Failed: 891
API error for 'Website: https://www.telegraph.co.uk/fashion/shopp...': Unterminated string starting at: line 27 column 5 (char 459)
API error for 'Website: https://www.telegraph.co.uk/fashion/shopp...': Unterminated string starting at: line 27 column 5 (char 459)
Progress: 6415/19808 (32.4%) | Scraped: 5523 | Failed: 892
Progress: 6415/19808 (32.4%) | Scraped: 5523 | Failed: 892
Progress: 6420/19808 (32.4%) | Scraped: 5527 | Failed: 893
Progress: 6420/19808 (32.4%) | Scraped: 5527 | Failed: 893
Progress: 6425/19808 (32.4%) | Scraped: 5531 | Failed: 894
Progress: 6425/19808 (32.4%) | Scraped: 5531 | Failed: 894
Progress: 6430/19808 (32.5%) | Scraped: 5536 | Failed: 894
Progress: 6430/19808 (32.5%) | Scraped: 5536 | Failed: 894
Progress: 6435/19808 (32.5%) | Scraped: 5540 | Failed: 895
Progress: 6435/19808 (32.5%)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6525/19808 (32.9%) | Scraped: 5616 | Failed: 909
Progress: 6530/19808 (33.0%) | Scraped: 5621 | Failed: 909
Progress: 6530/19808 (33.0%) | Scraped: 5621 | Failed: 909
API error for 'Website: https://relliklondon.co.uk/
Content: Rell...': Unterminated string starting at: line 12 column 5 (char 463)
API error for 'Website: https://relliklondon.co.uk/
Content: Rell...': Unterminated string starting at: line 12 column 5 (char 463)
Progress: 6535/19808 (33.0%) | Scraped: 5626 | Failed: 909
Progress: 6535/19808 (33.0%) | Scraped: 5626 | Failed: 909


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6540/19808 (33.0%) | Scraped: 5631 | Failed: 909
Progress: 6545/19808 (33.0%) | Scraped: 5636 | Failed: 909
Progress: 6550/19808 (33.1%) | Scraped: 5641 | Failed: 909
Progress: 6555/19808 (33.1%) | Scraped: 5645 | Failed: 910
API error for 'Website: https://www.brainyquote.com/topics/respon...': Expecting value: line 27 column 4 (char 517)
Progress: 6560/19808 (33.1%) | Scraped: 5649 | Failed: 911
Progress: 6565/19808 (33.1%) | Scraped: 5654 | Failed: 911
Progress: 6570/19808 (33.2%) | Scraped: 5658 | Failed: 912
Progress: 6575/19808 (33.2%) | Scraped: 5663 | Failed: 912
Progress: 6580/19808 (33.2%) | Scraped: 5667 | Failed: 913
Progress: 6585/19808 (33.2%) | Scraped: 5671 | Failed: 914
Progress: 6590/19808 (33.3%) | Scraped: 5675 | Failed: 915
Progress: 6595/19808 (33.3%) | Scraped: 5679 | Failed: 916
Progress: 6600/19808 (33.3%) | Scraped: 5679 | Failed: 921
Progress: 6605/19808 (33.3%) | Scraped: 5681 | Failed: 924
Progress: 6610/19808 (33.4%) | Scraped: 5683 | Failed: 927

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6690/19808 (33.8%) | Scraped: 5746 | Failed: 944
Progress: 6695/19808 (33.8%) | Scraped: 5751 | Failed: 944
Progress: 6700/19808 (33.8%) | Scraped: 5754 | Failed: 946
Progress: 6705/19808 (33.8%) | Scraped: 5758 | Failed: 947
Progress: 6710/19808 (33.9%) | Scraped: 5763 | Failed: 947
Progress: 6715/19808 (33.9%) | Scraped: 5768 | Failed: 947


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6720/19808 (33.9%) | Scraped: 5772 | Failed: 948
Progress: 6725/19808 (34.0%) | Scraped: 5777 | Failed: 948


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6730/19808 (34.0%) | Scraped: 5781 | Failed: 949
Progress: 6735/19808 (34.0%) | Scraped: 5785 | Failed: 950
Progress: 6740/19808 (34.0%) | Scraped: 5790 | Failed: 950
Progress: 6745/19808 (34.1%) | Scraped: 5795 | Failed: 950
Progress: 6750/19808 (34.1%) | Scraped: 5800 | Failed: 950
Progress: 6755/19808 (34.1%) | Scraped: 5805 | Failed: 950
Progress: 6760/19808 (34.1%) | Scraped: 5810 | Failed: 950
Progress: 6765/19808 (34.2%) | Scraped: 5815 | Failed: 950
Progress: 6770/19808 (34.2%) | Scraped: 5820 | Failed: 950
Progress: 6775/19808 (34.2%) | Scraped: 5825 | Failed: 950
API error for 'Website: https://razorsql.com/features/sqlite_gui....': Expecting value: line 4 column 364 (char 437)
Progress: 6780/19808 (34.2%) | Scraped: 5830 | Failed: 950
Progress: 6785/19808 (34.3%) | Scraped: 5833 | Failed: 952
Progress: 6790/19808 (34.3%) | Scraped: 5838 | Failed: 952
Progress: 6795/19808 (34.3%) | Scraped: 5842 | Failed: 953
Progress: 6800/19808 (34.3%) | Scraped: 5845 | Failed: 95

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6905/19808 (34.9%) | Scraped: 5930 | Failed: 975
Progress: 6910/19808 (34.9%) | Scraped: 5934 | Failed: 976
Progress: 6915/19808 (34.9%) | Scraped: 5939 | Failed: 976


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6920/19808 (34.9%) | Scraped: 5944 | Failed: 976
Progress: 6925/19808 (35.0%) | Scraped: 5949 | Failed: 976
Progress: 6930/19808 (35.0%) | Scraped: 5953 | Failed: 977
Progress: 6935/19808 (35.0%) | Scraped: 5957 | Failed: 978
Progress: 6940/19808 (35.0%) | Scraped: 5962 | Failed: 978
Progress: 6945/19808 (35.1%) | Scraped: 5966 | Failed: 979
Progress: 6950/19808 (35.1%) | Scraped: 5971 | Failed: 979
Progress: 6955/19808 (35.1%) | Scraped: 5974 | Failed: 981
Progress: 6960/19808 (35.1%) | Scraped: 5978 | Failed: 982
Progress: 6965/19808 (35.2%) | Scraped: 5983 | Failed: 982


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 6970/19808 (35.2%) | Scraped: 5987 | Failed: 983
Progress: 6975/19808 (35.2%) | Scraped: 5992 | Failed: 983
Progress: 6980/19808 (35.2%) | Scraped: 5996 | Failed: 984
Progress: 6985/19808 (35.3%) | Scraped: 6000 | Failed: 985
API error for 'Website: https://www.marksandspencer.com/l/food-to...': Unterminated string starting at: line 21 column 5 (char 612)
Progress: 6990/19808 (35.3%) | Scraped: 6005 | Failed: 985
API error for 'Website: https://www.delhiairport.com/shops-and-st...': Expecting value: line 25 column 1 (char 504)
Progress: 6995/19808 (35.3%) | Scraped: 6010 | Failed: 985
Progress: 7000/19808 (35.3%) | Scraped: 6015 | Failed: 985
Progress: 7005/19808 (35.4%) | Scraped: 6020 | Failed: 985
Progress: 7010/19808 (35.4%) | Scraped: 6024 | Failed: 986
Progress: 7015/19808 (35.4%) | Scraped: 6029 | Failed: 986
Progress: 7020/19808 (35.4%) | Scraped: 6034 | Failed: 986
Progress: 7025/19808 (35.5%) | Scraped: 6039 | Failed: 986
Progress: 7030/19808 (35.5%) | Scraped: 6042

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7075/19808 (35.7%) | Scraped: 6083 | Failed: 992
Progress: 7080/19808 (35.7%) | Scraped: 6088 | Failed: 992
Progress: 7085/19808 (35.8%) | Scraped: 6092 | Failed: 993
Progress: 7090/19808 (35.8%) | Scraped: 6097 | Failed: 993
Progress: 7095/19808 (35.8%) | Scraped: 6101 | Failed: 994
Progress: 7100/19808 (35.8%) | Scraped: 6106 | Failed: 994
Progress: 7105/19808 (35.9%) | Scraped: 6110 | Failed: 995
Progress: 7110/19808 (35.9%) | Scraped: 6115 | Failed: 995
API error for 'Website: https://www.wanderwomaniya.com/
Content: ...': Unterminated string starting at: line 4 column 324 (char 410)
Progress: 7115/19808 (35.9%) | Scraped: 6120 | Failed: 995
Progress: 7120/19808 (35.9%) | Scraped: 6125 | Failed: 995
Progress: 7125/19808 (36.0%) | Scraped: 6130 | Failed: 995
Progress: 7130/19808 (36.0%) | Scraped: 6135 | Failed: 995
API error for 'Website: https://www.birleysandwiches.co.uk/menus/...': Unterminated string starting at: line 4 column 463 (char 519)
Progress: 7135/19808 (36.0

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7205/19808 (36.4%) | Scraped: 6201 | Failed: 1004
Progress: 7210/19808 (36.4%) | Scraped: 6206 | Failed: 1004
Progress: 7215/19808 (36.4%) | Scraped: 6209 | Failed: 1006
Progress: 7220/19808 (36.4%) | Scraped: 6213 | Failed: 1007
Progress: 7225/19808 (36.5%) | Scraped: 6218 | Failed: 1007
Progress: 7230/19808 (36.5%) | Scraped: 6222 | Failed: 1008
Progress: 7235/19808 (36.5%) | Scraped: 6227 | Failed: 1008
Progress: 7240/19808 (36.6%) | Scraped: 6232 | Failed: 1008


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7245/19808 (36.6%) | Scraped: 6237 | Failed: 1008
Progress: 7250/19808 (36.6%) | Scraped: 6241 | Failed: 1009
Progress: 7255/19808 (36.6%) | Scraped: 6244 | Failed: 1011
Progress: 7260/19808 (36.7%) | Scraped: 6249 | Failed: 1011
Progress: 7265/19808 (36.7%) | Scraped: 6253 | Failed: 1012
Progress: 7270/19808 (36.7%) | Scraped: 6256 | Failed: 1014
Progress: 7275/19808 (36.7%) | Scraped: 6261 | Failed: 1014
Progress: 7280/19808 (36.8%) | Scraped: 6264 | Failed: 1016
Progress: 7285/19808 (36.8%) | Scraped: 6268 | Failed: 1017
Progress: 7290/19808 (36.8%) | Scraped: 6272 | Failed: 1018
Progress: 7295/19808 (36.8%) | Scraped: 6277 | Failed: 1018
Progress: 7300/19808 (36.9%) | Scraped: 6282 | Failed: 1018
Progress: 7305/19808 (36.9%) | Scraped: 6287 | Failed: 1018


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7310/19808 (36.9%) | Scraped: 6291 | Failed: 1019
Progress: 7315/19808 (36.9%) | Scraped: 6296 | Failed: 1019
Progress: 7320/19808 (37.0%) | Scraped: 6301 | Failed: 1019
Progress: 7325/19808 (37.0%) | Scraped: 6304 | Failed: 1021
Progress: 7330/19808 (37.0%) | Scraped: 6306 | Failed: 1024
Progress: 7335/19808 (37.0%) | Scraped: 6311 | Failed: 1024
Progress: 7340/19808 (37.1%) | Scraped: 6314 | Failed: 1026
Progress: 7345/19808 (37.1%) | Scraped: 6314 | Failed: 1031
Progress: 7350/19808 (37.1%) | Scraped: 6314 | Failed: 1036
Progress: 7355/19808 (37.1%) | Scraped: 6314 | Failed: 1041
Progress: 7360/19808 (37.2%) | Scraped: 6314 | Failed: 1046
Progress: 7365/19808 (37.2%) | Scraped: 6314 | Failed: 1051
Progress: 7370/19808 (37.2%) | Scraped: 6314 | Failed: 1056
Progress: 7375/19808 (37.2%) | Scraped: 6314 | Failed: 1061
Progress: 7380/19808 (37.3%) | Scraped: 6314 | Failed: 1066
Progress: 7385/19808 (37.3%) | Scraped: 6314 | Failed: 1071
Progress: 7390/19808 (37.3%) | Scraped: 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7420/19808 (37.5%) | Scraped: 6330 | Failed: 1090
Progress: 7425/19808 (37.5%) | Scraped: 6335 | Failed: 1090
Progress: 7430/19808 (37.5%) | Scraped: 6339 | Failed: 1091
Progress: 7435/19808 (37.5%) | Scraped: 6342 | Failed: 1093
Progress: 7440/19808 (37.6%) | Scraped: 6346 | Failed: 1094
Progress: 7445/19808 (37.6%) | Scraped: 6351 | Failed: 1094
Progress: 7450/19808 (37.6%) | Scraped: 6354 | Failed: 1096
Progress: 7455/19808 (37.6%) | Scraped: 6359 | Failed: 1096
Progress: 7460/19808 (37.7%) | Scraped: 6364 | Failed: 1096
Progress: 7465/19808 (37.7%) | Scraped: 6369 | Failed: 1096
API error for 'Website: https://octopusventures.com/blog/ten-reas...': Unterminated string starting at: line 4 column 372 (char 450)
Progress: 7470/19808 (37.7%) | Scraped: 6374 | Failed: 1096
Progress: 7475/19808 (37.7%) | Scraped: 6379 | Failed: 1096
Progress: 7480/19808 (37.8%) | Scraped: 6383 | Failed: 1097
Progress: 7485/19808 (37.8%) | Scraped: 6387 | Failed: 1098


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7490/19808 (37.8%) | Scraped: 6391 | Failed: 1099
Progress: 7495/19808 (37.8%) | Scraped: 6396 | Failed: 1099
Progress: 7500/19808 (37.9%) | Scraped: 6397 | Failed: 1103
Progress: 7505/19808 (37.9%) | Scraped: 6397 | Failed: 1108
Progress: 7510/19808 (37.9%) | Scraped: 6397 | Failed: 1113
Progress: 7515/19808 (37.9%) | Scraped: 6397 | Failed: 1118
Progress: 7520/19808 (38.0%) | Scraped: 6397 | Failed: 1123
Progress: 7525/19808 (38.0%) | Scraped: 6401 | Failed: 1124
Progress: 7530/19808 (38.0%) | Scraped: 6406 | Failed: 1124
Progress: 7535/19808 (38.0%) | Scraped: 6410 | Failed: 1125
Progress: 7540/19808 (38.1%) | Scraped: 6414 | Failed: 1126
Progress: 7545/19808 (38.1%) | Scraped: 6417 | Failed: 1128
Progress: 7550/19808 (38.1%) | Scraped: 6422 | Failed: 1128
Progress: 7555/19808 (38.1%) | Scraped: 6426 | Failed: 1129
Progress: 7560/19808 (38.2%) | Scraped: 6431 | Failed: 1129
Progress: 7565/19808 (38.2%) | Scraped: 6436 | Failed: 1129
Progress: 7570/19808 (38.2%) | Scraped: 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7660/19808 (38.7%) | Scraped: 6513 | Failed: 1147
Progress: 7665/19808 (38.7%) | Scraped: 6517 | Failed: 1148
Progress: 7670/19808 (38.7%) | Scraped: 6521 | Failed: 1149
Progress: 7675/19808 (38.7%) | Scraped: 6526 | Failed: 1149
Progress: 7680/19808 (38.8%) | Scraped: 6530 | Failed: 1150


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7685/19808 (38.8%) | Scraped: 6535 | Failed: 1150
Progress: 7690/19808 (38.8%) | Scraped: 6539 | Failed: 1151
Progress: 7695/19808 (38.8%) | Scraped: 6544 | Failed: 1151
Progress: 7700/19808 (38.9%) | Scraped: 6548 | Failed: 1152
Progress: 7705/19808 (38.9%) | Scraped: 6553 | Failed: 1152
Progress: 7710/19808 (38.9%) | Scraped: 6558 | Failed: 1152
Progress: 7715/19808 (38.9%) | Scraped: 6563 | Failed: 1152
Progress: 7720/19808 (39.0%) | Scraped: 6566 | Failed: 1154
Progress: 7725/19808 (39.0%) | Scraped: 6571 | Failed: 1154
Progress: 7730/19808 (39.0%) | Scraped: 6576 | Failed: 1154
Progress: 7735/19808 (39.0%) | Scraped: 6580 | Failed: 1155
Progress: 7740/19808 (39.1%) | Scraped: 6584 | Failed: 1156


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7745/19808 (39.1%) | Scraped: 6589 | Failed: 1156
Progress: 7750/19808 (39.1%) | Scraped: 6593 | Failed: 1157


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7755/19808 (39.2%) | Scraped: 6598 | Failed: 1157


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7760/19808 (39.2%) | Scraped: 6602 | Failed: 1158
Progress: 7765/19808 (39.2%) | Scraped: 6605 | Failed: 1160
Progress: 7770/19808 (39.2%) | Scraped: 6610 | Failed: 1160
Progress: 7775/19808 (39.3%) | Scraped: 6615 | Failed: 1160
Progress: 7780/19808 (39.3%) | Scraped: 6620 | Failed: 1160
Progress: 7785/19808 (39.3%) | Scraped: 6625 | Failed: 1160
Progress: 7790/19808 (39.3%) | Scraped: 6629 | Failed: 1161
Progress: 7795/19808 (39.4%) | Scraped: 6633 | Failed: 1162
Progress: 7800/19808 (39.4%) | Scraped: 6638 | Failed: 1162
Progress: 7805/19808 (39.4%) | Scraped: 6642 | Failed: 1163
Progress: 7810/19808 (39.4%) | Scraped: 6646 | Failed: 1164
Progress: 7815/19808 (39.5%) | Scraped: 6651 | Failed: 1164
Progress: 7820/19808 (39.5%) | Scraped: 6656 | Failed: 1164
API error for 'Website: https://www.cntraveller.com/gallery/paris...': Expecting value: line 23 column 4 (char 490)
Progress: 7825/19808 (39.5%) | Scraped: 6661 | Failed: 1164
Progress: 7830/19808 (39.5%) | Scraped: 6666

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7845/19808 (39.6%) | Scraped: 6679 | Failed: 1166
Progress: 7850/19808 (39.6%) | Scraped: 6683 | Failed: 1167
Progress: 7855/19808 (39.7%) | Scraped: 6688 | Failed: 1167
Progress: 7860/19808 (39.7%) | Scraped: 6691 | Failed: 1169


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7865/19808 (39.7%) | Scraped: 6695 | Failed: 1170
Progress: 7870/19808 (39.7%) | Scraped: 6699 | Failed: 1171
Progress: 7875/19808 (39.8%) | Scraped: 6704 | Failed: 1171
Progress: 7880/19808 (39.8%) | Scraped: 6709 | Failed: 1171
Progress: 7885/19808 (39.8%) | Scraped: 6713 | Failed: 1172
Progress: 7890/19808 (39.8%) | Scraped: 6718 | Failed: 1172
Progress: 7895/19808 (39.9%) | Scraped: 6722 | Failed: 1173
Progress: 7900/19808 (39.9%) | Scraped: 6727 | Failed: 1173
Progress: 7905/19808 (39.9%) | Scraped: 6732 | Failed: 1173
Progress: 7910/19808 (39.9%) | Scraped: 6737 | Failed: 1173
Progress: 7915/19808 (40.0%) | Scraped: 6741 | Failed: 1174


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7920/19808 (40.0%) | Scraped: 6745 | Failed: 1175
Progress: 7925/19808 (40.0%) | Scraped: 6750 | Failed: 1175
Progress: 7930/19808 (40.0%) | Scraped: 6755 | Failed: 1175
Progress: 7935/19808 (40.1%) | Scraped: 6760 | Failed: 1175
Progress: 7940/19808 (40.1%) | Scraped: 6765 | Failed: 1175
Progress: 7945/19808 (40.1%) | Scraped: 6769 | Failed: 1176
Progress: 7950/19808 (40.1%) | Scraped: 6774 | Failed: 1176
Progress: 7955/19808 (40.2%) | Scraped: 6778 | Failed: 1177


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 7960/19808 (40.2%) | Scraped: 6783 | Failed: 1177
Progress: 7965/19808 (40.2%) | Scraped: 6784 | Failed: 1181
Progress: 7970/19808 (40.2%) | Scraped: 6789 | Failed: 1181
Progress: 7975/19808 (40.3%) | Scraped: 6793 | Failed: 1182
Progress: 7980/19808 (40.3%) | Scraped: 6797 | Failed: 1183
Progress: 7985/19808 (40.3%) | Scraped: 6801 | Failed: 1184
Progress: 7990/19808 (40.3%) | Scraped: 6804 | Failed: 1186
Progress: 7995/19808 (40.4%) | Scraped: 6807 | Failed: 1188
Progress: 8000/19808 (40.4%) | Scraped: 6812 | Failed: 1188
Progress: 8005/19808 (40.4%) | Scraped: 6817 | Failed: 1188
Progress: 8010/19808 (40.4%) | Scraped: 6820 | Failed: 1190
Progress: 8015/19808 (40.5%) | Scraped: 6825 | Failed: 1190
API error for 'Website: https://shizune.co/investors/health-care-...': Unterminated string starting at: line 4 column 515 (char 603)
Progress: 8020/19808 (40.5%) | Scraped: 6830 | Failed: 1190
Progress: 8025/19808 (40.5%) | Scraped: 6835 | Failed: 1190
Progress: 8030/19808 (40.5%

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 8320/19808 (42.0%) | Scraped: 7045 | Failed: 1275
Progress: 8325/19808 (42.0%) | Scraped: 7049 | Failed: 1276
Progress: 8330/19808 (42.1%) | Scraped: 7054 | Failed: 1276


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 8335/19808 (42.1%) | Scraped: 7057 | Failed: 1278
Progress: 8340/19808 (42.1%) | Scraped: 7061 | Failed: 1279
Progress: 8345/19808 (42.1%) | Scraped: 7065 | Failed: 1280
Progress: 8350/19808 (42.2%) | Scraped: 7069 | Failed: 1281
Progress: 8355/19808 (42.2%) | Scraped: 7074 | Failed: 1281
Progress: 8360/19808 (42.2%) | Scraped: 7078 | Failed: 1282
Progress: 8365/19808 (42.2%) | Scraped: 7083 | Failed: 1282
Progress: 8370/19808 (42.3%) | Scraped: 7088 | Failed: 1282
Progress: 8375/19808 (42.3%) | Scraped: 7091 | Failed: 1284
Progress: 8380/19808 (42.3%) | Scraped: 7096 | Failed: 1284
Progress: 8385/19808 (42.3%) | Scraped: 7100 | Failed: 1285
Progress: 8390/19808 (42.4%) | Scraped: 7104 | Failed: 1286
API error for 'Website: https://www.michael-konczer.com/en/traini...': Request timed out.
Progress: 8395/19808 (42.4%) | Scraped: 7109 | Failed: 1286
Progress: 8400/19808 (42.4%) | Scraped: 7113 | Failed: 1287
Progress: 8405/19808 (42.4%) | Scraped: 7118 | Failed: 1287
Progress: 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 8510/19808 (43.0%) | Scraped: 7214 | Failed: 1296
Progress: 8515/19808 (43.0%) | Scraped: 7219 | Failed: 1296
Progress: 8520/19808 (43.0%) | Scraped: 7223 | Failed: 1297
Progress: 8525/19808 (43.0%) | Scraped: 7227 | Failed: 1298
Progress: 8530/19808 (43.1%) | Scraped: 7232 | Failed: 1298
Progress: 8535/19808 (43.1%) | Scraped: 7236 | Failed: 1299
Progress: 8540/19808 (43.1%) | Scraped: 7241 | Failed: 1299
Progress: 8545/19808 (43.1%) | Scraped: 7245 | Failed: 1300
Progress: 8550/19808 (43.2%) | Scraped: 7250 | Failed: 1300
Progress: 8555/19808 (43.2%) | Scraped: 7255 | Failed: 1300
Progress: 8560/19808 (43.2%) | Scraped: 7260 | Failed: 1300
Progress: 8565/19808 (43.2%) | Scraped: 7263 | Failed: 1302
Progress: 8570/19808 (43.3%) | Scraped: 7268 | Failed: 1302
Progress: 8575/19808 (43.3%) | Scraped: 7273 | Failed: 1302
Progress: 8580/19808 (43.3%) | Scraped: 7277 | Failed: 1303
Progress: 8585/19808 (43.3%) | Scraped: 7282 | Failed: 1303
Progress: 8590/19808 (43.4%) | Scraped: 

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 8880/19808 (44.8%) | Scraped: 7503 | Failed: 1377
Progress: 8885/19808 (44.9%) | Scraped: 7507 | Failed: 1378
Progress: 8890/19808 (44.9%) | Scraped: 7512 | Failed: 1378
Progress: 8895/19808 (44.9%) | Scraped: 7517 | Failed: 1378
Progress: 8900/19808 (44.9%) | Scraped: 7522 | Failed: 1378
Progress: 8905/19808 (45.0%) | Scraped: 7526 | Failed: 1379
Progress: 8910/19808 (45.0%) | Scraped: 7530 | Failed: 1380
Progress: 8915/19808 (45.0%) | Scraped: 7535 | Failed: 1380
Progress: 8920/19808 (45.0%) | Scraped: 7537 | Failed: 1383
Progress: 8925/19808 (45.1%) | Scraped: 7542 | Failed: 1383
API error for 'Website: https://www.theoutnet.com/en-gb/shop/desi...': Unterminated string starting at: line 4 column 441 (char 514)
API error for 'Website: https://www.malonesouliers.com/collection...': Unterminated string starting at: line 14 column 5 (char 481)
Progress: 8930/19808 (45.1%) | Scraped: 7547 | Failed: 1383
Progress: 8935/19808 (45.1%) | Scraped: 7552 | Failed: 1383
Progress: 8940/

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9045/19808 (45.7%) | Scraped: 7648 | Failed: 1397
API error for 'Website: https://zoom.us/zoomconference
Content: Z...': Unterminated string starting at: line 4 column 375 (char 461)
API error for 'Website: https://zoom.us/zoomconference/rates
Cont...': Unterminated string starting at: line 4 column 371 (char 472)
Progress: 9050/19808 (45.7%) | Scraped: 7651 | Failed: 1399
API error for 'Website: https://zumarestaurant.com/
Content: cont...': Unterminated string starting at: line 4 column 319 (char 395)
Progress: 9055/19808 (45.7%) | Scraped: 7656 | Failed: 1399
Progress: 9060/19808 (45.7%) | Scraped: 7659 | Failed: 1401
Progress: 9065/19808 (45.8%) | Scraped: 7664 | Failed: 1401
Progress: 9070/19808 (45.8%) | Scraped: 7668 | Failed: 1402
Progress: 9075/19808 (45.8%) | Scraped: 7673 | Failed: 1402
Progress: 9080/19808 (45.8%) | Scraped: 7678 | Failed: 1402
Progress: 9085/19808 (45.9%) | Scraped: 7682 | Failed: 1403
Progress: 9090/19808 (45.9%) | Scraped: 7687 | Failed: 1403
P

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9100/19808 (45.9%) | Scraped: 7697 | Failed: 1403
Progress: 9105/19808 (46.0%) | Scraped: 7702 | Failed: 1403
Progress: 9110/19808 (46.0%) | Scraped: 7707 | Failed: 1403
Progress: 9115/19808 (46.0%) | Scraped: 7711 | Failed: 1404
Progress: 9120/19808 (46.0%) | Scraped: 7715 | Failed: 1405
Progress: 9125/19808 (46.1%) | Scraped: 7719 | Failed: 1406
Progress: 9130/19808 (46.1%) | Scraped: 7724 | Failed: 1406
Progress: 9135/19808 (46.1%) | Scraped: 7729 | Failed: 1406


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9140/19808 (46.1%) | Scraped: 7733 | Failed: 1407


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9145/19808 (46.2%) | Scraped: 7738 | Failed: 1407
Progress: 9150/19808 (46.2%) | Scraped: 7743 | Failed: 1407
Progress: 9155/19808 (46.2%) | Scraped: 7748 | Failed: 1407
Progress: 9160/19808 (46.2%) | Scraped: 7753 | Failed: 1407
Progress: 9165/19808 (46.3%) | Scraped: 7757 | Failed: 1408


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9170/19808 (46.3%) | Scraped: 7762 | Failed: 1408
Progress: 9175/19808 (46.3%) | Scraped: 7766 | Failed: 1409
Progress: 9180/19808 (46.3%) | Scraped: 7771 | Failed: 1409
Progress: 9185/19808 (46.4%) | Scraped: 7776 | Failed: 1409
Progress: 9190/19808 (46.4%) | Scraped: 7781 | Failed: 1409
Progress: 9195/19808 (46.4%) | Scraped: 7786 | Failed: 1409
Progress: 9200/19808 (46.4%) | Scraped: 7789 | Failed: 1411
Progress: 9205/19808 (46.5%) | Scraped: 7794 | Failed: 1411
Progress: 9210/19808 (46.5%) | Scraped: 7798 | Failed: 1412


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9215/19808 (46.5%) | Scraped: 7803 | Failed: 1412
Progress: 9220/19808 (46.5%) | Scraped: 7807 | Failed: 1413
Progress: 9225/19808 (46.6%) | Scraped: 7812 | Failed: 1413
Progress: 9230/19808 (46.6%) | Scraped: 7816 | Failed: 1414
Progress: 9235/19808 (46.6%) | Scraped: 7817 | Failed: 1418
Progress: 9240/19808 (46.6%) | Scraped: 7818 | Failed: 1422
Progress: 9245/19808 (46.7%) | Scraped: 7823 | Failed: 1422


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9250/19808 (46.7%) | Scraped: 7827 | Failed: 1423


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9255/19808 (46.7%) | Scraped: 7829 | Failed: 1426


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9260/19808 (46.7%) | Scraped: 7834 | Failed: 1426
Progress: 9265/19808 (46.8%) | Scraped: 7838 | Failed: 1427
API error for 'Website: http://bananarepublic.gap.co.uk/browse/pr...': Unterminated string starting at: line 13 column 5 (char 597)
Progress: 9270/19808 (46.8%) | Scraped: 7840 | Failed: 1430
Progress: 9275/19808 (46.8%) | Scraped: 7844 | Failed: 1431
Progress: 9280/19808 (46.8%) | Scraped: 7848 | Failed: 1432
Progress: 9285/19808 (46.9%) | Scraped: 7850 | Failed: 1435
Progress: 9290/19808 (46.9%) | Scraped: 7855 | Failed: 1435
Progress: 9295/19808 (46.9%) | Scraped: 7858 | Failed: 1437
Progress: 9300/19808 (47.0%) | Scraped: 7861 | Failed: 1439
Progress: 9305/19808 (47.0%) | Scraped: 7865 | Failed: 1440
API error for 'Website: http://clickserve.dartsearch.net/link/cli...': Unterminated string starting at: line 4 column 471 (char 558)
Progress: 9310/19808 (47.0%) | Scraped: 7870 | Failed: 1440
Progress: 9315/19808 (47.0%) | Scraped: 7875 | Failed: 1440
Progress: 9320/

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9765/19808 (49.3%) | Scraped: 8323 | Failed: 1442
Progress: 9770/19808 (49.3%) | Scraped: 8327 | Failed: 1443


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9775/19808 (49.3%) | Scraped: 8331 | Failed: 1444


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9780/19808 (49.4%) | Scraped: 8334 | Failed: 1446
Progress: 9785/19808 (49.4%) | Scraped: 8337 | Failed: 1448


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9790/19808 (49.4%) | Scraped: 8342 | Failed: 1448
Progress: 9795/19808 (49.4%) | Scraped: 8346 | Failed: 1449
Progress: 9800/19808 (49.5%) | Scraped: 8348 | Failed: 1452


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9805/19808 (49.5%) | Scraped: 8353 | Failed: 1452


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9810/19808 (49.5%) | Scraped: 8357 | Failed: 1453
API error for 'Website: http://frenchvineyard.co.uk/vine/wine-reg...': Unterminated string starting at: line 18 column 5 (char 465)
Progress: 9815/19808 (49.6%) | Scraped: 8362 | Failed: 1453


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9820/19808 (49.6%) | Scraped: 8367 | Failed: 1453
Progress: 9825/19808 (49.6%) | Scraped: 8371 | Failed: 1454
Progress: 9830/19808 (49.6%) | Scraped: 8373 | Failed: 1457
Progress: 9835/19808 (49.7%) | Scraped: 8376 | Failed: 1459


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9840/19808 (49.7%) | Scraped: 8378 | Failed: 1462


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9845/19808 (49.7%) | Scraped: 8382 | Failed: 1463
Progress: 9850/19808 (49.7%) | Scraped: 8386 | Failed: 1464
Progress: 9855/19808 (49.8%) | Scraped: 8389 | Failed: 1466
Progress: 9860/19808 (49.8%) | Scraped: 8392 | Failed: 1468
Progress: 9865/19808 (49.8%) | Scraped: 8395 | Failed: 1470
Progress: 9870/19808 (49.8%) | Scraped: 8399 | Failed: 1471
Progress: 9875/19808 (49.9%) | Scraped: 8401 | Failed: 1474
Progress: 9880/19808 (49.9%) | Scraped: 8403 | Failed: 1477


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9885/19808 (49.9%) | Scraped: 8407 | Failed: 1478
Progress: 9890/19808 (49.9%) | Scraped: 8412 | Failed: 1478
Progress: 9895/19808 (50.0%) | Scraped: 8414 | Failed: 1481
Progress: 9900/19808 (50.0%) | Scraped: 8418 | Failed: 1482


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9905/19808 (50.0%) | Scraped: 8422 | Failed: 1483
Progress: 9910/19808 (50.0%) | Scraped: 8425 | Failed: 1485
API error for 'Website: http://nishamadhulika.com/sweets/-gulgule...': Unterminated string starting at: line 4 column 234 (char 309)
Progress: 9915/19808 (50.1%) | Scraped: 8429 | Failed: 1486
Progress: 9920/19808 (50.1%) | Scraped: 8433 | Failed: 1487
Progress: 9925/19808 (50.1%) | Scraped: 8438 | Failed: 1487


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9930/19808 (50.1%) | Scraped: 8443 | Failed: 1487
Progress: 9935/19808 (50.2%) | Scraped: 8443 | Failed: 1492


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 9940/19808 (50.2%) | Scraped: 8444 | Failed: 1496


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


API error for 'Website: http://pixel.everesttech.net/3199/cq?ev_s...': Unterminated string starting at: line 27 column 5 (char 471)
Progress: 9945/19808 (50.2%) | Scraped: 8447 | Failed: 1498
API error for 'Website: http://pixel.everesttech.net/3199/cq?ev_s...': Unterminated string starting at: line 27 column 5 (char 469)
Progress: 9950/19808 (50.2%) | Scraped: 8450 | Failed: 1500
Progress: 9955/19808 (50.3%) | Scraped: 8455 | Failed: 1500
Progress: 9960/19808 (50.3%) | Scraped: 8457 | Failed: 1503
Progress: 9965/19808 (50.3%) | Scraped: 8460 | Failed: 1505
Progress: 9970/19808 (50.3%) | Scraped: 8465 | Failed: 1505
Progress: 9975/19808 (50.4%) | Scraped: 8470 | Failed: 1505
Progress: 9980/19808 (50.4%) | Scraped: 8475 | Failed: 1505
Progress: 9985/19808 (50.4%) | Scraped: 8480 | Failed: 1505
Progress: 9990/19808 (50.4%) | Scraped: 8485 | Failed: 1505
Progress: 9995/19808 (50.5%) | Scraped: 8488 | Failed: 1507
Progress: 10000/19808 (50.5%) | Scraped: 8493 | Failed: 1507
Progress: 10005

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10080/19808 (50.9%) | Scraped: 8550 | Failed: 1530
Progress: 10085/19808 (50.9%) | Scraped: 8554 | Failed: 1531
Progress: 10090/19808 (50.9%) | Scraped: 8557 | Failed: 1533


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10095/19808 (51.0%) | Scraped: 8562 | Failed: 1533
Progress: 10100/19808 (51.0%) | Scraped: 8565 | Failed: 1535
Progress: 10105/19808 (51.0%) | Scraped: 8570 | Failed: 1535
API error for 'Website: http://www.allsubscriptionboxes.co.uk/cat...': Unterminated string starting at: line 25 column 5 (char 490)
Progress: 10110/19808 (51.0%) | Scraped: 8575 | Failed: 1535
Progress: 10115/19808 (51.1%) | Scraped: 8580 | Failed: 1535
Progress: 10120/19808 (51.1%) | Scraped: 8585 | Failed: 1535
Progress: 10125/19808 (51.1%) | Scraped: 8590 | Failed: 1535
Progress: 10130/19808 (51.1%) | Scraped: 8595 | Failed: 1535
Progress: 10135/19808 (51.2%) | Scraped: 8600 | Failed: 1535
Progress: 10140/19808 (51.2%) | Scraped: 8605 | Failed: 1535
Progress: 10145/19808 (51.2%) | Scraped: 8609 | Failed: 1536
Progress: 10150/19808 (51.2%) | Scraped: 8613 | Failed: 1537
Progress: 10155/19808 (51.3%) | Scraped: 8616 | Failed: 1539
Progress: 10160/19808 (51.3%) | Scraped: 8621 | Failed: 1539
Progress: 1016

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10360/19808 (52.3%) | Scraped: 8795 | Failed: 1565
Progress: 10365/19808 (52.3%) | Scraped: 8799 | Failed: 1566


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


API error for 'Website: http://www.eladgil.com/
Content: Elad Gil...': Expecting value: line 28 column 4 (char 490)
Progress: 10370/19808 (52.4%) | Scraped: 8802 | Failed: 1568
Progress: 10375/19808 (52.4%) | Scraped: 8806 | Failed: 1569
Progress: 10380/19808 (52.4%) | Scraped: 8809 | Failed: 1571
Progress: 10385/19808 (52.4%) | Scraped: 8814 | Failed: 1571
Progress: 10390/19808 (52.5%) | Scraped: 8819 | Failed: 1571
Progress: 10395/19808 (52.5%) | Scraped: 8822 | Failed: 1573
API error for 'Website: http://www.floracleaningservices.com/
Con...': Unterminated string starting at: line 4 column 336 (char 404)
Progress: 10400/19808 (52.5%) | Scraped: 8826 | Failed: 1574
Progress: 10405/19808 (52.5%) | Scraped: 8830 | Failed: 1575
Progress: 10410/19808 (52.6%) | Scraped: 8833 | Failed: 1577
Progress: 10415/19808 (52.6%) | Scraped: 8837 | Failed: 1578
Progress: 10420/19808 (52.6%) | Scraped: 8839 | Failed: 1581
Progress: 10425/19808 (52.6%) | Scraped: 8844 | Failed: 1581
Progress: 10430/198

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10520/19808 (53.1%) | Scraped: 8926 | Failed: 1594


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10525/19808 (53.1%) | Scraped: 8930 | Failed: 1595
Progress: 10530/19808 (53.2%) | Scraped: 8933 | Failed: 1597
API error for 'Website: http://www.katescloset.com.au/hats.html
C...': Unterminated string starting at: line 25 column 5 (char 469)
Progress: 10535/19808 (53.2%) | Scraped: 8935 | Failed: 1600
Progress: 10540/19808 (53.2%) | Scraped: 8939 | Failed: 1601
Progress: 10545/19808 (53.2%) | Scraped: 8941 | Failed: 1604
Progress: 10550/19808 (53.3%) | Scraped: 8946 | Failed: 1604
Progress: 10555/19808 (53.3%) | Scraped: 8950 | Failed: 1605
API error for 'Website: http://www.listchallenges.com/200-movies-...': Unterminated string starting at: line 4 column 450 (char 525)
Progress: 10560/19808 (53.3%) | Scraped: 8955 | Failed: 1605
Progress: 10565/19808 (53.3%) | Scraped: 8958 | Failed: 1607
Progress: 10570/19808 (53.4%) | Scraped: 8961 | Failed: 1609
Progress: 10575/19808 (53.4%) | Scraped: 8965 | Failed: 1610
API error for 'Website: http://www.made-in-china.com/cs/hot-chin

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10590/19808 (53.5%) | Scraped: 8979 | Failed: 1611
Progress: 10595/19808 (53.5%) | Scraped: 8983 | Failed: 1612
Progress: 10600/19808 (53.5%) | Scraped: 8987 | Failed: 1613
Progress: 10605/19808 (53.5%) | Scraped: 8989 | Failed: 1616


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10610/19808 (53.6%) | Scraped: 8993 | Failed: 1617


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10615/19808 (53.6%) | Scraped: 8996 | Failed: 1619
Progress: 10620/19808 (53.6%) | Scraped: 9000 | Failed: 1620
Progress: 10625/19808 (53.6%) | Scraped: 9002 | Failed: 1623
Progress: 10630/19808 (53.7%) | Scraped: 9007 | Failed: 1623
Progress: 10635/19808 (53.7%) | Scraped: 9010 | Failed: 1625
Progress: 10640/19808 (53.7%) | Scraped: 9015 | Failed: 1625
Progress: 10645/19808 (53.7%) | Scraped: 9020 | Failed: 1625
Progress: 10650/19808 (53.8%) | Scraped: 9024 | Failed: 1626
Progress: 10655/19808 (53.8%) | Scraped: 9029 | Failed: 1626
Progress: 10660/19808 (53.8%) | Scraped: 9034 | Failed: 1626
Progress: 10665/19808 (53.8%) | Scraped: 9039 | Failed: 1626
Progress: 10670/19808 (53.9%) | Scraped: 9042 | Failed: 1628
Progress: 10675/19808 (53.9%) | Scraped: 9045 | Failed: 1630
Progress: 10680/19808 (53.9%) | Scraped: 9048 | Failed: 1632
Progress: 10685/19808 (53.9%) | Scraped: 9052 | Failed: 1633


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10690/19808 (54.0%) | Scraped: 9056 | Failed: 1634
Progress: 10695/19808 (54.0%) | Scraped: 9059 | Failed: 1636
Progress: 10700/19808 (54.0%) | Scraped: 9063 | Failed: 1637
Progress: 10705/19808 (54.0%) | Scraped: 9068 | Failed: 1637
Progress: 10710/19808 (54.1%) | Scraped: 9071 | Failed: 1639
Progress: 10715/19808 (54.1%) | Scraped: 9076 | Failed: 1639
Progress: 10720/19808 (54.1%) | Scraped: 9080 | Failed: 1640
Progress: 10725/19808 (54.1%) | Scraped: 9085 | Failed: 1640
Progress: 10730/19808 (54.2%) | Scraped: 9089 | Failed: 1641
Progress: 10735/19808 (54.2%) | Scraped: 9093 | Failed: 1642


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10740/19808 (54.2%) | Scraped: 9097 | Failed: 1643
Progress: 10745/19808 (54.2%) | Scraped: 9101 | Failed: 1644
API error for 'Website: http://www.sharedearth-trade.co.uk/
Conte...': Expecting value: line 25 column 4 (char 571)
API error for 'Website: http://www.simplyseed.co.uk/index/ref/www...': Unterminated string starting at: line 4 column 425 (char 501)
Progress: 10750/19808 (54.3%) | Scraped: 9105 | Failed: 1645
API error for 'Website: http://www.sixsenses.com/
Content: Luxury...': Expecting value: line 19 column 1 (char 501)
Progress: 10755/19808 (54.3%) | Scraped: 9109 | Failed: 1646
Progress: 10760/19808 (54.3%) | Scraped: 9114 | Failed: 1646
Progress: 10765/19808 (54.3%) | Scraped: 9118 | Failed: 1647
Progress: 10770/19808 (54.4%) | Scraped: 9121 | Failed: 1649
Progress: 10775/19808 (54.4%) | Scraped: 9124 | Failed: 1651
Progress: 10780/19808 (54.4%) | Scraped: 9128 | Failed: 1652
Progress: 10785/19808 (54.4%) | Scraped: 9132 | Failed: 1653
Progress: 10790/19808 (54

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10860/19808 (54.8%) | Scraped: 9182 | Failed: 1678


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10865/19808 (54.9%) | Scraped: 9185 | Failed: 1680
Progress: 10870/19808 (54.9%) | Scraped: 9190 | Failed: 1680
Progress: 10875/19808 (54.9%) | Scraped: 9193 | Failed: 1682
Progress: 10880/19808 (54.9%) | Scraped: 9195 | Failed: 1685
Progress: 10885/19808 (55.0%) | Scraped: 9197 | Failed: 1688
Progress: 10890/19808 (55.0%) | Scraped: 9202 | Failed: 1688
Progress: 10895/19808 (55.0%) | Scraped: 9207 | Failed: 1688
Progress: 10900/19808 (55.0%) | Scraped: 9212 | Failed: 1688
Progress: 10905/19808 (55.1%) | Scraped: 9216 | Failed: 1689
Progress: 10910/19808 (55.1%) | Scraped: 9219 | Failed: 1691


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10915/19808 (55.1%) | Scraped: 9220 | Failed: 1695


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10920/19808 (55.1%) | Scraped: 9224 | Failed: 1696
Progress: 10925/19808 (55.2%) | Scraped: 9228 | Failed: 1697
Progress: 10930/19808 (55.2%) | Scraped: 9233 | Failed: 1697
Progress: 10935/19808 (55.2%) | Scraped: 9238 | Failed: 1697
Progress: 10940/19808 (55.2%) | Scraped: 9242 | Failed: 1698
Progress: 10945/19808 (55.3%) | Scraped: 9245 | Failed: 1700
Progress: 10950/19808 (55.3%) | Scraped: 9247 | Failed: 1703
Progress: 10955/19808 (55.3%) | Scraped: 9247 | Failed: 1708
Progress: 10960/19808 (55.3%) | Scraped: 9249 | Failed: 1711


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 10965/19808 (55.4%) | Scraped: 9253 | Failed: 1712
Progress: 10970/19808 (55.4%) | Scraped: 9258 | Failed: 1712
Progress: 10975/19808 (55.4%) | Scraped: 9263 | Failed: 1712
Progress: 10980/19808 (55.4%) | Scraped: 9268 | Failed: 1712
Progress: 10985/19808 (55.5%) | Scraped: 9273 | Failed: 1712
Progress: 10990/19808 (55.5%) | Scraped: 9278 | Failed: 1712
Progress: 10995/19808 (55.5%) | Scraped: 9283 | Failed: 1712
Progress: 11000/19808 (55.5%) | Scraped: 9288 | Failed: 1712
Progress: 11005/19808 (55.6%) | Scraped: 9293 | Failed: 1712
Progress: 11010/19808 (55.6%) | Scraped: 9297 | Failed: 1713
Progress: 11015/19808 (55.6%) | Scraped: 9299 | Failed: 1716
Progress: 11020/19808 (55.6%) | Scraped: 9304 | Failed: 1716
Progress: 11025/19808 (55.7%) | Scraped: 9309 | Failed: 1716
Progress: 11030/19808 (55.7%) | Scraped: 9314 | Failed: 1716
Progress: 11035/19808 (55.7%) | Scraped: 9317 | Failed: 1718
Progress: 11040/19808 (55.7%) | Scraped: 9322 | Failed: 1718
Progress: 11045/19808 (5

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 11320/19808 (57.1%) | Scraped: 9571 | Failed: 1749


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 11325/19808 (57.2%) | Scraped: 9573 | Failed: 1752
API error for 'Website: https://back.cochrane.org/sites/back.coch...': Unterminated string starting at: line 20 column 5 (char 592)
Progress: 11330/19808 (57.2%) | Scraped: 9577 | Failed: 1753
Progress: 11335/19808 (57.2%) | Scraped: 9582 | Failed: 1753
Progress: 11340/19808 (57.2%) | Scraped: 9587 | Failed: 1753
Progress: 11345/19808 (57.3%) | Scraped: 9590 | Failed: 1755
Progress: 11350/19808 (57.3%) | Scraped: 9594 | Failed: 1756
Progress: 11355/19808 (57.3%) | Scraped: 9597 | Failed: 1758
Progress: 11360/19808 (57.4%) | Scraped: 9602 | Failed: 1758
Progress: 11365/19808 (57.4%) | Scraped: 9607 | Failed: 1758
Progress: 11370/19808 (57.4%) | Scraped: 9608 | Failed: 1762
Progress: 11375/19808 (57.4%) | Scraped: 9612 | Failed: 1763
API error for 'Website: https://bigmammagroup.com/
Content: Big M...': Unterminated string starting at: line 23 column 5 (char 490)
API error for 'Website: https://bigmammagroup.com/en/accueil
Cont

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 11550/19808 (58.3%) | Scraped: 9739 | Failed: 1811


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


API error for 'Website: https://clickserve.dartsearch.net/link/cl...': Unterminated string starting at: line 15 column 5 (char 557)
API error for 'Website: https://clickserve.dartsearch.net/link/cl...': Unterminated string starting at: line 4 column 360 (char 428)
Progress: 11555/19808 (58.3%) | Scraped: 9742 | Failed: 1813
API error for 'Website: https://clickserve.dartsearch.net/link/cl...': Unterminated string starting at: line 20 column 5 (char 587)
Progress: 11560/19808 (58.4%) | Scraped: 9747 | Failed: 1813
API error for 'Website: https://clickserve.dartsearch.net/link/cl...': Unterminated string starting at: line 4 column 512 (char 576)
Progress: 11565/19808 (58.4%) | Scraped: 9752 | Failed: 1813
Progress: 11570/19808 (58.4%) | Scraped: 9757 | Failed: 1813
API error for 'Website: https://clickserve.dartsearch.net/link/cl...': Unterminated string starting at: line 17 column 5 (char 571)
Progress: 11575/19808 (58.4%) | Scraped: 9762 | Failed: 1813
Progress: 11580/19808 (58.5%) | S

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 11975/19808 (60.5%) | Scraped: 10151 | Failed: 1824
Progress: 11980/19808 (60.5%) | Scraped: 10153 | Failed: 1827
Progress: 11985/19808 (60.5%) | Scraped: 10156 | Failed: 1829
Progress: 11990/19808 (60.5%) | Scraped: 10160 | Failed: 1830
Progress: 11995/19808 (60.6%) | Scraped: 10160 | Failed: 1835
Progress: 12000/19808 (60.6%) | Scraped: 10164 | Failed: 1836
Progress: 12005/19808 (60.6%) | Scraped: 10169 | Failed: 1836
Progress: 12010/19808 (60.6%) | Scraped: 10174 | Failed: 1836
Progress: 12015/19808 (60.7%) | Scraped: 10179 | Failed: 1836
Progress: 12020/19808 (60.7%) | Scraped: 10184 | Failed: 1836
Progress: 12025/19808 (60.7%) | Scraped: 10189 | Failed: 1836
Progress: 12030/19808 (60.7%) | Scraped: 10194 | Failed: 1836
Progress: 12035/19808 (60.8%) | Scraped: 10199 | Failed: 1836
Progress: 12040/19808 (60.8%) | Scraped: 10204 | Failed: 1836
Progress: 12045/19808 (60.8%) | Scraped: 10208 | Failed: 1837
Progress: 12050/19808 (60.8%) | Scraped: 10211 | Failed: 1839
Progress

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 12455/19808 (62.9%) | Scraped: 10590 | Failed: 1865
Progress: 12460/19808 (62.9%) | Scraped: 10594 | Failed: 1866
Progress: 12465/19808 (62.9%) | Scraped: 10597 | Failed: 1868
Progress: 12470/19808 (63.0%) | Scraped: 10601 | Failed: 1869
Progress: 12475/19808 (63.0%) | Scraped: 10605 | Failed: 1870
Progress: 12480/19808 (63.0%) | Scraped: 10609 | Failed: 1871
Progress: 12485/19808 (63.0%) | Scraped: 10612 | Failed: 1873
Progress: 12490/19808 (63.1%) | Scraped: 10616 | Failed: 1874
Progress: 12495/19808 (63.1%) | Scraped: 10621 | Failed: 1874
API error for 'URL: https://fr.tlscontact.com/gb/EDI/page.php?pid...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1. Please try again in 8.64s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}
Progress: 12500/1980

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 12705/19808 (64.1%) | Scraped: 10783 | Failed: 1922
Progress: 12710/19808 (64.2%) | Scraped: 10786 | Failed: 1924
Progress: 12715/19808 (64.2%) | Scraped: 10790 | Failed: 1925
Progress: 12720/19808 (64.2%) | Scraped: 10795 | Failed: 1925
Progress: 12725/19808 (64.2%) | Scraped: 10800 | Failed: 1925
Progress: 12730/19808 (64.3%) | Scraped: 10805 | Failed: 1925
Progress: 12735/19808 (64.3%) | Scraped: 10809 | Failed: 1926
Progress: 12740/19808 (64.3%) | Scraped: 10812 | Failed: 1928
Progress: 12745/19808 (64.3%) | Scraped: 10816 | Failed: 1929
Progress: 12750/19808 (64.4%) | Scraped: 10816 | Failed: 1934
Progress: 12755/19808 (64.4%) | Scraped: 10818 | Failed: 1937
API error for 'URL: https://investors.thetradedesk.com/investor-o...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1. Please try again in 8.64s. Visit https://platform.openai.com/

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 13315/19808 (67.2%) | Scraped: 11275 | Failed: 2040
Progress: 13320/19808 (67.2%) | Scraped: 11279 | Failed: 2041
Progress: 13325/19808 (67.3%) | Scraped: 11282 | Failed: 2043
Progress: 13330/19808 (67.3%) | Scraped: 11285 | Failed: 2045
API error for 'Website: https://onlinedoctor.lloydspharmacy.com/u...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1. Please try again in 8.64s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}
Progress: 13335/19808 (67.3%) | Scraped: 11289 | Failed: 2046
Progress: 13340/19808 (67.3%) | Scraped: 11289 | Failed: 2051
Progress: 13345/19808 (67.4%) | Scraped: 11293 | Failed: 2052


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 13350/19808 (67.4%) | Scraped: 11298 | Failed: 2052
Progress: 13355/19808 (67.4%) | Scraped: 11303 | Failed: 2052
Progress: 13360/19808 (67.4%) | Scraped: 11307 | Failed: 2053
Progress: 13365/19808 (67.5%) | Scraped: 11311 | Failed: 2054
Progress: 13370/19808 (67.5%) | Scraped: 11315 | Failed: 2055
Progress: 13375/19808 (67.5%) | Scraped: 11320 | Failed: 2055
Progress: 13380/19808 (67.5%) | Scraped: 11325 | Failed: 2055
Progress: 13385/19808 (67.6%) | Scraped: 11330 | Failed: 2055
Progress: 13390/19808 (67.6%) | Scraped: 11334 | Failed: 2056
API error for 'URL: https://passotogo.com/
Semantic context: pass...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1. Please try again in 8.64s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}
Progress: 13395/1980

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 13525/19808 (68.3%) | Scraped: 11444 | Failed: 2081


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 13530/19808 (68.3%) | Scraped: 11448 | Failed: 2082
API error for 'Website: https://scalable.capital/
Content: Scalab...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1. Please try again in 8.64s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 13535/19808 (68.3%) | Scraped: 11453 | Failed: 2082
Progress: 13540/19808 (68.4%) | Scraped: 11458 | Failed: 2082
Progress: 13545/19808 (68.4%) | Scraped: 11462 | Failed: 2083
Progress: 13550/19808 (68.4%) | Scraped: 11467 | Failed: 2083
Progress: 13555/19808 (68.4%) | Scraped: 11472 | Failed: 2083
Progress: 13560/19808 (68.5%) | Scraped: 11477 | Failed: 2083
Progress: 13565/19808 (68.5%) | Scraped: 11482 | Failed: 2083
Progress: 13570/19808 (68.5%) | Scraped: 11487 | Failed: 2083
Progress: 13575/19808 (68.5%) | Scraped: 11491 | Failed: 2084


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 13580/19808 (68.6%) | Scraped: 11492 | Failed: 2088
Progress: 13585/19808 (68.6%) | Scraped: 11495 | Failed: 2090
Progress: 13590/19808 (68.6%) | Scraped: 11499 | Failed: 2091
Progress: 13595/19808 (68.6%) | Scraped: 11504 | Failed: 2091
Progress: 13600/19808 (68.7%) | Scraped: 11509 | Failed: 2091
Progress: 13605/19808 (68.7%) | Scraped: 11511 | Failed: 2094
API error for 'Website: https://sifted.eu/articles/deeptech-inves...': Expecting value: line 25 column 4 (char 605)
Progress: 13610/19808 (68.7%) | Scraped: 11516 | Failed: 2094
Progress: 13615/19808 (68.7%) | Scraped: 11521 | Failed: 2094
Progress: 13620/19808 (68.8%) | Scraped: 11526 | Failed: 2094


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 13625/19808 (68.8%) | Scraped: 11529 | Failed: 2096
Progress: 13630/19808 (68.8%) | Scraped: 11533 | Failed: 2097
Progress: 13635/19808 (68.8%) | Scraped: 11536 | Failed: 2099
Progress: 13640/19808 (68.9%) | Scraped: 11541 | Failed: 2099
Progress: 13645/19808 (68.9%) | Scraped: 11545 | Failed: 2100
Progress: 13650/19808 (68.9%) | Scraped: 11549 | Failed: 2101
Progress: 13655/19808 (68.9%) | Scraped: 11552 | Failed: 2103
Progress: 13660/19808 (69.0%) | Scraped: 11557 | Failed: 2103
Progress: 13665/19808 (69.0%) | Scraped: 11562 | Failed: 2103
Progress: 13670/19808 (69.0%) | Scraped: 11567 | Failed: 2103
Progress: 13675/19808 (69.0%) | Scraped: 11572 | Failed: 2103
Progress: 13680/19808 (69.1%) | Scraped: 11577 | Failed: 2103
Progress: 13685/19808 (69.1%) | Scraped: 11582 | Failed: 2103
Progress: 13690/19808 (69.1%) | Scraped: 11587 | Failed: 2103
Progress: 13695/19808 (69.1%) | Scraped: 11592 | Failed: 2103
Progress: 13700/19808 (69.2%) | Scraped: 11597 | Failed: 2103
API erro

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 14095/19808 (71.2%) | Scraped: 11924 | Failed: 2171
Progress: 14100/19808 (71.2%) | Scraped: 11929 | Failed: 2171
Progress: 14105/19808 (71.2%) | Scraped: 11934 | Failed: 2171
Progress: 14110/19808 (71.2%) | Scraped: 11939 | Failed: 2171
Progress: 14115/19808 (71.3%) | Scraped: 11943 | Failed: 2172
Progress: 14120/19808 (71.3%) | Scraped: 11947 | Failed: 2173
Progress: 14125/19808 (71.3%) | Scraped: 11949 | Failed: 2176
Progress: 14130/19808 (71.3%) | Scraped: 11953 | Failed: 2177
Progress: 14135/19808 (71.4%) | Scraped: 11953 | Failed: 2182
Progress: 14140/19808 (71.4%) | Scraped: 11956 | Failed: 2184
Progress: 14145/19808 (71.4%) | Scraped: 11956 | Failed: 2189
Progress: 14150/19808 (71.4%) | Scraped: 11956 | Failed: 2194
API error for 'URL: https://uk.louisvuitton.com/eng-gb/products/o...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1.

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


API error for 'Website: https://usa.visa.com/dam/VCOM/regional/na...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1. Please try again in 8.64s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}
Progress: 14210/19808 (71.7%) | Scraped: 11996 | Failed: 2214
Progress: 14215/19808 (71.8%) | Scraped: 12001 | Failed: 2214
Progress: 14220/19808 (71.8%) | Scraped: 12004 | Failed: 2216
Progress: 14225/19808 (71.8%) | Scraped: 12005 | Failed: 2220
Progress: 14230/19808 (71.8%) | Scraped: 12005 | Failed: 2225
Progress: 14235/19808 (71.9%) | Scraped: 12005 | Failed: 2230
Progress: 14240/19808 (71.9%) | Scraped: 12009 | Failed: 2231
API error for 'Website: https://wallethub.com/answers/cc/american...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mi

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 14250/19808 (71.9%) | Scraped: 12017 | Failed: 2233
Progress: 14255/19808 (72.0%) | Scraped: 12021 | Failed: 2234
API error for 'URL: https://weather-and-climate.com/srinagar-Marc...': Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-BBY7di7miXBr1WlgXWkJQRd6 on requests per day (RPD): Limit 10000, Used 10000, Requested 1. Please try again in 8.64s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}
Progress: 14260/19808 (72.0%) | Scraped: 12024 | Failed: 2236
Progress: 14265/19808 (72.0%) | Scraped: 12026 | Failed: 2239
API error for 'Website: https://whowhatwear.co.uk/amp/french-fash...': Expecting value: line 27 column 1 (char 438)
Progress: 14270/19808 (72.0%) | Scraped: 12030 | Failed: 2240
Progress: 14275/19808 (72.1%) | Scraped: 12034 | Failed: 2241
Progress: 14280/19808 (72.1%) | Scraped: 12039 | Failed: 2241
Progress: 14285/19808 (72.1%

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 14610/19808 (73.8%) | Scraped: 12352 | Failed: 2258
Progress: 14615/19808 (73.8%) | Scraped: 12357 | Failed: 2258
Progress: 14620/19808 (73.8%) | Scraped: 12359 | Failed: 2261
Progress: 14625/19808 (73.8%) | Scraped: 12364 | Failed: 2261
API error for 'Website: https://www.asos.com/search/?q=alpha-h&af...': Unterminated string starting at: line 4 column 487 (char 555)
Progress: 14630/19808 (73.9%) | Scraped: 12369 | Failed: 2261
API error for 'Website: https://www.autonomous.ai/?utm_term=&utm_...': Unterminated string starting at: line 20 column 5 (char 468)
Progress: 14635/19808 (73.9%) | Scraped: 12373 | Failed: 2262
Progress: 14640/19808 (73.9%) | Scraped: 12377 | Failed: 2263
Progress: 14645/19808 (73.9%) | Scraped: 12382 | Failed: 2263
Progress: 14650/19808 (74.0%) | Scraped: 12386 | Failed: 2264
Progress: 14655/19808 (74.0%) | Scraped: 12390 | Failed: 2265
API error for 'Website: https://www.bailliegifford.com/individual...': Error code: 429 - {'error': {'message': 'Rat

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 15390/19808 (77.7%) | Scraped: 13046 | Failed: 2344
Progress: 15395/19808 (77.7%) | Scraped: 13051 | Failed: 2344
Progress: 15400/19808 (77.7%) | Scraped: 13053 | Failed: 2347
Progress: 15405/19808 (77.8%) | Scraped: 13053 | Failed: 2352
Progress: 15410/19808 (77.8%) | Scraped: 13053 | Failed: 2357
Progress: 15415/19808 (77.8%) | Scraped: 13058 | Failed: 2357
Progress: 15420/19808 (77.8%) | Scraped: 13063 | Failed: 2357
Progress: 15425/19808 (77.9%) | Scraped: 13068 | Failed: 2357
Progress: 15430/19808 (77.9%) | Scraped: 13073 | Failed: 2357
Progress: 15435/19808 (77.9%) | Scraped: 13078 | Failed: 2357
Progress: 15440/19808 (77.9%) | Scraped: 13083 | Failed: 2357
Progress: 15445/19808 (78.0%) | Scraped: 13088 | Failed: 2357
Progress: 15450/19808 (78.0%) | Scraped: 13093 | Failed: 2357
Progress: 15455/19808 (78.0%) | Scraped: 13098 | Failed: 2357
Progress: 15460/19808 (78.0%) | Scraped: 13103 | Failed: 2357
Progress: 15465/19808 (78.1%) | Scraped: 13108 | Failed: 2357
Progress

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


API error for 'Website: https://www.eachccp.eu/members/
Content: ...': Expecting value: line 22 column 4 (char 491)
Progress: 15615/19808 (78.8%) | Scraped: 13226 | Failed: 2389
Progress: 15620/19808 (78.9%) | Scraped: 13231 | Failed: 2389
Progress: 15625/19808 (78.9%) | Scraped: 13236 | Failed: 2389
Progress: 15630/19808 (78.9%) | Scraped: 13240 | Failed: 2390
Progress: 15635/19808 (78.9%) | Scraped: 13245 | Failed: 2390
Progress: 15640/19808 (79.0%) | Scraped: 13250 | Failed: 2390
Progress: 15645/19808 (79.0%) | Scraped: 13255 | Failed: 2390
Progress: 15650/19808 (79.0%) | Scraped: 13260 | Failed: 2390
Progress: 15655/19808 (79.0%) | Scraped: 13265 | Failed: 2390
Progress: 15660/19808 (79.1%) | Scraped: 13270 | Failed: 2390
Progress: 15665/19808 (79.1%) | Scraped: 13272 | Failed: 2393
Progress: 15670/19808 (79.1%) | Scraped: 13275 | Failed: 2395
Progress: 15675/19808 (79.1%) | Scraped: 13280 | Failed: 2395
Progress: 15680/19808 (79.2%) | Scraped: 13284 | Failed: 2396
Progress: 15685/

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 15695/19808 (79.2%) | Scraped: 13296 | Failed: 2399
Progress: 15700/19808 (79.3%) | Scraped: 13301 | Failed: 2399
Progress: 15705/19808 (79.3%) | Scraped: 13306 | Failed: 2399
Progress: 15710/19808 (79.3%) | Scraped: 13310 | Failed: 2400
Progress: 15715/19808 (79.3%) | Scraped: 13314 | Failed: 2401
Progress: 15720/19808 (79.4%) | Scraped: 13314 | Failed: 2406
Progress: 15725/19808 (79.4%) | Scraped: 13318 | Failed: 2407
Progress: 15730/19808 (79.4%) | Scraped: 13322 | Failed: 2408
Progress: 15735/19808 (79.4%) | Scraped: 13326 | Failed: 2409
Progress: 15740/19808 (79.5%) | Scraped: 13327 | Failed: 2413
Progress: 15745/19808 (79.5%) | Scraped: 13327 | Failed: 2418
Progress: 15750/19808 (79.5%) | Scraped: 13327 | Failed: 2423
API error for 'Website: https://www.eurogamer.net/digitalfoundry
...': Unterminated string starting at: line 18 column 5 (char 494)
Progress: 15755/19808 (79.5%) | Scraped: 13332 | Failed: 2423
Progress: 15760/19808 (79.6%) | Scraped: 13337 | Failed: 2423


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 16140/19808 (81.5%) | Scraped: 13634 | Failed: 2506
Progress: 16145/19808 (81.5%) | Scraped: 13639 | Failed: 2506
Progress: 16150/19808 (81.5%) | Scraped: 13644 | Failed: 2506
Progress: 16155/19808 (81.6%) | Scraped: 13649 | Failed: 2506
Progress: 16160/19808 (81.6%) | Scraped: 13654 | Failed: 2506
Progress: 16165/19808 (81.6%) | Scraped: 13659 | Failed: 2506
Progress: 16170/19808 (81.6%) | Scraped: 13663 | Failed: 2507
Progress: 16175/19808 (81.7%) | Scraped: 13668 | Failed: 2507
Progress: 16180/19808 (81.7%) | Scraped: 13673 | Failed: 2507
Progress: 16185/19808 (81.7%) | Scraped: 13678 | Failed: 2507


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 16190/19808 (81.7%) | Scraped: 13683 | Failed: 2507
Progress: 16195/19808 (81.8%) | Scraped: 13688 | Failed: 2507
Progress: 16200/19808 (81.8%) | Scraped: 13693 | Failed: 2507
Progress: 16205/19808 (81.8%) | Scraped: 13698 | Failed: 2507
API error for 'Website: https://www.gpsmycity.com/tours/lucknow-h...': Unterminated string starting at: line 4 column 380 (char 462)
Progress: 16210/19808 (81.8%) | Scraped: 13703 | Failed: 2507
Progress: 16215/19808 (81.9%) | Scraped: 13707 | Failed: 2508
Progress: 16220/19808 (81.9%) | Scraped: 13711 | Failed: 2509
Progress: 16225/19808 (81.9%) | Scraped: 13715 | Failed: 2510
Progress: 16230/19808 (81.9%) | Scraped: 13720 | Failed: 2510
Progress: 16235/19808 (82.0%) | Scraped: 13725 | Failed: 2510
Progress: 16240/19808 (82.0%) | Scraped: 13730 | Failed: 2510
Progress: 16245/19808 (82.0%) | Scraped: 13735 | Failed: 2510
Progress: 16250/19808 (82.0%) | Scraped: 13740 | Failed: 2510
Progress: 16255/19808 (82.1%) | Scraped: 13745 | Failed: 2510

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 16365/19808 (82.6%) | Scraped: 13835 | Failed: 2530
Progress: 16370/19808 (82.6%) | Scraped: 13840 | Failed: 2530
Progress: 16375/19808 (82.7%) | Scraped: 13845 | Failed: 2530
API error for 'Website: https://www.home.barclays/about-barclays/...': Unterminated string starting at: line 4 column 402 (char 479)
Progress: 16380/19808 (82.7%) | Scraped: 13850 | Failed: 2530
Progress: 16385/19808 (82.7%) | Scraped: 13855 | Failed: 2530
Progress: 16390/19808 (82.7%) | Scraped: 13858 | Failed: 2532
Progress: 16395/19808 (82.8%) | Scraped: 13860 | Failed: 2535
Progress: 16400/19808 (82.8%) | Scraped: 13864 | Failed: 2536
Progress: 16405/19808 (82.8%) | Scraped: 13868 | Failed: 2537
Progress: 16410/19808 (82.8%) | Scraped: 13872 | Failed: 2538
API error for 'Website: https://www.houseoffraser.co.uk/women/coa...': Unterminated string starting at: line 4 column 369 (char 437)
Progress: 16415/19808 (82.9%) | Scraped: 13877 | Failed: 2538
Progress: 16420/19808 (82.9%) | Scraped: 13880 | Fai

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


API error for 'Website: https://www.imgmodels.com/thylaneblondeau...': Unterminated string starting at: line 4 column 359 (char 441)
Progress: 16500/19808 (83.3%) | Scraped: 13954 | Failed: 2546
Progress: 16505/19808 (83.3%) | Scraped: 13959 | Failed: 2546
Progress: 16510/19808 (83.4%) | Scraped: 13963 | Failed: 2547
Progress: 16515/19808 (83.4%) | Scraped: 13965 | Failed: 2550
Progress: 16520/19808 (83.4%) | Scraped: 13970 | Failed: 2550
Progress: 16525/19808 (83.4%) | Scraped: 13975 | Failed: 2550
API error for 'Website: https://www.independent.co.uk/extras/indy...': Unterminated string starting at: line 16 column 5 (char 574)
Progress: 16530/19808 (83.5%) | Scraped: 13980 | Failed: 2550
Progress: 16535/19808 (83.5%) | Scraped: 13985 | Failed: 2550
Progress: 16540/19808 (83.5%) | Scraped: 13990 | Failed: 2550
Progress: 16545/19808 (83.5%) | Scraped: 13995 | Failed: 2550
Progress: 16550/19808 (83.6%) | Scraped: 14000 | Failed: 2550
Progress: 16555/19808 (83.6%) | Scraped: 14005 | Fail

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 16950/19808 (85.6%) | Scraped: 14342 | Failed: 2608
API error for 'Website: https://www.lights4fun.co.uk/c/q/outdoor-...': Unterminated string starting at: line 16 column 5 (char 666)
Progress: 16955/19808 (85.6%) | Scraped: 14347 | Failed: 2608
Progress: 16960/19808 (85.6%) | Scraped: 14351 | Failed: 2609
Progress: 16965/19808 (85.6%) | Scraped: 14356 | Failed: 2609
Progress: 16970/19808 (85.7%) | Scraped: 14360 | Failed: 2610
Progress: 16975/19808 (85.7%) | Scraped: 14362 | Failed: 2613
Progress: 16980/19808 (85.7%) | Scraped: 14366 | Failed: 2614
API error for 'Website: https://www.linksoflondon.com/gb-en/women...': Unterminated string starting at: line 4 column 353 (char 424)
Progress: 16985/19808 (85.7%) | Scraped: 14371 | Failed: 2614
Progress: 16990/19808 (85.8%) | Scraped: 14375 | Failed: 2615
Progress: 16995/19808 (85.8%) | Scraped: 14378 | Failed: 2617
Progress: 17000/19808 (85.8%) | Scraped: 14383 | Failed: 2617
API error for 'Website: https://www.livestrong.com/my

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 17130/19808 (86.5%) | Scraped: 14493 | Failed: 2637
Progress: 17135/19808 (86.5%) | Scraped: 14497 | Failed: 2638
Progress: 17140/19808 (86.5%) | Scraped: 14501 | Failed: 2639
API error for 'Website: https://www.marksandspencer.com/l/food-to...': Unterminated string starting at: line 22 column 5 (char 555)
Progress: 17145/19808 (86.6%) | Scraped: 14504 | Failed: 2641
API error for 'Website: https://www.marksandspencer.com/l/food-to...': Expecting value: line 23 column 1 (char 585)
Progress: 17150/19808 (86.6%) | Scraped: 14509 | Failed: 2641
Progress: 17155/19808 (86.6%) | Scraped: 14513 | Failed: 2642
Progress: 17160/19808 (86.6%) | Scraped: 14518 | Failed: 2642
API error for 'Website: https://www.matalan.co.uk/homeware/dining...': Expecting value: line 4 column 495 (char 570)
Progress: 17165/19808 (86.7%) | Scraped: 14521 | Failed: 2644


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 17170/19808 (86.7%) | Scraped: 14526 | Failed: 2644
API error for 'Website: https://www.maxfactor.com/en/foundation-f...': Expecting value: line 16 column 4 (char 558)
Progress: 17175/19808 (86.7%) | Scraped: 14531 | Failed: 2644
Progress: 17180/19808 (86.7%) | Scraped: 14536 | Failed: 2644
Progress: 17185/19808 (86.8%) | Scraped: 14540 | Failed: 2645
Progress: 17190/19808 (86.8%) | Scraped: 14544 | Failed: 2646
Progress: 17195/19808 (86.8%) | Scraped: 14549 | Failed: 2646
Progress: 17200/19808 (86.8%) | Scraped: 14554 | Failed: 2646
Progress: 17205/19808 (86.9%) | Scraped: 14559 | Failed: 2646
Progress: 17210/19808 (86.9%) | Scraped: 14561 | Failed: 2649
Progress: 17215/19808 (86.9%) | Scraped: 14566 | Failed: 2649
Progress: 17220/19808 (86.9%) | Scraped: 14571 | Failed: 2649
Progress: 17225/19808 (87.0%) | Scraped: 14574 | Failed: 2651
Progress: 17230/19808 (87.0%) | Scraped: 14578 | Failed: 2652
Progress: 17235/19808 (87.0%) | Scraped: 14583 | Failed: 2652
Progress: 17240/

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 17310/19808 (87.4%) | Scraped: 14626 | Failed: 2684
Progress: 17315/19808 (87.4%) | Scraped: 14630 | Failed: 2685
Progress: 17320/19808 (87.4%) | Scraped: 14635 | Failed: 2685
Progress: 17325/19808 (87.5%) | Scraped: 14638 | Failed: 2687
Progress: 17330/19808 (87.5%) | Scraped: 14639 | Failed: 2691
Progress: 17335/19808 (87.5%) | Scraped: 14644 | Failed: 2691
Progress: 17340/19808 (87.5%) | Scraped: 14649 | Failed: 2691
Progress: 17345/19808 (87.6%) | Scraped: 14654 | Failed: 2691
Progress: 17350/19808 (87.6%) | Scraped: 14657 | Failed: 2693
Progress: 17355/19808 (87.6%) | Scraped: 14662 | Failed: 2693
Progress: 17360/19808 (87.6%) | Scraped: 14667 | Failed: 2693
Progress: 17365/19808 (87.7%) | Scraped: 14670 | Failed: 2695
Progress: 17370/19808 (87.7%) | Scraped: 14671 | Failed: 2699
Progress: 17375/19808 (87.7%) | Scraped: 14676 | Failed: 2699
Progress: 17380/19808 (87.7%) | Scraped: 14681 | Failed: 2699
Progress: 17385/19808 (87.8%) | Scraped: 14686 | Failed: 2699
Progress

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 17545/19808 (88.6%) | Scraped: 14823 | Failed: 2722
Progress: 17550/19808 (88.6%) | Scraped: 14828 | Failed: 2722
Progress: 17555/19808 (88.6%) | Scraped: 14831 | Failed: 2724
Progress: 17560/19808 (88.7%) | Scraped: 14836 | Failed: 2724


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 17565/19808 (88.7%) | Scraped: 14841 | Failed: 2724
Progress: 17570/19808 (88.7%) | Scraped: 14846 | Failed: 2724
Progress: 17575/19808 (88.7%) | Scraped: 14851 | Failed: 2724
Progress: 17580/19808 (88.8%) | Scraped: 14856 | Failed: 2724
Progress: 17585/19808 (88.8%) | Scraped: 14861 | Failed: 2724
Progress: 17590/19808 (88.8%) | Scraped: 14866 | Failed: 2724
Progress: 17595/19808 (88.8%) | Scraped: 14870 | Failed: 2725
Progress: 17600/19808 (88.9%) | Scraped: 14875 | Failed: 2725
Progress: 17605/19808 (88.9%) | Scraped: 14880 | Failed: 2725
Progress: 17610/19808 (88.9%) | Scraped: 14884 | Failed: 2726
Progress: 17615/19808 (88.9%) | Scraped: 14888 | Failed: 2727
Progress: 17620/19808 (89.0%) | Scraped: 14893 | Failed: 2727
Progress: 17625/19808 (89.0%) | Scraped: 14898 | Failed: 2727
Progress: 17630/19808 (89.0%) | Scraped: 14900 | Failed: 2730
Progress: 17635/19808 (89.0%) | Scraped: 14904 | Failed: 2731
Progress: 17640/19808 (89.1%) | Scraped: 14909 | Failed: 2731
Progress

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


API error for 'Website: https://www.riverisland.com/c/women/shoes...': Unterminated string starting at: line 4 column 420 (char 490)
Progress: 18010/19808 (90.9%) | Scraped: 15188 | Failed: 2822
Progress: 18015/19808 (90.9%) | Scraped: 15189 | Failed: 2826
Progress: 18020/19808 (91.0%) | Scraped: 15194 | Failed: 2826
Progress: 18025/19808 (91.0%) | Scraped: 15197 | Failed: 2828
Progress: 18030/19808 (91.0%) | Scraped: 15202 | Failed: 2828
Progress: 18035/19808 (91.0%) | Scraped: 15206 | Failed: 2829
Progress: 18040/19808 (91.1%) | Scraped: 15211 | Failed: 2829
Progress: 18045/19808 (91.1%) | Scraped: 15216 | Failed: 2829
Progress: 18050/19808 (91.1%) | Scraped: 15221 | Failed: 2829
Progress: 18055/19808 (91.2%) | Scraped: 15226 | Failed: 2829
Progress: 18060/19808 (91.2%) | Scraped: 15231 | Failed: 2829
Progress: 18065/19808 (91.2%) | Scraped: 15233 | Failed: 2832
Progress: 18070/19808 (91.2%) | Scraped: 15236 | Failed: 2834
Progress: 18075/19808 (91.3%) | Scraped: 15238 | Failed: 2837

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18240/19808 (92.1%) | Scraped: 15375 | Failed: 2865
Progress: 18245/19808 (92.1%) | Scraped: 15375 | Failed: 2870
Progress: 18250/19808 (92.1%) | Scraped: 15379 | Failed: 2871
Progress: 18255/19808 (92.2%) | Scraped: 15383 | Failed: 2872
Progress: 18260/19808 (92.2%) | Scraped: 15388 | Failed: 2872
Progress: 18265/19808 (92.2%) | Scraped: 15392 | Failed: 2873
Progress: 18270/19808 (92.2%) | Scraped: 15395 | Failed: 2875
Progress: 18275/19808 (92.3%) | Scraped: 15398 | Failed: 2877
API error for 'Website: https://www.sportskeeda.com/amp/tennis/ne...': Unterminated string starting at: line 4 column 386 (char 456)
Progress: 18280/19808 (92.3%) | Scraped: 15403 | Failed: 2877
Progress: 18285/19808 (92.3%) | Scraped: 15408 | Failed: 2877
Progress: 18290/19808 (92.3%) | Scraped: 15413 | Failed: 2877
Progress: 18295/19808 (92.4%) | Scraped: 15418 | Failed: 2877
Progress: 18300/19808 (92.4%) | Scraped: 15423 | Failed: 2877
Progress: 18305/19808 (92.4%) | Scraped: 15428 | Failed: 2877

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18360/19808 (92.7%) | Scraped: 15480 | Failed: 2880
Progress: 18365/19808 (92.7%) | Scraped: 15484 | Failed: 2881
API error for 'Website: https://www.susanentwistle.com/collection...': Expecting value: line 21 column 4 (char 545)
Progress: 18370/19808 (92.7%) | Scraped: 15487 | Failed: 2883
Progress: 18375/19808 (92.8%) | Scraped: 15490 | Failed: 2885
Progress: 18380/19808 (92.8%) | Scraped: 15495 | Failed: 2885


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18385/19808 (92.8%) | Scraped: 15499 | Failed: 2886
Progress: 18390/19808 (92.8%) | Scraped: 15504 | Failed: 2886
Progress: 18395/19808 (92.9%) | Scraped: 15507 | Failed: 2888
Progress: 18400/19808 (92.9%) | Scraped: 15512 | Failed: 2888
Progress: 18405/19808 (92.9%) | Scraped: 15517 | Failed: 2888
Progress: 18410/19808 (92.9%) | Scraped: 15521 | Failed: 2889
Progress: 18415/19808 (93.0%) | Scraped: 15526 | Failed: 2889
Progress: 18420/19808 (93.0%) | Scraped: 15526 | Failed: 2894
Progress: 18425/19808 (93.0%) | Scraped: 15526 | Failed: 2899
Progress: 18430/19808 (93.0%) | Scraped: 15527 | Failed: 2903
Progress: 18435/19808 (93.1%) | Scraped: 15532 | Failed: 2903
API error for 'Website: https://www.tbvsc.com/bicester-village/en...': Unterminated string starting at: line 25 column 5 (char 439)
Progress: 18440/19808 (93.1%) | Scraped: 15535 | Failed: 2905
API error for 'Website: https://www.tbvsc.com/bicester-village/en...': Unterminated string starting at: line 24 column 5 (ch

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18450/19808 (93.1%) | Scraped: 15545 | Failed: 2905
Progress: 18455/19808 (93.2%) | Scraped: 15550 | Failed: 2905
Progress: 18460/19808 (93.2%) | Scraped: 15555 | Failed: 2905
Progress: 18465/19808 (93.2%) | Scraped: 15558 | Failed: 2907
API error for 'Website: https://www.tedbaker.com/row/delivery-and...': Expecting ',' delimiter: line 5 column 1 (char 571)
Progress: 18470/19808 (93.2%) | Scraped: 15562 | Failed: 2908
API error for 'Website: https://www.tedbaker.com/uk/Womens/Clothi...': Unterminated string starting at: line 15 column 5 (char 546)
API error for 'Website: https://www.tedbaker.com/uk/Womens/Workwe...': Unterminated string starting at: line 4 column 431 (char 516)
Progress: 18475/19808 (93.3%) | Scraped: 15567 | Failed: 2908
Progress: 18480/19808 (93.3%) | Scraped: 15572 | Failed: 2908
API error for 'Website: https://www.tedbaker.com/uk/c/womens/edit...': Unterminated string starting at: line 4 column 431 (char 512)
API error for 'Website: https://www.tedbaker.

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18565/19808 (93.7%) | Scraped: 15655 | Failed: 2910
Progress: 18570/19808 (93.8%) | Scraped: 15660 | Failed: 2910
API error for 'Website: https://www.thefishsociety.co.uk/sashimi-...': Unterminated string starting at: line 19 column 5 (char 516)
Progress: 18575/19808 (93.8%) | Scraped: 15665 | Failed: 2910
Progress: 18580/19808 (93.8%) | Scraped: 15669 | Failed: 2911
Progress: 18585/19808 (93.8%) | Scraped: 15674 | Failed: 2911
Progress: 18590/19808 (93.9%) | Scraped: 15678 | Failed: 2912
Progress: 18595/19808 (93.9%) | Scraped: 15683 | Failed: 2912
Progress: 18600/19808 (93.9%) | Scraped: 15686 | Failed: 2914
Progress: 18605/19808 (93.9%) | Scraped: 15689 | Failed: 2916
Progress: 18610/19808 (94.0%) | Scraped: 15694 | Failed: 2916
Progress: 18615/19808 (94.0%) | Scraped: 15697 | Failed: 2918
Progress: 18620/19808 (94.0%) | Scraped: 15699 | Failed: 2921
Progress: 18625/19808 (94.0%) | Scraped: 15704 | Failed: 2921
Progress: 18630/19808 (94.1%) | Scraped: 15709 | Failed: 2921


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18925/19808 (95.5%) | Scraped: 15940 | Failed: 2985
Progress: 18930/19808 (95.6%) | Scraped: 15941 | Failed: 2989


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18935/19808 (95.6%) | Scraped: 15944 | Failed: 2991


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 18940/19808 (95.6%) | Scraped: 15948 | Failed: 2992
Progress: 18945/19808 (95.6%) | Scraped: 15953 | Failed: 2992
Progress: 18950/19808 (95.7%) | Scraped: 15957 | Failed: 2993
Progress: 18955/19808 (95.7%) | Scraped: 15962 | Failed: 2993
API error for 'Website: https://www.vittoria.com/ww/en/stories/sp...': Unterminated string starting at: line 4 column 393 (char 473)
Progress: 18960/19808 (95.7%) | Scraped: 15967 | Failed: 2993
API error for 'Website: https://www.vlccwellness.com/India/vlcc-b...': Unterminated string starting at: line 4 column 391 (char 482)
API error for 'Website: https://www.vogue.co.uk/article/best-vint...': Expecting value: line 4 column 414 (char 502)
Progress: 18965/19808 (95.7%) | Scraped: 15972 | Failed: 2993
Progress: 18970/19808 (95.8%) | Scraped: 15977 | Failed: 2993
Progress: 18975/19808 (95.8%) | Scraped: 15982 | Failed: 2993
Progress: 18980/19808 (95.8%) | Scraped: 15987 | Failed: 2993
Progress: 18985/19808 (95.8%) | Scraped: 15992 | Failed: 29

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19055/19808 (96.2%) | Scraped: 16031 | Failed: 3024
Progress: 19060/19808 (96.2%) | Scraped: 16036 | Failed: 3024
Progress: 19065/19808 (96.2%) | Scraped: 16040 | Failed: 3025
Progress: 19070/19808 (96.3%) | Scraped: 16045 | Failed: 3025
Progress: 19075/19808 (96.3%) | Scraped: 16046 | Failed: 3029
Progress: 19080/19808 (96.3%) | Scraped: 16046 | Failed: 3034
Progress: 19085/19808 (96.3%) | Scraped: 16050 | Failed: 3035
Progress: 19090/19808 (96.4%) | Scraped: 16055 | Failed: 3035


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19095/19808 (96.4%) | Scraped: 16059 | Failed: 3036
Progress: 19100/19808 (96.4%) | Scraped: 16064 | Failed: 3036
Progress: 19105/19808 (96.5%) | Scraped: 16068 | Failed: 3037
API error for 'Website: https://www.whowhatwear.co.uk/amp/french-...': Expecting value: line 27 column 1 (char 438)
Progress: 19110/19808 (96.5%) | Scraped: 16073 | Failed: 3037
API error for 'Website: https://www.whowhatwear.co.uk/best-sneake...': Unterminated string starting at: line 18 column 5 (char 585)
Progress: 19115/19808 (96.5%) | Scraped: 16078 | Failed: 3037
Progress: 19120/19808 (96.5%) | Scraped: 16083 | Failed: 3037
API error for 'Website: https://www.whowhatwear.co.uk/timeless-je...': Unterminated string starting at: line 15 column 5 (char 504)
Progress: 19125/19808 (96.6%) | Scraped: 16087 | Failed: 3038
Progress: 19130/19808 (96.6%) | Scraped: 16092 | Failed: 3038
Progress: 19135/19808 (96.6%) | Scraped: 16095 | Failed: 3040
Progress: 19140/19808 (96.6%) | Scraped: 16100 | Failed: 3040


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19285/19808 (97.4%) | Scraped: 16229 | Failed: 3056
API error for 'Website: https://yorkfitness.com/a/s/collections/d...': Unterminated string starting at: line 4 column 402 (char 480)
Progress: 19290/19808 (97.4%) | Scraped: 16231 | Failed: 3059
Progress: 19295/19808 (97.4%) | Scraped: 16236 | Failed: 3059
Progress: 19300/19808 (97.4%) | Scraped: 16240 | Failed: 3060
Progress: 19305/19808 (97.5%) | Scraped: 16245 | Failed: 3060
Progress: 19310/19808 (97.5%) | Scraped: 16249 | Failed: 3061
Progress: 19315/19808 (97.5%) | Scraped: 16254 | Failed: 3061
API error for 'Website: https://www.apple.com/uk/iphone/compare/
...': Unterminated string starting at: line 20 column 5 (char 414)
Progress: 19320/19808 (97.5%) | Scraped: 16259 | Failed: 3061
Progress: 19325/19808 (97.6%) | Scraped: 16264 | Failed: 3061
Progress: 19330/19808 (97.6%) | Scraped: 16268 | Failed: 3062
Progress: 19335/19808 (97.6%) | Scraped: 16273 | Failed: 3062
Progress: 19340/19808 (97.6%) | Scraped: 16278 | Fail

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19380/19808 (97.8%) | Scraped: 16317 | Failed: 3063
Progress: 19385/19808 (97.9%) | Scraped: 16320 | Failed: 3065


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19390/19808 (97.9%) | Scraped: 16325 | Failed: 3065
Progress: 19395/19808 (97.9%) | Scraped: 16330 | Failed: 3065
Progress: 19400/19808 (97.9%) | Scraped: 16335 | Failed: 3065
Progress: 19405/19808 (98.0%) | Scraped: 16340 | Failed: 3065
Progress: 19410/19808 (98.0%) | Scraped: 16345 | Failed: 3065
Progress: 19415/19808 (98.0%) | Scraped: 16350 | Failed: 3065


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19420/19808 (98.0%) | Scraped: 16355 | Failed: 3065
API error for 'Website: https://www.eatatplow.com/menus
Content: ...': Unterminated string starting at: line 4 column 478 (char 544)
Progress: 19425/19808 (98.1%) | Scraped: 16360 | Failed: 3065
Progress: 19430/19808 (98.1%) | Scraped: 16365 | Failed: 3065
Progress: 19435/19808 (98.1%) | Scraped: 16370 | Failed: 3065
Progress: 19440/19808 (98.1%) | Scraped: 16375 | Failed: 3065
Progress: 19445/19808 (98.2%) | Scraped: 16380 | Failed: 3065
Progress: 19450/19808 (98.2%) | Scraped: 16385 | Failed: 3065
Progress: 19455/19808 (98.2%) | Scraped: 16390 | Failed: 3065
Progress: 19460/19808 (98.2%) | Scraped: 16395 | Failed: 3065
Progress: 19465/19808 (98.3%) | Scraped: 16400 | Failed: 3065
Progress: 19470/19808 (98.3%) | Scraped: 16405 | Failed: 3065
Progress: 19475/19808 (98.3%) | Scraped: 16410 | Failed: 3065


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19480/19808 (98.3%) | Scraped: 16415 | Failed: 3065
Progress: 19485/19808 (98.4%) | Scraped: 16420 | Failed: 3065
Progress: 19490/19808 (98.4%) | Scraped: 16424 | Failed: 3066
Progress: 19495/19808 (98.4%) | Scraped: 16429 | Failed: 3066
Progress: 19500/19808 (98.4%) | Scraped: 16434 | Failed: 3066
Progress: 19505/19808 (98.5%) | Scraped: 16439 | Failed: 3066
Progress: 19510/19808 (98.5%) | Scraped: 16444 | Failed: 3066
Progress: 19515/19808 (98.5%) | Scraped: 16449 | Failed: 3066
Progress: 19520/19808 (98.5%) | Scraped: 16454 | Failed: 3066
Progress: 19525/19808 (98.6%) | Scraped: 16459 | Failed: 3066
Progress: 19530/19808 (98.6%) | Scraped: 16464 | Failed: 3066
Progress: 19535/19808 (98.6%) | Scraped: 16469 | Failed: 3066
Progress: 19540/19808 (98.6%) | Scraped: 16474 | Failed: 3066
Progress: 19545/19808 (98.7%) | Scraped: 16479 | Failed: 3066
Progress: 19550/19808 (98.7%) | Scraped: 16484 | Failed: 3066
Progress: 19555/19808 (98.7%) | Scraped: 16489 | Failed: 3066
Progress

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19710/19808 (99.5%) | Scraped: 16640 | Failed: 3070
Progress: 19715/19808 (99.5%) | Scraped: 16645 | Failed: 3070
Progress: 19720/19808 (99.6%) | Scraped: 16650 | Failed: 3070
Progress: 19725/19808 (99.6%) | Scraped: 16655 | Failed: 3070
Progress: 19730/19808 (99.6%) | Scraped: 16660 | Failed: 3070
Progress: 19735/19808 (99.6%) | Scraped: 16665 | Failed: 3070
Progress: 19740/19808 (99.7%) | Scraped: 16670 | Failed: 3070
Progress: 19745/19808 (99.7%) | Scraped: 16671 | Failed: 3074
Progress: 19750/19808 (99.7%) | Scraped: 16673 | Failed: 3077
Progress: 19755/19808 (99.7%) | Scraped: 16678 | Failed: 3077
Progress: 19760/19808 (99.8%) | Scraped: 16683 | Failed: 3077
Progress: 19765/19808 (99.8%) | Scraped: 16688 | Failed: 3077
Progress: 19770/19808 (99.8%) | Scraped: 16692 | Failed: 3078
Progress: 19775/19808 (99.8%) | Scraped: 16697 | Failed: 3078


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19780/19808 (99.9%) | Scraped: 16702 | Failed: 3078
Progress: 19785/19808 (99.9%) | Scraped: 16707 | Failed: 3078


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Progress: 19790/19808 (99.9%) | Scraped: 16712 | Failed: 3078
Progress: 19795/19808 (99.9%) | Scraped: 16716 | Failed: 3079
Progress: 19800/19808 (100.0%) | Scraped: 16719 | Failed: 3081
Progress: 19805/19808 (100.0%) | Scraped: 16724 | Failed: 3081
Progress: 19808/19808 (100.0%) | Scraped: 16727 | Failed: 3081

✓ Page visit extraction complete!
  Successfully scraped: 16727
  Fallback to URL parsing: 3081
  Total processed: 19808

Sample page visits with extracted information:

Page: Visited https://www.businessinsider.com/shivon-zilis-reporte...
  Category: News & Media
  Topic: Elon Musk's children and relationship with Shivon Zilis
  Items: Shivon Zilis, Elon Musk, Neuralink, Walter Isaacson, Justine Wilson, Grimes

Page: Visited Elon Musk and Shivon Zilis privately welcome third b...
  Category: Entertainment
  Topic: Celebrity news
  Items: Elon Musk, Shivon Zilis, Neuralink, Tesla, SpaceX, Grimes, Justine Wilson

Page: Visited Teens could lose bank accounts and driving licences 

In [None]:
# Display API usage statistics and cost estimation
print("=== API USAGE STATISTICS ===\n")

total_unique_texts = len(extraction_cache)
total_api_calls = total_unique_texts  # Each unique text = 1 API call

# GPT-4o-mini pricing (as of Dec 2024)
# Input: $0.150 per 1M tokens
# Output: $0.600 per 1M tokens

# Estimate tokens (rough approximation)
avg_input_tokens_per_call = 100  # System prompt + user query
avg_output_tokens_per_call = 50   # JSON response

total_input_tokens = total_api_calls * avg_input_tokens_per_call
total_output_tokens = total_api_calls * avg_output_tokens_per_call

input_cost = (total_input_tokens / 1_000_000) * 0.150
output_cost = (total_output_tokens / 1_000_000) * 0.600
total_cost = input_cost + output_cost

print(f"Total unique texts processed: {total_unique_texts:,}")
print(f"Total API calls made: {total_api_calls:,}")
print(f"\nEstimated token usage:")
print(f"  Input tokens:  ~{total_input_tokens:,}")
print(f"  Output tokens: ~{total_output_tokens:,}")
print(f"  Total tokens:  ~{total_input_tokens + total_output_tokens:,}")
print(f"\nEstimated cost (GPT-4o-mini):")
print(f"  Input cost:    ${input_cost:.4f}")
print(f"  Output cost:   ${output_cost:.4f}")
print(f"  Total cost:    ${total_cost:.4f}")
print(f"\nCache efficiency:")
original_calls = len(search_queries_df) + len(page_visits_df)
calls_saved = original_calls - total_api_calls
savings_pct = (calls_saved / original_calls) * 100
print(f"  Without cache: {original_calls:,} API calls")
print(f"  With cache:    {total_api_calls:,} API calls")
print(f"  Calls saved:   {calls_saved:,} ({savings_pct:.1f}%)")

=== API USAGE STATISTICS ===

Total unique texts processed: 19,791
Total API calls made: 19,791

Estimated token usage:
  Input tokens:  ~1,979,100
  Output tokens: ~989,550
  Total tokens:  ~2,968,650

Estimated cost (GPT-4o-mini):
  Input cost:    $0.2969
  Output cost:   $0.5937
  Total cost:    $0.8906

Cache efficiency:
  Without cache: 53,031 API calls
  With cache:    19,791 API calls
  Calls saved:   33,240 (62.7%)


## Section 8: API Usage Statistics and Cost Analysis

**Objective:** Analyze API usage, estimate costs, and demonstrate the efficiency of caching.

**What this section does:**
- Calculates total API calls made
- Estimates token usage based on typical request sizes
- Computes cost based on GPT-4o-mini pricing
- Shows cache efficiency and savings from deduplication
- Displays cost breakdown

In [None]:
# Save the extraction cache for future use
CACHE_FILE = 'openai_extraction_cache.json'

try:
    with open(CACHE_FILE, 'w') as f:
        json.dump(extraction_cache, f, indent=2)
    
    # Get file size
    file_size = os.path.getsize(CACHE_FILE)
    file_size_mb = file_size / (1024 * 1024)
    
    print(f"✓ Extraction cache saved to '{CACHE_FILE}'")
    print(f"  Entries: {len(extraction_cache):,}")
    print(f"  File size: {file_size_mb:.2f} MB")
    print(f"\n💡 Next time you run this notebook, it will load from this cache")
    print(f"   and skip API calls for already-processed texts!")
    
except Exception as e:
    print(f"⚠ Error saving cache: {e}")

✓ Extraction cache saved to 'openai_extraction_cache.json'
  Entries: 19,791
  File size: 91.26 MB

💡 Next time you run this notebook, it will load from this cache
   and skip API calls for already-processed texts!


## Section 9: Search Cluster Analysis with Semantic Context

**Objective:** Analyze temporal clusters of searches to understand when users explore related topics.

**What this section does:**
- Extracts semantic context for each search cluster
- Identifies dominant categories within clusters
- Aggregates topics and entities from clustered searches
- Displays category distribution for each cluster
- Shows example search sequences

In [None]:
# Create comprehensive structured output combining searches, visits, and clusters
print("=== COMPREHENSIVE USER INTEREST PROFILE ===\n")

# Aggregate all activities (searches + visits) by semantic category
all_activities = []

# Add search queries
for idx, row in search_queries_df.iterrows():
    all_activities.append({
        'activity_type': 'search',
        'text': row['search_query'],
        'timestamp': row['timestamp'],
        'semantic_category': row['semantic_category'],
        'topic': row['topic'],
        'items': row['items'],
    })

# Add page visits
for idx, row in page_visits_df.iterrows():
    all_activities.append({
        'activity_type': 'page_visit',
        'text': row['title'],
        'timestamp': row['timestamp'],
        'semantic_category': row['semantic_category'],
        'topic': row['topic'],
        'items': row['items'],
    })

# Create activity dataframe
activities_df = pd.DataFrame(all_activities).sort_values('timestamp')

# Generate category-based summary
print("CONSOLIDATED INTERESTS BY SEMANTIC CATEGORY:")
print("=" * 80)

category_summary = activities_df.groupby('semantic_category').agg({
    'activity_type': 'count',
    'topic': lambda x: ', '.join(list(set(x.dropna()))[:5]),
    'items': lambda x: ', '.join(list(set([item for sublist in x for item in sublist]))[:8])
}).rename(columns={'activity_type': 'activity_count', 'topic': 'main_topics', 'items': 'key_entities'})

for category in category_summary.index:
    row = category_summary.loc[category]
    count = row['activity_count']
    topics = row['main_topics']
    items = row['key_entities']
    
    pct = (count / len(activities_df)) * 100
    print(f"\n{category.upper()}")
    print(f"  Activity Count: {int(count)} ({pct:.1f}%)")
    print(f"  Main Topics: {topics if topics else 'N/A'}")
    print(f"  Key Entities: {items if items else 'N/A'}")

# Temporal category trends
print("\n\n" + "=" * 80)
print("CATEGORY TRENDS BY TIME OF DAY:")
print("=" * 80)

activities_df['hour'] = activities_df['timestamp'].dt.hour
peak_categories_by_hour = activities_df.groupby(['hour', 'semantic_category']).size().reset_index(name='count')

for hour in range(0, 24, 3):
    hour_data = peak_categories_by_hour[peak_categories_by_hour['hour'].between(hour, min(hour+2, 23))]
    if len(hour_data) > 0:
        top_cat = hour_data.nlargest(1, 'count').iloc[0]
        print(f"{hour:02d}:00-{hour+2:02d}:59 → {top_cat['semantic_category']:30s} ({int(top_cat['count'])} activities)")

=== COMPREHENSIVE USER INTEREST PROFILE ===



KeyError: 'semantic_category'

## Section 10: Comprehensive User Interest Profile

**Objective:** Combine search and page visit data to create a holistic user profile with consolidated insights.

**What this section does:**
- Aggregates all activities (searches + visits) into single dataframe
- Groups by semantic category and provides activity counts
- Shows main topics and key entities per category
- Analyzes temporal trends showing peak interest times
- Displays category preferences across hours of the day

In [None]:
# Export structured data to JSON for further use
import json

# Create a clean structured output format
structured_output = {
    'metadata': {
        'total_activities': len(activities_df),
        'total_searches': len(search_queries_df),
        'total_page_visits': len(page_visits_df),
        'total_search_clusters': len(search_clusters),
        'date_range': {
            'start': str(activities_df['timestamp'].min()),
            'end': str(activities_df['timestamp'].max())
        }
    },
    'semantic_categories': {},
    'search_clusters_analysis': [],
    'top_entities': {}
}

# Add category-level insights
for category in activities_df['semantic_category'].unique():
    cat_data = activities_df[activities_df['semantic_category'] == category]
    items = []
    for item_list in cat_data['items']:
        items.extend(item_list)
    
    structured_output['semantic_categories'][category] = {
        'activity_count': int(cat_data.shape[0]),
        'search_count': int(cat_data[cat_data['activity_type'] == 'search'].shape[0]),
        'visit_count': int(cat_data[cat_data['activity_type'] == 'page_visit'].shape[0]),
        'topics': list(set(cat_data['topic'].dropna())),
        'top_entities': list(dict.fromkeys(items))[:10],
    }

# Add cluster analysis
for i, cluster in enumerate(search_clusters[:5], 1):
    context = extract_cluster_context(cluster)
    structured_output['search_clusters_analysis'].append({
        'cluster_id': i,
        'size': context['cluster_size'],
        'dominant_category': context['dominant_category'],
        'category_distribution': context['category_distribution'],
        'topics': context['topics'],
        'entities': context['items'],
        'searches': cluster
    })

# Collect top entities across all categories
all_entities = []
for item_list in activities_df['items']:
    all_entities.extend(item_list)

from collections import Counter
entity_counts = Counter(all_entities)
structured_output['top_entities'] = dict(entity_counts.most_common(20))

# Save to JSON
output_file = 'structured_user_interests.json'
with open(output_file, 'w') as f:
    json.dump(structured_output, f, indent=2, default=str)

print(f"✓ Structured output saved to '{output_file}'")
print(f"\nGenerated JSON contains:")
print(f"  • Metadata about the analysis")
print(f"  • Semantic categories with topics and entities")
print(f"  • Search cluster analysis")
print(f"  • Top entities by frequency")

# Display the structured output
print("\n\n=== STRUCTURED USER INTEREST PROFILE (JSON FORMAT) ===\n")
print(json.dumps(structured_output, indent=2, default=str)[:2000] + "\n...[truncated]")

NameError: name 'activities_df' is not defined

## Section 11: Summary and Generated Files

**Objective:** Provide a summary of all output files generated by this analysis and usage instructions.

**What this section does:**
- Lists all generated files with descriptions and sizes
- Shows when files were created and updated
- Provides next steps and usage examples
- Explains how to load and use the structured output

In [None]:
# Display summary of all generated files
import os

print("=" * 80)
print("GENERATED FILES SUMMARY")
print("=" * 80)

files_to_check = [
    ('structured_user_interests.json', 'Complete analysis with categories, topics, entities'),
    ('openai_extraction_cache.json', 'Cached API responses (reusable for future runs)'),
]

print("\n📁 Output Files:\n")

for filename, description in files_to_check:
    if os.path.exists(filename):
        file_size = os.path.getsize(filename)
        file_size_mb = file_size / (1024 * 1024)
        
        print(f"✓ {filename}")
        print(f"  Description: {description}")
        print(f"  Size: {file_size_mb:.2f} MB")
        print()
    else:
        print(f"⚠ {filename} - Not found")
        print(f"  Description: {description}")
        print()

print("=" * 80)
print("NEXT STEPS")
print("=" * 80)
print("""
1. Use 'structured_user_interests.json' for analysis and recommendations
2. The cache file will speed up future runs (no re-processing needed)
3. To re-run with fresh data, delete 'openai_extraction_cache.json'

Example: Load the structured data
    import json
    with open('structured_user_interests.json') as f:
        data = json.load(f)
    
    # Access categories
    categories = data['semantic_categories']
    
    # Access top entities
    entities = data['top_entities']
""")

## Section 12: Advanced Visualization and Insights

**Objective:** Create detailed tables and visualizations showing the complete extraction results for all activities.

**What this section does:**
- Creates comprehensive tables of search queries with extraction results
- Shows page visits with extracted categories and entities
- Builds category-to-topic-to-item hierarchical mapping
- Displays activity counts and key entities per category
- Provides detailed view of all extracted structured information

In [None]:
# Create detailed recommendation table
print("=== DETAILED ENTITY EXTRACTION & CATEGORIZATION TABLE ===\n")

# Create a comprehensive table with all information
detailed_table = []

for idx, row in search_queries_df.head(30).iterrows():
    detailed_table.append({
        'Activity': 'Search',
        'Query/Title': row['search_query'][:50],
        'Category': row['semantic_category'],
        'Topic': row['topic'][:30],
        'Entities': ', '.join(row['items'][:2]) if row['items'] else '-',
        'Entity Count': len(row['items'])
    })

detailed_df = pd.DataFrame(detailed_table)
print("Search Queries - Structured Extraction (OpenAI GPT-4o-mini):")
print(detailed_df.to_string())

print("\n\n" + "=" * 100)
print("PAGE VISITS - STRUCTURED EXTRACTION:\n")

# Similar for page visits
visit_table = []
for idx, row in page_visits_df.head(20).iterrows():
    visit_table.append({
        'Activity': 'Page Visit',
        'Query/Title': row['title'][:50],
        'Category': row['semantic_category'],
        'Topic': row['topic'][:30],
        'Entities': ', '.join(row['items'][:2]) if row['items'] else '-',
    })

visit_df = pd.DataFrame(visit_table)
print(visit_df.to_string())

# Category to Item mapping
print("\n\n" + "=" * 100)
print("CATEGORY → TOPIC → ITEM MAPPING:\n")

category_mapping = activities_df.groupby('semantic_category').apply(
    lambda x: {
        'topics': list(set([t for t in x['topic'] if pd.notna(t)][:5])),
        'items': list(dict.fromkeys([item for items in x['items'] for item in items][:8])),
        'activity_count': len(x)
    }
).to_dict()

for category, info in category_mapping.items():
    print(f"\n{'█' * 3} {category}")
    print(f"    Activity Count: {info['activity_count']}")
    if info['topics']:
        print(f"    Topics: {', '.join(info['topics'])}")
    if info['items']:
        print(f"    Key Items/Entities: {', '.join(info['items'])}")

# Search History Information Extraction & Analysis - Complete Reference

## Executive Summary
This notebook demonstrates a sophisticated approach to analyzing Google search history data. Unlike simple keyword matching or embedding-based similarity, this analysis performs deep semantic understanding through:

1. **Named Entity Recognition (NER)** - Extracts specific entities (brands, products, locations, people)
2. **Semantic Categorization** - Classifies searches and visits into 11+ semantic categories
3. **Behavior Analysis** - Identifies patterns, clusters, and temporal trends
4. **Web Content Extraction** - Analyzes actual pages visited, not just search queries
5. **Comprehensive Profiling** - Combines all signals for holistic user understanding

## Key Features:
- **Dual Processing**: Analyzes both search queries (intent) and page visits (behavior)
- **API Optimization**: Implements caching to reduce costs and improve efficiency
- **Fallback Strategies**: Handles failures gracefully (e.g., scraping fallback to URL parsing)
- **Structured Output**: Generates machine-readable JSON for downstream integration
- **Cost Transparency**: Tracks and reports API usage and estimated costs

## Data Flow:
```
Raw Search History JSON
    ↓
Parse & Classify (searches/visits/other)
    ↓
├─→ Search Queries → Temporal Analysis → NER Extraction
│
└─→ Page Visits → Web Scraping → NER Extraction
    ↓
Combine Results → Cluster Analysis → User Profile
    ↓
Export to Structured JSON
```