#Embedding-Based Retrieval with Activeloop and OpenAI

Copyright 2024 Denis Rothman

This first component of the RAG pipeline collects data and prepares it.

# Environment

In [1]:
!uv tree

rag-driven-generative-ai v0.1.0
├── beautifulsoup4 v4.14.2
│   ├── soupsieve v2.8
│   └── typing-extensions v4.15.0
├── deeplake v3.9.52
│   ├── boto3 v1.40.49
│   │   ├── botocore v1.40.49
│   │   │   ├── jmespath v1.0.1
│   │   │   ├── python-dateutil v2.9.0.post0
│   │   │   │   └── six v1.16.0
│   │   │   └── urllib3 v2.5.0
│   │   ├── jmespath v1.0.1
│   │   └── s3transfer v0.14.0
│   │       └── botocore v1.40.49 (*)
│   ├── click v8.3.0
│   │   └── colorama v0.4.6
│   ├── humbug v0.3.2
│   │   └── requests v2.32.5
│   │       ├── certifi v2025.10.5
│   │       ├── charset-normalizer v3.4.4
│   │       ├── idna v3.11
│   │       └── urllib3 v2.5.0
│   ├── lz4 v4.4.4
│   ├── numpy v2.3.4
│   ├── pathos v0.3.4
│   │   ├── dill v0.4.0
│   │   ├── multiprocess v0.70.18
│   │   │   └── dill v0.4.0
│   │   ├── pox v0.3.6
│   │   └── ppft v1.7.7
│   ├── pillow v10.4.0
│   ├── pydantic v2.12.3
│   │   ├── annotated-types v0.7.0
│   │   ├── pydantic-core v2.41.4
│   │   │   └── typing-ext

[2mResolved [1m151 packages[0m [2min 3ms[0m[0m


In [2]:
# !pip install beautifulsoup4==4.12.3
# !pip install requests==2.31.0

# DATA COLLECTION

## Collecting the data

In [9]:
import requests
from bs4 import BeautifulSoup
import re

# URLs of the Wikipedia articles
urls = [
    "https://en.wikipedia.org/wiki/Space_exploration",
    "https://en.wikipedia.org/wiki/Apollo_program",
    "https://en.wikipedia.org/wiki/Hubble_Space_Telescope",
    "https://en.wikipedia.org/wiki/Mars_rover",  # Corrected link
    "https://en.wikipedia.org/wiki/International_Space_Station",
    "https://en.wikipedia.org/wiki/SpaceX",
    "https://en.wikipedia.org/wiki/Juno_(spacecraft)",
    "https://en.wikipedia.org/wiki/Voyager_program",
    "https://en.wikipedia.org/wiki/Galileo_(spacecraft)",
    "https://en.wikipedia.org/wiki/Kepler_Space_Telescope",
    "https://en.wikipedia.org/wiki/James_Webb_Space_Telescope",
    "https://en.wikipedia.org/wiki/Space_Shuttle",
    "https://en.wikipedia.org/wiki/Artemis_program",
    "https://en.wikipedia.org/wiki/Skylab",
    "https://en.wikipedia.org/wiki/NASA",
    "https://en.wikipedia.org/wiki/European_Space_Agency",
    "https://en.wikipedia.org/wiki/Ariane_(rocket_family)",
    "https://en.wikipedia.org/wiki/Spitzer_Space_Telescope",
    "https://en.wikipedia.org/wiki/New_Horizons",
    "https://en.wikipedia.org/wiki/Cassini%E2%80%93Huygens",
    "https://en.wikipedia.org/wiki/Curiosity_(rover)",
    "https://en.wikipedia.org/wiki/Perseverance_(rover)",
    "https://en.wikipedia.org/wiki/InSight",
    "https://en.wikipedia.org/wiki/OSIRIS-REx",
    "https://en.wikipedia.org/wiki/Parker_Solar_Probe",
    "https://en.wikipedia.org/wiki/BepiColombo",
    "https://en.wikipedia.org/wiki/Jupiter_Icy_Moons_Explorer",
    "https://en.wikipedia.org/wiki/Solar_Orbiter",
    "https://en.wikipedia.org/wiki/CHEOPS",
    "https://en.wikipedia.org/wiki/Gaia_(spacecraft)"
]

## Preparing the data

In [10]:
def clean_text(content):
    # Remove references that usually appear as [1], [2], etc.
    content = re.sub(r'\[\d+\]', '', content)
    return content

def fetch_and_clean(url):
    try:
        # Add user agent to avoid blocking
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        # Fetch the content of the URL
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise error for bad status codes
        
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the main content of the article
        content = soup.find('div', class_='mw-parser-output')
        
        # Check if content was found
        if content is None:
            print(f"  Warning: Could not find content div")
            return ""

        # Get all paragraphs on the page
        all_paragraphs = soup.find_all('p')
        
        # Filter paragraphs that are within the mw-parser-output div
        # Use find_parent to check if paragraph is inside content div
        content_paragraphs = [p for p in all_paragraphs 
                              if p.find_parent('div', class_='mw-parser-output') is not None]
        
        if not content_paragraphs:
            print(f"  Warning: No paragraphs found in content area")
            return ""
        
        # Further filter out paragraphs from reference/navigation sections
        filtered_paragraphs = []
        for p in content_paragraphs:
            # Get text to check if it's meaningful
            text = p.get_text(strip=True)
            if len(text) < 20:  # Skip very short paragraphs
                continue
            
            # Check if paragraph is inside a reference/citation section by class
            parent_classes = []
            for parent in p.parents:
                if parent.get('class'):
                    parent_classes.extend(parent.get('class'))
            
            # Skip if in reference/navigation sections
            skip_classes = ['references', 'reflist', 'navbox', 'metadata', 'ambox', 'infobox']
            parent_class_string = ' '.join(parent_classes).lower()
            if any(skip_class in parent_class_string for skip_class in skip_classes):
                continue
            
            # Check if we're past the References heading
            prev_headings = p.find_all_previous(['h2', 'h3'], limit=1)
            if prev_headings:
                heading_text = prev_headings[0].get_text().strip().lower()
                # Stop at these sections
                if any(ref_word in heading_text for ref_word in ['references', 'bibliography', 'external links', 'see also', 'notes', 'citations']):
                    continue
            
            filtered_paragraphs.append(p)
        
        # Extract text from filtered paragraphs
        if filtered_paragraphs:
            text = ' '.join([p.get_text(strip=True) for p in filtered_paragraphs])
            text = clean_text(text)
            return text
        else:
            print(f"  Warning: All paragraphs filtered out")
            return ""
        
    except requests.exceptions.RequestException as e:
        print(f"  Error fetching: {e}")
        return ""
    except Exception as e:
        print(f"  Unexpected error: {e}")
        import traceback
        traceback.print_exc()
        return ""

# File to write the clean text
successful = 0
skipped = 0

with open('llm.txt', 'w', encoding='utf-8') as file:
    for i, url in enumerate(urls, 1):
        article_name = url.split('/')[-1].replace('_', ' ')
        print(f"\nProcessing {i}/{len(urls)}: {article_name}")
        clean_article_text = fetch_and_clean(url)
        
        if clean_article_text and len(clean_article_text) > 100:
            file.write(clean_article_text + '\n\n')  # Double newline between articles
            print(f"  ✓ Success: {len(clean_article_text):,} characters extracted")
            successful += 1
        else:
            print(f"  ✗ Skipped: Insufficient content")
            skipped += 1

print(f"\n{'='*60}")
print(f"Complete: {successful} articles saved, {skipped} skipped")
print(f"Success rate: {successful/len(urls)*100:.1f}%")
print(f"Content written to llm.txt")


Processing 1/30: Space exploration
  ✓ Success: 40,071 characters extracted

Processing 2/30: Apollo program
  ✓ Success: 57,059 characters extracted

Processing 3/30: Hubble Space Telescope
  ✓ Success: 81,333 characters extracted

Processing 4/30: Mars rover
  ✓ Success: 2,543 characters extracted

Processing 5/30: International Space Station
  ✓ Success: 108,244 characters extracted

Processing 6/30: SpaceX
  ✓ Success: 54,564 characters extracted

Processing 7/30: Juno (spacecraft)
  ✓ Success: 27,529 characters extracted

Processing 8/30: Voyager program
  ✓ Success: 24,412 characters extracted

Processing 9/30: Galileo (spacecraft)
  ✓ Success: 20,553 characters extracted

Processing 10/30: Kepler Space Telescope
  ✓ Success: 53,907 characters extracted

Processing 11/30: James Webb Space Telescope
  ✓ Success: 51,837 characters extracted

Processing 12/30: Space Shuttle
  ✓ Success: 61,917 characters extracted

Processing 13/30: Artemis program
  ✓ Success: 59,908 characters ex

In [11]:
# Open the file and read the first 20 lines
with open('llm.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()
    # Print the first 20 lines
    for line in lines[:20]:
        print(line.strip())

Space explorationis the physical investigation ofouter spacebyuncrewed robotic space probesand throughhuman spaceflight. While the observation of objects in space, known asastronomy, predates reliablerecorded history, it was the development of large and relatively efficientrocketsduring the mid-twentieth century that allowed physical space exploration to become a reality. Common rationales for exploring space include advancing scientific research, national prestige, uniting different nations, ensuring the future survival of humanity, and developing military and strategic advantages against other countries. The early era of space exploration was driven by a "Space Race" in which theSoviet Unionand theUnited Statesvied to demonstrate their technological superiority. Landmarks of this era include the launch of the first human-made object to orbitEarth, the Soviet Union'sSputnik 1, on 4 October 1957, and the firstMoon landingby the AmericanApollo 11mission on 20 July 1969. The Soviet space