# Epstein Files Search Engine — Cloud Build

This notebook downloads all data sources, normalizes them, and builds the search index entirely in Google Colab.
The final index is saved to your Google Drive so you can pull it into the GitHub repo.

**Runtime**: ~15 min | **RAM**: ~8GB peak | **Drive space**: ~600MB for index

## 1. Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Where the final index will be saved
DRIVE_OUTPUT = '/content/drive/MyDrive/epstein-search-index'
!mkdir -p "$DRIVE_OUTPUT"

## 2. Clone the repo

In [None]:
!git clone https://github.com/loudfair/abovea.cloud.git /content/epstein-search
%cd /content/epstein-search

## 3. Install dependencies

In [None]:
!pip install -q -r requirements.txt

## 4. Download all data sources
Downloads from HuggingFace, GitHub, and Archive.org — all in the cloud, nothing local.

In [None]:
%%time
import os
os.chdir('/content/epstein-search')

# ── Download HuggingFace datasets ──
print("Downloading HuggingFace datasets...")
os.makedirs('downloads/huggingface', exist_ok=True)

from datasets import load_dataset

hf_datasets = [
    ('theelderemo/FULL_EPSTEIN_INDEX', 'full_index'),
    ('to-be/epstein-emails', 'emails'),
    ('svetfm/epstein-files-nov11-25-house-post-ocr-embeddings', 'embeddings'),
    ('svetfm/epstein-fbi-files', 'fbi_files'),
    ('vikash06/EpsteinFiles', 'fbi_ocr'),
    ('567-labs/jmail-house-oversight', 'house_emails'),
]

for name, folder in hf_datasets:
    outdir = f'downloads/huggingface/{folder}'
    os.makedirs(outdir, exist_ok=True)
    if os.path.exists(f'{outdir}/train.parquet'):
        print(f'  ✓ {name} (cached)')
        continue
    try:
        ds = load_dataset(name)
        for split in ds:
            ds[split].to_parquet(f'{outdir}/{split}.parquet')
        print(f'  ✓ {name} ({len(ds[list(ds.keys())[0]])} rows)')
    except Exception as e:
        print(f'  ✗ {name}: {e}')

# ── Clone GitHub repos ──
print("\nCloning GitHub repos...")
repos = [
    ("https://github.com/epstein-docs/epstein-docs.github.io.git", "epstein-docs"),
    ("https://github.com/markramm/EpsteinFiles.git", "markramm"),
    ("https://github.com/benbaessler/epfiles.git", "epfiles"),
    ("https://github.com/theelderemo/FULL_EPSTEIN_INDEX.git", "full-index"),
    ("https://github.com/HarleyCoops/TrumpEpsteinFiles.git", "trump-files"),
    ("https://github.com/LMSBAND/epstein-files-db.git", "epstein-files-db"),
    ("https://github.com/promexdotme/epstein-justice-files-text.git", "justice-files-text"),
    ("https://github.com/phelix001/epstein-network.git", "epstein-network"),
    ("https://github.com/maxandrews/Epstein-doc-explorer.git", "doc-explorer"),
    ("https://github.com/yung-megafone/Epstein-Files.git", "magnet-links"),
    ("https://github.com/SvetimFM/epstein-files-visualizations.git", "visualizations"),
    ("https://github.com/paulgp/epstein-document-search.git", "document-search"),
]
for url, dirname in repos:
    dest = f"downloads/github/{dirname}"
    if os.path.isdir(dest):
        print(f"  ✓ {dirname} (cached)")
    else:
        r = os.system(f'git clone --depth 1 -q "{url}" "{dest}" 2>/dev/null')
        print(f"  {'✓' if r == 0 else '✗'} {dirname}")

# ── Download Archive.org PDFs ──
print("\nDownloading Archive.org files...")
archives = [
    ("downloads/archive/flight-logs/epstein-flight-logs.pdf",
     "https://archive.org/download/epstein-flight-logs-unredacted-17/EPSTEIN%20FLIGHT%20LOGS%20UNREDACTED%20%2817%29.pdf",
     "Flight logs"),
    ("downloads/archive/black-book/black-book.pdf",
     "https://archive.org/download/jeffrey-epstein-39s-little-black-book-unredacted/Jeffrey%20Epstein%27s%20Little%20Black%20Book%20unredacted.pdf",
     "Black book"),
    ("downloads/archive/epstein-docs-collection/Epstein-Docs.pdf",
     "https://ia600705.us.archive.org/21/items/epsteindocs/Epstein-Docs.pdf",
     "Epstein docs collection"),
    ("downloads/archive/depositions/Edwards-vs-Epstein-depositions.pdf",
     "https://ia600705.us.archive.org/21/items/epsteindocs/12%23%20Epstein%20deposition%27s%20-%20Edwards%20vs%20Epstein%20%2B%20attachments.pdf",
     "Depositions"),
]
for path, url, label in archives:
    os.makedirs(os.path.dirname(path), exist_ok=True)
    if os.path.exists(path):
        print(f"  ✓ {label} (cached)")
    else:
        r = os.system(f'curl -sL -o "{path}" "{url}"')
        print(f"  {'✓' if r == 0 else '✗'} {label}")

# ── Normalize ──
print("\n\nNormalizing all sources...")
os.makedirs('data/normalized', exist_ok=True)
os.makedirs('data/index', exist_ok=True)
!python normalize.py

# ── Build index ──
print("\nBuilding search index...")
!python build_index.py

print("\n✅ Done!")

## 5. Check disk & memory

In [None]:
import psutil, shutil

mem = psutil.virtual_memory()
disk = shutil.disk_usage('/content')

print(f"RAM:  {mem.available / (1024**3):.1f} GB available / {mem.total / (1024**3):.1f} GB total")
print(f"Disk: {disk.free / (1024**3):.1f} GB available / {disk.total / (1024**3):.1f} GB total")
print()
!du -sh /content/epstein-search/downloads/ 2>/dev/null || echo 'No downloads yet'
!du -sh /content/epstein-search/data/ 2>/dev/null || echo 'No data yet'

## 6. Run security audit

In [None]:
!python audit.py

## 7. Copy index to Google Drive
Copies the built search index to your Google Drive.

In [None]:
import shutil, os

INDEX_DIR = '/content/epstein-search/data/index'

# Copy index files to Drive
for f in os.listdir(INDEX_DIR):
    src = os.path.join(INDEX_DIR, f)
    dst = os.path.join(DRIVE_OUTPUT, f)
    print(f'Copying {f} ({os.path.getsize(src) / (1024*1024):.1f} MB)...')
    shutil.copy2(src, dst)

# Also copy the corpus for the web UI
corpus_src = '/content/epstein-search/data/normalized/corpus.jsonl'
if os.path.exists(corpus_src):
    corpus_dst = os.path.join(DRIVE_OUTPUT, 'corpus.jsonl')
    print(f'Copying corpus.jsonl ({os.path.getsize(corpus_src) / (1024*1024):.1f} MB)...')
    shutil.copy2(corpus_src, corpus_dst)

print(f'\n✓ Index saved to Google Drive: {DRIVE_OUTPUT}')
!du -sh "$DRIVE_OUTPUT"

## 8. Create GitHub Release (optional)
Uploads the index as a GitHub release so the repo can download it directly.

In [None]:
# Uncomment and set your GitHub token to upload as a release
# GITHUB_TOKEN = 'ghp_...'  # paste your token here
#
# !pip install -q requests
# import requests, os, json
#
# REPO = 'loudfair/abovea.cloud'
# TAG = 'index-v1'
# headers = {'Authorization': f'token {GITHUB_TOKEN}'}
#
# # Create release
# r = requests.post(
#     f'https://api.github.com/repos/{REPO}/releases',
#     headers=headers,
#     json={'tag_name': TAG, 'name': 'Search Index v1', 'body': 'Pre-built search index'}
# )
# release = r.json()
# upload_url = release['upload_url'].replace('{?name,label}', '')
#
# # Tar the index
# !cd /content/epstein-search/data && tar czf /tmp/search-index.tar.gz index/
#
# # Upload
# with open('/tmp/search-index.tar.gz', 'rb') as f:
#     r = requests.post(
#         f'{upload_url}?name=search-index.tar.gz',
#         headers={**headers, 'Content-Type': 'application/gzip'},
#         data=f
#     )
# print(f'✓ Uploaded: {r.json().get("browser_download_url", r.text)}')

## 9. Quick test — verify the index works

In [None]:
!python search.py --text "flight log" --limit 3

---
## Next steps

Your index is now in Google Drive at `MyDrive/epstein-search-index/`.

### Pull to your local repo with rclone (recommended)

```bash
# One-time: install rclone and connect Google Drive
brew install rclone        # macOS (or: curl https://rclone.org/install.sh | sudo bash)
rclone config              # → New remote → name it "gdrive" → Google Drive → follow auth

# Then just run the setup script:
git clone https://github.com/loudfair/abovea.cloud.git
cd abovea.cloud
chmod +x setup_from_drive.sh
./setup_from_drive.sh      # syncs index from Drive → ready in seconds

# Start the web UI:
source venv/bin/activate
python app.py              # → http://localhost:5000
```

### Or mount Drive directly (zero download)
```bash
# Mount your Drive folder as a local directory — files stream on demand
mkdir -p data/index
rclone mount gdrive:epstein-search-index/ data/index/ --vfs-cache-mode full --daemon
# Now search.py and app.py just work — reads stream from Drive
```