# Wiktionary Etymology Scraper

This notebook uses the `wiktionary_scraper` module to scrape and visualize etymological relationships between languages.

It supports two types of etymological relationships:
- **Borrowed terms**: Words that one language borrowed from another
- **Derived terms**: Words that are derived from words in another language

## Setup

In [1]:
# Imports
import json
import time
from pathlib import Path

import pandas as pd
import numpy as np

# Import the wiktionary_scraper module
import wiktionary_scraper as ws

# Optional: Sound notifications when scraping completes
try:
    import chime
    chime.theme('pokemon')
    CHIME_AVAILABLE = True
except ImportError:
    CHIME_AVAILABLE = False
    print("Note: Install 'chime' package for sound notifications: pip install chime")

In [2]:
# Configuration
DATA_DIR = Path(".")
BORROWED_TERMS_FILE = DATA_DIR / "borrowed_terms.json"
DERIVED_TERMS_FILE = DATA_DIR / "derived_terms.json"

## Part 1: Borrowed Terms

### Scrape Borrowed Terms

This cell scrapes all borrowed terms from Wiktionary. **Warning**: This can take 30+ minutes to complete.

If `borrowed_terms.json` already exists, skip this cell and load from the file in the next section.

In [3]:
# Scrape borrowed terms (skip if borrowed_terms.json already exists)
start_time = time.time()

borrowed_terms = ws.scrape_etymological_terms(
    category_type="borrowed",
    save_path=str(BORROWED_TERMS_FILE),
    verbose=True
)

elapsed = time.time() - start_time
print(f"\nScraping completed in {elapsed/60:.1f} minutes")

if CHIME_AVAILABLE:
    chime.success()

Scraping borrowed terms from: https://en.wiktionary.org/wiki/Category:Borrowed_terms_by_language

Step 1: Collecting root category pages...
  Found 8 page(s)

Step 2: Collecting language categories...


Level 1 pages: 100%|██████████| 8/8 [00:00<00:00, 46.74it/s]

  Found 5744 language categorie(s)

Step 3: Collecting language-pair categories...





  Found 5743 language-pair categorie(s)

Step 4: Expanding language-pair categories to find all subcategories...


Language pairs: 100%|██████████| 5743/5743 [12:08<00:00,  7.88it/s]  


  Found 13108 subcategorie(s)

Step 5: Extracting terms from categories...


Extracting terms:  35%|███▌      | 6622/18851 [15:45<3:55:27,  1.16s/it]

Failed to load url https://en.wiktionary.org/wiki/Category:Balinese_terms_borrowed_from_Hokkien


Extracting terms: 100%|██████████| 18851/18851 [3:55:03<00:00,  1.34it/s]   


Done! Collected 12349 categories with 135898 total terms
Saved results to: borrowed_terms.json

Scraping completed in 283.9 minutes





### Load Borrowed Terms (from existing file)

If you already have `borrowed_terms.json`, load it here instead of re-scraping.

In [None]:
# Load borrowed terms from file
if BORROWED_TERMS_FILE.exists():
    borrowed_terms = ws.load_terms_from_json(str(BORROWED_TERMS_FILE))
    print(f"Loaded {len(borrowed_terms)} categories")
    print(f"Total terms: {sum(len(v) for v in borrowed_terms.values()):,}")
    
    # Show top 10 categories by term count
    print("\nTop 10 categories by term count:")
    sorted_cats = sorted(borrowed_terms.items(), key=lambda x: len(x[1]), reverse=True)[:10]
    for cat, urls in sorted_cats:
        print(f"  {cat}: {len(urls):,} terms")
else:
    print(f"File not found: {BORROWED_TERMS_FILE}")
    print("Run the scraping cell above first.")

### Visualize Borrowed Terms Heatmap

This creates a heatmap showing which languages (debtors) borrowed the most terms from which other languages (creditors).

In [4]:
# Create and display heatmap for borrowed terms
if 'borrowed_terms' in locals():
    heatmap = ws.create_language_heatmap(
        borrowed_terms,
        category_type="borrowed",
        top_n=50
    )
    display(heatmap)
else:
    print("Load borrowed_terms first (see cells above)")

  result = func(self.values, **kwargs)


creditor,French,Latin,English,Spanish,Sanskrit,Italian,Arabic,German,Ancient Greek,Dutch,Classical Persian,Russian,Mandarin,Japanese,Swedish,Ottoman Turkish,Hungarian,Old Church Slavonic,Old French,Pali,Old Armenian,Polish,Hindi,Hanyu Pinyin,Javanese,Occitan,Hebrew,Wade–Giles,Persian,Portuguese,Late Latin,Ukrainian,Malay,Romanian,Middle French,New Latin,Greek,Moroccan Arabic,Chinese,Maori,Medieval Latin,Korean,Urdu,Serbo-Croatian,Hokkien,Turkish,Koine Greek,Esperanto,Old Norse,Sicilian
debtor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
Romanian,27069,1065,225,1,0,130,1,336,4,1,0,105,1,1,0,433,803,571,0,0,0,3,0,0,0,0,1,0,0,1,1,65,0,0,0,1,152,0,0,0,1,0,0,7,0,11,1,1,0,0
English,5205,4893,0,2271,187,2495,278,1937,506,243,1,630,1506,912,34,2,3,1,1,1,1,244,303,419,1,1,118,387,11,124,33,184,6,237,176,175,16,2,6,149,24,89,8,9,12,19,2,1,1,1
Tagalog,1,1,193,5129,2,1,1,1,1,0,1,0,1,10,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,14,0,0,0,0,0,1,0,0,1,0,0,21,0,0,0,0,0
Portuguese,433,2926,295,148,4,43,4,2,9,2,1,2,2,32,1,1,1,0,1,3,0,3,2,1,0,1,1,0,2,0,18,1,1,2,1,1,1,0,1,1,16,1,1,1,1,1,0,1,1,0
Spanish,498,2475,455,0,2,65,18,1,17,2,1,2,1,4,1,1,1,0,1,1,0,1,1,0,0,1,1,0,1,1,54,1,1,2,1,1,1,1,1,1,7,1,1,1,1,1,0,0,1,1
Indonesian,1,37,200,1,2,2,22,1,4,2811,1,1,2,4,1,0,0,0,0,1,0,0,4,0,385,0,2,0,1,5,1,1,1,0,1,1,0,0,1,0,1,1,1,2,3,1,0,0,0,1
Hindi,1,1,193,1,2300,1,41,1,1,0,951,2,1,5,0,1,0,0,0,1,0,0,0,0,1,0,1,0,1,1,0,1,5,0,0,0,0,0,10,0,0,1,1,0,0,1,0,0,0,0
Polish,1114,1000,293,1,1,15,1,525,9,1,1,26,1,1,1,1,3,1,1,1,0,0,2,0,1,1,6,0,4,3,1,12,1,1,1,3,1,0,1,1,1,1,1,1,0,1,1,1,1,0
Japanese,34,1,2882,3,1,1,16,13,1,3,1,3,1,0,1,0,2,0,0,2,0,1,1,0,0,0,1,0,1,2,0,1,1,0,0,1,1,0,10,1,0,1,1,1,1,1,0,1,1,0
Catalan,112,2707,10,35,1,5,10,1,8,1,0,1,1,1,1,1,1,0,5,0,0,1,1,0,0,1,2,0,1,1,32,1,1,1,1,3,1,1,1,0,1,1,0,0,0,1,1,0,2,1


## Part 2: Derived Terms

### Scrape Derived Terms

This cell scrapes all derived terms from Wiktionary. **Warning**: This can take 30+ minutes to complete.

If `derived_terms.json` already exists, skip this cell and load from the file in the next section.

In [None]:
# Scrape derived terms (skip if derived_terms.json already exists)
start_time = time.time()

derived_terms = ws.scrape_etymological_terms(
    category_type="derived",
    save_path=str(DERIVED_TERMS_FILE),
    verbose=True
)

elapsed = time.time() - start_time
print(f"\nScraping completed in {elapsed/60:.1f} minutes")

if CHIME_AVAILABLE:
    chime.success()

### Load Derived Terms (from existing file)

If you already have `derived_terms.json`, load it here instead of re-scraping.

In [None]:
# Load derived terms from file
if DERIVED_TERMS_FILE.exists():
    derived_terms = ws.load_terms_from_json(str(DERIVED_TERMS_FILE))
    print(f"Loaded {len(derived_terms)} categories")
    print(f"Total terms: {sum(len(v) for v in derived_terms.values()):,}")
    
    # Show top 10 categories by term count
    print("\nTop 10 categories by term count:")
    sorted_cats = sorted(derived_terms.items(), key=lambda x: len(x[1]), reverse=True)[:10]
    for cat, urls in sorted_cats:
        print(f"  {cat}: {len(urls):,} terms")
else:
    print(f"File not found: {DERIVED_TERMS_FILE}")
    print("Run the scraping cell above first.")

### Visualize Derived Terms Heatmap

This creates a heatmap showing which languages (recipients) have the most terms derived from which other languages (sources).

In [None]:
# Create and display heatmap for derived terms
if 'derived_terms' in locals():
    heatmap = ws.create_language_heatmap(
        derived_terms,
        category_type="derived",
        top_n=50
    )
    display(heatmap)
else:
    print("Load derived_terms first (see cells above)")

## Part 3: Comparative Analysis (Optional)

Compare borrowed vs derived terms to understand different patterns of linguistic influence.

In [None]:
# Compare borrowed vs derived terms
if 'borrowed_terms' in locals() and 'derived_terms' in locals():
    # Count total terms
    borrowed_total = sum(len(v) for v in borrowed_terms.values())
    derived_total = sum(len(v) for v in derived_terms.values())
    
    print("Comparison Summary")
    print("=" * 50)
    print(f"Borrowed terms: {len(borrowed_terms):,} categories, {borrowed_total:,} total terms")
    print(f"Derived terms:  {len(derived_terms):,} categories, {derived_total:,} total terms")
    print()
    
    # Extract language pairs
    def extract_languages(terms_dict, pattern):
        languages = set()
        for cat in terms_dict.keys():
            if pattern in cat:
                parts = cat.split(pattern)
                languages.add(parts[0])
                languages.add(parts[1])
        return languages
    
    borrowed_langs = extract_languages(borrowed_terms, "_terms_borrowed_from_")
    derived_langs = extract_languages(derived_terms, "_terms_derived_from_")
    
    print(f"Languages with borrowed terms: {len(borrowed_langs)}")
    print(f"Languages with derived terms:  {len(derived_langs)}")
    print(f"Languages in both:             {len(borrowed_langs & derived_langs)}")
    print()
    
    # Find languages only in one category
    only_borrowed = borrowed_langs - derived_langs
    only_derived = derived_langs - borrowed_langs
    
    if only_borrowed:
        print(f"Languages only in borrowed: {len(only_borrowed)}")
        print(f"  Examples: {', '.join(sorted(only_borrowed)[:10])}")
        print()
    
    if only_derived:
        print(f"Languages only in derived: {len(only_derived)}")
        print(f"  Examples: {', '.join(sorted(only_derived)[:10])}")
else:
    print("Load both borrowed_terms and derived_terms to run this analysis")

In [None]:
# Find languages with most borrowing vs derivation
if 'borrowed_terms' in locals() and 'derived_terms' in locals():
    from urllib.parse import unquote
    
    # Count terms per language (as recipient/debtor)
    def count_by_recipient(terms_dict, pattern):
        counts = {}
        for cat, urls in terms_dict.items():
            if pattern in cat:
                recipient = cat.split(pattern)[0]
                recipient = unquote(recipient).replace("_", " ")
                counts[recipient] = counts.get(recipient, 0) + len(urls)
        return pd.Series(counts).sort_values(ascending=False)
    
    borrowed_by_lang = count_by_recipient(borrowed_terms, "_terms_borrowed_from_")
    derived_by_lang = count_by_recipient(derived_terms, "_terms_derived_from_")
    
    # Create comparison DataFrame
    comparison = pd.DataFrame({
        'borrowed': borrowed_by_lang,
        'derived': derived_by_lang
    }).fillna(0).astype(int)
    
    comparison['total'] = comparison['borrowed'] + comparison['derived']
    comparison['borrowed_pct'] = (comparison['borrowed'] / comparison['total'] * 100).round(1)
    comparison = comparison.sort_values('total', ascending=False)
    
    print("Top 20 languages by total etymological terms (borrowed + derived)")
    print(comparison.head(20))