In [1]:
%load_ext autoreload
%autoreload 2

# About

This notebook gives examples of how to query the Archive.prg API with a search string, then save the results locally in the format expected by the next notebook, `2_do_inference.ipynb`.

In [2]:
from internetarchive import search_items, get_item, Search

from ner_pipeline.scrape_for_training import do_search
from ner_pipeline.scrape_for_training import prepare_data

# Step 1: Define documents to query

This first step finds documents that *may* contain citations of Homer's *Iliad* or *Odyssey*.

In [3]:
IL_OD: str = "iliad OR odyssey AND mediatype:texts"
# 543,608 results with full_text_search (as of 06 Sept 2021)
SEARCH_RES: Search = do_search(keyword_string=IL_OD)

Search string: iliad OR odyssey AND mediatype:texts
Results: 543609


# Step 2: Query documents for citations

In the documents returned above, now look for citations that match our regex pattern.

In [4]:
# Regex of patterns of citations
PATTERN = r'Iliad\s\d{1,2}\.\d{1,4}|Il\.*\s\d{1,2}\.\d{1,4}|Iliad\s.[ivxlcdm]*\.\s*\d{1,4}| \
            Il\.*\s.[ivxlcdm]*\.\s*\d{1,4}|book\s*.[ivxlcdm]\.\sline\s*\d{1,4}| \
            Odyssey\s\d{1,2}\.\d{1,4}|Od\.*\s\d{1,2}\.\d{1,4}|Odyssey\s.[ivxlcdm]*\.\s*\d{1,4}| \
            Od\.*\s.[ivxlcdm]*\.\s*\d{1,4}'

By calling this fucntion, user-defined number of pos/neg instances will be saved in the directory `pos_neg_instances`.

In [5]:
NUM_POS = 10  # 10000
NUM_NEG = 10  # 10000
prepare_data(search_res=SEARCH_RES,
             pattern=PATTERN,
             num_of_pos=NUM_POS,
             num_of_neg=NUM_NEG)

Successfully got 10 positive data and 10 negative data by scraping 13 books!
Positive instances are saved at: pos_neg_instances/pos_instances_10.txt
Negative instances are saved at: pos_neg_instances/neg_instances_10.txt


# Inspect results

Now that the results have been downloaded, look at the two files that have been generated.

In [6]:
!head -n 10 pos_neg_instances/pos_instances_10.txt

{'content': 'I Megarians for Salamis, they quoted Iliad 2. 558, where ', 'annotations': [{'start': 37, 'end': 49, 'label': 'Citation'}]}
{'content': 'Megarians for Salamis, they quoted Iliad 2. 558, where ', 'annotations': [{'start': 35, 'end': 47, 'label': 'Citation'}]}
{'content': 'Megarians for Salamis, they quoted Iliad 2. 558, where ', 'annotations': [{'start': 35, 'end': 47, 'label': 'Citation'}]}
{'content': 'The same lines occur in the Odyssey xxi. 350., and in ', 'annotations': [{'start': 28, 'end': 44, 'label': 'Citation'}]}
{'content': 'the Iliad vi. 490. at the close of the interview between ', 'annotations': [{'start': 4, 'end': 17, 'label': 'Citation'}]}
{'content': 'Megarians for Salamis, they quoted Iliad 2. 558, where ', 'annotations': [{'start': 35, 'end': 47, 'label': 'Citation'}]}
{'content': 'cp. Odyssey iv. 293 ', 'annotations': [{'start': 4, 'end': 19, 'label': 'Citation'}]}
{'content': 'the Iliad (xix. 326-333) breaks the sequence of the verses ', 'annota

In [7]:
!head -n 10 pos_neg_instances/neg_instances_10.txt

{'content': 'wBMESm ', 'annotations': []}
{'content': '■ft*-: ', 'annotations': []}
{'content': 'Hi ', 'annotations': []}
{'content': 'm ', 'annotations': []}
{'content': "• ■■:*&'■- 1 ", 'annotations': []}
{'content': ', . .v;ii ; i- ', 'annotations': []}
{'content': 'IB ', 'annotations': []}
{'content': 'i . . ? /**« ', 'annotations': []}
{'content': '1 SSrSS ', 'annotations': []}
{'content': ':vy>< ', 'annotations': []}


# How to run

When scraping large amounts of samples from Archive.org, export this file and put it in the `scripts` directory.