### Intro

This notebook shows a way to work with textual content of considerable size. It is based on a web post by Bart de Goede entitled "Building a full-text search engine in 150 lines of code".

The post, and the Python code it contains, is a good example of the practical use of the computer when dealing with a large amount of textual material: The dump of all Wikipedia abstracts.

The are various ways of getting the dataset.

Simply direct your browser at the address of the dataset will start the download of the zipped file:

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz

But one could also write a small function in Python that will do the job:

In [4]:
import requests

def download_wikipedia_abstracts():
    URL = "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz"
    with requests.get(URL, stream=True) as r:
        r.raise_for_status()
        with open('data/enwiki-latest-abstract.xml.gz', 'wb') as f:
            # Write every 1 Mb
            for i, chunk in enumerate(r.iter_content(chunk_size = 1024*1024)):
                f.write(chunk)
                if i % 10 == 0:
                    print(f'Downloaded {i} Mb', end='\r')
                    
# download_wikipedia_abstracts()

Downloaded 760 Mb

Either way we end up with a gzipped file of considerable size: 797.2 Mb. When we unzip the file we get an XML file of 6.3 Gb.

The thing with these large files is that in order to process them one has to know what the structure of the data is. Usually one fires up an editor, but 6.3 Gb is a serious file and using Emacs I had to use its special VLF-mode to load the file in chunks (parts).

In these cases some knowledge of the commandline always pays off.

In [8]:
!head ../data/wikipedia/enwiki-latest-abstract.xml

<feed>
<doc>
<title>Wikipedia: Anarchism</title>
<url>https://en.wikipedia.org/wiki/Anarchism</url>
<abstract>Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be undesirable, unnecessary, and harmful.</abstract>
<links>
<sublink linktype="nav"><anchor>Etymology, terminology and definition</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Etymology,_terminology_and_definition</link></sublink>
<sublink linktype="nav"><anchor>History</anchor><link>https://en.wikipedia.org/wiki/Anarchism#History</link></sublink>
<sublink linktype="nav"><anchor>Pre-modern era</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Pre-modern_era</link></sublink>
<sublink linktype="nav"><anchor>Modern era</anchor><link>https://en.wikipedia.org/wiki/Anarchism#Modern_era</link></sublink>
