# Reading plain text from the web

by Koenraad De Smedt at UiB

---
There is a lot of textual material on the web that can be read and processed.
The [CLARIN VLO](https://vlo.clarin.eu) is a searchable catalog of language resources in many formats. There is also much literature at [Project Gutenberg](https://gutenberg.org/).

This notebook will deal with the simplest format, namely, plain Unicode text. You will learn the following:

1.   Read a plain text from the web into a string
2.   Tokenize the text and compute the types and lexical variation.
3.   Read and process plain text, line by line, from the web

---

# 0. Getting some text

We need to import the `requests` module that can send a request to a webpage based on its url.

In [None]:
import requests

If you search for *Utopia* on the CLARIN VLO, you will find a [metadata record for Thomas More's Utopia](https://vlo.clarin.eu/record/https_58__47__47_hdl.handle.net_47_20.500.14106_47_3220_64_format_61_cmdi?26). That page has nine linked resources, one of which is the following link to plain text at the Oxford Text Archive. You can open it in a new tab in the browser to check that it contains plain text.

In [None]:
utopia_url = 'https://llds.ling-phil.ox.ac.uk/llds/xmlui/bitstream/handle/20.500.14106/3220/3220.txt?sequence=8'

You may think of copying the whole text and pasting it into a string, but that would be inconvenient for various reasons. Instead, we will get the text directly from the webpage into Python.

The function `requests.get` opens a webpage based on the url. There are several kinds of information in the response, but here we are only interested in getting the textual content as a Unicode string by means of `.text`.
In this example, we take only a 1000 characters because the whole text is too big to display here.

In [None]:
utopia_text = requests.get(utopia_url).text[:1000]
print(utopia_text)

# 1. Computing tokens, types and distribution.

Now that we have plain text, we can further process it. We import the `nltk` module, which provides some useful text manipulation and counting functions.

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize, FreqDist

Make a list of word tokens from the lowercased text.

In [None]:
utopia_tokens = word_tokenize(utopia_text.lower())
print(utopia_tokens[:30])

The set of types can also be called the vocabulary of the text.

In [None]:
utopia_types = set(utopia_tokens)
print(utopia_types)

We can compute the lexical variation by dividing the number of types by the number of tokens. The larger this number, the more varied use of words. The lower this number, the more repetition of words. For a very short text, this number doesn't mean all that much.

In [None]:
len(utopia_types) / len(utopia_tokens)

Let's define a function for lexical variation based on this proportion.

In [None]:
def lexical_variation (text):
  tokens = word_tokenize(text.lower())
  types = set(tokens)
  return len(types) / len(tokens)

lexical_variation(utopia_text)

Make a distribution and get the count of a token.

In [None]:
counts = FreqDist(utopia_tokens)
counts['prince'] # assume tokens are all lowercase

# 2. Streaming line by line

Instead of reading some or all characters of a webpage into a string, it is also possible to read and process *streamed* content line by line (if the text is divided in somewhat meaningful lines).

What we get from `iter_lines` is an iterator, so that only as many lines are read as the program asks for by means of `next`. The code in the following cell reads and prints the first 20 lines only and also prints a line counter.

By default, `iter_lines` produces raw strings (without newlines), so we need to tell it to decode each line into text.

In [None]:
utopia_stream = requests.get(utopia_url, stream=True)
linestream = utopia_stream.iter_lines(decode_unicode=True)
for n in range(20):
  print(n, next(linestream))

Here is an alternative way of reading and printing 20 lines. We zip a range of numbers and a stream of lines. Again, this is very efficient because if the range is limited to 20, only 20 lines are read and zipped.

In [None]:
utopia_stream = requests.get(utopia_url, stream=True)
for n, line in zip(range(20), utopia_stream.iter_lines(decode_unicode=True)):
  print(n, line)

Suppose we want to read 20 lines and print only lines containing a double quote sign, we add a condition with `if`.

In [None]:
utopia_stream = requests.get(utopia_url, stream=True)
for n, line in zip(range(20), utopia_stream.iter_lines(decode_unicode=True)):
  if '"' in line:
    print(n, line)

The previous counts 20 lines read, not 20 lines written. Suppose we want to print 20 lines with double quotes, then we should use a counter that is increased only after we know we have a line that we want.

In [None]:
utopia_stream = requests.get(utopia_url, stream=True)
line_iterator = utopia_stream.iter_lines(decode_unicode=True)
printed = 0
while printed < 20:
  line = next(line_iterator)
  if '"' in line:
    print(printed, line)
    printed += 1

## Exercises

1.  Read the full text of *Utopia* into Python. Do not print the whole text, because it is too long, but lowercase it, tokenize it and compute the lexical variation.

2.  Compute the distribution of tokens in the full text (see the notebook on tokenization and frequencies with NLTK). Print the counts of *I*, *you*, *he* and *she*. Also, compute the relative frequency of these words per million words. Optionally, make a barplot with the counts.

3.  Extend the code for reading lines with double quotes so that two counters are printed, one that counts the lines read and another that counts the lines printed.

4.  Read [one of the plain texts of the taped diary](https://llds.ling-phil.ox.ac.uk/llds/xmlui/bitstream/handle/20.500.14106/0070/tape1-0070.txt?sequence=8) of [Patty Hearst](https://www.youtube.com/watch?v=kDHccwiT_0E) available at the Oxford Text Archive. Notice that the first two lines are metadata: they are not part of the actual content. There are at least two possible strategies:
 - Read two lines with `next` but ignore them, then read the remaining lines and join them with newline;
 - Read the whole file but delete the first two lines with a *regex*.

5.  Find a large word list online with one word on each line. Iterate over its lines and print only the lines that are palindromes. Reuse the palindrome function from the earlier notebook about palindromes. The following are possible URLs for a large English word list. Alternatively, you can look for a list in another language.

 *   http://wiki.puzzlers.org/pub/wordlists/unixdict.txt
 *   https://raw.githubusercontent.com/quinnj/Rosetta-Julia/master/unixdict.txt
 *   https://searchcode.com/codesearch/raw/29038705/

6.  (optional) Suppose you need help in solving a crossword puzzle. Write a function `search_words` that iterates an online word list, as suggested above, and prints all lines matching a given regex. For instance, `search_words('^[db]a...$')` will look for five-letter words starting with *d* or *b* followed by *a*. You may want to limit the number of words that are printed because there could be many.

7.  (optional) German has some very long words. Using a large word list for German (e.g. https://gist.githubusercontent.com/MarvinJWendt/2f4f4154b8ae218600eb091a5706b5f4/raw/36b70dd6be330aa61cd4d4cdfda6234dcb0b8784/wordlist-german.txt), iterate over its lines and select lines longer than 40 characters. Print each of those words or collect them in a list.

## Notes

1.  See the [documentation of requests](https://docs.python-requests.org/en/master/user/quickstart/) if you need more possibilities to access webpages.
In case a webpage is encoded in anything else than UTF-8, then instead of `.text`, one can also use `.content.decode(encoding)`,  which gets the content and interprets that according to the given encoding, for instance `.content.decode('cp1252')`

2.  In some versions of Python on MacOS, the use of the *requests* module may cause an error about certificates. If that is the case, you can activate the *Install Certificates.command* on MacOS. That command should be in the Python folder in your Applications folder.