# [Processing Raw Text](http://www.nltk.org/book/ch03.html)

## 3.1 Accessing Text from the Web and from Disk

### 3.1.1 Electronic Books

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese and Spanish (with more than 100 texts each).

Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows.

In [1]:
import nltk, re, pprint
>>> from nltk import word_tokenize

In [2]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

In [3]:
print(type(raw))
print(len(raw))
raw[:75]

<class 'str'>
1176967


'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

The variable raw contains a string with 1,176,893 characters. (We can see that it is a string, using type(raw).) This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines. Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return and line feed characters (the file must have been created on a Windows machine). For our language processing, we want to break up the string into words and punctuation, as we saw in 1.. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.

In [4]:
tokens = word_tokenize(raw)
type(tokens)
len(tokens)
print(tokens[:10])

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']


If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1, along with the regular list operations like slicing:

In [5]:
text = nltk.Text(tokens)
print(type(text))
print("-"*50)
print(text[1024:1062])
print("-"*50)
print(text.collocation_list())

<class 'nltk.text.Text'>
--------------------------------------------------
['an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.', 'He', 'had', 'successfully']
--------------------------------------------------
['Katerina Ivanovna', 'Pyotr Petrovitch', 'Pulcheria Alexandrovna', 'Avdotya Romanovna', 'Rodion Romanovitch', 'Marfa Petrovna', 'Sofya Semyonovna', 'old woman', 'Project Gutenberg-tm', 'Porfiry Petrovitch', 'Amalia Ivanovna', 'great deal', 'young man', 'Nikodim Fomitch', 'Ilya Petrovitch', 'Project Gutenberg', 'Andrey Semyonovitch', 'Hay Market', 'Dmitri Prokofitch', 'Good heavens']


Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer at the end of the file. **We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:**

In [6]:
response = request.urlopen(url)
raw = response.read().decode('utf8')

print(raw.find("PART I"))
print(raw.rfind("End of Project Gutenberg"))

5336
1157812


In [7]:
raw = raw[raw.find("PART I"):raw.rfind("End of Project Gutenberg")]
print(raw.find("PART I"))

0


The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string [1]. We overwrite raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

### 3.1.2 Dealing with HTML

In [8]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
print(html[:60])

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN


To get text out of HTML we will use a Python library called BeautifulSoup, available from http://www.crummy.com/software/BeautifulSoup/:

In [9]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
print(tokens)

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in', '200', "years'", 'NEWS', 'SPORT', 'WEATHER', 'WORLD', 'SERVICE', 'A-Z', 'INDEX', 'SEARCH', 'You', 'are', 'in', ':', 'Health', 'News', 'Front', 'Page', 'Africa', 'Americas', 'Asia-Pacific', 'Europe', 'Middle', 'East', 'South', 'Asia', 'UK', 'Business', 'Entertainment', 'Science/Nature', 'Technology', 'Health', 'Medical', 'notes', '--', '--', '--', '--', '--', '--', '-', 'Talking', 'Point', '--', '--', '--', '--', '--', '--', '-', 'Country', 'Profiles', 'In', 'Depth', '--', '--', '--', '--', '--', '--', '-', 'Programmes', '--', '--', '--', '--', '--', '--', '-', 'SERVICES', 'Daily', 'E-mail', 'News', 'Ticker', 'Mobile/PDAs', '--', '--', '--', '--', '--', '--', '-', 'Text', 'Only', 'Feedback', 'Help', 'EDITIONS', 'Change', 'to', 'UK', 'Friday', ',', '27', 'September', ',', '2002', ',', '11:51', 'GMT', '12:51', 'UK', 'Blondes', "'to", 'die', 'out', 'in', '200', "years'", 'Scientists', 'believe', 'the', 'last', 'blond

This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.

In [10]:
?nltk.Text.concordance    

[1;31mSignature:[0m [0mnltk[0m[1;33m.[0m[0mText[0m[1;33m.[0m[0mconcordance[0m    [1;33m([0m[0mself[0m[1;33m,[0m [0mword[0m[1;33m,[0m [0mwidth[0m[1;33m=[0m[1;36m79[0m[1;33m,[0m [0mlines[0m[1;33m=[0m[1;36m25[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Prints a concordance for ``word`` with the specified context window.
Word matching is not case-sensitive.

:param word: The target word
:type word: str
:param width: The width of each line, in characters (default=80)
:type width: int
:param lines: The number of lines to display (default=25)
:type lines: int

:seealso: ``ConcordanceIndex``
[1;31mFile:[0m      c:\users\nikhil\.conda\envs\dl_nlp\lib\site-packages\nltk\text.py
[1;31mType:[0m      function


In [11]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


### 3.1.3 Processing Search Engine Results

The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very specific patterns, which would only match one or two examples on a smaller example, but which might match tens of thousands of examples when run on the web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very convenient tool for quickly checking a theory, to see if it is reasonable.

Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent results, and can give widely different figures when used at different times or in different geographical regions. When content has been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine may change unpredictably, breaking any pattern-based method of locating particular content **(a problem which is ameliorated by the use of search engine APIs)**  --> **AKA use API over web scraping or search scraping.**

### 3.1.4 Processing RSS Feeds

The blogosphere is an important source of text, in both formal and informal registers. With the help of a Python library called the Universal Feed Parser, available from https://pypi.python.org/pypi/feedparser, we can access the content of a blog, as shown below:

In [12]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")  ## Returns Dictionary
llog['feed']

{'language': 'en-US',
 'title': 'Language Log',
 'title_detail': {'type': 'text/plain',
  'language': 'en-US',
  'base': 'https://languagelog.ldc.upenn.edu/nll/wp-atom.php',
  'value': 'Language Log'},
 'subtitle': '',
 'subtitle_detail': {'type': 'text/plain',
  'language': 'en-US',
  'base': 'https://languagelog.ldc.upenn.edu/nll/wp-atom.php',
  'value': ''},
 'updated': '2020-05-14T10:18:20Z',
 'updated_parsed': time.struct_time(tm_year=2020, tm_mon=5, tm_mday=14, tm_hour=10, tm_min=18, tm_sec=20, tm_wday=3, tm_yday=135, tm_isdst=0),
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'https://languagelog.ldc.upenn.edu/nll'},
  {'rel': 'self',
   'type': 'application/atom+xml',
   'href': 'https://languagelog.ldc.upenn.edu/nll/?feed=atom'}],
 'link': 'https://languagelog.ldc.upenn.edu/nll',
 'id': 'https://languagelog.ldc.upenn.edu/nll/?feed=atom',
 'guidislink': False}

In [18]:
# llog.entries[2]

In [14]:
len(llog.entries)
post = llog.entries[2]
print(post.title)
print(post['title'])
print(post.get('title'))

Goonerisms spalore!
Goonerisms spalore!
Goonerisms spalore!


In [15]:
content = post.content[0].value
print(content[:200])

<p>"Prinderella and the Cince Told by Cynthia Hall Domine"</p>
<p><iframe title="Prinderella and the Cince Told by Cynthia Hall Domine" width="500" height="281" src="https://www.youtube.com/embed/NbZd


In [17]:
raw = BeautifulSoup(content, 'html.parser').get_text()
print(word_tokenize(raw))

['``', 'Prinderella', 'and', 'the', 'Cince', 'Told', 'by', 'Cynthia', 'Hall', 'Domine', "''", 'Is', "n't", 'that', 'a', 'shirty', 'dame', '?', 'Selected', 'readings', "''", 'Obscene', 'spoonerism', 'and', 'stupid', 'verbing', 'discussion', 'on', 'Radio', '4', "''", '(', '12/6/10', ')', "''", 'Thematic', 'spoonerisms', '?', "''", '(', '7/14/19', ')', "''", 'Finger', 'spoonerisms', 'and', 'conservation', 'of', 'caps', "''", '(', '6/23/10', ')', "''", 'More', 'possible', 'than', 'they', 'can', 'powerfully', 'imagine', "''", '(', '7/31/09', ')', "''", "'The", 'travesty', 'that', 'is', 'taking', 'fold', "'", "''", '(', '10/28/19', ')', "''", 'Nuckin', 'Futs', "''", '(', '1/18/12', ')']


### 3.1.5 Reading Local Files

In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. Suppose you have a file document.txt, you can load its contents like this:

To check that the file that you are trying to open is really in the right directory, use IDLE's Open command in the File menu; this will display a list of all the files in the directory where IDLE is running. An alternative is to examine the current directory from within Python:

In [19]:
# f = open('document.txt')
# raw = f.read()

In [20]:
import os
os.listdir('.')

['.ipynb_checkpoints',
 'nltk_book_chapter1.ipynb',
 'nltk_book_chapter2.ipynb',
 'nltk_book_chapter3.ipynb',
 'test_install.ipynb',
 'week1_nlp.ipynb']

Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('document.txt', 'rU') — 'r' means to open the file for reading (the default), and 'U' stands for "Universal", which lets us ignore the different conventions used for marking newlines.

Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:

In [21]:
# f.read()

We can also read a file one line at a time using a for loop. Here we use the strip() method to remove the newline character at the end of the input line.

In [22]:
# f = open('document.txt', 'rU')
# for line in f:
#     print(line.strip())

## 3.2   Strings: Text Processing at the Lowest Level

Skipping this section since it is just talking about string manipulation in Python

## 3.3   Text Processing with Unicode

Our programs will often need to deal with different languages, and different character sets. The concept of "plain text" is a fiction. If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Europe you might use one of the extended Latin character sets, containing such characters as "ø" for Danish and Norwegian, "ő" for Hungarian, "ñ" for Spanish and Breton, and "ň" for Czech and Slovak. In this section, we will give an overview of how to use Unicode for processing texts that use non-ASCII character sets.
What is Unicode?

**Unicode supports over a million characters. Each character is assigned a number, called a code point**. In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal form.

Within a program, we can manipulate Unicode strings just like normal strings. **However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can only support a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple bytes and can represent the full range of Unicode characters.**

**Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode — translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding — this translation out of Unicode is called encoding, and is illustrated in 3.3.**

### 3.3.1 Extracting encoded text

Let's assume that we have a small text file, and that we know how it is encoded. For example, polish-lat2.txt, as the name suggests, is a snippet of Polish text (from the Polish Wikipedia; see http://pl.wikipedia.org/wiki/Biblioteka_Pruska). This file is encoded as Latin-2, also known as ISO-8859-2. The function nltk.data.find() locates the file for us.

In [24]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [27]:
# Opening without encoding causes issues with some of the characters.
f = open(path)
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Pañstwowa. Jej dawne zbiory znane pod nazw±
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny ¶wiatowej na Dolny ¦l±sk, zosta³y
odnalezione po 1945 r. na terytorium Polski. Trafi³y do Biblioteki
Jagielloñskiej w Krakowie, obejmuj± ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


The Python open() function can read encoded data into Unicode strings, and write out Unicode strings in encoded form. It takes a parameter to specify the encoding of the file being read or written. So let's open our Polish file with the encoding 'latin2' and inspect the contents of the file:

In [26]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


If this does not display correctly on your terminal, or if we want to see the underlying numerical values (or "codepoints") of the characters, then we can convert all non-ASCII characters into their two-digit \xXX and four-digit \uXXXX representations:

In [28]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


## 3.4   Regular Expressions for Detecting Word Patterns

In [29]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [32]:
print([w for w in wordlist if re.search('ed$', w)][:10])

['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', 'abridged', 'abscessed', 'absconded']


### 3.4.1 Using Basic Meta-Characters

The . wildcard symbol matches any single character. Suppose we have room in a crossword puzzle for an 8-letter word with j as its third letter and t as its sixth letter. In place of each blank cell we use a period:

In [36]:
print([w for w in wordlist if re.search('^..j..t..$', w)][:10])

['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector']


Finally, the ? symbol specifies that the previous character is optional. Thus "^e-?mail\$" will match both email and e-mail. We could count the total number of occurrences of this word (in either spelling) in a text using sum(1 for w in text if `re.search('^e-?mail$', w))`.

### 3.4.2 Ranges and Closures

The T9 system is used for entering text on mobile phones (see 3.5). Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence? Here we use the regular expression 

In [38]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters. It should be clear that + simply means "one or more instances of the preceding item", which could be an individual character like m, a set like [fed] or a range like [d-f]. 

In [40]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
print([w for w in chat_words if re.search('^m+i+n+e+$', w)])

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']


In [41]:
print([w for w in chat_words if re.search('^[ha]+$', w)])

['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa', 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', 'hahahahaaa', 'hahahahahaha', 'hahahahahahaha', 'hahahahahahahahahahahahahahahaha', 'hahahhahah', 'hahhahahaha']


Now let's replace + with \*, which means "zero or more instances of the preceding item". The regular expression ^m*i*n*e*\$ will match everything that we found using «^m+i+n+e+$», but also words where some of the letters don't appear at all, e.g. me, min, and mmmmm. Note that the + and * symbols are sometimes referred to as Kleene closures, or simply closures.

**The ^ operator has another function when it appears as the first character inside square brackets. For example "\[^aeiouAEIOU]" matches any character other than a vowel**. We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using "^\[^aeiouAEIOU]+$" to find items like these: :):):), grrr, cyb3r and zzzzzzzz. Notice this includes non-alphabetic characters.

## 3.5   Useful Applications of Regular Expressions