In [1]:
import re

import nltk

In [2]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

##### Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.

In [3]:
s = "colorless"
idx = s.rfind("o")
s[: idx + 1] + "u" + s[idx + 1 :]

'colourless'

##### We can use the slice notation to remove morphological endings on words. For example, `'dogs'[:-1]` removes the last character of `dogs`, leaving `dog`. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): `dish-es, run-ning, nation-ality, un-do, pre-heat`.

In [4]:
words = ["dishes", "running", "nationality", "undo", "preheat"]
affixes = ["es", "ning", "ality", "do", "heat"]
for word in words:
    for suffix in affixes:
        if word.endswith(suffix):
            print(word[: -len(suffix)])

dish
run
nation
un
pre


##### We saw how we can generate an `IndexError` by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

In [5]:
# No
s[-10]

IndexError: string index out of range

##### We can specify a "step" size for the slice. The following returns every second character within the slice: `monty[6:11:2]`. It also works in the reverse direction: `monty[10:5:-2]` Try these for yourself, then experiment with different step values.

In [6]:
monty = "Monty Python"
monty[6:11:2]
monty[10:5:-2]

'Pto'

'otP'

##### What happens if you ask the interpreter to evaluate `monty[::-1]`? Explain why this is a reasonable result.

In [7]:
# Interpreter prints the word in reverse
monty[::-1]

'nohtyP ytnoM'

##### Describe the class of strings matched by the following regular expressions.

1. `[a-zA-Z]+`
1. `[A-Z][a-z]*`
1. `p[aeiou]{,2}t`
1. `\d+(\.\d+)?`
1. `([^aeiou][aeiou][^aeiou])*`
1. `\w+|[^\w\s]+`

Test your answers using `nltk.re_show()`.

1. `[a-zA-Z]+`: All alphabets occuring one or more times
1. `[A-Z][a-z]*`: All alphabets occuring zero or more times
1. `p[aeiou]{,2}t`: part of word starting with p, ending with t, and including at most 2 vovels in the middle
1. `\d+(\.\d+)?`: Arbitrary length number with optional decimals of arbitrary length
1. `([^aeiou][aeiou][^aeiou])*`: part of string that has a vowel in between two consonants
1. `\w+|[^\w\s]+`: Word followed by not a word or string

##### Write regular expressions to match the following classes of strings:

1. A single determiner (assume that a, an, and the are the only determiners).
1. An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.

Answer:
1. `(?:\s[a(n)|the]\s){,1}`
1. `"\\w(?:\\s[+*]\\s\\w)+"`

Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use `from urllib import request` and then `request.urlopen('http://nltk.org/').read().decode('utf8')` to access the contents of the URL.

In [8]:
from urllib import request


def url_text(url):
    f = request.urlopen(url).read().decode("utf8")
    return re.findall("<.*>(.+)?</.*>", f)


url_text("http://nltk.org/")

['Natural Language Toolkit &#8212; NLTK 3.5 documentation',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '.',
 'Natural Language Processing with Python',
 '.)',
 '',
 'Tokenize and tag some text:',
 'nltk',
 '&quot;&quot;&quot;At eight o&#39;clock on Thursday morning',
 'Arthur didn&#39;t feel very good.&quot;&quot;&quot;',
 ')',
 'tokens',
 '[&#39;At&#39;, &#39;eight&#39;, &quot;o&#39;clock&quot;, &#39;on&#39;, &#39;Thursday&#39;, &#39;morning&#39;,',
 '&#39;Arthur&#39;, &#39;did&#39;, &quot;n&#39;t&quot;, &#39;feel&#39;, &#39;very&#39;, &#39;good&#39;, &#39;.&#39;]',
 ')',
 ']',
 '[(&#39;At&#39;, &#39;IN&#39;), (&#39;eight&#39;, &#39;CD&#39;), (&quot;o&#39;clock&quot;, &#39;JJ&#39;), (&#39;on&#39;, &#39;IN&#39;),',
 '(&#39;Thursday&#39;, &#39;NNP&#39;), (&#39;morning&#39;, &#39;NN&#39;)]',
 '',
 'Identify named entities:',
 ')',
 'entities',
 'Tree(&#39;S&#39;, [(&#39;At&#39;, &#39;IN&#39;), (&#39;eight&#39;, &#39;CD&#39;), (&quot;o&#39;clock&quot;, &#39;JJ&#39;),',
 '           (&#39;on&#39;

##### Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).
Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.

##### Rewrite the following loop as a list comprehension:

```python
>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> result = []
>>> for word in sent:
...     word_len = (word, len(word))
...     result.append(word_len)
>>> result
[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]
```


In [9]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = [(word, len(word)) for word in sent]
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

##### Define a string `raw` containing a sentence of your own choosing. Now, split `raw` on some character other than space, such as 's'.

In [10]:
raw = "This is an example string"
raw.split("s")

['Thi', ' i', ' an example ', 'tring']

##### Write a for loop to print out the characters of a string, one per line.

In [11]:
for c in raw:
    print(c)

T
h
i
s
 
i
s
 
a
n
 
e
x
a
m
p
l
e
 
s
t
r
i
n
g


##### What is the difference between calling `split` on a string with no argument or with ' ' as the argument, e.g. `sent.split()` versus `sent.split(' ')`? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use `'\t'` to enter a tab character.)

In [14]:
sent = "This is  my\tsentence"
sent.split()
sent.split("")

['This', 'is', 'my', 'sentence']

ValueError: empty separator

##### Create a variable `words` containing a list of words. Experiment with `words.sort()` and `sorted(words)`. What is the difference?

In [15]:
words = raw.split()
sorted(words)
words.sort()

['This', 'an', 'example', 'is', 'string']

##### Explore the difference between strings and integers by typing the following at a Python prompt: `"3" * 7` and `3 * 7`. Try converting between strings and integers using int("3") and str(3).

In [16]:
"3" * 7
3 * 7

'3333333'

21

##### Use a text editor to create a file called `prog.py` containing the single line `monty = 'Monty Python'`. Next, start up a new session with the Python interpreter, and enter the expression `monty` at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the .py part of the filename):

```python
>>> from prog import monty
>>> monty
```
This time, Python should return with a value. You can also try `import prog`, in which case Python should be able to evaluate the expression `prog.monty` at the prompt.

In [18]:
from prog import monty

monty

'Monty Python'

##### What happens when the formatting strings %6s and %-6s are used to display strings that are longer than six characters?

In [24]:
s = 'assdfgghd'
print("%6s".format(s))
print("%-6s".format(s))

%6s
%-6s


##### Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

In [43]:
text = nltk.corpus.gutenberg.raw()
tokens = nltk.word_tokenize(text)
wh_words = [w for w in tokens if w.startswith("wh")]
print(nltk.FreqDist(wh_words).most_common(20))

[('which', 8434), ('when', 5284), ('what', 4125), ('who', 3604), ('where', 1454), ('whom', 1345), ('while', 1040), ('whole', 933), ('whale', 788), ('whose', 730), ('white', 592), ('why', 560), ('whether', 538), ('whales', 238), ('whatever', 183), ('wherein', 178), ('whatsoever', 142), ('whosoever', 136), ('whence', 133), ('whither', 131)]


##### Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. fuzzy 53. Read the file into a Python list using open(filename).readlines(). Next, break each line into its two fields using split(), and convert the number into an integer using int(). The result should be a list of the form: [['fuzzy', 53], ...].

##### Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.

##### Write a function unknown() that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using re.findall()) and remove any items from this set that occur in the Words Corpus (nltk.corpus.words). Try to categorize these words manually and discuss your findings.

##### Examine the results of processing the URL http://news.bbc.co.uk/ using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.

##### Are you able to write a regular expression to tokenize text in such a way that the word don't is tokenized into do and n't? Explain why this regular expression won't work: «n't|\w+».

##### Try to write code to convert text into hAck3r, using regular expressions and substitution, where e → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: $ for word-initial s, and 5 for word-internal s.

##### Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay, e.g. string → ingstray, idle → idleay. http://en.wikipedia.org/wiki/Pig_Latin

Write a function to convert a word to Pig Latin.
Write code that converts text, instead of individual words.
Extend it further to preserve capitalization, to keep qu together (i.e. so that quiet becomes ietquay), and to detect when y is used as a consonant (e.g. yellow) vs a vowel (e.g. style).

##### Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.

##### Python's random module includes a function choice() which randomly chooses an item from a sequence, e.g. choice("aehh ") will produce one of four possible characters, with the letter h being twice as frequent as the others. Write a generator expression that produces a sequence of 500 randomly chosen letters drawn from the string "aehh ", and put this expression inside a call to the ''.join() function, to concatenate them into one long string. You should get a result that looks like uncontrolled sneezing or maniacal laughter: he  haha ee  heheeh eha. Use split() and join() again to normalize the whitespace in this string.

##### Consider the numeric expressions in the following sentence from the MedLine Corpus: The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively. Should we say that the numeric expression 4.53 +/- 0.15% is three words? Or should we say that it's a single compound word? Or should we say that it is actually nine words, since it's read "four point five three, plus or minus zero point fifteen percent"? Or should we say that it's not a "real" word at all, since it wouldn't appear in any dictionary? Discuss these different possibilities. Can you think of application domains that motivate at least two of these answers?

##### Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μw to be the average number of letters per word, and μs to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 μw + 0.5 μs - 21.43. Compute the ARI score for various sections of the Brown Corpus, including section f (lore) and j (learned). Make use of the fact that nltk.corpus.brown.words() produces a sequence of words, while nltk.corpus.brown.sents() produces a sequence of sentences.

##### Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.

##### Define the variable saying to contain the list ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more',
'is', 'said', 'than', 'done', '.']. Process this list using a for loop, and store the length of each word in a new list lengths. Hint: begin by assigning the empty list to lengths, using lengths = []. Then each time through the loop, use append() to add another length value to the list. Now do the same thing using a list comprehension.

##### Define a variable silly to contain the string: 'newly formed bland ideas are inexpressible in an infuriating
way'. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, colorless green ideas sleep furiously according to Wikipedia). Now write code to perform the following tasks:

Split silly into a list of strings, one per word, using Python's split() operation, and save this to a variable called bland.
Extract the second letter of each word in silly and join them into a string, to get 'eoldrnnnna'.
Combine the words in bland back into a single string, using join(). Make sure the words in the resulting string are separated with whitespace.
Print the words of silly in alphabetical order, one per line.

##### The index() function can be used to look up items in sequences. For example, 'inexpressible'.index('e') tells us the index of the first position of the letter e.

What happens when you look up a substring, e.g. 'inexpressible'.index('re')?
Define a variable words containing a list of words. Now use words.index() to look up the position of an individual word.
Define a variable silly as in the exercise above. Use the index() function in combination with list slicing to build a list phrase consisting of all the words up to (but not including) in in silly.

##### Write code to convert nationality adjectives like Canadian and Australian to their corresponding nouns Canada and Australia (see http://en.wikipedia.org/wiki/List_of_adjectival_forms_of_place_names).

##### Read the LanguageLog post on phrases of the form as best as p can and as best p can, where p is a pronoun. Investigate this phenomenon with the help of a corpus and the findall() method for searching tokenized text described in 3.5. http://itre.cis.upenn.edu/~myl/languagelog/archives/002733.html

##### Study the lolcat version of the book of Genesis, accessible as nltk.corpus.genesis.words('lolcat.txt'), and the rules for converting text into lolspeak at http://www.lolcatbible.com/index.php?title=How_to_speak_lolcat. Define regular expressions to convert English words into corresponding lolspeak words.

##### Read about the re.sub() function for string substitution using regular expressions, using help(re.sub) and by consulting the further readings for this chapter. Use re.sub in writing code to remove HTML tags from an HTML file, and to normalize whitespace.

##### An interesting challenge for tokenization is words that have been split across a line-break. E.g. if long-term is split, then we have the string long-\nterm.

Write a regular expression that identifies words that are hyphenated at a line-break. The expression will need to include the \n character.
Use re.sub() to remove the \n character from these words.
How might you identify words that should not remain hyphenated once the newline is removed, e.g. 'encyclo-\npedia'?x

##### Read the Wikipedia entry on Soundex. Implement this algorithm in Python.

##### Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (nltk.corpus.abc). Use Punkt to perform sentence segmentation.

##### Rewrite the following nested loop as a nested list comprehension:

 	
>>> words = ['attribution', 'confabulation', 'elocution',
...          'sequoia', 'tenacious', 'unidirectional']
>>> vsequences = set()
>>> for word in words:
...     vowels = []
...     for char in word:
...         if char in 'aeiou':
...             vowels.append(char)
...     vsequences.add(''.join(vowels))
>>> sorted(vsequences)
['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

##### Use WordNet to create a semantic index for a text collection. Extend the concordance search program in 3.6, indexing each word using the offset of its first synset, e.g. wn.synsets('dog')[0].offset (and optionally the offset of some of its ancestors in the hypernym hierarchy).

##### With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (nltk.corpus.udhr), and NLTK's frequency distribution and rank correlation functionality (nltk.FreqDist, nltk.spearman_correlation), develop a system that guesses the language of a previously unseen text. For simplicity, work with a single character encoding and just a few languages.

##### Write a program that processes a text and discovers cases where a word has been used with a novel sense. For each word, compute the WordNet similarity between all synsets of the word and all synsets of the words in its context. (Note that this is a crude approach; doing it well is a difficult, open research problem.)

##### Read the article on normalization of non-standard words (Sproat et al, 2001), and implement a similar system for text normalization.