## Chapter 3 Processing Raw Text

In [1]:
import nltk,re,pprint

In [2]:
from nltk import word_tokenize

### 3.1 Accessing Text from the Web and from Disk

#### The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

In [3]:
from urllib import request

#### Text number 1776 is King Richard II by William Shakespeare, and we can access it as follows.

In [4]:
url = "https://www.gutenberg.org/files/1776/1776.txt"

In [5]:
response = request.urlopen(url)

In [6]:
raw = response.read().decode('utf8')

In [7]:
type(raw)

str

In [8]:
len(raw)

156953

In [9]:
raw[:204]

'\r\nThis Etext file is presented by Project Gutenberg, in\r\ncooperation with World Library, Inc., from their Library of the\r\nFuture and Shakespeare CDROMS.  Project Gutenberg often releases\r\nEtexts that are '

#### The variable raw contains a string with 156,953 characters. This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines

#### For our language processing, we want to break up the string into words and punctuation. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.

In [10]:
tokens = word_tokenize(raw)

In [11]:
type(tokens)

list

#### The number of words in this text file

In [12]:
len(tokens)

31024

#### The first ten words

In [13]:
tokens[:10]

['This',
 'Etext',
 'file',
 'is',
 'presented',
 'by',
 'Project',
 'Gutenberg',
 ',',
 'in']

#### Dealing with HTML

#### Much of the text on the web is in the form of HTML documents. 

#### You can use a web browser to save a page as text to a local file, then access this as described in the section on files below

#### The first step is the same as before, using urlopen. 

#### For fun we'll pick a BBC News story called Blondes to die out in 200 years, an urban legend passed along by the BBC as established scientific fact:

In [14]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

In [15]:
html = request.urlopen(url).read().decode('utf8')

In [16]:
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

#### You can type print(html) to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.

In [17]:
print(html)

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>BBC NEWS | Health | Blondes 'to die out in 200 years'</title>
<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">
<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">
<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">
<meta name="IFS_URL" content="/2/hi/health/2284783.stm">
<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">
<meta name="Headline" content="Blondes 'to die out in 200 years'">
<meta name="Section" content="Health">
<meta name="Description" content="Natural blondes are an endangered species and will die out by 2202, a study suggests.">
<!-- GENMaps-->
<map name="banner">
<area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html" shape="RECT">
</map>

<script src="/nol/shared/js/livestats_v1_1.js" langua

#### To get text out of HTML we will use a Python library called BeautifulSoup, available from http://www.crummy.com/software/BeautifulSoup/:

In [18]:
from bs4 import BeautifulSoup

In [19]:
raw = BeautifulSoup(html, 'html.parser').get_text()

In [20]:
tokens = word_tokenize(raw)

In [21]:
tokens

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '12:51'

#### This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.

In [22]:
tokens=tokens[111:403]

In [23]:
print(tokens)

['Blondes', "'to", 'die', 'out', 'in', '200', "years'", 'Scientists', 'believe', 'the', 'last', 'blondes', 'will', 'be', 'in', 'Finland', 'The', 'last', 'natural', 'blondes', 'will', 'die', 'out', 'within', '200', 'years', ',', 'scientists', 'believe', '.', 'A', 'study', 'by', 'experts', 'in', 'Germany', 'suggests', 'people', 'with', 'blonde', 'hair', 'are', 'an', 'endangered', 'species', 'and', 'will', 'become', 'extinct', 'by', '2202', '.', 'Researchers', 'predict', 'the', 'last', 'truly', 'natural', 'blonde', 'will', 'be', 'born', 'in', 'Finland', '-', 'the', 'country', 'with', 'the', 'highest', 'proportion', 'of', 'blondes', '.', 'The', 'frequency', 'of', 'blondes', 'may', 'drop', 'but', 'they', 'wo', "n't", 'disappear', 'Prof', 'Jonathan', 'Rees', ',', 'University', 'of', 'Edinburgh', 'But', 'they', 'say', 'too', 'few', 'people', 'now', 'carry', 'the', 'gene', 'for', 'blondes', 'to', 'last', 'beyond', 'the', 'next', 'two', 'centuries', '.', 'The', 'problem', 'is', 'that', 'blonde'

#### If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1., along with the regular list operations like slicing

In [24]:
text = nltk.Text(tokens)

In [25]:
len(text)

292

In [26]:
sorted(set(text))

["'",
 "''",
 "'to",
 ',',
 '-',
 '.',
 '200',
 '2202',
 'A',
 'Ann',
 'BBC',
 'Blondes',
 'Bottle-blondes',
 'But',
 'Dyed',
 'Edinburgh',
 'Finland',
 'Genes',
 'Germany',
 'I',
 'In',
 'Jonathan',
 'News',
 'Online',
 'Prof',
 'Rees',
 'Researchers',
 'Scientists',
 'The',
 'They',
 'University',
 'Widdecombe',
 '``',
 'a',
 'also',
 'an',
 'and',
 'are',
 'as',
 'at',
 'attractive',
 'be',
 'become',
 'believe',
 'beyond',
 'blame',
 'blonde',
 'blondes',
 'born',
 'both',
 'bottle',
 'but',
 'by',
 'carry',
 'case',
 'caused',
 'centuries',
 'chance',
 'child',
 'choose',
 'completely',
 'country',
 'demise',
 'dermatology',
 'die',
 'disadvantage',
 'disappear',
 'do',
 'drop',
 'dyed-blondes',
 'endangered',
 'experts',
 'extinct',
 'family',
 'few',
 'for',
 'frequency',
 'gene',
 'generation',
 'grandparents',
 'hair',
 'have',
 'having',
 'he',
 'highest',
 'if',
 'in',
 'is',
 'it',
 'last',
 'like',
 'may',
 'men',
 'more',
 'must',
 "n't",
 'natural',
 'next',
 'not',
 'no

In [27]:
text.count("blondes")

10

In [28]:
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


#### Exercise 0: Try to retreive some text from any web page in the form of HTML documents by using the above code (Discussion in Breakout Rooms)

In [29]:
url = "https://docs.python.org/3/library/html.html"

In [30]:
html = request.urlopen(url).read().decode('utf8')

In [31]:
raw = BeautifulSoup(html, 'html.parser').get_text()

In [32]:
tokens = word_tokenize(raw)

In [33]:
tokens

['html',
 '—',
 'HyperText',
 'Markup',
 'Language',
 'support',
 '—',
 'Python',
 '3.10.7',
 'documentation',
 'Previous',
 'topic',
 'Structured',
 'Markup',
 'Processing',
 'Tools',
 'Next',
 'topic',
 'html.parser',
 '—',
 'Simple',
 'HTML',
 'and',
 'XHTML',
 'parser',
 'This',
 'Page',
 'Report',
 'a',
 'Bug',
 'Show',
 'Source',
 'Navigation',
 'index',
 'modules',
 '|',
 'next',
 '|',
 'previous',
 '|',
 'Python',
 '»',
 '3.10.7',
 'Documentation',
 '»',
 'The',
 'Python',
 'Standard',
 'Library',
 '»',
 'Structured',
 'Markup',
 'Processing',
 'Tools',
 '»',
 'html',
 '—',
 'HyperText',
 'Markup',
 'Language',
 'support',
 '|',
 'html',
 '—',
 'HyperText',
 'Markup',
 'Language',
 'support¶',
 'Source',
 'code',
 ':',
 'Lib/html/__init__.py',
 'This',
 'module',
 'defines',
 'utilities',
 'to',
 'manipulate',
 'HTML',
 '.',
 'html.escape',
 '(',
 's',
 ',',
 'quote=True',
 ')',
 '¶',
 'Convert',
 'the',
 'characters',
 '&',
 ',',
 '<',
 'and',
 '>',
 'in',
 'string',
 's',
 't

In [34]:
vocab=sorted(set(w.lower() for w in tokens if w.isalpha()))

In [35]:
vocab

['a',
 'additionally',
 'all',
 'also',
 'an',
 'and',
 'are',
 'as',
 'attribute',
 'both',
 'bsd',
 'bug',
 'by',
 'character',
 'characters',
 'clause',
 'code',
 'contain',
 'convert',
 'copyright',
 'corporation',
 'corresponding',
 'created',
 'defined',
 'defines',
 'definitions',
 'delimited',
 'display',
 'documentation',
 'donate',
 'entity',
 'examples',
 'flag',
 'for',
 'found',
 'foundation',
 'function',
 'gt',
 'helps',
 'history',
 'html',
 'hypertext',
 'if',
 'in',
 'inclusion',
 'index',
 'information',
 'invalid',
 'is',
 'language',
 'last',
 'lenient',
 'library',
 'license',
 'licensed',
 'list',
 'manipulate',
 'markup',
 'might',
 'mode',
 'module',
 'modules',
 'more',
 'named',
 'navigation',
 'need',
 'new',
 'next',
 'numeric',
 'of',
 'on',
 'optional',
 'other',
 'package',
 'page',
 'parser',
 'parsing',
 'please',
 'previous',
 'processing',
 'python',
 'quote',
 'quotes',
 'recipes',
 'references',
 'report',
 'rules',
 's',
 'see',
 'sep',
 'sequence

#### Reading Local Files

#### In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. 

#### Suppose you have a file document.txt, you can load its contents like this

In [36]:
f=open('document.txt')

#### The read() method creates a string with the contents of the entire file:

In [37]:
f.read()

'The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online book is available. (If you use the library for academic research, please cite the book.)\n\nSteven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. Oâ€™Reilly Media Inc. https://www.nltk.org/book/'

In [38]:
raw=open('document.txt').read()

In [39]:
len(raw)

345

In [40]:
tokens=word_tokenize(raw)

#### The NLP Pipeline (See Slides)

#### When we tokenize a string we produce a list (of words), and this is Python's <list> type

####  Normalizing and sorting lists produces other lists:

In [41]:
tokens

['The',
 'Natural',
 'Language',
 'Toolkit',
 '(',
 'NLTK',
 ')',
 'is',
 'an',
 'open',
 'source',
 'Python',
 'library',
 'for',
 'Natural',
 'Language',
 'Processing',
 '.',
 'A',
 'free',
 'online',
 'book',
 'is',
 'available',
 '.',
 '(',
 'If',
 'you',
 'use',
 'the',
 'library',
 'for',
 'academic',
 'research',
 ',',
 'please',
 'cite',
 'the',
 'book',
 '.',
 ')',
 'Steven',
 'Bird',
 ',',
 'Ewan',
 'Klein',
 ',',
 'and',
 'Edward',
 'Loper',
 '(',
 '2009',
 ')',
 '.',
 'Natural',
 'Language',
 'Processing',
 'with',
 'Python',
 '.',
 'Oâ€™Reilly',
 'Media',
 'Inc.',
 'https',
 ':',
 '//www.nltk.org/book/']

In [42]:
words = [w.lower() for w in tokens]

In [43]:
words

['the',
 'natural',
 'language',
 'toolkit',
 '(',
 'nltk',
 ')',
 'is',
 'an',
 'open',
 'source',
 'python',
 'library',
 'for',
 'natural',
 'language',
 'processing',
 '.',
 'a',
 'free',
 'online',
 'book',
 'is',
 'available',
 '.',
 '(',
 'if',
 'you',
 'use',
 'the',
 'library',
 'for',
 'academic',
 'research',
 ',',
 'please',
 'cite',
 'the',
 'book',
 '.',
 ')',
 'steven',
 'bird',
 ',',
 'ewan',
 'klein',
 ',',
 'and',
 'edward',
 'loper',
 '(',
 '2009',
 ')',
 '.',
 'natural',
 'language',
 'processing',
 'with',
 'python',
 '.',
 'oâ€™reilly',
 'media',
 'inc.',
 'https',
 ':',
 '//www.nltk.org/book/']

In [44]:
vocab=sorted(set(words))

In [45]:
vocab

['(',
 ')',
 ',',
 '.',
 '//www.nltk.org/book/',
 '2009',
 ':',
 'a',
 'academic',
 'an',
 'and',
 'available',
 'bird',
 'book',
 'cite',
 'edward',
 'ewan',
 'for',
 'free',
 'https',
 'if',
 'inc.',
 'is',
 'klein',
 'language',
 'library',
 'loper',
 'media',
 'natural',
 'nltk',
 'online',
 'open',
 'oâ€™reilly',
 'please',
 'processing',
 'python',
 'research',
 'source',
 'steven',
 'the',
 'toolkit',
 'use',
 'with',
 'you']

In [46]:
vocab=sorted(set(w.lower() for w in tokens))

In [47]:
vocab

['(',
 ')',
 ',',
 '.',
 '//www.nltk.org/book/',
 '2009',
 ':',
 'a',
 'academic',
 'an',
 'and',
 'available',
 'bird',
 'book',
 'cite',
 'edward',
 'ewan',
 'for',
 'free',
 'https',
 'if',
 'inc.',
 'is',
 'klein',
 'language',
 'library',
 'loper',
 'media',
 'natural',
 'nltk',
 'online',
 'open',
 'oâ€™reilly',
 'please',
 'processing',
 'python',
 'research',
 'source',
 'steven',
 'the',
 'toolkit',
 'use',
 'with',
 'you']

#### Exercise 1. Define your own document and set up the unique word vacabulary that does not include stopwords, punctuation symbols and numbers. In addition, please get rid of words that only differ in capitalization.

In [48]:
from nltk.corpus import stopwords

In [49]:
stopwords= nltk.corpus.stopwords.words('english')

In [50]:
raw=open('document.txt').read()

In [51]:
tokens=word_tokenize(raw)
tokens

['The',
 'Natural',
 'Language',
 'Toolkit',
 '(',
 'NLTK',
 ')',
 'is',
 'an',
 'open',
 'source',
 'Python',
 'library',
 'for',
 'Natural',
 'Language',
 'Processing',
 '.',
 'A',
 'free',
 'online',
 'book',
 'is',
 'available',
 '.',
 '(',
 'If',
 'you',
 'use',
 'the',
 'library',
 'for',
 'academic',
 'research',
 ',',
 'please',
 'cite',
 'the',
 'book',
 '.',
 ')',
 'Steven',
 'Bird',
 ',',
 'Ewan',
 'Klein',
 ',',
 'and',
 'Edward',
 'Loper',
 '(',
 '2009',
 ')',
 '.',
 'Natural',
 'Language',
 'Processing',
 'with',
 'Python',
 '.',
 'Oâ€™Reilly',
 'Media',
 'Inc.',
 'https',
 ':',
 '//www.nltk.org/book/']

In [52]:
vocab=sorted(set(w.lower() for w in tokens if (w.lower() not in stopwords) and w.isalpha()))

In [53]:
vocab

['academic',
 'available',
 'bird',
 'book',
 'cite',
 'edward',
 'ewan',
 'free',
 'https',
 'klein',
 'language',
 'library',
 'loper',
 'media',
 'natural',
 'nltk',
 'online',
 'open',
 'please',
 'processing',
 'python',
 'research',
 'source',
 'steven',
 'toolkit',
 'use']

#### This is because strings are immutable — you can't change a string once you have created it. However, lists are mutable, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value.

## 3.2   Regular Expressions for Detecting Word Patterns

#### Many linguistic processing tasks involve pattern matching. For example, we can find words ending with ed using endswith('ed'). We saw a variety of such "word tests" in 4.2. Regular expressions give us a more powerful and flexible method for describing the character patterns we are interested in.

#### To use regular expressions in Python we need to import the re library using: import re. We also need a list of words to search; we'll use the Words Corpus again (4). We will preprocess it to remove any proper names.

In [54]:
import nltk

In [55]:
import re

In [56]:
wordlist=[w for w in nltk.corpus.words.words('en') if w.islower()]

#### Let's find words ending with ed using the regular expression
#### We will use the re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s. We need to specify the characters of interest, and use the dollar sign which has a special behavior in the context of regular expressions in that it matches the end of the word:

In [57]:
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

#### The . wildcard symbol matches any single character. Suppose we have room in a crossword puzzle for an 8-letter word with j as its third letter and t as its sixth letter. In place of each blank cell we use a period:

In [58]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

####  Exercise 2: The caret symbol ^ matches the start of a string, just like the $ matches the end. 
#### What results do we get with the above example if we leave out both of these, and search for «..j..t..»?

number limits has been removed and EARLY PATTERN match has been removed
so the result has found with third letter j and sixth letter t

In [59]:
[w for w in wordlist if re.search('..j..t..',w)]

['abjectedness',
 'abjection',
 'abjective',
 'abjectly',
 'abjectness',
 'adjection',
 'adjectional',
 'adjectival',
 'adjectivally',
 'adjective',
 'adjectively',
 'adjectivism',
 'adjectivitis',
 'adjustable',
 'adjustably',
 'adjustage',
 'adjustation',
 'adjuster',
 'adjustive',
 'adjustment',
 'antejentacular',
 'antiprojectivity',
 'bijouterie',
 'coadjustment',
 'cojusticiar',
 'conjective',
 'conjecturable',
 'conjecturably',
 'conjectural',
 'conjecturalist',
 'conjecturality',
 'conjecturally',
 'conjecture',
 'conjecturer',
 'coprojector',
 'counterobjection',
 'dejected',
 'dejectedly',
 'dejectedness',
 'dejectile',
 'dejection',
 'dejectly',
 'dejectory',
 'dejecture',
 'disjection',
 'guanajuatite',
 'inadjustability',
 'inadjustable',
 'injectable',
 'injection',
 'injector',
 'injustice',
 'insubjection',
 'interjection',
 'interjectional',
 'interjectionalize',
 'interjectionally',
 'interjectionary',
 'interjectionize',
 'interjectiveness',
 'interjector',
 'interje

#### J is no longer the third letter and t is no longer the six letter and the end letter could be anything, the length of the word could be anything

In [60]:
#### Finally, the ? symbol specifies that the previous character is optional. 
#### Thus «^e-?mail$» will match both email and e-mail. 
#### We could count the total number of occurrences of this word (in either spelling) in a text using sum(1 for w in text if #### re.search('^e-?mail$', w)).

In [61]:
[w for w in wordlist if re.search('^e-?mail$', w)]

[]

In [62]:
text=['e-mail','email','e mail']

In [63]:
[w for w in text if re.search('^e-?mail$', w)]

['e-mail', 'email']

In [64]:
sum(1 for w in wordlist if re.search('^e-?mail$', w))

0

In [65]:
sum(1 for w in text if re.search('^e-?mail$', w))

2

#### Basic Regular Expression Meta-Characters, Including Wildcards, Ranges and Closures

In [66]:
#Operator	Behavior
#.	        Wildcard, matches any character
#^abc	    Matches some pattern abc at the start of a string
#abc$	    Matches some pattern abc at the end of a string
#[a-z]: one letter from a to z
# * means 0 or more

# + means  1 or more



#[abc]	    Matches one of a set of characters
#[A-Z0-9]	Matches one of a range of characters
#ed|ing|s	Matches one of the specified strings (disjunction)
#*	        Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
#+	        One or more of previous item, e.g. a+, [a-z]+
#?	        Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
#{n}	    Exactly n repeats where n is a non-negative integer
#{n,}	    At least n repeats
#{,n}	    No more than n repeats
#{m,n}	    At least m and no more than n repeats
#{1, 3}  at least 1 repeats and no more than 3 repeats of previous items

#a(b|c)+	Parentheses that indicate the scope of the operators

### Ranges and Closures

#### The T9 system is used for entering text on mobile phones (see 3.5). Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence? Here we use the regular expression «^[ghi][mno][jlk][def]$»:

In [67]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

In [68]:
#Exercise 3: Look for some "finger-twisters", by searching for words that only use part of the number-pad. For example «^ [ghijklmno] +$», or more concisely, «^[g-o]+$», will match words that only use keys 4, 5, 6 in the center row, and «^[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner. What do - and + mean?

In [69]:
[w for w in wordlist if re.search("^[g-o]+$",w)]

['g',
 'ghoom',
 'gig',
 'giggling',
 'gigolo',
 'gilim',
 'gill',
 'gilling',
 'gilo',
 'gim',
 'gin',
 'ging',
 'gingili',
 'gink',
 'ginkgo',
 'ginning',
 'gio',
 'glink',
 'glom',
 'glonoin',
 'gloom',
 'glooming',
 'gnomon',
 'go',
 'gog',
 'gogo',
 'goi',
 'going',
 'gol',
 'goli',
 'gon',
 'gong',
 'gonion',
 'goo',
 'googol',
 'gook',
 'gool',
 'goon',
 'h',
 'hi',
 'high',
 'hill',
 'him',
 'hin',
 'hing',
 'hinoki',
 'ho',
 'hog',
 'hoggin',
 'hogling',
 'hoi',
 'hoin',
 'holing',
 'holl',
 'hollin',
 'hollo',
 'hollong',
 'holm',
 'homo',
 'homologon',
 'hong',
 'honk',
 'hook',
 'hoon',
 'i',
 'igloo',
 'ihi',
 'ilk',
 'ill',
 'imi',
 'imino',
 'immi',
 'in',
 'ing',
 'ingoing',
 'inion',
 'ink',
 'inkling',
 'inlook',
 'inn',
 'inning',
 'io',
 'ion',
 'j',
 'jhool',
 'jig',
 'jing',
 'jingling',
 'jingo',
 'jinjili',
 'jink',
 'jinn',
 'jinni',
 'jo',
 'jog',
 'johnin',
 'join',
 'joining',
 'joll',
 'joom',
 'k',
 'kiki',
 'kil',
 'kilhig',
 'kilim',
 'kill',
 'killing',

In [70]:
[w for w in wordlist if re.search("^[g-o]+",w)]

['g',
 'ga',
 'gab',
 'gabardine',
 'gabbard',
 'gabber',
 'gabble',
 'gabblement',
 'gabbler',
 'gabbro',
 'gabbroic',
 'gabbroid',
 'gabbroitic',
 'gabby',
 'gabelle',
 'gabelled',
 'gabelleman',
 'gabeller',
 'gaberdine',
 'gaberlunzie',
 'gabgab',
 'gabi',
 'gabion',
 'gabionade',
 'gabionage',
 'gabioned',
 'gablatores',
 'gable',
 'gableboard',
 'gablelike',
 'gablet',
 'gablewise',
 'gablock',
 'gaby',
 'gad',
 'gadabout',
 'gadbee',
 'gadbush',
 'gadded',
 'gadder',
 'gaddi',
 'gadding',
 'gaddingly',
 'gaddish',
 'gaddishness',
 'gade',
 'gadfly',
 'gadge',
 'gadger',
 'gadget',
 'gadid',
 'gadinine',
 'gadling',
 'gadman',
 'gadoid',
 'gadolinia',
 'gadolinic',
 'gadolinite',
 'gadolinium',
 'gadroon',
 'gadroonage',
 'gadsman',
 'gaduin',
 'gadwall',
 'gaen',
 'gaet',
 'gaff',
 'gaffe',
 'gaffer',
 'gaffle',
 'gaffsman',
 'gag',
 'gagate',
 'gage',
 'gageable',
 'gagee',
 'gageite',
 'gagelike',
 'gager',
 'gagership',
 'gagger',
 'gaggery',
 'gaggle',
 'gaggler',
 'gagman',

#### Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:

In [71]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

In [72]:
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [73]:
[w for w in chat_words if re.search('^[ha]$', w)]

['a', 'h']

In [74]:
[w for w in chat_words if re.search('^[ha]+$', w)]

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

### Kleene closures (*)

#### It should be clear that + simply means "one or more instances of the preceding item", which could be an individual character like m, a set like [fed] or a range like [d-f]. 

#### Now let's replace + with *, which means "zero or more instances of the preceding item". 

In [75]:
#### The regular expression «^m*i*n*e*$» will match everything that we found using «^m+i+n+e+$», but also words where some of the letters don't appear at all, e.g. me, min, and mmmmm. 

#### Note that the + and * symbols are sometimes referred to as Kleene closures, or simply closures.

In [76]:
[w for w in chat_words if re.search('^m*i*n*e*$', w)]

['',
 'e',
 'i',
 'in',
 'm',
 'me',
 'meeeeeeeeeeeee',
 'mi',
 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'min',
 'mine',
 'mm',
 'mmm',
 'mmmm',
 'mmmmm',
 'mmmmmm',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',
 'mmmmmmmmmm',
 'mmmmmmmmmmmmm',
 'mmmmmmmmmmmmmm',
 'n',
 'ne']

#### The ^ operator has another function when it appears as the first character inside square brackets. For example «[^aeiouAEIOU]» matches any character other than a vowel. We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using «^[^aeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r and zzzzzzzz. Notice this includes non-alphabetic characters.

In [77]:
[w for w in chat_words if re.search('^[^aeiouAEIOU]+$', w)]
#not in (a,e,i,o,u,A,E,I,O, U)

['!',
 '!!',
 '!!!',
 '!!!!',
 '!!!!!',
 '!!!!!!',
 '!!!!!!!',
 '!!!!!!!!',
 '!!!!!!!!!',
 '!!!!!!!!!!',
 '!!!!!!!!!!!',
 '!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',
 '!!!!!!.',
 '!!!!!.',
 '!!!!....',
 '!!!.',
 '!!.',
 '!!...',
 '!.',
 '!...',
 '!=',
 '!?',
 '!??',
 '!???',
 '"',
 '"...',
 '"?',
 '"s',
 '#',
 '###',
 '####',
 '$',
 '$$',
 '$27',
 '&',
 '&^',
 "'",
 "''",
 "'.",
 "'d",
 "'ll",
 "'m",
 "'n'",
 "'s",
 '(',
 '(((',
 '((((',
 '(((((',
 '((((((',
 '(((((((',
 '((((((((',
 '(((((((((',
 '((((((((((',
 '(((((((((((',
 '((((((((((((',
 '(((((((((((((',
 '((((((((((((((',
 '(((((((((((((((',
 '(((((((((((((((((',
 '((((((((((((((((((',
 '((((((((((((((((((((',
 '(((((((((((((((((((((',
 '(((((((((((((((((((((((',
 '((((((((((((((((((((((((',
 '(((((((((((((((((((((((((',
 '((((((((((((((((((((((((((',
 '((((

#### Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating the use of some new symbols: \, {}, (), and |:

#### You probably worked out that a backslash means that the following character is deprived of its special powers and must literally match a specific character in the word. Thus, while . is special, \. only matches a period. 

#### The braced expressions, like {3,5}, specify the number of repeats of the previous item. 

#### The pipe character indicates a choice between the material on its left or its right. 

#### Parentheses indicate the scope of an operator: they can be used together with the pipe (or disjunction) symbol like this: «w(i|e|ai|oo)t», matching wit, wet, wait, and woot. 

In [78]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [79]:
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

#0-9 multiple times, decimalnumbers 

['0.0085',
 '0.05',
 '0.1',
 '0.16',
 '0.2',
 '0.25',
 '0.28',
 '0.3',
 '0.4',
 '0.5',
 '0.50',
 '0.54',
 '0.56',
 '0.60',
 '0.7',
 '0.82',
 '0.84',
 '0.9',
 '0.95',
 '0.99',
 '1.01',
 '1.1',
 '1.125',
 '1.14',
 '1.1650',
 '1.17',
 '1.18',
 '1.19',
 '1.2',
 '1.20',
 '1.24',
 '1.25',
 '1.26',
 '1.28',
 '1.35',
 '1.39',
 '1.4',
 '1.457',
 '1.46',
 '1.49',
 '1.5',
 '1.50',
 '1.55',
 '1.56',
 '1.5755',
 '1.5805',
 '1.6',
 '1.61',
 '1.637',
 '1.64',
 '1.65',
 '1.7',
 '1.75',
 '1.76',
 '1.8',
 '1.82',
 '1.8415',
 '1.85',
 '1.8500',
 '1.9',
 '1.916',
 '1.92',
 '10.19',
 '10.2',
 '10.5',
 '107.03',
 '107.9',
 '109.73',
 '11.10',
 '11.5',
 '11.57',
 '11.6',
 '11.72',
 '11.95',
 '112.9',
 '113.2',
 '116.3',
 '116.4',
 '116.7',
 '116.9',
 '118.6',
 '12.09',
 '12.5',
 '12.52',
 '12.68',
 '12.7',
 '12.82',
 '12.97',
 '120.7',
 '1206.26',
 '121.6',
 '126.1',
 '126.15',
 '127.03',
 '129.91',
 '13.1',
 '13.15',
 '13.5',
 '13.50',
 '13.625',
 '13.65',
 '13.73',
 '13.8',
 '13.90',
 '130.6',
 '130.7',
 '

In [80]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

['C$', 'US$']

In [81]:
[w for w in wsj if re.search('^[0-9]{4}$', w)]

['1614',
 '1637',
 '1787',
 '1901',
 '1903',
 '1917',
 '1925',
 '1929',
 '1933',
 '1934',
 '1948',
 '1953',
 '1955',
 '1956',
 '1961',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1971',
 '1972',
 '1973',
 '1975',
 '1976',
 '1977',
 '1979',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2005',
 '2009',
 '2017',
 '2019',
 '2029',
 '3057',
 '8300']

In [82]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]
#1 or more numbers - 3 to 5 lower case letters

['10-day',
 '10-lap',
 '10-year',
 '100-share',
 '12-point',
 '12-year',
 '14-hour',
 '15-day',
 '150-point',
 '190-point',
 '20-point',
 '20-stock',
 '21-month',
 '237-seat',
 '240-page',
 '27-year',
 '30-day',
 '30-point',
 '30-share',
 '30-year',
 '300-day',
 '36-day',
 '36-store',
 '42-year',
 '50-state',
 '500-stock',
 '52-week',
 '69-point',
 '84-month',
 '87-store',
 '90-day']

In [83]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

#{2,3}
#at least 2 times but no more than 3 times
#{,6}
#0 - 6times

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

In [151]:
[w for w in wsj if re.search(''\w+(?:[-']\w+)*|[-.(]+|\S\w*', w)]

SyntaxError: unexpected character after line continuation character (2225206983.py, line 1)

In [84]:
[w for w in wsj if re.search('(ed|ing)$', w)]
#(ed|ing) end with either ed or ing



['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 'Banking',
 'Beginning',
 'Beijing',
 'Being',
 'Bermuda-based',
 'Betting',
 'Boeing',
 'Broadcasting',
 'Bucking',
 'Buying',
 'Calif.-based',
 'Change-ringing',
 'Citing',
 'Concerned',
 'Confronted',
 'Conn.based',
 'Consolidated',
 'Continued',
 'Continuing',
 'Declining',
 'Defending',
 'Depending',
 'Designated',
 'Determining',
 'Developed',
 'Died',
 'During',
 'Encouraged',
 'Encouraging',
 'English-speaking',
 'Estimated',
 'Everything',
 'Excluding',
 'Exxon-owned',
 'Faulding',
 'Fed',
 'Feeding',
 'Filling',
 'Filmed',
 'Financing',
 'Following',
 'Founded',
 'Fracturing',
 'Francisco-based',
 'Fred',
 'Funded',
 'Funding',
 'Generalized',
 'Germany-based',
 'Getting',
 'Guaranteed',
 'Having',
 'Heating',
 'Heightened',
 'Holding',
 'Housing',
 'Illumin

### 3.3   Useful Applications of Regular Expressions

#### r

#### To the Python interpreter, a regular expression is just like any other string

#### If the string contains a backslash followed by particular characters, it will interpret these specially.  For example \b would be interpreted as the backspace character. 

#### In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re library for processing. 

#### We do this by prefixing the string with the letter r, to indicate that it is a raw string. 

#### For example, the raw string r'\band\b' contains two \b symbols that are interpreted by the re library as matching word boundaries instead of backspace characters. 

#### If you get into the habit of using r'...' for regular expressions — as we will do from now on — you will avoid having to think about these complications.

#### The above examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w).

#### Apart from checking if a regular expression matches a word, we can use regular expressions to extract material from words, or to modify words in specific ways.

#### Extracting Word Pieces

#### The re.findall() ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find all the vowels in a word, then count them:

In [85]:
#re.search : match with the whole word with that pattern
#re.findall 

In [86]:
word = 'supercalifragilisticexpialidocious'

In [87]:
re.findall(r'[aeiou]', word) 

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [88]:
len(re.findall(r'[aeiou]', word))

16

#### Let's look for all sequences of two or more vowels in some text, and determine their relative frequency:

In [89]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [90]:
fd = nltk.FreqDist(vs for word in wsj
                      for vs in re.findall(r'[aeiou]{2,}', word))

In [91]:
fd.most_common(12)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

#### Doing More with Word Pieces

#### Once we can use re.findall() to extract material from words, there's interesting things to do with the pieces, like glue them back together or plot them.

#### It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left out. 

#### For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel sequences

#### The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all consonants; everything else is ignored.

#### This three-way disjunction is processed left-to-right, if one of the three parts matches the word, any later parts of the regular expression are ignored

In [92]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
regexp
#starts with vowel [AEIOUaeiou]
#ends w vowel
#not vowel

'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

#### We use re.findall() to extract all the matching pieces, and ''.join() to join them together 

In [93]:
def compress(word):
    pieces = re.findall(regexp, word)
    print(pieces)
    return ''.join(pieces)

In [94]:
english_udhr = nltk.corpus.udhr.words('English-Latin1')

In [95]:
compress(word='Universal')

['U', 'n', 'v', 'r', 's', 'l']


'Unvrsl'

In [96]:
compress(word="apple")

['a', 'p', 'p', 'l', 'e']


'apple'

In [97]:
compress(word='live')

['l', 'v', 'e']


'lve'

In [98]:
compress(word='viedo')

['v', 'd', 'o']


'vdo'

#### nltk.tokenwrap():Pretty print a list of text tokens, breaking lines on whitespace

In [99]:
print(nltk.tokenwrap(w for w in english_udhr[:75]))

Universal Declaration of Human Rights Preamble Whereas recognition of
the inherent dignity and of the equal and inalienable rights of all
members of the human family is the foundation of freedom , justice and
peace in the world , Whereas disregard and contempt for human rights
have resulted in barbarous acts which have outraged the conscience of
mankind , and the advent of a world in which human beings shall enjoy
freedom of speech and


In [100]:
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

['U', 'n', 'v', 'r', 's', 'l']
['D', 'c', 'l', 'r', 't', 'n']
['o', 'f']
['H', 'm', 'n']
['R', 'g', 'h', 't', 's']
['P', 'r', 'm', 'b', 'l', 'e']
['W', 'h', 'r', 's']
['r', 'c', 'g', 'n', 't', 'n']
['o', 'f']
['t', 'h', 'e']
['i', 'n', 'h', 'r', 'n', 't']
['d', 'g', 'n', 't', 'y']
['a', 'n', 'd']
['o', 'f']
['t', 'h', 'e']
['e', 'q', 'l']
['a', 'n', 'd']
['i', 'n', 'l', 'n', 'b', 'l', 'e']
['r', 'g', 'h', 't', 's']
['o', 'f']
['a', 'l', 'l']
['m', 'm', 'b', 'r', 's']
['o', 'f']
['t', 'h', 'e']
['h', 'm', 'n']
['f', 'm', 'l', 'y']
['i', 's']
['t', 'h', 'e']
['f', 'n', 'd', 't', 'n']
['o', 'f']
['f', 'r', 'd', 'm']
[',']
['j', 's', 't', 'c', 'e']
['a', 'n', 'd']
['p', 'c', 'e']
['i', 'n']
['t', 'h', 'e']
['w', 'r', 'l', 'd']
[',']
['W', 'h', 'r', 's']
['d', 's', 'r', 'g', 'r', 'd']
['a', 'n', 'd']
['c', 'n', 't', 'm', 'p', 't']
['f', 'r']
['h', 'm', 'n']
['r', 'g', 'h', 't', 's']
['h', 'v', 'e']
['r', 's', 'l', 't', 'd']
['i', 'n']
['b', 'r', 'b', 'r', 's']
['a', 'c', 't', 's']
['w', 'h'

### Finding Word Stems

#### There are various ways we can pull out the stem of a word. Here's a simple-minded approach which just strips off anything that looks like a suffix:

In [101]:
def stem(word):
     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
        return word

In [102]:
stem("studying")

'study'

#### Although we will ultimately use NLTK's built-in stemmers, it's interesting to see how we can use regular expressions for this task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to limit the scope of the disjunction.

In [103]:
 re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

#### Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of many arcane subtleties of regular expressions. Here's the revised version.

In [104]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

#### However, we'd actually like to split the word into stem and suffix. So we should just parenthesize both parts of the regular expression:

In [105]:
 re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

#### This looks promising, but still has a problem. Let's look at a different word, processes:

In [106]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

#### The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star operator is "greedy" and the .* part of the expression tries to consume as much of the input as possible. If we use the "non-greedy" version of the star operator, written *?, we get what we want:

In [107]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

#### This works even when we allow an empty suffix, by making the content of the second parentheses optional:

In [108]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

#### This approach still has many problems (can you spot them?) .Now we will move on to define a function to perform stemming, and apply it to a whole text:

In [109]:
def stem(word):  
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

In [110]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
   is no basis for a system of government.  Supreme executive power derives from
 a mandate from the masses, not from some farcical aquatic ceremony."""

In [111]:
tokens = word_tokenize(raw)

In [112]:
[stem(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

#### Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words like distribut and deriv, but these are acceptable stems in some applications.

### 3.4  Normalizing Text

#### In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g. set(w.lower() for w in text). 

#### By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. 

#### Often we want to go further than this, and strip off any affixes, a task known as stemming. 

#### A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization. 

In [113]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government.  Supreme executive power derives from
 a mandate from the masses, not from some farcical aquatic ceremony."""

In [114]:
tokens=word_tokenize(raw)

#### Stemmers

#### NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases

#### The Porter and Lancaster stemmers follow their own rules for stripping affixes. 

#### Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), while the Lancaster stemmer does not.

In [115]:
porter = nltk.PorterStemmer()

In [116]:
lancaster = nltk.LancasterStemmer()

In [117]:
[porter.stem(t) for t in tokens]

['denni',
 ':',
 'listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.']

In [118]:
[lancaster.stem(t) for t in tokens]

['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

#### Lemmatization

#### The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the above stemmers. Notice that it doesn't handle lying, but it converts women to woman.

In [119]:
wnl = nltk.WordNetLemmatizer()

In [120]:
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

### 3.5 Regular Expressions for Tokenizing Text

#### Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. 

#### Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers.

#### Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.

#### Simple Approaches to Tokenization

#### The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice's Adventures in Wonderland:

In [121]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

#### We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match any space characters in the string [1] since this results in tokens that contain a \n newline character; instead we need to match any number of spaces, tabs, or newlines 

In [122]:
re.split(r' ', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone\nthough),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very\nwell',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

In [123]:
 re.split(r'[ \t\n]+', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

#### The regular expression «[ \t\n]+» matches one or more space, tab (\t) or newline (\n). Other whitespace characters, such as carriage-return and form-feed should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any whitespace character. The above statement can be rewritten as re.split(r'\s+', raw)

In [124]:
re.split(r'\s+', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

#### Remember to prefix regular expressions with the letter r (meaning "raw"), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.

#### Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement of this class \W, i.e. all characters other than letters, digits or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:

In [125]:
 re.split(r'\W+', raw)

['',
 'When',
 'I',
 'M',
 'a',
 'Duchess',
 'she',
 'said',
 'to',
 'herself',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 'I',
 'won',
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 'Maybe',
 'it',
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 'tempered',
 '']

#### Now that we're matching the words, we're in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated.

In [126]:
re.findall(r'\w+|\S\w*', raw)

["'When",
 'I',
 "'M",
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'I",
 'won',
 "'t",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '-',
 '-Maybe',
 'it',
 "'s",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 '-tempered',
 ',',
 "'",
 '.',
 '.',
 '.']

#### Let's generalize the \w+ in the above expression to permit word-internal hyphens and apostrophes: «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it's. (We need to include ?: in this expression for reasons discussed earlier.) We'll also add a pattern to match quote characters so these are kept separate from the text they enclose.# 

In [127]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


#### Regular Expression Symbols

In [128]:
#Symbol 	Function
#\b 	Word boundary (zero width)
#\d 	Any decimal digit (equivalent to [0-9])
#\D 	Any non-digit character (equivalent to [^0-9])
#\s 	Any whitespace character (equivalent to [ \t\n\r\f\v])
#\S 	Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
#\w 	Any alphanumeric character (equivalent to [a-zA-Z0-9_])
#\W 	Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
#\t 	The tab character
#\n 	The newline character

### 3.6  Segmentation

#### Tokenization is an instance of a more general problem of segmentation.

#### Sentence Segmentation

#### Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences. As we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number of words per sentence in the Brown Corpus:

In [129]:
len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())

20.250994070456922

#### In other cases, the text is only available as a stream of characters. Before tokenizing the text into words, we need to segment it into sentences. NLTK facilitates this by including the Punkt sentence segmenter (Kiss & Strunk, 2006). Here is an example of its use in segmenting the text of a novel. 

In [130]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')

In [131]:
sents = nltk.sent_tokenize(text)

In [132]:
pprint.pprint(sents[79:89])

['"Nonsense!"',
 'said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\n'
 'railway trains look so sad and tired, so very sad and tired?',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\n'
 'is because they know that whatever place they have taken a ticket\n'
 'for that place they will reach.',
 'It is because after they have\n'
 'passed Sloane Square they know that the next station must be\n'
 'Victoria, and nothing but Victoria.',
 'Oh, their wild rapture!',
 'oh,\n'
 'their eyes like stars and their souls again in Eden, if the next\n'
 'station were unaccountably Baker Street!"',
 '"It is you who are unpoetical," replied the poet Syme.']


#### Sentence segmentation is difficult because period is used to mark abbreviations, and some periods simultaneously mark an abbreviation and terminate a sentence, as often happens with acronyms like U.S.A.For another approach to sentence segmentation, see Chapter 6.

### 3.7   String Formatting

#### The print command yields Python's attempt to produce the most human-readable form of an object. The second method — naming the variable at a prompt — shows us a string that can be used to recreate this object. It is important to keep in mind that both of these are just strings, displayed for the benefit of you, the user. They do not give us any clue as to the actual internal representation of the object.

#### There are many other useful ways to display an object as a string of characters. This may be for the benefit of a human reader, or because we want to export our data to a particular file format for use in an external program.

#### Formatted output typically contains a combination of variables and pre-specified strings, e.g. given a frequency distribution fdist we could do:

In [133]:
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])

In [134]:
for word in sorted(fdist):
    print(word, '->', fdist[word], end='; ')

cat -> 3; dog -> 4; snake -> 1; 

#### Print statements that contain alternating variables and constants can be difficult to read and maintain. Another solution is to use string formatting.

In [135]:
for word in sorted(fdist):
    print('{}->{};'.format(word, fdist[word]), end=' ')

cat->3; dog->4; snake->1; 

#### To understand what is going on here, let's test out the format string on its own. (By now this will be your usual method of exploring new syntax.)

#### The curly brackets '{}' mark the presence of a replacement field: this acts as a placeholder for the string values of objects that are passed to the str.format() method. We can embed occurrences of '{}' inside a string, then replace them with strings by calling format() with appropriate arguments. A string containing replacement fields is called a format string.

In [136]:
'{}->{};'.format ('cat', 3)

'cat->3;'

In [137]:
'{}'.format(3)

'3'

In [138]:
'I want a {} right now'.format('coffee')

'I want a coffee right now'

#### We can have any number of placeholders, but the str.format method must be called with exactly the same number of arguments.

In [139]:
'{} wants a {} {}'.format ('Lee', 'sandwich', 'for lunch')

'Lee wants a sandwich for lunch'

In [140]:
'{} wants a {} {}'.format ('sandwich', 'for lunch')

IndexError: Replacement index 2 out of range for positional args tuple

#### The field name in a format string can start with a number, which refers to a positional argument of format(). Something like 'from {} to {}' is equivalent to 'from {0} to {1}', but we can use the numbers to get non-default orders:

In [None]:
'from {1} to {0}'.format('A', 'B')

In [None]:
# We can also provide the values for the placeholders indirectly. Here's an example using a for loop:

In [1]:
template = 'Lee wants a {} right now'

In [2]:
menu = ['sandwich', 'spam fritter', 'pancake']

In [3]:
 for snack in menu:
    print(template.format(snack))

Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now


#### lining Things Up

#### It is right-justified by default for numbers, but we can precede the width specifier with a '<' alignment option to make numbers left-justified

In [None]:
'{:6}'.format(41)

In [None]:
'{:<6}'.format(41)

#### Strings are left-justified by default, but can be right-justified with the '>' alignment option.

In [None]:
'{:6}'.format('dog')

In [None]:
'{:>6}'.format('dog')

#### Other control characters can be used to specify the sign and precision of floating point numbers; for example {:.4f} indicates that four digits should be displayed after the decimal point for a floating point number.

In [None]:
 import math

In [None]:
'{:.4f}'.format(math.pi)

#### The string formatting is smart enough to know that if you include a '%' in your format specification, then you want to represent the value as a percentage; there's no need to multiply by 100.

In [None]:
count, total = 3205, 9375

In [None]:
"accuracy for {} words: {:.4%}".format(total, count / total)

#### a format string '{:{width}}' and bound a value to the width parameter in format() allows us to specify the width of a field using a variable.

In [None]:
'{:{width}}'.format('Monty Python', width=15)

#### We could use this to automatically customize the column to be just wide enough to accommodate all the words, using width = max(len(w) for w in words).

#### Writing Results to a File

#### It is often useful to write output to files as well. The following code opens a file output.txt for writing, and saves the program output to the file.

In [None]:
output_file = open('output.txt', 'w')

In [None]:
words = set(nltk.corpus.genesis.words('english-kjv.txt'))

In [None]:
for word in sorted(words):
    print(word, file=output_file)