# IT Skills for linguists 2
## UAM, Faculty of English, 2BA
### Topic: *Regular expressions*
#### Poznań, 23.01.2023
#### Teacher: mgr inż. Michał Junczyk


# Regular Expressions
- One of the most important aspects of programming for linguists: pattern matching.
- Linguists often have to assess whether a string:
  - matches a pattern e.g. starts and ends with specific character
  - contains a certain sequence of characters
  - contains specific number of characters, etc.

## Recap - finding chars in strings
- to check if string **s** contained the sequence **ab**:
    - 'ab' in s -> return *True* or *False*
- how to check if a string contained **'a'** and then **'b'** with something in between?

In [18]:
import sys

def mymatch(s):      #a and then b
	i = 0
	#flag to keep track of
	#whether we see an 'a'
	aFlag = False
	while i < len(s):
		if s[i] == 'a':
			aFlag = True
			break
		i += 1
	#look for 'b' where we left off
	while i < len(s):
		if s[i] == 'b':
			#if we find 'b', return True of
			#False depending on whether we
			#previously saw 'a'
			return aFlag
		i += 1
	#if all that fails, return False
	return False



input:  ab result:  True
input:  a   b result:  True
input:  cc a ccc b cc result:  True
input:  ba result:  False
input:  bab result:  True


In [64]:
test_words = ["ab", "a   b", "caccbcb", "ba", "bab"]
for test_word in test_words:
    print("input: ", test_word, "result: ", mymatch(test_word))


input:  ab result:  True
input:  a   b result:  True
input:  caccbcb result:  True
input:  ba result:  False
input:  bab result:  True


- Solution above doesn’t generalize!
  - infinite number of patterns to search for -> writing function for each impossibe
- Better solution: **Regular Expressions!** 
- Many programming languages implement Regular Expressions!

## 6.1 Matching

- Matching a string against some pattern is done with the **re** module
- Most typically with the search() function in **re** module.

### Code example 2

In [65]:
import re,sys

def find_patt_re(test_string):
    if re.search('ab',test_string):
        return('a match')
    else:
        return('no match')


In [66]:
print("Searching for 'ab' string.\n")
for test_word in test_words:
    print("input: ", test_word, "result: ", find_patt_re(test_word))


Searching for 'ab' string.

input:  ab result:  a match
input:  a   b result:  no match
input:  caccbcb result:  no match
input:  ba result:  no match
input:  bab result:  a match


### Code example 3

In [67]:
import re,sys

def find_patt_re(test_string):
    if re.search('a.*b',test_string):
        return('a match')
    else:
        return('no match')


In [68]:
print("Searching for 'a' followed by 'b' with other characters potentially in between.\n")
for test_word in test_words:
    print("input: ", test_word, "result: ", find_patt_re(test_word))


Searching for 'a' followed by 'b' with other characters potentially in between.

input:  ab result:  a match
input:  a   b result:  a match
input:  caccbcb result:  a match
input:  ba result:  no match
input:  bab result:  a match


### Examples explanation
- In example 3 we've replaced search pattern **"ab"** with **"a.*b"**
  - first pattern 'ab' - 'a' followed by 'b'.
  - second pattern 'a.\*b' - 'a' followed **anywhere** by'b'.
    - symbol '.' - alphanumeric character
    - symbol '\*' - zero or more characters matched

### Interactive example


- Go to page: https://regexr.com/
- Insert your regex on the top
- Insert your test data in the middle
- See the explan tab at the bottom to undestand your regex
- Play around!

### Understanding "re" object

- The search() function actually does not return *True* or *False*
- *Match object* is returned if the string matches
- None is returned if it does not.
- In previous examples:
  - *match object* evaluated to *True*
  - *None* object evaluated to *False*

- Let's take **search()** function output and compare it against *False* and against *None*

In [69]:
import re

#do two matches
res1 = re.search('a.*b','hat')
res2 = re.search('a.*b','nab')

#evaluate results of both matches
for s,r in [('hat',res1),('nab',res2)]:
	if r:            #simple if test
		print(s,"matches 'a.*b'")
		print("r is a match object")
	else:
		print(s,"does not match 'a.*b'")
		print("r is None")
	if r == False:   #does match simply fail?
		print('r == False')
	else:
		print('r != False')
	if r == None:    #is match a None object?
		print('r == None')
	else:
		print('r != None')



hat does not match 'a.*b'
r is None
r != False
r == None
nab matches 'a.*b'
r is a match object
r != False
r != None


### Methods of match object
- match object have several methods:
  - group() - returns the matched part of the string
  - start() - returns starting indices of matched portion
  - end() - returns ending indices of matched portion
  - span() - returns both indices

In [70]:
import re,sys

def find_patt_re(test_string):
    #do a match
    res = re.search('a.*b',test_string)
    if res:
        #if match succeeds, print matching
        print("match: '",res.group(),"'",sep='')
        return("True")
    else:
        print('no match')
        return("False")



In [71]:
print("Searching for 'a' followed by 'b' with other characters potentially in between.")
print("We also print matched portion of the string.\n")
for test_word in test_words:
    print("input: ", test_word, "result: ", find_patt_re(test_word))


Searching for 'a' followed by 'b' with other characters potentially in between.
We also print matched portion of the string.

match: 'ab'
input:  ab result:  True
match: 'a   b'
input:  a   b result:  True
match: 'accbcb'
input:  caccbcb result:  True
no match
input:  ba result:  False
match: 'ab'
input:  bab result:  True


- match object have several methods:
  - group() - returns the matched part of the string
  - start() - returns starting indices of matched portion
  - end() - returns ending indices of matched portion
  - span() - returns both indices

In [74]:
import re,sys

def find_patt_re(test_string):
    #do a match
    res = re.search('a.*b',test_string)
    if res: #if match succeeds, print everything
        print("match: '",res.group(),"'",sep='')
        print('starting index:',res.start())
        print('ending index:',res.end())
        print('both indices:',res.span())
        return(True)
    else:
        print('no match')
        return(False)

In [75]:
for test_word in test_words:
    print("test word:", test_word)
    print(find_patt_re(test_word))
    print()


test word: ab
match: 'ab'
starting index: 0
ending index: 2
both indices: (0, 2)
True

test word: a   b
match: 'a   b'
starting index: 0
ending index: 5
both indices: (0, 5)
True

test word: caccbcb
match: 'accbcb'
starting index: 1
ending index: 7
both indices: (1, 7)
True

test word: ba
no match
False

test word: bab
match: 'ab'
starting index: 1
ending index: 3
both indices: (1, 3)
True




- Note 
  - the *ending index* is the index of the character **after the final character of the pattern**
    - for 'ab' ending index is 2
    - for 'bab' ending index is 3
  - the match begins at the earliest possible point in the string
    - if there are multiple instances of 'a', the match begins with the first one.
  - the match is as greedy as possible
    - If there are multiple instances of 'b', the match uses the rightmost one.
    - e.g. in case of 'caccbcb' the second 'b' is matched

### Methods of match object continued
- **findall()** - return all instances of the match in a string
  - e.g. let's match 'a' against 'abracadabra'
  - result ['a','a','a','a','a']

## 6.2 Patterns

- Patterns to be matched are defined in terms of **regular expressions (REs)**
- REs have a simple syntax that can be specified recursively, like a phrase-structure grammar


- There 4 major operations used in RE
  - A single symbol RE e.g. 'a', '3', 'k', etc.
  - A concatenation or sequence of REs e.g. 'ab', '3g', 'kk', etc. 
  - The union of disjunction of two REs 
    - indicated with a tie-bar, e.g., in 'a|b', 'a|d'
    - match occurs if either one or the other of the component expressions matches.
    - is recursive
  - Kleene star. Allows aRE to be matched zero or more times. 
    - indicated with a following asterisk
    - is recursive 
    - example: a*, a(b*), (ab)*, (a|b)*

- An RE is a string defined in terms of the recursive operations:
    - concatenation
    - union
    - Kleene star
- An RE itself is a **finite sequence of symbols**, but it defines a potentially **infinite set of strings**.

- With REs is  efficient to check if some string matches some pattern.
- Tradeoff for efficiency - some  patterns cannot be specified.

### Example 8

In [85]:
!python3 ./topic\ 6\ -\ code\ examples/re8.py "rabb+"

rabbit
rabbit-hole
rabbit-hole
rabbit!
rabbits.
rabbit-hole--and


- Some special characters in REs amay be interpreted by operating systems
- Patterns provided through command line must be in quotes

### Other RE notations and special characters

- Symbol: **.**
  - use: matches any single character e.g. alphabetic, numeric, space, tab, etc.
  - examples:
    - *'a.b'* - matches strings where 'b' follows 'a' with a single char intervening
    - '...' - matches any string with at least three letters.
- Symbol: **^**
  - use: matches the beginning of the string


- Symbol: **\$** 
  - use: matches the end of the string. 
  - examples: 
    - '^ab' - matches only strings that begin with that substring.
    - 'ab\$' - matches any string that ends with that substring
    - '^ab\$' - matches exact string 'ab'
- Symbol: **|** (tie-bar)
  - use: indicate disjunction or union.
  - examples
    - *'(a|b|cd|ef)g'* matches *'ag', 'bg', 'cdg'*, or *'efg'*


- Symbol: **[ ]**
  - use: enlisting single characters of union
  - example:
    - *'a(b|c|d)e'* is equal to *'a[bcd]e'*
  - Additionally '^' can be used for inverse set of characters
    - example:
      - *'[^abc]'* matches a single character other than 'a', 'b', or 'c'.
- Symbol: '-'
  - use:inside [] for naturally ordered character sequence
    - example:
      - [a-e] is the same as [abcde]
      - [0-5] is the same as [012345]
   - Multiple sequences can be used in the same square brackets
     - example: [a-zA-Z] - single upper or lowercase letter
   - inverse character classes. 
     - example: [^a-z] - single character that is not a lowercase letter.

## 6.3 Backreferences

- Backreferences move pattern matching beyond regular expressions.
- Warning - provide a lot of power and convenience, but are expensive to calculate and should be used carefully
- Mechanism - reusing (backreferencing) matched groups of characters
- Symbol - \1, \2 etc.
- Example:
  - '(..)\1' matches 'abab' 'cccc' '0101' etc.

In [94]:
import sys,re

def find_patt_re(test_string):
    #do a match
    m = re.search('(.*)b(.*)',test_string)
    if m:               #if it succeeds, print...
        #the whole match
        print('all: "',m.group(),'"',sep='')
        #the first part
        print('group 1: "',m.group(1),'"',sep='')
        #the second part
        print('group 2: "',m.group(2),'"',sep='')



In [95]:
print("Searching for 'a' followed by 'b' with other characters potentially in between.\n")
for test_word in test_words:
    find_patt_re(test_word)


Searching for 'a' followed by 'b' with other characters potentially in between.

all: "ab"
group 1: "a"
group 2: ""
all: "a   b"
group 1: "a   "
group 2: ""
all: "caccbcb"
group 1: "caccbc"
group 2: ""
all: "ba"
group 1: ""
group 2: "a"
all: "bab"
group 1: "ba"
group 2: ""


In [97]:
import sys,re

def find_patt_re(test_string):
    #do a match
    m = re.search('(.*)b(.*)',test_string)
    if m:
        print('all: "',m.group(),'"',sep='')
        print('group 1: "',m.group(1),'"',sep='')
        print('group 1 start:',m.start(1))
        print('group 1 end:',m.end(1))
        print('group 2: "',m.group(2),'"',sep='')
        print('group 2 start:',m.start(2))
        print('group 2 end:',m.end(2))



In [98]:
print("Searching for 'a' followed by 'b' with other characters potentially in between.\n")
for test_word in test_words:
    find_patt_re(test_word)


Searching for 'a' followed by 'b' with other characters potentially in between.

all: "ab"
group 1: "a"
group 1 start: 0
group 1 end: 1
group 2: ""
group 2 start: 2
group 2 end: 2
all: "a   b"
group 1: "a   "
group 1 start: 0
group 1 end: 4
group 2: ""
group 2 start: 5
group 2 end: 5
all: "caccbcb"
group 1: "caccbc"
group 1 start: 0
group 1 end: 6
group 2: ""
group 2 start: 7
group 2 end: 7
all: "ba"
group 1: ""
group 1 start: 0
group 1 end: 0
group 2: "a"
group 2 start: 1
group 2 end: 2
all: "bab"
group 1: "ba"
group 1 start: 0
group 1 end: 2
group 2: ""
group 2 start: 3
group 2 end: 3


## 6.4 Program example - Initial Consonant Clusters

- Let's use pattern matching to investigate the distribution of word-initial consonant clusters
- I.e. let's measure how frequent different kinds of initial consonant clusters are.

### Example re11.py - frame code

In [99]:
f = open('alice.txt','r') #open the file
text = f.read()
#read it all in
f.close()
#close file stream
#print first 100 letters to make sure
print(text[:100])

***This is the Project Gutenberg Etext of Alice in Wonderland***
*This 30th edition should be labele


### Example re12.py - striping Project Gutenberg header

In [101]:
f = open('alice.txt','r') #open file
text = f.read()
#read it all in
f.close()
#close file stream
text = text[10841:]
#remove header
#convert to lowercase and split into words
words = text.lower().split()
#print first 50 words
for w in words[:10]:
    print(w)

alice's
adventures
in
wonderland
lewis
carroll
the
millennium
fulcrum
edition


### Using re8.py for checking for nonalphanumeric characters

In [111]:
!python3 "topic 6 - code examples/re8.py" '\W'

***This
Wonderland***
*This
alice30.txt
alice30.zip.
***This
8,
1994***
**In
Gutenberg***
header.
disk,
readers.
this.
**Welcome
Texts**
**Etexts
Computers,
1971**
*These
Donations*
Etexts,
below.
donations.
Alice's
March,
[Etext
#11]
[Originally
January,
1991]
[Date
updated:
10,
2004]
*****The
Wonderland*****
******This
alice30.txt
alice30.zip******
NUMBER,
alice31.txt
LETTER,
alice30a.txt
dates,
editing.
that.
note:
announcement.
Midnight,
Time,
month.
suggestion,
so.
[xxxxx10x.xxx]
month.
[tried
failed]
do,
less.
(one
page)
work.
selected,
entered,
proofread,
edited,
analyzed,
written,
etc.
readers.
$4
month:
$2
million.
31,
2001.
[10,000
100,000,000=Trillion]
readers,
10%
2001.
ever!
"Project
Gutenberg/IBC",
("IBC"
College).
(Subscriptions
IBC,
too)
matters,
to:
P.
O.
Champaign,
S.
Hart,
Director:
hart@vmd.cso.uiuc.edu
(internet)
hart@uiucvmd
(bitnet)
(Internet,
Bitnet,
Compuserv

before,
never!
it's
bad,
is!'
slipped,
moment,
splash!
water.
sea,
`and
railway,'
herself.
(Alice
life,
conclusion,
sea,
spades,
houses,
station.)
However,
high.
`I
hadn't
much!'
Alice,
about,
out.
`I
now,
suppose,
tears!
thing,
sure!
However,
to-day.'
off,
was:
hippopotamus,
now,
herself.
`Would
use,
now,'
Alice,
`to
mouse?
out-of-the-way
here,
talk:
rate,
there's
trying.'
began:
`O
Mouse,
pool?
here,
Mouse!'
(Alice
mouse:
before,
brother's
Grammar,
`A
mouse--of
mouse--to
mouse--a
mouse--O
mouse!')
inquisitively,
eyes,
nothing.
`Perhaps
doesn't
English,'
Alice;
`I
it's
mouse,
Conqueror.'
(For,
history,
happened.)
again:
`Ou
chatte?'
lesson-book.
water,
fright.
`Oh,
pardon!'
hastily,
animal's
feelings.
`I
didn't
cats.'
`Not
cats!'
Mouse,
shrill,
voice.
`Would
me?'
`Well,
not,'
tone:
`don't
it.
Dinah:
you'd
her.
thing,'
on,
herself,
pool,
`and
fire,
face--and
nurse--an

is--"Birds
together."'
`Only
isn't
bird,'
remarked.
`Right,
usual,'
Duchess:
`what
things!'
`It's
mineral,
THINK,'
Alice.
`Of
is,'
Duchess,
said;
`there's
mustard-mine
here.
is--"The
mine,
yours."'
`Oh,
know!'
Alice,
remark,
`it's
vegetable.
doesn't
one,
is.'
`I
you,'
Duchess;
`and
is--"Be
be"--or
you'd
simply--"Never
otherwise."'
`I
better,'
politely,
`if
down:
can't
it.'
`That's
chose,'
replied,
tone.
`Pray
don't
that,'
Alice.
`Oh,
don't
trouble!'
Duchess.
`I
I've
yet.'
`A
present!'
Alice.
`I'm
don't
that!'
loud.
`Thinking
again?'
asked,
chin.
`I've
think,'
sharply,
worried.
`Just
right,'
Duchess,
`as
fly;
m--'
here,
Alice's
surprise,
Duchess's
away,
`moral,'
tremble.
up,
them,
folded,
thunderstorm.
`A
day,
Majesty!'
low,
voice.
`Now,
Queen,
spoke;
`either
off,
time!
choice!'
choice,
moment.
`Let's
game,'
Alice;
word,
croquet-ground.
Queen's
absence,
shade:
however,

### Excluding cases where some punctuation like is on the right end of the word

In [112]:
!python3 "topic 6 - code examples/re8.py" '\W\w'

***This
*This
alice30.txt
alice30.zip.
***This
**In
**Welcome
**Etexts
*These
Alice's
[Etext
#11]
[Originally
[Date
*****The
******This
alice30.txt
alice30.zip******
alice31.txt
alice30a.txt
[xxxxx10x.xxx]
[tried
(one
$4
$2
[10,000
100,000,000=Trillion]
"Project
Gutenberg/IBC",
("IBC"
(Subscriptions
hart@vmd.cso.uiuc.edu
(internet)
hart@uiucvmd
(bitnet)
(Internet,
(or
[Mac
.type]
mrcnext.cso.uiuc.edu
your@login
etext/etext91
[for
[now
etext/etext93]
etext/articles
[get
[to
[to
.set
0INDEX.GUT
**Information
(Three
***START**THE
PRINT!**FOR
ETEXTS**START***
"Small
what's
"Small
*BEFORE!*
GUTENBERG-tm
"Small
(if
(such
GUTENBERG-TM
GUTENBERG-tm
"public
(the
"Project").
(and
Project's
"PROJECT
Project's
"Defects".
"Right
[1]
(and
GUTENBERG-tm
[2]
(if
"AS-IS".
[1]
[2]
[3]
"PROJECT
GUTENBERG-tm"
"Small
[1]
"small
mark-up,
*EITHER*:
*not*
(_)
(as
(or
[2]
"Small
[3]
don't
"Project
(or
(or
*WANT*
DON'T
"Project
"Small
(72600.2026@compuserve.com);
(212-254-5093)
*END*THE
ETEXTS*Ver.04.29.93*END*


`or
I'll
bread-and-butter,
`I'm
`You're
guinea-pigs
(As
guinea-pig,
`I'm
I've
`I've
"There
`If
that's
`I
can't
`I'm
`Then
guinea-pig
`Come,
guinea-pigs!'
`Now
`I'd
`You
`--and
`Call
Duchess's
pepper-box
`Give
`Shan't,'
`Your
cross-examine
`Well,
`What
`Pepper,
`Treacle,'
`Collar
`Behead
`Never
`Call
`Really,
cross-examine
`--for
haven't
`Alice!'
Alice's
`Here!'
jury-box
`Oh,
jury-box,
`The
`until
jury-box,
`not
`I
`What
`Nothing,'
`Nothing
`Nothing
`That's
`UNimportant,
`UNimportant,
`important--unimportant--
unimportant--important--'
`important,'
`unimportant.'
`but
doesn't
note-book,
`Silence!'
`Rule
Forty-two.
`I'M
`You
`Nearly
`Well,
shan't
`besides,
that's
`It's
`Then
note-book
`Consider
`There's
`this
`What's
`I
haven't
`but
to--to
`It
`unless
isn't
`Who
`It
isn't
`in
there's
`It
isn't
it's
`Are
prisoner's
`No,
they're
`and
that's
(The
`He
else's
(The
`Please
`I
didn't

- The non-word (\W) matching strings contains:
  - apostrophes
  - hyphens,
  - single and double quotation marks.
  - brackets etc.
- To reliably select initial consonant clusters, all punctuation marks should be removed


## Using re13.py to remove punctutation marks

In [113]:
import re

f = open('alice.txt','r')  #read in Alice
text = f.read()
f.close()
text = text[10841:]        #strip header
#convert to lower case
lowertext = text.lower()
#punctuation to convert
punc = '[\.\?\-!\?\*,"\(\):\`\[\];_/~]'
#convert punctuation to space
newtext = ''
for c in lowertext:
	if re.search(punc,c):
		newtext += ' '
	else:
		newtext += c
words = newtext.split()    #split into words
#print first 50 words
for w in words[:50]:
	print(w)



alice's
adventures
in
wonderland
lewis
carroll
the
millennium
fulcrum
edition
3
0
chapter
i
down
the
rabbit
hole
alice
was
beginning
to
get
very
tired
of
sitting
by
her
sister
on
the
bank
and
of
having
nothing
to
do
once
or
twice
she
had
peeped
into
the
book
her
sister


## Using re14.py to remove remaining apostrophes

In [115]:
import re

f = open('alice.txt','r')  #read in Alice
text = f.read()
f.close()
text = text[10841:]        #strip header
#convert to lower case
lowertext = text.lower()
#punctuation to convert
punc = '[\.\?\-!\?\*,"\(\):\`\[\];_/~]'
#convert punctuation to space
newtext = ''
for c in lowertext:
	if re.search(punc,c):
		newtext += ' '
	else:
		newtext += c
words = newtext.split()    #split into words
#delete single quotes
newwords = []
for w in words:
	word = ''
	for c in w:
		if c != "'":
			word += c
	newwords.append(word)
#print first 50 words
for w in newwords[:50]:
	print(w)



alices
adventures
in
wonderland
lewis
carroll
the
millennium
fulcrum
edition
3
0
chapter
i
down
the
rabbit
hole
alice
was
beginning
to
get
very
tired
of
sitting
by
her
sister
on
the
bank
and
of
having
nothing
to
do
once
or
twice
she
had
peeped
into
the
book
her
sister


## Using re15.py to remove remaining numerals


In [116]:
import re

f = open('alice.txt','r')  #read in Alice
text = f.read()
f.close()
text = text[10841:]        #strip header
#convert to lower case
lowertext = text.lower()
#punctuation to convert
punc = '[\.\?\-!\?\*,"\(\):\`\[\];_/~]'
#convert punctuation to space
newtext = ''
for c in lowertext:
	if re.search(punc,c):
		newtext += ' '
	else:
		newtext += c
#split into words
words = newtext.split()
#eliminate single quotes
newwords = []
for w in words:
	word = ''
	for c in w:
		if c != "'":
			word += c
	newwords.append(word)
#eliminate words with numbers
finalwords = []
for w in newwords:
	if re.search('[0-9]',w):
		continue
	else:
		finalwords.append(w)
#print first 50 words
for w in finalwords[:50]:
	print(w)



alices
adventures
in
wonderland
lewis
carroll
the
millennium
fulcrum
edition
chapter
i
down
the
rabbit
hole
alice
was
beginning
to
get
very
tired
of
sitting
by
her
sister
on
the
bank
and
of
having
nothing
to
do
once
or
twice
she
had
peeped
into
the
book
her
sister
was
reading


## Selecting consontant clusters (example re16.py)
- Identification of certain consonants is lexical
- Yet letter stands for different phonemes depending on the word it occurs in
  - Example, g is [g] in get, but [dʒ] in gem.
- Solution - aggregate by words before doing counts for individual clusters


In [118]:
import re

f = open('alice.txt','r')  #read in Alice
text = f.read()
f.close()
text = text[10841:]        #strip header
#convert to lower case
lowertext = text.lower()
#punctuation to convert
punc = '[\.\?\-!\?\*,"\(\):\`\[\];_/~]'
#convert punctuation to space
newtext = ''
for c in lowertext:
	if re.search(punc,c):
		newtext += ' '
	else:
		newtext += c
#split into words
words = newtext.split()
#eliminate single quotes
newwords = []
for w in words:
	word = ''
	for c in w:
		if c != "'":
			word += c
	newwords.append(word)
#eliminate words with numbers
finalwords = []
for w in newwords:
	if re.search('[0-9]',w):
		continue
	else:
		finalwords.append(w)
#do counts for words
wordlist = {}
for w in finalwords:
	if len(w) > 0:
		if w in wordlist:
			wordlist[w] += 1
		else:
			wordlist[w] = 1
#sort the words
keys = sorted(wordlist.keys())
#print out the first 100 words
for i in range(50):
	print(keys[i],wordlist[keys[i]])
#print out the number of distinct words
print('Keys:',len(keys))

a 632
abide 1
able 1
about 94
above 3
absence 1
absurd 2
acceptance 1
accident 2
accidentally 1
account 1
accounting 1
accounts 1
accusation 1
accustomed 1
ache 1
across 5
act 1
actually 1
ada 1
added 23
adding 1
addressed 2
addressing 1
adjourn 1
adoption 1
advance 3
advantage 3
adventures 7
advice 2
advisable 2
advise 1
affair 1
affectionately 1
afford 1
afore 1
afraid 12
after 43
afterwards 2
again 83
against 9
age 4
ago 2
agony 1
agree 2
ah 5
ahem 1
air 15
airs 1
alarm 2
Keys: 2616


- Above code makes use of a new function, *sorted()*
- Function *sorted()* sorts a list of strings (or numbers)

## Let's make our code more modular (example re17.py)

In [119]:
import re

def preprocess():
	f = open('alice.txt','r')  #read in Alice
	text = f.read()
	f.close()
	text = text[10841:]        #strip header
	#convert to lower case
	lowertext = text.lower()
	#punctuation to convert
	punc = '[\.\?\-!\?\*,"\(\):\`\[\];_/~]'
	#convert punctuation to space
	newtext = ''
	for c in lowertext:
		if re.search(punc,c):
			newtext += ' '
		else:
			newtext += c
	#split into words
	words = newtext.split()
	#eliminate single quotes
	newwords = []
	for w in words:
		word = ''
		for c in w:
			if c != "'":
				word += c
		newwords.append(word)
	#eliminate words with numbers
	finalwords = []
	for w in newwords:
		if re.search('[0-9]',w):
			continue
		else:
			finalwords.append(w)
	#do counts for words
	wordlist = {}
	for w in finalwords:
		if len(w) > 0:
			if w in wordlist:
				wordlist[w] += 1
			else:
				wordlist[w] = 1
	return wordlist



In [122]:
import re17

#we're using the module re17 here
wordlist = re17.preprocess()
#sort the words using built-in sorted() function
keys = sorted(wordlist.keys())
#print out the first 100 words
for i in range(50):
	print(keys[i],wordlist[keys[i]])
#print out the number of distinct words
print('Keys:',len(keys))



a 632
abide 1
able 1
about 94
above 3
absence 1
absurd 2
acceptance 1
accident 2
accidentally 1
account 1
accounting 1
accounts 1
accusation 1
accustomed 1
ache 1
across 5
act 1
actually 1
ada 1
added 23
adding 1
addressed 2
addressing 1
adjourn 1
adoption 1
advance 3
advantage 3
adventures 7
advice 2
advisable 2
advise 1
affair 1
affectionately 1
afford 1
afore 1
afraid 12
after 43
afterwards 2
again 83
against 9
age 4
ago 2
agony 1
agree 2
ah 5
ahem 1
air 15
airs 1
alarm 2
Keys: 2616


## Using re19.py to find word-initial consonant clusters

In [124]:
import re,re17

#get the word counts
wordlist = re17.preprocess()
#just get the words
words = wordlist.keys()
clusters = []        #strip off onsets
for w in words:
	m = re.search('^[^aeiou]*',w)
	if m:
		onset = w[0:m.end()]
		clusters.append(onset)
#eliminate duplicate onsets using set() function
clusters = sorted(set(clusters))
for c in clusters:   #print all onsets
	print("'",c,"'",sep='')
print(len(clusters)) #print number of onsets



''
'b'
'bl'
'br'
'by'
'c'
'ch'
'chr'
'chrys'
'cl'
'cr'
'cry'
'd'
'dr'
'dry'
'f'
'fl'
'fly'
'fr'
'fry'
'g'
'gl'
'gr'
'gryph'
'h'
'hjckrrh'
'hm'
'j'
'k'
'kn'
'l'
'ly'
'm'
'my'
'mys'
'myst'
'n'
'p'
'pl'
'pr'
'q'
'r'
's'
'sc'
'sch'
'scr'
'sh'
'shr'
'shy'
'shyly'
'sk'
'sky'
'sl'
'sm'
'sn'
'sp'
'spl'
'spr'
'sq'
'st'
'str'
'sw'
't'
'th'
'thr'
'tr'
'try'
'tw'
'v'
'w'
'wh'
'why'
'wr'
'x'
'y'
'z'
76


- Program makes use of the *set()* function, which converts a list to a set.
- As the result duplicates are removed from the list. 
- Sorted() is used to sort it alphabetically. 
- 76 hypothetical onset clusters are detected

- Lack of treatment of y and other vowels is a source of noise
- Let's require words to contain a vowel and allow y to count as a vowel when it is not word-initial

In [125]:
import re,re17

#get the word counts
wordlist = re17.preprocess()
words = wordlist.keys()  #just get the words
clusters = []            #strip off onsets
for w in words:
	m = re.search('^([^aeiouy]*)[aeiouy]',w)
	if m:
		if m.end(1) == 0 and w[0] == 'y':
			onset = 'y'
		else:
			onset = w[0:m.end(1)]
		clusters.append(onset)
#eliminate duplicate onsets
clusters = sorted(set(clusters))
for c in clusters:       #print all onsets
	print("'",c,"'",sep='')
#print number of onsets
print(len(clusters))



''
'b'
'bl'
'br'
'c'
'ch'
'chr'
'cl'
'cr'
'd'
'dr'
'f'
'fl'
'fr'
'g'
'gl'
'gr'
'h'
'j'
'k'
'kn'
'l'
'm'
'n'
'p'
'pl'
'pr'
'q'
'r'
's'
'sc'
'sch'
'scr'
'sh'
'shr'
'sk'
'sl'
'sm'
'sn'
'sp'
'spl'
'spr'
'sq'
'st'
'str'
'sw'
't'
'th'
'thr'
'tr'
'tw'
'v'
'w'
'wh'
'wr'
'x'
'y'
'z'
58


## Last step - couting all clusters (example re21.py)


In [126]:
import re,re17

#get the word counts
wordlist = re17.preprocess()
#just get the words
words = wordlist.keys()
#strip off onsets and do counts
clusters = {}
for w in words:
	m = re.search('^([^aeiouy]*)[aeiouy]',w)
	if m:
		if m.end(1) == 0 and w[0] == 'y':
			ons = 'y'
		else:
			ons = w[0:m.end(1)]
		if ons in clusters:
			clusters[ons] += 1
		else:
			clusters[ons] = 1
#print onset counts
keys = sorted(clusters.keys())
for c in keys:
	print("'",c,"': ",clusters[c],sep='')



'': 414
'b': 110
'bl': 9
'br': 23
'c': 132
'ch': 37
'chr': 2
'cl': 19
'cr': 28
'd': 109
'dr': 24
'f': 124
'fl': 21
'fr': 16
'g': 37
'gl': 7
'gr': 30
'h': 115
'j': 21
'k': 17
'kn': 16
'l': 121
'm': 115
'n': 56
'p': 116
'pl': 21
'pr': 39
'q': 17
'r': 118
's': 151
'sc': 3
'sch': 2
'scr': 6
'sh': 49
'shr': 7
'sk': 4
'sl': 10
'sm': 8
'sn': 9
'sp': 14
'spl': 4
'spr': 3
'sq': 4
'st': 37
'str': 9
'sw': 7
't': 92
'th': 42
'thr': 8
'tr': 30
'tw': 10
'v': 25
'w': 94
'wh': 28
'wr': 9
'x': 2
'y': 23
'z': 2
