# Regular Expressions Exercises

Acknowledgement: This notebook is provided by Intel AI Developer Program

#### Introduction

Lets download the complete works of sherlock holmes and do some detective work using REGEX.

For many of these exercises it is helpful to compile a regular expression p and then use p.findall() followed by some additional simple python processing to answer the questions


In [24]:
import re

In [25]:
# for linux and mac, uncomment and run the following line:
# !wget https://sherlock-holm.es/stories/plain-text/cnus.txt
# for windows, use your browser to download txt file and place in working directory

In [26]:
text = ''
with open('./data/cnus.txt','r') as f:
    text = " ".join([l.strip() for l in f.readlines()])

In [27]:
text[2611:3000]

"On landing at Bombay, I learned that my corps had advanced through the passes, and was already deep in the enemy's country. I followed, however, with many other officers who were in the same situation as myself, and succeeded in reaching Candahar in safety, where I found my regiment, and at once entered upon my new duties.  The campaign brought honours and promotion to many, but for me "

## Question 1

One of Sherlock Holmes' famous catch phrases is the use of the word 'undoubtedly'

* How many times is the word 'undoubtedly' used?

In [33]:
# the word undoubtedly only appears 43 times
p = re.compile('undoubtedly')
print(len(p.findall(text)))

# alternative

print(len(re.findall('undoubtedly', text)))

43
43


## Question 2

Characters are announced very deliberatly in the language of the setting in Victorian England. We can use this later to find characters in the book. But for now let's practice on a character we know

How often is Sherlock Holmes refered to by 'Mr. Sherlock Holmes' vs 'Sherlock Holmes' vs. 'Mr. Holmes' vs 'Sherlock'

In [34]:
#lets start with the very simplest thing. 
#Let's just find the occurances of Sherlock Holmes
p = re.compile('Sherlock Holmes')
len(p.findall(text))

361

In [35]:
# one easy way to solve this
# is to just use the 'or' operator | with all the patterns we want to match
p = re.compile('Mr\. Sherlock Holmes|Sherlock Holmes|Mr\. Holmes|Sherlock|Holmes')
results = p.findall(text)
counts = {}
for r in results:
    if r in counts.keys():
        counts[r] += 1
    else:
        counts[r] = 1
        
counts

{'Sherlock Holmes': 268,
 'Mr. Sherlock Holmes': 93,
 'Holmes': 1646,
 'Mr. Holmes': 496,
 'Sherlock': 22}

How proper

one thing to remember with REGEX is there is rarely a single correct way to do things.

Another strategy we might have tried would be to make optional matchings groups

In [36]:
p = re.compile('((Mr\.\s)?(Sherlock\s)?(Holmes)?)')
results = p.findall(text)
counts = {}  # this is a dictionsy
for r in results:
    if r[0]:
        if r[0] in counts.keys():
            counts[r[0].strip()] += 1
        else:
            counts[r[0].strip()] = 1
        
counts

{'Sherlock Holmes': 268,
 'Mr. Sherlock Holmes': 93,
 'Holmes': 1646,
 'Mr.': 1,
 'Mr. Holmes': 496,
 'Mr. Sherlock': 1,
 'Sherlock': 1}

## Question 3

* Find all the doctors in the collection
    
* make a list of all the characters that appear in the collection (hint: Mrs. Mr. Miss Dr. etc)

In [37]:
p = re.compile('[MD][irs][s\.]?[s\.]? [A-Z]\w*')
set(p.findall(text))
    

{'Dr. Ainstree',
 'Dr. Armstrong',
 'Dr. Barnicot',
 'Dr. Becher',
 'Dr. Ferrier',
 'Dr. Fordham',
 'Dr. Grimesby',
 'Dr. Horsom',
 'Dr. Huxtable',
 'Dr. James',
 'Dr. Leon',
 'Dr. Leslie',
 'Dr. Moore',
 'Dr. Mortimer',
 'Dr. Percy',
 'Dr. Richards',
 'Dr. Roylott',
 'Dr. Shlessinger',
 'Dr. Somerton',
 'Dr. Sterndale',
 'Dr. Thorneycroft',
 'Dr. Trevelyan',
 'Dr. Watson',
 'Dr. Willows',
 'Dr. Wood',
 'Miss Adler',
 'Miss Alice',
 'Miss Brenda',
 'Miss Burnet',
 'Miss Cushing',
 'Miss Dobney',
 'Miss Doran',
 'Miss Edith',
 'Miss Ettie',
 'Miss Flora',
 'Miss Fraser',
 'Miss Harrison',
 'Miss Hatty',
 'Miss Helen',
 'Miss Holder',
 'Miss Honoria',
 'Miss Hunter',
 'Miss Irene',
 'Miss M',
 'Miss Marie',
 'Miss Mary',
 'Miss Miles',
 'Miss Morrison',
 'Miss Morstan',
 'Miss Nancy',
 'Miss Rachel',
 'Miss Roylott',
 'Miss Rucastle',
 'Miss S',
 'Miss Sarah',
 'Miss Smith',
 'Miss Stapleton',
 'Miss Stoner',
 'Miss Stoper',
 'Miss Susan',
 'Miss Sutherland',
 'Miss Turner',
 'Miss Viole

## Question 4

* Search out all the years and dates that appear in the story

In [38]:
# we can use \d to match any digit
p = re.compile('1[89]\d\d')
p.findall(text)

['1878',
 '1860',
 '1857',
 '1871',
 '1878',
 '1878',
 '1882',
 '1882',
 '1882',
 '1888',
 '1858',
 '1890',
 '1890',
 '1869',
 '1870',
 '1878',
 '1883',
 '1883',
 '1869',
 '1869',
 '1884',
 '1887',
 '1846',
 '1855',
 '1875',
 '1891',
 '1890',
 '1891',
 '1894',
 '1894',
 '1840',
 '1881',
 '1884',
 '1887',
 '1894',
 '1901',
 '1895',
 '1900',
 '1888',
 '1872',
 '1883',
 '1884',
 '1883',
 '1883',
 '1883',
 '1883',
 '1894',
 '1884',
 '1882',
 '1882',
 '1884',
 '1882',
 '1883',
 '1876',
 '1800',
 '1865',
 '1875',
 '1872',
 '1874',
 '1875',
 '1892',
 '1895',
 '1897',
 '1914',
 '1911',
 '1915']

## Question 5

Sherlock holmes is frequently smoking his pipe. But like many verbs in english, there are many ways that the word smoking can be conjugated depending on the context.

* capture all sentences that take about smoking (smoke, smokes, smoking, smoked)
* capture the two words that appear after the smoking word (advance task. )
* capture the two words that appear before the smoking word (advance task)

In [40]:
p = re.compile('\.[ A-Za-z]+smok[ A-Za-z]+\.')
p.findall(text)

['. I am going to smoke and to think over this queer business to which my fair client has introduced us.',
 '. I would have thought no more of knifing him than of smoking this cigar.',
 '. The smoke and shouting were enough to shake nerves of steel.',
 '. He had even smoked there.',
 '. Then I went into the back yard and smoked a pipe and wondered what it would be best to do.',
 '. As we rolled into Eyford Station we saw a gigantic column of smoke which streamed up from behind a small clump of trees in the neighbourhood and hung like an immense ostrich feather over the landscape.',
 '. Then he lit his pipe and sat for some time smoking and turning them over.',
 '. I had smoked two cigarettes before he moved.',
 '.  We had breakfasted and were smoking our morning pipe on the day after the remarkable experience which I have recorded when Mr.',
 '. I observed that he was smoking with extraordinary rapidity.',
 '. He does smoke something terrible.',
 '. From over a distant rise there float

In [45]:
p = re.compile('\.[ A-Za-z][ A-Za-z]+smok[ A-Za-z][ A-Za-z]\.')
p.findall(text)

[]

    
## Question 6

Often we will recieve a block of unstructured text and want to use REGEX to provide some structure. In this case, we may want to split the book by chapter.

* View the contents of the text. Notice what is used to delimit the chapters
* use the re.split() function to split the books by chapter. You should get a list where each item is a chapter
* print out the contents of chapter 3

In [54]:
print(text[500:50000])

eeches  The Memoirs of Sherlock Holmes Silver Blaze The Yellow Face The Stock-Broker's Clerk The "Gloria Scott" The Musgrave Ritual The Reigate Squires The Crooked Man The Resident Patient The Greek Interpreter The Naval Treaty The Final Problem  The Return of Sherlock Holmes The Adventure of the Empty House The Adventure of the Norwood Builder The Adventure of the Dancing Men The Adventure of the Solitary Cyclist The Adventure of the Priory School The Adventure of Black Peter The Adventure of Charles Augustus Milverton The Adventure of the Six Napoleons The Adventure of the Three Students The Adventure of the Golden Pince-Nez The Adventure of the Missing Three-Quarter The Adventure of the Abbey Grange The Adventure of the Second Stain  The Hound of the Baskervilles  The Valley Of Fear  His Last Bow Preface The Adventure of Wisteria Lodge The Adventure of the Cardboard Box The Adventure of the Red Circle The Adventure of the Bruce-Partington Plans The Adventure of the Dying Detective T

In [46]:
p = re.compile('CHAPTER\s[\w]+')
p.findall(text)

['CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER XIII',
 'CHAPTER XIV',
 'CHAPTER XV',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER I',
 'CHAPTER II']

In [13]:
p = re.compile('CHAPTER\s[IVX]+')
p.findall(text)

['CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER IX',
 'CHAPTER X',
 'CHAPTER XI',
 'CHAPTER XII',
 'CHAPTER XIII',
 'CHAPTER XIV',
 'CHAPTER XV',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER III',
 'CHAPTER IV',
 'CHAPTER V',
 'CHAPTER VI',
 'CHAPTER VII',
 'CHAPTER VIII',
 'CHAPTER I',
 'CHAPTER II',
 'CHAPTER I',
 'CHAPTER II']

In [11]:
p.split(text)[1]

