----
Another regex example: Finding the "the"s
----

The objective of this exercise is to make you familiar with the regular expression “re” module of python. 

In this exercise we will type a sentence that contains several “the”s either as the word itself or as a part of the larger sentence like o”the”rwise. The task is to use the re module to find “the” by itself. 

We will use a hypothetical sentence and then use the findall module of re to determine if a sentence contains “the”. We will then experiment with what happens when we change the re we use. Make sure you have the regex cheat sheet with you.

In [30]:
""" Lets make up a fictional sentece containing many "the"s. We
 are only interested in the first "the" and we should igore all the other "the"'s in blithe, theatrical, theme and othewise. 
"""
some_text = ' the overly blithe theatrical theme is otherwise the most boring' 

In [31]:
some_text.find('the')

1

Lets try to search for "the" in the above sentence. 

In [32]:
import re
# The findall module requires the string we are searching for and the sentence to search in
results_the = re.findall('the', some_text) 

In [33]:
results_the # We see all"the"s corresponding to the, blithe, theatrical, theme and otherwise

['the', 'the', 'the', 'the', 'the', 'the']

However we just want to that "the" that occurs by itself, i.e. no other letters before or after it.

Let's look for 'the' as a word 

---
Useful Tools:
----

- [Realtime regex engine](http://regexr.com/)
- [Regex tester](https://regex101.com/)
- [Regex cheatsheet](http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/)
- [Python Regex checker](http://pythex.org/)

In [34]:
# Let's look for word boundary
results_the = re.findall('[^\w]the[^\w]', some_text) 

In [35]:
results_the

[' the ', ' the ']

In [36]:
map(str.strip, results_the)

<map at 0x108d22860>

In [37]:
list(map(str.strip, results_the))

['the', 'the']

---
Check for understanding
---

<details><summary>
How do we manifest the results?
</summary>
`list(map(str.strip, results_the))`
</details>

---
Regex Design Patterns
---
1. Create pattern in Plain English
2. Map to regex language
3. Text on capture text
    - All Positives: Captures all examples of pattern
    - No Negatives: Everything captured is from the pattern
4. Data munging. Don't over-engineer your regex. Filtering before and after are okay.

### Other methods

- search() returns the first match, if any.
- match() apply the pattern at the start of the string
- split() splits source at matches with pattern and returns a list of the string pieces.
- sub() short for substitution replace found pattern with another string

-----
Yet Other Demo:  Determining the frequency of the words ending in "ing"
-----

In this exercise we will use regular expressions to determine if a word contains “ing”. If it does we will add it to a dictionary and increase its count by 1. 

The objective is to get familiar with the idea of the using the counts of the words.

In [38]:
with open('../../corpora/shakespeare_all.txt') as f:
    shakespeare = f.read()

In [39]:
print("Shakespeare wrote about {:,} words.".format(len(shakespeare.split())))

Shakespeare wrote about 899,594 words.


--- 
Let's write some regex
----

When I write some regex

![](http://i.imgur.com/8b5kNhQ.gif)

[Source](http://thecodinglove.com/post/85802561535/when-i-write-some-regex)

In [52]:
from collections import defaultdict
import string

words = defaultdict(int)

for word in shakespeare.split():
    # Finding words that end in "ing" 
    matches = re.findall('ing$', word)  # $ means "end of string"
    if matches:
        new_word = word.strip(string.punctuation).lower() ## if word being considered is capitalized, reduce to lower case
        words[new_word] += 1

In [53]:
# How many unique words do we have?
len(words.keys())

1667

Let's make it prettty

In [42]:
import pandas as pd

In [43]:
# Creating a table using pandas and transposing it so that the columns become rows and vice versa  
ing_words = pd.DataFrame(words, index=[0]).T 

In [44]:
ing_words.columns = ['count']

In [45]:
# Sort the rows
ing_words.sort_values('count',
                      ascending=False,
                      inplace=True)

In [46]:
# How many words do we have?
ing_words.shape 

(1667, 1)

In [47]:
# Take a peek at the data
ing_words.head(10) 

Unnamed: 0,count
king,770
being,643
bring,436
nothing,436
thing,251
something,155
having,145
loving,106
coming,90
living,82


In [48]:
# Take a peak at the data
ing_words.tail(10) 

Unnamed: 0,count
averring,1
late-walking,1
laund'ring,1
authorizing,1
lazy-pacing,1
auguring,1
leave-taking,1
length'ning,1
libelling,1
a-birding,1


---
Closing Thought
---

> Some people, when confronted with a problem, think   
> “I know, I'll use regular expressions.”   Now they have two problems.

<br>
<br>
---