## INVESTIGATING UNSTRUCTURED TEXT
As we saw last week, even the sometimes messy and unpredictable Markup language of HTML can give us clues to how data may be structured. But language as a system (as we saw in Borges) also comes with its own structures. Python provides numerous methods for navigating through basic linguistic patterns. Let's begin with repetition itself:

In [None]:
speech = '''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''



There're various ways to investigate Macbeth's famous, very short, speech. We begin by searching for the obvious, searching through the whole speech.

In [None]:
'tomorrow' in speech

In [None]:
speech.find('tomorrow')

In [None]:
speech[14:14+len('tomorrow')]
#speech[14:22]


In [None]:
speech.count('tomorrow')

In [None]:
speech.lower().count('tomorrow')
#speech.lower().find('tomorrow')

In [None]:
speech.upper().count('a')

In [None]:
speech.lower().count(' a')

In [None]:
speech.lower().count(' a ')

Of course, there is already a structure to the speech that we are ignoring--it has lines. Let's get out those lines and put them into a list.

In [None]:
lines = speech.split('\n')
#lines = tom.splitlines() 
lines

In [None]:
firstline = lines[0]
firstline

Python has a handful of built-in ways to search a line. Here are just a few.

In [None]:
yest = firstline.replace('tomorrow','yesterday',1)
yest

In [None]:
firstline.startswith('tomorrow')

In [None]:
firstline.endswith('tomorrow')

## List comprehensions
What if we want to search through every line. The obvious way is using a `for` loop.

In [None]:
for line in lines:
    if line.startswith('And'):
        print(line)

That is a very simple loop, so simple that Python has a solution for a looping through a list using a one-line statement, called a **list comprehension**

In [None]:
[line for line in lines if line.startswith('And')]

Remember this, when we start using more robust ways of searching line by line (sentence by sentence, etc) these will come in handy. But before we jump to those special searching methods, let's have a little detour on sorting.

## Sorting!
Say we want to investigate the lines in the speech, and order them from longest line to shortest line. Well we know how to get the length of each line using loop, but how can we measure them to reorder our list?

In [None]:
for line in lines:
    print(len(line))

We could write a function that pairs these numbers with each line, and then sorts through everything--but sort functions are notoriously challenging to write. And Python has a built in sorting function.

In [None]:
sortlines = lines.copy()
sortlines.sort()
sortlines

But not only that, Python has a built in mini-function generator called `lambda` that you can nest inside at sorting function.

In [None]:
sortlines = lines.copy()
sortlines.sort(key=lambda x: len(x), reverse=True)
#sortlines.sort(key=lambda x: x.split()[-1], reverse=True)
sortlines

## Regular Expressions
The more you work with unstructured text, the greater desire you will have for the power that regular expressions give you. Regular expressions are a mini-language to themselves (often sharing similarities across different programming languages). They allow you to search for a variety of patterns within text. The most obvious patterns you might find our telephone numbers, ZIP Codes, email addresses (social security numbers and credit card numbers for the more malicious)--and many regular expressions have been written to capture these with varying levels accuracy. Today, however, our focus will be on exploring text.

First import the built-in regular expression library `re`

In [None]:
import re

There are five main regular expression functions that we will work with:

**match()** & **search()**: these methods tell you whether or not they found a match, and where that match was located--although match() only searches at the very beginning of the line--so it is rarely useful.

**split()** & **sub()**: these two work just like split() & replace(), but they search for patterns and return a list or a substitute string respective.

**findall()**: just as the name sounds, this method returns a list of matching patterns that were found throughout the entire string.

In [None]:

#found = re.match("morrow",firstline,re.IGNORECASE)
found = re.search("morrow",firstline,re.IGNORECASE)
found.group()
#found.end()

In [None]:
newlist = re.split("and",firstline,flags=re.IGNORECASE)
newstring = re.sub("tomorrow","yesterday",firstline,flags=re.IGNORECASE)
print(newlist,newstring)

In [None]:
words = re.findall("to",firstline,re.IGNORECASE)
words

## Special characters
While the search methods above are more useful than what's built into Python, it is the pattern seeking commands that--once you get used to them--do the most powerful work.

Here's a list  of the most common pattern seeking characters:

| special character | what it does |
|--------|---------|
| `.` | Match any character except newline |
| `^` | match the beginning of string |
| `$` | match the end of string, including `\n` |
| `*` | match 0 or more repetitions |
| `+` | match 1 or more repetitions  |
| `?` | match 0 or 1 repetitions  |
| `{m}` | m specifies the number of repetitions  |
| `{m,n}` | m and n specifies a range of repetitions  |
| `{m,}` | m specifies the minimum number of repetitions  |


In [None]:
all_ll = re.findall("..ll",speech)
#re.search("^Tomorrow",firstline)
#re.search("tomorrow,$",firstline)
all_ll


In [None]:
#a list comprehension again!
#Note that match() would produce the same thing
[line for line in lines if re.search("^And",line)]

In [None]:
[line for line in lines if re.search(",$",line)]

In [None]:
th_plus = re.findall("the*..",speech)
th_plus

In [None]:
l_plus = re.findall("..l+..",speech)
l_plus

In [None]:
l_plus = re.findall(".or?",speech)
l_plus

In [None]:
o_2 = re.findall("..o{2}..",speech)
o_2

## Sets and Groups
**Sets**, which include `[]` in shortcuts like `\w`, allow you to search for certain types of characters. **Groups**, which are demarcated by `()` allow you to specify important sub-patterns that you can access individually.

| enclosures | what it does |
|--------|---------|
| `[]` | A defined set of characters to search for |
| `()` | A group of characters to search for, can be accessed individually in the results. |


| Examples of sets | what it does |
|--------|---------|
| `[aeiou]` | Find any vowel |
| `[Tt]` | Find a lowercase or uppercase t |
| `[0-9]` | Find any number, there is a shortcut for this |
| `[^0-9]` | Find anything that's not number, there is a shortcut for this |
| `[13579]` | Find any odd numer |
| `[A-Za-z]` | Find any letter, there is a shortcut for this too |
| `[+.*]` | Find those actual characters, special characters are canceled in sets (not including shortcuts: see below) |


| Shortcut | what it does |
|--------|---------|
| `\b` | Word boundary: spaces, commas, end of line, anything that comes at the beginning or end of a word |
| `\B` | Not a word-boundary |
| `\d` | numbers [0-9] |
| `\D` | not numbers |
| `\s` | whitespace characters: space, tab... |
| `\S` | not space |
| `\w` | letters |
| `\W` | not letters |


In [None]:
words = re.findall(r"\b[CcBb]\w+",speech)
words

In [None]:
words = re.findall(r"[tT]\w+",speech)
#words = re.findall(r"([tT]\w+)",line)
words

Looking for phrases

In [None]:
phrases = re.findall(r"(?=(\b\w{2}\W+\w+\W+\w+))",speech)
#phrases = re.findall(r"(\b\w{2}) (\w+) (\w+)",speech)
phrases

Searching a longer poem

In [None]:
f = open('/Users/Jon/Documents/columbia_syllabus/wasteland.txt', 'r')
wasteland = f.read()

In [None]:
poemlines = wasteland.split('\n')

In [None]:

[line for line in poemlines if re.search("win.", line)]


Searching whole play

In [None]:
f = open('/Users/Jon/Documents/columbia_syllabus/hamlet.txt', 'r')
play = f.read()


In [None]:
type(play)

In [None]:
play[:500]

In [None]:
all_chars = re.findall(r"[\n]([A-Z]+)[\n]",play)
all_chars