# HW7: Counting and regular expressions



In [None]:
# re module for working with regular expressions
import re
# For numerical work, nearly everyone uses numpy
from numpy import pi

## Part 1: Dictionaries and counting

This notebook partly draws from materials put together by [Dirk Hovy](http://dirkhovy.com/). That's why there's a figure today! Dirk is a computational linguist at the University of Copenhagen. Much of his work tries to explore the intersection of social variables and NLP, working with large online corpora.

### The structure of programs

Most of programming, irrespective of the language you use, has four main elements:

1. ***Assignment***: linking a name to a value. The names are called ***variables***. 

2. ***Loops***: sometimes we want to do the same thing repeatedly, either a fixed number of times, or
until something happens. This is what loops are for. 

3. ***I/O (Input/Output)***: this refers to everything that has to do with getting information into and
out of our programs, e.g. files (opening, closing, reading from or writing to them) or output on
the screen.

4. ***Control structures***: sometimes, we need to make decisions. I.e., if a variable has a certain 
value, do `X`, otherwise, do `Y`. Control structures are simple `if...then...else` constructs that evaluate
the alternatives and make this decision. 

Today we'll put these together to do a useful elementary language processing task: getting counts of words in a document. The three main new things we need to learn today are: **reading from files**, **control structures**, and an important new data type the **dictionary** or just **dict**, which is a **mapping** data structure.

### The dictionary (or "dict") data type

Python uses the term "dictionary" or "dict" for a *mapping*: a collection of items of one type mapping to another type. A dictionary is written with curly braces. For example, here's a mapping, from web sites to my passwords:

In [None]:
passwds = {'Amazon': 'curly', 'Google': 'furry', 'Apple': 'easy',
           'Microsoft' : 'easy'}

No, not really! But it will do. You can access elements from a dict using the same square brackets notation after the dict/variable name, but now using a key which is the first half of the mapping:

In [None]:
print('My Google password is: ' + passwds['Google'])

Trying to get a value for a key that doesn't exist is an error!

In [None]:
passwds['LinkedIN']

If we want to add a new item to our dictionary, we can simply assign a key a value:
```
<dictionary>[key] = <value>
```

Add the value `"flotilla"` as my `"Facebook"` pasword:

In [None]:
# Add the value "flotilla" as my "Facebook" pasword:

In a dict, there can only be one value for a key, but several keys can have the same value. Oh, and while I said a key can have only one value, that value _can_ be a list, which lets you do general relations. A dictionary, unlike a list, isn't ordered. But you can very efficiently get the value for a key. You can also call 3 method `keys()`, `values()`, and `items()` which return list-like values that you can do a `for`-loop over to see all the keys, values, and mappings in the dict.  Try them:

In [None]:
for k in passwds.keys():
    print(k)

Note that the keys didn't come out in the order that I wrote them down. You shouldn't rely on the order you wrote things down in.

In [None]:
# Now print all the values

In [None]:
# Now print all the items

Note that the item is something we havent quite seen before – it's two strings wrapped in parentheses. It looks like the arguments to a function. This is different from a list and is called a ***tuple***. It's less important than lists, but we'll come back to them later today....

You can check whether a key or value is in a map with the `in` and `not in` operators: `<key> in <dict>`. But that's often tedious to use, so you should also know the cleverer method on dicts `get(key, default)`, which lets you ask for a key, and return its value if it exists, or the default value otherwise. We'll be able to use it later to make our program neater.

In [None]:
# See if I have an 'Amazon' password
'Amazon' in passwds

In [None]:
# Either print my 'Facebook' password or 'None'


### Word Counts—Dictionaries and Control Structures

Last week, we learned about variable assignment, loops, and printing to the screen.
There are several useful object types that we have not yet covered, and we need to learn about the constructs that let us
test conditions. We will see them in this program, as well as IO for reading from files.

We want to know which words occur how often in a
file. This is a common elementary text processing step in order to get some idea of your data and to get a sense of its overall topics. The output of such counting is precisely what people use to draw the very common visualization of [word clouds](http://www.wordle.net/). (Even though they're very common, many visualization people don't like them very much; just like pie charts.)

Let’s first think about what we have and what we want. We have a ***file***, and we want the
counts for the ***words*** in there. So there is a ***file***, ***sentences***, ***words***, and their ***counts***. We need to read the
file, get the sentences; for each sentence, get the words, and somehow record their counts. In the end,
we just print out the counts again. We can display this like in Figure 1.

<img src="pics/diagram_word_counts.png" width="500px">
<div align="center">*Figure 1: Flow chart for our word count problem*</div>



Now let’s look at the program: it takes a file, reads it in, keeps a running count for each word, and
prints those counts at the end. 

**Make sure to execute each code section as you progress (even the pre-written ones), so that the variables become available to the interpreter.** You won't see any direct output when executing the cell below, but we will need it further on.

We first declare the name of the file as a variable, and then actually open the file. 
`open()` is the function that reads in the file. It takes just one argument: the name of the file we
try to open. You can give it a second argument, `'w'` if you want to write to a file, rather than just read from it. Here, we only want to read, so we don't need to specify anything else.

Python takes care of some pesky new line and encoding issues, so we won’t worry too much right now about special characters. Go ahead and read a file with runes!

In [None]:
file_name = 'debate-clinton.txt'

# open the file for reading
text_file = open(file_name)


The result of running `open()` is not the text of the file. It is similar to a list (it's not exactly a list, but an ***iterator*** – that's also what `keys()` gave us above), and we call that list `text_file`, so we can use it later on. This give us a ***handle*** to read through the file. 

In [None]:
word_count = {}

After we have assigned the file hande, we assign the name `word_count` to a ***dictionary***. Here, our keys will be strings (the words), and the values we map them to are numbers (their respective
frequencies). If we just use a pair of curly braces, as we
did here, we get an empty dictionary. There are no entries. 

After we have declared the dictionary, we start iterating through the file with a `for`-loop.
Since `text_file` is an open file, this gives us a list of all the lines. We can
thus iterate over them. For each line in the file, we want to do a number of things.
That is why the next lines are all indented under the `for`-loop header line. 

In [None]:
# go through the open file line by line
for line in text_file:
    # get rid of the line break at the end
    line = line.strip()
    # split sentence along spaces
    sentence = line.split()
    # go through words
    for word in sentence:
        # check whether word is already in dictionary
        if word in word_count:
            # if yes: increment
            word_count[word] = word_count[word] + 1
        # if not, add an entry
        else:
            word_count[word] = 1

First, we get rid of the line break and any white space at the beginning or end of the line. We could write code to do all these things, but there is an easier (and shorter) way:
we use the `strip()` command to remove all that whitespace from the line and assign it to 
the same name as before (`line`). (*Subtle point here about line break character!*) We made a new object – strings cannot be changed – but we assign it the same name. Whenever we use `line` from now on, it is the “cleaned-up” version of the line. (*Subtle points:* (1) Mutable and immutable objects; (2) If a method returns a new, changed object but doesn't change the original object – this is common and the only way to do things for immutable objects – then it is vital to assign the output of the method to something, or you will lose it. Commonly we assign it back to the same variable name if we conceptually think that we have *improved* the same thing.) 

We then use the `split()` command we have seen before next.
Remember, it splits a sentence at the white space subsequences into a list, so if we had extra white spaces,
it would create empty entries in our list. The list of strings resulting from `split()` is assigned to
`sentence`, and we then iterate over that list. We have seen this before, so I will skip to the next
interesting part here: control structures.

After we have read the whole file, we close it with the `close()` command on the file variable. The dot tells us that it is a property of files. Note 
that this line is no longer indented under the `for`-loop, but at the same level. This means that it is only
executed once we have completed all our iterations of the for-loop, in this case, after we have read all
lines in the file!

In [None]:
# close the file after reading
text_file.close()

### Control Structures

So far, we have simply executed one command after the next. We never had to make a decision or
choose among options. Now we have to. If a word is already listed in our dictionary, we want to
increase its count by one (we know how to do that). If the word is not in our dictionary, however, we
have to make an entry. Otherwise, we would try to increment a count that does not even exist (you
cannot look up something that is not in the dictionary).

To make the decision what to do, we use the `if...then...else` structure or ***conditional***.
The structure of the conditional is simple:
```
if <condition is true>:
    <action1>
else:
    <action2>
```

Here `<condition>` is another type of variable, a so-called ***boolean***. They are named after the 
mathematician Boole, and have only two values: `True` and `False` (note the capital spelling!). In our
case, the value comes from the outcome of the condition `word in word_count` to check whether
the dictionary `word_count` contains the key word. **`in`** is one of Python’s reserved words. You can
use it to check whether a variable is in a dictionary, a list, or other ***collections***.
Sometimes, there are more than just two cases (something being true or false) that we would
like to account for. In that case, we can check for more conditions:
```
if <condition1 is true>:
    <action1>
elif <condition2 is true>:
    <action2>
elif <condition3 is true>:
    <action3>
else:
    <action4>
```

You can add as many `elif` cases as you want! We will see an example of this in the next section.

So if the word we look at is indeed in our dictionary, we increment its count by one. 
This puts the current word in the dictionary and sets its counter to 1.

Write your own control structure that checks whether `"Amazon"` is a key in `passwds`, and prints `"Your password is <passwd>"` if there is one and `"You don't have an Amazon account!"` otherwise.

In our program, we finally want to print all the counts we have collected to the screen. We use another `for`-loop. This time, it iterates over a list of ***tuples***. Tuples are a lot like lists, with the big difference that they 
have a fixed size. They are less flexible than lists. In Python, we denote tuples by round brackets
(instead of square ones as for lists). The function `items()` of a dictionary returns a list of 
tuples of each key and its respective value. We use that and assign them to `word` and `frequency`, respectively. We print each word and its frequency (provided it occurred more than once) separated by a space, (that is why there is a comma in the `print()` statement, see above).

In [None]:
# take each pair of word and frequency in the dictionary
for (word, frequency) in word_count.items():
    if frequency > 1:
        print(word, frequency)

Well, we've learned a fair bit here. We have learned how to read in a text file, how to use control structures, and we have
seen the new object types dictionaries and tuples.
You have now seen a lot of the basics of Python! While there are a lot of other things that you *can* learn – and, gosh, I'm going to attempt to teach quite a few of them — you can actually write quite a bit of basic text processing using just these elements. Many of the things that we'll learn later provide faster, more powerful, more convenient ways to do things that you _could_ do with just these elements.

### Doing more with word counts from the Clinton-Trump debate

Try to do all of these things, and end up with a decent program at the end that does all this stuff.

The above program got word counts from the file `'debate-clinton.txt'`. It was hardcoded to do so. But we also want word counts from `'debate-trump.txt'`. So, what we want is a function that can count words in _any_ file.  We might structure our program as two functions:

1. A function that takes a string filename and returns a dict from words to their counts in the file.

2. A function that takes a dict of word counts and prints the word counts

You are welcome to copy and paste any code from above to get this to work.

In [None]:
# Function for collecting word counts from a file
# Don't forget to close the file after you have read it in!

In [None]:
# Function for printing word counts from a dict

In [None]:
# Top-level code that calls the above functions for each of the files
# 'debate-clinton.txt' and 'debate-trump.txt'.
# You probably also want to print out a blank line separator dividing the files and saying which you are printing.

Let's now try to make that program a bit better! You can just edit it above and leave the final program.

1. Although it was useful to learn about `if` control structures – and we will use them a lot – you don't actually need to use one here. Do you instead remember about the `get(<key>, <default>)` method on a dict that we saw earlier? Try using it. *(Warning: this can be challenging so it is fine to skip it at first and come back later)*.

2. The above code is hardcoded to print words that occur 2 or more times. This makes the list shorter by leaving out words that occur only once (the "hapax legomena" of the list — there are always a lot of these, often about 40% of the word types). But it's still a long list. We might want to only print words that occur 3 or more times, say. Make the minimum number of times for a word to occur to print it another parameter of the second function, and have the top-level code call it with the value 3.

3. At the other extreme, it might not be very interesting knowing how often the candidates say "the" or "to". Such function words often don't seem to carry much content. (Though, of course, be aware of the work of people such as [James Pennebaker](http://www.secretlifeofpronouns.com/), who emphasizes how much social meaning can be conveyed by function words. Lists of common function words that you are not going to count are conventionally called **stop words** in computational work. Modify the first program to also accept a list of stop words which you don't put in the hash. Modify the top-level code so that it doesn't count `['the', 'a', 'an', 'that', 'and', 'to', 'of']`

4. We're starting to suffer badly from simply tokenizing by dividing on whitespace. We could do a little better by simply "whiting out" the commonest punctuation marks that glom on to words. Before splitting on white space, we could change the string to delete letters like: `'.', ',', '"'`. Remember the `replace()` method on `str` that we saw last time. Look at your output, you may well want to delete a few more. Doing this will do a little textual damage; e.g., `30,000` will become `30000`, but it won't be too bad. You may want to not delete `'`, though, so that you don't damage words like `isn't`.

5. It might also be useful to lowercase all tokens, so that words don't become different just because they are at the start of a sentence. Of course, you'll then just have to be smart enough to recognize that `irs` means the `IRS`.

6. It would be good to also add up how many non-stop words were spoken by each candidate. Who spoke the most?

7. To normalize for frequency, it would be useful to also work out the percent of times the word each candidate says is a certain word. So, in the second function, also print the percent as well as the raw count.

8. Find at least one interesting difference in word use between the two candidates, and put it in the cell below!

### Bonus: Getting word counts from the Google Books data

The raw data files for the Google Books collection are available for 
download. The files are huge, so I created a tiny sample in the file `googlebooks-eng-all-1gram-20120701-a-sample`.

The format of this file is as follows (whitespace inserted for 
readability):

```
word TAB year TAB match_count TAB volume_count NEWLINE
```

The TAB character is "\t", which you can treat like any other (for
example, you can split a string on "\t"). The `match_count` is how many times the word occurred and the `volume_count` is a smaller number for how many _different_ books it occurred in. We will use the `match_count`. Note that the words have also been disambiguated by part of speech where ambiguous. We'll get to that later.

Your first task: complete googlebooks_counts_by_year so that it processes
my sample file and returns a 2d dictionary (a top-level dictionary whose values are each a dictionary!) with this structure:

{
  word1: {year1: count, year2: count ...},
  word2: {year1: count, year2: count ...},
  ...
}

where the contents of the year dicts is determined by the file.
(That is, different words will have different years and counts
associated with them.)

In [None]:
def googlebooks_counts_by_year(filename):
    """Maps a Google books 1-grams file to a 2d dictionary
    giving each word's counts by year."""
    pass


Second task: Complete the function googlebooks_year_collapse so that it takes as input the output of googlebooks_counts_by_year and collapses
it down so that each word is associated with its single tokencount
for the full, obtained by summing up all of the counts for the
years associated with that word. 

In [None]:
def googlebooks_year_collapse(d):
    """Convert the output of googlebooks_counts_by_year to 
    a simpler dict mapping words to counts."""
    pass


Write something at the top-level to call this code on the file `googlebooks-eng-all-1gram-20120701-a-sample`:

## Part 2: Regular Expressions in Python

Regular expressions – regex – are super useful when processing text in Python! However, there are so many different patterns and ways to use the re module that it is impossible to learn by heart. So instead, this part of the HW is for you to play around and get comfortable with using regex. Also, head to [regex101.com](http://www.regex101.com/) for a nice place to test out your regex patterns before running them here.  

In [None]:
p = re.compile('[a-z]+')
p

In [None]:
# try a few different things instead of 'tempo' - can you find the things that don't match?
m = p.match('tempo')
m

In [None]:
m.group()

In [None]:
m.start(), m.end()

In [None]:
print(p.match('::: message'))

In [None]:
m = p.search('::: message'); print(m)  

In [None]:
p = re.compile('[a-z]+')
m = p.search( '2942a9vv4dxaq42' )
if m:
    print('Match found: ', m.group())
else:
    print('No match')

In [None]:
p.findall('2942a9vv4dxaq42')

In [None]:
# Shortcut pattern and matcher in one!
re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')

### Groups

In [None]:
p = re.compile('(a(b)c)d')
m = p.match('abcd')
m.group(0)

In [None]:
m.group(1)

In [None]:
m.group(2)

In [34]:
p = re.compile(r'(\d+)\s+(\w+)\s+St')
m = p.search('I live at 1345 Cowper St')
m.group(0)

'1345 Cowper St'

In [35]:
m.group(1)

'1345'

In [36]:
m.group(2)

'Cowper'

### Substitutions in a string

In [None]:
p = re.compile('(blue|white|red)')
p.sub('colour', 'blue socks and red shoes')

### Splitting on a regular expression

In [None]:
# A slightly better word tokenizer
# The re split method returns you things that MATCH
# the regular expression and skips stuff in between
s = '''\'As wet as ever,\' said Alice
    in a melancholy tone: \'it doesn\'t seem to
    dry me at all.\''''
p = re.compile('\W+')
p.split(s)

In [None]:
# Again you can shortcut this.
re.split('\W+', s)

Finally, I might note that there's even more complex and sometimes useful stuff you can do with regex that hasn't yet been covered. You can find all the glorious and messy details in the Python 3 library documentation: Case insensitivity, non-capturing groups, ....

## Assignment

In [None]:
tweets = (
    """@Becky17 - i'm having a little trouble "getting"<br /> the whole twibes thing (but sometimes u gotta just get in there and try it).  :-)""",
    """Oh .. and follow @Spyker3292, @Domness, @Karlkempobrien, @Duidl_Media and @Chasetastic. Cheers for the Congrats! :D | #FollowSaturday""",
    """blade--trinity;;; sweeeeeeet. :)""",
    """@renay Thanks Renay! $9,000 yay =)""",
    """@denvy can try :) drop a tweet with "##awaresg_tshirts" so i can <strong>track</strong> orders #awaresg""",
    """U need to chk out & follow here, a more beautiful animal not anywhere else! @EmmaRileySutton :) #followfriday""",
    """@LadyB84 Manchester United??? Really??? Breaks my heart :-(http://www.twitpic.com/4x1fn""",
    """Can't wait till tomorrow =D""",
    """Big Shot's Funeral » Google » Peoria making its case for Google ... http://cli.gs/Wa8za#heading1.""",
    """Contact email@address.org today""",
    """@linguist278: Variations on phone numbers: +1 (800) 123-4567, (800) 123-4567. Not a real tweet!""",
    """RT @StanfordPraglab: Mole Day is coming up. Theme is Animole Kingdom: http://en.wikipedia.org/wiki/Mole_Day #Holidays :-)"""
    )


Write a function that takes a list of strings, texts, and a regular expression, regex, as input and prints to standard output the subset that match regex.

In [None]:
def matcher(texts, regex):
    """Takes a list of strings texts as input and prints to standard output the subset that match regex."""
    pass


Use this function to test out writing a few regular expressions, testing on the data above.

In [None]:
def contains_hashtag(texts):
    """Uses matcher to find tweets that contain a hashtag. Assume a hashtag begins with # and has a non-null sequence of non-space characters after it."""
    pass


In [None]:
def contains_money(texts):
    """Uses matcher to find tweets that contain a money amount. Assume a money amount begins with $ and has a non-null sequence of digits and periods after it."""
    pass

Write a function that takes a list of strings, texts, and a regular expression, regex, as input and prints to standard output the substrings of each string that match regex.

In [None]:
def searcher(texts, regex):
    """Takes a list of strings texts as input and prints to standard output the substrings of each that match regex."""
    pass

Use this function to test out writing a few regular expressions, testing on the data above.

In [None]:
def smileys(texts):
    """Uses searcher to find smiley faces, such as :) that appear in the list of strings, texts."""
    pass

In [52]:
smileys(tweets)

:-)
:D
:)
=)
:)
:)
:-(
=D
:-)


In [53]:
example_text = ["Yes. :) I'm really happy. :D Except when I'm sad. :-("]
smileys(example_text)

:)
:D
:-(


Use the included file words-english.txt and search it 
for words that have a consonant cluster of 4 or more consonants at the end. We're just using a count of four orthographic consonants, not sounds (phonemes). We won't count y since it is usally a vowel at the end of words.

Complete the function below

In [None]:
def final_consonant_clusters(filename):
    consonants = "bcdfghjklmnpqrstvwxz" # 'y' left out for added interest

In [None]:
final_consonant_clusters('words-english.txt')

Use the included file gaddafi.txt and write a regular expression to match instances of his surname. You should match the first 112 but not the last 8!

In [None]:
def gaddafi_matches(filename):
    pass

In [None]:
gaddafi_matches('gaddafi.txt')