# Useful starting stuff

In [None]:
%matplotlib inline
import matplotlib
import nltk

In [None]:
class ListTable(list):
    def _repr_html_(self):
        html = ["<table style= 'border: 1px solid black;''>"]
        for row in self:
            html.append("<tr>")
            for col in row:
                html.append("<td align='left' style='border: .5px solid gray;''>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html)

# Collocations in Genesis

Look back at our work on collocations in Genesis.
I don't think it was very satisfying though.
Here you're going to try to do better.

Start by reading in genesis again

In [None]:
mygenesis = nltk.corpus.PlaintextCorpusReader("corpora", 'genesis.txt')

Create a frequency distribution of bigrams, using one of the approaches we used in Notebook 6.

For practice, make a plot of part of that distribution, for practice. If you can, do it straight from matplotlib, rather than using `FreqDist.plo()`.

In Notebook 6, we did a so-so job of finding real, interesting collocations in Genesis. Our goal here is to see if we can do any better. As a first step, if you already have some ideas how to do better, go ahead and try them here. I'll give you a cell to work in. But, if you don't have anything you want to try, you can go to the next step.

We essentially used the t statistic as a way to score possible collocations.
But there are many other measures that can be used.

NLTK provides some tools that make it easy to play with a few potential measures.
There's a little tutorial here: http://www.nltk.org/howto/collocations.html

Take a look at the tutorial, and figure out how to use some of the alternative measures there to score
bigrams in Genesis. Are any of them better than the t statistic? Feel free to combine these measures with anything we tried in class.

For continuity with what we have done, you 'll probably want to read in the genesis corpus as I do below. (The third line below is different than in the tutorial. I'll get you started below.

In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(mygenesis.words())
### continue from here

# MTMS explorations

## The machinery

Now you're going to do some work with the MTMS corpus. As a first step, you just need to run the code in the next cell. It give you the machinery you need to read in the MTMS corpus in a useful way. It uses python *classes*, which is advanced stuff. You can try to understand it if you like, but it's not necessary at this point.

In [None]:
import csv
import nltk
import os

class mtms_document:
    def __init__(self, base, filename):
        csvfile = open(base + "/" + filename)
        csv_reader = csv.DictReader(csvfile, delimiter=",")
        self._filename = filename
        self._paragraphs = []
        self._sentences = []
        self._words = []
        for r in csv_reader:
            raw_sentences = nltk.sent_tokenize(r["text"])
            new_sentences = [nltk.word_tokenize(sent) for sent in raw_sentences]
            self._sentences += new_sentences
            new_words = []
            for sent in new_sentences:
                new_words += sent
                self._paragraphs.append(new_words)
            self._words += new_words
        return
            
    def words(self):
        return self._words
    
    def paragraphs(self):
        return self._paragraphs
    
    def sentences(self):
        return self._sentences
    
    def name(self):
        return self._filename
    
class mtms_reader:
    def __init__(self):
        self.base = "corpora/mtms_csv"
        self._name_list = os.listdir(self.base)
        self.mtms_docs = {}
        self._words = []
        self._sentences = []
        self._paragraphs = []
        self._document_list = []
        ndocs = len(name_list)
        for n, fname in enumerate(self._name_list):
            if n % 100 == 0:
                print("processing doc {} of {}".format(n, ndocs))
            new_doc = mtms_document(self.base, fname)
            self.mtms_docs[fname] = new_doc
            self._words += new_doc.words()
            self._sentences += new_doc.sentences()
            self._paragraphs += new_doc.paragraphs()
        return
    
    def document_names(self):
        return self._name_list
    
    def words(self):
        return self._words
    
    def sentences(self):
        return self._sentences
    
    def paragraphs(self):
        return self._paragraphs
    
    def __getitem__(self, docname):
        return self.mtms_docs[docname]

## Some basics

Read in the corpus
Find the most common unigrams and bigrams

In [None]:
mr = mtms_reader()
# Find unigrams and bigrams

See if you can discover any interesting or meaningful collocations

## The Lexicon

The MTMS corpus was actually assembled for a particular reason.
The "Lexicon Project" is an international collaboration where researchers from several countries, speaking multiple languages, assembled what are essentiall dictionaries of the language used by math teachers to talk about events in their classroom.

Miriam Sherin and the Lexicon team here at Northwestern used surveys and interviews to assemble a lexicon that 
is supposed to characterize the terms used by middle school math teachers here in the U.S.
Most of the lexicon is now available in a file in your lists folder.

It's in a slightly complicated form. Each lexicon term has its own row, with a list of phrases separated by commas. The phrases on a row are supposed to be different forms in which a lexicon term might appear. The first entry on a row is the base term. 

You can open the file and look inside. Note that some lexicon "terms" have multiple words in them.

The code in the next cell pulls in the lexicon file for you and does some initial processing. In particular, it build three lists for you:

* `base_terms`: This is a list of all of the base terms.

* `one_word_terms`: A list of just the one-word terms.

* `two_word_terms`: A list of just the two-word terms


In [None]:
lexicon_file = open("lists/lexicon.txt")
lexicon_raw_groups = lexicon_file.read().split("\n")
all_lexicon_groups = [g.split(", ") for g in lexicon_raw_groups]
lexicon_dict = {}
for group in all_lexicon_groups:
    split_group = [tuple(t.split()) for t in group]
    if len(split_group[0]) == 1:
        group_name = split_group[0][0]
        lexicon_dict[group_name] = [t[0] for t in split_group]
    else:
        group_name = split_group[0]
        lexicon_dict[group_name] = split_group
base_terms = list(lexicon_dict.keys())
one_word_terms = [t for t in base_terms if isinstance(t, str)]
two_word_terms = [t for t in base_terms if len(t) == 2]

Using one of the base terms, you can look up all of its forms like this:

In [None]:
a_one_word_term = base_terms[0]
print(a_one_word_term)
lexicon_dict[a_one_word_term]

One of the reasons for assembling the MTMS corpus was to see if the words identified by the lexicon team are actually used frequently by teachers. (There are reasons that the MTMS corpus is good for this purpose, and reasons that it is not so good.)

Do some of this exploration yourself. First, how frequently do the one-word terms in the lexicon appear in the lexicon corpus?

How frequently do the two-word terms appear?

## Advanced Lexicon Tasks

### Finding terms that *should* be on the lexicon

If you still have some time, you can work on this: By exploring collocations, you can you identify two-word phrases that you think are candidates that *should* be added to the lexicon?

### Identifying co-occurrence patterns among lexicon terms

Finally, this is a task that I have to work on this week for an ongoing research project. If you can make progress, that would actually be helpful. The question is this: Are there lexicon terms that seem to co-occur, either in sentences, paragraphs, or documents? If you can make some nice pictures for me, that would be nice.