# Homework: Competitive Grammar Writing

### Group northernwolfpack 

* Chithra Bhat, cbhat
* Gustavo Felhberg, gfelhber
* Heikal Badrulhisham, hbadrulh
* Helen Zhang, hyz1

In order to develop the grammar files and the vocabulary, we decided to try two different approaches and, in the end, chose the one with the best results.

### Penn Treebank (best results)

The first approach was to extract grammatical rules from syntactic trees in a sample of the Penn Treebank, available in the python package NLTK (nltk.corpus.treebank). 

Initially, we extracted all the grammar trees available in the treebank corpus and converted them to Chomsky Normal Form using the NLTK method 'tree.chomsky_normal_form()'. Then, for each of the trees, we extracted the respective rules using a method get_rules(tree) developed by our group. Using a similar approach, we extracted the rules existent in the 'devset.trees' file and combined these rules with those extracted from the Penn Treebank. After merging the rules, we counted their frequency and saved the grammar file in a PCFG format. The source code with the methods used to extract the grammar is available in the files 'get_treebank_grammar.py' and 'get_devset_rule.py'.

The vocabulary was built using the following corpora: brown, treebank, conll2000 and nps_chat. After filtering out the words not allowed, we obtained their part of speech and savedthem in the Vocab.gr file. The source code with the methods to extract the vocab is available in the file 'build_vocab.py'



#### Main files:

* get_treebank_grammar.py:

    * extract the trees from Penn Treebank
    * convert to CNF
    * extract the rules from trees
    * add rules extracted from the devset trees
    * counts the rules frequency 
    * add misc rules
    * save rules into grammar file
    
    
* get_devset_rules.py:

    * extract trees from devset.trees file
    * convert to CNF
    * extract rules from trees
    * save 'devset_rules.txt' file


* build_vocab.py:

    * extracts the tagged words from corpora treebank, conll2000, brown and nps chat
    * filter only allowed words
    * extract the parts of speach of each (POS) word
    * count the occurrence of the words and respective POS
    * read vocab included manually in file 'manual_vocab.txt'
    * save the final vocabulary file


### Stanford Parser

The second approach was to use the Stanford Parser (https://nlp.stanford.edu/software/lex-parser.shtml) to extract the vocabulary and the grammatical rules. We used the content of the book Monty Python and the Holy Grail available in the NLTK library, tokenized the sentences, and used the parser to extract the rules. The source code with the methods to extract the grammar and vocabulary is available in /hw1/stanford_parser/ folder. The main file stanford_parser.ipynb is a Jupyter Notebook with the descriptions of the steps in order to generate the grammar and vocabulary using this method.

#### Main files:

* stanford_parser.ipynb:

    * load the Monty Python and the Holy Grail corpus
    * extract the vocabulary and calculate the frequency, respecting the allowed words restriction
    * extract the sentences from the corpus
    * reads the Monty Python sentences and generate a pandas dataframe 
    * clean the strings (remove parts like 'SCENE 1:', 'KING ARTHUR:', '[wind]', '[clop clop clop]', etc.)
    * extract trees, convert to CNF and extract the rules
    * calculate the frequency of the rules
    * save into a grammar file    
    
    
* parse_sentence.ipynb:

    * parse individual sentences and output the vocab with POS and the grammar rules as in the image below:

<img src="./stanford_parser/parse_sentence.png">


### Unsupervised taggers:

In order to evaluate the taggers on the treebank corpus, we first started with unsupervised taggers. 
* nltk.DefaultTagger: We applied the most frequent tag, i.e. the NN tag to the default tagger. 
* nltk.RegexpTagger: The regular expression tagger assigns tags to tokens on the basis of matching patterns. Regular expression tagger by itself is limited to very common language properties; therefore it is able to tag
only few sentences of the whole corpus correctly.
    
### Supervised taggers:

* nltk.UnigramTagger: Unigram taggers are based on a simple statistical algorithm: It assigns the tags with the most probable tag by calculating the frequencies of each token i.e., for each token, assign the tag that is most likely for that particular token.
* nltk.BigramTagger: An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.  

### Combining taggers:
One way to address the trade-off between accuracy and coverage is to use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary. We used the content of the book Monty Python and the Holy Grail available in the NLTK library, tokenized the sentences, and used the parser to extract the rules.

* Approach
    * Tag the tokens with the bigram tagger.
    * If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
    * If the unigram tagger is also unable to find a tag, use a regex/default tagger.
    
#### Main files:

* temp/chithra/cgw-default.ipynb:

    * Load the Monty Python and the Holy Grail text file
    * Train the model(combining taggers) on Brown corpus tagged sentences
    * Provide Monty Python corpus sentences as test set for the model 
    * For every word, identify POS and save that to a Pandas dataframe
    * Save this dataframe as vocabulary file.
    * Read Monty Python sentences
    * Extract trees, convert to CNF and extract the rules
    * Save the results to a grammar file  
    
However, the grammar generated by the Penn treebank approach had better results, so we decided to use that as our official result.