## 1. Syntactic Patterns for Technical Terms ##

In [5]:
import nltk, re
from nltk.corpus import brown
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize
from nltk.util import ngrams
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

As seen in the Manning and Schuetze chapter, there is a well-known part-of-speech 
based pattern defined by Justeson and Katz for identifying simple noun phrases 
that often words well for pulling out keyphrases.

 Technical Term  T = (A | N)+ (N | C)  | N

Below, write a function to  define a chunker using the RegexpParser as illustrated in the NLTK book Chapter 7 section 2.3 *Chunking with Regular Expressions*.  You'll need to revise the grammar rules shown there to match the pattern shown above.  You can be liberal with your definition of what is meant by *N* here.  Also, C refers to cardinal number, which is CD in the brown corpus.



In [4]:
technical_term = r"""
            T:  {<JJ|NN.*>+ <NN.*|CD>|<NN.*>}
                {<N.*>+}"""

Below, write a function to call the chunker, run it on some sentences, and then print out the results for  those sentences.

For uniformity, please run it on sentences 100 through 104 from the full tagged brown corpus.

 

In [5]:
def chunks(sent_list):
    brown_chunks = []
    cp = nltk.RegexpParser(technical_term)
    for sentence in sent_list:
        result = cp.parse(sentence)
        brown_chunks.append(result)
    return brown_chunks

chunks(nltk.corpus.brown.tagged_sents()[99:105])

[Tree('S', [(u'--', u'--'), Tree('T', [(u'Committee', u'NN'), (u'approval', u'NN')]), (u'of', u'IN'), Tree('T', [(u'Gov.', u'NN-TL')]), Tree('T', [(u'Price', u'NP'), (u"Daniel's", u'NP$')]), (u'``', u'``'), (u'abandoned', u'VBN'), Tree('T', [(u'property', u'NN')]), (u"''", u"''"), Tree('T', [(u'act', u'NN')]), (u'seemed', u'VBD'), (u'certain', u'JJ'), Tree('T', [(u'Thursday', u'NR')]), (u'despite', u'IN'), (u'the', u'AT'), Tree('T', [(u'adamant', u'JJ'), (u'protests', u'NNS')]), (u'of', u'IN'), Tree('T', [(u'Texas', u'NP')]), Tree('T', [(u'bankers', u'NNS')]), (u'.', u'.')]),
 Tree('S', [Tree('T', [(u'Daniel', u'NP')]), (u'personally', u'RB'), (u'led', u'VBD'), (u'the', u'AT'), Tree('T', [(u'fight', u'NN')]), (u'for', u'IN'), (u'the', u'AT'), Tree('T', [(u'measure', u'NN')]), (u',', u','), (u'which', u'WDT'), (u'he', u'PPS'), (u'had', u'HVD'), (u'watered', u'VBN'), (u'down', u'RP'), (u'considerably', u'RB'), (u'since', u'IN'), (u'its', u'PP$'), Tree('T', [(u'rejection', u'NN')]), (u'by


Then extract out the phrases themselves on sentences 100 through 160 using the subtree extraction technique shown in the 
*Exploring Text Corpora* category.  

In [7]:
print(nltk.corpus.brown.tagged_sents()[99:160])

[[(u'--', u'--'), (u'Committee', u'NN'), (u'approval', u'NN'), (u'of', u'IN'), (u'Gov.', u'NN-TL'), (u'Price', u'NP'), (u"Daniel's", u'NP$'), (u'``', u'``'), (u'abandoned', u'VBN'), (u'property', u'NN'), (u"''", u"''"), (u'act', u'NN'), (u'seemed', u'VBD'), (u'certain', u'JJ'), (u'Thursday', u'NR'), (u'despite', u'IN'), (u'the', u'AT'), (u'adamant', u'JJ'), (u'protests', u'NNS'), (u'of', u'IN'), (u'Texas', u'NP'), (u'bankers', u'NNS'), (u'.', u'.')], [(u'Daniel', u'NP'), (u'personally', u'RB'), (u'led', u'VBD'), (u'the', u'AT'), (u'fight', u'NN'), (u'for', u'IN'), (u'the', u'AT'), (u'measure', u'NN'), (u',', u','), (u'which', u'WDT'), (u'he', u'PPS'), (u'had', u'HVD'), (u'watered', u'VBN'), (u'down', u'RP'), (u'considerably', u'RB'), (u'since', u'IN'), (u'its', u'PP$'), (u'rejection', u'NN'), (u'by', u'IN'), (u'two', u'CD'), (u'previous', u'JJ'), (u'Legislatures', u'NNS-TL'), (u',', u','), (u'in', u'IN'), (u'a', u'AT'), (u'public', u'JJ'), (u'hearing', u'NN'), (u'before', u'IN'), (u'th

In [8]:
brown_chunks = chunks(nltk.corpus.brown.tagged_sents()[100:161]) 
for tree in brown_chunks:
    for subtree in tree.subtrees():
        if subtree.label() == 'T':
            print(subtree)

(T Daniel/NP)
(T fight/NN)
(T measure/NN)
(T rejection/NN)
(T previous/JJ Legislatures/NNS-TL)
(T public/JJ hearing/NN)
(T House/NN-TL Committee/NN-TL)
(T Revenue/NN-TL)
(T Taxation/NN-TL)
(T committee/NN rules/NNS)
(T subcommittee/NN)
(T week/NN)
(T questions/NNS)
(T committee/NN members/NNS)
(T bankers/NNS)
(T witnesses/NNS)
(T doubt/NN)
(T passage/NN)
(T Daniel/NP)
(T estimate/NN)
(T dollars/NNS)
(T deficit/NN)
(T dollars/NNS)
(T end/NN)
(T current/JJ fiscal/JJ year/NN)
(T Aug./NP)
(T committee/NN)
(T measure/NN)
(T means/NNS)
(T escheat/NN law/NN)
(T books/NNS)
(T Texas/NP)
(T republic/NN)
(T state/NN)
(T bank/NN accounts/NNS)
(T stocks/NNS)
(T personal/JJ property/NN)
(T persons/NNS)
(T years/NNS)
(T bill/NN)
(T Daniel/NP)
(T banks/NNS)
(T insurance/NN firms/NNS)
(T pipeline/NN companies/NNS)
(T corporations/NNS)
(T such/JJ property/NN)
(T state/NN treasurer/NN)
(T escheat/NN law/NN)
(T such/JJ property/NN)
(T Daniel/NP)
(T Dewey/NP Lawrence/NP)
(T Tyler/NP)
(T lawyer/NN)
(T Texas

## 2. Identify Proper Nouns ##
For this next task, write a new version of the chunker, but this time change it in two ways:
 1. Make it recognize proper nouns
 2. Make it work on your personal text collection which means that you need to run a tagger over your personal text collection.

Note that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer.  You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)



**Tagger:** Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:

In [48]:
def create_data_sets(sentences):
    size = int(len(sentences) * 0.9)
    train_sents = sentences[:size]
    test_sents = sentences[size:]
    return train_sents, test_sents

def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    t3 = nltk.TrigramTagger(train_sents, backoff=t2)
    return t3


def train_tagger(already_tagged_sents):
    train_sents, test_sents = create_data_sets(already_tagged_sents)
    ngram_tagger = build_backoff_tagger(train_sents)
    print ("%0.3f pos accuracy on test set" % ngram_tagger.evaluate(test_sents))
    return ngram_tagger

In [2]:
def tokenize_text(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences    
    return [nltk.word_tokenize(word) for word in raw_sents]

In [59]:
def train_tagger_on_brown():
    modified_speech_sents = [[('common', 'JJ'), ('Hard-working', 'JJ'), ('people', 'NNS'), ('.', '.')],
                        [("I'm", 'PPSS+BEM'), ('a', 'AT'), ('Republican', 'NP'), ('.', '.')],
                        [("I'm", 'PPSS+BEM'), ('Republican', 'NP'),('.', '.')], 
                        [('the', 'AT'), ('Republican', 'NP'), ('politicians', 'NNS'), ('.', '.')],
                        [('the', 'AT'), ('American', 'NP'), ('people', 'NNS'), ('.', '.')]]


    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'editorial', 'fiction', 'government', 'hobbies',
    'humor', 'learned', 'lore', 'mystery', 'religion', 'reviews', 'romance'])
    
    #append hand-tagged cooking sentences to the front of the training data
    all_tagged_sents = modified_speech_sents + brown_tagged_sents
    return train_tagger(all_tagged_sents)

In [60]:
brown_tagger = train_tagger_on_brown()

0.909 pos accuracy on test set


In [61]:
def tagme(sents, tagger):
    return [tagger.tag(sent) for sent in sents]

In [3]:
with open("speeches.txt") as w:
    text = w.read()

In [6]:
text = text.replace('\ufeff', '')
new_text = re.sub('[\n]+','\n', text)
sent_text = tokenize_text(new_text)

In [7]:
sent_text[0]

['SPEECH', '1', '...', 'Thank', 'you', 'so', 'much', '.']

In [69]:
tagged_sents = tagme(sent_text,brown_tagger)

**Chunker:** Code for the proper noun chunker goes here:

In [70]:
grammar = "NC: {<NP.*>+}"

In [71]:
def my_chunker(sent_list):
    mychunks = []
    cp = nltk.RegexpParser(grammar)
    for sentence in sent_list:
        result = cp.parse(sentence)
        mychunks.append(result)
    return mychunks

**Test the Chunker:** Test your proper noun recognizer on a lot of sentences to see how well it is working.  You might want to add prepositions in order to improve your results.  


In [120]:
chunks(nltk.corpus.brown.tagged_sents()[105:125])

[Tree('S', [('It', 'PPS'), ('permits', 'VBZ'), ('the', 'AT'), Tree('T', [('state', 'NN')]), ('to', 'TO'), ('take', 'VB'), ('over', 'RP'), Tree('T', [('bank', 'NN'), ('accounts', 'NNS')]), (',', ','), Tree('T', [('stocks', 'NNS')]), ('and', 'CC'), ('other', 'AP'), Tree('T', [('personal', 'JJ'), ('property', 'NN')]), ('of', 'IN'), Tree('T', [('persons', 'NNS')]), ('missing', 'VBG'), ('for', 'IN'), ('seven', 'CD'), Tree('T', [('years', 'NNS')]), ('or', 'CC'), ('more', 'AP'), ('.', '.')]),
 Tree('S', [('The', 'AT'), Tree('T', [('bill', 'NN')]), (',', ','), ('which', 'WDT'), Tree('T', [('Daniel', 'NP')]), ('said', 'VBD'), ('he', 'PPS'), ('drafted', 'VBD'), ('personally', 'RB'), (',', ','), ('would', 'MD'), ('force', 'VB'), Tree('T', [('banks', 'NNS')]), (',', ','), Tree('T', [('insurance', 'NN'), ('firms', 'NNS')]), (',', ','), Tree('T', [('pipeline', 'NN'), ('companies', 'NNS')]), ('and', 'CC'), ('other', 'AP'), Tree('T', [('corporations', 'NNS')]), ('to', 'TO'), ('report', 'VB'), Tree('T'

**FreqDist Results:** After you have your proper noun recognizer working to your satisfaction, below  run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency.  That code goes here, along with the output:


In [123]:
mychunks = my_chunker(tagged_sents)
for tree in mychunks:
    for subtree in tree.subtrees():
        if subtree.label() == 'NC': 
            print(subtree)

(NC Steve/NP)
(NC David/NP)
(NC Iowa/NP)
(NC Iowa/NP)
(NC Washington/NP)
(NC Republican/NP)
(NC Republican/NP)
(NC Republicans/NPS)
(NC China/NP)
(NC China/NP)
(NC China/NP)
(NC Mexico/NP-TL)
(NC China/NP)
(NC Iraq/NP)
(NC Donald/NP)
(NC Iran/NP)
(NC York/NP-TL)
(NC Republican/NP)
(NC Ryan/NP)
(NC Democrats/NPS)
(NC Iran/NP)
(NC Syria/NP)
(NC Syria/NP)
(NC Syria/NP)
(NC Midwest/NP)
(NC China/NP)
(NC Ohio/NP)
(NC Saudi/NP Arabia/NP)
(NC China/NP)
(NC George/NP Washington/NP)
(NC LaGuardia/NP)
(NC Mr./NP)
(NC Kennedy/NP)
(NC Chinese/NPS)
(NC Korea/NP-TL)
(NC Korea/NP-TL)
(NC China/NP)
(NC Korea/NP-TL)
(NC Korea/NP-TL)
(NC Massachusetts/NP)
(NC Bush/NP)
(NC Bush/NP)
(NC Bush/NP)
(NC Abraham/NP Lincoln/NP)
(NC Roberts/NP)
(NC Jeb/NP)
(NC Roberts/NP)
(NC Roberts/NP)
(NC Jeb/NP Bush/NP)
(NC Iran/NP)
(NC America/NP-TL)
(NC Pennsylvania/NP)
(NC Congress/NP)
(NC America/NP)
(NC America/NP)
(NC Nazis/NPS)
(NC Democrats/NPS)
(NC Republicans/NPS)
(NC Mr./NP)
(NC Iraq/NP)
(NC Egypt/NP)
(NC Syria/NP

In [80]:
proper_noun_list = []
for tree in mychunks:
    for subtree in tree.subtrees():
        if subtree.label() == 'NC': 
            proper_noun_list.append(subtree.leaves())

In [116]:
updated_proper_noun_list = []
temp_list = []
for item in proper_noun_list:
    if (len(item) > 1):
        for list in item:
            temp_list.append(list[0])
        new_proper_noun = ' '.join(temp_list)
        temp_list = []
        updated_proper_noun_list.append(new_proper_noun)
    else:
        updated_proper_noun_list.append(item[0][0])

In [117]:
fdist = nltk.FreqDist(updated_proper_noun_list)
fdist.most_common(20)

[('China', 194),
 ('America', 175),
 ('Mexico', 150),
 ('Hillary', 145),
 ('Iowa', 115),
 ('Iran', 82),
 ('Israel', 80),
 ('Iraq', 68),
 ('Japan', 63),
 ('Florida', 52),
 ('Donald', 51),
 ('Hampshire', 44),
 ('Americans', 43),
 ('Carolina', 40),
 ('York', 40),
 ('Mr.', 36),
 ('Democrats', 35),
 ('Cruz', 35),
 ('Republicans', 35),
 ('Korea', 34)]