## Assignment Steps

The Algorithm uses a combination of chunking, Trigram Collocation and Wordnet Hypernyms to help gather relevant information regarding the text. 

The steps I took are as follows:

1. Read in file and tokenize by sentence and word using train backoff tagger
2. Normalize by only accepting letters (used regex), removing stop words
3. Use collocation algorithms to find frequently occuring Bigrams and Trigrams
4. Remove all Bigrams with verb occurrences to narrow BiGram set 
5. Order the Bigrams and Trigrams using PMI, and Chi-Square
6. Use chunking on Noun Phrases based on a defined grammar to extract key chunks of information
7. Extract most common Hyperterms of the most common unigrams occurring in the text
8. Compare and Contrast different results to establish best Gist result

**Import Statements to gather relevant packages**

In [2]:
import nltk, re, string
import pandas as pd
from nltk.collocations import *
from nltk.stem import WordNetLemmatizer
from nltk.corpus import brown
from nltk import word_tokenize
from nltk.tokenize import regexp_tokenize
from nltk.util import ngrams
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
from nltk.collocations import *
from nltk.corpus import brown
import urllib.request
from nltk.corpus import stopwords
from nltk.collocations import *
from collections import defaultdict
from nltk.corpus import wordnet as wn

**Global Variables**

In [3]:
punctuations = list(string.punctuation)
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

**Grammar Rule to be used for chunking**

As we see, before arriving at this Grammar rule, I worked through multiple grammar rules to help develop a relevant noun phrase chunker which would yield the most appropriate and relevant results for chunking of the text

Some example grammar rules tried:

In [6]:
# NounPhrase: \{\<J\.\*>\<N.*>}
# NounPhrase: {(<J.*>*<N.*>+(<CD>*)(<CC>*)(<DT>*)(<IN>*)(<POS>*)<J.*>*<N.*>+)}
# NounPhrase: {((<DT>*)(<POS>*)(<IN>*)(<FW.*>*)<J.*>*<NN.*>+)+}

In [4]:
grammar = r"""
            NounPhrase: {((<DT>*)(<POS>*)(<FW.*>*)<J.*>*<N.*>+)+}"""

**Reading in the text file from the local user**

In [5]:
def readtextfile(file):
    with open(file) as w:
        text = w.read()
    return text

**Read URL file**

In [6]:
def read_file_url(url):
    with urllib.request.urlopen(url) as url:
        f_in =  url.read().decode(url.headers.get_content_charset())
        f_in = f_in.strip().split()
        f_in = " ".join(f_in)
    return f_in

**Preprocessing of the raw text**

This text processing is specific to the Donal Trump Speeches file

In [7]:
def text_preprocessing_trumpspeeches(text):
    text = text.replace('\ufeff', '')
    new_text = re.sub('[\n]+','\n', text)
    return new_text

**My function to use the above created grammar rule to parse through the text and extract relevant chunks**

In [8]:
def my_chunker(sent_list):
    mychunks = []
    cp = nltk.RegexpParser(grammar)
    for sentence in sent_list:
        result = cp.parse(sentence)
        mychunks.append(result)
    return mychunks

** Word Tokenizer **

In [9]:
def word_tokenizer(text):
    pattern = r'''(?x)  # set flag to allow verbose regexps
     (?:[A-Z]\.)+[A-Z]*        # abbreviations, e.g. U.S.A.
    | [a-zA-Z]+(?:[-'][a-zA-Z]+)*            # words with optional internal hyphens or apostrophes         
    | \$?\d+(?:\.\d+)?%?     # currency (dollars only, e.g. $12.40, $33, $.9) and digits 
    | [+/\-@&*.,;"'?():\-_`] #special symbols
    '''
    
    tokens = nltk.regexp_tokenize(text,pattern)
    
    return tokens

** Sentence Tokenizer **

In [10]:
def sentence_tokenizer(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus)
    return [word_tokenizer(item) for item in raw_sents]

** Normalization **

This function will remove any stop words and punctuations from the text that is sent to it

In [11]:
def normalize_text(text):
    sentence = [item for item in text if item[0] not in punctuations and item[0].lower() not in stopwords.words('english')]
    return sentence

** Function to tag the text using the standard POS Tagger **

In [12]:
def standard_pos_tagger(text):
    if(isinstance(text[0],list)):
        tagged_POS = [nltk.pos_tag(sent) for sent in text]
    else:
        tagged_POS = nltk.pos_tag(text)
    return tagged_POS

** Function to implement the backoff tagger based on the brown tag set **

In [13]:
def create_data_sets(sentences):
    size = int(len(sentences) * 0.9)
    train_sents = sentences[:size]
    test_sents = sentences[size:]
    return train_sents, test_sents

def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    t3 = nltk.TrigramTagger(train_sents, backoff=t2)
    return t3


def train_tagger(already_tagged_sents):
    train_sents, test_sents = create_data_sets(already_tagged_sents)
    ngram_tagger = build_backoff_tagger(train_sents)
    print ("%0.3f pos accuracy on test set" % ngram_tagger.evaluate(test_sents))
    return ngram_tagger

** Training the tagger on brown tagset**

In [14]:
def train_tagger_on_brown():
    modified_speech_sents = [[('common', 'JJ'), ('Hard-working', 'JJ'), ('people', 'NNS'), ('.', '.')],
                        [("I'm", 'PPSS+BEM'), ('a', 'AT'), ('Republican', 'NP'), ('.', '.')],
                        [("I'm", 'PPSS+BEM'), ('Republican', 'NP'),('.', '.')], 
                        [('the', 'AT'), ('Republican', 'NP'), ('politicians', 'NNS'), ('.', '.')],
                        [('the', 'AT'), ('American', 'NP'), ('people', 'NNS'), ('.', '.')]]


    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'editorial', 'fiction', 'government', 'hobbies',
    'humor', 'learned', 'lore', 'mystery', 'religion', 'reviews', 'romance'])
    
    all_tagged_sents = modified_speech_sents + brown_tagged_sents
    return train_tagger(all_tagged_sents)

In [15]:
brown_tagger = train_tagger_on_brown()

0.909 pos accuracy on test set


** Tagger for the input text using the backoff tagger **

In [16]:
def Trained_Speech_Tagger(sents,tagger):
    if(isinstance(sents[0],list)):
        return [tagger.tag(sent) for sent in sents]
    else:
        return tagger.tag(sents)

**Funtion to retrieve the Bigram Collocations from the text**

In [17]:
def bigram_collocation_creator(text):
    finder = BigramCollocationFinder.from_words(text)
    return finder

**Funtion to retrieve the Trigram Collocations from the text**

In [18]:
def trigram_collocation_creator(text):
    finder = TrigramCollocationFinder.from_words(text)
    return finder

** Code to find the most common unigrams in the text**

In [19]:
def common_unigrams(sents):
   lemmatizer = WordNetLemmatizer() # to get word stems
   
   
   
   normalized_words = [lemmatizer.lemmatize(word[0].lower()) for sent in sents
                          for word in sent 
                          if word[1].startswith('N')]

   top_unigrams = [word for (word, count) in nltk.FreqDist(normalized_words).most_common(40)]
   return top_unigrams
    

** Categories Extraction through Hypernyms**

In [20]:
def categories_from_hypernyms(sents):
    termlist = common_unigrams(sents) 
    hypterms = []
    hypterms_dict = defaultdict(type([1]))
    for term in termlist:                  
        string = wn.synsets(term.lower(), 'n')  
        for syn in string:                      
            for hyp in syn.hypernyms():
                hypterms = hypterms + [hyp.name]     
                hypterms_dict[hyp.name].append(term)  
    frequency = nltk.FreqDist(hypterms)
    return frequency, hypterms_dict
    

### Steps to Extract Relevant information from personal text

Reading in my text file - Speeches.txt which contains all of the Donald Trump speeches for his 2016 presdential campagin

In [21]:
text = readtextfile("speeches.txt")

Running Preprocessing to clean up text

In [22]:
processed_text = text_preprocessing_trumpspeeches(text)

Tokenizing the text into sentences

In [23]:
text_sentences = sentence_tokenizer(processed_text)

Tokenizing the text into relevant tokens

In [24]:
text_tokens = word_tokenizer(processed_text)

Tagging of the sentences and the tokens using the standard tagger

In [25]:
standard_tagged_sentences = standard_pos_tagger(text_sentences)
standard_tagged_tokens = standard_pos_tagger(text_tokens)

Tagging of the sentences and the tokens using the trained backoff tagger

In [171]:
trained_tagged_sentences = Trained_Speech_Tagger(text_sentences,brown_tagger)
trained_tagged_tokens = Trained_Speech_Tagger(text_tokens,brown_tagger)

TypeError: isinstance() arg 2 must be a type or tuple of types

Cleaning up the tagged sentences to remove the stop words and punctuations. This function was initially performed before tagging, however, it made more sense to tag the text using the backoff tagger based on the complete text since stop words and and punctuations proved useful for correct tagging of certain elements. As a result, once tagged we go ahead and remove stop words, punctuations, etc

In [28]:
modified_tagged_sentences = [normalize_text(sent) for sent in trained_tagged_sentences]
modified_tagged_tokens = normalize_text(trained_tagged_tokens)

Cleaning up the list of untagged tokens 

In [29]:
tokenlist = [item for item in text_tokens if item not in punctuations and item.lower() not in stopwords.words('english')]

Find the Bigram Collocation from the text which occur atleast 10 times in the text.  We remove the bigrams which include 'VBG' verb types from the text and find the remaining most common bigram collocations and store them in relevant bigram_chunks

In [30]:
bigram_collocation = bigram_collocation_creator(modified_tagged_tokens)
bigram_collocation.apply_freq_filter(10)

In [81]:
relevant_bigram_chunks = []
for _list in bigram_collocation.ngram_fd.most_common(30):
    if (_list[0][0][1] == "VBG" or _list[0][1][1] == "VBG"):
        continue
    else:
        relevant_bigram_chunks.append(_list[0])

In [84]:
temp_list1 = []
temp_list2 = []
for item in relevant_bigram_chunks:
    for _list in item:
        temp_list1.append(_list[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_bigram_chunks = temp_list2

Finding Bigrams and arranging them based on PMI and then CHI and only retaining top 10

In [76]:
bigram_collocation.apply_freq_filter(10)

In [89]:
relevant_bigram_chunks_pmi = []
temp_list1 = []
temp_list2 = []
for _list in bigram_collocation.nbest(bigram_measures.pmi, 10):
    for item in _list:
        temp_list1.append(item[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_bigram_chunks_pmi = temp_list2

In [90]:
relevant_bigram_chunks_chi = []
temp_list1 = []
temp_list2 = []
for _list in bigram_collocation.nbest(bigram_measures.chi_sq, 10):
    for item in _list:
        temp_list1.append(item[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_bigram_chunks_chi = temp_list2

Removing common elements from all three results sets to get uniquely identifying Bigrams

In [91]:
resulting_list_bigrams = relevant_bigram_chunks_pmi
resulting_list_bigrams.extend(x for x in relevant_bigram_chunks_chi if x not in resulting_list_bigrams)
resulting_list_bigrams.extend(x for x in relevant_bigram_chunks if x not in resulting_list_bigrams)

Find the Trigram Collocation from the text which occur atleast 10 times in the text.

In [37]:
trigram_collocation = trigram_collocation_creator(modified_tagged_tokens)
trigram_collocation.apply_freq_filter(5)

In [38]:
relevant_trigram_chunks = []
temp_list1 = []
temp_list = []
temp_list2 = []
for _list in trigram_collocation.ngram_fd.most_common(20):
    temp_list1.append(_list[0])

for item in temp_list1:
    for x in item:
        temp_list.append(x[0])
    new_phrase = ' '.join(temp_list)
    temp_list = []
    temp_list2.append(new_phrase)

relevant_trigram_chunks = temp_list2

In [39]:
relevant_trigram_chunks_pmi = []
temp_list1 = []
temp_list2 = []
for _list in trigram_collocation.nbest(trigram_measures.pmi, 10):
    for item in _list:
        temp_list1.append(item[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_trigram_chunks_pmi = temp_list2

Removing common elements from all three results sets to get uniquely identifying Bigrams

In [69]:
resulting_list_trigrams = relevant_trigram_chunks
resulting_list_trigrams.extend(x for x in relevant_trigram_chunks_pmi if x not in resulting_list_trigrams)

Finding trigrams within the original token list and looking at the most frequent trigrams to compare with above result

In [41]:
temp_list1 = []
temp_list2 = []
tr = nltk.trigrams(tokenlist)
tr_fd = nltk.FreqDist(tr)
tr_fd.most_common(20)
for item in tr_fd.most_common(20):
    for _list in item[0]:
        temp_list1.append(_list)
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

Extracting Relevant Chunks from the text based on a defined Noun Phrase Grammar Rule

In [43]:
mychunks = my_chunker(trained_tagged_sentences)

In [59]:
nounphrase_list = []
for tree in mychunks:
    for subtree in tree.subtrees():
        if subtree.label() == 'NounPhrase': 
            nounphrase_list.append(subtree.leaves())

Chunked using noun phrases (including prepositions) and retaining the most common ones but only those with at least 4 words 

In [61]:
updated_nounphrase_list = []
temp_list = []
for item in nounphrase_list:
    if (len(item) >= 4):
        for list in item:
            temp_list.append(list[0])
        new_phrase = ' '.join(temp_list)
        temp_list = []
        updated_nounphrase_list.append(new_phrase)

In [62]:
fdist = nltk.FreqDist(updated_nounphrase_list)

In [97]:
relevant_chunks = [item[0] for item in fdist.most_common(20)]

We will now return the most common hyperterms of the most frequent unigrams along with their synset examples to find some relevant sub topics and text attributes based on these synsets

In [100]:
hyperterm_frequency, hypterms_dict = categories_from_hypernyms(modified_tagged_sentences)
common_terms = dict()
for (name, count) in hyperterm_frequency.most_common(10):
    nm=name().split('.')[0]
    common_terms[nm]=', '.join(set(hypterms_dict[name]))

### Displaying the Gist of the Personal Text based on above algorithm

In [122]:
pd.set_option('display.expand_frame_repr', True)
df1 = pd.DataFrame(resulting_list_trigrams, columns=['Key Phrases'])
df2 = pd.DataFrame(relevant_chunks, columns=['Key Chunks'])
print ("Key categories generated from text to help give context to determined key phrases and key chunks")
print()
for k, v in common_terms.items():
    print("Category:",k,"-->","Example terms:",v)
print()
print(df1)
print(df2)

Key categories generated from text to help give context to determined key phrases and key chunks

Category: difficulty --> Example terms: problem, wall, job
Category: transaction --> Example terms: deal, trade
Category: administrative_district --> Example terms: state, country
Category: work --> Example terms: care, job
Category: artifact --> Example terms: thing, way
Category: people --> Example terms: country, world, folk
Category: group --> Example terms: people, world
Category: time_period --> Example terms: day, time, year
Category: large_indefinite_quantity --> Example terms: lot, deal, million
Category: attribute --> Example terms: state, thing, time

                 Key Phrases
0            going take care
1         make America great
2           going build wall
3         make country great
4           Thank Thank much
5           going bring back
6              going get rid
7         going happen going
8       Thank much everybody
9       going happen anymore
10        goin

## Reflections

It was easy to see that for the text that has been adopted - 'Donal Trump Speeches', a lot of information was transactional in nature. Heavy usage of verbs indicatin Trump's promise to perform a certain action affected the algorithm output and hence, that needed to be taken into consideration early on. 

Also given the poltiical nature of the text, there was heavy reliance on Proper Nouns in the text to bring across certain points regarding the presenditial campaign and this again was important to keep in mind when extracting the Gist of the text through chunking and related activities.

There were very interesting insights gathered through this exercise and such as Trump's heavy reliance on the action word - "going" which again as mentioned above indicates promise.

For the above algorithm, as mentioned throughout the notebook, there were certain points where the results were not as good as expected and hence newer techniques needed to be adopted. 

Extraction of frequent Bigrams and Unigrams was less indicative of any imporant information for my personal text and hence since frequent Bigrams included a lot of Proper Nouns without much context to them. As a result it proved helpful to find Frequent trigrams which gave god results in terms of descriptive key phrases. Also in the case of chunking, it was important to note that this proved to be the most helpful in performing key phrase extraction. Given the limitation of size, I limited my result set, since I wanted to combine it with other results. Frequent hypernym extraction proved to be a bit less useful again due to the nature of the text and the way Trump talks. However, we were still able to extract some key pieces of information.


### Run below code for mystery text

In [127]:
file = 'link'

## Example -- file = 'http://people.ischool.berkeley.edu/~tygar/for.i206/pg1342.txt'

In [132]:
text_file = read_file_url(file)

In [26]:
text = readtextfile("mystery_text_expository_2016.txt")

In [28]:
processed_mystery_text = text_preprocessing_trumpspeeches(text)

In [29]:
text_sentences = sentence_tokenizer(processed_mystery_text)

In [30]:
text_tokens = word_tokenizer(processed_mystery_text)

In [31]:
trained_tagged_sentences = Trained_Speech_Tagger(text_sentences,brown_tagger)
trained_tagged_tokens = Trained_Speech_Tagger(text_tokens,brown_tagger)

In [32]:
modified_tagged_sentences = [normalize_text(sent) for sent in trained_tagged_sentences]
modified_tagged_tokens = normalize_text(trained_tagged_tokens)

In [33]:
bigram_collocation = bigram_collocation_creator(modified_tagged_tokens)
bigram_collocation.apply_freq_filter(10)

In [34]:
relevant_bigram_chunks = []
for _list in bigram_collocation.ngram_fd.most_common(30):
    if (_list[0][0][1] == "VBG" or _list[0][1][1] == "VBG"):
        continue
    else:
        relevant_bigram_chunks.append(_list[0])

In [35]:
temp_list1 = []
temp_list2 = []
for item in relevant_bigram_chunks:
    for _list in item:
        temp_list1.append(_list[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_bigram_chunks = temp_list2

In [36]:
bigram_collocation.apply_freq_filter(10)

In [37]:
relevant_bigram_chunks_pmi = []
temp_list1 = []
temp_list2 = []
for _list in bigram_collocation.nbest(bigram_measures.pmi, 10):
    for item in _list:
        temp_list1.append(item[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_bigram_chunks_pmi = temp_list2

In [38]:
relevant_bigram_chunks_chi = []
temp_list1 = []
temp_list2 = []
for _list in bigram_collocation.nbest(bigram_measures.chi_sq, 10):
    for item in _list:
        temp_list1.append(item[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_bigram_chunks_chi = temp_list2

In [39]:
resulting_list_bigrams = relevant_bigram_chunks_pmi
resulting_list_bigrams.extend(x for x in relevant_bigram_chunks_chi if x not in resulting_list_bigrams)
resulting_list_bigrams.extend(x for x in relevant_bigram_chunks if x not in resulting_list_bigrams)

In [40]:
trigram_collocation = trigram_collocation_creator(modified_tagged_tokens)
trigram_collocation.apply_freq_filter(5)

In [41]:
relevant_trigram_chunks = []
temp_list1 = []
temp_list = []
temp_list2 = []
for _list in trigram_collocation.ngram_fd.most_common(20):
    temp_list1.append(_list[0])

for item in temp_list1:
    for x in item:
        temp_list.append(x[0])
    new_phrase = ' '.join(temp_list)
    temp_list = []
    temp_list2.append(new_phrase)

relevant_trigram_chunks = temp_list2

In [42]:
relevant_trigram_chunks_pmi = []
temp_list1 = []
temp_list2 = []
for _list in trigram_collocation.nbest(trigram_measures.pmi, 10):
    for item in _list:
        temp_list1.append(item[0])
    new_phrase = ' '.join(temp_list1)
    temp_list1 = []
    temp_list2.append(new_phrase)

relevant_trigram_chunks_pmi = temp_list2

In [43]:
resulting_list_trigrams = relevant_trigram_chunks
resulting_list_trigrams.extend(x for x in relevant_trigram_chunks_pmi if x not in resulting_list_trigrams)

In [44]:
mychunks = my_chunker(trained_tagged_sentences)

In [45]:
nounphrase_list = []
for tree in mychunks:
    for subtree in tree.subtrees():
        if subtree.label() == 'NounPhrase': 
            nounphrase_list.append(subtree.leaves())

In [46]:
updated_nounphrase_list = []
temp_list = []
for item in nounphrase_list:
    if (len(item) >= 4):
        for list in item:
            temp_list.append(list[0])
        new_phrase = ' '.join(temp_list)
        temp_list = []
        updated_nounphrase_list.append(new_phrase)

In [47]:
fdist = nltk.FreqDist(updated_nounphrase_list)

In [48]:
relevant_chunks = [item[0] for item in fdist.most_common(20)]

In [49]:
hyperterm_frequency, hypterms_dict = categories_from_hypernyms(modified_tagged_sentences)
common_terms = dict()
for (name, count) in hyperterm_frequency.most_common(10):
    nm=name().split('.')[0]
    common_terms[nm]=', '.join(set(hypterms_dict[name]))

In [50]:
pd.set_option('display.expand_frame_repr', True)
df1 = pd.DataFrame(resulting_list_trigrams, columns=['Key Phrases'])
df2 = pd.DataFrame(relevant_chunks, columns=['Key Chunks'])
print ("Key categories generated from text to help give context to determined key phrases and key chunks")
print()
for k, v in common_terms.items():
    print("Category:",k,"-->","Example terms:",v)
print()
print(df1)
print(df2)

Key categories generated from text to help give context to determined key phrases and key chunks

Category: administrative_district --> Example terms: state, department, country
Category: people --> Example terms: world, nation, country, business
Category: activity --> Example terms: effort, work, service, business
Category: group --> Example terms: world, system, people, men
Category: state --> Example terms: action, power, condition
Category: time_period --> Example terms: time, year, life
Category: political_unit --> Example terms: state, nation, country
Category: being --> Example terms: life
Category: force --> Example terms: law, service, men
Category: male --> Example terms: men

                       Key Phrases
0              State Union Address
1                 fiscal year 1947
2         Government United States
3             people United States
4                    21 st century
5                   ending June 30
6                     World War II
7            United Stat