This notebook shows how to retrain the NLTK backoff tagger.
- You'll see an example in which some recipe text has some errors in tagging, most likely because the training data did not have many examples of the target sentence structure.  
- Next, you'll see the affects of adding a few sentences of training data with the missing sentence structure on the accuracy of the tagger.
- Your assignment is to do something similar on your adopted text.


In [25]:
import nltk, re
from nltk.corpus import brown
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize
from nltk.util import ngrams

Define functions for training and evaluating a backoff tagger.

In [49]:
def create_data_sets(sentences):
    size = int(len(sentences) * 0.9)
    train_sents = sentences[:size]
    test_sents = sentences[size:]
    return train_sents, test_sents

def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    t3 = nltk.TrigramTagger(train_sents, backoff=t2)
    return t3


def train_tagger(already_tagged_sents):
    train_sents, test_sents = create_data_sets(already_tagged_sents)
    ngram_tagger = build_backoff_tagger(train_sents)
    print ("%0.3f pos accuracy on test set" % ngram_tagger.evaluate(test_sents))
    return ngram_tagger


Make a specialized function for training a tagger on the brown corpus.

For my text collection, I wanted to test with limited categories, hence I removed non relevant categories from the training to avoid over fitting. Since the nature of the text is presdential speeches, it would be better removing the below categories:

science_fiction
belles_lettres


In [5]:
def train_tagger_on_brown():
    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'editorial', 'fiction', 'government', 'hobbies',
    'humor', 'learned', 'lore', 'mystery', 'religion', 'reviews', 'romance'])
    return train_tagger(brown_tagged_sents)


Functions for creating an NLTK corpus object, so we can operate on it using nltk.tokenize_text()

In [12]:
def tokenize_text(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences    
    return [nltk.word_tokenize(word) for word in raw_sents]

def create_corpus(f):
    with open(f, 'r') as text_file:
        new_corpus = text_file.read()
    return new_corpus


Now train and evaluate an ngram backoff tagger, using the brown corpus as the training and testing set.  (This takes a few moments to complete.)

In [6]:
brown_tagger = train_tagger_on_brown()

0.909 pos accuracy on test set


Next, read in a file of recipes and tokenize it.

In [13]:
cookbook_file = './cookbooks.txt'
cookbook_sents = tokenize_text(create_corpus(cookbook_file))
cookbook_sents

[['VERMICELLI', 'SOUP', '.'],
 ['Put',
  'a',
  'shin',
  'of',
  'veal',
  ',',
  'one',
  'onion',
  ',',
  'two',
  'carrots',
  ',',
  'two',
  'turnips',
  ',',
  'and',
  'a',
  'little',
  'salt',
  ',',
  'into',
  'four',
  'quarts',
  'of',
  'water',
  '.'],
 ['Boil',
  'this',
  'three',
  'hours',
  ';',
  'add',
  'two',
  'cups',
  'of',
  'vermicelli',
  ',',
  'and',
  'boil',
  'it',
  'an',
  'hour',
  'and',
  'a',
  'half',
  'longer',
  '.'],
 ['Before', 'serving', 'take', 'out', 'the', 'bone', 'and', 'vegetable', '.'],
 ['JENNY', 'LIND', "'S", 'SOUP', '.'],
 ['The',
  'following',
  'soup',
  'is',
  'stated',
  'by',
  'Miss',
  'Bremer',
  ',',
  'to',
  'be',
  'the',
  'soup',
  'constantly',
  'served',
  'to',
  'Mademoiselle',
  'Jenny',
  'Lind',
  ',',
  'as',
  'prepared',
  'by',
  'her',
  'own',
  'cook',
  '.'],
 ['The',
  'sago',
  'and',
  'eggs',
  'were',
  'found',
  'by',
  'her',
  'soothing',
  'to',
  'the',
  'chest',
  ',',
  'and',
  'be

In this collection,  imperative sentences (sentences that being with a verb) are always mistagged.  The POS tagger marks the initial verb as NN instead of VB.  (There may be other kinds of errors too, but we are only looking at imperative sentences here.) In order to see the sentences where the errors are occuring, the code below finds sentences that begin with imperatives, tags them with the tagger, and returns them in a list. 

In [31]:
def get_cookbook_imperatives(sents, tagger):
    cooking_commands = ["Wash", "Stir", "Moisten", "Drain", "Cook", "Pour", "Chop", "Slice", "Season", "Mix", "Fry", "Bake", "Roast", "Wisk"]        
    return [tagger.tag(sent) for sent in sents if sent[0] in cooking_commands]       


Let's look at those sentences.

In [37]:
imperatives = get_cookbook_imperatives(cookbook_sents, brown_tagger)
imperatives[0:5]

[[('Wash', 'NN'),
  ('a', 'AT'),
  ('quarter', 'NN'),
  ('of', 'IN'),
  ('a', 'AT'),
  ('pound', 'NN'),
  ('of', 'IN'),
  ('best', 'JJT'),
  ('pearl', 'NN'),
  ('sago', 'NN'),
  ('thoroughly', 'RB'),
  (',', ','),
  ('then', 'RB'),
  ('stew', 'NN'),
  ('it', 'PPS'),
  ('quite', 'QL'),
  ('tender', 'JJ'),
  ('and', 'CC'),
  ('very', 'QL'),
  ('View', 'NN'),
  ('page', 'NN'),
  ('[', '('),
  ('32', 'CD'),
  (']', ')'),
  ('thick', 'JJ'),
  ('in', 'IN'),
  ('water', 'NN'),
  ('or', 'CC'),
  ('thick', 'JJ'),
  ('broth', 'NN'),
  (';', '.'),
  ('(', '('),
  ('it', 'PPS'),
  ('will', 'MD'),
  ('require', 'VB'),
  ('nearly', 'QL'),
  ('or', 'CC'),
  ('quite', 'QL'),
  ('a', 'AT'),
  ('quart', 'NN'),
  ('of', 'IN'),
  ('liquid', 'JJ'),
  (',', ','),
  ('which', 'WDT'),
  ('should', 'MD'),
  ('be', 'BE'),
  ('poured', 'VBN'),
  ('to', 'TO'),
  ('it', 'PPO'),
  ('cold', 'JJ'),
  ('and', 'CC'),
  ('heated', 'VBN'),
  ('slowly', 'RB'),
  (';', '.'),
  (')', ')'),
  ('then', 'RB'),
  ('mix', 'VB'),

Notice that most of the initial words are incorrectly tagged as nouns rather than verbs.  How can we fix this?  One way is to label a few rather generic sentences with the structure we are interested in, add them to the start of the training data, and then retrain the tagger.

In [33]:
def train_tagger_on_brown_augmented_with_cooking_sents():

    cooking_action_sents = [[('Strain', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Mix', 'VB'), ('them', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Season', 'VB'), ('them', 'PPS'), ('with', 'IN'), ('pepper', 'NN'), ('.', '.')], 
                        [('Wash', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Chop', 'VB'), ('the', 'AT'), ('greens', 'NNS'), ('.', '.')],
                        [('Slice', 'VB'), ('it', 'PPS'), ('well', 'RB'), ('.', '.')],
                        [('Bake', 'VB'), ('the', 'AT'), ('cake', 'NN'), ('.', '.')],
                        [('Pour', 'VB'), ('into', 'IN'), ('a', 'AT'), ('mold', 'NN'), ('.', '.')],
                        [('Stir', 'VB'), ('the', 'AT'), ('mixture', 'NN'), ('.', '.')],
                        [('Moisten', 'VB'), ('the', 'AT'), ('grains', 'NNS'), ('.', '.')],
                        [('Cook', 'VB'), ('the', 'AT'), ('duck', 'NN'), ('.', '.')],
                        [('Drain', 'VB'), ('for', 'IN'), ('one', 'CD'), ('day', 'NN'), ('.', '.')]]


    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
    'humor', 'learned', 'lore', 'mystery', 'religion', 'reviews', 'romance', 'science_fiction'])
    
    #append hand-tagged cooking sentences to the front of the training data
    all_tagged_sents = cooking_action_sents + brown_tagged_sents
    return train_tagger(all_tagged_sents)
    

Let's retrain the tagger.

In [34]:
brown_and_cooking_tagger = train_tagger_on_brown_augmented_with_cooking_sents()


0.911 pos accuracy on test set


How well is this working on the cookbook imperatives now? Is more training data needed to change the behavior of the tagger?

In [38]:
better_imperatives = get_cookbook_imperatives(cookbook_sents, brown_and_cooking_tagger)
better_imperatives

[[('Wash', 'VB'),
  ('a', 'AT'),
  ('quarter', 'NN'),
  ('of', 'IN'),
  ('a', 'AT'),
  ('pound', 'NN'),
  ('of', 'IN'),
  ('best', 'JJT'),
  ('pearl', 'NN'),
  ('sago', 'NN'),
  ('thoroughly', 'RB'),
  (',', ','),
  ('then', 'RB'),
  ('stew', 'NN'),
  ('it', 'PPS'),
  ('quite', 'QL'),
  ('tender', 'JJ'),
  ('and', 'CC'),
  ('very', 'QL'),
  ('View', 'NN'),
  ('page', 'NN'),
  ('[', '('),
  ('32', 'CD'),
  (']', ')'),
  ('thick', 'JJ'),
  ('in', 'IN'),
  ('water', 'NN'),
  ('or', 'CC'),
  ('thick', 'JJ'),
  ('broth', 'NN'),
  (';', '.'),
  ('(', '('),
  ('it', 'PPS'),
  ('will', 'MD'),
  ('require', 'VB'),
  ('nearly', 'QL'),
  ('or', 'CC'),
  ('quite', 'QL'),
  ('a', 'AT'),
  ('quart', 'NN'),
  ('of', 'IN'),
  ('liquid', 'JJ'),
  (',', ','),
  ('which', 'WDT'),
  ('should', 'MD'),
  ('be', 'BE'),
  ('poured', 'VBN'),
  ('to', 'TO'),
  ('it', 'PPO'),
  ('cold', 'JJ'),
  ('and', 'CC'),
  ('heated', 'VBN'),
  ('slowly', 'RB'),
  (';', '.'),
  (')', ')'),
  ('then', 'RB'),
  ('mix', 'VB'),

It worked quite well.  It would be worth experimenting to see if it would still work if I'd supplied fewer of the cooking verbs.

##Assignment:##

Rewrite this notebook to do the following:
- Tag your adopted text with an NLTK backoff tagger
- Identify a common type of error that is amenable to fixing by making a pattern of training data, similar to what we see with the recipe examples.  You'll want to focus on a particular pattern so that making a few tweaks will have a impact on the results of training.
- Show the before and after effects on the output of the tagger.  Ideally you'll see the errors get fixed not just on the specific examples you fixed, but on similar examples with different words.  In the case of recipes, imperative verbs beyond those in the hardcoded list would be fixed because the tagger would recognize the pattern that verbs can occur at the start of the sentence.

## My Steps

For my collection I modified the backoff tagger to a Trigram Tagger since upon evaluation of the text it seemed more useful to utilize trigrams. We will see examples below of sentences where PoS tagging differs for different Trigrams. 

Before moving into my tagged text, I had certain hypotheses:

-- Wrong tagging of words such as I've, We're, I'm, You've, etc

-- Given the nature of the speech I am expecting Trump might be confused as an Verb or a Proper Noun

Upon intitial analysis of my tagged adopted text, I noticed the below results:

-- Words such as I've, We're, I'm, You've, etc have been properly tagged.

-- Trump was always tagged as a Proper Noun

There were some more interesting results that were seen through a quick analysis:

-- Hard-Working was tagged as noun

-- The term Republican was either tagged as an Adjective or as a Noun

-- Similar results were seen with the word American

### Understanding the tags

To understand how the term Republican, American and other identified words were being tagged I created Trigrams from my tokens and connected them to their associated tagging.

Below are some observations and discussion regarding the same:

Common sentences such as 
'I'm a Republican', 'I'm part of the Republican Party', 'The Republican people', 'An American Law', etc seem to have inconsistent tagging with respect to Adjectives and Nouns.

For the purposes of this text and within the context of his speeches it is seen that terms such as Republican, American, Democratic, Democrat, etc should be considered as Nouns and hence the tagger needs to be trained to remove incorrect Adjective tagging.

### To Test
In the sentence - 'No Republican Party' the word Republican was incorrectly tagged as an Adjective but within the context of the text we want it to be a Noun. We did not harcode the sentence -  and will test if 'Republican' is correctly tagged.

### Note
There still seems to be some confusion regarding the tagging of words such as 'Republican', 'American', etc since after long research it is still uncertain in which situations they would be tagged as an Adjective, Adverb or Noun. Further research needs to be done

Reading in the 'speeches.txt' collection for the tagger. The next step is where we tokenize the text based on the earlier tokenizer defined.

In [106]:
with open("speeches.txt") as w:
    text = w.read()

In [8]:
text = text.replace('\ufeff', '')
new_text = re.sub('[\n]+','\n', text)
pattern = r'''(?x)  # set flag to allow verbose regexps
 (?:[A-Z]\.)+[A-Z]*        # abbreviations, e.g. U.S.A.
| [a-zA-Z]+(?:[-'][a-zA-Z]+)*            # words with optional internal hyphens or apostrophes         
| \$?\d+(?:\.\d+)?%?     # currency (dollars only, e.g. $12.40, $33, $.9) and digits 
| [+/\-@&*.,;"'?():\-_`] #special symbols
'''

Tokens = nltk.regexp_tokenize(new_text,pattern)
Tokens

['SPEECH',
 '1',
 '.',
 '.',
 '.',
 'Thank',
 'you',
 'so',
 'much',
 '.',
 "That's",
 'so',
 'nice',
 '.',
 "Isn't",
 'he',
 'a',
 'great',
 'guy',
 '.',
 'He',
 "doesn't",
 'get',
 'a',
 'fair',
 'press',
 ';',
 'he',
 "doesn't",
 'get',
 'it',
 '.',
 "It's",
 'just',
 'not',
 'fair',
 '.',
 'And',
 'I',
 'have',
 'to',
 'tell',
 'you',
 "I'm",
 'here',
 ',',
 'and',
 'very',
 'strongly',
 'here',
 ',',
 'because',
 'I',
 'have',
 'great',
 'respect',
 'for',
 'Steve',
 'King',
 'and',
 'have',
 'great',
 'respect',
 'likewise',
 'for',
 'Citizens',
 'United',
 ',',
 'David',
 'and',
 'everybody',
 ',',
 'and',
 'tremendous',
 'resect',
 'for',
 'the',
 'Tea',
 'Party',
 '.',
 'Also',
 ',',
 'also',
 'the',
 'people',
 'of',
 'Iowa',
 '.',
 'They',
 'have',
 'something',
 'in',
 'common',
 '.',
 'Hard-working',
 'people',
 '.',
 'They',
 'want',
 'to',
 'work',
 ',',
 'they',
 'want',
 'to',
 'make',
 'the',
 'country',
 'great',
 '.',
 'I',
 'love',
 'the',
 'people',
 'of',
 'Iowa'

Create the tagme() function to tag my adopted text with an NLTK backoff tagger

In [50]:
def tagme(tokens, tagger):
    return [tagger.tag(tokens)]

In [107]:
tagging_answer =  tagme(Tokens, brown_tagger)
tagging_answer

[[('SPEECH', 'NN'),
  ('1', 'CD'),
  ('.', '.'),
  ('.', '.'),
  ('.', '.'),
  ('Thank', 'VB'),
  ('you', 'PPO'),
  ('so', 'CS'),
  ('much', 'AP'),
  ('.', '.'),
  ("That's", 'DT+BEZ'),
  ('so', 'QL'),
  ('nice', 'JJ'),
  ('.', '.'),
  ("Isn't", 'BEZ*'),
  ('he', 'PPS'),
  ('a', 'AT'),
  ('great', 'JJ'),
  ('guy', 'NN'),
  ('.', '.'),
  ('He', 'PPS'),
  ("doesn't", 'DOZ*'),
  ('get', 'VB'),
  ('a', 'AT'),
  ('fair', 'JJ'),
  ('press', 'NN'),
  (';', '.'),
  ('he', 'PPS'),
  ("doesn't", 'DOZ*'),
  ('get', 'VB'),
  ('it', 'PPO'),
  ('.', '.'),
  ("It's", 'PPS+BEZ'),
  ('just', 'RB'),
  ('not', '*'),
  ('fair', 'JJ'),
  ('.', '.'),
  ('And', 'CC'),
  ('I', 'PPSS'),
  ('have', 'HV'),
  ('to', 'TO'),
  ('tell', 'VB'),
  ('you', 'PPO'),
  ("I'm", 'PPSS+BEM'),
  ('here', 'RB'),
  (',', ','),
  ('and', 'CC'),
  ('very', 'QL'),
  ('strongly', 'RB'),
  ('here', 'RB'),
  (',', ','),
  ('because', 'CS'),
  ('I', 'PPSS'),
  ('have', 'HV'),
  ('great', 'JJ'),
  ('respect', 'NN'),
  ('for', 'IN'),
  

In [109]:
error_list = []
word_check = ["Republican", "American", "Trump", "republican", "trump", "Hard-working", "Democratic","conservative", "Democrat"]
# Trump_checker = ["Trump", "trump"]
# checking = ["conservative", "Conservative"]
# Party_checker_Demo = ["Democratic"]
Party_checker_Rep = ["Republican"]
for token in tagging_answer[0]:
    if (token[0] in word_check):
        error_list.append(token)

print(error_list)

[('Hard-working', 'NN'), ('conservative', 'JJ'), ('conservative', 'JJ'), ('Republican', 'JJ'), ('Republican', 'NP'), ('Republican', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('Republican', 'NP'), ('Trump', 'NN'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Republican', 'JJ'), ('Democratic', 'JJ-TL'), ('Democrat', 'NP'), ('Republican', 'JJ'), ('conservative', 'JJ'), ('conservative', 'JJ'), ('conservative', 'NN

In [116]:
trigram_list = []
pairs = nltk.trigrams(Tokens)
for a in pairs:
    for item in word_check:
        if (item in a):
            trigram_list.append(a)

In [117]:
trigram_Set = []
[trigram_Set.append(trigram_list[i:i+3]) for i in range(0,len(trigram_list),3)]

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

Below we connect the List of trigrams with its respective tagging. For example for the below list of trigrams

("I'm", 'a', 'Republican'), ('a', 'Republican', '.'), ('Republican', '.', 'And') -- Republican is tagged as an Adjective

In [119]:
trigram_errors = zip(trigram_Set, error_list)

In [120]:
list(trigram_errors)

[([('common', '.', 'Hard-working'),
   ('.', 'Hard-working', 'people'),
   ('Hard-working', 'people', '.')],
  ('Hard-working', 'NN')),
 ([("I'm", 'a', 'conservative'),
   ('a', 'conservative', ','),
   ('conservative', ',', 'actually')],
  ('conservative', 'JJ')),
 ([('actually', 'very', 'conservative'),
   ('very', 'conservative', ','),
   ('conservative', ',', 'and')],
  ('conservative', 'JJ')),
 ([("I'm", 'a', 'Republican'),
   ('a', 'Republican', '.'),
   ('Republican', '.', 'And')],
  ('Republican', 'JJ')),
 ([('by', 'our', 'Republican'),
   ('our', 'Republican', 'politicians'),
   ('Republican', 'politicians', '.')],
  ('Republican', 'NP')),
 ([('are', 'the', 'Republican'),
   ('the', 'Republican', 'politicians'),
   ('Republican', 'politicians', 'doing')],
  ('Republican', 'JJ')),
 ([('better', 'than', 'Trump'), ('than', 'Trump', '?'), ('Trump', '?', 'I')],
  ('Trump', 'NN')),
 ([('love', 'Donald', 'Trump'), ('Donald', 'Trump', '-'), ('Trump', '-', '-')],
  ('Trump', 'NN')),
 (

In [121]:
def train_tagger_on_brown_augmented():

    modified_speech_sents = [[('common', 'JJ'), ('Hard-working', 'JJ'), ('people', 'NNS'), ('.', '.')],
                        [("I'm", 'PPSS+BEM'), ('a', 'AT'), ('Republican', 'NP'), ('.', '.')],
                        [("I'm", 'PPSS+BEM'), ('Republican', 'NP'),('.', '.')], 
                        [('the', 'AT'), ('Republican', 'NP'), ('politicians', 'NNS'), ('.', '.')],
                        [('the', 'AT'), ('American', 'NP'), ('people', 'NNS'), ('.', '.')]]


    brown_tagged_sents = brown.tagged_sents(categories=['adventure', 'editorial', 'fiction', 'government', 'hobbies',
    'humor', 'learned', 'lore', 'mystery', 'religion', 'reviews', 'romance'])
    
    #append hand-tagged cooking sentences to the front of the training data
    all_tagged_sents = modified_speech_sents + brown_tagged_sents
    return train_tagger(all_tagged_sents)

In [104]:
modified_tagger = train_tagger_on_brown_augmented()

0.909 pos accuracy on test set


In [123]:
tagging_modified_answer =  tagme(Tokens, modified_tagger)
error_list = []
for token in tagging_modified_answer[0]:
    if (token[0] in word_check):
        error_list.append(token)

print(error_list)

[('Hard-working', 'JJ'), ('action', 'NN'), ('action', 'NN'), ('conservative', 'JJ'), ('conservative', 'JJ'), ('Republican', 'NP'), ('Republican', 'NP'), ('Republican', 'NP'), ('Trump', 'NN'), ('Trump', 'NN'), ('Republican', 'NP'), ('Trump', 'NN'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('American', 'JJ'), ('action', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('American', 'JJ'), ('American', 'JJ'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('action', 'NN'), ('action', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Trump', 'NN'), ('Republican', 'NP'), ('Democratic', 'JJ-TL'), ('Democrat', 'NP'), 

In [126]:
trigram_list = []
pairs = nltk.trigrams(Tokens)
for a in pairs:
    for item in word_check:
        if (item in a):
            trigram_list.append(a)

In [125]:
trigram_Set = []
[trigram_Set.append(trigram_list[i:i+3]) for i in range(0,len(trigram_list),3)]

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [127]:
trigram_errors = zip(trigram_Set,error_list)

In [128]:
list(trigram_errors)

[([('common', '.', 'Hard-working'),
   ('.', 'Hard-working', 'people'),
   ('Hard-working', 'people', '.')],
  ('Hard-working', 'JJ')),
 ([(',', 'no', 'action'), ('no', 'action', '.'), ('action', '.', 'They')],
  ('action', 'NN')),
 ([('and', 'no', 'action'), ('no', 'action', '.'), ('action', '.', 'And')],
  ('action', 'NN')),
 ([("I'm", 'a', 'conservative'),
   ('a', 'conservative', ','),
   ('conservative', ',', 'actually')],
  ('conservative', 'JJ')),
 ([('actually', 'very', 'conservative'),
   ('very', 'conservative', ','),
   ('conservative', ',', 'and')],
  ('conservative', 'JJ')),
 ([("I'm", 'a', 'Republican'),
   ('a', 'Republican', '.'),
   ('Republican', '.', 'And')],
  ('Republican', 'NP')),
 ([('by', 'our', 'Republican'),
   ('our', 'Republican', 'politicians'),
   ('Republican', 'politicians', '.')],
  ('Republican', 'NP')),
 ([('are', 'the', 'Republican'),
   ('the', 'Republican', 'politicians'),
   ('Republican', 'politicians', 'doing')],
  ('Republican', 'NP')),
 ([('be

The above is the final list of trigrams along with their associated tagging after training the tagger with the new sentences.

We saw that a lot of our taggin for Republican was fixed with the trained tagger. For our test, we saw that in 'No Republican party' - Republican was correctly tagged as a Noun after the new trained tagger. However, we still saw incorrect tagging and further training may be required.

Also Hard-Working was correctly tagged - ('common', '.', 'Hard-working'), ('.', 'Hard-working', 'people'), ('Hard-working', 'people', '.')], -- ('Hard-working', 'JJ')

A quick look of the original and modified Republican tagging

Original

In [129]:
for token in tagging_answer[0]:
    if (token[0] in Party_checker_Rep):
        print(token)

('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'JJ')
('Republican', 'NP')


Modified

In [130]:
for token in tagging_modified_answer[0]:
    if (token[0] in Party_checker_Rep):
        print(token)

('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'JJ')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
('Republican', 'NP')
