## Introduction to NLP course (2017-2018).

### Peter Weber & Jonatan Piñol

Homework 2.1: Markov Models. Hidden Markov Models and Part of Speech Tagging.

Objectives:

1) Create a tri-gram model for generating pseudo-Trump sentences 
- load the corpus, tokenize it and obtain list of trigrams 
- define a function that obtains the counts of the "model" 
- define a function that generates a pseudo-sentence 
- when generating a sentence, make sure that your sentence fulfils the following requirements
    - it is at least 5 words long
    - the last token of the pseudo-sentence is a ".", "!", or "?"
    - it does not contain any other ".", "!", "?" tokens other than the final one
- print 5 pseudo-sentences

2) Use the built-in n-gram HMM models in nltk to tag a corpus 
- load the brown corpus
- split each category in the corpus to test and train
- for each category in the corpus, train on the train set and evaluate on the test set the following taggers:
    - default
    - affix
    - unigram
    - bigram
    - trigram
    
    Each tagger should have backoff configured on the previous tagger.
    
    Print the results in a table.
    
    
- repeat the previous experiment using universal tagset. Print the results in a table.
- cross evaluate between different genres (train on one category, evaluate on all the other categories). Print and compare the results
- Only for the "news" portion of the corpus, compare
    - the best berforming tagger (with backoff)
    - the naive bayes tagger
    
    Compare the accuracy as well as the execution time.
    
    Use both the universal tagset and the full tagset.

In [2]:
# Import nltk
import nltk
from nltk import bigrams, trigrams

# Import numpy
import numpy as np

# Import time
import time

# Import codecs
import codecs

# Import taggers
from nltk import DefaultTagger, AffixTagger, UnigramTagger, BigramTagger, TrigramTagger
from nltk import ClassifierBasedPOSTagger

# Import the brown corpus
from nltk.corpus import brown

# Additional modules
import random

In [6]:
# Homework 2 part 1

def get_markov_stats(trigrams):
    """
    Input: 
        (list) trigrams
    
    Output: 
        (dict: key = str, value = list) 
            key: string of first two words in trigram  
            value: list of all possible third words in trigram
    """
    # Initialize
    markov_stats = {}
    
    for words in trigrams:
        words_list = list(words)
        word_12 = " ".join([words_list[0], words_list[1]])
        word_3 = words_list[2]
        
        # Check if there is an entry for the current word
        if word_12 in markov_stats.keys():
            # If it is, append the second one
            markov_stats[word_12].append(word_3)

        # If it isn't, create it with the corresponding value
        else:
            markov_stats[word_12] = [word_3]
    return(markov_stats)

def generate_sentence(corpus, stats):
    """
    Input:
        corpus: (list) corpus of words
        stats: output of get_markov_stats
    Output:
        prints sentence according to rules given in assignment
    """
    
    # Get first two words of sentence, excluding .!?
    first_bigram = list(random.choice(list(nltk.bigrams(corpus[:-1]))))
    while "." in first_bigram or "!" in first_bigram or "?" in first_bigram or "," in first_bigram or "’" in first_bigram or first_bigram[0].islower():
        first_bigram = list(random.choice(list(nltk.bigrams(corpus[:-1]))))
    
    new_speech = first_bigram

    # Generate a sentence of minimum length 5
    length = 1
    while True:
        # Get next word from previous two words
        next_word = np.random.choice(stats[" ".join(new_speech[-2:])])
        
        # Is sentence shorter than 5 words?
        if length < 5: 
            # If it finds punctuation restart
            if "." in next_word or "!" in next_word or "?" in next_word:
                return(generate_sentence(corpus = corpus, stats = stats))  
            # If no punctuation append word to existing sentence
            else:
                new_speech.append(next_word)
                length += 1
                
        # Is sentence at least 5 words long?
        elif length >= 5:
            # If includes punctuation return
            if "." in next_word or "!" in next_word or "?" in next_word:
                new_speech.append(next_word)
                return(" ".join(new_speech))
            # If no punctuation append word to existing sentence
            else:
                new_speech.append(next_word)

def hw2_part1():
    """
    Output:
        Prints 5 sentences Trump could have said, based on trigram HMM.
    """
    # Trump speeches file location
    fname = "speeches.txt"
    # Read the corpus
    raw_corpus = codecs.open(fname,'r','utf8').read()
    
    # Tokenize the corpus
    corpus = nltk.word_tokenize(raw_corpus)

    # Generate list of trigrams
    trump_trigr = list(nltk.trigrams(corpus))
    
    # Generate model
    markov_stats = get_markov_stats(trump_trigr)
   
    # Generate and print sentences
    for i in range(5):
        sentence = generate_sentence(corpus, markov_stats)
        print(sentence)    

hw2_part1()

SYSTEM IS WONDERFUL IN MANY WAYS , BUT FIVE AT LEAST PEOPLE LOOKING FOR HIM .
THE AIR-CONDITIONERS TO COME IN AND THEY STRIP APART TO USE FOR OUR COUNTRY , BUT THEY 'LL SAY , MAN , DID THEY TAKE THAT .
Israel will force the issue , but I want to go over – well , how long are we get .
`` Well , we had the killing of Jamiel and the decisions that are unbelievable , what he did a big business .
I can ’ t even want to leave on such and such a believer and I hope you all very much everybody .


In [3]:
# Homework 2 part 2
import pandas as pd 

# Function that splits a corpus in train and test
def split_train_test(corpus,test_size=500):
    return corpus[test_size:], corpus[:test_size]

# Dummy function
# Extend and rework

def split_data(category, tag_set):
    """
    Inputs:
        - category (string) category of the brown corpus
        - tag_set (string) 'universal' or any other string that is not 'universal'
    Output:
        - train_tags (list of strings) training tags for default tagger
        - train_tsents (corpus) tagged sentences training set
        - test_tsents (corpus) tagged sentences test set
    
    """
    # Evaluate either universal or full tagset
    if tag_set == 'universal':
        brown_twords = brown.tagged_words(categories=category, tagset = 'universal')
        brown_tsents = brown.tagged_sents(categories=category, tagset = 'universal')
    else: 
        brown_twords = brown.tagged_words(categories=category)
        brown_tsents = brown.tagged_sents(categories=category)

    # Split each category in the brown corpus into train and test
    train_twords, test_twords = split_train_test(brown_twords)
    train_tsents, test_tsents = split_train_test(brown_tsents)
    train_tags = [tag for (token,tag) in train_twords]
    return(train_tags, train_tsents, test_tsents)

def train_taggers(train_tags, train_tsents, test_tsents):
    """
    Inputs: 
        - output of split_data()
    Output:
        - trained taggers
    """
    
    ### For each category, train and evaluate taggers. Use backoff.

    # Default tagger (most frequent class)
    start_time = time.time()
    most_frequent_tag = nltk.FreqDist(train_tags).max()
    default_tagger = nltk.DefaultTagger(most_frequent_tag)
    default_time = time.time() - start_time
    
    # Affix tagger
    start_time = time.time()
    affix_tagger = AffixTagger(train_tsents)
    affix_time = time.time() - start_time
    
    # Unigram tagger
    start_time = time.time()
    unigram_tagger = UnigramTagger(train_tsents, backoff = affix_tagger)
    unigram_time = time.time() - start_time
    
    # Bigram tagger
    start_time = time.time()
    bigram_tagger = BigramTagger(train_tsents, backoff=unigram_tagger)
    bigram_time = time.time() - start_time
    
    # Trigram tagger
    start_time = time.time()
    trigram_tagger = TrigramTagger(train_tsents, backoff=bigram_tagger)
    trigram_time = time.time() - start_time
    
    taggers = [default_tagger, affix_tagger, unigram_tagger, bigram_tagger, trigram_tagger]
    times = [default_time, affix_time, unigram_time, bigram_time, trigram_time]
    return(taggers, times)

def print_and_evaluate_accuracy(tag_set, taggers, category, test_set):
    """
    Inputs:
        - tag_set (string) 'universal' or any other string that is not 'universal'
        - taggers (list) trained taggers
        - test_set (corpus) tsents output of split_data()
    Outputs:
        - taggerstr_list (list of strings) the names of the taggers
        - accuracy_list (list) evaluated accuracies
        - category_list (list of strings)
    """
    
    tagger_strings = ["default_tagger", "affix_tagger", "unigram_tagger", "bigram_tagger", "trigram_tagger"]
    taggerstr_list, accuracy_list, category_list = [], [], []
    
    # Print the statistics for all taggers provided the category
    print("\n")
    for tagger, tagstring in zip(taggers, tagger_strings):
        accuracy = tagger.evaluate(test_set)
        print("The accuracy of the {} tag set {} tagger on the {} category is: {}".\
              format(tag_set, tagstring, category, round(accuracy,2)))

        taggerstr_list.append(tagstring)
        accuracy_list.append(accuracy)
        category_list.append(category)
    return(taggerstr_list, accuracy_list, category_list)
    

def get_tagger_summary(tag_set, start_cat, end_cat):
    """
    Inputs:
        - start_cat (integers) category in the brown corpus to start with
        - end_cat --//-- end with
    """
    test_sents_list, taggers_list, taggerstr_list, accuracy_list, category_list,times_list = [], [], [], [], [], []
    
    for category in brown.categories()[start_cat:end_cat]:
        train_tags, train_tsents, test_tsents = split_data(category, tag_set)
        taggers,times = train_taggers(train_tags, train_tsents, test_tsents)
        taggerstr_, accuracy_, category_ = print_and_evaluate_accuracy(tag_set, taggers, category, test_tsents)
        taggerstr_list.extend(taggerstr_)
        accuracy_list.extend(accuracy_)
        category_list.extend(category_) 
        taggers_list.extend(taggers)
        times_list.extend(times)
        test_sents_list.extend(test_tsents)
        
    summary = pd.DataFrame({
        "tagger": taggerstr_list,
        "accuracy": accuracy_list,
        "category": category_list,
        "runtime" : times_list
                      })
    cross_evaluation = pd.DataFrame({
        "category": category_list,
        "tagger": taggers_list
    })
    return(summary, cross_evaluation)
 
def print_best_tagger(df):
    """
    Input:  
        - df: first output of get_tagger_summary()
    """
    ### Prints which is the best performing tagger for each category in df
    idx = df.groupby(['category'])['accuracy'].transform(max) == df['accuracy']
    max_ = df[idx]
    print("\nThe best performing tagger for each category is given by\n", max_)
    
def train_and_evaluate_nb_tagger(tag_set, category = "news"):
    
    # Train and evaluate nb tagger on the "news" category (full tagset)
    if tag_set == "universal":
        brown_tsents = brown.tagged_sents(categories=category, tagset = "universal")
    else:
        brown_tsents = brown.tagged_sents(categories=category)
    train_tsents, test_tsents = split_train_test(brown_tsents)
    start_time = time.time()
    nb_tagger = ClassifierBasedPOSTagger(train=train_tsents)
    nb_time = time.time() - start_time
    accuracy = nb_tagger.evaluate(test_tsents)

    # Print the performance of the nb tagger and the runtime (full tagset)
    print("\nThe accuracy of the nb tagger using the {} tagset on the news category is: {}".\
          format(tag_set, round(accuracy,2)))
    print("\nThe runtime of the nb tagger using the {} tagset on the news category in seconds is: {}".\
          format(tag_set, round(nb_time,2)))
    
def evaluate_cross_accuracy(model, train_category, test_category, tag_set = 'universal'):
    """
    Inputs:
        - model (df) trained taggers, second output of get_tagger_summary()
    Outputs:
        - summary_cross (df) summarizes all the results of cross evaluation in a df 
    
    """
    _, _, test_sents = split_data(test_category, tag_set)
    taggers = model[model["category"] == train_category]["tagger"]
    accuracies = [round(tagger.evaluate(test_sents),2) for tagger in taggers]        
    train_cat_list = taggers.shape[0] * [train_category]
    test_cat_list = taggers.shape[0] * [test_category]
    
    summary_cross = pd.DataFrame({
        'tagger': taggers,
        'accuracy': accuracies,
        'train_category': train_cat_list,
        'test_category': test_cat_list
    })
    return(summary_cross)
    
def hw2_part2(start, stop):
    
    ### FULL TAGSET
    ftag, cross_eval_ftag = get_tagger_summary('full', start, stop)
        
    ### UNIVERSAL TAGSET
    utag, cross_eval_utag = get_tagger_summary('universal', start, stop)
       
    # Print the performance of the best performing n-gram tagger and the runtime (full tagset)
    print_best_tagger(ftag)
    
    # NB classifier on full tagset
    train_and_evaluate_nb_tagger("full")
    
    # Print the performance of the best performing n-gram tagger and the runtime (universal tagset)
    print_best_tagger(utag)
    
    # NB classifier on universal tagset
    train_and_evaluate_nb_tagger("universal")
    
    ### Cross evaluation

    # Cross-evaluate between categories (using universal tagset)
    df_cross_eval = pd.DataFrame(columns = ['train_category', 'test_category', 'tagger', 'accuracy'])
    
    for train_category in brown.categories()[start:stop]:
        categories = brown.categories()[start:stop]
        categories.remove(train_category)
        for test_category in categories:
            df_cross_eval = pd.concat([df_cross_eval, \
                                       evaluate_cross_accuracy(cross_eval_utag, train_category, test_category)])
    
    # Example: train on news_train, evaluate on the "test" of every other category
    # Do this for all categories in the corpus
    # Print the results
    print("\nThe cross evaluation gives the following result:")
    print(df_cross_eval)
    return(df_cross_eval)

In [9]:
train_tags, train_tsents, test_tsents = split_data('news', 'unviersal')
test_tsents

[[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')], [(u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL'), (u',', u','), (u'which', u'WDT'), (u'had', u'HVD'), (u'over-all', u'JJ'), (u'charge', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'election', u'NN'), (u',', u','), (u'``', u'``'), (u'deserves', u'VBZ'), (u'the', u'AT'), (u'praise', u'NN'), (u'and'

In [4]:
summary = hw2_part2(0, len(brown.categories()))




The accuracy of the full tag set default_tagger tagger on the adventure category is: 0.11
The accuracy of the full tag set affix_tagger tagger on the adventure category is: 0.19
The accuracy of the full tag set unigram_tagger tagger on the adventure category is: 0.89
The accuracy of the full tag set bigram_tagger tagger on the adventure category is: 0.91
The accuracy of the full tag set trigram_tagger tagger on the adventure category is: 0.9


The accuracy of the full tag set default_tagger tagger on the belles_lettres category is: 0.12
The accuracy of the full tag set affix_tagger tagger on the belles_lettres category is: 0.26
The accuracy of the full tag set unigram_tagger tagger on the belles_lettres category is: 0.9
The accuracy of the full tag set bigram_tagger tagger on the belles_lettres category is: 0.91
The accuracy of the full tag set trigram_tagger tagger on the belles_lettres category is: 0.91


The accuracy of the full tag set default_tagger tagger on the editorial categ

In [5]:
# Dataframe containing the cross validation
summary.head() 

Unnamed: 0,accuracy,tagger,test_category,train_category
0,0.22,<DefaultTagger: tag=NOUN>,belles_lettres,adventure
1,0.28,<AffixTagger: size=996>,belles_lettres,adventure
2,0.89,<UnigramTagger: size=3013>,belles_lettres,adventure
3,0.9,<BigramTagger: size=779>,belles_lettres,adventure
4,0.9,<TrigramTagger: size=666>,belles_lettres,adventure
