<a href="https://colab.research.google.com/github/mirandaday16/hin_urd_translation/blob/master/Copy_of_LanguageModeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Your task is to train *character-level* language models. 
You will train unigram, bigram, and trigram character level models on a collection of books from Project Gutenberg. You will then use these trained English language models to distinguish English documents from Brazilian Portuguese documents in the test set.

In [1]:
import pandas as pd
import httpimport
import collections

with httpimport.remote_repo(['lm_helper'], 'https://raw.githubusercontent.com/jasoriya/CS6120-PS2-support/master/utils/'):
  from lm_helper import get_train_data, get_test_data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package mac_morpho to /root/nltk_data...
[nltk_data]   Unzipping corpora/mac_morpho.zip.


This code loads the training and test data. Each dataset is a list of books. Each book contains a list of sentences, and each sentence contains a list of words. For building a character language model, you should join the words of a sentence together with a space character.

In [0]:
# get the train and test data
train = get_train_data()
test, test_files = get_test_data()

## 1.1
Collect statistics on the unigram, bigram, and trigram character counts.

If your machine takes a long time to perform this computation, you may save these counts to files in your github repository and load them on request. This is not necessary, however.

In [3]:
# Collect unigram character counts stats (list of char counts, avg char count)
def get_unigram_char_counts():
  char_counts = []
  for document in train:
    for sentence in document:
      for word in sentence:
        char_counts.append(len(word))
  return char_counts

def get_unigram_char_avg(char_counts_list):
  avg = 0
  unigram_count = 0
  for count in char_counts_list:
    avg += count
    unigram_count += 1
  avg = avg / unigram_count
  return avg

# Collect bigram character counts stats (list of char counts, avg char count)
def get_bigram_list():
  bigrams_list = []
  for document in train:
    for sentence in document:
      for i in range(len(sentence) - 1):
        bigrams_list.append([sentence[i], sentence[i + 1]])
  return(bigrams_list)

def get_bigram_char_counts(bigrams_list):
  char_counts = []
  for bigram in bigrams_list:
    bigram_char_count = len(bigram[0]) + len(bigram[1])
    char_counts.append(bigram_char_count)
  return char_counts

def get_bigram_char_avg(char_counts_list):
  avg = 0
  bigram_count = 0
  for count in char_counts_list:
    avg += count
    bigram_count += 1
  avg = avg / bigram_count
  return avg

# Collect trigram character counts stats (list of char counts, avg char count)
def get_trigram_list():
  trigrams_list = []
  for document in train:
      for sentence in document:
        for i in range(len(sentence) - 2):
          trigrams_list.append([sentence[i], sentence[i + 1], sentence[i + 2]])
  return(trigrams_list)

def get_trigram_char_counts(trigrams_list):
  char_counts = []
  for trigram in trigrams_list:
    trigram_char_count = len(trigram[0]) + len(trigram[1]) + len(trigram[2])
    char_counts.append(trigram_char_count)
  return char_counts

def get_trigram_char_avg(char_counts_list):
  avg = 0
  trigram_count = 0
  for count in char_counts_list:
    avg += count
    trigram_count += 1
  avg = avg / trigram_count
  return avg


def main():
  print("Character Count Statistics for ngrams")
  print("UNIGRAMS")
  print("Average unigram character count: ", get_unigram_char_avg(get_unigram_char_counts()))
  print("BIGRAMS")
  print("Average bigram character count: ", get_bigram_char_avg(get_bigram_char_counts(get_bigram_list())))
  print("TRIGRAMS")
  print("Average trigram character count: ", get_trigram_char_avg(get_trigram_char_counts(get_trigram_list())))


main()


Character Count Statistics for ngrams
UNIGRAMS
Average unigram character count:  3.6186308183165288
BIGRAMS
Average bigram character count:  7.366206767270403
TRIGRAMS
Average trigram character count:  11.087873364372285


## 1.2
Calculate the perplexity for each document in the test set using linear interpolation smoothing method. For determining λs for linear interpolation, you can divide the training data into a new training set (80%) and a held-out set (20%), then using grid search method:
Choose ~10 values of λ to test using grid search on held-out data.

Some documents in the test set are in Brazilian Portuguese. Identify them as follows: 
  - Sort by perplexity and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names (from `test_files`) and perplexities of the documents above the threshold

    ```
        file name, score
        file name, score
        . . .
        file name, score
    ```

  - Copy this list of filenames and manually annotate them as being correctly or incorrectly labeled as Portuguese.




In [26]:
# Tokenize entire dataset
tokenized_train = []
for document in train:
  for sentence in document:
    for word in sentence:
      tokenized_train.append(word)
# print(len(tokenized_train))
# Contains 2,621,785 tokens

# Split training set into training (80%) and held-out (20%) sets
new_training = []
held_out = []

# print(len(train))
# length of train set is 18 documents; 80% is approx. 14

new_training = train[0:14]
held_out = train[14:]

# new_training contains 14 documents (~78%), and held_out contains 4 (~22%).

# Creating a unigram language model to generate perplexities
def create_unigram_model():
  unigram_model = collections.defaultdict(lambda: 0.01)
  for token in tokenized_train:
    try:
      unigram_model[token] += 1
    except KeyError:
      unigram_model[token] = 1
      continue
    num = float(sum(unigram_model.values()))
    print(unigram_model)
    for word in unigram_model:
      unigram_model[word] = unigram_model[word]/num
    return unigram_model

# # Calculate perplexity for each document using linear interpolation smoothing.
def calculate_perplexity(document_number, unigram_model):
  tokenized_document = []
  for sentence in new_training[document_number]:
    for word in sentence:
      tokenized_document.append(word)
  # print(tokenized_document[:10])
  perplexity = 1
  N = 0
  for unigram in tokenized_document:
    N += 1
    perplexity = perplexity * (1/unigram_model[unigram])
    # print(unigram_model[unigram])
  perplexity = (perplexity)**(1/float(N))
  return perplexity


# def sort_by_perplexity():

calculate_perplexity(1, create_unigram_model())
# create_unigram_model()



defaultdict(<function create_unigram_model.<locals>.<lambda> at 0x7f34a5b5e2f0>, {'[': 1.1})


inf

## 1.3
Build a trigram language model with add-λ smoothing (use λ = 0.1).

Sort the test documents by perplexity and perform a check for Brazilian Portuguese documents as above:

  - Observe the perplexity scores and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names and perplexities of the documents above the threshold

  ```
      file name, score
      file name, score
      . . .
      file name, score
  ```

  - Copy this list of filenames and manually annotate them for correctness.

In [0]:
# Trigram language model with add-lambda smoothing


## 1.4
Based on your observation from above questions, compare linear interpolation and add-λ smoothing by listing out their pros and cons.

[Your text here.]