# Preprocessing Testing Dataset

This notebook constitutes the second pipeline of the N-Gram Language Modeling Analysis Project. It includes the preprocessing of the testing data from the DUC 2005 dataset.

After the execution of this notebook, the following output files will be generated and saved in the folder /output_data:
- test_sentences.pkl: List of sentences in the test set after preprocessing, before tagging \<UNK\> words.
- test_sentences_unk.pkl: List of sentences in the test set after preprocessing, after tagging \<UNK\> words.
- unigram_dictionary_test.pkl: Dictionary containing all the unigrams of the test set after preprocessing.
- bigram_dictionary_test.pkl: Dictionary containing all the biigrams of the test set after preprocessing.
- trigram_dictionary_test.pkl: Dictionary containing all the triigrams of the test set after preprocessing.
- fourgram_dictionary_test.pkl: Dictionary containing all the fourgrams of the test set after preprocessing.

The output files of this notebook will be used in the other pipelines: 
- NLP-Assignment_2-analytics.ipynb
- NLP-Assignment_3-results.ipynb

## Import Libraries

In [None]:
import pandas as pd
import nltk
from bs4 import BeautifulSoup
import os
import re
import itertools
import pickle

## Read HTML Documents

### Get path and name of all the files containd in the Test dataset

In [107]:
import os

#Get the file names (either assessments or measurments, does not matter since we are only counting rows)
mypath = "D:/Datos/Documents/Development/Python/Projects/NCF/NLP/Assignment01/DUC 2005 Dataset/TestSet"
#mypath = "/Users/milevavantuyl/Desktop/NLP/Assignment 1/google_drive/DUC 2005 Dataset" # Mileva's path
all_files = []

for path, subdirs, files in os.walk(mypath):
    for name in files:
        all_files.append(os.path.join(path, name))

In [108]:
all_files = [file for file in all_files if '.DS_Store' not in file] # Mileva remove . files

### Join all corpus in a single variable

In [109]:
#html_document = "D:/Datos/Documents/Development/Python/Projects/NCF/NLP/Assignment01/DUC 2005 Dataset/TrainingSet/d301i/FT921-10162"

corpus = []

for file in all_files:
    with open(file, 'r') as f:
        contents = f.read()
        soup = BeautifulSoup(contents, 'html.parser')
        corpus.append(soup.text)

num_training_documents = len(corpus)
print(f"There are {num_training_documents} files in the training corpus.")

There are 270 files in the training corpus.


In [110]:
#Corpus contains a list of corpus. Join all elements in a single corpus
text = ''.join(corpus)
#text

## Preprocessing

### Lowercase

In [111]:
text = text.lower()
#text

## Handle trailing spaces

In [112]:
#Substitute \n with blank space
text = text.replace('\n', ' ').replace('\r', '')
#text

### Create List of Sentences

Mileva updated the order. Created the list of sentences & such before identifying the unknown words. 

In [113]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text)
len(sentences)
sentences[0]

"   la011089-0104   3661    january 10, 1989, tuesday, home edition      metro; part 2; page 3; column 1; metro desk      570 words      saluting the heroes;    deputies honored for risking their lives trying to save others      by william overend, times staff writer      sheriff's deputies tim parker and gean okada risked their lives trying to save  two small children from a burning house in south-central los angeles."

### Remove Special Characters

In [114]:
clean_sentences = []

for sentence in sentences:
    clean = re.sub(r"[^a-zA-Z0-9?! ]+", "", sentence)
    clean_sentences.append(clean)

clean_sentences[0]

'   la0110890104   3661    january 10 1989 tuesday home edition      metro part 2 page 3 column 1 metro desk      570 words      saluting the heroes    deputies honored for risking their lives trying to save others      by william overend times staff writer      sheriffs deputies tim parker and gean okada risked their lives trying to save  two small children from a burning house in southcentral los angeles'

### Tokenice Sentences

In [115]:
# A list of sentences, where each sentence is represented as a list of tokens
tokenized_sentences = []
for sentence in clean_sentences:
    tokens = nltk.word_tokenize(sentence)
    tokenized_sentences.append(tokens)

### Remove long sentences

In [116]:
all_sentences = tokenized_sentences
tokenized_sentences = [] #Will store the selected sentences (#tokens<55)
[tokenized_sentences.append(sentence) for sentence in all_sentences if len(sentence) < 55]

len(all_sentences), len(tokenized_sentences)

(9152, 8886)

### Save Cleaned/ Tokenized Sentences (Mileva addition)

In [118]:
# Save tokenized sentences (before adding the <UNK> tag)
filename = os.path.join("output_data", "UNK 5-55/test_sentences.pkl")
with open(filename, "wb") as file: 
    pickle.dump(tokenized_sentences, file)

### Tag Unknown Words

We are tagging with \<UNK\> all the words that are not contained in the used_words list, generated with the training set.

In [119]:
#Read data
a_file = open("output_data/UNK 5-55/used_words.pkl", "rb")
used_words = pickle.load(a_file)
a_file.close()
used_words

['ft',
 '07',
 'feb',
 '92',
 'noriega',
 'gains',
 'ground',
 'henry',
 'hamman',
 'focuses',
 'on',
 'the',
 'former',
 'panamanian',
 'leaders',
 'attempts',
 'to',
 'fend',
 'off',
 'us',
 'drugs',
 'charges',
 'by',
 'leader',
 'general',
 'manuel',
 'antonio',
 'noriegas',
 'defence',
 'against',
 'drug',
 'trafficking',
 'in',
 'miami',
 'gained',
 'this',
 'week',
 'a',
 'enforcement',
 'administration',
 'agent',
 'acknowledged',
 'that',
 'man',
 'identified',
 'prosecution',
 'as',
 'medelln',
 'cocaine',
 'cartel',
 'money',
 'launderer',
 'had',
 'been',
 'arrested',
 'basis',
 'of',
 'information',
 'from',
 'forces',
 'commanded',
 'gen',
 'testimony',
 'mr',
 'james',
 'l',
 'bramble',
 'charge',
 'panama',
 'operations',
 'august',
 '1982',
 'until',
 'june',
 '1984',
 'raised',
 'questions',
 'about',
 'earlier',
 'claims',
 'protected',
 'cartels',
 'return',
 'for',
 'payoffs',
 'prosecutors',
 'cited',
 'ramon',
 'milian',
 'rodriguez',
 'conduit',
 'bank',
 'accou

In [120]:
len(used_words)

22734

In [121]:
tokenized_sentences_unk = []
for sentence in tokenized_sentences:
    l_replace_all = ['<UNK>' if word not in used_words else word for word in sentence]
    tokenized_sentences_unk.append(l_replace_all)

#tokenized_sentences_unk[0]

In [122]:
# Save test sentences with the <UNK> tags 
filename = os.path.join("output_data/UNK 5-55/", "test_sentences_unk.pkl")
with open(filename, "wb") as file: 
    pickle.dump(tokenized_sentences_unk, file)

## Generate N-Grams

### Unigrams

For unigrams, we don't take into account the sentences, so we are just getting the single tokens and creating a dictionary

In [123]:
#Join elements of the list tokenized_sentences_unk (that contains each sentence unigrams) into a single list
unigrams = list(itertools.chain.from_iterable(tokenized_sentences_unk))
#unigrams

#### Generate Unigram Dictionary

In [124]:
from nltk.probability import FreqDist

unigrams_dict = FreqDist(unigrams)
unigrams_dict_top_20 = unigrams_dict.most_common(20)
print("Top 20 unigrams and frequency: \n", unigrams_dict_top_20)

Top 20 unigrams and frequency: 
 [('the', 11299), ('<UNK>', 11210), ('of', 5254), ('to', 4868), ('a', 4264), ('and', 4259), ('in', 3905), ('that', 1888), ('for', 1875), ('is', 1760), ('on', 1260), ('by', 1204), ('it', 1166), ('was', 1119), ('said', 1099), ('as', 1076), ('are', 1007), ('with', 997), ('he', 976), ('at', 941)]


In [125]:
#Convert FreqDist to disctionary
unigrams_dict = dict(unigrams_dict)

# Add <s> and </s> tags to the unigram dict
num_sentences = len(tokenized_sentences_unk)
unigrams_dict['<s>'] = num_sentences
unigrams_dict['</s>'] = num_sentences

### Bigrams

#### Add Padding Symbols to Sentences

In [126]:
from nltk.util import pad_sequence

bi_tokens_padding = []

for sentence in tokenized_sentences_unk:
    e = list(pad_sequence(sentence,
                     pad_left=True, left_pad_symbol="<s>",
                     pad_right=True, right_pad_symbol="</s>",
                     n=2))
    bi_tokens_padding.append(e)

#bi_tokens_padding[0]

#### Generate Bigram Dictionary

In [127]:
bigrams = []
for sentence in bi_tokens_padding:
    bigrams.append(list(nltk.bigrams(sentence)))

#bigrams

In [128]:
#Join elements of the list bigrams (that contains each sentence bigrams) into a single list
bigrams = list(itertools.chain.from_iterable(bigrams))
#bigrams

In [129]:
bigrams_dict = FreqDist(bigrams)
bigrams_dict_top_20 = bigrams_dict.most_common(20)
print("Top 20 bigrams and frequency: \n", bigrams_dict_top_20)

Top 20 bigrams and frequency: 
 [(('of', 'the'), 1355), (('<s>', 'the'), 1221), (('<UNK>', '<UNK>'), 1170), (('the', '<UNK>'), 1062), (('in', 'the'), 963), (('<UNK>', '</s>'), 826), (('<s>', '<UNK>'), 572), (('<UNK>', 'and'), 568), (('a', '<UNK>'), 488), (('to', 'the'), 466), (('of', '<UNK>'), 456), (('and', '<UNK>'), 429), (('on', 'the'), 387), (('for', 'the'), 358), (('said', '</s>'), 343), (('<s>', 'but'), 333), (('<s>', 'in'), 326), (('<UNK>', 'of'), 317), (('by', 'the'), 295), (('<UNK>', 'the'), 293)]


In [130]:
#Convert FreqDist to disctionary
bigrams_dict = dict(bigrams_dict)

# Add <s><s> and </s></s> bigrams to the bigram dict
num_sentences = len(tokenized_sentences_unk)
bigrams_dict[('<s>', '<s>')] = num_sentences
bigrams_dict[('</s>', '</s>')] = num_sentences
#bigrams_dict

### Trigrams

#### Add Padding Symbols to Sentences

In [131]:
tri_tokens_padding = []

for sentence in bi_tokens_padding:
    e = list(pad_sequence(sentence,
                     pad_left=True, left_pad_symbol="<s>",
                     pad_right=True, right_pad_symbol="</s>",
                     n=2))
    tri_tokens_padding.append(e)

#tri_tokens_padding[0]

#### Generate Trigram Dictionary

In [132]:
trigrams = []
for sentence in tri_tokens_padding:
    trigrams.append(list(nltk.trigrams(sentence)))

#trigrams[0]

In [133]:
#Join elements of the list trigrams (that contains each sentence trigrams) into a single list
trigrams = list(itertools.chain.from_iterable(trigrams))
#trigrams

In [134]:
trigrams_dict = FreqDist(trigrams)
trigrams_dict_top_20 = trigrams_dict.most_common(20)
print("Top 20 triigrams and frequency: \n", trigrams_dict_top_20)

Top 20 triigrams and frequency: 
 [(('<s>', '<s>', 'the'), 1221), (('<UNK>', '</s>', '</s>'), 826), (('<s>', '<s>', '<UNK>'), 572), (('said', '</s>', '</s>'), 343), (('<s>', '<s>', 'but'), 333), (('<s>', '<s>', 'in'), 326), (('<s>', '<s>', 'it'), 285), (('<s>', '<s>', 'he'), 237), (('<s>', '<s>', 'i'), 227), (('<UNK>', '<UNK>', '<UNK>'), 189), (('<s>', '<s>', 'mr'), 163), (('<s>', '<s>', 'a'), 156), (('<UNK>', 'and', '<UNK>'), 152), (('?', '</s>', '</s>'), 146), (('of', 'the', '<UNK>'), 138), (('<s>', '<s>', 'we'), 136), (('<s>', '<s>', 'they'), 123), (('<s>', '<s>', 'this'), 122), (('<s>', 'the', '<UNK>'), 112), (('<UNK>', 'said', '</s>'), 105)]


In [135]:
#Convert FreqDist to disctionary
trigrams_dict = dict(trigrams_dict)

# Add <s><s><s> and </s></s></s> trigrams to the trigram dict
num_sentences = len(tokenized_sentences_unk)
trigrams_dict[('<s>', '<s>', '<s>')] = num_sentences
trigrams_dict[('</s>', '</s>', '</s>')] = num_sentences

#trigrams_dict

### Four-grams

#### Add Padding Symbols to Sentences

In [136]:
four_tokens_padding = []

for sentence in tri_tokens_padding:
    e = list(pad_sequence(sentence,
                     pad_left=True, left_pad_symbol="<s>",
                     pad_right=True, right_pad_symbol="</s>",
                     n=2))
    four_tokens_padding.append(e)

#four_tokens_padding[0]

#### Generate Four-gram Dictionary

In [137]:
from nltk.util import ngrams

fourgrams = []
for sentence in four_tokens_padding:
    fourgrams.append( list(ngrams(sentence, 4)) )

#fourgrams[0]

In [138]:
#Join elements of the list fourgrams (that contains each sentence fourgrams) into a single list
fourgrams = list(itertools.chain.from_iterable(fourgrams))
#fourgrams

In [139]:
fourgrams_dict = FreqDist(fourgrams)
fourgrams_dict_top_20 = fourgrams_dict.most_common(20)
print("Top 20 fourgrams and frequency: \n", fourgrams_dict_top_20)

Top 20 fourgrams and frequency: 
 [(('<s>', '<s>', '<s>', 'the'), 1221), (('<UNK>', '</s>', '</s>', '</s>'), 826), (('<s>', '<s>', '<s>', '<UNK>'), 572), (('said', '</s>', '</s>', '</s>'), 343), (('<s>', '<s>', '<s>', 'but'), 333), (('<s>', '<s>', '<s>', 'in'), 326), (('<s>', '<s>', '<s>', 'it'), 285), (('<s>', '<s>', '<s>', 'he'), 237), (('<s>', '<s>', '<s>', 'i'), 227), (('<s>', '<s>', '<s>', 'mr'), 163), (('<s>', '<s>', '<s>', 'a'), 156), (('?', '</s>', '</s>', '</s>'), 146), (('<s>', '<s>', '<s>', 'we'), 136), (('<s>', '<s>', '<s>', 'they'), 123), (('<s>', '<s>', '<s>', 'this'), 122), (('<s>', '<s>', 'the', '<UNK>'), 112), (('<UNK>', 'said', '</s>', '</s>'), 105), (('<s>', '<s>', '<s>', 'and'), 101), (('<s>', '<s>', '<s>', 'there'), 98), (('<s>', '<s>', '<s>', 'if'), 98)]


In [140]:
#Convert FreqDist to disctionary
fourgrams_dict = dict(fourgrams_dict)
#fourgrams_dict

### Export N-Gram Dictionaries (in pkl files)

In [141]:
import pickle
output_folder = "output_data/UNK 5-55/"
unigram_dictionary_file = "unigram_dictionary_test.pkl"
bigram_dictionary_file = "bigram_dictionary_test.pkl"
trigram_dictionary_file = "trigram_dictionary_test.pkl"
fourgram_dictionary_file = "fourgram_dictionary_test.pkl"

In [142]:
def export_dictionary(dict, output_file):
    a_file = open(output_folder + output_file, "wb")
    pickle.dump(dict, a_file)
    a_file.close()

In [143]:
#Export dictionaries
export_dictionary(unigrams_dict, unigram_dictionary_file)
export_dictionary(bigrams_dict, bigram_dictionary_file)
export_dictionary(trigrams_dict, trigram_dictionary_file)
export_dictionary(fourgrams_dict, fourgram_dictionary_file)

#### How to Import pkl data

In [144]:
#Inverse process: Read data
a_file = open(output_folder + bigram_dictionary_file, "rb")
output = pickle.load(a_file)
a_file.close()
#output