### CS5234J: Summative Group Assessment 3
**Goals**: In this assignment you will be practising more advanced tools that will be useful 
in your final projects. The assignment requires you to solve three realistic data processing problems 
on the Spark platform.

**Before you start:**
* This assignment is **summative** coursework.
* It constitutes 4% of the final course mark.
* It consists of 3 questions.
* The answers should be given by filling in blanks in the code cells of a copy of this 
notebook as instructed in the question descriptions and the comments in the code.
* Do **not** create your own cells as these will not be checked!
* Submission deadline is **21 June 2021, 10:00**
* Submit a copy of this notebook with your answers by following the Assignment 3 
submission link on Moodle. For example, if viewing the notebook in Jupyter, select `File->Download as->Notebook (.ipynb)` to download a copy of the notebook.
* Please note that submitting anything rather than a copy of this notebook (e.g., a PDF file
or a ZIP archive) will automatically result in your entire submission receiving a mark of 0. 
Likewise, any code cells that do not compile (for whatever reason, including
accidental comments, incorrect indentation, unbalanced parentheses, etc.) will be penalized by deducting the **entire** 
quantity of marks associated with the relevant question. This is in line with the requirements 
of the departmental policy for electronic submissions: 
https://intranet.royalholloway.ac.uk/computerscience/documents/pdf/electronicsubmissionstudentversion.pdf
* You can work in teams of **two** people. 
* If you formed a team for Assignment 1, you **must** work as part of the same team for this assignment and the final project.

**Running the code**
To run the code, we recommend using an instance of the Jupyter Notebook server integrated with 
PySpark, which can be accessed as follows:
* Start NoMachine, and log into `linux.cim.rhul.ac.uk`
* Open a terminal window
* At the prompt, type `ssh -X bigdata`. Note the `X` must be capitalized.
* Type `/home/local/ufac001/pyspark-jupyter.sh` and hit `enter` 
    to launch a Jupyter Notebook server integrated with 
PySpark. If everything works as expected, this will open up a tab in a web browser through which
you could load and work on the notebook.

As an alternative, you can also use the Databricks Community Edition cloud, but please be
aware that their automated notebook synchronisation may not always work as expected
potentially resulting in the loss of work. One possible workaround is to connect your notebook
to a Git repository, and then use the provided 
commit interface to force synchronisation as necessary. If 
you would like to follow this route, and need help creating a private repository
on GitHub (available to all RHUL students), please contact the CS Helpdesk.

**Spark Restrictions**
Your solution should use pyspark and the RDD APIs. In particular, you should *not* use
DataFrames/DataSets or SparkSQL as part of your solution.

## Question 1: Spark Term Frequencies (10%)
Write a function `get_term_frequencies(email_lst)` that computes the _term frequency_ for each term (word) that occurs in the body of a collection of emails. The function takes one argument `email_lst`, which is a list of key-value pairs `(EMAIL-ID, BODY)` where `EMAIL-ID` is a string identifier for an email and `BODY` is a string corresponding to the text content of the email body. Your function should return a list where each element is a pair `(EMAIL-ID, [(TERM, FREQ)])`. The key `EMAIL-ID` is an email identifier and the value is itself a list of pairs containing for every term `TERM` in the email its frequency of occurrence _in that email_.

Your solution may make use of the provided helper function `term_freq(body_text)`, which computes the term frequencies for a single email body, returning the result as a dictionary.

Your code should be written as a series of the following Spark transformations:
1. Use `sc.parallelize()` to create a base RDD from `email_lst`
2. Use `mapValues` to compute using the helper function `term_freq` the term frequencies for every email body in the base RDD created in step 1. Each element of the resulting RDD should be a tuple `(EMAIL-ID, {TERM : FREQ})`, such that the term frequencies for each email are stored as a dictionary. 
3. Use `mapValues` again to convert the RDD in step 2 to another key-value pair RDD with elements `(EMAIL-ID, [(TERM, FREQ)])`, where the term frequencies for each email are stored as a list of pairs of term frequencies.
4. Apply a `collect()` action to the result of step 3, and return the resulting list.

In [1]:
import re
from functools import reduce

def dict_inc(dic, k):
    dic[k] = dic.get(k, 0) + 1
    return dic

def term_freq(text):
    r = re.compile('^[a-zA-Z]+$')
    terms = filter(lambda x: re.match(r, x), text.split())

    tf = reduce(lambda tfs, x: dict_inc(tfs, x), terms, {})
    n_words = len(tf.keys())

    # normalize to length of document
    tf_norm = {term :  f / n_words for (term, f) in tf.items()}

    return tf_norm

def get_term_frequencies(email_lst):
    '''
    email_lst: A list of email bodies (EMAIL-ID, BODY)
    Returns an RDD where each element (EMAIL-ID, [(TERM, FREQ)]) contains the term frequency
    FREQ (a float) for every term TERM that occurs in the email, computed
    using a series of Spark operations as described in the question.

    Replace pass with your code. Use `sc` to reference the Spark context.
    '''
    # Your code goes here
    data = sc.parallelize(email_lst)
    return data.mapValues(lambda x: term_freq(x))\
          .mapValues(lambda x: [(key, value) for key, value in x.items()])\
          .collect()

You can use the following code to test your implementation of `get_term_frequencies()`:

In [2]:
email1 = ("emails/1.txt", """Attached is the Hotsheet.   \nPlease be advised that it is password protected.
                Please review the last Hotsheet email or reply to this email if you forget
                the password.""")

email2 = ("emails/2.txt", """ As you discussed with Jenn Staton, we will be\nreviewing the daily DPR's
                tomorrow morning to determine the May curve shift\nreports that we would like to obtain.
                Also, since earnings release has\nbeen tentatively scheduled for July 12th, we would like
                to review the June\ncurve shift reports to date and make a portion of our June request
                now in\norder to alleviate some of the burden on your group in July.  Let me know\nif
                this poses any problems.""")

email3 = ("emails/3.txt", """Let\'s all conference sometime Monday to sort \nthrough how these trades should be documented,
                booked and otherwise handled.  \nLynn has suggested that ESA is merely a holding company and
                therefore, an \ninappropriate vehicle for holding trades (maybe one or two trades is O.K.).""")
    
print(get_term_frequencies([email1, email2, email3])[:1])
'''
The output produced by the line above when executed with the model implementation
of get_term_frequencies() was as follows (N.B. we only print the first element of the
resulting list):

[('emails/1.txt', 
    [('Attached', 0.05), ('is', 0.1), ('the', 0.15), ('Please', 0.1), ('be', 0.05), ('advised', 0.05), 
    ('that', 0.05), ('it', 0.05), ('password', 0.05), ('review', 0.05), ('last', 0.05), ('Hotsheet', 0.05), ('email', 0.1), 
    ('or', 0.05), ('reply', 0.05), ('to', 0.05), ('this', 0.05), ('if', 0.05), ('you', 0.05), ('forget', 0.05)])]

'''

[('emails/1.txt', [('Attached', 0.05), ('is', 0.1), ('the', 0.15), ('Please', 0.1), ('be', 0.05), ('advised', 0.05), ('that', 0.05), ('it', 0.05), ('password', 0.05), ('review', 0.05), ('last', 0.05), ('Hotsheet', 0.05), ('email', 0.1), ('or', 0.05), ('reply', 0.05), ('to', 0.05), ('this', 0.05), ('if', 0.05), ('you', 0.05), ('forget', 0.05)])]


"\nThe output produced by the line above when executed with the model implementation\nof get_term_frequencies() was as follows (N.B. we only print the first element of the\nresulting list):\n\n[('emails/1.txt', \n    [('Attached', 0.05), ('is', 0.1), ('the', 0.15), ('Please', 0.1), ('be', 0.05), ('advised', 0.05), \n    ('that', 0.05), ('it', 0.05), ('password', 0.05), ('review', 0.05), ('last', 0.05), ('Hotsheet', 0.05), ('email', 0.1), \n    ('or', 0.05), ('reply', 0.05), ('to', 0.05), ('this', 0.05), ('if', 0.05), ('you', 0.05), ('forget', 0.05)])]\n\n"

## Question 2: Spark Inverse Document Frequencies (50%)

Write another function `get_inv_document_frequencies(term_freqs)` that applies further transformations to the dataset returned by the `get_term_frequencies(email_lst)` function from Question 1 to determine the _inverse document frequency_ for each term that occurs in the email dataset. The inverse document frequency is a measure that allows to identify term that occur infrequently in the dataset.

Your code should be written as a series of the following Spark transformations:

1. Create from `tfRDD` below a new RDD whose elements are of the form `(EMAIL-ID, TERM)`, such that every element `(EMAIL-ID, [(TERM, FREQ)])` of `tfRDD` produces a separate pair in the new RDD for every term in the email. _Hint:_ Use `flatMapValues()`.
2. Using a series of Spark transformations, convert the RDD produced in step 1 to a new RDD of document frequencies. Each element should be a tuple `(TERM, DOC-FREQ)` where `DOC-FREQ` is a _count_ of the number of emails containing the term.
3. Use a `map` transformation to convert each `(TERM, DOC-FREQ)` tuple to a tuple `(TERM, INV-DOC-FREQ)`, where `INV-DOC-FREQ` is the _inverse document frequency_. You may use the helper function `inv_doc_freq` below to compute the inverse document frequency for each term. 
4. Apply a `collect()` action to the result of step 3, and return the resulting list.

Note that similarly to Assignment 2, calling `get_term_frequencies()` first is only needed to prevent any errors in the implementation of Question 1 from propagating to the solution of this question as this way, we will be able to use the model implementation of `get_term_frequencies()` for testing. A more efficient solution would avoid materializing the results of `get_term_frequencies()` in the driver, and instead directly extend the processing steps of Question 1 with further operations. Make sure you understand why it is important!

In [3]:
import math

def inv_doc_freq(doc_freq, n):
    '''
    Helper function that returns the inverse document frequency
    given the term's document frequency and the total
    number of documents.
    doc_freq: the document frequency of a particular term
    n: the total number of documents.
    '''        
    return max(0, math.log(n/(doc_freq+1)))
    
def get_inv_document_frequencies(term_freqs):
    '''
    Computes using a series of Spark transformations a list of 
    inverse document frequency pairs (TERM, INV-DOC-FREQ),
    where TERM is a term.
    
    term_freqs: A list of pairs (EMAIL-ID, [(TERM, FREQ)]) of 
    email identifiers together with their term frequencies.
    '''
    # Your code below
    tfRDD = sc.parallelize(term_freqs)
    doc_count = tfRDD.map(lambda x: x[0]).count()
    term_rdd = tfRDD.flatMapValues(lambda term_freqs: [term for term, freq in term_freqs])
    term_doc_freq = term_rdd.map(lambda x: (x[1], x[0])).groupByKey().mapValues(lambda x: len(x))
    inverse_doc_freq = term_doc_freq.mapValues(lambda x: inv_doc_freq(x, doc_count))
    return inverse_doc_freq.collect()

You may use the following code to test your implementation of `get_inv_document_frequencies()`:

In [4]:
term_freqs = [('emails/1.txt', 
    [('Attached', 0.05), ('is', 0.1), ('the', 0.15), ('Please', 0.1), ('be', 0.05), ('advised', 0.05), 
    ('that', 0.05), ('it', 0.05), ('password', 0.05), ('review', 0.05), ('last', 0.05), ('Hotsheet', 0.05), ('email', 0.1), 
    ('or', 0.05), ('reply', 0.05), ('to', 0.05), ('this', 0.05), ('if', 0.05), ('you', 0.05), ('forget', 0.05)]), 
('emails/2.txt', 
    [('As', 0.017543859649122806), ('you', 0.017543859649122806), ('discussed', 0.017543859649122806), 
    ('with', 0.017543859649122806), ('Jenn', 0.017543859649122806), ('we', 0.05263157894736842), 
    ('will', 0.017543859649122806), ('be', 0.017543859649122806), ('reviewing', 0.017543859649122806), 
    ('the', 0.07017543859649122), ('daily', 0.017543859649122806), ('tomorrow', 0.017543859649122806), 
    ('morning', 0.017543859649122806), ('to', 0.08771929824561403), ('determine', 0.017543859649122806), 
    ('May', 0.017543859649122806), ('curve', 0.03508771929824561), ('shift', 0.03508771929824561), 
    ('reports', 0.03508771929824561), ('that', 0.017543859649122806), ('would', 0.03508771929824561), 
    ('like', 0.03508771929824561), ('since', 0.017543859649122806), ('earnings', 0.017543859649122806), 
    ('release', 0.017543859649122806), ('has', 0.017543859649122806), ('been', 0.017543859649122806), 
    ('tentatively', 0.017543859649122806), ('scheduled', 0.017543859649122806), ('for', 0.017543859649122806), 
    ('July', 0.017543859649122806), ('review', 0.017543859649122806), ('June', 0.03508771929824561), 
    ('date', 0.017543859649122806), ('and', 0.017543859649122806), ('make', 0.017543859649122806), 
    ('a', 0.017543859649122806), ('portion', 0.017543859649122806), ('of', 0.03508771929824561), 
    ('our', 0.017543859649122806), ('request', 0.017543859649122806), ('now', 0.017543859649122806), 
    ('in', 0.03508771929824561), ('order', 0.017543859649122806), ('alleviate', 0.017543859649122806), 
    ('some', 0.017543859649122806), ('burden', 0.017543859649122806), ('on', 0.017543859649122806), 
    ('your', 0.017543859649122806), ('group', 0.017543859649122806), ('Let', 0.017543859649122806), 
    ('me', 0.017543859649122806), ('know', 0.017543859649122806), ('if', 0.017543859649122806), 
    ('this', 0.017543859649122806), ('poses', 0.017543859649122806), ('any', 0.017543859649122806)]), 
('emails/3.txt', 
    [('all', 0.03125), ('conference', 0.03125), ('sometime', 0.03125), ('Monday', 0.03125), ('to', 0.03125), 
    ('sort', 0.03125), ('through', 0.03125), ('how', 0.03125), ('these', 0.03125), ('trades', 0.09375), ('should', 0.03125), 
    ('be', 0.03125), ('booked', 0.03125), ('and', 0.0625), ('otherwise', 0.03125), ('Lynn', 0.03125), ('has', 0.03125), 
    ('suggested', 0.03125), ('that', 0.03125), ('ESA', 0.03125), ('is', 0.0625), ('merely', 0.03125), ('a', 0.03125), 
    ('holding', 0.0625), ('company', 0.03125), ('an', 0.03125), ('inappropriate', 0.03125), ('vehicle', 0.03125), 
    ('for', 0.03125), ('one', 0.03125), ('or', 0.03125), ('two', 0.03125)])]

print(get_inv_document_frequencies(term_freqs))


'''
The output produced by the line above when executed with the model implementation
of get_inv_document_frequencies() was as follows:

[('is', 0), ('Please', 0.4054651081081644), ('password', 0.4054651081081644), ('last', 0.4054651081081644), ('
this', 0), ('Jenn', 0.4054651081081644), ('we', 0.4054651081081644), ('tomorrow', 0.4054651081081644), ('May',
 0.4054651081081644), ('curve', 0.4054651081081644), ('would', 0.4054651081081644), ('like', 0.405465108108164
4), ('earnings', 0.4054651081081644), ('tentatively', 0.4054651081081644), ('July', 0.4054651081081644), ('mak
e', 0.4054651081081644), ('of', 0.4054651081081644), ('now', 0.4054651081081644), ('in', 0.4054651081081644),
('group', 0.4054651081081644), ('Let', 0.4054651081081644), ('know', 0.4054651081081644), ('sometime', 0.40546
51081081644), ('Monday', 0.4054651081081644), ('these', 0.4054651081081644), ('trades', 0.4054651081081644), (
'booked', 0.4054651081081644), ('suggested', 0.4054651081081644), ('ESA', 0.4054651081081644), ('holding', 0.4
054651081081644), ('an', 0.4054651081081644), ('two', 0.4054651081081644), ('As', 0.4054651081081644), ('you',
 0), ('discussed', 0.4054651081081644), ('with', 0.4054651081081644), ('will', 0.4054651081081644), ('be', 0),
 ('reviewing', 0.4054651081081644), ('the', 0), ('daily', 0.4054651081081644), ('morning', 0.4054651081081644)
, ('to', 0), ('determine', 0.4054651081081644), ('shift', 0.4054651081081644), ('reports', 0.4054651081081644)
, ('that', 0), ('since', 0.4054651081081644), ('release', 0.4054651081081644), ('has', 0), ('been', 0.40546510
81081644), ('scheduled', 0.4054651081081644), ('for', 0), ('review', 0), ('June', 0.4054651081081644), ('date'
, 0.4054651081081644), ('and', 0), ('a', 0), ('portion', 0.4054651081081644), ('our', 0.4054651081081644), ('r
equest', 0.4054651081081644), ('order', 0.4054651081081644), ('alleviate', 0.4054651081081644), ('some', 0.405
4651081081644), ('burden', 0.4054651081081644), ('on', 0.4054651081081644), ('your', 0.4054651081081644), ('me
', 0.4054651081081644), ('if', 0), ('poses', 0.4054651081081644), ('any', 0.4054651081081644), ('all', 0.40546
51081081644), ('conference', 0.4054651081081644), ('sort', 0.4054651081081644), ('through', 0.4054651081081644
), ('how', 0.4054651081081644), ('should', 0.4054651081081644), ('otherwise', 0.4054651081081644), ('Lynn', 0.
4054651081081644), ('merely', 0.4054651081081644), ('company', 0.4054651081081644), ('inappropriate', 0.405465
1081081644), ('vehicle', 0.4054651081081644), ('one', 0.4054651081081644), ('or', 0), ('Attached', 0.405465108
1081644), ('advised', 0.4054651081081644), ('it', 0.4054651081081644), ('Hotsheet', 0.4054651081081644), ('ema
il', 0.4054651081081644), ('reply', 0.4054651081081644), ('forget', 0.4054651081081644)]
'''

[('Jenn', 0.4054651081081644), ('we', 0.4054651081081644), ('tomorrow', 0.4054651081081644), ('May', 0.4054651081081644), ('curve', 0.4054651081081644), ('would', 0.4054651081081644), ('like', 0.4054651081081644), ('earnings', 0.4054651081081644), ('tentatively', 0.4054651081081644), ('July', 0.4054651081081644), ('make', 0.4054651081081644), ('of', 0.4054651081081644), ('now', 0.4054651081081644), ('in', 0.4054651081081644), ('group', 0.4054651081081644), ('Let', 0.4054651081081644), ('know', 0.4054651081081644), ('this', 0), ('sometime', 0.4054651081081644), ('Monday', 0.4054651081081644), ('these', 0.4054651081081644), ('trades', 0.4054651081081644), ('booked', 0.4054651081081644), ('suggested', 0.4054651081081644), ('ESA', 0.4054651081081644), ('is', 0), ('holding', 0.4054651081081644), ('an', 0.4054651081081644), ('two', 0.4054651081081644), ('Please', 0.4054651081081644), ('password', 0.4054651081081644), ('last', 0.4054651081081644), ('Attached', 0.4054651081081644), ('the', 0),

"\nThe output produced by the line above when executed with the model implementation\nof get_inv_document_frequencies() was as follows:\n\n[('is', 0), ('Please', 0.4054651081081644), ('password', 0.4054651081081644), ('last', 0.4054651081081644), ('\nthis', 0), ('Jenn', 0.4054651081081644), ('we', 0.4054651081081644), ('tomorrow', 0.4054651081081644), ('May',\n 0.4054651081081644), ('curve', 0.4054651081081644), ('would', 0.4054651081081644), ('like', 0.405465108108164\n4), ('earnings', 0.4054651081081644), ('tentatively', 0.4054651081081644), ('July', 0.4054651081081644), ('mak\ne', 0.4054651081081644), ('of', 0.4054651081081644), ('now', 0.4054651081081644), ('in', 0.4054651081081644),\n('group', 0.4054651081081644), ('Let', 0.4054651081081644), ('know', 0.4054651081081644), ('sometime', 0.40546\n51081081644), ('Monday', 0.4054651081081644), ('these', 0.4054651081081644), ('trades', 0.4054651081081644), (\n'booked', 0.4054651081081644), ('suggested', 0.4054651081081644), ('ESA', 0.40

## Question 3: Spark TF-IDF index (30%)
Finally, write a function `tf_idf_index(term_freqs, inv_doc_freqs)` to compute the _term-frequency-inverse-document-frequency_ (TF-IDF) for every term in every email body (i.e. a TF-IDF index). Each TF-IDF value increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the dataset that contain the word, which helps to adjust for the fact that some words appear more frequently in general (e.g. 'the', 'and', etc.). The TF-IDF index you will compute could potentially be used to find relevant emails given a keyword search over the dataset (e.g. by ranking each email based on the sum of the TF-IDF scores for each term in the keyword search in that email).

Your code should be written as a series of Spark transformations. Starting from `tfsRDD` and `idfsRDD` below, create a new RDD where each element is a key-value pair of the form `(EMAIL-ID, [(TERM, TF-IDF)])`, i.e. the key is an email identifier and the value is a list of `(TERM, TF-IDF)` tuples, such that every term in the email is paired with its associated `TF-IDF` score _for that email_. Note that `TF_IDF` of a term in an email is simply the product of its term frequency in that email (`FREQ` from Question 1) and its inverse document frequency across all emails (`INV-DOC-FREQ` from Question 2). Finally, sort the resulting RDD by its key (`EMAIL-ID`) and return the result as a list.

Note that as with the previous question, calling `get_term_frequencies()` and `get_inv_document_frequencies()` first is only needed to prevent any errors in the implementation of Question 1 from propagating to the solution of this question as this way, we will be able to use the model implementation of `get_term_frequencies()` and `get_inv_document_frequencies()` for testing. A more efficient solution would avoid materializing results in the driver, and instead directly extend the processing steps of Questions 1 and 2 with further operations. Make sure you understand why it is important!

In [6]:
def tf_idf_index(term_freqs, inv_doc_freqs):
    tfsRDD = sc.parallelize(term_freqs)
    idfsRDD = sc.parallelize(inv_doc_freqs)
    # Your code here
    return tfsRDD.flatMapValues(lambda x: x)\
          .map(lambda x: (x[1][0], (x[1][1], x[0])))\
          .join(idfsRDD)\
          .map(lambda x: (x[1][0][1], (x[0], x[1][0][0] * x[1][1])))\
          .groupByKey()\
          .mapValues(lambda x: list(x))\
          .sortByKey()\
          .collect()

You may use the following code to test your implementation of `tf_idf_index()`:

In [7]:
term_freqs = [('emails/1.txt', 
    [('Attached', 0.05), ('is', 0.1), ('the', 0.15), ('Please', 0.1), ('be', 0.05), ('advised', 0.05), 
    ('that', 0.05), ('it', 0.05), ('password', 0.05), ('review', 0.05), ('last', 0.05), ('Hotsheet', 0.05), ('email', 0.1), 
    ('or', 0.05), ('reply', 0.05), ('to', 0.05), ('this', 0.05), ('if', 0.05), ('you', 0.05), ('forget', 0.05)]), 
('emails/2.txt', 
    [('As', 0.017543859649122806), ('you', 0.017543859649122806), ('discussed', 0.017543859649122806), 
    ('with', 0.017543859649122806), ('Jenn', 0.017543859649122806), ('we', 0.05263157894736842), 
    ('will', 0.017543859649122806), ('be', 0.017543859649122806), ('reviewing', 0.017543859649122806), 
    ('the', 0.07017543859649122), ('daily', 0.017543859649122806), ('tomorrow', 0.017543859649122806), 
    ('morning', 0.017543859649122806), ('to', 0.08771929824561403), ('determine', 0.017543859649122806), 
    ('May', 0.017543859649122806), ('curve', 0.03508771929824561), ('shift', 0.03508771929824561), 
    ('reports', 0.03508771929824561), ('that', 0.017543859649122806), ('would', 0.03508771929824561), 
    ('like', 0.03508771929824561), ('since', 0.017543859649122806), ('earnings', 0.017543859649122806), 
    ('release', 0.017543859649122806), ('has', 0.017543859649122806), ('been', 0.017543859649122806), 
    ('tentatively', 0.017543859649122806), ('scheduled', 0.017543859649122806), ('for', 0.017543859649122806), 
    ('July', 0.017543859649122806), ('review', 0.017543859649122806), ('June', 0.03508771929824561), 
    ('date', 0.017543859649122806), ('and', 0.017543859649122806), ('make', 0.017543859649122806), 
    ('a', 0.017543859649122806), ('portion', 0.017543859649122806), ('of', 0.03508771929824561), 
    ('our', 0.017543859649122806), ('request', 0.017543859649122806), ('now', 0.017543859649122806), 
    ('in', 0.03508771929824561), ('order', 0.017543859649122806), ('alleviate', 0.017543859649122806), 
    ('some', 0.017543859649122806), ('burden', 0.017543859649122806), ('on', 0.017543859649122806), 
    ('your', 0.017543859649122806), ('group', 0.017543859649122806), ('Let', 0.017543859649122806), 
    ('me', 0.017543859649122806), ('know', 0.017543859649122806), ('if', 0.017543859649122806), 
    ('this', 0.017543859649122806), ('poses', 0.017543859649122806), ('any', 0.017543859649122806)]), 
('emails/3.txt', 
    [('all', 0.03125), ('conference', 0.03125), ('sometime', 0.03125), ('Monday', 0.03125), ('to', 0.03125), 
    ('sort', 0.03125), ('through', 0.03125), ('how', 0.03125), ('these', 0.03125), ('trades', 0.09375), ('should', 0.03125), 
    ('be', 0.03125), ('booked', 0.03125), ('and', 0.0625), ('otherwise', 0.03125), ('Lynn', 0.03125), ('has', 0.03125), 
    ('suggested', 0.03125), ('that', 0.03125), ('ESA', 0.03125), ('is', 0.0625), ('merely', 0.03125), ('a', 0.03125), 
    ('holding', 0.0625), ('company', 0.03125), ('an', 0.03125), ('inappropriate', 0.03125), ('vehicle', 0.03125), 
    ('for', 0.03125), ('one', 0.03125), ('or', 0.03125), ('two', 0.03125)])]



inv_doc_freqs = [('is', 0), ('Please', 0.4054651081081644), ('password', 0.4054651081081644), ('last', 0.4054651081081644), 
                 ('this', 0), ('Jenn', 0.4054651081081644), ('we', 0.4054651081081644), ('tomorrow', 0.4054651081081644), 
                 ('May', 0.4054651081081644), ('curve', 0.4054651081081644), ('would', 0.4054651081081644), 
                 ('like', 0.4054651081081644), ('earnings', 0.4054651081081644), ('tentatively', 0.4054651081081644), 
                 ('July', 0.4054651081081644), ('make', 0.4054651081081644), ('of', 0.4054651081081644), 
                 ('now', 0.4054651081081644), ('in', 0.4054651081081644), ('group', 0.4054651081081644), 
                 ('Let', 0.4054651081081644), ('know', 0.4054651081081644), ('sometime', 0.4054651081081644), 
                 ('Monday', 0.4054651081081644), ('these', 0.4054651081081644), ('trades', 0.4054651081081644), 
                 ('booked', 0.4054651081081644), ('suggested', 0.4054651081081644), ('ESA', 0.4054651081081644), 
                 ('holding', 0.4054651081081644), ('an', 0.4054651081081644), ('two', 0.4054651081081644), 
                 ('As', 0.4054651081081644), ('you', 0), ('discussed', 0.4054651081081644), ('with', 0.4054651081081644), 
                 ('will', 0.4054651081081644), ('be', 0), ('reviewing', 0.4054651081081644), ('the', 0), 
                 ('daily', 0.4054651081081644), ('morning', 0.4054651081081644), ('to', 0), ('determine', 0.4054651081081644), 
                 ('shift', 0.4054651081081644), ('reports', 0.4054651081081644), ('that', 0), ('since', 0.4054651081081644), 
                 ('release', 0.4054651081081644), ('has', 0), ('been', 0.4054651081081644), ('scheduled', 0.4054651081081644), 
                 ('for', 0), ('review', 0), ('June', 0.4054651081081644), ('date', 0.4054651081081644), ('and', 0), 
                 ('a', 0), ('portion', 0.4054651081081644), ('our', 0.4054651081081644), ('request', 0.4054651081081644), 
                 ('order', 0.4054651081081644), ('alleviate', 0.4054651081081644), ('some', 0.4054651081081644), 
                 ('burden', 0.4054651081081644), ('on', 0.4054651081081644), ('your', 0.4054651081081644), 
                 ('me', 0.4054651081081644), ('if', 0), ('poses', 0.4054651081081644), ('any', 0.4054651081081644), 
                 ('all', 0.4054651081081644), ('conference', 0.4054651081081644), ('sort', 0.4054651081081644), 
                 ('through', 0.4054651081081644), ('how', 0.4054651081081644), ('should', 0.4054651081081644), 
                 ('otherwise', 0.4054651081081644), ('Lynn', 0.4054651081081644), ('merely', 0.4054651081081644), 
                 ('company', 0.4054651081081644), ('inappropriate', 0.4054651081081644), ('vehicle', 0.4054651081081644), 
                 ('one', 0.4054651081081644), ('or', 0), ('Attached', 0.4054651081081644), ('advised', 0.4054651081081644), 
                 ('it', 0.4054651081081644), ('Hotsheet', 0.4054651081081644), ('email', 0.4054651081081644), 
                 ('reply', 0.4054651081081644), ('forget', 0.4054651081081644)]

print(tf_idf_index(term_freqs, inv_doc_freqs)[:1])

'''
The output produced by the line above when executed with the model implementation
of tf_idf_index() was as follows (N.B. we only print the first element of the
resulting list):

[('emails/1.txt', [('last', 0.02027325540540822), ('this', 0.0), ('is', 0.0), ('Please', 0.04054651081081644), 
    ('password', 0.02027325540540822), ('Attached', 0.02027325540540822), ('the', 0.0), ('be', 0.0), ('that', 0.0), 
    ('email', 0.04054651081081644), ('to', 0.0), ('you', 0.0), ('forget', 0.02027325540540822), 
    ('advised', 0.02027325540540822), ('it', 0.02027325540540822), ('review', 0.0), ('Hotsheet', 0.02027325540540822), 
    ('or', 0.0), ('reply', 0.02027325540540822), ('if', 0.0)])]

'''

[('emails/1.txt', [('last', 0.02027325540540822), ('this', 0.0), ('is', 0.0), ('Please', 0.04054651081081644), ('password', 0.02027325540540822), ('Attached', 0.02027325540540822), ('the', 0.0), ('be', 0.0), ('that', 0.0), ('email', 0.04054651081081644), ('to', 0.0), ('you', 0.0), ('forget', 0.02027325540540822), ('advised', 0.02027325540540822), ('it', 0.02027325540540822), ('review', 0.0), ('Hotsheet', 0.02027325540540822), ('or', 0.0), ('reply', 0.02027325540540822), ('if', 0.0)])]


"\nThe output produced by the line above when executed with the model implementation\nof tf_idf_index() was as follows (N.B. we only print the first element of the\nresulting list):\n\n[('emails/1.txt', [('last', 0.02027325540540822), ('this', 0.0), ('is', 0.0), ('Please', 0.04054651081081644), \n    ('password', 0.02027325540540822), ('Attached', 0.02027325540540822), ('the', 0.0), ('be', 0.0), ('that', 0.0), \n    ('email', 0.04054651081081644), ('to', 0.0), ('you', 0.0), ('forget', 0.02027325540540822), \n    ('advised', 0.02027325540540822), ('it', 0.02027325540540822), ('review', 0.0), ('Hotsheet', 0.02027325540540822), \n    ('or', 0.0), ('reply', 0.02027325540540822), ('if', 0.0)])]\n\n"