# Assignment Group 4

## Module A _(54 points)_
For this assignment your work will set up a toolkit for plain text enhancement and summarization for reuse later. Specifically, you'll use an API service known as Wikifier to embed hyperlink references to encyclopedia articles.

__A1.__ _(2 points)_ Throughout this problem, we're going to be working with the Wikifier API, which is "a web service that takes a text document as input and annotates it with links to relevant Wikipedia concepts". So, before we can begin with the actual work, it's necessary to request access to this API at the following link: http://wikifier.org/info.html. Read this documentation and familiarize yourself with it, and make sure to register for a `userKey`.
Store your `userKey` in the below cell:

__A2.__ _(7 points)_ We're going to make requests to Wikifier. For this, create a function that takes a block of text and `userKey` as input, and outputs the JSON response given by the Wikification API (per the following specifications) interpreted as a Python object (deserialized). __Important__: For full credit you must read the Wikifier docs: http://wikifier.org/info.html; specifically, you must determine how to build a URL that sets the `'applyPageRankSqThreshold'` field to `True`, the corresponding `'pageRankSqThreshold'` field to `0.8`, and the language field to `'en'`.

__A3.__ _(4 points)_ Check to make sure your function works! Load the small block of text provided under `./data/texts/example_text.txt`, and apply your `Wikify()` function. Print out the result. Does everything look correct, according to the docs? Discuss the output in the response box.

_Response_.

__A4.__ (7 pts) Now we want to make our code reusable, and don't want to cause any server problems for wikifier.org. So, write a function called `get_wikification(filename, KEY)` that accepts a `filename` of a text file (assuming a `'.txt'` extension) from `'./data/texts/'` and Wikifier `userKey` and does the following:

1. Checks to see if a corresponding filename (with `'.json'` extension) exists in the `'./data/wikifications/'` directory, and
    1. loads the text's wikification from the json file (only if it exists!!) or
    - requests the text's wikification from wikifier.org (only if the fine from __2__ doesn't exist!) and
        1. stores the newly received wikification in `'data/wikifications/'` under the corresponding file name,
- but always `return`s the text's wikifications to the user.

\[__Hint.__ For __1.A__ vs. __1.B__ in the above steps, you can check to see if a file exists using the `os` module's
`os.path.exists(file_path)` function.\]


__A5.__ _(6 points)_ Now, to work with the Wikifier output we'll need to go through a few steps. Since the annotationd (embedded links) provided by Wikifer reference specific document indices, i.e., for words (of their tokenization) and characters. We'll work on the character level because it's more precise. To get this off of the ground, make a function called `build_doc(wiki)`, that takes a Wikifier output (`wiki`) and builds up the original document from the `wiki['words']` and `wiki['spaces']` fields. 

\[__Hint.__ These two fields alternate,  with `wiki['spaces']` having exactly one more element than `wiki['words']`, i.e., a rebuilt document will join the elements of the two with the first element coming from `wiki['spaces']`\].

__A6.__ _(6 points)_ Now that the document is back together we want to collect the links from the annotations. Write a function called `get_links(wiki)` that processes the `annotation` objects from the `wiki['annotations']` field (a list of `annotation` objects), and creates/outputs a data object called links, which is a list of tuples:
```
links = [(chFrom, chTo, hyperlink), ...]
```
Hyperlinks will come from `annotation["url"]`, and the `chFrom` and `chTo` elements represent character ranges corresponding to embeddings of `annotation["url"]` in the document. Note that tthere are multiple character ranges: one for each `support` object in `annotation["support"]` (also a list).  Specifically, each `support` object in the annotation field has two character indices keyed as `support["chFrom"]` and `support["chTo"]` which designate a range (in character indices) over which the hyperlink might well be embedded. 

__A7.__ _(5 points)_ Your next job is to create an `embed_link(doc, link)` function that takes the full document output from __A4__, and, a given link (list element) from __A6__'s output, an updated document with hyperlink embedded. Exhibit the function of `embed_link` using `IPython`'s `HTML()` function: `from IPython.core.display import HTML`

__A8.__ _(2 points)_ The responses received from Wikifier are really only _predictions_ of where links should go. This means that multiple link predictions might span overlapping portions of source text. So, we'll have to make sure our code does some light decision making to avoid overlapping hyperlinks and broken html code. Review the Wikifier response data and discuss any viable approach to dealing with the possible overlapping link problem in the response box below. \[__Hint__: One such approach is taken in problem __A8__ that you may discuss here. If you do so, you are required to provide a detailed description of _how_ __A8__ resolves the overlapping-link problem and any limitations of its approach.\]

__Note__: Even if you have not yet generated any Wikifier output through Problem A, you may complete this part for full credit using the sample data provided in `./data/example_text.json`.

_Response_.

__A9.__ _(8 points)_ Here is where we're going to build a wrapper function, called `embed_links()`, to manage the link embedding process, being sure to avoid the overlapping link problem. Specifically, your function _must_ do the following, in order:

1. Take a wikifier, `wiki`, response as input.
2. Build the document using `build_doc()` and store its output as a string called `doc`
3. Apply `get_links()` function, storing the `links` output 
4. _Reverse_ sort the `link`  by their ending points, i.e., `support["chTo"]` values
5. Loop through the _sorted_ `links` 
6. Conditionally use `embed_link()` function to embed links into `doc` in the loop. Specifically, only embed a link if its `support["chTo"]` value is less than the previous one's `support["chFrom"]` value, or the link to embed is the first in the loop.
7. `return` the link-enhanced doc and exhibit it's output using `HTML()`.

__Note__: If you have not yet completed __Problem A__, you can complete this part for full credit using the sample file, `./data/example_text.json`. This file has the correct input format of data for your function (a Wikifier json output).

__A10.__ _(2 points)_ The `embed_links()` function's design overcomes an additional challenge. Your job here is to explain in detail how it does so. Here's the challenge: since we are using character indices to insert the links over ranges of the input document, adding the markup required for html _necessarily_ creates a longer document and throws off the original document's character indices from their original positions. How does __B5__'s design avoid this _index modification_ problem and simply utilize the exact character indices provided by Wikifier?

_Response_.

__A11.__ _(5 points)_ It's time to package your code up so someone else can use it! Specifically, the goal here is to place _all of the functions_ defined within this module in a file called `./wikifier.py`. Doing so will technically create our very own _Python module_!

When this is complete, exhibit your code's function by loading the module, i.e., `import wikifier`, and executing the two controller methods:

- `wikification = wikifier.get_wikification("example_text.txt", USER_KEY)`
- `annotated_doc = wikifier.embed_links(wikification)`

Note: Your `./wikifier.py` Python must also contain any necessary module loads, and since it is being distributed for _others_ to use it absolutely must use relative file paths, assuming the `./data/texts/` and `./data/wikifications/` file structure. So if these items aren't in place in your functions above, you must modify your code!

\[Hint. to write the contents of a cell to a file (instead of executing in the ipython kernel), place the following magic command at the top of the cell: `%%writefile filename.extension`\]