# Assignment Group 4

## Module B _(49 points)_
For this assignment your work will  complete a pipeline for the backend of an e-reader enhancement. Specifically, you'll use disseminated code to apply Wikifier (embedding hyperlink references in encyclopedia articles) to the content areas of open-domain html eBooks.

__B1.__ _(5 points)_ First, we're going to set up access for Project Gutenberg data. Each book in the Project Gutenberg collection has a unique ID associated with it. So, your first job is to write a function called `download_book(book_id)` that takes as input the `book_id` of a specific Project Gutenberg book and then downloads the full HTML version of the book. 

\[__Hint__: To sort out the url structure find and observe the URLs of a few (HTML-format) books on Gutenberg: https://www.gutenberg.org/ and generalize the pattern in your code, bringing in the book's ID number as a an argument into the final URL as needed.\]

__B2.__ _(2 points)_ Let's use the function you wrote in the previous part to download a book! Use this to access _Alice in Wonderland_ (ID number `19033`). To prove your code's functiomn, exhibit the first 1000 characters of the resulting document below.

__B3.__ _(7 pts)_ Now that we have a downloader, the next step is to wrap its code in another that manages the files we've downloaded in a database.

In particular, write a function called `get_html(book_id)` that does the following:

1. Checks to see if a corresponding filename (i.e., `book_id + '.html'` extension) exists in the `'./data/html/'` directory, and
    1. loads the html file as text (only if it exists!!) or
    - uses `download_book()` to access the document from Gutenberg (only if the file doesn't exist!) and stores the accessed html copy in `'data/html/'`,
- but always `return`s the text specified by the user.

\[__Hint.__ To check and see if a file exists, you can use the `os` module `.path.exists(PATH)` method.\]

__B4.__ _(3 pts)_ To perform Wikifications of our book we'll be using a 'homemade' client module (See __Module A__ for full development) to manage the API made available by [wikifier.org](http://wikifier.org). To use this module, you'll still have to register for a `USER_KEY`. So, follow instructions at http://wikifier.org/info.html for a `USER_KEY`, and to test your setup apply the module's code to the file `example_text.txt`, inside of `./data/texts/`.

To do this, utilize the two methods:
- `wikification = wikifier.get_wikification(file_name, USER_KEY)`
- `annotated_doc = wikifier.embed_links(wikification)`

Note: the 'homemade' client assumes the plain text (to be wikified) exists inside of `./data/texts/`, and automatically caches any wikification output in `./data/wikifications/`. This file structure is essential!

__B5.__ _(5 pts)_ Now, since our `wikifier` module only works from plain text files, we'll need to have some utility in place that extracts separated paragraphs from a given Project Gutenberg book's html in `./data/html/` and stores them in a corresponding file in `./data/texts/`. In particular, write a function called `extract_paragraphs(book_id)` that does the following:

1. Utilizes `get_html()` to load the given eBook,
- applies `BeautifulSoup` to extract a list of `paragraph`s as strings, i.e., the string objects returned from the `.text` attributes of `<p>...</p>` tagged blocks, and
- writes the extracted paragraphs to separate files, called `book_id + '-' + p_num + '.txt'` in `./data/texts/`.
where each `p_num` refers to the paragraph's list index.

__B6.__ _(3 points)_ Now we want to write a function called `wikify_text(book_id, p_num, KEY)`, where `KEY` is an argument that refers to your wikifier.org `USER_KEY`, which will allow you to pass it to `wikifier.get_wikification`. Since we're already working with our `wikifier` module's data file structure all we have to do is utilize our new functions with the module's existing methods. In particular, make sure your function does the following:

1. Runs `extract_paragraph(book_id)` to ensure the plain text content is present in the right directory, and 

2. Checks to see if a corresponding filename (i.e., `book_id + '-' + p_num + '.json'` extension) exists in the `'./data/wikifications/'` directory, and
    1. loads the json file (only if it exists!!) or
    - uses our module's wikification method, i.e., `wikification = wikifier.get_wikification(file_name, USER_KEY)` and stores the `annotated_doc` as a json file called `book_id + '-' + p_num + '.json'` in `./data/wikifications/`, 
- but always `return`s the wikifications to the user.

__B7.__ _(10 points)_ Finally (and this is a big one), we'd like to replace html paragraphs with ones that we've annotated with the URLs pointing to the Wikipedia articles. To do so, write a function called `get_enhanced_html(book_id, KEY)` that does the following:

1. Utilizes our `get_html(book_id)` function to access the book's `html` and interprets the book's `html` with BeautifulSoup,
2. uses `BeautifulSoup` to loop over the `html` `paragraph`s, i.e., the `<p>...</p>` tagged blocks with the `enumerate()` function (to capture indices, i.e., `p_num`),
3. utilizes our `wikify_text(book_id, p_num, KEY)` function to access/generate the paragraph's `wikification`,
4. builds an enhanced document for the paragraph, using 
    - `annotated_doc = wikifier.embed_links(wikification)`,
5. encloses each corresponding enhanced paragraph in `<p>` tags, and applies `BeautifulSoup()`, storing the interpreted object as the `enhanced_paragraph`,
6. replaces each original `paragraph` with its corresponding `enhanced_paragraph` using the `.replace_with()` method for `BeautifulSoup` objects, and 
7. after the loop finishes, returns the enhanced html document (`BeautifulSoup` object).

__Note__: For more information on the `.replace_with()` method, check out the documentation at:
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#replace-with

__B8.__  _(5 points)_ From a efficient access and distribution perspective, explain why part __B3__'s Step 1 A/B and __B6__'s Step 2 A/B make this tool's processing minimal (in conjunction with, e.g., the `./data/texts/` file structure), and most of all, nice to the servers responding to our requests. Record this discussion in the response box below.

_Response_.

__B9.__ _(5 pts)_ It's time to package your code up so someone else can use it! Specifically, the goal here is to place _all of the functions_ defined within this module in a file called `./EnhanceEBook.py`. Doing so will technically create our very own _Python module_!

When this is complete, exhibit your code's function by loading the module, i.e., `import EnhanceEBook`, and executing the one controller method:

- `enhanced_html = EnhanceEBook.get_enhanced_html(book_id, KEY)`

and then `from IPython.core.display import HTML` use the `HTML` function to display your `enhanced_html`.

Note: Your `./EnhanceEBook.py` Python module must also contain any necessary module loads, including our 'homemade' `wikifier.py` module (in the current directory). Also, since this is being distributed for _others_ to use it absolutely must use relative file paths, assuming the `./data/texts/` and `./data/wikifications/` file structure. So if these items aren't in place in your functions above, you must modify your code!

\[Hint. to write the contents of a cell to a file (instead of executing in the ipython kernel), place the following magic command at the top of the cell: `%%writefile filename.extension`\]

__B10.__ _(3 points)_ Now that you have a functioning and self-contained data pipeline, think of possible models for distribution of said data. Describe the data distribution model you believe to be the best. Should the software download and store all data locally on a user's machine, or should it always access data on the fly from the web?

_Response._

__B11.__ _(3 points)_ Finally, compare the possible advantages and weaknesses of each of the data distribution models you considered in the previous part.

_Response._