### Data Types and Functions Exercises - W/out Answers


#### Case: Gutenberg Project Popular Downloads

Gutenberg is a project that is aiming to distribute the public domain e-books online. It is nice because you can reach many many classical literature piece for free and online.


It has a nice the most popular downloads page that you can see the most frequently downloaded e-books from the website: https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads


We prepared a small application to scrape some information on the given link. The dataset looks like this:

```
{
    "b_0": {"book_name": "A Christmas Carol in Prose; Being a Ghost Story of Christmas",
        "book_author": "Charles Dickens",
        "book_link": "/ebooks/46",
        "book_downloads": "65436 downloads",
        "book_image": "/cache/epub/46/pg46.cover.small.jpg"},
    "b_1": {"book_name": "Pride and Prejudice",
        "book_author": "Jane Austen",
        "book_link": "/ebooks/1342",
        "book_downloads": "39523 downloads",
        "book_image": "/cache/epub/1342/pg1342.cover.small.jpg"}
}

```

##### Goal:

We will be using Python mostly to exercise on the following topics:

* Using json, re, os, copy libraries
* List and Dict comprehensions
* Using Functions
* Loops
* Reading and Writing Files
    * .json
    * .txt
    * .csv
* And some general Pythonic conventions
    * import this
    * requirements.txt
    * import modules


##### Exercises Preperation:

Go to your terminal and follow those steps:

1. Change the working directory to the current path:
```
cd <THIS_JUPYTER_NOTEBOOK_PATH>
```

2. Activate your Python 3.7 environment:
```
conda activate <YOUR_PYTHON_3.7_VENV_NAME>
```

3. Install required packages to your environment:
```
pip3 install -r requirements.txt
```

In [3]:
from data_collector.gutenberg_scraper import create_main_dict_for_popular_books

MAIN_PAGE = "https://www.gutenberg.org"
PATH_GUTENBERG_MOST_POP = (MAIN_PAGE +
                           "/ebooks/search/%3Fsort_order%3Ddownloads")

main_dict_for_popular_books = create_main_dict_for_popular_books(PATH_GUTENBERG_MOST_POP)

2020-02-06 09:54:04: INFO Cannot get the book author in: Beowulf: An Anglo-Saxon Epic Poem


In [None]:
from data_collector.gutenberg_scraper import (
    get_book_text_link, request_page_content)

In [35]:
def simple_tokenizer(text):
    import re
    word_regex = re.compile(r"(\w+)")
    return re.findall(word_regex, text.lower())

### Exercises

**1** Write a function to create a list of tuples. Including only the book_name and book_downloads.

So the function is going to take:

```
{
    "b_0": {"book_name": "A Christmas Carol in Prose; Being a Ghost Story of Christmas",
        "book_author": "Charles Dickens",
        "book_link": "/ebooks/46",
        "book_downloads": "65436 downloads",
        "book_image": "/cache/epub/46/pg46.cover.small.jpg"},
    "b_1": {"book_name": "Pride and Prejudice",
        "book_author": "Jane Austen",
        "book_link": "/ebooks/1342",
        "book_downloads": "39523 downloads",
        "book_image": "/cache/epub/1342/pg1342.cover.small.jpg"}
}
```

And return to:

```
[('A Christmas Carol in Prose; Being a Ghost Story of Christmas',
  '65436 downloads'),
 ('Pride and Prejudice', '39523 downloads')]
```

In [None]:
results = []
for key, values in main_dict_for_popular_books.items():
    results.append((values["book_name"], values["book_downloads"]))

In [12]:
def create_a_list_of_book_name_and_downloads(popular_books_dict):
    return [(book_item["book_name"], book_item["book_downloads"])
             for book_item in popular_books_dict.values()]

**2** Write a function to sort book_name and downloads by checking the number of the downloads.

* Hint: You can use list sort, by converting the downloads string to an integer.

* Hint2: You can use `create_a_list_of_book_name_and_downloads` from the previous exercise.

So the function is going to take:

```
{
    "b_0": {"book_name": "A Christmas Carol in Prose; Being a Ghost Story of Christmas",
        "book_author": "Charles Dickens",
        "book_link": "/ebooks/46",
        "book_downloads": "65436 downloads",
        "book_image": "/cache/epub/46/pg46.cover.small.jpg"},
    "b_1": {"book_name": "Pride and Prejudice",
        "book_author": "Jane Austen",
        "book_link": "/ebooks/1342",
        "book_downloads": "39523 downloads",
        "book_image": "/cache/epub/1342/pg1342.cover.small.jpg"}
}
```

And return to:

```
[(39523, 'Pride and Prejudice'),
 (65436, 'A Christmas Carol in Prose; Being a Ghost Story of Christmas')]
```

In [33]:
def sort_ascending_the_download_counts_per_book(popular_books_dict):
    necessary_info = create_a_list_of_book_name_and_downloads(
        popular_books_dict)
    return sorted([(int(book_download.strip(" downloads")), book_name)
                    for book_name, book_download in necessary_info])

**3** Write a function to count words in a text. Use `simple_tokenizer` in the function to make your life easier.

So the function is going to take:

```
"I love Justin Bieber. He is so talented."
```

And return to:

```
8
```

In [None]:
def word_count(text):
    pass

**4** Write a function with a dictionary that is stating which words are used in the text how many times. Use `simple_tokenizer` in the function to make your life easier.

So the function is going to take:

```
"I love Justin Bieber. He is so talented."
```

And return to:

```
{'i': 1,
 'love': 1,
 'justin': 1,
 'bieber': 1,
 'he': 1,
 'is': 1,
 'so': 1,
 'talented': 1}
```

In [None]:
def word_count_with_dict(text):
    # Hint: Check .update or .get methods 
    # for dictionary data type: https://www.w3schools.com/python/ref_dictionary_update.asp
    pass

### Extra Exercises (A bit more tiring)

Here we have created a function to download book text for a given book link as string.

This `download_book_text` takes "https://www.gutenberg.org/ebooks/46" and "A Modest Proposal" as paramteres. And downloads A Modest Proposal by Jonathan Swift to "./downloaded_books/A Modest Proposal.txt" location.

There are two extra questions:

1. Use `download_book_text` to download all the popular books with their own book name. 

    a. Use dictionary comprehensions to reach the `book_link` and combine it with the `MAIN_PAGE`.
    
    b. You can create a list of tuples to keep track of `book_link` and `book_names`.


2. By using the downloaded book texts in the previous step, you can count the words in each book. That can be a really interesting analysis.

    a. You can use a list of tuples to store all the relevant information about the books.
    
    b. When you know the `book_name` for a corresponding book, you can open the .txt files.
    
    c. When you open the .txt files you can apply `word_count` on the string values.

In [1]:
def download_book_text(book_link_str, path_name):
    book_text_link_path = get_book_text_link(book_link_str)
    requested_book_text = request_page_content(MAIN_PAGE + book_text_link_path)
    requested_book_text_decoded = requested_book_text.decode("utf")
    with open(f"downloaded_books/{path_name}.txt", "w") as f:
        f.write(requested_book_text_decoded)  

In [None]:
def download_all_books_as_text(popular_books_dict):
    pass

In [3]:
def count_words_per_book(popular_books_dict):
    pass