# Homework 3 - Which book would you recommend?

*Stefano D'Arrigo 1960500, Alessio Sentinelli, Iyuele*

---

![goodreads image](./images/goodrreads.jpg)

## Notes before starting

In order to keep this notebook tidy and agile to read, the majority of the code we wrote to complete the tasks is not included here and is provided into the folder `scripts`. Nevertheless, the crucial pieces of code are directly executed or shown and commented inside this notebook. For further understanding of each operation and choice we made, please refer to the comments to the code.

---

## 1. Data collection

### 1.1. Get the list of books

Please describe here what are the characteristics of the website structure you exploited to get the needed information, what are the choices you made and what you did. Comment and refer to the code below

In [None]:
from bs4 import BeautifulSoup 
import requests

f = open("url_list.txt","w")
for k in range (1,301):  #301 for the pages
    page = requests.get("https://www.goodreads.com/list/show/1.Best_Books_Ever?page=" + str(k))
    soup = BeautifulSoup(page.content, features="lxml")
    URL_con3 = soup.find_all('div', class_="js-tooltipTrigger tooltipTrigger")
    for j in range (0,100): #100 for the books
        URL_str = str(URL_con3[j]) 
        list_split = URL_str.split(" ")
        result = list_split[5] # it seems it is always 5
        result_clean = result.split("\"")[1]
        f.write("https://www.goodreads.com" + result_clean + "\n")
f.close()

Comment briefly on the output of this task and say where the reader can find the file

### 1.2. Crawl books

The goal of this task is to retrieve all the `HTML` pages of the books, reading the `url_list.txt` file that we created in the previous task.

In order to bypass eventual security measures against scraping, we leveraged the library `selenium`, which provides a full and automatized web client agent. 

To complete this task, we wrote a class `DataCollector`, included in `data_collection.py`. The methods of this class receive the user's input, compute the offset from which start reading the URLs file and save the HTML pages. 

The core business of this class is included into the following method:

~~~python
def __save_html_pages(self, start_from, stop_at):
        """
        Start collecting from line start_from and stop at line stop_at.
        """
        with open(os.path.join(self.root_dir, 'url_list.txt'), 'r') as urls_file:
            try:
                urls = urls_file.readlines()[start_from : ] # select the line from which start collecting
            except:
                print('Error: reached file end!')
                exit(-1)
            for url, i in zip(urls, tqdm(range(start_from, stop_at))): # 
                if i % 100 == 0:
                    self.__make_dir(i // 100 + 1)
                try:
                    driver.get(url)
                    page_html = driver.page_source
                    with open(os.path.join(self.html_dir, f'article_{i + 1:05d}.html'), 'w') as out_file:
                        out_file.write(page_html)
                except:
                    with open('./log/log.csv', 'a') as log:
                        log.write(f'[{datetime.datetime.now()}], {i+1}, {url}\n')
                    continue
            driver.close()
~~~

Using the parameter `start_from`, the user can decide from which document start crawling. The eventual errors in retrieving the pages were annotated in a log file and handled manually after the execution of the script.

The output of this method are the collected data, structured in the following way:

```
- html/
    - 1/
        - article_00001.html
        - article_00002.html
        - ...
        - article_00100.html
    - 2/
        - article_00101.html
        - ...
        - article_00200.html
    - ...
    - 300/
        - article_29901.html
        - ...
        - article_30000.html
```

### 1.3 Parse downloaded pages

Please describe here what are the characteristics of the website structure you exploited to get the needed information, what are the choices you made and what you did. Comment and refer to the code below

In [1]:
def book_scraping(html_source): # this takes the html content and returns a list with the useful info

    soup = BeautifulSoup(html_source, features='lxml') # instantiate a BeautifulSoup object for HTML parsing

    bookTitle = soup.find_all('h1', id='bookTitle')[0].contents[0].strip() # get the book title

    # if bookSeries is not present, then set it to the empty string
    try:
        bookSeries = soup.find_all('h2', id='bookSeries')[0].contents[1].contents[0].strip()[1:-1]
    except:
        bookSeries = ''

    # if bookAuthors is not present, then set it to the empty string
    try:
        bookAuthors = soup.find_all('span', itemprop='name')[0].contents[0].strip()
    except:
        bookAuthors = ''
    
    # the plot of the book is essential; if something goes wrong with the plot, raise an error
    try:
        descr = soup.find_all('div', id='description')[0].contents # get the plot
        descr_fil= list(filter(lambda s: s!='\n', descr)) # clean it from newline chars
        if len(descr_fil) == 1:
            Plot = ''.join(descr_fil[0].contents[0]) # join the filtered plot into a string
        else:
            descr_fil = descr_fil[1:-1]
            x = [j for i in descr_fil for j in i.contents if (isinstance(j, str)==True)]
            Plot = ''.join(x) # join the filtered plot into a string
        if detect(Plot) != 'en':
            raise Exception # if the plot is not in english, raise an error
    except:
        raise # pass the error to the caller function

    # if NumberofPages is not present, then set it to the empty string
    try:
        NumberofPages = soup.find_all('span', itemprop='numberOfPages')[0].contents[0].split()[0]
    except:
        NumberofPages = ''
    
    # if ratingValue is not present, then set it to the empty string
    try:
        ratingValue = soup.find_all('span', itemprop='ratingValue')[0].contents[0].strip()
    except:
        ratingValue = ''
    
    # if rating_reviews is not present, then set it to the empty string
    try:
        ratings_reviews = soup.find_all('a', href='#other_reviews')
        for i in ratings_reviews:
            if i.find_all('meta',itemprop='ratingCount'):
                ratingCount = i.contents[2].split()[0]
            if i.find_all('meta',itemprop='reviewCount'):
                reviewCount = i.contents[2].split()[0]
    except:
        ratings_reviews = ''

    # if Published is not present, then set it to the empty string
    try:        
        pub = soup.find_all('div', class_='row')[1].contents[0].split()[1:4]
        Published = ' '.join(pub) # join the list of publishers
    except:
        Published = ''
    
    # if Character is not present, then set it to the empty string
    try:
        char = soup.find_all('a', href=re.compile('characters')) # find the regular expression(re) 'characters' within the attribute href 
        if len(char) == 0:
            Characters = '' # no characters in char
        else:
            Characters = ', '.join([i.contents[0] for i in char])
    except:
        Characters = '' # something went wrong with char
    
    # if Setting is not present, then set it to the empty string
    try:
        sett = soup.find_all('a', href=re.compile('places')) # find the regular expression(re) 'places' within the attribute href 
        if len(sett) == 0:
            Setting = ''
        else:
            Setting = ', '.join([i.contents[0] for i in sett])
    except:
        Setting = '' # something went wrong with Setting
    
    # get the URL to the page
    Url = soup.find_all('link', rel='canonical')[0].get('href')

    return [bookTitle, bookSeries, bookAuthors, ratingValue, ratingCount, reviewCount, Plot, NumberofPages, Published, Characters, Setting, Url]

Comment briefly on the output of this task