# Homework 3 - Which book would you recommend?

*Stefano D'Arrigo 1960500, Alessio Sentinelli, Iyuele Alemu Korsaye*

---

![goodreads image](./images/goodrreads.jpg)

## Notes before starting

In order to keep this notebook tidy and agile to read, the majority of the code we wrote to complete the tasks is not included here and is provided into the folder `scripts`. Nevertheless, the crucial pieces of code are directly executed or shown and commented inside this notebook. For further understanding of each operation and choice we made, please refer to the comments to the code.

---

## 1. Data collection

#### Loading libraries

In [2]:
from bs4 import BeautifulSoup as bs
import requests
import re
from selenium import webdriver
import chromedriver_binary
import spacy
from spacy_fastlang import LanguageDetector

### 1.1. Get the list of books

First, we open a txt file and extract the URL of the 30k books, using a for loop over the 300 pages and leveraging the request library we get access to the pages where the books are present.

Then using a third-party library, beautiful soup we can pull the data out of the HTML files and use the ‘lxml’ parser to extract the class "js tooltipTrigger tooltipTrigger" within a div tag, where the url of the books is present.
Finally we write within a loop function the collected 30k url on the text file we opened previously and close the file.



In [None]:
f = open("url_list.txt","w")
for k in range (1,301):  #301 for the pages
    page = requests.get("https://www.goodreads.com/list/show/1.Best_Books_Ever?page=" + str(k))
    soup = BeautifulSoup(page.content, features="lxml")
    URL_con3 = soup.find_all('div', class_="js-tooltipTrigger tooltipTrigger")
    for j in range (0,100): #100 for the books
        URL_str = str(URL_con3[j]) 
        list_split = URL_str.split(" ")
        result = list_split[5] # it seems it is always 5
        result_clean = result.split("\"")[1]
        f.write("https://www.goodreads.com" + result_clean + "\n")
f.close()

We obtained a txt file with 30k lines with each line containing the URL book, and the txt file is finally saved in the data folder.

### 1.2. Crawl books

The goal of this task is to retrieve all the `HTML` pages of the books, reading the `url_list.txt` file that we created in the previous task.

In order to bypass eventual security measures against scraping, we leveraged the library `selenium`, which provides a full and automatized web client agent. 

To complete this task, we wrote a class `DataCollector`, included in `data_collection.py`. The methods of this class receive the user's input, compute the offset from which start reading the URLs file and save the HTML pages. 

The core business of this class is included into the following method:

~~~python
def __save_html_pages(self, start_from, stop_at):
        """
        Start collecting from line start_from and stop at line stop_at.
        """
        with open(os.path.join(self.root_dir, 'url_list.txt'), 'r') as urls_file:
            try:
                urls = urls_file.readlines()[start_from : ] # select the line from which start collecting
            except:
                print('Error: reached file end!')
                exit(-1)
            for url, i in zip(urls, tqdm(range(start_from, stop_at))): # 
                if i % 100 == 0:
                    self.__make_dir(i // 100 + 1)
                try:
                    driver.get(url)
                    page_html = driver.page_source
                    with open(os.path.join(self.html_dir, f'article_{i + 1:05d}.html'), 'w') as out_file:
                        out_file.write(page_html)
                except:
                    with open('./log/log.csv', 'a') as log:
                        log.write(f'[{datetime.datetime.now()}], {i+1}, {url}\n')
                    continue
            driver.close()
~~~

Using the parameter `start_from`, the user can decide from which document start crawling. The eventual errors in retrieving the pages were annotated in a log file and handled manually after the execution of the script.

The output of this method are the collected data, structured in the following way:

```
- html/
    - 1/
        - article_00001.html
        - article_00002.html
        - ...
        - article_00100.html
    - 2/
        - article_00101.html
        - ...
        - article_00200.html
    - ...
    - 300/
        - article_29901.html
        - ...
        - article_30000.html
```

### 1.3 Parse downloaded pages

Once we have accessed the HTML content of all the 30 000 books, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available, the one we have used in our function (book_scraping) is ‘lxml’ parser.

Now, we need to navigate and search the parse tree that we created and for this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files. A really nice feature  about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like, lxml, html5lib parser, etc. so  that BeautifulSoup object and the parser library can be created at the same time.
soup = BeautifulSoup(html_source, features='lxml')

Now, we are ready to extract all the relevant data from the HTML content that are crucial for building a book recommendation engine. The soup object contains all the data in the nested structure which can be programmatically extracted using the ‘book scraping’ function, that we have created to retrieve and save for each 30k book all the relevant information (book title, book series, book author, rating value, rating count, review count, plot, number of pages, published date, characters, settings and URL)


In [1]:
def book_scraping(html_source, nlp): # this takes the html content and returns a list with the useful info
    nlp = spacy.load('en_core_web_sm')
    nlp.add_pipe(LanguageDetector())

    soup = BeautifulSoup(html_source, features='lxml') # instantiate a BeautifulSoup object for HTML parsing

    bookTitle = soup.find_all('h1', id='bookTitle')[0].contents[0].strip() # get the book title

    # if bookSeries is not present, then set it to the empty string
    try:
        bookSeries = soup.find_all('h2', id='bookSeries')[0].contents[1].contents[0].strip()[1:-1]
    except:
        bookSeries = ''

    # if bookAuthors is not present, then set it to the empty string
    try:
        bookAuthors = soup.find_all('span', itemprop='name')[0].contents[0].strip()
    except:
        bookAuthors = ''
    
    # the plot of the book is essential; if something goes wrong with the plot, raise an error
    try:
        Plot = soup.find_all('div', id='description')[0].contents  # get the main tag where the plot is found 
        filter_plot = list(filter(lambda i: i!='\n', Plot))  # filter the plot by removing tags that doesn’t contain the description 
        if len(filter_plot) == 1:    
            Plot = filter_plot[0].text
        else:                                    # getting all the plot within the tag
            Plot = filter_plot[1].text                                               
    except:
        Plot = ''                                # return an empty string if there is no description

    doc = nlp(Plot)
    if doc._.language == 'en':          # return an empty string if the description is not written in English
        pass
    else:
        Plot = ''

    # if NumberofPages is not present, then set it to the empty string
    try:
        NumberofPages = soup.find_all('span', itemprop='numberOfPages')[0].contents[0].split()[0]
    except:
        NumberofPages = ''
    
    # if ratingValue is not present, then set it to the empty string
    try:
        ratingValue = soup.find_all('span', itemprop='ratingValue')[0].contents[0].strip()
    except:
        ratingValue = ''
    
    # if rating_reviews is not present, then set it to the empty string
    try:
        ratings_reviews = soup.find_all('a', href='#other_reviews')
        for i in ratings_reviews:
            if i.find_all('meta',itemprop='ratingCount'):
                ratingCount = i.contents[2].split()[0]
            if i.find_all('meta',itemprop='reviewCount'):
                reviewCount = i.contents[2].split()[0]
    except:
        ratings_reviews = ''

    # if Published is not present, then set it to the empty string
    try:        
        pub = soup.find_all('div', class_='row')[1].contents[0].split()[1:4]
        Published = ' '.join(pub) # join the list of publishers
    except:
        Published = ''
    
    # if Character is not present, then set it to the empty string
    try:
        char = soup.find_all('a', href=re.compile('characters')) # find the regular expression(re) 'characters' within the attribute href 
        if len(char) == 0:
            Characters = '' # no characters in char
        else:
            Characters = ', '.join([i.contents[0] for i in char])
    except:
        Characters = '' # something went wrong with char
    
    # if Setting is not present, then set it to the empty string
    try:
        sett = soup.find_all('a', href=re.compile('places')) # find the regular expression(re) 'places' within the attribute href 
        if len(sett) == 0:
            Setting = ''
        else:
            Setting = ', '.join([i.contents[0] for i in sett])
    except:
        Setting = '' # something went wrong with Setting
    
    # get the URL to the page
    Url = soup.find_all('link', rel='canonical')[0].get('href')

    return [bookTitle, bookSeries, bookAuthors, ratingValue, ratingCount, reviewCount, Plot, NumberofPages, Published, Characters, Setting, Url]

The output of the function is structured in a manner that for each book, the extracted relevant information are in a tab separated values extensions ready to be feed-in for the next stage.
During the scraping procedure, if the information we were seeking was not available then an empty string is returned and also books with a description written in a different language than in English were discarded.


## 2. Search Engine

Once we have collected all the raw HTML pages and scraped the target information, we apply some natural language processing (NLP) techniques in order to create the files which the search engine will work on.

The text tools we use are available in the `ntlk` library:

In [1]:
import nltk

nltk.download('punkt')
nltk.download('stopwords') 

First thing first, it is worth to tokenize the text, i.e. split it into a list of single words, according to punctuation and words formation rules:

In [2]:
def tokenize(text:str):
    return nltk.word_tokenize(text)

Then, we filter the words that contain alphanumeric characters only and we convert all of them into lower case: 

In [3]:
def alphanum(text:list): 
    text_result = []
    for w in text:
        if w.isalnum():
            text_result.append(w.lower())
    return text_result

The input text is full of very common words which, then, are very poor of information (according to the concept of self-information in information theory). These words are known as "stop words". The following function aims to remove all of them, comparing each word with a pre-build stopwords list: 

In [4]:
def stopwords(text:list):
    text_result = []
    stop_words = nltk.corpus.stopwords.words('english')
    for w in text:
        if w not in stop_words:
            text_result.append(w)
    return text_result

The last step of the natural language processing pipeline that we reckon is useful at this point is the mapping of each word to its corresponding root, so that words derivated from the same root are mapped into it; e.g. verbs forms are mapped into the base form of the verb. This task can be carried out by two different techniques: the stemming and the lemmatization. The former applies to each word some fixed rules of words formation to remove prefixes and suffixes; it sometimes lacks in accuracy, giving in output not meaningful words, but it is quite fast. On the contrary, the latter is way more accurate because it considers a language’s vocabulary to apply a morphological analysis to words, but its application is slower. For the purpose of this excercise, a stemmer is enought. Again, it is easy to plug-in a lemmatizer instead of the stemmer for future needs, thanks to the modular nature of our code.

In [5]:
def stemming(text:list):
    stemmer = nltk.stem.PorterStemmer()
    return [stemmer.stem(w) for w in text]

Finally, to summarize and to clarify, the NLP pipeline we follow is:
1. tokenization
2. alphanumeric filtering and lower case conversion
3. stopwords removing
4. stemming

The function below summarizes this pipeline, taking as input a row text and giving as output the processed text:

In [7]:
def pre_process(text:str):
    return ' '.join(stemming(stopwords(alphanum(tokenize(text)))))

### 2.1. Conjunctive query

From this section on, we make use of the tools provided by the class `TfidfVectorizer` from the module `sklearn.feature_extraction.text`. 

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### 2.1.1) Create your index!

This class takes the documents as input, builds a vocabulary from them if not provided explicitly and optionally computes the tfIdf score for each of them. 

The class `IndexBuilder` wraps some methods of `TfidfVectorizer` and contains the logic to save the vocabulary and the inverted indexes.

In [13]:
import pandas as pd 
from scripts.utilities import FileContentGetter
import json

class IndexBuilder:

    def __init__(self, vocabulary=None):
        self.count_vect = TfidfVectorizer(vocabulary=vocabulary, use_idf=True)


    def concatenate_dataset(self, data_path, fields):
        """
        Read all the .tsv files, making use of the utility FileContentGetter, 
        and return a pandas.DataFrame containing all of them sorted by the file number
        """
        content_getter = FileContentGetter(data_path) # utility object
        article = content_getter.get(fields=fields, file_ext='tsv') # get a pandas.DataFrame with the file content
        articles = [] # empty list to hold all the articles' DataFrames
        while article is not None: # if there are no more articles to gather, then None is returned FileContentGetter.get()
            articles.append(article) # save the DataFrame with the article content
            article= content_getter.get(fields=fields, file_ext='tsv') # get a new article
        return pd.concat(articles).sort_values(by='file_num', ignore_index=True) # concat all the articles' DataFrame and sort the rows by the file number
    

    def vectorize_dataset(self, dataset):
        """
        Initialize the document-term matrix from the DataFrame of documents
        """
        self.document_term_matrix = self.count_vect.fit_transform(dataset)


    def save_vocabulary(self, file_path='./data/vocabulary.json'):
        """
        Save the vocabulary computed by TfidfVectorizer into a .json file
        """
        with open(file_path, 'w') as vocabulary:
            json.dump(self.count_vect.vocabulary_, vocabulary)


    def load_vocabulary(self, file_path='./data/vocabulary.json'):
        """
        Load a pre-build vocabulary from the filesystem
        """
        with open(file_path, 'r') as vocabulary:
            vocabulary = json.load(vocabulary)
            self.count_vect = TfidfVectorizer(vocabulary=vocabulary, use_idf=True)
    

    def __select_index_type(self, document_index, document_number, term_id, tfidf):
        """
        Select an inverted index type and save a new inverted index file. 
        """
        if tfidf:
            self.inverted_index[term_id].append((document_number, self.document_term_matrix[document_index, term_id]))
        else:
            if self.document_term_matrix[document_index, term_id] != 0: # if 0, then term term_id is not in document
                self.inverted_index[term_id].append(document_number)


    def save_inverted_index(self, document_numbers, file_path='./data/inverted_index_2_1_2.json', tfidf=False):
        """
        Save the inverted index.
        The inverted index types available are:
            * {
                term_id_1:[document_1, document_2, document_4],
                term_id_2:[document_1, document_3, document_5, document_6],
                ...}
            * {
                term_id_1:[(document1, tfIdf_{term,document1}), (document2, tfIdf_{term,document2}), (document4, tfIdf_{term,document4}), ...],
                term_id_2:[(document1, tfIdf_{term,document1}), (document3, tfIdf_{term,document3}), (document5, tfIdf_{term,document5}), (document6, tfIdf_{term,document6}), ...],
                ...}
        """
        self.inverted_index = dict() # store the inverted index to be dumped into a .json file. A dictionary data structure is useful for this purpose
        for term_id in range(self.document_term_matrix.shape[1]):
            self.inverted_index[term_id] = [] # the list of all the documents which contain the term term_id
            for document_index, document_number in zip(range(self.document_term_matrix.shape[0]), document_numbers): # 
                self.__select_index_type(document_index, document_number, term_id, tfidf)
        with open(file_path, 'w') as out_file:
            json.dump(self.inverted_index, out_file)

We get an object of `IndexBuilder`:

In [14]:
index_builder = IndexBuilder()

The method `concatenate_dataset(..)`, as described above, returns the dataset of the books:

In [15]:
dataset = index_builder.concatenate_dataset('./data/tsv/*/*.tsv', fields=None)

In [16]:
dataset.head()

Unnamed: 0,bookTitle,bookSeries,bookAuthors,ratingValue,ratingCount,reviewCount,Plot,NumberofPages,Published,Characters,Setting,Url,file_num
0,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,6413302,172615,In the ruins of a place once known as North Am...,374.0,September 14th 2008,"Katniss Everdeen, Peeta Mellark, Cato (Hunger ...","District 12, Panem, Capitol, Panem, Panem",https://www.goodreads.com/book/show/2767052-th...,1
1,Harry Potter and the Order of the Phoenix,Harry Potter #5,J.K. Rowling,4.5,2527001,42768,There is a door at the end of a silent corrido...,870.0,September 2004 by,"Sirius Black, Draco Malfoy, Ron Weasley, Petun...","Hogwarts School of Witchcraft and Wizardry, Lo...",https://www.goodreads.com/book/show/2.Harry_Po...,2
2,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,4530963,91866,The unforgettable novel of a childhood in a sl...,324.0,May 23rd 2006,"Scout Finch, Atticus Finch, Jem Finch, Arthur ...","Maycomb, Alabama",https://www.goodreads.com/book/show/2657.To_Ki...,3
3,Pride and Prejudice,,Jane Austen,4.26,3020392,67869,"Since its immediate success in 1813, has rema...",279.0,October 10th 2000,"Mr. Bennet, Mrs. Bennet, Jane Bennet, Elizabet...","United Kingdom, Derbyshire, England, England, ...",https://www.goodreads.com/book/show/1885.Pride...,4
4,Twilight,The Twilight Saga #1,Stephenie Meyer,3.6,4993492,104954,About three things I was absolutely positive.,501.0,September 6th 2006,"Edward Cullen, Jacob Black, Laurent, Renee, Be...","Forks, Washington, Phoenix, Arizona, Washingto...",https://www.goodreads.com/book/show/41865.Twil...,5


Before building the vocabulary, we apply to the `Plot` column the natural language processing techniques:

In [17]:
dataset['Plot'] = dataset['Plot'].map(pre_process)
dataset['Plot'].head()

0    ruin place known north america lie nation pane...
1    door end silent corridor haunt harri pottter d...
2    unforgett novel childhood sleepi southern town...
3    sinc immedi success 1813 remain one popular no...
4                            three thing absolut posit
Name: Plot, dtype: object

Then, the dataset is passed to the method `vectorize_dataset(..)`, which creates the document-term matrix, a compressed sparse matrix holding the pre-calculated tfIdf score; in this section we simply consider a non-zero value of this score as the sign that the i-th word is present in the j-th document. In the next section, we will leverage this score and we will explain in detail how this score is computed.

In [18]:
index_builder.vectorize_dataset(dataset['Plot'])

Finally, we are able to create a new file holding the vocabulary:

In [19]:
index_builder.save_vocabulary()

 and to create a simple inverted index:

In [None]:
index_builder.save_inverted_index(dataset['file_num'])

#### 2.1.2) Execute the query

## 5. Algorithmic Question

This problem can be solved with a recursive approach.

Given a string $S$ and $i\in [0,length(S)]$, let $X[i]$ be the length of the longest increasing subsequence ending at position $i$. 

Then, $X[i]=1+\max\{X[j]; j\in[0,i-1] : S[j]<S[i]\}$.

In [32]:
max_len = 1

def max_length_recursive(s, i):
    global max_len
    max_len_i = 1
    for j in range(0, i):
        res = max_length_recursive(s, j)
        if s[j] < s[i]:
            if res+1 > max_len_i:
                max_len_i = res + 1
    max_len = max(max_len, max_len_i)
    return max_len_i

In [6]:
def max_length(s):
    max_len = 0
    for i in range(1, len(s)):
        res = max_length_recursive(s, i)
        if res > max_len:
            max_len = res
    return max_len

In [34]:
s = 'CADFECEILGJHABNOPSTIRYOEABILCNR' # 'ZQABWARSA' 
max_length_recursive(s, len(s)-1)
max_len

KeyboardInterrupt: 

In [91]:
l = []
for i in range(0, len(s)):
    l.append(max_length(s, i))
max(l)

5