## Homework 3 - Which book would you recommend?

-------------

### Importing related libraries

In [None]:
import functions

In [1]:

import requests
import os
import re
import pandas as pd
import nltk
import numpy as np
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from langdetect import detect
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer


## 1. Data collection

#### 1.1. Get the list of books

* From [best books ever list](https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1) we want to collect the url associated to each book in the list and retrieve only the urls of the books listed in the first 300 pages.



* The output of this step is a `.txt` file whose single line corresponds to a book's url.

In [None]:
#inizialize an empty list
url = []

#for each page, save the corresponding web page, find the anchor elements and save the corresponding tags  
for i in range(1, 301):
    page = requests.get("https://www.goodreads.com/list/show/1.Best_Books_Ever?page="+ str(i))
    soup = BeautifulSoup(page.content, features='lxml')
    tag_a = soup.find_all('a', {"class": "bookTitle"}, itemprop = "url")
    
#for each book, save the corresponding url into the array
    for j in range(0,100):
        url.append("https://www.goodreads.com"+ tag_a[j]['href'])
        
#create a txt file where for each row there is a book's url 
with open("url.txt", 'w') as f:
    f.write("\n".join(map(str, url)))


#### 1.2. Crawl books

1. Download the html corresponding to each of the collected urls.


2. After you collect a single page, immediatly save its `html` in a file. In this way, if your program stops, for any reason, you will not loose the data collected up to the stopping point.


3. Organize the entire set of downloaded `html` pages into folders. Each folder will contain the `htmls` of the books in page 1, page 2, ... of the list of books.

# Warning :

### Do not run the below cell because it'll download over 20 GB files !
#### Before running modify the `for` loop range to a lower number (e.g. 3) 

In [None]:
#First we open the url.txt file, reading the lines and then close the file
f = open("url.txt", "r")
lines = f.readlines()
f.close()

#Setting our parent directory and the directory associated with each page number from the list
## PLEASE CHANGE THE PARENT DIRECTORY ACCORDING TO YOUR SYSTEM
parent_dir = "C:/Users/engme/OneDrive/Desktop/ALL Materials/Data Science - Sapienza/1st Semester/ALGORITHMIC METHODS OF DATA MINING AND LABORATORY/Labs/6/Links"
directory = "Page_"
page_num = 0


#Looping for 300 times that corresponds to the number of pages
for i in range(300):
    #Incrementing page number according to each page
    page_num += 1
    #Setting the current working directory
    directory = "Page_" + str(page_num)
    #Setting the main path to create the directory
    path = os.path.join(parent_dir, directory)
    #Creating new directory
    os.makedirs(path)
    
    #Looping for 100 times, which is the number of articles per page
    for i in range(100):
        #Selecting the corresponding link
        link = lines[i][:-1]
        #Dowloading the article
        r = requests.get(link, allow_redirects=True)
        #Setting the name file to keep the track
        file_name = "article_" + str(i+1) + ".html"
        #Saving the html file to its corresponding directory
        open(parent_dir + "/" + directory + "/" + file_name, 'wb').write(r.content)

-----------

#### 1.3 Parse downloaded pages

* Extracting the books information for each book as following:

1. Title (to save as `bookTitle`)
2. Series (to save as `bookSeries`)
3. Author(s), the first box in the picture below (to save as `bookAuthors`)
4. Ratings, average stars (to save as `ratingValue`)
5. Number of givent ratings (to save as `ratingCount`)
6. Number of reviews (to save as `reviewCount`)
7. The entire plot (to save as `Plot`)
8. Number of pages (to save as `NumberofPages`)
9. Published (Publishing Date)
10. Characters
11. Setting
12. Url

In [None]:
## Custom function designed to 
parent_dir = "C:/Users/engme/OneDrive/Desktop/ALL Materials/Data Science - Sapienza/1st Semester/ALGORITHMIC METHODS OF DATA MINING AND LABORATORY/Labs/6"

functions.info_parser(parent_dir, pages=300, tsv_articles= "tsv_articles", links= "Links", url= 'url')

All tsv files generated sucessfully in tsv_articles directory

--------------

## 2. Search Engine

Now, we want to create two different Search Engines that, given as input a query, return the books that match the query.

First, you must pre-process all the information collected for each book by :



1. Removing stopwords
2. Removing punctuation
3. Stemming
4. Anything else you think it's needed


For this purpose, you can use the [nltk](https://www.nltk.org/) library.

In [3]:

#initialize an empty dictionary
processed_docs = {}

parent_dir = "C:\\Users\\elisa\\Desktop\\Algorithmic Methods of Data Mining\\ADM-HW3\\tsv_articles\\tsv_articles"

#initialize a list that will contain the lists of tokens, one per plot 
tokenizer = nltk.RegexpTokenizer(r"\w+")
stop_words = set(stopwords.words('english')) 
ps = PorterStemmer()

#for every article_i.tsv file, extract the Plot, tokenize it and preprocess it 
for n_art in range(1, 30001):
    directory = "article_" + str(n_art) + ".tsv"
    path = os.path.join(parent_dir, directory)
    if os.path.exists(path):
        plot = pd.read_csv(path, delimiter = '\t', usecols = ['Plot'])
        tokens = tokenizer.tokenize(str(plot))
        processed_doc = []
        for token in tokens:
            if (token != 'Plot') & (token != '0') & (not token in stop_words):
                processed_doc.append(ps.stem(token))

        processed_docs['document_'+str(n_art)] = processed_doc


In [4]:

#initialize an empty dictionary
vocabulary = {}

term_id = 1

#for every document (for every plot in our case)
for doc in processed_docs.values():
    
#for every token in the document
    for tok in doc:
        
#if the token is not present in the dictionary yet...
        if tok not in vocabulary:
        
#...add it and set term_id as his id
            vocabulary[tok] =  term_id
            term_id += 1
            

In [5]:

#initialize an empty dictionary
inv_index1 = {}

doc_id = 0 
#for every document (for every plot in our case)
for doc in processed_docs.values():

    #increase the id of the document
    doc_id+=1
    
    directory = "article_" + str(doc_id) + ".tsv"
    path = os.path.join(parent_dir, directory)
    if os.path.exists(path):
        
    #for every token in the document
        for tok in doc:

    #if the id of that specific token is not present in the dictionary yet...
            if vocabulary[tok] not in inv_index1:

    #...add it to the dictionary as a key and let document_doc_id be one of its values:
                inv_index1[vocabulary[tok]] = ["document_"+str(doc_id)]

    #else, if this token is present in the dictionary but document_doc_id is not one of his values yet...
            elif "document_"+str(doc_id) not in inv_index1.get(vocabulary[tok]):

    #append document_doc_id to his values
                inv_index1[vocabulary[tok]].append("document_"+str(doc_id))


In [6]:

my_dict = {}
for i in range(1,30001):
    directory = "article_" + str(i) + ".tsv"
    path = os.path.join(parent_dir, directory)
    if os.path.exists(path):
        my_dict["document_"+str(i)] = "article_"+str(i)
        

In [7]:
def Search_Engine(query):
    
    #stem the tokens of the query in order to create a new query: my_query
    my_query = []
    for tok in query:
        my_query.append(ps.stem(tok))

    #create a new dictionary which contains just the keys present in my_query        
    my_invertedId = {}
    for tok in my_query:
        if tok in vocabulary:
            my_invertedId[tok] = inv_index1.get(vocabulary[tok])
            
    #if any of the query's tokens is not present into the vocabulary, give an Error Message to the user
        elif tok not in vocabulary:
            return("The query is not present in any plot")
      
    #define a list of sets where each set represents the documents that contain each token of the query
    my_sets = []
    for key in my_invertedId:
        my_sets.append(set(my_invertedId[key]))
    result = set()

    for i in range(1, 30001):
        result.add('document_'+str(i))
        
    for my_set in my_sets:
        result = result.intersection(my_set)
    
    if result == set():
        return("The query is not present in any plot")
    else:
        found = list(result)

        i = 0
        for item in found:
            directory = my_dict[item]+".tsv"
            path = os.path.join(parent_dir, directory)
            if i == 0:
                data = pd.read_csv(path, delimiter = '\t', usecols = ['bookTitle', 'Plot', 'Url'])
            else:
                data = data.append(pd.read_csv(path, delimiter = '\t', usecols = ['bookTitle', 'Plot', 'Url']))
               
            data = data.rename(index = {0:'book_'+str(i+1)})
            
            i+=1
                            
        return(data)


In [8]:
Search_Engine(input().split())

could


Unnamed: 0,bookTitle,Plot,Url
book_1,Existence,"The daughter of a spy labeled ""missing in acti...",https://www.goodreads.com/book/show/332613.One...
book_2,"The Inquisitor's Tale: Or, the Three Magical C...","1242. On a dark night, travelers from across F...",https://www.goodreads.com/book/show/7604.Lolita
book_3,The Tenant of Wildfell Hall,Note: Editions of The Tenant that start with: ...,https://www.goodreads.com/book/show/1953.A_Tal...
book_4,The Information,"Fame, envy, lust, violence, intrigues literary...",https://www.goodreads.com/book/show/5107.The_C...
book_5,You’re the Password to My Life,We all have that one person in our lives in wh...,https://www.goodreads.com/book/show/13496.A_Ga...
...,...,...,...
book_81,Behind Her Eyes,Why is everyone talking about the ending of Sa...,https://www.goodreads.com/book/show/22628.The_...
book_82,The Peacemaker,With war scars that no one could see and that ...,https://www.goodreads.com/book/show/662.Atlas_...
book_83,Stillhouse Lake,Gina Royal is the definition of average—a shy ...,https://www.goodreads.com/book/show/4214.Life_...
book_84,Encyclopedia of Things That Never Were: Creatu...,"Hardcover sales of more than 70,000 copies hav...",https://www.goodreads.com/book/show/15823480-a...


In [14]:

#initialize an empty dictionary
inv_index2 = {}

n = len(processed_docs) # number of documents

#for every term_id belonging to the dictionary my_dict
for term_id in inv_index1:
    line = [term_id, inv_index1[term_id]]
    N = len(inv_index1[term_id]) #number of documents with the term corrisponding to the id term_id
    
    #for every document that contains that term
    for doc in line[1]:
        tf = processed_docs[doc].count(list(vocabulary.keys())[term_id-1]) #term frequency
        Idf = np.log(n/N) #inverse document frequency
        tfIdf = tf*Idf
        
        if term_id not in inv_index2:
            inv_index2[term_id] = [(doc,tfIdf)]
        else:
            inv_index2[term_id].append((doc,tfIdf))
        

In [18]:
inv_index2

{1: [('document_1', 5.7457520457526225),
  ('document_151', 5.7457520457526225),
  ('document_1155', 0.0),
  ('document_1792', 0.0),
  ('document_1954', 0.0),
  ('document_1961', 0.0),
  ('document_2041', 0.0),
  ('document_2054', 0.0),
  ('document_2652', 0.0),
  ('document_2888', 0.0),
  ('document_2921', 0.0),
  ('document_3114', 0.0),
  ('document_3441', 0.0),
  ('document_3953', 0.0),
  ('document_4783', 0.0),
  ('document_5318', 0.0),
  ('document_5418', 0.0),
  ('document_5527', 0.0),
  ('document_5627', 0.0),
  ('document_5728', 0.0),
  ('document_6374', 0.0),
  ('document_6581', 0.0),
  ('document_6602', 0.0),
  ('document_6679', 0.0),
  ('document_6923', 0.0),
  ('document_6975', 0.0),
  ('document_7026', 0.0),
  ('document_7440', 0.0),
  ('document_7706', 0.0),
  ('document_8067', 0.0),
  ('document_8159', 0.0),
  ('document_9057', 0.0),
  ('document_9291', 0.0),
  ('document_9432', 0.0),
  ('document_9485', 0.0),
  ('document_9637', 0.0),
  ('document_9807', 0.0),
  ('docum