# Book Scraper
This scraper fetches Arabic books data from [Abjjad](http://abjjad.com) website by 2 methods. 



**First method: **

You can use it if you want to collect the data by yourself, and you can decide which categories of books you want to collect. The general steps of this method are:

1. Open a list (category) of books (like language, arts, biography...etc)

2. Open all pages in the list

3. Getting all books URLs within the list

4. Scrape each book data and save it into a dataframe and csv

**Second method:** 

You can use it if you have a list of books URLs and you want to scrape the books directly. The examples of the ***input*** of this method are in [Data/Lists_of_Links](https://github.com/iHaifaa/Arabic_Books_Recommendation_System/tree/main/Data/Lists_of_Links) folder. 

The features of the obtained data are:

| Feature | Description |
|:----------------:|:----------------------------------------------------------------------------------------------:|
| ISBN | ISBN or ISBN 13 of the book. |
| Title | The whole title of the book. |
| Author | The author name/s. Like the main author, co-author, and translator. |
| Authors_Number | Number of authors, it is 1 if the book has only one author. |
| Description | The text description of the book, taken from the book summary or written by an admin. |
| Genres | The category of the book. |
| Average_Ratings | Average rate for a given book. |
| Reviews_Number | Number of written reviews by users. |
| Quotes_Number | Number of quotes taken from the book and published by users. |
| Community_Size | Number of users who added the book to their shelves (as read, currently reading, or to_read) |
| Pages_Number | Amount of pages per title. |
| Editions | Number of different editions for the book, not the current one. |
| Publication_Year | Year of the first publication. |
| Publisher | Publisher's name. |
| URL | Direct link to the book's page on Abjjad. |
| Cover_URL | Direct link to the book's image on Abjjad. |

Finally, I provided a sample of the ***output*** of this process in [Data/Books_Data](https://github.com/iHaifaa/Arabic_Books_Recommendation_System/tree/main/Data/Books_Data) folder and called **Books_Category9.csv** and **Books_Category9.csv** . 


In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

from lxml import html
#from time import sleep

import pandas as pd

import csv

In [4]:
browser = webdriver.Firefox()

browser.get('https://www.abjjad.com/search')

browser.find_element_by_xpath('/html/body/header/div[2]/nav[2]/ul/li[2]/div/span').click() # go to books tab

browser.find_element_by_xpath('/html/body/header/div[2]/nav[2]/ul/li[2]/div/div/a[1]').click() # go to all books option

In [5]:
'''signin = browser.find_element_by_xpath("/html/body/header/div[2]/nav[1]/ul/li[3]/a")
signin.click()

email = browser.find_element_by_xpath("//*[@id='Email']")
email.send_keys("your email") 

password = browser.find_element_by_xpath("//*[@id='Password']")
password.send_keys("your password")

submit = browser.find_element_by_xpath("/html/body/div[2]/div/div/div/form[1]/section/fieldset/ol/li[3]/input")
submit.click()

sleep(3)'''

'signin = browser.find_element_by_xpath("/html/body/header/div[2]/nav[1]/ul/li[3]/a")\nsignin.click()\n\nemail = browser.find_element_by_xpath("//*[@id=\'Email\']")\nemail.send_keys("your email") \n\npassword = browser.find_element_by_xpath("//*[@id=\'Password\']")\npassword.send_keys("your password")\n\nsubmit = browser.find_element_by_xpath("/html/body/div[2]/div/div/div/form[1]/section/fieldset/ol/li[3]/input")\nsubmit.click()\n\nsleep(3)'

## Essential functions

### Books Metadata

In [6]:
def get_isbn():
    raw_html = browser.page_source
    html_source = html.fromstring(raw_html)
    try:
        return browser.find_element_by_css_selector('.about > ul:nth-child(2) > li:nth-child(3)').text
    except:
        return None

In [7]:
def get_title():
    try:
     #return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[1]/div/div/div[3]/h1').text # book title
     return browser.find_element_by_css_selector('.bookeditbadge > h1:nth-child(1)').text
    except: 
        return None

In [8]:
def get_author():
    try:
        return browser.find_element_by_class_name('author').text
    except:
        return None

In [9]:
def get_authors():
    try:
        return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[4]/div/ul/li[5]').text 
    except:
        return None

In [10]:
def get_description():
    try:
        raw_html = browser.page_source
        html_source = html.fromstring(raw_html)
        text_1 = ''.join(html_source.xpath("//span[@class='content']/text()")).strip()
        text_2 = ''.join(html_source.xpath("//span[@class='guts']/text()")).strip()
        return f'{text_1} {text_2}'
    except:
        return None

In [11]:
def get_genre():
    try:
        return browser.find_element_by_css_selector('.cats > ul:nth-child(2)').text # genre
    except:
        return None

In [12]:
def get_year():
    try:
        return browser.find_element_by_css_selector('.about > ul:nth-child(2) > li:nth-child(1)').text
    except:
        return None

In [13]:
def get_pages():
    try:
        return browser.find_element_by_css_selector('.about > ul:nth-child(2) > li:nth-child(2)').text
    except:
        return None

In [14]:
def get_publisher():
    try:
        return browser.find_element_by_css_selector('.about > ul:nth-child(2) > li:nth-child(4) > a:nth-child(1)').text # publisher
    except:
        return None

In [15]:
def get_rating():
    try:
        return browser.find_element_by_class_name('rating').text # number of ratings
    except:
        return None

In [16]:
def get_reviews():
    try:
        return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[4]/div/ul/li[1]').text
    except: 
        return None

In [17]:
def get_quotes():
    try:
        return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[4]/div/ul/li[2]').text
    except:
        return None

In [18]:
def get_community():
    try:
        return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[4]/div/ul/li[3]').text
    except:
        return None

In [19]:
def get_actual_readers():
    browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[1]/div/div/div[3]/div[3]/div/span').click()
    try:
        return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[1]/div/div/div[4]/div[3]/div/ul/li[4]').text # readers
    except:
        return None

In [20]:
def get_editions():
    try:
        return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[4]/div/ul/li[4]').text # number of editions
    except:
        return None

In [21]:
def get_img():
    try:
        return browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[1]/div/div/div[1]/a/img').get_attribute("src") #img
    except:
        return None

### Scraper function

In [22]:
books_dataset = pd.DataFrame(columns = ['ISBN', 'Title', 'Author', 'Authors_Number', 'Description', 'Genres', 'Average_Ratings', 'Reviews_Number', 'Quotes_Number', 'Community_Size', 'Pages_Number', 'Editions', 'Publication_Year', 'Publisher', 'URL', 'Cover_URL']) #'Readers_Number', 

In [23]:
def book_scraper(book_index): 
    #===========================================================
    try:
        more = browser.find_element_by_xpath("//div/span[@class='a h']") 
        more.click()
    except:
        print("There is no more button")

    # ================= Appending all metadata to the dataframe =================
    global books_dataset 
    books_dataset = books_dataset.append({'ISBN':get_isbn(), 'Title':get_title(), 'Author':get_author(), 'Authors_Number':get_authors(), 
                                        'Description':get_description(), 'Genres':get_genre(), 'Average_Ratings':get_rating(), 
                                        'Reviews_Number':get_reviews(), 'Quotes_Number':get_quotes(), 'Community_Size':get_community(),                                                             'Pages_Number':get_pages(),'Editions':get_editions(), 'Publication_Year':get_year(), 
                                        'Publisher':get_publisher(), 'URL':browser.current_url, 'Cover_URL':get_img()}, ignore_index=True)
    
    books_dataset.to_csv('Data/Books_Data/Books_Category9.csv')
    
    print('================= Book number: ', book_index, ' was successfully scraped =================')

## Method 1: Scrape by browseing website's categories

### Pager function

In [None]:
# To iterate over the category pages

def pager(category_no):
    
    print('===================== Category Number: ', category_no, '=====================')

    # ================= Open all pages within a category =================
    while True:
        try:
            next_page_btn =browser.find_elements_by_xpath("/html/body/div[2]/div/div[1]/div[2]/div[2]/div[2]/div/div[2]")
            element =WebDriverWait(browser,5).until(expected_conditions.element_to_be_clickable((By.XPATH,
            '/html/body/div[2]/div/div[1]/div[2]/div[2]/div[2]/div/div[2]')))
            browser.execute_script("return arguments[0].scrollIntoView();", element)
            element.click()        
        except:
            #if len(next_page_btn) <1:
            print("No more pages left")
            break

In [23]:
# We will use the first 27 categories of books from Abjjad
# So, we collect the categories xpath to open the links of categories
# Then, getting books urls

for i in range(1,28):
    browser.find_element_by_xpath('/html/body/header/div[2]/nav[2]/ul/li[2]/div/span').click() # go to books tab
    browser.find_element_by_xpath('/html/body/header/div[2]/nav[2]/ul/li[2]/div/div/a[1]').click() # go to all books option
    browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[2]/div[2]/div[1]/section/div/ul/li['+str(i)+']').click() # open a category page
    #print("Number of books in this category is: ", browser.find_element_by_xpath('/html/body/div[2]/div/div[1]/div[2]/div[1]/p/span').text)
    pager(i)

    # ================= Collect all books urls within a specific category =================
    urls = browser.find_elements_by_css_selector("div.thebookbadge h5 a")
    list_of_urls = []
    for url in urls:
        #print(url.get_attribute("href")) 
        list_of_urls.append(url.get_attribute("href"))
        
    # ================== Save the list of urls into a text file ===================
    with open('Data/Lists_of_Links/list_of_urls'+str(i)+'.txt', "w") as output:
        output.write(str(list_of_urls))
    
    # ================= Scrape each book by its url within a category =================
    for book_index, book in enumerate(list_of_urls):
        # Open the book link
        try:
            browser.get(book)
        # Call a function to scrape each book metadata by pass the previous list
            book_scraper(book_index+1)
        except:
            print('The book number:', book_index, 'was cancelled due to an exception')

on
The book number: 893 was cancelled due to an exception
The book number: 894 was cancelled due to an exception
The book number: 895 was cancelled due to an exception
The book number: 896 was cancelled due to an exception
The book number: 897 was cancelled due to an exception
The book number: 898 was cancelled due to an exception
The book number: 899 was cancelled due to an exception
The book number: 900 was cancelled due to an exception
The book number: 901 was cancelled due to an exception
The book number: 902 was cancelled due to an exception
The book number: 903 was cancelled due to an exception
The book number: 904 was cancelled due to an exception
The book number: 905 was cancelled due to an exception
The book number: 906 was cancelled due to an exception
The book number: 907 was cancelled due to an exception
The book number: 908 was cancelled due to an exception
The book number: 909 was cancelled due to an exception
The book number: 910 was cancelled due to an exception
The boo

## Method 2: Scrape using lists of books urls

In [24]:
# Read the text file 
textFile = open("Data/Lists_of_Links/list_of_urls10.txt", "r")
list_of_urls = textFile.read()
# Convert it into a list
list_of_urls = list_of_urls.split(",")
textFile.close()

In [25]:
len(list_of_urls)

31901

In [27]:
# ================= Scrape each book by its url within a specific category =================
book_index = 1360
while book_index < len(list_of_urls):
    # Open the book link
    try:
        browser.get(list_of_urls[book_index][2:-1]) # there are additional space and qoutation mark, we need to remove them.
        # Call a function to scrape each book metadata by pass the previous list
        book_scraper(book_index)
    except:
        print("The book number: ", book_index, 'was cancelled due to an exception with the link')
    book_index += 1



InvalidArgumentException: Message: Malformed URL: URL constructor: on is not a valid URL.


## Overview of the resulted dataframe

In [28]:
books_dataset.shape

(198, 16)

In [29]:
books_dataset.head()

Unnamed: 0,ISBN,Title,Author,Authors_Number,Description,Genres,Average_Ratings,Reviews_Number,Quotes_Number,Community_Size,Pages_Number,Editions,Publication_Year,Publisher,URL,Cover_URL
0,ISBN 13 9789953895697,روايتي لروايتي,تأليف سحر خليفة (تأليف),المؤلفون\n1,الأديب، كأيِّ إنسان، ابن البيئة. وهي، أي البيئ...,أدب مختارات أدبية,3.3,مراجعات\n2,اقتباسات\n12,القرّاء\n76,366 صفحة,طبعات\n1,نشر سنة 2018,دار الآداب,https://www.abjjad.com/book/2710601731/%D8%B1%...,https://abjjadst.blob.core.windows.net/pub/792...
1,ISBN 13 978-977-6765-14-7,امرأة من طهران,تأليف فريبا وفي (تأليف) المترجم أحمد موسى (ترجمة),المؤلفون\n2,امرأة من طهران تروي حياة النساء حبيسات الجدران...,أدب أدب مترجم إضافات جديدة جديد,4.0,مراجعات\n2,اقتباسات\n2,القرّاء\n76,254 صفحة,طبعات\n1,نشر سنة 2020,منشورات الربيع,https://www.abjjad.com/book/2735898624/%D8%A7%...,https://abjjadst.blob.core.windows.net/pub/7e9...
2,ISBN 13 9789953211961,مختارات #5: في صحبة المتنبي ورفاقه,تأليف الطيب صالح (تأليف),المؤلفون\n1,أي صحبة خير من صحبة أبي الطيب المتنبي الذي شغل...,أدب نقد أدبي,4.8,مراجعات\n1,اقتباسات\n5,القرّاء\n75,442 صفحة,طبعات\n1,نشر سنة 2005,دار رياض الريس للكتب والنشر,https://www.abjjad.com/book/15445051/%D9%85%D8...,https://abjjadst.blob.core.windows.net/pub/5a9...
3,المؤسسة العربية للدراسات والنشر,حين تركنا الجسر,تأليف عبد الرحمن منيف (تأليف),المؤلفون\n1,"""أية مشاعر للفرح يحملها قلب الإنسان؟ أية غبطة ...",أدب روايات,3.8,مراجعات\n8,اقتباسات\n5,القرّاء\n75,217 صفحة,طبعات\n1,نشر سنة 1999,,https://www.abjjad.com/book/15445384/%D8%AD%D9...,https://abjjadst.blob.core.windows.net/pub/395...
4,ISBN 9960497887,متعة الحديث 1,تأليف عبد الله محمد الداوود (تأليف),المؤلفون\n1,متعة الحديث مجموعة مكونة من كتابين ذكر فيها ال...,أدب مختارات أدبية,3.8,مراجعات\n1,اقتباسات\n12,القرّاء\n75,314 صفحة,طبعات\n1,نشر سنة 2006,,https://www.abjjad.com/book/15446139/%D9%85%D8...,https://abjjadst.blob.core.windows.net/pub/77b...
