<a href="https://colab.research.google.com/github/matoussi-roua/openlibrary-spider-scraper/blob/main/OpenLibrary_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Install Scrapy and Playwright**


*   scrapy → Python framework for web scraping.

*   scrapy-playwright → Enables Scrapy to handle dynamic websites (those that load content with JavaScript) by controlling a real browser (Chromium, Firefox, or WebKit).

*   !playwright install → Downloads browser binaries needed by Playwright to run and render pages.



In [1]:
!pip install scrapy scrapy-playwright
!playwright install

# Install asyncio reactor
from twisted.internet import asyncioreactor
asyncioreactor.install()


╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Missing libraries:                                   ║
║     libwoff2dec.so.1.0.2                             ║
║     libgstgl-1.0.so.0                                ║
║     libgstcodecparsers-1.0.so.0                      ║
║     libavif.so.13                                    ║
║     libharfbuzz-icu.so.0                             ║
║     libenchant-2.so.2                                ║
║     libsecret-1.so.0                                 ║
║     libhyphen.so.0                                   ║
║     libmanette-0.2.so.0                              ║
╚══════════════════════════════════════════════════════╝
    at validateDependenciesLinux (/usr/local/lib/python3.12/dist-packages/playwright/driver/package/lib/server/registry/dependencies.js:269:9)
[90m    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)[39m
    at async Registry._

# **Import Scrapy and CrawlerProcess**
scrapy provides the spider logic, and CrawlerProcess lets you execute the spider and collect scraped data programmatically.

In [2]:
import scrapy
from scrapy.crawler import CrawlerProcess


# **Setup for Asynchronous Scrapy Execution**

asyncio: Runs multiple tasks concurrently without blocking.

reactor: Twisted’s event loop that manages async network operations for Scrapy.

asyncioreactor: Integrates Twisted with Python’s asyncio, ensuring Scrapy works in async environments like Jupyter/Colab.

In [3]:
import asyncio
from twisted.internet import reactor
from twisted.internet import asyncioreactor

# Explicitly install the asyncioreactor before running
# Removed explicit install as it's often already installed in Colab
# try:
#     asyncioreactor.install()
# except Exception as e:
#     print(f"Could not install asyncioreactor: {e}")


# **OpenLibrarySpider**

This is a Scrapy spider that scrapes book information from Open Library.

Starts at https://openlibrary.org/ and finds all books on the page.

Follows each book link to get details: title, author, rating, publisher, pages, description, subjects.

yield sends data to JSON or CSV output.

In [4]:


class OpenLibrarySpider(scrapy.Spider):
    name = "openlibrary"
    start_urls = [
        'https://openlibrary.org/'  # replace with your starting page
    ]

    def parse(self, response):
        books = response.css('div.book.carousel__item')
        for book in books:
            relative_url = book.css('div.book-cover a::attr(href)').get()
            if relative_url:
                full_url = response.urljoin(relative_url)
                print(f"[STEP] Following book link: {full_url}")  # Debug output
                yield response.follow(full_url, callback=self.parse_book_details)

    def parse_book_details(self, response):
        title = response.css('h1.work-title::text').get()
        print(f"[STEP] Parsing book details: {title}")  # Debug output
        subtitle = response.css('h2.work-subtitle::text').get()
        author = response.css('h2.edition-byline a::text').get()
        rating = response.css('ul.readers-stats meta[itemprop="ratingValue"]::attr(content)').get()
        rating_count = response.css('ul.readers-stats meta[itemprop="ratingCount"]::attr(content)').get()
        pub_date = response.css('div.edition-omniline-item span[itemprop="datePublished"]::text').get()
        publisher = response.css('div.edition-omniline-item a[itemprop="publisher"]::text').get()
        language = response.css('div.edition-omniline-item span[itemprop="inLanguage"] a::text').get()
        pages = response.css('div.edition-omniline-item span[itemprop="numberOfPages"]::text').get()
        description = response.css('div.book-description .read-more__content > p::text').get(default='').strip()
        subjects = response.css('div.subjects-content a::text').getall()
        print(f"[STEP] Yielding data for: {title}")  # Debug output
        yield {
            'title': title,
            'subtitle': subtitle,
            'author': author,
            'rating': rating,
            'rating_count': rating_count,
            'publication_date': pub_date,
            'publisher': publisher,
            'language': language,
            'pages': pages,
            'description': description,
            'subjects': subjects
            }



# **Run Spider Programmatically**
run_spider(): Asynchronously starts the Scrapy crawler.

CrawlerProcess(settings=...): Configures output formats (JSON & CSV) and logging.

process.crawl(OpenLibrarySpider): Schedules the spider to run.

await process.start(stop_after_crawl=True): Runs the spider and waits until it finishes.

**Asyncio loop handling:**

If a loop is already running (common in Colab/Jupyter), it adds the spider as a task.

Otherwise, it runs the spider in a new loop using asyncio.run().

In [5]:
# -----------------------------
# Run spider programmatically
# -----------------------------

async def run_spider():
    process = CrawlerProcess(settings={
        "FEEDS": {
            "books.json": {"format": "json", "encoding": "utf-8", "indent": 4},
            "books.csv": {"format": "csv", "encoding": "utf-8"},
        },
        "LOG_LEVEL": "INFO",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", # Ensure this is also in process settings
    })

    # Schedule the spider to be run
    process.crawl(OpenLibrarySpider)

    # Start the crawler within the asyncio loop
    await process.start(stop_after_crawl=True)


# Get the current running loop and create a task to run the spider
try:
    loop = asyncio.get_running_loop()
    loop.create_task(run_spider())
except RuntimeError:
    # If no loop is running (less common in Colab), use asyncio.run()
    asyncio.run(run_spider())

In [6]:
from google.colab import files
files.download("/content/books.csv")
files.download("/content/books.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [7]:
import pandas as pd
df = pd.read_csv("/content/books.csv")
df.to_excel("books.xlsx", index=False)
print("Excel file generated: books.xlsx")

Excel file generated: books.xlsx
