# Scrapy

Scrapy is a powerful web scraping framework for Python. It provides a complete set of tools for extracting data from websites efficiently and structurally.

```python
import scrapy
```

### Core Concepts

#### Spider

- The main class for defining how a particular site (or group of sites) will be scraped.

#### Item

- Container for scraped data; defines the fields to be extracted.

#### Item Pipeline

- Component for processing scraped items after extraction.

#### Selector

- Used to extract data from HTML/XML sources using XPath or CSS expressions.

#### Middleware

- Hooks into Scrapy's request/response processing.

### Common Scrapy Components

#### Spider

- `scrapy.Spider`: Base class for scrapy spiders
  - `name`: Unique identifier for the spider
  - `start_urls`: List of URLs where the spider will begin to crawl
  - `parse(response)`: Method called to handle the response downloaded for each of the requests

#### Selectors

- `response.css()`: Select elements using CSS selectors
- `response.xpath()`: Select elements using XPath expressions

#### Items

- `scrapy.Item`: Base class for scrapy items
- `scrapy.Field()`: Used to specify fields in an item

#### Item Pipeline

- `process_item(item, spider)`: Method to process each scraped item

#### Settings

- `ROBOTSTXT_OBEY`: Whether to respect robots.txt rules
- `CONCURRENT_REQUESTS`: The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader

Now, let's create a Python script that demonstrates various Scrapy functions. Note that Scrapy projects are typically structured as separate files and directories, but for demonstration purposes, we'll create a single script that shows the key components:



```python
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.item import Item, Field

# Define Item
class BookItem(Item):
    title = Field()
    price = Field()
    rating = Field()

# Define Spider
class BookSpider(scrapy.Spider):
    name = 'bookspider'
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.css('article.product_pod'):
            item = BookItem()
            item['title'] = book.css('h3 a::attr(title)').get()
            item['price'] = book.css('p.price_color::text').get()
            item['rating'] = book.css('p.star-rating::attr(class)').get().split()[-1]
            yield item

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

# Define Item Pipeline
class PricePipeline:
    def process_item(self, item, spider):
        # Convert price to float
        item['price'] = float(item['price'][1:])
        return item

# Define settings
settings = {
    'FEED_FORMAT': 'json',
    'FEED_URI': 'books.json',
    'ROBOTSTXT_OBEY': True,
    'CONCURRENT_REQUESTS': 1,
    'ITEM_PIPELINES': {PricePipeline: 300},
}

# Set up the crawler and start the spider
process = CrawlerProcess(settings)
process.crawl(BookSpider)
process.start()

print("Scraping completed. Check 'books.json' for results.")

```

This script demonstrates several key Scrapy concepts:

1. **Item Definition**: We define a `BookItem` class to structure the data we're scraping.

2. **Spider**: The `BookSpider` class defines how to scrape the target website (books.toscrape.com in this case).

3. **Parsing**: The `parse` method shows how to use CSS selectors to extract data from the HTML response.

4. **Pagination**: We demonstrate how to follow links to the next page to scrape multiple pages.

5. **Item Pipeline**: The `PricePipeline` class shows how to process items after they've been scraped (in this case, converting the price to a float).

6. **Settings**: We define various settings, including output format, concurrency, and pipeline configuration.

7. **Crawler Process**: We set up and run the crawler using `CrawlerProcess`.

To run this script, you'll need to have Scrapy installed. You can install it using pip:

```
pip install scrapy
```

After running the script, it will scrape data from the books.toscrape.com website and save the results in a file named 'books.json' in the same directory as the script.

### Assignment 1: Web Scraper for a single page
 * Install Scrapy and create a new project.
 * Create a spider that scrapes the title and all heading `(<h1>)` elements.
 * Store the scraped data in a CSV file.

In [3]:
# Install required libraries
!pip install scrapy
!pip install selenium
!apt-get install chromium-browser

# Import necessary libraries
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time
import csv

# Define search query and parameters
search_query = "your_search_query"
num_images = 100  # Number of images to scrape

# Construct Google Images search URL
url = f"https://www.google.com/search?q={search_query}&tbm=isch&ijn=0"

# Create Scrapy Spider
class GoogleImagesSpider(scrapy.Spider):
    name = "google_images"
    start_urls = [url]

    def parse(self, response):
        # Use Selenium to render JavaScript-heavy pages
        options = Options()
        options.add_argument("--headless")
        options.add_argument("--disable-gpu")
        options.add_argument("--window-size=1920,1080")

        driver = webdriver.Chrome(options=options)
        driver.get(response.url)

        # Extract image URLs
        image_urls = []
        for img in driver.find_elements_by_xpath("//img[@class='rg_i Q4LuWd']"):
            image_url = img.get_attribute("src")
            image_urls.append(image_url)
            print(image_url)  # Print image URL to console

        # Close Selenium driver
        driver.quit()

        # Save image URLs to CSV file
        with open('image_urls.csv', 'w', newline='') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(["Image URL", "Search Query"])  # Header
            for image_url in image_urls:
                writer.writerow([image_url, search_query])

        # Follow pagination links (if any)
        try:
            next_page = driver.find_element_by_xpath("//a[@class='rnypbc']").get_attribute("href")
            if next_page:
                yield Request(next_page, callback=self.parse)
        except:
            pass

# Create CrawlerProcess with settings
process = CrawlerProcess(settings={
    "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.3",
    "DOWNLOAD_DELAY": 3,
    "CONCURRENT_REQUESTS": 16,
})

# Run Spider
process.crawl(GoogleImagesSpider)
process.start()

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-browser is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 48 not upgraded.


INFO:scrapy.utils.log:Scrapy 2.11.2 started (bot: scrapybot)
2024-09-19 09:14:28 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
2024-09-19 09:14:28 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
INFO:scrapy.addons:Enabled addons:
[]
2024-09-19 09:14:28 [scrapy.addons] INFO: Enabled addons:
[]
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
2024-09-19 09:14:28 [scrapy.utils.log] DEBUG: Using reactor: twisted

ReactorNotRestartable: 