# Most Used Functions in Scrapy

Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract data from websites. In this notebook, we will cover some of the most commonly used functions and techniques in Scrapy.

## 1. Creating a Scrapy Project

To start a new Scrapy project, use the `startproject` command.

In [None]:
# Command to create a new Scrapy project
!scrapy startproject myproject

## 2. Creating a Spider

A spider is a class that you define and that Scrapy uses to scrape information from a website.

In [None]:
# Example Spider code
import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ['http://example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

## 3. Running a Spider

To run a spider, use the `crawl` command followed by the name of the spider.

In [None]:
# Command to run a spider
!scrapy crawl myspider

## 4. Selecting Elements with CSS Selectors

Scrapy provides a powerful selection mechanism using CSS selectors.

In [None]:
# Example of using CSS selectors
class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ['http://example.com']

    def parse(self, response):
        for item in response.css('div.item'):
            title = item.css('h2::text').get()
            yield {'title': title}

## 5. Selecting Elements with XPath

XPath is another powerful way to select elements from a webpage.

In [None]:
# Example of using XPath selectors
class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ['http://example.com']

    def parse(self, response):
        for item in response.xpath('//div[@class="item"]'):
            title = item.xpath('h2/text()').get()
            yield {'title': title}

## 6. Following Links

To follow links, use the `follow` method.

In [None]:
# Example of following links
class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ['http://example.com']

    def parse(self, response):
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse_detail)

    def parse_detail(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

## 7. Storing the Scraped Data

Scrapy can store the scraped data in various formats such as JSON, CSV, and XML.

In [None]:
# Command to store the scraped data in JSON format
!scrapy crawl myspider -o output.json

## 8. Handling Request and Response

Scrapy allows you to handle requests and responses efficiently using middleware and custom handlers.

In [None]:
# Example of handling requests and responses
class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ['http://example.com']

    def start_requests(self):
        urls = [
            'http://example.com/page1',
            'http://example.com/page2',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

## 9. Using Scrapy Shell

Scrapy Shell is an interactive shell that allows you to test your scraping code in real-time.

In [None]:
# Command to open Scrapy Shell
!scrapy shell 'http://example.com'

## 10. Using Scrapy Pipelines

Scrapy Pipelines are used to process the scraped data after it has been extracted.

In [None]:
# Example of a Scrapy Pipeline
class MyPipeline:
    def process_item(self, item, spider):
        item['title'] = item['title'].upper()
        return item