# HTTP

## Headers

In HTTP requests, headers are components of the request that provide additional information about the request or the client making the request. These headers are key-value pairs included in the request's header section. Some common headers in HTTP requests include:
- **User-Agent**: Provides information about the client making the request, such as the type of browser, operating system, and version.
- **Host**: Specifies the domain name of the server being requested.
- **Accept**: Indicates the types of content the client can accept in response.
- **Content-Type**: Specifies the type of content being sent in the request body, such as "application/json" or "text/html".
- **Authorization**: Contains credentials for authenticating the client with the server.
- **Cookie**: Sends previously stored cookies back to the server.
- **Referer**: Indicates the URL of the webpage that linked to the resource being requested.
- **Accept-Language**: Specifies the preferred language(s) for the response.
- **Cache-Control**: Directives for caching mechanisms in both requests and responses.
- **Connection**: Controls whether the connection to the server should be kept open or closed after the request has been processed.

These headers, among others, help the server understand the context of the request and determine how to respond appropriately.

## Cookies

In the context of web development and HTTP, "cookies" refer to small pieces of data that websites store on a user's computer or device. Cookies are sent from a web server to a user's web browser and are then stored locally on the user's device. They are commonly used to keep track of user sessions, preferences, authentication information, and other browsing data.

Here are some key points about cookies:
- **Persistence**: Cookies can be either session cookies or persistent cookies. Session cookies are temporary and are deleted when the user closes their browser, while persistent cookies remain on the user's device for a specified period or until manually deleted.
- **Purpose**: Cookies serve various purposes, including session management (maintaining login state), personalization (remembering user preferences), tracking (analyzing user behavior), and advertising (delivering targeted ads).
- **Structure**: Cookies typically consist of a name-value pair along with additional attributes such as expiration time, domain, path, and security settings.
- **Usage**: Websites can set cookies using HTTP headers in the server's response. Subsequently, the browser includes cookies in subsequent requests to the same website, allowing the server to recognize and personalize the user's experience.
- **Security and Privacy**: While cookies are essential for many web applications, they can also raise privacy concerns. Users can control cookie behavior through browser settings and extensions, including blocking or deleting cookies and managing cookie permissions.

Overall, cookies play a fundamental role in modern web development by enabling stateful interactions between web servers and clients, facilitating personalization and user tracking, and enhancing the user experience on the web.

# Web scraping tools comparisons

# Tools comparisons

## Selenium and Scrapy

Selenium and Scrapy are both popular tools for web scraping but they have some key differences in terms of ease of use, functionality, and performance.

- Selenium:
    - Selenium is primarily used for web testing and automation, but it can also be used for web scraping.
    - Selenium is a browser automation tool that allows you to control a web browser programmatically, allowing you to interact with websites as a real user would.
    - Selenium is great for scraping websites that heavily rely on JavaScript, as it can render JavaScript content and execute complex interactions on the page.
    - However, because Selenium relies on controlling a browser, it can be slower and more resource-intensive compared to other scraping tools.
    - Selenium is a good option for websites that require a lot of interaction or have complex layouts that are difficult to scrape with traditional scraping tools.
    - 
- Scrapy:
    - Scrapy is a powerful and fast web scraping framework written in Python.
    - Scrapy is specifically designed for web scraping and offers a wide range of tools and features for scraping and parsing websites.
    - Scrapy allows you to easily crawl websites, extract data, and store it in various formats such as JSON, XML, or CSV.
    - Scrapy is more lightweight and efficient compared to Selenium, as it does not require a browser to run.
    - Scrapy is great for scraping large volumes of data from multiple websites and for more advanced scraping tasks.
    - Scrapy provides a robust system for handling requests, managing cookies, and following links automatically.

In conclusion, Selenium is a better choice for scraping websites with heavy JavaScript content or complex interactions, while Scrapy is a more efficient and versatile tool for general web scraping tasks. Ultimately, the choice between Selenium and Scrapy will depend on the specific requirements of the scraping project and the technical skills of the user.

## BeautifulSoup and Scrapy

Beautiful Soup (bs4) and Scrapy are both popular Python libraries used for web scraping, but they have some key differences:

- **Usage:** Beautiful Soup is primarily used for parsing HTML and XML documents, making it a good choice for extracting specific information from web pages. Scrapy, on the other hand, is a full-fledged web scraping framework that allows for more advanced web scraping tasks such as crawling multiple pages, handling asynchronous requests, and storing data in databases.

- **Speed:** Scrapy is generally faster than Beautiful Soup due to its asynchronous scraping capabilities and built-in web crawling features. This makes it a better choice for scraping large amounts of data or scraping websites with complex structures.

- **Scalability:** Scrapy is more scalable than Beautiful Soup, as it is built to handle large-scale web scraping tasks. Its built-in scheduling and parallel processing features make it well-suited for scraping websites with a lot of pages or for scraping data from multiple websites simultaneously.

- **Learning curve:** Beautiful Soup is easier to learn and understand, especially for beginners, as it provides a simple and intuitive way to navigate and extract data from HTML documents. Scrapy, on the other hand, has a steeper learning curve due to its more advanced features and concepts, such as spiders, pipelines, and middleware.

Scrapy is a web scraping framework in Python that provides tools for extracting data from websites. It has built-in functionality for parsing and extracting data from HTML documents, so in most cases, you may not need to use BeautifulSoup (bs4) for parsing HTML.

However, if you encounter complex HTML structures or need more advanced parsing capabilities, you may find BeautifulSoup useful in conjunction with Scrapy. BeautifulSoup is a popular Python library for parsing HTML and XML documents, and it can be used in combination with Scrapy to handle more complex parsing tasks or to extract data from non-standard HTML structures.

Ultimately, whether you need to use BeautifulSoup with Scrapy will depend on the specific requirements of your web scraping project.

In conclusion, Beautiful Soup is a good choice for simple web scraping tasks or for projects where you only need to extract specific information from a few web pages. Scrapy, on the other hand, is more suitable for larger and more complex web scraping tasks that require crawling multiple pages, handling asynchronous requests, and storing data in databases. Ultimately, the choice between Beautiful Soup and Scrapy will depend on the specific requirements of your web scraping project and your level of experience with web scraping.

## Requests and Scrapy

Scrapy is a higher-level web scraping framework in Python that provides its own networking and request handling capabilities.

Scrapy handles the sending of HTTP requests, managing cookies, following redirects, and other networking tasks internally, allowing you to focus on writing the scraping logic and data extraction rules.

While the Requests library is often used for sending HTTP requests and handling responses in Python, it is typically not necessary when using Scrapy as it provides its own tools for making requests and handling responses efficiently.

In summary, when using Scrapy for web scraping, you can rely on its built-in capabilities for handling HTTP requests and responses without the need for the Requests library.

# Scrapy usage

## Items and spider

In a Scrapy project, the `items.py` file is used to define the data structure of the items that you want to scrape from websites. It's essentially a schema or blueprint for the structured data you intend to extract.

Here's why you need the `items.py` file:

- **Data Structure Definition**: It provides a central location to define the structure of the items you intend to scrape. Each item corresponds to a specific type of data you want to extract from web pages.

- **Item Class Definitions**: Inside the items.py file, you define Python classes that represent the different types of items. These classes typically subclass Scrapy's scrapy.Item class.

- **Field Definitions**: Within each item class, you define attributes (fields) that represent the different pieces of data you want to extract. Each field can have a specific data type and may include additional metadata such as requiredness or input/output processors.

- **Consistency and Reusability**: By defining item structures in a separate file, you ensure consistency across your spider implementations. It also promotes reusability, as you can reuse item definitions across multiple spiders within the same project.

Here's a basic example of what the `items.py` file might look like:

```python
import scrapy

class MyItem(scrapy.Item):
    title = scrapy.Field()
    description = scrapy.Field()
    price = scrapy.Field()
    image_url = scrapy.Field()
```

In summary, while the `parse_item` method within the spider class is responsible for extracting and processing individual items from responses, the `items.py` file is used to define the structure of those items in a centralized and reusable manner.

The `items.py` file serves as a blueprint for defining the structure of your items, specifying the fields they will contain and their data types. It's a separate file where you define your item classes.

On the other hand, the parsing logic, which involves extracting data from web pages and populating these item objects with the extracted data, is typically implemented within the spider itself, usually in the parse method or other custom parsing methods within the spider class.

Here's a breakdown:

- `items.py`: This file defines the structure of your items using Scrapy's Item class. It specifies the fields that each item will contain, their data types, and any additional metadata. It's essentially a declaration of what kind of data you expect to extract.

- `spider.py`: This file contains the spider class where you implement the logic for crawling web pages, extracting data, and populating item objects. You write parsing logic within methods like parse or custom parsing methods such as parse_item. In these methods, you use Scrapy's selectors (XPath or CSS selectors) to extract data from the HTML responses and populate the item objects with the extracted data.

By separating the item definition (in `items.py`) from the parsing logic (in `spider.py`), you can maintain a clean and organized project structure. It also promotes reusability, as you can define item structures once and reuse them across multiple spiders within the same project.

## Concurrent requests politeness

In general, a polite range for the `CONCURRENT_REQUESTS` parameter in Scrapy typically falls within the range of `1` to `8` concurrent requests per domain. However, this range can vary depending on factors such as the target website's capacity, the nature of the scraping task, and the resources available to your spider. Here's a breakdown of what can be considered polite:

- **Low Concurrency (1-3):**
    - This range is very conservative and suitable for scraping from websites with limited server capacity, strict rate limits, or where you want to be extra cautious to avoid overloading the server.
    - It's also appropriate when adhering closely to the website's robots.txt directives, particularly if the website specifies a low crawl delay or imposes restrictions on simultaneous connections.
- **Moderate Concurrency (4-6):**
    - This range strikes a balance between politeness and efficiency, allowing for a modest level of concurrency without putting undue strain on the target server.
    - It's suitable for most scraping tasks on websites with moderate server capacity and where you want to balance speed with politeness.
- **High Concurrency (7-8):**
    - This range is more aggressive and suitable for scraping from websites with robust server infrastructure and higher capacity to handle concurrent requests.
    - It's appropriate when you prioritize speed and efficiency over being overly cautious, but it's essential to monitor server response times and error rates to ensure that the website can handle the load.
- **Customized Concurrency:**
    - Depending on the specific requirements of your scraping task and the characteristics of the target website, you may need to adjust the concurrency level outside of these ranges.
    - Factors such as the website's crawl delay, responsiveness, and tolerance for simultaneous connections should be taken into account when determining the appropriate concurrency level.

It's important to note that while higher concurrency levels can speed up the scraping process, they also increase the risk of overloading the server, triggering rate limits, or violating the website's terms of service. Therefore, it's crucial to strike a balance between speed and politeness and to always prioritize being respectful to the target website and its server. Additionally, monitoring server response times and adjusting the concurrency level as needed can help ensure smooth and efficient scraping operations.

## Autothrottle and Donwload_delay

Whether to use autothrottle or download_delay settings in your Scrapy project depends on your scraping requirements, the website you're scraping, and the resources available to you. Here are some considerations to help you decide:

- **Scraping Speed**: If you're scraping a website that does not have rate limits or restrictions on scraping speed, you may not need to use autothrottle or download_delay. However, if you're scraping a website that limits the number of requests you can make within a certain time period, or if you're concerned about overwhelming the server with too many requests, it's a good idea to implement some form of throttling.

- **Respecting Robots.txt**: If you're following the guidelines set out in the website's robots.txt file, you may want to adjust your scraping speed accordingly. Some websites specify crawl-delay directives in their robots.txt file to indicate how quickly you should crawl the site. In such cases, you can use autothrottle or download_delay to ensure that you're not making requests too frequently and adhering to the guidelines.

- **Server Load**: Scraping too quickly can put a strain on the server hosting the website you're scraping. If you're scraping a high-traffic website or a website with limited server resources, it's important to be mindful of the server load and adjust your scraping speed accordingly. Using autothrottle or download_delay can help distribute your requests more evenly and reduce the load on the server.

- **Concurrency**: If you're running multiple concurrent spider instances or scraping multiple websites simultaneously, using autothrottle or download_delay can help manage concurrency and prevent overloading the server with too many requests at once.

- **Testing and Monitoring**: Regardless of whether you choose to use autothrottle or download_delay, it's important to monitor your scraping activity and adjust your settings as needed. You may need to experiment with different settings to find the optimal balance between scraping speed and server load.

In summary, using autothrottle or download_delay settings in your Scrapy project can help manage scraping speed, prevent server overload, and ensure that you're scraping responsibly and ethically. Consider the specific requirements of your scraping task and the characteristics of the website you're scraping when deciding whether to implement throttling mechanisms.

_**And what if I set both settings?**_

Setting both autothrottle and download_delay settings in your Scrapy project can provide additional control and flexibility over your scraping speed. However, you should be cautious about potential conflicts or redundancy between these settings. Here's how they interact:

- **Autothrottle**: Autothrottle dynamically adjusts the scraping speed based on server response times and other factors. It automatically adjusts the download delay (request rate) to optimize throughput while minimizing server load and preventing IP bans. When using autothrottle, Scrapy will calculate and adjust the download delay for each request dynamically.

- **Download Delay**: Download delay is a fixed delay (in seconds) between consecutive requests. When you set a download delay, Scrapy will wait for the specified amount of time before sending each request. This can be useful for limiting the speed of your scraping and ensuring that you're not making requests too frequently.

If you set both autothrottle and download_delay settings:

- **Potential Conflict**: There may be conflicts between the dynamically adjusted download delay by autothrottle and the fixed download delay you've set. Autothrottle may override the fixed download delay if it determines that a different delay is more appropriate based on server response times and other factors.

- **Redundancy**: Setting both settings simultaneously may lead to redundancy, as autothrottle already adjusts the download delay dynamically. In many cases, using autothrottle alone may be sufficient to manage scraping speed effectively without the need for a fixed download delay.

- **Fine-Tuning**: However, if you have specific requirements or preferences for the scraping speed and want to ensure a minimum delay between requests regardless of server response times, setting both settings can provide additional fine-tuning and control over scraping speed.

In summary, while setting both autothrottle and download_delay settings can provide additional control over scraping speed, you should be mindful of potential conflicts and redundancy between these settings. Consider your specific scraping requirements and experiment with different configurations to find the optimal balance between scraping speed and server load.

## Collect and parse links spider

Combining the functionality of both spiders into a single spider might be a good approach, especially if the tasks are closely related or if there's a dependency between them. Here's how you could rewrite your spiders:

- **Collecting Links and Following Pagination**: You can integrate the logic of collecting links from the search page and following pagination into the parse method of a single spider. This spider would start by sending a request to the search page, extract links from the response, and then follow pagination links if they exist.

- **Parsing Collected Links**: Once the spider has collected all the links from the search pages, it can parse them directly within the same parse method or in a separate callback method. This callback would handle the parsing of each individual link and extracting the desired data.

Here's a simplified example of how you might structure your combined spider:

```python
import scrapy

class CombinedSpider(scrapy.Spider):
    name = "combined_spider"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/search"]

    def parse(self, response):
        # Extract links from the search page
        for link in response.css('a::attr(href)').extract():
            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_item)

        # Follow pagination links if they exist
        next_page = response.css('a.next::attr(href)').extract_first()
        if next_page:
            yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)

    def parse_item(self, response):
        # Parse each individual link and extract desired data
        item = {}
        # Add parsing logic here
        yield item
```

In this combined spider:

- The parse method handles both the collection of links from the search page and the following of pagination links.
- The parse_item method is responsible for parsing each individual link and extracting the desired data.
- By combining the functionality into a single spider, you simplify your project structure and potentially improve code maintainability. Additionally, it may make it easier to manage the execution flow and handle dependencies between tasks. However, it's essential to consider the complexity and scalability of your project when making this decision. If the tasks are fundamentally different or if they require significantly different logic, separating them into multiple spiders may still be appropriate.

_**And how can I add the commands to send the items to PostgreSQL database?**_

_ChatGPT_:  
To send items to a PostgreSQL database in a Scrapy project, you can use the Pipeline component to define a custom pipeline that processes and stores the items in the database. Here's a general outline of how you can achieve this:

- **Define Custom Pipeline**: Create a custom pipeline class that connects to the PostgreSQL database and inserts items into the appropriate tables. Here's an example of how you can define a pipeline:

```python
import psycopg2

class PostgreSQLPipeline:
    def open_spider(self, spider):
        # Connect to PostgreSQL database
        self.conn = psycopg2.connect(
            dbname='your_database_name',
            user='your_username',
            password='your_password',
            host='your_host',
            port='your_port'
        )
        self.cur = self.conn.cursor()

    def close_spider(self, spider):
        # Close database connection
        self.cur.close()
        self.conn.close()

    def process_item(self, item, spider):
        # Insert item into database
        try:
            self.cur.execute("""
                INSERT INTO your_table_name (column1, column2, ...)
                VALUES (%s, %s, ...)
            """, (
                item['value1'],
                item['value2'],
                ...
            ))
            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
            spider.logger.error(f"Error inserting item into database: {e}")
        return item
```

Replace 'your_database_name', 'your_username', 'your_password', 'your_host', 'your_port', 'your_table_name', and column1, column2, etc. with your actual database credentials and table information.

- **Enable the Pipeline**: Enable the custom pipeline in your Scrapy settings by adding it to the ITEM_PIPELINES setting:

```python
ITEM_PIPELINES = {
    'your_project_name.pipelines.PostgreSQLPipeline': 300,
}
```

Replace 'your_project_name.pipelines.PostgreSQLPipeline' with the import path to your custom pipeline class.

With these steps, Scrapy will process items through your custom pipeline, which will insert them into your PostgreSQL database. Make sure to handle database connections properly, including error handling and closing connections to avoid resource leaks. Additionally, consider implementing data validation and sanitization to ensure data integrity and security.

_**And if I have multiple relations to populate from one item object and even some more operations with the database, like extracting foreign keys for these data, how can I implement all these methods in the pipeline?**_

_ChatGPT:_  
If you have multiple relations to populate from one item object and additional database operations such as extracting foreign keys, you can still implement these functionalities within the custom pipeline class. Here's a general approach you can take:

- **Define Custom Pipeline:** Define a custom pipeline class that handles the processing of items and database operations. You can structure the pipeline to perform various tasks such as inserting data into multiple tables, extracting foreign keys, and executing additional database operations.

```python
import psycopg2

class PostgreSQLPipeline:
    def open_spider(self, spider):
        # Connect to PostgreSQL database
        self.conn = psycopg2.connect(
            dbname='your_database_name',
            user='your_username',
            password='your_password',
            host='your_host',
            port='your_port'
        )
        self.cur = self.conn.cursor()

    def close_spider(self, spider):
        # Close database connection
        self.cur.close()
        self.conn.close()

    def process_item(self, item, spider):
        try:
            # Insert data into main table
            main_table_query = """
                INSERT INTO main_table (column1, column2, ...)
                VALUES (%s, %s, ...)
                RETURNING id
            """
            self.cur.execute(main_table_query, (
                item['value1'],
                item['value2'],
                ...
            ))
            main_table_id = self.cur.fetchone()[0]

            # Extract foreign keys or additional data
            foreign_key_value = self.extract_foreign_key(item)

            # Insert data into related table using main table ID and foreign key
            related_table_query = """
                INSERT INTO related_table (main_table_id, foreign_key_column, ...)
                VALUES (%s, %s, ...)
            """
            self.cur.execute(related_table_query, (
                main_table_id,
                foreign_key_value,
                ...
            ))

            # Commit the transaction
            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
            spider.logger.error(f"Error processing item: {e}")
        return item

    def extract_foreign_key(self, item):
        # Extract foreign key value from item or perform additional database operations
        foreign_key_value = ...
        return foreign_key_value
```

In this example, main_table represents the main table where the primary data is inserted, and related_table represents a related table with a foreign key relationship to main_table. The process_item method inserts data into both tables, and the extract_foreign_key method extracts the foreign key value from the item or performs additional database operations to obtain it.

- **Enable the Pipeline**: Enable the custom pipeline in your Scrapy settings as described in the previous response.

By implementing these methods within the custom pipeline class, you can handle multiple relations, foreign key extraction, and additional database operations in a structured and organized manner. Adjust the code as necessary to fit the specific requirements and structure of your database schema and data processing logic.

_**So, the `process_item` method plays the role of the `main` function in the regular program in the sense it implements all the mechanics, correct?**_

_ChatGPT:_  
Yes, that's correct. In a Scrapy pipeline, the `process_item` method acts as the main function that orchestrates the processing of each item obtained from the spider. It receives each item as input, performs necessary operations such as data manipulation, validation, and database interactions, and then returns the item, possibly modified, for further processing or storage.

The `process_item` method typically contains the core logic for handling each item, including any necessary database operations, data transformations, or validations. It allows you to encapsulate item processing logic in a modular and reusable way within the pipeline.

While the process_item method serves as the central processing unit for each item, the pipeline class as a whole can contain additional methods and logic for tasks such as opening and closing database connections, extracting foreign keys, or performing other preprocessing or postprocessing tasks.

By structuring the pipeline class effectively and implementing the necessary functionality within the process_item method, you can achieve a clean and efficient workflow for processing items in your Scrapy project.

## Random user-agent

```sh
(venv) pip install scrapy-user-agents

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
```

## Stop and run crawler

See this: https://docs.scrapy.org/en/latest/topics/jobs.html

or do this:

here's an example of a simple Scrapy crawler that can be stopped and resumed from the point it was stopped at. This example utilizes Scrapy's built-in support for resuming crawls by saving and loading the state of the crawl.

```python
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']
    
    # You can define any additional logic here

    def parse(self, response):
        # Your parsing logic here
        # For demonstration purposes, we'll just print the URL
        print(response.url)

if __name__ == "__main__":
    # You can adjust these settings according to your needs
    settings = get_project_settings()
    settings.set('JOBDIR', 'crawls/mycrawler')

    process = CrawlerProcess(settings)
    process.crawl(MySpider)
    process.start()
```

In this example:

`MySpider` is a basic Scrapy spider that starts from `http://example.com` and prints the URLs it visits in the parse method.

We set the `JOBDIR` setting to `crawls/mycrawler`. This tells Scrapy to store information about the crawl (such as visited URLs and states) in this directory.

The crawler process is started with `CrawlerProcess` using the specified settings and spider.

Now, when you run this script, Scrapy will save its state in the `crawls/mycrawler` directory. If you stop the script and run it again later, Scrapy will resume the crawl from where it left off, rather than starting from scratch. This allows you to stop and resume the crawl as needed.

## Use browser automation in scrapy

You can use Selenium in conjunction with Scrapy to scrape web pages that require JavaScript rendering or interaction. While Scrapy is a powerful web scraping framework primarily focused on making HTTP requests and parsing HTML/XML responses, it does not provide built-in support for JavaScript rendering or browser automation, which Selenium excels at.

Here's how you can combine Selenium with Scrapy:

Use Selenium to Navigate and Interact with Pages: Write Selenium scripts to navigate to the pages you want to scrape, interact with elements (such as clicking buttons or filling forms), and retrieve the dynamically loaded content.

Extract Content with Scrapy: Once you've loaded the desired content with Selenium, you can extract the relevant data using Scrapy's parsing capabilities. You can either parse the HTML directly with Scrapy's built-in selectors or extract data from Selenium's WebDriver.

Combine Both Approaches: You can either run Selenium scripts within Scrapy's spider callbacks or integrate Selenium with Scrapy's middleware to seamlessly use both frameworks together.

Here's a basic example of how you might integrate Selenium with Scrapy:

```python
import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By


class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def start_requests(self):
        for url in self.start_urls:
            yield SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        # Extract data using Scrapy's selectors
        title = response.xpath('//h1/text()').get()

        # Extract data using Selenium
        driver = response.request.meta['driver']
        selenium_title = driver.find_element(By.XPATH, '//h1').text

        yield {
            'scrapy_title': title,
            'selenium_title': selenium_title
        }
```

In this example, we use the `SeleniumRequest` class provided by the `scrapy-selenium` package to make requests using Selenium. We then extract data both using Scrapy's selectors (`response.xpath()`) and Selenium's WebDriver (`driver.find_element()`).

Remember to install `scrapy-selenium` and configure Selenium in your project before using this approach. Additionally, be mindful of performance considerations when using Selenium, as it can be slower than pure Scrapy scraping due to browser automation overhead.

## [Disable cookies](https://docs.scrapy.org/en/latest/topics/broad-crawls.html#disable-cookies)

Disable cookies unless you really need. Cookies are often not needed when doing broad crawls (search engine crawlers ignore them), and they improve performance by saving some CPU cycles and reducing the memory footprint of your Scrapy crawler.

To disable cookies use:
```sh
COOKIES_ENABLED = False
```

## [Reduce download timeout](https://docs.scrapy.org/en/latest/topics/broad-crawls.html#reduce-download-timeout)

Unless you are crawling from a very slow connection (which shouldn’t be the case for broad crawls) reduce the download timeout so that stuck requests are discarded quickly and free up capacity to process the next ones.

To reduce the download timeout use:
```sh
DOWNLOAD_TIMEOUT = 15
```

## [Disable redirects](https://docs.scrapy.org/en/latest/topics/broad-crawls.html#disable-redirects)

Consider disabling redirects, unless you are interested in following them. When doing broad crawls it’s common to save redirects and resolve them when revisiting the site at a later crawl. This also help to keep the number of request constant per crawl batch, otherwise redirect loops may cause the crawler to dedicate too many resources on any specific domain.

To disable redirects use:
```sh
REDIRECT_ENABLED = False
```