# Main Reference 
[This Tutorial](https://scrapfly.io/blog/how-to-scrape-amazon/)

But we debugged there code and added some other things

# Request vs response
First we will usually send a get request to the amazon API to retrieve the search results of a search query therefore we will need to install the requests library and we need the url of the search site.

`pip install requests`

In [1]:
# We replace the spaces in the search query (Keywords) by '+' 

search_query = 'Refrigerator'.replace(' ', '+')
# search_query = 't-shirt women'.replace(' ', '+') --> 't-shirt+women'

In [9]:
# The url of the search has always this standard format
#  (with some optional extensions sometimes)
search_url = f"https://www.amazon.com/s?k={search_query}&page=1"
print(search_url)

https://www.amazon.com/s?k=Refrigerator&page=1


In [None]:
import requests
# We need to use browser-like headers for our requests to avoid being blocked and to encode the content of the response
# here we set headers of Chrome browser on Windows

HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
}

# the get request returns a response object which has a content and text methods

response = requests.get(search_url, headers=HEADERS) # retrieve the results from the first page
# check the type of the object
print(type(response))
# check the content and text methods
# print(response.content)
print(response.text)

# Setup

In this tutorial we'll be using Python and two major community packages:

- httpx - HTTP client library which will let us communicate with amazon.com's servers
- parsel - HTML parsing library which will help us to parse our web scraped HTML files. In this tutorial we'll be using a mixture of css and xpath selectors to parse HTML - both of which are supported by parsel.

Optionally we'll also use: 
- loguru - a pretty logging library that'll help us keep track of what's going on.

These packages can be easily installed via pip command:

`pip install httpx parsel loguru`

## More details:

We need to parse the search results on the response object, we will use for this the parsel library which is [documented here](https://parsel.readthedocs.io/en/v1.0.1/parsel.html).

We will use also a logger from the loguru library to provide some log information about the status of the execution of the scraper on the screen.

Theses information will appear as colorful text in the run shell.

The third library which we need is [the httpx library](https://www.python-httpx.org/).
This is an http client and provides an alternative of the request library and  provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2. 

When requesting the many results from a website the asyncronous approach is a better approach from the usual request and it retrieves the results [conurrently](https://stackoverflow.com/questions/5017392/what-does-concurrent-requests-really-mean) and awaite the fast processes this gives the http client the chance to get all the data.

Install the above mentioned libraries if you did not install the whole requirements yet.

The parse_search function below parses the items of any single page (of the response) of the search results but **it skips the ads(sponsored results)**

In [15]:
# The Selector module parses the response via css and httpx selectors which are usually used to style the html web page
from parsel import Selector
# The logger is used to show the colorful text in the run shell which gives information about the results and debugs the code
from loguru import logger as log
# The urljoin can be used to join urls after splitting them and to parse them
from urllib.parse import urljoin 

# This function will parse the response page using the Selector
# as an alternative of the beautiful soap
# it takes any response page as an argument and returns  a list of dictionaries 
# of the titles and urls which we will use later to get the asin of the products and get the reviews
def parse_search(resp):
    """Parse search result page for product previews"""
    previews = []
    sel = Selector(text=resp.text)

    # find boxes of each product preview 
    
    # Open the developer tool and inspect the results they will be 
    # inside div boxes with a class selector s-result-item)
    product_boxes = sel.css("div.s-result-item[data-component-type=s-search-result]")

    for box in product_boxes:
        asin = box.xpath('@data-asin').extract_first()
        # get the url of every search item in the search result, 
        # these include also sponsored items and ads, these will have the string "/slredirect/" 
        box_url = urljoin(str(resp.url), box.css("h2>a::attr(href)").get()).split("?")[0]
        
        # the standard url of any product is:
        url = f"https://www.amazon.com/dp/{asin}" 
   
        if len(urljoin(str(resp.url), box.css("h2>a::attr(href)").get()).split("/"))!=6 and "/slredirect/" not in url and "sspa" not in url:  # skip ads etc.
            previews.append(
                {
                  "asin": asin,
                    "title": box.css("h2>a>span::text").get(),
                    "url": url,
                }
            )
    log.debug(f"found {len(previews)} product listings in {resp.url}") # formulate the summery and debug log report
    return previews

In [None]:
# main scope call the function to run it
response = requests.get(search_url, headers=HEADERS) # search_url is declared in the first part
parse_search(response)
# this parse_search works for any kind of response be it from a get request or from httpx client

# Httpx Client

Let us try the search_parse with response from an httpx client

In [None]:
import httpx
response= httpx.get(search_url, headers=HEADERS) # you will get an error about the url 
parse_search(response)

# Httpx AsyncClient

From [Async Support](https://www.python-httpx.org/async/)

HTTPX offers a standard synchronous API by default, but also gives you the option of an async client if you need it.

Async is a concurrency model that is far more efficient than multi-threading, and can provide significant performance benefits and enable the use of long-lived network connections such as WebSockets.

If you're working with an async web framework then you'll also want to use an async client for sending outgoing HTTP requests.
# Making Async requests

To make asynchronous requests, you'll need an AsyncClient.

See the above reference to understand the async requests better.

Here is an example of the syntax.

In [87]:
async with httpx.AsyncClient() as client:
   r = await client.get('https://www.example.com/')
print(r)   

<Response [200 OK]>


In the following we  will call the async Client as 'session'

Let see to parse the response if this session using our parse_search function

In [None]:
session= httpx.AsyncClient()
first_page = await session.get(search_url, headers=HEADERS)
parse_search(first_page)

# Get the results for the other pages
So until now we tried the parse search with search_url of the first page and it worked for asyncClient session 

Now we need to get the results for the other pages.
- We need to specify how many pages are they in total
- we need to loop over those pages
The reference had a bug in getting the total number of the results which have been fixed below
The Function **search** is going to do this it takes a search query as argument and append the results to the list.

the asyncio library is to run the requests concurrently and to wait the fast requests until the slower ones get run.

`pip install asyncio`

In [96]:
import asyncio
async def search(query:str, session:httpx.AsyncClient):
    
    log.info(f"{query}: scraping first page")

    # first, let's scrape first query page to find out how many pages we have in total:
    search_url = f"https://www.amazon.com/s?k={query}&page=1"
    first_page = await session.get(search_url, headers=HEADERS)
    sel = Selector(text=first_page.text)
    # print(sel.getall())

    """the following part of the tutorial was wrong and giving les pages than we should get"""
    """
    _page_numbers = sel.xpath('//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()').getall()# this is wronge from the reference
    print(f"page numbers{_page_numbers}")
    """
    last_page = sel.xpath('//span[has-class("s-pagination-disabled")][not(has-class("s-pagination-previous"))]/text()') # When you are on the first page the last page is without hyperlink i.e. no a selector and the previous page of the last do not appear in the span of the pagination list 
    # print(last_page.getall())
    total_pages = int(last_page.getall()[0]) # the wrong solution was max(int(number) for number in _page_numbers)
    # print(f"total_pages are {total_pages}")
    log.info(f"{query}: found {total_pages} pages, scraping them concurrently")

    # now we can scrape remaining pages concurrently 
    # (I commented out the async and the session to avoid the runtime error we will scrape them without awaiting time and without concurrency)
    """
    other_pages = await asyncio.gather(
         *[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)]
        )
    """
    other_pages= []
    for page_number in range(2, total_pages+1):
        page = await asyncio.gather(session.get(f"https://www.amazon.com/s?k={query}&page={page_number}", headers=HEADERS))
        other_pages.extend(page)
    # print(other_pages)
    # print(len(other_pages))
    # parse all search pages for product preview data:
    previews = []
    for response in [first_page, *other_pages]:
        previews.extend(parse_search(response))

    log.info(f"{query}: found total of {len(previews)} product previews")
    return previews

In [None]:

query= 'Refrigerator'.replace(' ', '+')
# search(query)
# asyncio.run(search(query))
parsed= await search(query, session)
# if you want to print the results do it in another sell to avoid connect time out

In [None]:
print(parsed)

The above code 

Can be run in jupyter Notebook cells or in the python console but not inside a python script. 

Python will give you an error message which says that you can use the await outside an async function.

Therefore the next step is important to run the search function without errors in python. 


In [101]:
async def get_product_search_list(query):
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.5), headers=HEADERS) as session:
        parsed_data = await search(query, session=session)
    return parsed_data

In [None]:
parsed2 = await get_product_search_list(query)

In [None]:
print(parsed2)

# Write the data into a jSON file and a pandas DataFrame (Excel file)

A pop up will ask you to provide i (for file number to avoid overwriting other files when running the code)

In [103]:
import json
import pandas as pd

In [86]:
# write the data to some json or Excel file 
i = int(input("Enter the file number four the output: "))

# dump to a json file
with open(f'query_results_{i}.json', 'w') as file:
    json.dump(parsed, file, indent=2)
# print(json.dumps(data, indent=2))  # this is an alternative to the above line to print the json dictionaries in the run shell

# write to excel
df = pd.DataFrame(parsed)
df.to_excel(f"query_results_{i}.xlsx", index=False)
    