from [This reference](https://scrapfly.io/blog/how-to-scrape-amazon/)

# Setup

In this tutorial we'll be using Python and two major community packages:

- httpx - HTTP client library which will let us communicate with amazon.com's servers
- parsel - HTML parsing library which will help us to parse our web scraped HTML files. In this tutorial we'll be using a mixture of css and xpath selectors to parse HTML - both of which are supported by parsel.

Optionally we'll also use: 
- loguru - a pretty logging library that'll help us keep track of what's going on.

These packages can be easily installed via pip command:

`pip install httpx parsel loguru`

# Scraping Reviews

To scrape product reviews first let's take a look at where we can find them. If we scroll to the bottom of the page we can see a link that says "See All Reviews" and if we click it we can see that we are taken to a URL that follows this format:
![review page](https://scrapfly.io/blog/content/images/how-to-scrape-amazon_url-reviews.svg)

We can see that just like for product information all we need is the ASIN identifier to find the review page of a product. Let's add this logic to our scraper:

In [12]:
import math
from typing_extensions import TypedDict
import httpx
from loguru import logger as log
from parsel import Selector
from urllib.parse import urljoin
import math

In [13]:
class ReviewData(TypedDict):
    """storage type hint for amazons review object"""
    title: str
    text: str
    location_and_date: str
    verified: bool
    rating: float


In [14]:
def parse_reviews(response) -> ReviewData:
    """parse review from single review page"""
    sel = Selector(text=response.text)
    review_boxes = sel.css("#cm_cr-review_list div.review")
    parsed = []
    for box in review_boxes:
        parsed.append({
                "text": "".join(box.css("span[data-hook=review-body] ::text").getall()).strip(),
                "title": box.css("*[data-hook=review-title]>span::text").get(),
                "location_and_date": box.css("span[data-hook=review-date] ::text").get(),
                "verified": bool(box.css("span[data-hook=avp-badge] ::text").get()),
                "rating": box.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0],
        })
    return parsed

In [15]:
async def scrape_reviews(asin, session: httpx.AsyncClient) -> ReviewData:
    """scrape all reviews of a given ASIN of an amazon product"""
    url = f"https://www.amazon.com/product-reviews/{asin}/"
    log.info(f"scraping review page: {url}")
    # find first page
    first_page = await session.get(url)
    sel = Selector(text=first_page.text)
    # find total amount of pages 
    total_reviews = sel.css("div[data-hook=cr-filter-info-review-rating-count] ::text").re(r"(\d+,*\d*)")[1]
    total_reviews = int(total_reviews.replace(",", ""))
    total_pages = int(math.ceil(total_reviews / 10.0))

    log.info(f"found total {total_reviews} reviews across {total_pages} pages -> scraping")
    _next_page = urljoin(url, sel.css(".a-pagination .a-last>a::attr(href)").get())
    if _next_page:
        next_page_urls = [_next_page.replace("pageNumber=2", f"pageNumber={i}") for i in range(2, total_pages + 1)]
        assert len(set(next_page_urls)) == len(next_page_urls)
        other_pages = await asyncio.gather(*[session.get(url) for url in next_page_urls])
    else:
        other_pages = []
    reviews = []
    for response in [first_page, *other_pages]:
        reviews.extend(parse_reviews(response))
    log.info(f"scraped total {len(reviews)} reviews")
    return reviews

In the above scraper we are putting together everything we've learned in this tutorial:

To scrape pagination we are using the same technique we used in scraping search: scrape first page, find total pages and scrape the rest concurrently.
To parse reviews are also using the same technique we used in parsing search: iterate through each box containing the review and parse the data using CSS selectors.

Let's run this scraper and see what output it generates:

In [9]:
import json
import asyncio

# We need to use browser-like headers for our requests to avoid being blocked
# here we set headers of Chrome browser on Windows:
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

In [10]:
async def run():

    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await scrape_reviews("B08QVPBFCS", session=session)
   
    print(json.dumps(data, indent=2))

In [11]:
asyncio.run(run())

RuntimeError: asyncio.run() cannot be called from a running event loop

_The last code block is throwing an error only in the ipython Notebook because it seams that it can not be run in the kernel_