# Main Reference
from [This reference](https://scrapfly.io/blog/how-to-scrape-amazon/)
But we debugged their code and added other things

# Setup

In this tutorial we'll be using Python and two major community packages:

- httpx - HTTP client library which will let us communicate with amazon.com's servers
- parsel - HTML parsing library which will help us to parse our web scraped HTML files. In this tutorial we'll be using a mixture of css and xpath selectors to parse HTML - both of which are supported by parsel.

Optionally we'll also use: 
- loguru - a pretty logging library that'll help us keep track of what's going on.

These packages can be easily installed via pip command:

`pip install httpx parsel loguru`

See [our scrape search notebook](https://github.com/morschulik/scrape_amazon_reviews/blob/main/Jupyter_notebook/scrap_search.ipynb) 

We advice you to start from there to understand the logic used below better a more details on those libraries.

# Scraping Reviews

To scrape product reviews first let's take a look at where we can find them. If we scroll to the bottom of the page we can see a link that says "See All Reviews" and if we click it we can see that we are taken to a URL that follows this format:
![review page](https://scrapfly.io/blog/content/images/how-to-scrape-amazon_url-reviews.svg)

or more shortly:
`url = f"https://www.amazon.com/product-reviews/{asin}/"`

We can see that just like for product information all we need is the ASIN identifier to find the review page of a product. Let's add this logic to our scraper:

In [6]:
import math
from typing_extensions import TypedDict
import httpx
from loguru import logger as log
from parsel import Selector
from urllib.parse import urljoin
import math

you can send to the following function any response(to the get reviews request) and it will parse it to a list of dictionaries having title , body, location, date etc

In [7]:
def parse_reviews(response):
    """parse review from single review page"""
    sel = Selector(text=response.text)
    review_boxes = sel.css("#cm_cr-review_list div.review") # inspect them in the browser
    parsed = []
    for box in review_boxes:
        parsed.append({
                "body": "".join(box.css("span[data-hook=review-body] ::text").getall()).strip(),
                "title": box.css("*[data-hook=review-title]>span::text").get(),
                "location": box.css("span[data-hook=review-date] ::text").get().split(" on")[0].replace("Reviewed in ", ""), # split location from date and delete the reviewd in string
                "date": box.css("span[data-hook=review-date] ::text").get().split(" on")[1],
                "verified": bool(box.css("span[data-hook=avp-badge] ::text").get()),
                "star_rating": box.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0],
        })
    return parsed

Let us run it:

In [8]:
asin='B08QVPBFCS'
response= httpx.get(url = f"https://www.amazon.com/product-reviews/{asin}/") # the response returns the first page
parsed = parse_reviews(response)

In [None]:
print(parsed)

The following function:
- takes the asin (like a product id which is in the url) as an argument.
- an httpx.AsyncClient session as another argument (to retrieve the reviews concurrently).
- uses the the response which results from the httpx request as an argument of the parse_review function to to parse the first page.
- goes throw all the pages
- finally saves the data in a list and return it.

To parse reviews we are also using the same technique we used in parsing search: iterate through each box containing the review and parse the data using CSS selectors.

In [10]:
import asyncio
async def scrape_reviews(asin, session: httpx.AsyncClient):# -> ReviewData:
    """scrape all reviews of a given ASIN of an amazon product"""
    url = f"https://www.amazon.com/product-reviews/{asin}/"
    log.info(f"scraping review page: {url}")
    # find first page
    first_page = await session.get(url)
    sel = Selector(text=first_page.text)
    # find total amount of pages 
    total_reviews = sel.css("div[data-hook=cr-filter-info-review-rating-count] ::text").re(r"(\d+,*\d*)")[1]
    total_reviews = int(total_reviews.replace(",", ""))
    total_pages = int(math.ceil(total_reviews / 10.0))

    log.info(f"found total {total_reviews} reviews across {total_pages} pages -> scraping")
    _next_page = urljoin(url, sel.css(".a-pagination .a-last>a::attr(href)").get())
    if _next_page:
        next_page_urls = [_next_page.replace("pageNumber=2", f"pageNumber={i}") for i in range(2, total_pages + 1)]
        assert len(set(next_page_urls)) == len(next_page_urls)
        other_pages = await asyncio.gather(*[session.get(url) for url in next_page_urls])
    else:
        other_pages = []
    reviews = []
    for response in [first_page, *other_pages]:
        reviews.extend(parse_reviews(response))
    log.info(f"scraped total {len(reviews)} reviews")
    return reviews

Let's run this scraper and see what output it generates:

In [None]:
product_asin = 'B08QVPBFCS'
session=session= httpx.AsyncClient()
parsed_reviews= await scrape_reviews(product_asin, session)
# if you want to print the results do it in another sell to avoid connect time out

In [None]:
print(parsed_reviews)

Lt us set limits and connection timeout to the session and run the function in a way that works in python scripts(in python scripts you are not allowed to use the await outside a async function )

In [13]:
import json
import asyncio

# We need to use browser-like headers for our requests to avoid being blocked
# here we set headers of Chrome browser on Windows:
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

In [14]:
async def get_reviews(some_asin):

    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        review_data = await scrape_reviews("B08QVPBFCS", session=session)
    return review_data
    # print(json.dumps(data, indent=2))

In [None]:
# to run it in a python script use asyncio.run(run()) 
some_asin ="B08QVPBFCS"
review_data = await get_reviews(some_asin)

In [None]:
print(review_data)

_The last code  **`asyncio.run(run())`**  is throwing an error only in the ipython Notebook because it seams that it can not be run in the kernel but you should use it in the python script_

# Write the data into a jSON file and a pandas DataFrame (Excel file)

A pop up will ask you to provide i (for file number to avoid overwriting other files when running the code)

In [16]:
import json
import pandas as pd

In [20]:
i = int(input("Enter the file number four the output: "))
# write the data to json file
with open(f'product_reviews_{i}.json', 'w') as file:
    json.dump(review_data, file, indent=2)
# write the data to Excel file
df = pd.DataFrame(review_data, columns=['title', 'star_rating', 'verified', 'location', 'date', 'body'])
df.to_excel(f"product_reviews{i}.xlsx", index=False)