## Matthew Grace

# Amazon product review webscraper. I am scraping amazon product reviews from customers, and storing as a csv. This will be accomplished using Beautiful Soup (bs4).

Import Statements and initital setup

In [46]:
#import statements
import requests
import bs4
import html5lib
import pandas
import numpy as np


In [47]:
# checking that a request to an amazon product page works
url = 'https://www.amazon.com/Brazilian-Jiu-Jitsu-Uniform-Preshrunk/dp/B07CZSBLMV/ref=sr_1_4?dchild=1&keywords=mma+gi&qid=1581012491&sr=8-4'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}
test = requests.get(url,headers=header)
test.reason

'OK'

Request is OK, so I can use Amazon to scrape.

In [28]:
# checking to make sure beautifulsoup works appropriately 
soup = bs4.BeautifulSoup(test.text,'html5lib')
prod_name = soup.find('span',{'class':"a-size-large"})
prod_name.text.strip()

'Brazilian Jiu Jitsu Gi BJJ Gi for Men & Women Grappling gi Uniform Kimonos Ultra Light, Preshrunk, Free White Belt!!!'

The following function returns the link to the product review page for a given product.

In [29]:
# soupp: bs4 object
# n, index of review on page.
# if not the last page of reviews, n=9
# returns string of customer review for given product (the soupp argument will have
# be constructed for a certain product)
def get_review(soupp, n):
    rev = soupp.find_all('span',{'data-hook':"review-body"})
    return rev[n].text.strip()


The following functions will use the same arguments as the get_review method, but returns other desired data from the customer review.

In [49]:
# returns a string of the date the review was posted
def get_review_date(soupp,n):
    d = soupp.find_all('span',{'data-hook':"review-date"})
    return d[n].text[33:] #only date in the html for the page is a string of the form
# "Product reviewed in the United States on *data*", so indexing is required


In [31]:
# returns a float of the review's rating
def get_review_rating(soupp,n):
    r = soupp.find_all('i',{'data-hook':"review-star-rating"})
    return float(r[n].text.strip().split()[0])

In [32]:
# returns an integer of the number of reviews, this can be done on any of the review pages
def num_reviews(soupp):
    ans = int(soupp.find('span',{'data-hook':"cr-filter-info-review-count"}).text.split()[-2])    
    return ans


In [33]:
# returns the username of the customer who wrote the review
def get_review_author(soupp,n):
    a = soupp.find_all('span',{'class':"a-profile-name"})
    return a[n].text.strip()


In [34]:
# returns the name of the product the review was written for
def get_prod_name(soupp):
    pn = soupp.find('a',{'data-hook':"product-link"})
    return pn.text
    

Now, I can use all of these helper functions to create a new function that will get the review data in a list of list form, for a particular product page. 

In [35]:
#soupp: bs4 object to parse
#url: the product's url for return purposes
#returns a list of lists with each reviews data.
def get_review_data_page(soupp, url,n):
    out = []
    for i in range(0,n):
        out.append([get_prod_name(soupp), get_review_rating(soupp,i), get_review_date(soupp,i),
                 get_review_author(soupp,i), get_review(soupp,i), url])
    return out
    
            
    
    
    

Now that I have a function for getting all of the review data on a particular page, the next step is to do this for multiple pages of reviews. After failures attempting to do this through the html tags on the review page, I discovered that the url for different products is the same for all amazon products with the exception of the product name, and an indentifier. These can be used along with an increment in the review page counter that can be found in the review url to go through each page of the reviews.

In [36]:
# url: url of the product's page
def strip_id_and_name(url):
    id1 = url.find('.com')
    id2 = url.find('/dp')
    name = url[id1+5:id2]
    id3 = url.find('/ref')
    id_final = url[id2+4:id3]
    return [name,id_final]
# testing on an amazon product
print(strip_id_and_name('https://www.amazon.com/Brazilian-Jiu-Jitsu-Uniform-Preshrunk/dp/B07CZSBLMV/ref=sr_1_4?dchild=1&keywords=mma+gi&qid=1581012491&sr=8-4'))

['Brazilian-Jiu-Jitsu-Uniform-Preshrunk', 'B07CZSBLMV']


This final function retrieves my desired review data for a given product accross all of it's reviews. There is a noted 2.5 second delay between calls to each review page to ensure that amazon servers do not block my requests, as previous attempts resulted in my connection being blocked.

In [37]:
# url: url of the product's page
# return a list of lists for all review data given a product
def get_all_reviews(url): 
    import time
    n_id = strip_id_and_name(url)
    name = n_id[0]
    prod_id = n_id[1]
    
    # the reviews url to be manipulated. As previously mentioned the product review
    # pages can be parsed with the page number at the end of the url, but given a certain
    # product, a name and product id is needed as well, hence the use of the 
    #strip_id_and_name function
    rev_url = 'https://amazon.com/'+name+'/product-reviews/'+prod_id+'/ref=cm_r_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1'
    header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}
    start = requests.get(rev_url,headers=header)
    soupp = bs4.BeautifulSoup(start.text,'html5lib')
    out = []
    n_reviews = num_reviews(soupp)
    num_pages = int(n_reviews/10)+1
    
    for i in range(0,num_pages):
        n=0
        if i==num_pages-1:
            n = n_reviews%10
        else:
            n = 10
        
        page_data = get_review_data_page(soupp,rev_url, n)        
        for x in page_data:
            out.append(x)
            
        next_page = i+2
        ind = rev_url.rfind('=')
        rev_url = rev_url[:ind+1] + str(next_page)
        time.sleep(2.5) # delay to allow requests through
        page = requests.get(rev_url,headers=header)

        soupp = bs4.BeautifulSoup(page.text,'html5lib')
    return out
    

With all necessary functions created, the list returned with all desired review data will be converted to a dataframe so that it can be saved as a csv file.

In [38]:
# l: a list, whose contents are also lists. The internal lists have the desired review data
def to_df(l):
    out = pandas.DataFrame(data=l,columns=['Product Name','Rating','Date','Author','Body','Link'])
    return out

The final block of code is a script I used to run a list of products through my functions, and to save as a csv file.

In [52]:
# #script to add product dfs to one dataframe and save it

urls = ['https://www.amazon.com/RHINO-RUGBY-Fitted-Stretch-Performance/dp/B07QM2J8NX/ref=pd_cp_200_4/130-7457756-6224769?_encoding=UTF8&pd_rd_i=B07QM2J8NX&pd_rd_r=e5b7fc1e-09a5-4e0a-aea0-2185ba8bcd0f&pd_rd_w=6086Q&pd_rd_wg=zTx8n&pf_rd_p=0e5324e1-c848-4872-bbd5-5be6baedf80e&pf_rd_r=3GDN6M2R680300VFQ8ZZ&refRID=3GDN6M2R680300VFQ8ZZ',
        'https://www.amazon.com/Nike-Strike-Soccer-White-Racer/dp/B07BT7QPFS/ref=sxin_3_ac_d_pm?ac_md=3-2-QWJvdmUgJDI1-ac_d_pm&keywords=soccer+ball&pd_rd_i=B07BT7QPFS&pd_rd_r=8320be32-68b8-453f-9af2-15a287de225f&pd_rd_w=tSLCD&pd_rd_wg=GQ2Kp&pf_rd_p=24d053a8-30a1-4822-a2ff-4d1ab2b984fc&pf_rd_r=35ACZ9SZKHYRV29CBNK1&psc=1&qid=1571430388&s=sporting-goods',
        'https://www.amazon.com/NIKE-Premier-League-Pitch-Soccer/dp/B07DM3XSV5/ref=pd_cp_200_4/130-7457756-6224769?_encoding=UTF8&pd_rd_i=B07DM3XSV5&pd_rd_r=d8198aa7-f84e-4ec7-b6b2-e5f8e9e2cda1&pd_rd_w=yP0jA&pd_rd_wg=hjDqJ&pf_rd_p=0e5324e1-c848-4872-bbd5-5be6baedf80e&pf_rd_r=YA05G2Y2SPS3VK20PBSJ&psc=1&refRID=YA05G2Y2SPS3VK20PBSJ',
        'https://www.amazon.com/NIKE-Pitch-Soccer-Ball/dp/B07C2M9GBF/ref=pd_day0_hl_200_5/130-7457756-6224769?_encoding=UTF8&pd_rd_i=B07C2M9GBF&pd_rd_r=ee203f0e-9dd6-471d-803b-e3efa9be5d60&pd_rd_w=gJZJF&pd_rd_wg=yvy8s&pf_rd_p=0501877d-5f8c-4ec8-9861-e0476eecc53e&pf_rd_r=NDFH279PEJVTGA2SAFMZ&refRID=NDFH279PEJVTGA2SAFMZ',
        'https://www.amazon.com/Equalizer-Soccer-Shorts-Black-Medium/dp/B00SLSRQYQ/ref=sr_1_12?dchild=1&keywords=soccer+shorts&qid=1571430495&s=sporting-goods&sr=1-12',
        'https://www.amazon.com/adidas-Blacks-Rugby-Jersey-Medium/dp/B07KPSDZ9Z/ref=sr_1_4?dchild=1&keywords=rugby+jersey%27&qid=1571430549&s=sporting-goods&sr=1-4',
        'https://www.amazon.com/Irish-Rugby-Shirt-Green-Shamrock/dp/B00UY5AX5I/ref=sr_1_3?dchild=1&keywords=rugby+shirt&qid=1571430583&s=sporting-goods&sr=1-3',
        'https://www.amazon.com/Guinness-Heritage-Charcoal-Sleeve-Jersey/dp/B07JPXS1ZF/ref=sr_1_2?dchild=1&keywords=rugby+shirt&qid=1571430650&s=sporting-goods&sr=1-2',
        'https://www.amazon.com/adidas-Entrada-Jersey-White-Medium/dp/B071GWPKXY/ref=sr_1_5?dchild=1&keywords=soccer+jersey&qid=1571430724&s=sporting-goods&sr=1-5',
        'https://www.amazon.com/adidas-Soccer-Madrid-Jersey-Medium/dp/B078H8696L/ref=sr_1_6?dchild=1&keywords=soccer+jersey&qid=1571430764&s=sporting-goods&sr=1-6',
        'https://www.amazon.com/adidas-Predator-Ground-Soccer-Metallic/dp/B07KWW74XT/ref=sr_1_3?dchild=1&keywords=soccer+cleats+mens&qid=1571436002&rnid=2941120011&s=apparel&sr=1-3',
        'https://www.amazon.com/adidas-Goletto-Ground-Black-Scarlet/dp/B07D9CXNNQ/ref=sr_1_8?dchild=1&keywords=soccer+cleats+mens&qid=1571436002&rnid=2941120011&s=apparel&sr=1-8',
        'https://www.amazon.com/adidas-Gloro-Ground-Black-Yellow/dp/B07D9H3Z95/ref=sr_1_9?dchild=1&keywords=soccer+cleats+mens&qid=1571436002&rnid=2941120011&s=apparel&sr=1-9',
        'https://www.amazon.com/G-Form-Youth-PRO-S-Compact-Shinguard-Blk-L/dp/B01N9HKXK2/ref=sr_1_29?dchild=1&keywords=shin+guards+soccer&qid=1571436126&s=apparel&sr=8-29',
        'https://www.amazon.com/Sportout-Comprehensive-Protection-Cushioned-Injuries/dp/B07JCS6YKB/ref=sr_1_5?dchild=1&keywords=shin+guards+soccer&qid=1571436184&s=apparel&sr=8-5'
        ]
for i in urls:
    reviews = to_df(get_all_reviews(i))
    save = pandas.read_csv('amazon_reviews.csv')
    save = save.append(reviews)
    save.to_csv('amazon_reviews.csv',header=True,index=False)
# checking csv
pandas.read_csv('amazon_reviews.csv').head()


Unnamed: 0,Product Name,Rating,Date,Author,Body,Link
0,RHINO RUGBY Fitted Stretch Performance Game Da...,5.0,"September 11, 2019",XtinaG,There is great stretch and comfort to these sh...,https://amazon.com/RHINO-RUGBY-Fitted-Stretch-...
1,RHINO RUGBY Fitted Stretch Performance Game Da...,1.0,"January 3, 2020",Amazon Customer,Im a medium 31 or 32 waist on every thing i bu...,https://amazon.com/RHINO-RUGBY-Fitted-Stretch-...
2,RHINO RUGBY Fitted Stretch Performance Game Da...,5.0,"August 4, 2019",XtinaG,This was my first pair of rugby shorts. I tend...,https://amazon.com/RHINO-RUGBY-Fitted-Stretch-...
3,RHINO RUGBY Fitted Stretch Performance Game Da...,5.0,"April 14, 2019",Amazon Customer,I did not want shorts that are tight on the wa...,https://amazon.com/RHINO-RUGBY-Fitted-Stretch-...
4,RHINO RUGBY Fitted Stretch Performance Game Da...,5.0,"March 13, 2019",Garrett,"Well made, heavy fabric. The flexible materia...",https://amazon.com/RHINO-RUGBY-Fitted-Stretch-...
