# Assignment 02: Web Scraping
Your Name: Hyeong-gi Hong  
Your Class: INST 447  
Your Section: TTh 0102  

In [1]:
import requests
from bs4 import BeautifulSoup
import sqlite3
import time

## Are Amazon reviews fake?
Should we trust Amazon reviews? 

That is a question that seems like we could answer if we only had enough data. So, lets collect some data. Your assignment is to scrape the reviews for five (5) similar products and to save those reviews into a sqllite database, that I have provided. The database will store:

**Product Table:**  
product_id - int - Primary key (this auto-increments).  
amazon_identifier - text - The identifier for the product.  
product_name - text - The name of the product.  
product_price - text - The price of the product. (Text because sometimes it is a range)  
scraper_name - text - Your name...  

**Review Table:**  
review_id - int - Primary key (this auto-increments).  
review_date - text - The date of the review.  
review_title - text - The title of the review.  
number_of_stars - int - The number of stars the review gave.  
verified_purchase - bool - Was the it a "Verifed Purchase"?  
review_body - text - The text of the review.  
number_found_helpful - int - The number of people that found the review helpful.  
product_id - int - Foreign key for the product.  

Since you are using my database structure, this means that I can run your code and fill up a single database with all of our data.

** YOU DO NOT HAVE TO DO ANYTHING WITH amazon-page_dump.db IF YOU ARE GETTING RESPONSES FROM AMAZON ** 
I have also given you a database named 'amazon-page_dump.db'. If you keep getting errors from Amazon, then tell me what that error is including the status-code, the reason given, and what you think that error means. Then use the pages saved in the page_dump table to complete the assignment. You will have to figure out how to best adapt the code framework I gave you to pull out the pages to parse.

**Database: amazon-page_dump.db  
Table: page_dump**  
dump_id - integer - Primary key (this auto-increments)  
amazon_identifier - text - The Amazon product identifier  
page_url - text - The url for the page  
page_html - text - The html of the page  

This is only in case you are getting only errors from Amazon and cannot access the pages to scrape. ** YOU DO NOT HAVE TO DO ANYTHING WITH amazon-page_dump.db IF YOU ARE GETTING RESPONSES FROM AMAZON ** 

In [2]:
# Create the amazon.db database, if it does not exist.
conn = sqlite3.connect('amazon.db')
c = conn.cursor()

In [3]:
# Create the products table 
c.execute('''
    CREATE TABLE IF NOT EXISTS products (
        product_id INTEGER PRIMARY KEY AUTOINCREMENT,
        amazon_identifier TEXT,
        product_name TEXT,
        product_price TEXT,
        scraper_name TEXT
        );
''')
# Create the reviews table
c.execute('''
    CREATE TABLE IF NOT EXISTS reviews (
        review_id INTEGER PRIMARY KEY AUTOINCREMENT,
        review_date TEXT, 
        review_title TEXT, 
        number_of_stars INTEGER, 
        verified_purchase BOOLEAN, 
        review_body TEXT, 
        number_found_helpful INTEGER,
        product_id INTEGER,
        FOREIGN KEY(product_id) REFERENCES products(product_id)
        )
''')

<sqlite3.Cursor at 0x781cae0>

## Your Task

Find 5 similar products on Amazon that have more than 5 reviews each (my example uses hotsauce, so you can't). Your products must be PG.
Grab their product identifiers and replace mine in the list I have below.

It is in the URL and it looks like these:

In [4]:
product_lists = ['B013WC0P2A', 'B001DHECXA', 'B075B4KRJT', 'B075M3YY18', 'B0072NZ292'] 

And the Urls for the reviews look like: https://www.amazon.com/product-reviews/B00AIR3Q38/?reviewerType=all_reviews&pageNumber=1

Notice that there is a spot in the url where the Amazon identifier goes and that there is an argument to be able to set the number of the page that is accesed.

In [5]:
# URL template
url = 'https://www.amazon.com/product-reviews/%s/'
# Default URL arguments
url_args = {'reviewerType': 'all_reviews',
            'pageNumber': 1,
            'sortBy': 'recent'}
# Pretend to be a browser
headers = {'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'}

Now loop through the identifiers and save the specified data into the database.

I have provided a shell. Fill out the bits that have comments that are like:
>#TASK: Get (x) data

In [6]:
# Loop through the Amazon product identifiers
for amazon_identifier in product_lists:
    # Get the first page
    url_args['pageNumber'] = 1
    r = requests.get(url % amazon_identifier, url_args)
    print(r.url)
    
    # Check if the request was successful.
    if r.status_code == requests.codes.ok:
        page_soup = BeautifulSoup(r.content, 'lxml')
        
        # TASK: Get the number of the last page and save it into the variable max_page number
        # note: make sure you handle the cases where there is a last button and when there isn't a last button
        pages = page_soup.find_all('li', attrs={'class': 'page-button'})
        max_page_number = 1
        
        if len(pages) != 0:
            for page in pages:
                n_page = page.text.replace(',', '');
                if max_page_number < int(n_page):
                    max_page_number = int(n_page)
        
        # TASK: Get the product name and save to the variable below
        product_name = page_soup.find('div', attrs={'class': 'a-row product-title'}).text
        
        # TASK: Get the product price and save to the variable below
        product_price = page_soup.find('span', attrs={'class': 'a-color-price arp-price'}).text
        
        # TASK: Change this to your name
        scraper_name = 'Hyeong-gi Hong'
        
        # Try to insert it. If this does not work, then we won't have a product_id to associate the reviews with and shouldn't save them.
        try:
            c.execute('INSERT INTO products (amazon_identifier, product_name, product_price, scraper_name) VALUES (?, ?, ?, ?)', 
                      (amazon_identifier, product_name, product_price, scraper_name))
            # We have to commit the transaction, or it won't be saved.
            conn.commit()
            # Save the last primary key inserted as the product_id
            product_id = c.lastrowid

            # If there are more than 5 pages, then stop at 5. 
            if max_page_number > 5:
                max_page_number = 5
            
            # Loop through all of the pages (1 through max)
            for page in range(1, max_page_number): 
                # Set the page number for the url
                url_args['pageNumber'] = page
                # Get the next page of reviews
                r = requests.get(url % amazon_identifier, url_args)
                print(r.url)
                # Check if we got a response
                if r.status_code == requests.codes.ok:
                    review_soup = BeautifulSoup(r.content, 'lxml')
                    review_pages = review_soup.find_all('div', attrs={'class': 'a-section review'})
                    
                    month_num = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6,
                                 'July': 7, 'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12}
                    
                    # TASK: Loop through the reviews (replace the [] with your code)
                    for review in review_pages:
                        
                        # TASK: Get the review date - convert it if necessary to YYYY-MM-DD
                        date_div = review.find('span', attrs={'class':'a-size-base a-color-secondary review-date'})
                        date = date_div.text.split(' ')
                        year = date[3]
                        month = date[1]
                        for m in month_num:
                            if month == m:
                                month = month_num[m]
                        day = date[2].strip(',')

                        review_date = "{}-{:02d}-{:02d}".format(year, month, int(day))

                        # TASK: Get the review title
                        review_title = review.find('a', attrs={'class': 'a-size-base a-link-normal review-title a-color-base a-text-bold'}).text
                        
                        # TASK: Get the number of stars - and make sure it is an int.
                        number_of_stars = int(review.find('a', attrs={'class': 'a-link-normal'}).text[0])
                        
                        # TASK: Get whether it is a verified purchase or not
                        verified_purchase = False
                        if review.find('span', attrs={'class': 'a-size-mini a-color-state a-text-boldd'}) == None:
                            verified_purchase = True
                            
                        # TASK: Get the actual text of the review
                        review_body = review.find('div', attrs={'class': 'a-row review-data'}).text
                        
                        # TASK: Get the number of people that found the review helpful
                        number_found_helpful = 0
                        review_votes = review.find('span', attrs={'class': 'review-votes'})
                        if review_votes != None:
                            num = review_votes.text.replace('\n', '').strip().split(' ')[0]
                            if num == 'One':
                                number_found_helpful = 1
                            else:
                                number_found_helpful = int(num)
                        
                        # Try to insert the review into the database. If it doesn't work. Then tell us why.
                        try:
                            c.execute('''INSERT INTO reviews 
                                            (product_id, review_date, review_title, number_of_stars, verified_purchase, review_body, number_found_helpful) 
                                         VALUES (?, ?, ?, ?, ?, ?, ?)''', 
                                      (product_id, review_date, review_title, number_of_stars, verified_purchase, review_body, number_found_helpful))
                            conn.commit()    
                        except sqlite3.DatabaseError as err:
                            print('SQL Error: {0}'.format(err))
                else: 
                    print('Error %s for %s on page %s' % (r.status_code, amazon_identifier, page))
                    
                # Slow things down.
                time.sleep(0.5)
        except sqlite3.DatabaseError as err:
            print('SQL Error: {0}'.format(err))
    else:
        print('Error %s for %s' % (r.status_code, amazon_identifier))
#     # Slow things down.
    time.sleep(0.5)

https://www.amazon.com/product-reviews/B013WC0P2A/?reviewerType=all_reviews&pageNumber=1&sortBy=recent
https://www.amazon.com/product-reviews/B013WC0P2A/?reviewerType=all_reviews&pageNumber=1&sortBy=recent
https://www.amazon.com/product-reviews/B013WC0P2A/?reviewerType=all_reviews&pageNumber=2&sortBy=recent
https://www.amazon.com/product-reviews/B013WC0P2A/?reviewerType=all_reviews&pageNumber=3&sortBy=recent
https://www.amazon.com/product-reviews/B013WC0P2A/?reviewerType=all_reviews&pageNumber=4&sortBy=recent
https://www.amazon.com/product-reviews/B001DHECXA/?reviewerType=all_reviews&pageNumber=1&sortBy=recent
https://www.amazon.com/product-reviews/B001DHECXA/?reviewerType=all_reviews&pageNumber=1&sortBy=recent
https://www.amazon.com/product-reviews/B001DHECXA/?reviewerType=all_reviews&pageNumber=2&sortBy=recent
https://www.amazon.com/product-reviews/B001DHECXA/?reviewerType=all_reviews&pageNumber=3&sortBy=recent
https://www.amazon.com/product-reviews/B001DHECXA/?reviewerType=all_revie

Checking to see if there are products

In [7]:
c.execute('SELECT COUNT(*) FROM products;')
c.fetchone()

(5,)

In [8]:
c.execute('SELECT * FROM products;')
c.fetchall()

[(1,
  'B013WC0P2A',
  'VicTsing MM057 2.4G Wireless Portable Mobile Mouse Optical Mice with USB Receiver, 5 Adjustable DPI Levels, 6 Buttons for Notebook, PC, Laptop, Computer, Macbook - Black',
  '$9.99',
  'Hyeong-gi Hong'),
 (2,
  'B001DHECXA',
  'TeckNet Classic 2.4G Portable Optical Wireless Mouse with USB Nano Receiver for Notebook,PC,Laptop,Computer,6 Buttons,30 Months Battery Life,4800 DPI,6 Adjustment Levels',
  '$12.49',
  'Hyeong-gi Hong'),
 (3,
  'B075B4KRJT',
  'Bluetooth Mouse, DINOWIN 3.0 Portable Mouse with Rechargeable Wireless USB Mouse Silent for Bluetooth-compatible Laptop,Mac,iMac,Macbook Android Tablet,PC Adjustable DPI 1000/1400/1600 (Black)',
  '$19.99',
  'Hyeong-gi Hong'),
 (4,
  'B075M3YY18',
  'VicTsing Wireless Gaming Mouse with Unique Silent Click, Breathing Backlit, 6 Programmable Buttons, 2400 DPI, Ergonomic Grips, 7 Buttons- Black',
  '$19.99',
  'Hyeong-gi Hong'),
 (5,
  'B0072NZ292',
  'HP A0X35AA#ABA Wireless Mouse X4000 with Laser Sensor',
  '$18.9

Checking to see if there are reviews

In [9]:
c.execute('SELECT COUNT(*) FROM reviews;')
c.fetchone()

(150,)

In [10]:
c.execute('SELECT * FROM reviews;')
c.fetchall()

[(1,
  '2018-04-12',
  'Great mouse',
  5,
  1,
  'Works great. Easy installation',
  0,
  1),
 (2,
  '2018-04-12',
  'Loved it till it broke 6 months later... buying another one :)',
  5,
  1,
  'bought it last October. Love everything about it! Very comfy, the back and forward buttons right about the thumb, adjustable cpi etc. but today it stopped working. dont know, mid game (tf2 haha) lost connection, changed batteries and nothing. figured it was a faulty one or due to my 2 little kids constantly dropping it. oh well at 10 bucks lets buy another one. and it will be here later tonight!',
  0,
  1),
 (3,
  '2018-04-12',
  'Scroll Wheel is completely defunct.',
  1,
  1,
  'I’d give 0 out of 10 if I could. Scroll wheel would jump a little initially but now it’s hit the point where scrolling doesn’t work at all and there were no solutions to how to fix it. Don’t waste your money, the constant headache isnt worth it.It either scrolls and stops and then wont scroll in that specific windo

Closing the database connection.

In [12]:
conn.close()