## Starting Off

With a partner, answer the following question:

Is it legal to scrape data from websites?

# Advanced Webscraping: How to make sure you don't get blocked.

## Aims:

- Write scripts that webscrapes and can handle errors and minimize the likelihood of your IP address getting blocked.


## Agenda

- Talk about the legality of scraping
- Practice scraping
- Look at ways to programmatically avoid getting banned
- Set up the selenium webdriver
- Learn how to use Selenium

## 1. Check 200 status code
It is always good to check the HTTP status code earlier and proceed accordingly.

This is good:

~~~
if response.status_code == 200:
   #Proceed further
~~~

This is better:

~~~~
if response.status_code != 200:
  return False
~~~

In [8]:
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        print( page.status_code)
        break
    # more code to process the results

NameError: name 'urls' is not defined

## 2. Never Trust HTML

Especially if you can’t control it. Web scraping depends on HTML DOM, a simple change in element or class name could break your entire script. The best way to deal with it is to check if it returns `None`.

~~~
page_count = soup.select('.pager-pages > li > a')
if page_count:
 #do your stuff
else:
 # ALERT!! Send notification to Admin
~~~

Here I am checking whether the CSS selector returned something legitimate, if yes then proceed further.

In [None]:
for url in urls:
    page = requests.get(url)
    # include code to do status check
    if page.status_code != 200:
        print( page.status_code)
        break
    # more code to process the results
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    if items:
        #continue processing the data
        pass
    else:
        print("Data is coming back blank")

## 3 .  Set headers

`requests` does not force you to use request headers while sending requests, but there are few smart websites that do not let you to get read anything important unless certain headers are not set in it. Once I faced the situation that the HTML I was seeing in browser was different than what I was getting via my script, kind of like magic huh. So, it is always good to make your requests as legitimate as you can. The least you should do is to set a User-Agent.

~~~
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

response = requests.get(url, headers=headers, timeout=5)

~~~

In [None]:
# Jay, research this more; this is a best practice to help not get blocked (non-blank headers)
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers)
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 4. Set timeout

One of the issues with `requests` is that, if you don’t mention **timeout**, it will keep trying until its last breath. This might be good for some certain conditions but not in majority cases. Therefore, it’s always good to set a timeout value for each request. Here I am setting timeout to 5 seconds.

~~~
response = requests.get(url, headers=headers, timeout=5)
~~~

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
    if page.status_code != 200:
        print(page.status_code)
    
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 5. Exception handling

It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:

~~~
try:
    # your logic is here

except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program") 
~~~

Check the very last one. This one tells the program that if someone wants to terminate program by using Ctrl+C then it wrap things up first and then exist. This situation is good if you are storing information in file and wants to dump all at the time of exit.

In [10]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        print('test') #continue processing the data
    else:
        print("Data is coming back blank")

NameError: name 'urls' is not defined

This code is starting to get long and hard to read. So let's start to modularize it.  

In [26]:
# headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}


def get_page(url):
    try:
        page = requests.get(url, headers = headers, timeout=5)
    # include code to do status check
        if page.status_code != 200:
            print(page.status_code)
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
        
        
    return page
    

We can replace a chunk of our code with this function

In [None]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    #use our new function to process each url
    page = get_page(url)
        
        
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")

## 6. Regulate your request pace

Many websites have a limit on how many times you can ping a website within a minute/hour/day. YOu want to be aware of that and change your script in order to account for that.

One example is using the `sleep()` function that is a part of the time package.  This can pause your script for a set amount of time.

~~~
import time
 
 
## Star loop ##
for url in urls:

    # try to make resquest here.
    
 
    #### Delay for 1 seconds ####
    time.sleep(1)
        
~~~

In [None]:
import time
 
 
## Start loop ##
for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        #continue processing the data
    else:
        print("Data is coming back blank")
    
    time.sleep(1)

## 7 - Save as you go

You might run into an issue halfway through your scrape and your script breaks. So you want to make sure you are saving your data as you go.  

~~~ 
import csv
...
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.writer(f)

    # collected_items = [
    #   ["Product #1", "10", "http://example.com/product-1"],
    #   ["Product #2", "25", "http://example.com/product-2"],
    #   ...
    # ]

    for item_property_list in collected_items:
        writer.writerow(item_property_list)
~~~
~~~
import csv
...
field_names = ["Product Name", "Price", "Detail URL"]
with open("~/Desktop/output.csv", "w") as f:
    writer = csv.DictWriter(f, field_names)

    # collected_items = [
    #   {
    #       "Product Name": "Product #1",
    #       "Price": "10",
    #       "Detail URL": "http://example.com/product-1"
    #   },
    #   ...
    # ]

    # Write a header row
    writer.writerow({x: x for x in field_names})

    for item_property_dict in collected_items:
        writer.writerow(item_property_dict)
~~~

In [7]:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        print(' ') #continue processing the data
    else:
        print("Data is coming back blank")
    
    #Saving your data as you go
    
    # Option 1: write the line of data to a csv files
    with open("~/Desktop/output.csv", "w") as f:
        writer = csv.writer(f)

    for item in items:
        writer.writerow(item)
        
    # Option 2: Inseting the data into a DB
    # This code uses a theoretical module, SQL_Helpers,
    # The functions below are examples and will not run. 
    import sql_helpers as sql
    
    sql.create_connection()
    for  item in items:
        data = item
        query = "INSERT INTO table_name VALUES (%s,%s,%s,%s)"
        sql.insert_data(db, query, data )
    # must be in the loop, so each item gets committed asap
    sql.commit()
    
    #Taking a one second pause to help slow down your requests 
    time.sleep(1)
sql.close()


NameError: name 'urls' is not defined

In [4]:
from bs4 import BeautifulSoup
import requests

In [43]:
html_page = requests.get("https://www.the-numbers.com/bankability") #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
soup.prettify
argh = soup.findAll('table', style="width:380px;")
for i in argh:
     for j in i.children:
        print(j)
    #print(i.a.string)
    

AttributeError: 'NavigableString' object has no attribute 'a'

In [5]:
from bs4 import BeautifulSoup
import requests

In [6]:
html_page = requests.get("https://www.the-numbers.com/bankability") #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<html>
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-1343128-1');
</script>
<meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="telephone=no" name="format-detection"/> <!-- for apple mobile -->
<meta content="521546213" property="fb:admins">
<meta content="initial-scale=1" name="viewport"/>
<meta content="A monthl

In [7]:
dirty_actors_list = list(soup.findAll('span', style="font-size:200%;"))
clean_actors_list = []
for line in dirty_actors_list:
    clean_actors_list.append(line.a.string)
clean_actors_list # test-print

['Tom Cruise',
 'Will Smith',
 'Robert Downey, Jr.',
 'Sandra Bullock',
 'Kathleen Kennedy',
 'Clint Eastwood',
 'Denzel Washington',
 'Ben Affleck',
 'Vin Diesel',
 'Kevin Feige',
 'Leonardo DiCaprio',
 'John Williams',
 'Steven Spielberg',
 'Tom Hanks',
 'Michael Keaton',
 'Jon Favreau',
 'Kenneth Branagh',
 'Angelina Jolie Pitt',
 'Bradley Cooper',
 'Gwyneth Paltrow',
 'David Heyman',
 'John Lasseter',
 'Alan Silvestri',
 'Scarlett Johansson',
 'Emma Watson',
 'Jim Carrey',
 'Adam Sandler',
 'Matt Damon',
 'Dwayne Johnson',
 'George Clooney',
 'Eddie Redmayne',
 'Ryan Reynolds',
 'Cameron Diaz',
 'Gal Gadot',
 'Samuel L. Jackson',
 'Djimon Hounsou',
 'Brad Pitt',
 "Lupita Nyong'o",
 'Russell Crowe',
 'Julia Roberts',
 'Tom Hardy',
 'Frank Marshall',
 'Harrison Ford',
 'Hugh Jackman',
 'James Wan',
 'Reese Witherspoon',
 'Mark Wahlberg',
 'Johnny Depp',
 'Chris Evans',
 'Mark Ruffalo']

In [8]:
dirty_bank_list = list(soup.findAll('div', style="font-size:200%;"))
clean_bank_list = []
for line in dirty_bank_list:
    clean_bank_list.append(int(line.text[1:].replace(',','')))
clean_bank_list # test-print

[22537572,
 20593743,
 16602313,
 15694181,
 15416491,
 14466692,
 14189296,
 13770962,
 13610706,
 13144845,
 11501979,
 11486027,
 11175085,
 11091235,
 10979173,
 10851176,
 10802765,
 10786097,
 10482036,
 10223870,
 10080186,
 9864583,
 9679797,
 9675619,
 9467350,
 9083733,
 8899915,
 8838814,
 8679831,
 8678721,
 8459158,
 8454752,
 8415932,
 8402849,
 8180346,
 8130231,
 8100147,
 8081050,
 8023745,
 7982614,
 7924721,
 7891161,
 7876773,
 7781327,
 7728167,
 7663786,
 7581985,
 7566729,
 7527266,
 7352303]

In [9]:
NS = soup.findAll('div', style="font-size:200%;")
movs_per_yr_list = []
bank_per_yr_list = []
bank_per_mov_list = []
for n in NS:
    string_container = n.nextSibling.nextSibling.get_text()
    print(string_container, "END \n")  # temp
    # parse through loop and retrieve movies per year for each actor
    divider1 = string_container.find(" movies/year")
    movs_per_yr_list.append(float(string_container[2:divider1]))
    # parse through loop to retrieve bankability per year for each actor
    divider2 = string_container.find("/year", divider1)+divider1
    divider3 = string_container.find("/year", divider2+divider1)
    bank_per_yr_list.append(int(string_container[divider2+divider1:divider3].replace(',','')))
    bank_per_mov_list.append(round(bank_per_yr_list[-1]/movs_per_yr_list[-1], 2))
    
    
print(movs_per_yr_list, len(movs_per_yr_list))
print(bank_per_yr_list, len(bank_per_yr_list))
print(bank_per_mov_list, len(bank_per_mov_list))
    #    print(string_container,start,end)


	1.1 movies/year
	
	$24,791,329/year
	
ChangeValue: +$1,791,379
 END 


	1.1 movies/year
	
	$22,653,117/year
	
ChangeValue: +$491,681
 END 


	1.4 movies/year
	
	$23,243,238/year
	
ChangeValue: +$91,901
 END 


	1.0 movies/year
	
	$15,694,181/year
	
ChangeValue: -$1,828
 END 


	1.3 movies/year
	
	$20,041,438/year
	
ChangeValue: -$15,067
 END 


	1.1 movies/year
	
	$15,913,361/year
	
ChangeValue: -$26,924
 END 


	1.2 movies/year
	
	$17,027,155/year
	
ChangeValue: +$1,060,927
 END 


	1.2 movies/year
	
	$16,525,154/year
	
ChangeValue: +$1,031,091
 END 


	1.2 movies/year
	
	$16,332,847/year
	
ChangeValue: -$17,392
 END 


	2.2 movies/year
	
	$28,918,660/year
	
ChangeValue: +$47,396
 END 


	1.8 movies/year
	
	$20,703,562/year
	
ChangeValue: +$1,145,631
 END 


	1.3 movies/year
	
	$14,931,835/year
	
ChangeValue: +$815,601
 END 


	2.9 movies/year
	
	$32,407,746/year
	
ChangeValue: -$380,879
 END 


	2.1 movies/year
	
	$23,291,593/year
	
ChangeValue: +$488,640
 END 


	1.3 movies/year
	

In [12]:
import re
from collections import defaultdict

films_dict = defaultdict(list)
rank = 1
print('length = ', len(list(soup.find_all(id="col2mid")))) # temp
for l in list(soup.find_all(id="col2mid")):
    for r in list(re.findall(r'summary">(.*?)<', str(l))):
        if r.find('â') != -1:
            if r[r.find('â')+3] == 's':
                r = r[:r.find('â')]+"'"+r[r.find('â')+3:]
            else:
                r = r[:r.find('â')]+":"+r[r.find('â')+2:]
        # print('X94! ', r.find('x94')) # r.replace('\\x94','')
        if r not in films_dict[str(rank)]: # check if duplicate, skip if it is
            films_dict[str(rank)].append(r)    
    rank +=1
# print(rank, films_dict) # WHY?!?!

for rank in films_dict:
    for film in films_dict[rank]:
        print(rank, film)

length =  50
1 Mission: Impossible:Fallout
1 Mission: Impossible:Rogue Nation
1 Mission: Impossible:Ghost Protocol
1 War of the Worlds
1 Austin Powers in Goldmember
1 Mission: Impossible 2
1 Mission: Impossible III
2 Aladdin
2 Suicide Squad
2 I am Legend
2 Hancock
2 Men in Black 3
2 Hitch
2 The Pursuit of Happyness
2 Annie
2 I, Robot
3 Avengers: Endgame
3 Avengers: Infinity War
3 The Avengers
3 Avengers: Age of Ultron
3 Captain America: Civil War
3 The Judge
3 A Guide to Recognizing Your Saints
4 Minions
4 Gravity
4 The Blind Side
4 Ocean's 8
4 The Heat
4 The Proposal
4 Miss Congeniality
4 Two Weeks Notice
4 Miss Congeniality 2: Armed and Fabulous
4 Hope Floats
5 Star Wars Ep. VII: The Force Awakens
5 Star Wars Ep. VIII: The Last Jedi
5 Rogue One: A Star Wars Story
5 Solo: A Star Wars Story
5 Indiana Jones and the Kingdom of the Crystal Skull
5 Contact
6 The Mule
6 Gran Torino
6 Million Dollar Baby
6 Space Cowboys
6 In the Line of Fire
6 American Sniper
6 Sully
7 The Equalizer 2
7 S

In [14]:
import copy

main_dict = {}
attributes = {}
roles_list = []
if len(clean_actors_list) == len(clean_bank_list):
    for i in range(len(clean_actors_list)):
        attributes['rank'] = i+1
        attributes['bankability'] = clean_bank_list[i]
        attributes['bank per movie'] = bank_per_mov_list[i]
        attributes['roles'] = films_dict[str(i+1)]
        main_dict[clean_actors_list[i]] = attributes.copy()
else:
    raise ValueError
main_dict # test-print

{'Tom Cruise': {'rank': 1,
  'bankability': 22537572,
  'bank per movie': 22537571.82,
  'roles': ['Mission: Impossible:\x94Fallout',
   'Mission: Impossible:\x94Rogue Nation',
   'Mission: Impossible:\x94Ghost Protocol',
   'War of the Worlds',
   'Austin Powers in Goldmember',
   'Mission: Impossible 2',
   'Mission: Impossible III']},
 'Will Smith': {'rank': 2,
  'bankability': 20593743,
  'bank per movie': 20593742.73,
  'roles': ['Aladdin',
   'Suicide Squad',
   'I am Legend',
   'Hancock',
   'Men in Black 3',
   'Hitch',
   'The Pursuit of Happyness',
   'Annie',
   'I, Robot']},
 'Robert Downey, Jr.': {'rank': 3,
  'bankability': 16602313,
  'bank per movie': 16602312.86,
  'roles': ['Avengers: Endgame',
   'Avengers: Infinity War',
   'The Avengers',
   'Avengers: Age of Ultron',
   'Captain America: Civil War',
   'The Judge',
   'A Guide to Recognizing Your Saints']},
 'Sandra Bullock': {'rank': 4,
  'bankability': 15694181,
  'bank per movie': 15694181.0,
  'roles': ['Mini

In [10]:
html_page = requests.get("https://www.the-numbers.com/bankability") #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
soup.prettify
actors_list = list(soup.findAll('div', style="font-size:200%;"))
main_dict = {}
#for line in argh:
#    print(type(line))
#     main_dict[line.a.string] = 

NameError: name 'argh' is not defined

In [None]:
# headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

for url in urls:
    print("Current date & time " + time.strftime("%c"))

    #use our new function to process each url
    page = get_page(url)
             
    #imagine we have gotten the contents of the page in the soup variable
    soup = BeautifulSoup(page.content, 'html.parser')
    items = soup.select(' .specific_class')
    
    #check to make sure we have items of that class
    if items:
        print(' ') #continue processing the data
    else:
        print("Data is coming back blank")
    
    #Saving your data as you go
    
    # Option 1: write the line of data to a csv files
    with open("~/Desktop/output.csv", "w") as f:
        writer = csv.writer(f)

    for item in items:
        writer.writerow(item)

        
    # Option 2: Inserting the data into a DB
    # This code uses a theoretical module, SQL_Helpers,
    # The functions below are examples and will not run. 
    import sql_helpers as sql
    
    sql.create_connection()
    for  item in items:
        data = item
        query = "INSERT INTO table_name VALUES (%s,%s,%s,%s)"
        sql.insert_data(db, query, data )
    # must be in the loop, so each item gets committed asap
    sql.commit()
    
    #Taking a one second pause to help slow down your requests 
    time.sleep(1)
sql.close()


## More Resources 
- [More advanced issues](https://blog.hartleybrody.com/web-scraping-cheat-sheet/)
- [Request Advanced Usage](http://docs.python-requests.org/en/master/user/advanced/#)

Web scraping with Python often requires no more than the use of the Beautiful Soup module to reach the goal. Beautiful Soup is a popular Python library that makes web scraping by traversing the DOM (document object model) easier to implement.

## Applied: Scraping Amazon's Best Sellers list:


Amazon keeps track of the best sellers for 41 different categories of products. We want to grab that data from Amazon so that we can keep track of which products are on that list and stock our mom and pop store with them.  


Deliverable: a file that contains all of the products on Amazon's best seller list. 

```[{'name': 'A top selling product',
'url': http://the_url_to_the_product.com},
{'name': 'A top selling product',
'url': http://the_url_to_the_product.com}]```

In [68]:
# headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

def get_page(url):
    try:
#         page = requests.get(url, headers = headers, timeout=5)
        page = requests.get(url, timeout=5)
        # include code to do status check
        if page.status_code != 200:
            print(page.status_code)
    except requests.ConnectionError as e:
        print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
        print(str(e))
    except requests.Timeout as e:
        print("OOPS!! Timeout Error")
        print(str(e))
    except requests.RequestException as e:
        print("OOPS!! General Error")
        print(str(e))
    except KeyboardInterrupt:
        print("Someone closed the program") 
    return page

In [69]:
import requests
from bs4 import BeautifulSoup as BS

First we start by grabbing the page where all of the best sellers list are located.

In [70]:
url="https://www.amazon.com/Best-Sellers/zgbs"

#let's use the function we already created
page = get_page(url)
page

<Response [200]>

In [71]:
soup = BS(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr">
 <head>
  <link href="https://images-na.ssl-images-amazon.com/images/I/21doGy6C0kL._RC|01KD4yyr5LL.css_.css?AUIClients/ZeitgeistPageAssets-zeitgeistHome" rel="stylesheet"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/41gCbfiTdaL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01Alnvtt1zL.css,21mOLw+nYYL.css,01L8Y-JFEhL.css_.css?AUIClients/AmazonUI#us.not-trident.206347-T1" rel="stylesheet"/>
  <script>
   (function(g,h,R,z){function G(a){x&&x.tag&&x.tag(q(":","aui",a))}function v(a,b){x&&x.count&&x.count("aui:"+a,0===b?0:b||(x.count("aui:"+a)||0)+1)}function n(a){try{return a.test(navigator.userAgent)}catch(b){return

Now that we have this page, we want to find the urls of all the other pages to scrape those.  

In [73]:
#using the select statement to find the elements containing each url
urls = soup.select('ul#zg_browseRoot a')  # this is the class we need (gotta search thru the html to find)
urls
# print(urls[0].text, '\n',urls[0]['href'])

[<a href="https://www.amazon.com/Best-Sellers/zgbs/amazon-devices">Amazon Devices &amp; Accessories</a>,
 <a href="https://www.amazon.com/Best-Sellers-Amazon-Launchpad/zgbs/boost">Amazon Launchpad</a>,
 <a href="https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances">Appliances</a>,
 <a href="https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Apps &amp; Games</a>,
 <a href="https://www.amazon.com/Best-Sellers-Arts-Crafts-Sewing/zgbs/arts-crafts">Arts, Crafts &amp; Sewing</a>,
 <a href="https://www.amazon.com/Best-Sellers-Audible-Audiobooks/zgbs/audible">Audible Books &amp; Originals</a>,
 <a href="https://www.amazon.com/Best-Sellers-Automotive/zgbs/automotive">Automotive</a>,
 <a href="https://www.amazon.com/Best-Sellers-Baby/zgbs/baby-products">Baby</a>,
 <a href="https://www.amazon.com/Best-Sellers-Beauty/zgbs/beauty">Beauty &amp; Personal Care</a>,
 <a href="https://www.amazon.com/best-sellers-books-Amazon/zgbs/books">Books</a>,
 <a href="https://www.amaz

In [74]:
#list of all best seller urls
urls = [url['href'] for url in urls]

Select a url/products that you want to investigate and lets build our script to parse one page.  then we can apply it to all of the pages. 

In [75]:
urls[3]

'https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps'

In [77]:
url=urls[0] #['href']  # gives us the specific individual cell on amazon

apps = get_page(url)
apps

<Response [200]>

In [78]:
app_soup = BS(apps.content, 'html.parser')
print(app_soup.prettify())

<!DOCTYPE doctype html>
<html class="a-no-js" data-19ax5a9jf="dingo">
 <head>
  <script>
   var aPageStart = (new Date()).getTime();
  </script>
  <meta charset="utf-8"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/21doGy6C0kL._RC|01WTbMujHuL.css_.css?AUIClients/ZeitgeistPageAssets-zeitgeistList" rel="stylesheet"/>
  <link href="https://images-na.ssl-images-amazon.com/images/I/41gCbfiTdaL._RC|516fcOUE-HL.css,01evdoiemkL.css,01K+Ps1DeEL.css,31pdJv9iSzL.css,01tgK36lpGL.css,11UGC+GXOPL.css,21LK7jaicML.css,11L58Qpo0GL.css,21kyTi1FabL.css,01Xl9KigtzL.css,01YhS3Cs-hL.css,21GwE3cR-yL.css,019SHZnt8RL.css,01wAWQRgXzL.css,21bWcRJYNIL.css,11WgRxUdJRL.css,01dU8+SPlFL.css,11ocrgKoE-L.css,01SHjPML6tL.css,111-D2qRjiL.css,01QrWuRrZ-L.css,310Imb6LqFL.css,11Z1a0FxSIL.css,01Alnvtt1zL.css,21mOLw+nYYL.css,01L8Y-JFEhL.css_.css?AUIClients/AmazonUI#us.not-trident.206347-T1" rel="stylesheet"/>
  <script>
   (function(g,h,R,z){function G(a){x&&x.tag&&x.tag(q(":","aui",a))}function v(a,b){x&&x

Inspect the actual webpage to determine the data you want and the corresponding elements you want to parse out. Then use that element tag or class to pull those elements out of the page. 

In [79]:
# your code here
product_links = app_soup.select('.a-link-normal')  # too broad, find its child
product_links

[<a class="a-link-normal" href="/Fire-TV-Stick-with-Alexa-Voice-Remote/dp/B0791TX5P5?_encoding=UTF8&amp;psc=1"><span class="zg-text-center-align"><div class="a-section a-spacing-small"><img alt="Fire TV Stick with Alexa Voice Remote, streaming media player" height="200" src="https://images-na.ssl-images-amazon.com/images/I/51ZdmnHKukL._AC_UL200_SR200,200_.jpg" width="200"/></div></span>
 <div aria-hidden="true" class="p13n-sc-truncate p13n-sc-line-clamp-2" data-rows="2">
             Fire TV Stick with Alexa Voice Remote, streaming media player
         </div>
 </a>,
 <a class="a-link-normal" href="/product-reviews/B0791TX5P5" title="4.5 out of 5 stars">
 <i class="a-icon a-icon-star a-star-4-5"><span class="a-icon-alt">4.5 out of 5 stars</span></i>
 </a>,
 <a class="a-size-small a-link-normal" href="/product-reviews/B0791TX5P5">17,319</a>,
 <a class="a-link-normal a-text-normal" href="/Fire-TV-Stick-with-Alexa-Voice-Remote/dp/B0791TX5P5?_encoding=UTF8&amp;psc=1"><span class="a-size-ba

Now that you can access all the data you need, let's put this into a loop so that we can proccess all of the products and create one list with all of the data.   

In [83]:
# your code here
products = []
for i in range(len(product_links)):
    if i % 4 == 0:
        info ={'name': product_links[i].text,
               'link': product_links[i]['href']}
        products.append(info)
print(len(products))

51


Now that we have each individual part working, let's wrap this all up in a function that we can run for each product class?


In [None]:
def parse_bestseller_cat(___):
    #your code here
    
    return ___

In [None]:
Next step is now to add this function to the larger script we have from above.  

## Selenium

The Selenium package is used to automate web browser interaction from Python. With Selenium, programming a Python script to automate a web browser is possible.

In [None]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [None]:
driver = webdriver.Chrome()
driver.get("https://www.instagram.com/accounts/login/")


In [None]:
username = ''
pw = ''

In [None]:
#find the element where you input your email
email = driver.find_elements_by_css_selector('form input')[0]

#find the element where you input your password
password = driver.find_elements_by_css_selector('form input')[1]

#send your keys to those elements
email.send_keys(username)
password.send_keys(pw)

#find the button to login
login = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[3]/button')

#have the program 'click' on the login button
login.click()


#looking for an interstital page
try: 
    not_now = WebDriverWait(driver, 15).until(
        lambda d: d.find_element_by_xpath('//button[text()="Not Now"]')
    )
    not_now.click()
except: 
    pass

#now you are logged in, navigate to a new page
driver.get("https://www.instagram.com/foodandprobability")

### Transitioning to Beautiful Soup
Beautiful Soup remains the best way to traverse the DOM and scrape the data. After utilizing Selenium to handle the interactive parts, it is time to ask Beautiful Soup to grab the data that you need