<a href="https://colab.research.google.com/github/kilos11/PYTHON-_AUTOMATION-/blob/main/12_WEB_SCRAPING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project: mapIt.py with the webbrowser Module**#
##*The webbrowser module’s open() function can launch a new browser to a specified URL.

In [None]:
import webbrowser

webbrowser.open('https://inventwithpython.com/')

False

##*A web browser tab will open to the URL https://inventwithpython.com/. This is about the only thing the webbrowser module can do. Even so, the open() function does make some interesting things possible. For example, it’s tedious to copy a street address to the clipboard and bring up a map of it on Google Maps. You could take a few steps out of this task by writing a simple script to automatically launch the map in your browser using the contents of your clipboard. This way, you only have to copy the address to a clipboard and run the script, and the map will be loaded for you.

##*This is what your program does:

##*Gets a street address from the command line arguments or clipboard
##*Opens the web browser to the Google Maps page for the address
##*This means your code will need to do the following:

##*Read the command line arguments from sys.argv.
##*Read the clipboard contents.
##*Call the webbrowser.open() function to open the web browser.

#**Step 1: Figure Out the URL**#
##*Based on the instructions in Appendix B, set up mapIt.py so that when you run it from the command line, like so . . .

C:\> mapit 870 Valencia St, San Francisco, CA 94110

##* . . the script will use the command line arguments instead of the clipboard. If there are no command line arguments, then the program will know to use the contents of the clipboard.

##*First you need to figure out what URL to use for a given street address. When you load https://maps.google.com/ in the browser and search for an address, the URL in the address bar looks something like this: https://www.google.com/maps/place/870+Valencia+St/@37.7590311,-122.4215096,17z/data=!3m1!4b1!4m2!3m1!1s0x808f7e3dadc07a37:0xc86b0b2bb93b73d8.

##*The address is in the URL, but there’s a lot of additional text there as well. Websites often add extra data to URLs to help track visitors or customize sites. But if you try just going to https://www.google.com/maps/place/870+Valencia+St+San+Francisco+CA/, you’ll find that it still brings up the correct page. So your program can be set to open a web browser to 'https://www.google.com/maps/place/your_address_string' (where your_address_string is the address you want to map).

#**Step 2: Handle the Command Line Arguments**#

In [None]:
import webbrowser, sys

if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])

##*The sys.argv variable stores a list of the program’s filename and command line arguments. If this list has more than just the filename in it, then len(sys.argv) evaluates to an integer greater than 1, meaning that command line arguments have indeed been provided.

##*Command line arguments are usually separated by spaces, but in this case, you want to interpret all of the arguments as a single string. Since sys.argv is a list of strings, you can pass it to the join() method, which returns a single string value. You don’t want the program name in this string, so instead of sys.argv, you should pass sys.argv[1:] to chop off the first element of the array. The final string that this expression evaluates to is stored in the address variable.

##*If you run the program by entering this into the command line . . .

mapit 870 Valencia St, San Francisco, CA 94110

. . . the sys.argv variable will contain this list value:

['mapIt.py', '870', 'Valencia', 'St, ', 'San', 'Francisco, ', 'CA', '94110']

##*The address variable will contain the string '870 Valencia St, San Francisco, CA 94110'.

##*Step 3: Handle the Clipboard Content and Launch the Browser**#

In [None]:
import webbrowser, sys, pyperclip

if len(sys.argv) > 1:
     # Get address from command line.
     address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()
webbrowser.open('https://www.google.com/maps/place/' + address)


False

#**Downloading Files from the Web with the requests Module**#
##*The requests module lets you easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression. The requests module doesn’t come with Python, so you’ll have to install it first. From the command line, run pip install --user requests.

In [None]:
!pip install  requests.

[31mERROR: Invalid requirement: 'requests.'[0m[31m
[0m

In [None]:
import requests

#**Downloading a Web Page with the requests.get() Function**#
##*The requests.get() function takes a string of a URL to download. By calling type() on requests.get()’s return value, you can see that it returns a Response object, which contains the response that the web server gave for your request.

In [None]:
import requests

res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
type(res)
res.status_code == requests.codes.ok
print(len(res.text))
print(res.text[:1000])


178978
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org/license


Title: Romeo and Juliet

Author: William Shakespeare

Posting Date: May 25, 2012 [EBook #1112]
Release Date: November, 1997  [Etext #1112]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***













*Project Gutenberg is proud to cooperate with The World Library*
in the presentation of The Complete Works of William Shakespeare
for your reading for education and entertainment.  HOWEVER, THIS
IS NEITHER SHAREWARE NOR PUBLIC DOMAIN. . .AND UNDER THE LIBRARY
OF THE FUTURE CONDITIONS OF THIS PRESENTATION. . .NO CHARGES MAY
BE MADE FOR *ANY* ACCESS TO THIS MATERIAL.  YOU ARE ENCOURAGED!

##*The URL goes to a text web page for the entire play of Romeo and Juliet, provided on this book’s site ➊. You can tell that the request for this web page succeeded by checking the status_code attribute of the Response object. If it is equal to the value of requests.codes.ok, then everything went fine ➋. (Incidentally, the status code for “OK” in the HTTP protocol is 200. You may already be familiar with the 404 status code for “Not Found.”) You can find a complete list of HTTP status codes and their meanings at https://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

##*If the request succeeded, the downloaded web page is stored as a string in the Response object’s text variable. This variable holds a large string of the entire play; the call to len(res.text) shows you that it is more than 178,000 characters long. Finally, calling print(res.text[:250]) displays only the first 250 characters.

##*If the request failed and displayed an error message, like “Failed to establish a new connection” or “Max retries exceeded,” then check your internet connection. Connecting to servers can be quite complicated, and I can’t give a full list of possible problems here. You can find common causes of your error by doing a web search of the error message in quotes.

In [None]:
res = requests.get('https://inventwithpython.com/page_that_does_not_exist')
res.raise_for_status()

HTTPError: 404 Client Error: Not Found for url: https://inventwithpython.com/page_that_does_not_exist

##*The raise_for_status() method is a good way to ensure that a program halts if a bad download occurs. This is a good thing: You want your program to stop as soon as some unexpected error happens. If a failed download isn’t a deal breaker for your program, you can wrap the raise_for_status() line with try and except statements to handle this error case without crashing.

In [None]:
import requests

res = requests.get('https://inventwithpython.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))


There was a problem: 404 Client Error: Not Found for url: https://inventwithpython.com/page_that_does_not_exist


##*This raise_for_status() method call causes the program to output the following:

There was a problem: 404 Client Error: Not Found for url: https://
inventwithpython.com/page_that_does_not_exist.html

##*Always call raise_for_status() after calling requests.get(). You want to be sure that the download has actually worked before your program continues.

#*Saving Downloaded Files to the Hard Drive**#
##*From here, you can save the web page to a file on your hard drive with the standard open() function and write() method. There are some slight differences, though. First, you must open the file in write binary mode by passing the string 'wb' as the second argument to open(). Even if the page is in plaintext (such as the Romeo and Juliet text you downloaded earlier), you need to write binary data instead of text data in order to maintain the Unicode encoding of the text.

##*To write the web page to a file, you can use a for loop with the Response object’s iter_content() method.

In [None]:
import requests

res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status()
playFile = open('RomeoAndJuliet.txt', 'wb')
for chunk in res.iter_content(100000):
    print(playFile.write(chunk))

100000
78978


##*The iter_content() method returns “chunks” of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass 100000 as the argument to iter_content().

##*The file RomeoAndJuliet.txt will now exist in the current working directory. Note that while the filename on the website was rj.txt, the file on your hard drive has a different filename. The requests module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program. Even if you were to lose your internet connection after downloading the web page, all the page data would still be on your computer.

##*The write() method returns the number of bytes written to the file. In the previous example, there were 100,000 bytes in the first chunk, and the remaining part of the file needed only 78,981 bytes.

##*To review, here’s the complete process for downloading and saving a file:

##*Call requests.get() to download the file.
##*Call open() with 'wb' to create a new file in write binary mode.
##*Loop over the Response object’s iter_content() method.
##*Call write() on each iteration to write the content to the file.
##*Call close() to close the file.
##*That’s all there is to the requests module! The for loop and iter_content() stuff may seem complicated compared to the open()/write()/close() workflow you’ve been using to write text files, but it’s to ensure that the requests module doesn’t eat up too much memory even if you download massive files. You can learn about the requests module’s other features from https://requests.readthedocs.org/.

#**HTML**#
##*Before you pick apart web pages, you’ll learn some HTML basics. You’ll also see how to access your web browser’s powerful developer tools, which will make scraping information from the web much easier.

##*A Quick Refresher
##*In case it’s been a while since you’ve looked at any HTML, here’s a quick overview of the basics. An HTML file is a plaintext file with the .html file extension. The text in these files is surrounded by tags, which are words enclosed in angle brackets. The tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags. For example, the following HTML will display Hello, world! in the browser, with Hello in bold:

##**<strong>Hello</strong>, world!

##**The opening <strong> tag says that the enclosed text will appear in bold. The closing </strong> tags tells the browser where the end of the bold text is.

##*There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the <a> tag encloses text that should be a link. The URL that the text links to is determined by the href attribute. Here’s an example:

"Al's free <a href="https://inventwithpython.com">Python books</a>."
##*Some elements have an id attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.

##*Viewing the Source HTML of a Web Page
##*You’ll need to look at the HTML source of the web pages that your programs will work with. To do this, right-click (or CTRL-click on macOS) any web page in your web browser, and select View Source or View page source to see the HTML text of the page (see Figure 12-3). This is the text your browser actually receives. The browser knows how to display, or render, the web page from this HTML.
##*Using the Developer Tools to Find HTML Elements
##*Once your program has downloaded a web page using the requests module, you will have the page’s HTML content as a single string value. Now you need to figure out which part of the HTML corresponds to the information on the web page you’re interested in.

##*This is where the browser’s developer tools can help. Say you want to write a program to pull weather forecast data from https://weather.gov/. Before writing any code, do a little research. If you visit the site and search for the 94105 ZIP code, the site will take you to a page showing the forecast for that area.

##*What if you’re interested in scraping the weather information for that ZIP code? Right-click where it is on the page (or CONTROL-click on macOS) and select Inspect Element from the context menu that appears. This will bring up the Developer Tools window, which shows you the HTML that produces this particular part of the web page. Figure 12-5 shows the developer tools open to the HTML of the nearest forecast. Note that if the https://weather.gov/ site changes the design of its web pages, you’ll need to repeat this process to inspect the new elements.
##*From the developer tools, you can see that the HTML responsible for the forecast part of the web page is <div class="col-sm-10 forecast-text">Sunny, with a high near 64. West wind 11 to 16 mph, with gusts as high as 21 mph.</div>. This is exactly what you were looking for! It seems that the forecast information is contained inside a <div> element with the forecast-text CSS class. Right-click on this element in the browser’s developer console, and from the context menu that appears, select Copy ▸ CSS Selector. This will copy a string such as 'div.row-odd:nth-child(1) > div:nth-child(2)' to the clipboard. You can use this string for Beautiful Soup’s select() or Selenium’s find_element_by_css_selector() methods, as explained later in this chapter. Now that you know what you’re looking for, the Beautiful Soup module will help you find it in the string.

<!-- This is the example.html example file. -->

<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="https://
inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>

#**Creating a BeautifulSoup Object from HTML**#
##*The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns a BeautifulSoup object

In [None]:
!pip install --user beautifulsoup4



In [None]:
import requests, bs4

res = requests.get('https://nostarch.com')
res.raise_for_status()
noStarchSoup = bs4.BeautifulSoup(res.text, 'html.parser')
type(noStarchSoup)

##*This code uses requests.get() to download the main page from the No Starch Press website and then passes the text attribute of the response to bs4.BeautifulSoup(). The BeautifulSoup object that it returns is stored in a variable named noStarchSoup.

##*You can also load an HTML file from your hard drive by passing a File object to bs4.BeautifulSoup() along with a second argument that tells Beautiful Soup which parser to use to analyze the HTML.

In [None]:
exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile, 'html.parser')
type(exampleSoup)

##**Finding an Element with the select() Method**#
##*You can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: they specify a pattern to look for—in this case, in HTML pages instead of general text strings.

Examples of CSS Selectors

Selector passed to the select() method

Will match . . .

##*soup.select('div')

All elements named

##*soup.select('#author')

The element with an id attribute of author

##*soup.select('.notice')

All elements that use a CSS class attribute named notice

##*soup.select('div span')

All elements named <span> that are within an element named <div>

##*soup.select('div > span')

All elements named <span> that are directly within an element named <div>, with no other element in between

##*soup.select('input[name]')

All elements named <input> that have a name attribute with any value

##*soup.select('input[type="button"]')

All elements named <input> that have an attribute named type with value button

##*The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a "p" element. Instead of writing the selector yourself, you can also right-click on the element in your browser and select Inspect Element. When the browser’s developer console opens, right-click on the element’s HTML and select Copy ▸ CSS Selector to copy the selector string to the clipboard and paste it into your source code.

##*The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str() function to show the HTML tags they represent. Tag values also have an attrs attribute that shows all the HTML attributes of the tag as a dictionary.

In [None]:
import bs4

exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(), 'html.parser')
elems = exampleSoup.select('#author')
type(elems)

#*Getting Data from an Element’s Attributes*#
##*The get() method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value.

In [None]:
import bs4

soup = bs4.BeautifulSoup(open('example.html'), 'html.parser')
spanElem = soup.select('span')[0]
str(spanElem)

spanElem.get('id')
spanElem.get('some_nonexistent_addr') == None
spanElem.attrs

#**Project: Opening All Search Results**#
##*Whenever I search a topic on Google, I don’t look at just one search result at a time. By middle-clicking a search result link (or clicking while holding CTRL), I open the first several links in a bunch of new tabs to read later. I search Google often enough that this workflow—opening my browser, searching for a topic, and middle-clicking several links one by one—is tedious. It would be nice if I could simply type a search term on the command line and have my computer automatically open a browser with all the top search results in new tabs. Let’s write a script to do this with the search results page for the Python Package Index at https://pypi.org/. A program like this can be adapted to many other websites, although the Google and DuckDuckGo often employ measures that make scraping their search results pages difficult.

##*This is what your program does:

##*Gets search keywords from the command line arguments
##*Retrieves the search results page
##*Opens a browser tab for each result
##*This means your code will need to do the following:

##*Read the command line arguments from sys.argv.
##*Fetch the search result page with the requests module.
##*Find the links to each search result.
##*Call the webbrowser.open() function to open the web browser.

#**Step 1: Get the Command Line Arguments and Request the Search Page**#
##*Before coding anything, you first need to know the URL of the search result page. By looking at the browser’s address bar after doing a search, you can see that the result page has a URL like https://pypi.org/search/?q=<SEARCH_TERM_HERE>. The requests module can download this page and then you can use Beautiful Soup to find the search result links in the HTML. Finally, you’ll use the webbrowser module to open those links in browser tabs.

In [None]:
import requests, sys, webbrowser, bs4

print('Searching...')    # display text while downloading the search result page
res = requests.get('https://google.com/search?q=' 'https://pypi.org/search/?q='
+ ' '.join(sys.argv[1:]))
res.raise_for_status()
# TODO: Retrieve top search result links.

# TODO: Open a browser tab for each result.

Searching...


#**Step 2: Find All the Results**#
##*Now you need to use Beautiful Soup to extract the top search result links from your downloaded HTML. But how do you figure out the right selector for the job? For example, you can’t just search for all <a> tags, because there are lots of links you don’t care about in the HTML. Instead, you must inspect the search result page with the browser’s developer tools to try to find a selector that will pick out only the links you want.

##*After doing a search for Beautiful Soup, you can open the browser’s developer tools and inspect some of the link elements on the page. They can look complicated, something like pages of this: <a class="package-snippet" href="/project/pyautogui/">.

##*It doesn’t matter that the element looks incredibly complicated. You just need to find the pattern that all the search result links have.

In [None]:
import requests, sys, webbrowser, bs4

# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# Open a browser tab for each result.
linkElems = soup.select('.package-snippet')

#**Step 3: Open Web Browsers for Each Result**#
##*Finally, we’ll tell the program to open web browser tabs for our results.

In [None]:
import requests, sys, webbrowser, bs4

# Open a browser tab for each result.
linkElems = soup.select('.package-snippet')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    urlToOpen = 'https://pypi.org' + linkElems[i].get('href')
    print('Opening', urlToOpen)
    webbrowser.open(urlToOpen)

#**Project: Downloading All XKCD Comics**#
##*Blogs and other regularly updating websites usually have a front page with the most recent post as well as a Previous button on the page that takes you to the previous post. Then that post will also have a Previous button, and so on, creating a trail from the most recent page to the first post on the site. If you wanted a copy of the site’s content to read when you’re not online, you could manually navigate over every page and save each one. But this is pretty boring work, so let’s write a program to do it instead.

##*XKCD is a popular geek webcomic with a website that fits this structure (see Figure 12-6). The front page at https://xkcd.com/ has a Prev button that guides the user back through prior comics. Downloading each comic by hand would take forever, but you can write a script to do this in a couple of minutes.
##**Here’s what your program does:

##*Loads the XKCD home page
##*Saves the comic image on that page
##*Follows the Previous Comic link
##*Repeats until it reaches the first comic
##**This means your code will need to do the following:

##*Download pages with the requests module.
##*Find the URL of the comic image for a page using Beautiful Soup.
##*Download and save the comic image to the hard drive with iter_content().
##*Find the URL of the Previous Comic link, and repeat.

#**Step 1: Design the Program**#
##**If you open the browser’s developer tools and inspect the elements on the page, you’ll find the following:

##*The URL of the comic’s image file is given by the href attribute of an <img> element.
##*The <img> element is inside a <div id="comic"> element.
##*The Prev button has a rel HTML attribute with the value prev.
##*The first comic’s Prev button links to the https://xkcd.com/# URL, indicating that there are no more previous pages.

In [None]:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.

import requests, os, bs4

url = 'https://xkcd.com'               # starting url
os.makedirs('xkcd', exist_ok=True)    # store comics in ./xkcd
while not url.endswith('#'):
    # TODO: Download the page.

        # TODO: Find the URL of the comic image.

            # TODO: Download the image.

                # TODO: Save the image to ./xkcd.

                    # TODO: Get the Prev button's url.
print('Done.')

IndentationError: expected an indented block after 'while' statement on line 8 (<ipython-input-6-a7bca630ecb4>, line 18)

##*You’ll have a url variable that starts with the value 'https://xkcd.com' and repeatedly update it (in a for loop) with the URL of the current page’s Prev link. At every step in the loop, you’ll download the comic at url. You’ll know to end the loop when url ends with '#'.

##*You will download the image files to a folder in the current working directory named xkcd. The call os.makedirs() ensures that this folder exists, and the exist_ok=True keyword argument prevents the function from throwing an exception if this folder already exists. The remaining code is just comments that outline the rest of your program.

##**Step 2: Download the Web Page**#
##*Let’s implement the code for downloading the page.

In [None]:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.

import requests, os, bs4

url = 'https://xkcd.com'               # starting url
os.makedirs('xkcd', exist_ok=True)    # store comics in ./xkcd
while not url.endswith('#'):
    # Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    print('Done.')

##*First, print url so that the user knows which URL the program is about to download; then use the requests module’s request.get() function to download it. As always, you immediately call the Response object’s raise_for_status() method to throw an exception and end the program if something went wrong with the download. Otherwise, you create a BeautifulSoup object from the text of the downloaded page.

#**Step 3: Find and Download the Comic Image**#

In [None]:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.

import requests, os, bs4

 # Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
    print('Could not find comic image.')
else:
    comicUrl = 'https:' + comicElem[0].get('src')
    # Download the image.
    print('Downloading image %s...' % (comicUrl))
    res = requests.get(comicUrl)
    res.raise_for_status()

Downloading image https://imgs.xkcd.com/comics/survey_marker.png...


#**Step 4: Save the Image and Find the Previous Comic**#

In [None]:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.

import requests, os, bs4

# Save the image to ./xkcd.
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)),
'wb')
for chunk in res.iter_content(100000):
    imageFile.write(chunk)
imageFile.close()
 # Get the Prev button's url.
prevLink = soup.select('a[rel="prev"]')[0]
url = 'https://xkcd.com' + prevLink.get('href')

#**Controlling the Browser with the selenium Module**#
##*The selenium module lets Python directly control the browser by programmatically clicking links and filling in login information, almost as though there were a human user interacting with the page. Using selenium, you can interact with web pages in a much more advanced way than with requests and bs4; but because it launches a web browser, it is a bit slower and hard to run in the background if, say, you just need to download some files from the web.

##*Still, if you need to interact with a web page in a way that, say, depends on the JavaScript code that updates the page, you’ll need to use selenium instead of requests. That’s because major ecommerce websites such as Amazon almost certainly have software systems to recognize traffic that they suspect is a script harvesting their info or signing up for multiple free accounts. These sites may refuse to serve pages to you after a while, breaking any scripts you’ve made. The selenium module is much more likely to function on these sites long-term than requests.

##*A major “tell” to websites that you’re using a script is the user-agent string, which identifies the web browser and is included in all HTTP requests. For example, the user-agent string for the requests module is something like 'python-requests/2.21.0'. You can visit a site such as https://www.whatsmyua.info/ to see your user-agent string. Using selenium, you’re much more likely to “pass for human” because not only is Selenium’s user-agent is the same as a regular browser (for instance, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'), but it has the same traffic patterns: a selenium-controlled browser will download images, advertisements, cookies, and privacy-invading trackers just like a regular browser. However, selenium can still be detected by websites, and major ticketing and ecommerce websites often block browsers controlled by selenium to prevent web scraping of their pages.

#**Starting a selenium-Controlled Browser**#
##*The following examples will show you how to control Firefox’s web browser. If you don’t already have Firefox, you can download it for free from https://getfirefox.com/. You can install selenium by running pip install --user selenium from a command line terminal.

In [19]:
!apt  install  selenium

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
[1;31mE: [0mUnable to locate package selenium[0m


In [None]:
from selenium import webdriver

browser = webdriver.Firefox()
type(browser)
browser.get('https://inventwithpython.com')

##*You’ll notice when webdriver.Firefox() is called, the Firefox web browser starts up. Calling type() on the value webdriver.Firefox() reveals it’s of the WebDriver data type. And calling browser.get('https://inventwithpython.com') directs the browser to https://inventwithpython.com/.

#**Finding Elements on the Page**#
##*WebDriver objects have quite a few methods for finding elements on a page. They are divided into the find_element_* and find_elements_* methods. The find_element_* methods return a single WebElement object, representing the first element on the page that matches your query. The find_elements_* methods return a list of WebElement_* objects for every matching element on the page.

##**Selenium’s WebDriver Methods for Finding Elements

##**Method name

##**WebElement object/list returned

browser.find_element_by_class_name(name)

browser.find_elements_by_class_name(name)

##**Elements that use the CSS
##**class name

browser.find_element_by_css_selector(selector)
browser.find_elements_by_css_selector(selector)

##**Elements that match the CSS
##*selector

browser.find_element_by_id(id)

browser.find_elements_by_id(id)

##**Elements with a matching id
##**attribute value

browser.find_element_by_link_text(text)

browser.find_elements_by_link_text(text)

##**<a> elements that completely
##**match the text provided

browser.find_element_by_partial_link_text(text)

browser.find_elements_by_partial_link_text(text)

##**<a> elements that contain the
##**text provided

browser.find_element_by_name(name)

browser.find_elements_by_name(name)

##**Elements with a matching name
##**attribute value

browser.find_element_by_tag_name(name)
browser.find_elements_by_tag_name(name)

##**Elements with a matching tag name
(case-insensitive; an <a> element is
matched by 'a' and 'A')