## Web Scraping

Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

In this section, we are going to discuss about useful Python libraries for web scraping.

The first question to ask before getting started with any python application is ‘Which libraries do I need?

`For web scraping there are a few different libraries to consider, including:`

- Beautiful Soup
- Requests
- Scrapy
- Selenium
- Urllib3

### Requests
It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation.

#### Installation of Requests
To install Requests, simply run this simple command in your terminal of choice:

`pip install requests`



In [1]:
!pip install requests



### Example
In this example, we are making a GET HTTP request for a web page. For this we need to first import requests library as follows −


In [2]:
import requests

In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.


In [3]:
url  = 'https://authoraditiagarwal.com/'
response = requests.get(url)


Now we can retrieve the content by using .text property as follows −

In [5]:
response.text[:150]

'<!DOCTYPE html><html lang="en-US" id="html"><head><meta charset="UTF-8" /><meta http-equiv="X-UA-Compatible" content="IE=10" /><link rel="profile" hre'

Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic

We will disscuse this later it was just  an intruduction to Request library.

https://docs.python-requests.org/en/master/user/quickstart/


## Urllib3
It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at 
<a href='https://urllib3.readthedocs.io/en/latest/'>here</a>.

#### Installing
urllib3 can be installed with <a href = 'https://pip.pypa.io/'>pip</a>.

`$ python -m pip install urllib3`

In [7]:
!pip install urllib3



#### Example: Scraping using Urllib3 and BeautifulSoup

In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using Urllib3 at the place of requests library for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data.

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

if you want to know more about the Beautiful Soup please refer to this <a href = 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'>here</a>

In [8]:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(response.data, 'lxml')
print (soup.title)
print (soup.title.text)

<title>Learn and grow together</title>
Learn and grow together


### Selenium
It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. 

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.

for doecumentaion please visit here https://pypi.org/project/selenium/.

#### Installing Selenium

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

`pip install selenium`

In [9]:
!pip install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
Installing collected packages: selenium
Successfully installed selenium-3.141.0


As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their links for downloading the same.

- Chrome https://sites.google.com/a/chromium.org/

- Edge https://developer.microsoft.com/

- Firefox https://github.com/

- Safari https://webkit.org/

#### Example

This example shows web scraping using selenium. It can also be used for testing which is called selenium testing.

After downloading the particular driver for the specified version of browser, we need to do programming in Python.

First, need to import webdriver from selenium as follows −

`from selenium import webdriver`

In [10]:
from selenium import webdriver

Now, provide the path of web driver which we have downloaded as per our requirement −

`path = r'C:\Users\nijat\Desktop\Data Science\Preparation For Interview\Technicl Skil\web Scraping\Chromedriver'`<br>
`browser = webdriver.Chrome(executable_path = path)`

In [11]:
path = r'C:\Users\nijat\Desktop\Data Science\Preparation For Interview\Technicl Skil\web Scraping\Chromedriver'
browser = webdriver.Chrome(executable_path = path)

Now, provide the url which we want to open in that web browser now controlled by our Python script.

In [12]:
browser.get('https://authoraditiagarwal.com/leadershipmanagement')

We can also scrape a particular element by providing the xpath as provided in lxml.

In [13]:
browser.find_element_by_xpath('/html/body').click()

You can check the browser, controlled by Python script, for output.

## Scrapy
Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data from websites.

if you want to know more about the Scapy visit <a herf="https://docs.scrapy.org/en/latest/">here</a>

#### Installing Scrapy
Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

`pip install scrapy`

We will be using `Requests` and `Beautiful Soup` modules. But before to do web scraping Let's disscuse some important topic.

## Legality of Web Scraping

With Python, we can scrape any website or particular elements of a web page but do you have any idea whether it is legal or not? Before scraping any website we must have to know about the legality of web scraping.

Generally, if you are going to use the scraped data for personal use, then there may not be any problem. But if you are going to republish that data, then before doing the same you should make download request to the owner or do some background research about policies as well about the data you are going to scrape.

### Research Required Prior to Scraping
If you are targeting a website for scraping data from it, we need to understand its scale and structure. Following are some of the files which we need to analyze before starting web scraping.

#### 1. Analyzing robots.txt
Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.

robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows −
<code>
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
and so on……..
</code>

#### 2. Analyzing Sitemap files
What you supposed to do if you want to crawl a website for updated information? You will crawl every web page for getting that updated information, but this will increase the server traffic of that particular website. That is why websites provide sitemap files for helping the crawlers to locate updating content without needing to crawl every web page. Sitemap standard is defined at http://www.sitemaps.org/protocol.html.


#### 3. What is the Size of Website?
Is the size of a website, i.e. the number of web pages of a website affects the way we crawl? Certainly yes. Because if we have less number of web pages to crawl, then the efficiency would not be a serious issue, but suppose if our website has millions of web pages, for example Microsoft.com, then downloading each web page sequentially would take several months and then efficiency would be a serious concern.

#### 4. Checking Website’s Size
By checking the size of result of Google’s crawler, we can have an estimate of the size of a website. Our result can be filtered by using the keyword site while doing the Google search. For example, estimating the size of https://authoraditiagarwal.com/ is given below −

<img src = 'https://www.tutorialspoint.com/python_web_scraping/images/checking_the_size.jpg'>


You can see there are around 60 results which mean it is not a big website and crawling would not lead the efficiency issue.

#### 5. Which technology is used by website?
Another important question is whether the technology used by website affects the way we crawl? Yes, it affects. But how we can check about the technology used by a website? There is a Python library named builtwith with the help of which we can find out about the technology used by a website.

##### Example

In this example we are going to check the technology used by the website https://authoraditiagarwal.com with the help of Python library builtwith. But before using this library, we need to install it as follows −

Let's first install  builtwith

`pip install builtwith`

Now with the help of following simple line of codes we can check the technology used by a particular website −

In [14]:
import builtwith
builtwith.parse('http://authoraditiagarwal.com')

{'web-servers': ['Apache'],
 'advertising-networks': ['Google AdSense'],
 'javascript-frameworks': ['Prototype', 'jQuery'],
 'ecommerce': ['WooCommerce'],
 'cms': ['WordPress'],
 'programming-languages': ['PHP'],
 'blogs': ['PHP', 'WordPress']}

#### 6. Who is the owner of website?

The owner of the website also matters because if the owner is known for blocking the crawlers, then the crawlers must be careful while scraping the data from website. There is a protocol named Whois with the help of which we can find out about the owner of the website.

Let's first install the whois 

`pip install python-whois`

#### Example

In this example we are going to check the owner of the website say `microsoft.com` with the help of Whois. But before using this library, we need to install it as follows −

In [16]:
import whois
print (whois.whois('microsoft.com'))

{
  "domain_name": [
    "MICROSOFT.COM",
    "microsoft.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2021-03-12 23:25:32",
    "2021-04-07 12:58:15"
  ],
  "creation_date": [
    "1991-05-02 04:00:00",
    "1991-05-01 21:00:00"
  ],
  "expiration_date": [
    "2022-05-03 04:00:00",
    "2022-05-02 00:00:00"
  ],
  "name_servers": [
    "NS1-205.AZURE-DNS.COM",
    "NS2-205.AZURE-DNS.NET",
    "NS3-205.AZURE-DNS.ORG",
    "NS4-205.AZURE-DNS.INFO",
    "ns1-205.azure-dns.com",
    "ns2-205.azure-dns.net",
    "ns4-205.azure-dns.info",
    "ns3-205.azure-dns.org"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTrans

Hence we will be working with Request and Beautiful Soup module, we installed request already let's install Beautiful Soup now.


### Beautiful Soup

Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup which can be known in more detail at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple words, BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests, because it needs an input (document or url) to create a soup object asit cannot fetch a web page by itself. You can use the following Python script to gather the title of web page and hyperlinks.

#### Installing Beautiful Soup

Using the pip command, we can install beautifulsoup either in our virtual environment or in global installation.

`pip install bs4`

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree and/or filter out what we are looking for.

Consider the following HTML:



#### Example 1

In [2]:
%%html
<!DOCTYPE html>
<html>
<head>
<title> Page Title</title>
</head>
<body>
<h3><b id='boldest'>Mansoor Nijatullah</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Osman Mansoor</h3>
<p> Salary: $85,000, 000 </p>
<h3> Zaid Mansoor </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

We can store it as a string in the variable HTML:

In [101]:
html = "<!DOCTYPE html><html><head><title> Page Title</title></head><body><h3><b id='boldest'>Mansoor Nijatullah</b></h3><p> Salary: $ 92,000,000 </p><h3> Osman Mansoor</h3><p> Salary: $85,000, 000 </p><h3> Zaid Mansoor </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:

In [26]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html5lib')

First, the document is converted to Unicode, (similar to ASCII),  and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. In this lab, we will cover <code>BeautifulSoup</code> and <code>Tag</code> objects that for the purposes of this lab are identical, and <code>NavigableString</code> objects.

We can use the method <code>prettify()</code> to display the HTML in the nested structure:

In [5]:
print(soup.prettify())  # Pretty-print this PageElement as a string

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Mansoor Nijatullah
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Osman Mansoor
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Zaid Mansoor
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>


## Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: `Tag`, `NavigableString`, `BeautifulSoup`, and `Comment`.

## Tags
Let's say we want the  title of the page and the name of the top paid data analysts we can use the <code>Tag</code>. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag title.

In [14]:
tag_object = soup.title
print("Tag object:",tag_object)

Tag object: <title> Page Title</title>


Tags have a lot of attributes and methods, You can see most of them in <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree'>Navigating the tree</a> and <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree'>Searching the tree</a>. For now, the most important features of a tag are its name and attributes.

#### Name
Every tag has a name, accessible as `.name`:

In [15]:
tag_object.name

'title'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [16]:
tag = soup.b
tag

<b id="boldest">Mansoor Nijatullah</b>

In [17]:
tag.name = "blockquote"
tag

<blockquote id="boldest">Mansoor Nijatullah</blockquote>

## Attributes
A tag may have any number of attributes. The tag `<b id="boldest">` has an attribute `id` whose value is `boldest`. You can access a tag’s attributes by treating the tag like a dictionary:

we can see the tag type <code>bs4.element.Tag</code>

In [10]:
print("Tag object type:",type(tag_object))

Tag object type: <class 'bs4.element.Tag'>


If there is more than one <code>Tag</code>  with the same name, the first element with that <code>Tag</code> name is called, this corresponds to the most paid data analysts:

In [27]:
tag_object = soup.h3
tag_object

<h3><b id="boldest">Mansoor Nijatullah</b></h3>

Enclosed in the bold attribute <code>b</code>, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name.

In [18]:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']
# 'boldest'

'boldest'

You can access that dictionary directly as `.attrs`:

In [19]:
tag.attrs

{'id': 'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [20]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

<b another-attribute="1" id="verybold">bold</b>

In [21]:
del tag['id']
del tag['another-attribute']
tag

<b>bold</b>

In [22]:
tag['id']

KeyError: 'id'

In [28]:
print(tag.get('id'))

None


### Multi-valued attributes
HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:


In [35]:
css_soup = BeautifulSoup('<p class="body"></p>','html.parser')
css_soup.p['class']

['body']

In [36]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>','html.parser')
css_soup.p['class']

['body', 'strikeout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [37]:
id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
id_soup.p['id']

'my id'

When you turn a tag back into a strng. multiple values are consolidated.

In [38]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>','html.parser')
rel_soup.a['rel']

['index']

In [39]:
rel_soup.a['re'] = ['index','contents']
print(rel_soup.p)

<p>Back to the <a re="index contents" rel="index">homepage</a></p>


You can disable this by passing `multi_valued_attributes=None` as a keyword argument into the BeautifulSoup constructor:

In [40]:
no_list_soup = BeautifulSoup('<p class="body strikeout"></p>','html.parser',multi_valued_attributes=None)
no_list_soup.p['class']

'body strikeout'

You can use `get_attribute_list` to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [41]:
id_soup.p.get_attribute_list('id')

['my id']

If you parse a document as XML, there are no multi-valued attributes:

In [42]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

'body strikeout'

Again, you can configure this using the `multi_valued_attributes` argument:

In [44]:
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
xml_soup.p['class']

['body', 'strikeout']

## NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the `NavigableString` class to contain these bits of text:

In [46]:
html = '<b class="boldes">Extremely bold</b>'
soup = BeautifulSoup(html,'html.parser')
tag = soup.b
tag.string

'Extremely bold'

In [47]:
# let'c chekc the type of it. 
type(tag.string)

bs4.element.NavigableString

A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree'>Navigating the tree</a> and <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree'>Searching the tree</a>. You can convert a NavigableString to a Unicode string with `unicode()` (in Python 2) or `str (in Python 3):


In [48]:
unicode_string = str(tag.string)
print(unicode_string)
print("Type of :",type(unicode_string))

Extremely bold
Type of : <class 'str'>


You can't edit a string in place, but you can replace one string with another using `replace_with()`

In [49]:
tag.string.replace_with("No longer bold")
tag

<b class="boldes">No longer bold</b>

NavigableString supports most of the features described in `Navigating the tree` and `Searching the tree`, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the `.contents` or `.string` attributes, or the `find()` method.

If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.

### BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in `Navigating the tree` and `Searching the tree`.

You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents:


In [51]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
# 'INSERT FOOTER HERE'
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name, so it’s been given the special .name “[document]”:

In [52]:
soup.name

'[document]'

### Comments and other special strings
`Tag`, `NavigableString`, and `BeautifulSoup` cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The main one you’ll probably encounter is the comment:

In [53]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
type(comment)

bs4.element.Comment

In [54]:
comment

'Hey, buddy. Want to buy a used parser?'

The Comment object is just a special type of NavigableString:

But when it appears as part of an HTML document, a Comment is displayed with special formatting:

In [55]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>


Beautiful Soup also defines classes called `Stylesheet`, `Script`, and `TemplateString`, for embedded CSS stylesheets (any strings found inside a `<style>` tag), embedded Javascript (any strings found in a `<script>` tag), and HTML templates (any strings inside a `<template>` tag). These classes work exactly the same way as NavigableString; their only purpose is to make it easier to pick out the main body of the page, by ignoring strings that represent something else. (These classes are new in Beautiful Soup 4.9.0, and the html5lib parser doesn’t use them.)

Beautiful Soup defines classes for anything else that might show up in an XML document: `CData`, `ProcessingInstruction`, `Declaration`, and `Doctype`. Like Comment, these classes are subclasses of NavigableString that add something extra to the string. Here’s an example that replaces the comment with a CDATA block:

In [56]:
from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

<b>
 <![CDATA[A CDATA block]]>
</b>


#### Navigating the tree

Here’s the “Three sisters” HTML document again:


In [25]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')

## Going Down

Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

### Navigating using tag names
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `<head>` tag, just say soup.head:

In [58]:
soup.head

<head><title>The Dormouse's story</title></head>

In [59]:
soup.title

<title>The Dormouse's story</title>

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first `<b>` tag beneath the` <body>` tag:

In [60]:
soup.body.b

<b>The Dormouse's story</b>

Using a tag name as an attribute will give you only the first tag by that name:



In [62]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you need to get all the `<a>` tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as `find_all()`:

In [63]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### .contents and .children
A tag's children are available in all list called `.contents`

In [64]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [65]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [71]:
title_tag = head_tag.contents[0]
title_tag.contents

["The Dormouse's story"]

The BeautifulSoup object itself has children. IN this case, the <html> tag is the child of the BeautffulSoup object.

In [72]:
len(soup.contents)

2

In [73]:
soup.contents[0].name

A string does not have .contents, because it can't contain anything:

In [74]:
text = title_tag.contents[0]
text.contents

AttributeError: 'NavigableString' object has no attribute 'contents'

Instead of getting them as a list, you can iterate over a tag's children using the .children generator:

In [76]:
for child in title_tag.children:
    print(child)

The Dormouse's story


### .descendants

The .contents and .children attributes only consider a tag's direct children. For instance, the `<head>` tag has a single direct child-the `<title>` tag:

In [77]:
head_tag.contents

[<title>The Dormouse's story</title>]

But the `<title>`tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the `<head>` tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [78]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


The `<head>` tag has only one child, but it has two descendants: the `<title> `tag and the `<title>` tag’s child. The BeautifulSoup object only has one direct child (the `<html>` tag), but it has a whole lot of descendants:

In [79]:
len(list(soup.children))

2

In [80]:
len(list(soup.descendants))

27

### .string
If a tag has only one child, and that child is a NavigableStrin, the child is make available as .string:

In [81]:
title_tag.string

"The Dormouse's story"

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:

In [82]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [83]:
head_tag.string

"The Dormouse's story"

if a tag contains more than one thing, then it's not clear what .string should refer to, so .string is defined to be None:


In [84]:
print(soup.html.string)

None


### .strings adn stripped_strings

If there's more than one thing inside a tag, you can still look at just the string. Use the .string generator:

In [85]:
for string in soup.strings:
    print(reper(string))

NameError: name 'reper' is not defined

### Going Up

Continuing the "family tree" anology, every tag and every string has a parent, the tag that contins it

### .parent
You can access an element's parent with the .parent attribute. In the example "three sister" document, the <head> tag is the parent of the <title> tag:


In [86]:
title_tag = soup.title
title_tag

<title>The Dormouse's story</title>

In [87]:
title_tag.parent

<head><title>The Dormouse's story</title></head>

In [88]:
title_tag.string.parent

<title>The Dormouse's story</title>

The parent of the top level tag like `<html>` is the BeautifulSoup object itself.

In [89]:
html_tag = soup.html
type(html_tag.parent)

bs4.BeautifulSoup

In [91]:
print(soup.parent)

None


### .parents

You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an` <a> `tag buried deep within the document, to the very top of the document:

In [92]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [93]:
for parent in link.parents:
    print(parent.name)

p
body
html
[document]


### Going sideways¶
Consider a simple document like this:

In [94]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser')
print(sibling_soup.prettify())

<a>
 <b>
  text1
 </b>
 <c>
  text2
 </c>
</a>


The `<b>` tag and the `<c>` tag are at the same level: they’re both direct children of the same tag. We call them siblings. When a document is `pretty-printed,` siblings show up at the same indentation level. You can also use this relationship in the code you write.

### .next_siblings and .previous_siblings

You can iterate over a tag’s siblings with .next_siblings or `.previous_siblings`:

In [96]:
type(soup.a.next_siblings)

generator

In [97]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'


In [98]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


### Going back and forth
Take a look at the beginning of the “three sisters” document:

if you want to know more about this please visit this
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Children, Parents, and Siblings

As stated above the <code>Tag</code> object is a tree of objects we can access the child of the tag or navigate down the branch as follows:

In [100]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

In [24]:
soup = BeautifulSoup(html,'html5lib')

NameError: name 'html' is not defined

In [106]:
tag_object=soup.title
print("tag object:",tag_object)

tag object: <title> Page Title</title>


In [107]:
tag_object=soup.h3
tag_object

<h3><b id="boldest">Mansoor Nijatullah</b></h3>

In [108]:
tag_child =tag_object.b
tag_child

<b id="boldest">Mansoor Nijatullah</b>

You can access the parent with the  `parent`

In [109]:
parent_tag=tag_child.parent
parent_tag

<h3><b id="boldest">Mansoor Nijatullah</b></h3>

<code>tag_object</code> parent is the <code>body</code> element.

In [110]:
tag_object.parent

<body><h3><b id="boldest">Mansoor Nijatullah</b></h3><p> Salary: $ 92,000,000 </p><h3> Osman Mansoor</h3><p> Salary: $85,000, 000 </p><h3> Zaid Mansoor </h3><p> Salary: $73,200, 000</p></body>

<code>tag_object</code> sibling is the <code>paragraph</code> element

In [111]:
sibling_1 = tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

`sibling_2` is the `header` element which is also a sibling of both `sibling_1` and `tag_object`

In [34]:
sibling_2=sibling_1.next_sibling
sibling_2

<h3> Osman Mansoor</h3>

<h3 id="first_question">Exercise: <code>next_sibling</code></h3>

Using the object <code>sibling\_2</code> and the method <code>next_sibling</code> to find the salary of Osman Nijatullah

In [124]:
sibling_2.next_sibling.contents

[' Salary: $85,000, 000 ']

### HTML Attributes
If the tag has attributes, the tag <code>id="boldest"</code> has an attribute <code>id</code> whose value is <code>boldest</code>. You can access a tag’s attributes by treating the tag like a dictionary:


In [125]:
tag_child['id']

'boldest'

You can access that dictionary directly as <code>attrs</code>:

In [126]:
tag_child.attrs

{'id': 'boldest'}

<h2 id="filter">Kinds of filters</h2>

Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string.  Consider the following HTML of rocket launchs:

In [1]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td>
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


We can store it as a string in the variable <code>table</code>:


In [3]:
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [4]:
from bs4 import BeautifulSoup
table_bs = BeautifulSoup(table, 'html5lib')
soup = BeautifulSoup(table,'html5lib')

### find_all()

The <code>find_all()</code> method looks through a tag’s descendants and retrieves all descendants that match your filters.

<p>
The Method signature for <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>
</p>


### Name
When we set the <code>name</code> parameter to a tag name, the method will extract all the tags with that name and its children.

In [17]:
table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>]

The result is a Python Iterable just like a list, each element is a <code>tag</code> object:


In [18]:
first_row =table_rows[0]
print(type(table_rows))
first_row

<class 'bs4.element.ResultSet'>


<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>

we can obtain the child

In [139]:
first_row.td

<td id="flight">Flight No</td>

If we iterate through the list, each element corresponds to a row in the table:

In [19]:
for i, row in enumerate(table_rows):
    print("row",i,"is",row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>


As <code>row</code> is a <code>cell</code> object, we can apply the method <code>find_all</code> to it and extract table cells in the object <code>cells</code> using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a <code>Tag</code> object, we can iterate through this list as well. We can extract the content using the <code>string</code>  attribute.


In [20]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('column',j,"cell",cell)

row 0
column 0 cell <td id="flight">Flight No</td>
column 1 cell <td>Launch site</td>
column 2 cell <td>Payload mass</td>
row 1
column 0 cell <td>1</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>
column 2 cell <td>300 kg</td>
row 2
column 0 cell <td>2</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
column 2 cell <td>94 kg</td>
row 3
column 0 cell <td>3</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>
column 2 cell <td>80 kg</td>


If we use a list we can match against any item in that list.

In [142]:
list_input = table_bs.find_all(name=['tr','td'])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>,
 <td>80 kg</td>]

## Attributes

If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code>  argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value.


In [5]:
table_bs.find_all(id='flight')

[<td id="flight">Flight No</td>]

We can find all the elements that have links to the Florida Wikipedia page:

In [6]:
list_input = table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

If we set the  <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:

In [7]:
table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

There are other methods for dealing with attributes and other related methods; Check out the following <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01#css-selectors'>link</a>

<h3 id="exer_type">Exercise: <code>find_all</code></h3>

Using the logic above, find all the elements without <code>href</code> value

In [13]:
table_bs.find_all('a')

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a></a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a> </a>]

or we can use the 

In [21]:
table_bs.find_all(href=False)

[<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>,
 <head></head>,
 <body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body>,
 <table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <t

Using the soup object <code>soup</code>, find the element with the <code>id</code> attribute content set to <code>"boldest"</code>.

In [5]:
soup = BeautifulSoup(html_doc,'html5lib')
table_bs.find_all(id='boldest')

NameError: name 'html_doc' is not defined

In [33]:
soup.find_all(id="boldest")

[]

### string

With string you can search for strings instead of tags, where we find all the elments with Florida:

In [35]:
table_bs.find_all(string="Florida")

['Florida', 'Florida']

## find
The <code>find_all()</code> method scans the entire document looking for results, it’s if you are looking for one element you can use the <code>find()</code> method to find the first element in the document. Consider the following two table:


In [36]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [6]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

We create a `BeaufifulSoup` object `two-tables_bs`

In [7]:
tow_tables_bs = BeautifulSoup(two_tables,'html.parser')
tow_tables_bs

<h3>Rocket Launch </h3><p><table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table></p>

We can find the first table using the tag name table

In [42]:
tow_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.

In [9]:
tow_tables_bs.find("table",class_='pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

<h2 id="DSCW">Downloading And Scraping The Contents Of A Web Page</h2> 

We Download the contents of the web page:

`url="http://www.ibm.com"`

We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:

`data = requests.get(url).text` 
We create the beaufifulSoup object using the BeautifulSoup constractor. 

`soup = BeautifulSoup(data,"html5lib")`


In [10]:
import requests
url = "http://www.ibm.com"
data  = requests.get(url).text 
soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'

Let's scrap all links

In [19]:
for link in soup.find_all('a',href=True): # in html cochor/link is represented by the tag <a>
    print(link.get('href'))

https://www.ibm.com/in/en
https://www.ibm.com/sitemap/in/en
https://www.ibm.com/in-en/services/applications?lnk=inhpv11
https://www.ibm.com/in-en/about#2825811
https://in.newsroom.ibm.com/2021-05-26-IBMs-response-to-COVID-19?lnk=inhpv11
/in-en/security/zero-trust
/in-en/events/cloud-podcasts
https://developer.ibm.com/callforcode/?utm_medium=OSocial&utm_source=Blog&utm_content=000031SE&utm_term=10008401&utm_id=Homepage-inside
https://www.ibm.com/blogs/journey-to-ai/2021/03/ibm-is-named-a-leader-2021-magic-quadrant-for-data-science-and-machine-learning-platforms/
/in-en/products/offers-and-discounts
/in-en/cloud/free
/in-en/products/cloud-pak-for-data
/in-en/security/identity-access-management/cloud-identity
https://www.ibm.com/account/reg/signup?formid=urx-46597&lnk=STW_IN_HP_T2_BLK&psrc=NONE&pexp=DEF&lnk2=trial_RoboticProcess
/products/digital-learning-subscription/pricing?lnk=STW_IN_HP_T5_BLK&psrc=NONE&pexp=DEF&lnk2=trial_training
/in-en/cloud/aspera
https://developer.ibm.com/depmodel

Scrape all image Tags

In [22]:
for link in soup.find_all('img'): 
    print(link)
    print(link.get('src'))
    print()

<img alt="A woman wearing mask while working" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-05-26/Small-rbf07708.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-05-26/Small-rbf07708.jpg

<img alt="finger touching digital lock" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-07-12/Small-finger-padlock-orbit%20%281%29.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-07-12/Small-finger-padlock-orbit%20%281%29.jpg

<img alt="a mic for recording" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-03-26/Logo-Lockup-444-w-x-320h-px.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-03-26/Logo-Lockup-444-w-x-320h-px.jpg

<img alt="a green field" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-05-02/Original-CFC-444x320%20%281%29.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-05-02/Original-CFC-444x320%20%281%29.jpg

<img alt="A leader climbing stairs" class="" loading="lazy" src="//1.cms.s81c.com/sites/

## Scrap data from HTML tables

In [23]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.

In [26]:
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text

In [27]:
soup = BeautifulSoup(data,'html5lib')

Find all the table in the web page

In [37]:
table = soup.find('table')
type(table)

bs4.element.Tag

In [38]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # get all the column for each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string  # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{} ---->{}".format(color_name,color_code))

Color Name ---->None
lightsalmon ---->#FFA07A
salmon ---->#FA8072
darksalmon ---->#E9967A
lightcoral ---->#F08080
coral ---->#FF7F50
tomato ---->#FF6347
orangered ---->#FF4500
gold ---->#FFD700
orange ---->#FFA500
darkorange ---->#FF8C00
lightyellow ---->#FFFFE0
lemonchiffon ---->#FFFACD
papayawhip ---->#FFEFD5
moccasin ---->#FFE4B5
peachpuff ---->#FFDAB9
palegoldenrod ---->#EEE8AA
khaki ---->#F0E68C
darkkhaki ---->#BDB76B
yellow ---->#FFFF00
lawngreen ---->#7CFC00
chartreuse ---->#7FFF00
limegreen ---->#32CD32
lime ---->#00FF00
forestgreen ---->#228B22
green ---->#008000
powderblue ---->#B0E0E6
lightblue ---->#ADD8E6
lightskyblue ---->#87CEFA
skyblue ---->#87CEEB
deepskyblue ---->#00BFFF
lightsteelblue ---->#B0C4DE
dodgerblue ---->#1E90FF


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas

In [39]:
import pandas as pd

In [40]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check the tables on the webpage.

In [45]:
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text

In [46]:
soup = BeautifulSoup(data,'html5lib')

In [47]:
#find all html tables in the web page.
tables = soup.find_all('table')  # in html table is represented by the tag <table>

In [50]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

26

Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.

In [52]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

5


See if you can locate the table name of the table, `10 most densly populated countries`, below.

In [53]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Country
   </th>
   <th>
    Population
   </th>
   <th>
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th>
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="3456" data-file-width="5184" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singa

In [74]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)


In [75]:
population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,171040000,143998,1188
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17620000,41526,424
8,9,Israel,9370000,22072,425
9,10,India,1379660000,3287240,420


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html

Using the same `url`, `data`, `soup`, and `tables` object as in the last section we can use the `read_html` function to create a DataFrame.

Remember the table we need is located in `tables[table_index]`

We can now use the `pandas` function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine `bs4`.


In [89]:
x = pd.read_html(str(tables[5]), flavor='bs4')
x[0]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,171040000,143998,1188
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17620000,41526,424
8,9,Israel,9370000,22072,425
9,10,India,1379660000,3287240,420


The function `read_html` always returns a list of DataFrames so we must pick the one we want out of the list.

In [93]:
population_data_read_html = pd.read_html(str(tables[5]),flavor='bs4')[0]
population_data_read_html

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,171040000,143998,1188
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17620000,41526,424
8,9,Israel,9370000,22072,425
9,10,India,1379660000,3287240,420


## Scrape data from HTML tables into a DataFrame using read_html

We can also use the `read_html` function to directly get DataFrames from a `url`.



In [94]:
dataframe_list = pd.read_html(url,flavor='bs4')

We can see there are 25 DataFrames just like when we used `find_all` on the `soup` object.

In [95]:
len(dataframe_list)

26

We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [96]:
pd.read_html(url,match="10 most densely populated countries",flavor='bs4')[0]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,171040000,143998,1188
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17620000,41526,424
8,9,Israel,9370000,22072,425
9,10,India,1379660000,3287240,420


## Let's do a small project

## Project Overview