Step 1: Inspect the website (if using Chrome, you can right-click and select inspect)

Step 2: Access URL of the website using code and download all the HTML contents on the page

Step 3: Format the downloaded content into a readable format

Step 4: Extract out useful information and save it into a structured format

Step 5: For information displayed on multiple pages of the website, you may need to repeat steps 2–4 to have the complete information.

In [1]:
# import sys
# !{sys.executable} -m pip install beautifulsoup4

### Import Libraries

We'll first import all relevant libraries that we will require to access a website's HTML and extract information from the same.

In [42]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Making a GET Request

First, we issue an HTTP request to the URL to get the HTML source code.

In [3]:
url = "https://quotes.toscrape.com/"

In [4]:
# makes a request to the web page and gets its HTML
response = requests.get(url)
response.status_code

200

##### Making the soup using  BeautifulSoup library to get the HTML for a webpage

Like with lxml, we can query tags by name or attribute, and we can narrow our search to the ancestors and descendants of specific tags.

In [5]:
# stores the HTML page in 'soup', a BeautifulSoup object
# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.
# Using the default parser as it is included in Python
soup = BeautifulSoup(response.content, 'lxml')


In [6]:
# The soup variable (BeautifulSoup object) we defined earlier can be seen as representing the whole document
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="t

##### prettify(): The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string

In [7]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" ite

### Searching and Navigating the HTML Tree
- Searching using find and find_all methods

#### find() method:

The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4.

#### find_all() method:

##### Finding all instances of a tag at once
The find_all method is used for finding out all tags with the specified tag name or id and returning them as a list of type bs4.

In [8]:
soup.find('head')  # soup.head

<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>

### Navigating using specific tags
##### Accessing the title tag of the HTML element using the soup object
This returns the entire html element

In [9]:
title = soup.title
title

<title>Quotes to Scrape</title>

#### To access on the content of the html element we need to add text method followed by the html tag

In [10]:
title = soup.title.text
title

'Quotes to Scrape'

#### Navigating to the div tag of the soup object

In [11]:
match = soup.div
match

<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
<

### Searching the HTML div element with class `quote` using the find method

In [12]:
match = soup.find('div', class_='quote')
match

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

# Exploring the first div element which contains other tags like span, small, a

#### Accessing the span element from the div using the soupobject.htmltag to display the first quote

In [13]:
#match.span
quote = match.span.text
quote

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [14]:

quote = soup.find('span', class_="text")
quote.text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

#### Searching  the small element of class attribute 'author'  using the find method

In [15]:
author = match.find('small', class_='author')
author

<small class="author" itemprop="author">Albert Einstein</small>

#### Getting the inner html using text attribute

In [16]:
author.text

'Albert Einstein'

#### Searching the HTML 'a' element using the find method

In [17]:
link = match.find('a').text
link

'(about)'

#### Searching all the 'a' HTML elements  using the find_all method

In [18]:
match.find_all('a')

[<a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>]

# ----------------------------------------------------------------------------------------

### Method-1

#### Accessing all the div tags of the class 'quote' from the soup object 
#### Iterate over all the div elements and print the quote which is available in the span tag.

In [19]:
quotes =[]
for page in soup.find_all('div', class_='quote'):
    quote = page.span.text
    quotes.append(quote)
quotes

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [20]:
quotes.count('thinking')

0

# ----------------------------------------------------------------------------------------

### Method-2

#### To display the quotes- Access all the span elements of the class text from the soup object  using the find_all method

#### This returns the entire html span element instead of the quote

In [21]:
quotes = soup.find_all('span', class_="text")
quotes

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

### To access the quote(inner html) we can use the text attribute.

#### As we have multiple quotes here, we are running a list comprehension and accessing text of every span element 

In [22]:
quotes1 = [quote.text for quote in quotes]
quotes1

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [23]:
quotes[0]

<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>

In [24]:
quotes1[0][1:-1]
#0 is index of quote and [1:-1] string of the same quote from start to end...

'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'

##### Accessing the quotes by removing the extra special characters.
- Accessing the text 

In [25]:
quotes = [quote.text[1:-1] for quote in quotes]
quotes

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 'It is our choices, Harry, that show what we truly are, far more than our abilities.',
 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.',
 "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
 'Try not to become a man of success. Rather become a man of value.',
 'It is better to be hated for what you are than to be loved for what you are not.',
 "I have not failed. I've just found 10,000 ways that won't work.",
 "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
 'A day without sunshine is like, you know, night.']

# ----------------------------------------------------------------------------------------

# Authors

### Accessing the author names from the soup object using the HTML element `small` with the class attribute `author`
- find_all returns the entire html element

In [26]:
authors = soup.find_all('small', class_='author')
authors

[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

#### To access the inner html content of the authors 

In [27]:
authors = [author.text for author in authors]
authors

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

#### Tags

#### Accessing the tags from the soup object using the HTML element `div` with the class attribute `tags`
- find_all returns the entire html element

In [28]:
tags = soup.find_all('div', class_='tags')
tags

[<div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>,
 <div class="tags">
             Tags:
             <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>
 <a class="tag" href="/tag/choices/page/1/">choices</a>
 </div>,
 <div class="tags">
             Tags:
             <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/live/page/1/">live</a>
 <a class="tag" href="/tag/miracle/page/1/">miracl

Accessing the first tag using indexing

In [29]:
tags[0]

<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>

#### Extracting the tag names from the first quote tag. As there are multiple tags for each quote, we are using a for loop

In [30]:
for i in tags[0].find_all('a', class_='tag'):
    print(i.text)

change
deep-thoughts
thinking
world


#### Extracting the tag names of all the quotes using two for loops

In [31]:
total_tags=[]
for i in range(len(tags)):
    k=[]
    for j in tags[i].find_all('a', class_='tag'):
        k.append(j.text)
    total_tags.append(','.join(k))
total_tags

['change,deep-thoughts,thinking,world',
 'abilities,choices',
 'inspirational,life,live,miracle,miracles',
 'aliteracy,books,classic,humor',
 'be-yourself,inspirational',
 'adulthood,success,value',
 'life,love',
 'edison,failure,inspirational,paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor,obvious,simile']

# ----------------------------------------------------------------------------------------

## Extracting Image URLs

###### `<img>` specifies an image. The path to the image file is specified in the src= attribute

In [32]:
from bs4 import BeautifulSoup
import requests
image_urls=[]
url = 'https://en.wikipedia.org/wiki/Transhumanism'

# get contents from url
content = requests.get(url).content

# get soup
soup = BeautifulSoup(content,'lxml') # choose lxml parser

# find the tag : <img ... >
image_tags = soup.findAll('img')

# print out image urls
for image_tag in image_tags:
    image_urls.append(image_tag.get('src'))
image_urls

['//upload.wikimedia.org/wikipedia/commons/thumb/4/47/Sound-icon.svg/20px-Sound-icon.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Transhumanism_h%2B.svg/200px-Transhumanism_h%2B.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/2/2a/2010_Utopien_arche04.jpg/200px-2010_Utopien_arche04.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/9/98/Ambox_current_red.svg/42px-Ambox_current_red.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/6/69/Hux-Oxon-72.jpg/220px-Hux-Oxon-72.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/7/71/PPTCountdowntoSingularityLog.jpg/300px-PPTCountdowntoSingularityLog.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/2/22/Da_Vinci_Vitruve_Luc_Viatour.jpg/200px-Da_Vinci_Vitruve_Luc_Viatour.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/33/Biohacker_Neil_Harbisson.jpg/190px-Biohacker_Neil_Harbisson.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/83/Amish_vs_modern_transportation.jpg/220px-Amish_v

In [33]:
from bs4 import BeautifulSoup
import re
img_urls=[]
html = requests.get('https://en.wikipedia.org/wiki/Peter_Jeffrey').content
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img')
for image in images: 
    img_urls.append(image['src'])
img_urls

['//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/0/01/Peter_Jeffrey.jpg/220px-Peter_Jeffrey.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg/10px-OOjs_UI_icon_edit-ltr-progressive.svg.png',
 '//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1',
 '/static/images/footer/wikimedia-button.png',
 '/static/images/footer/poweredby_mediawiki_88x31.png']

## Extracting Links

##### `<a>` specifies a hyperlink. The text enclosed between `<a>` and `</a>` is the text of the link that appears, while the URL is specified in the href= attribute of the tag.

In [34]:
# importing the libraries
from bs4 import BeautifulSoup
import requests

url="https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify()) # print the parsed data of html

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of countries by GDP (nominal) - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3bdf0521-aa09-4d53-aa25-063e9087d707","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_by_GDP_(nominal)","wgTitle":"List of countries by GDP (nominal)","wgCurRevisionId":1085637635,"wgRevisionId":1085637635,"wgArticleId":380845,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wik

##### Extracting the text from the URL

In [35]:
for link in soup.find_all("a"):
    print("Inner Text: {}".format(link.text))
    print("Title:. {}".format(link.get("title")))
    print("href: {}".format(link.get("href")))

Inner Text: 
Title:. None
href: None
Inner Text: 
Title:. This article is semi-protected.
href: /wiki/Wikipedia:Protection_policy#semi
Inner Text: Jump to navigation
Title:. None
href: #mw-head
Inner Text: Jump to search
Title:. None
href: #searchInput
Inner Text: List of countries by GDP (PPP)
Title:. List of countries by GDP (PPP)
href: /wiki/List_of_countries_by_GDP_(PPP)
Inner Text: 
Title:. None
href: /wiki/File:Nominal_GDP_of_Countries_Crimea_edited.svg
Inner Text: 
Title:. Enlarge
href: /wiki/File:Nominal_GDP_of_Countries_Crimea_edited.svg
Inner Text: [n 1]
Title:. None
href: #cite_note-1
Inner Text: Gross domestic product
Title:. Gross domestic product
href: /wiki/Gross_domestic_product
Inner Text: market value
Title:. Market value
href: /wiki/Market_value
Inner Text: [1]
Title:. None
href: #cite_note-2
Inner Text: exchange rates
Title:. Exchange rate
href: /wiki/Exchange_rate
Inner Text: cost of living
Title:. Cost of living
href: /wiki/Cost_of_living
Inner Text: exchange rate

## Extract Population Data Table from the Wikipedia page

##### `<table>` specifies a table. The rows of the table are specified by `<tr>` tags nested inside the `<table>` tag, while the cells in each row are specified by `<td>` tags nested inside each `<tr>` tag

In [36]:
from bs4 import BeautifulSoup
import requests

url="https://en.wikipedia.org/wiki/World_population"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify()) # print the parsed data of html

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   World population - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"53052d27-7c56-4d8d-8979-470af461d66a","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"World_population","wgTitle":"World population","wgCurRevisionId":1085722328,"wgRevisionId":1085722328,"wgArticleId":19017269,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with non-numeric formatnum arguments","CS1 maint: url-status","Webarchive template wayback links","Articles with short 

### Extract only specific columns from the table

In [40]:
table = soup.find('table')
countries = []
for row in table.find_all('tr')[2:]:
    temp = row.text.replace('\n\n', " ").strip() #replaces if any...
    temp_list = temp.split()
    countries.append((temp_list[1], temp_list[3]))
countries

[('India', '1,311'),
 ('United', '283'),
 ('Indonesia', '258'),
 ('Pakistan', '208'),
 ('Brazil', '206'),
 ('Nigeria', '182'),
 ('Bangladesh', '161'),
 ('Russia', '146'),
 ('Mexico', '127'),
 ('total', '7,349'),
 ('^', '=')]

### Extract all the columns from the table

In [38]:
new_table=[]
for row in table.find_all('tr')[2:-1]:  #tr = table rows...
    columns = row.find_all('td')
    new_table.append([column.text for column in columns[1:]])
new_table

[[' India', '1,053', '1,311', '1,528\n'],
 [' United States', '283', '322', '356\n'],
 [' Indonesia', '212', '258', '295\n'],
 [' Pakistan', '136', '208', '245\n'],
 [' Brazil', '176', '206', '228\n'],
 [' Nigeria', '123', '182', '263\n'],
 [' Bangladesh', '131', '161', '186\n'],
 [' Russia', '146', '146', '149\n'],
 [' Mexico', '103', '127', '148\n'],
 ['World total', '6,127', '7,349', '8,501\n']]

### Converting the table data into a structured data format

In [44]:
df = pd.DataFrame(new_table, columns=['Top ten most populous countries','2000', '2015', '2030'])

#replacing the "\n" with the "empty space" in the column "2030"...
df['2030'] = df['2030'].str.replace('\n','')

df

Unnamed: 0,Top ten most populous countries,2000,2015,2030
0,India,1053,1311,1528
1,United States,283,322,356
2,Indonesia,212,258,295
3,Pakistan,136,208,245
4,Brazil,176,206,228
5,Nigeria,123,182,263
6,Bangladesh,131,161,186
7,Russia,146,146,149
8,Mexico,103,127,148
9,World total,6127,7349,8501


In [45]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

# download wikipage
wikipage = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_continent_(data_file)"
result = requests.get(wikipage)

# if successful parse the download into a BeautifulSoup object, which allows easy manipulation 
if result.status_code == 200:
    soup = BeautifulSoup(result.content, "html.parser")
    
# find the object with HTML class wikitable sortable
table = soup.find('table',{'class':'wikitable sortable'})
table


<table class="wikitable sortable">
<tbody><tr>
<th>CC</th>
<th>a-2</th>
<th>a-3</th>
<th>#</th>
<th>Name
</th></tr>
<tr>
<td>AS</td>
<td>AF</td>
<td>AFG</td>
<td>004</td>
<td>Afghanistan, Islamic Republic of
</td></tr>
<tr>
<td>EU</td>
<td>AL</td>
<td>ALB</td>
<td>008</td>
<td>Albania, Republic of
</td></tr>
<tr>
<td>AN</td>
<td>AQ</td>
<td>ATA</td>
<td>010</td>
<td>Antarctica (the territory South of 60 deg S)
</td></tr>
<tr>
<td>AF</td>
<td>DZ</td>
<td>DZA</td>
<td>012</td>
<td>Algeria, People's Democratic Republic of
</td></tr>
<tr>
<td>OC</td>
<td>AS</td>
<td>ASM</td>
<td>016</td>
<td>American Samoa
</td></tr>
<tr>
<td>EU</td>
<td>AD</td>
<td>AND</td>
<td>020</td>
<td>Andorra, Principality of
</td></tr>
<tr>
<td>AF</td>
<td>AO</td>
<td>AGO</td>
<td>024</td>
<td>Angola, Republic of
</td></tr>
<tr>
<td>NA</td>
<td>AG</td>
<td>ATG</td>
<td>028</td>
<td>Antigua and Barbuda
</td></tr>
<tr>
<td>AS</td>
<td>AZ</td>
<td>AZE</td>
<td>031</td>
<td>Azerbaijan, Republic of
</td></tr>
<tr>
<td>E

In [47]:
new_table = []
for row in table.find_all('tr')[1:]:
    columns = row.find_all('td')
    new_table.append([column.text for column in columns])
    
new_table

[['AS', 'AF', 'AFG', '004', 'Afghanistan, Islamic Republic of\n'],
 ['EU', 'AL', 'ALB', '008', 'Albania, Republic of\n'],
 ['AN', 'AQ', 'ATA', '010', 'Antarctica (the territory South of 60 deg S)\n'],
 ['AF', 'DZ', 'DZA', '012', "Algeria, People's Democratic Republic of\n"],
 ['OC', 'AS', 'ASM', '016', 'American Samoa\n'],
 ['EU', 'AD', 'AND', '020', 'Andorra, Principality of\n'],
 ['AF', 'AO', 'AGO', '024', 'Angola, Republic of\n'],
 ['NA', 'AG', 'ATG', '028', 'Antigua and Barbuda\n'],
 ['AS', 'AZ', 'AZE', '031', 'Azerbaijan, Republic of\n'],
 ['EU', 'AZ', 'AZE', '031', 'Azerbaijan, Republic of\n'],
 ['SA', 'AR', 'ARG', '032', 'Argentina, Argentine Republic\n'],
 ['OC', 'AU', 'AUS', '036', 'Australia, Commonwealth of\n'],
 ['EU', 'AT', 'AUT', '040', 'Austria, Republic of\n'],
 ['NA', 'BS', 'BHS', '044', 'Bahamas, Commonwealth of the\n'],
 ['AS', 'BH', 'BHR', '048', 'Bahrain, Kingdom of\n'],
 ['AS', 'BD', 'BGD', '050', "Bangladesh, People's Republic of\n"],
 ['AS', 'AM', 'ARM', '051', 

In [48]:
df = pd.DataFrame(new_table, columns=['ContinentCode','Alpha2','Alpha3','PhoneCode','Name'])
df['Name'] = df['Name'].str.replace('\n','')
df

Unnamed: 0,ContinentCode,Alpha2,Alpha3,PhoneCode,Name
0,AS,AF,AFG,004,"Afghanistan, Islamic Republic of"
1,EU,AL,ALB,008,"Albania, Republic of"
2,AN,AQ,ATA,010,Antarctica (the territory South of 60 deg S)
3,AF,DZ,DZA,012,"Algeria, People's Democratic Republic of"
4,OC,AS,ASM,016,American Samoa
...,...,...,...,...,...
257,AS,YE,YEM,887,Yemen
258,AF,ZM,ZMB,894,"Zambia, Republic of"
259,AS,XD,,,United Nations Neutral Zone
260,AS,XS,,,Spratly Islands


In [None]:
import requests
from bs4 import BeautifulSoup

# URL for web scraping
url = "http://quotes.toscrape.com/"

# Perform GET request
response = requests.get(url)

# Parse HTML from the response
soup = BeautifulSoup(response.text, 'lxml')

#Extract quotes and quthors html elements
quotes_html = soup.find_all('span', class_="text")
authors_html = soup.find_all('small', class_="author")

#Extract quotes into a list
quotes = list()
for quote in quotes_html:
    quotes.append(quote.text)

#Extract authors into a list    
authors = list()
for author in authors_html:
    authors.append(author.text) 

# Make a quote / author tuple for printing    
for t in zip(quotes, authors):
    print(t)   

In [None]:
url='https://en.wikipedia.org/wiki/Tourism'

In [None]:
response=requests.get(url)
soup=BeautifulSoup(response.content, 'lxml')
tags = soup.find_all('img')
for tag in tags:
    print(tag.get('src'))
    print(tag.get('href'))