# **Web Scraping Lab Practice**


## Objectives


After completing this lab you will be able to:


-   Download a webpage using requests module
-   Scrape all links from a web page
-   Scrape all image urls from a web page
-   Scrape data from html tables


Import the required modules and functions


In [13]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

Download the contents of the web page


In [14]:
url = "http://www.ibm.com"

In [15]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text 

Create a soup object using the class BeautifulSoup


In [16]:
soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'

Scrape all links


In [17]:
for link in soup.find_all('a'):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))

https://www.ibm.com/pk/en
https://www.ibm.com/sitemap/pk/en
/pk-en/node/1706826
https://1.dam.s81c.com/public/content/dam/worldwide-content/events/internal-events/ul/g/4d/d3/4dd3b087-e58e-452d-a26ec1a36c755d2f.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/events/internal-events/ul/g/4d/d3/4dd3b087-e58e-452d-a26ec1a36c755d2f.jpg
https://www.ibm.com/pk-en/cloud?lnk=STW_PK_HP_L1_&psrc=NONE&pexp=DEF&lnk2=learn_Cloud
/taxonomy/term/85416
/pk-en/node/1706856
https://ibmvirtualsummit.vfairs.com/?utm_source=Organic&utm_medium=Home_News_Stripe&utm_campaign=Virtual_Summit&utm_content=PK
/taxonomy/term/85416
/pk-en/node/1706851
/pk-en/node/1706836
/ae-en/node/1706831
/pk-en/node/1706846
/ae-en/node/1706841
/taxonomy/term/85416
/pk-en/node/2193047
/ae-en/node/2193039
/ae-en/node/2193041
/ae-en/node/2193043
/ae-en/node/2398590
/pk-en/node/1706821
/taxonomy/term/85416
https://www.ibm.com/employment/ae-en/?lnk=fab
/pk-en/node/1706816
https://www.ibm.com/ae-en/products/offers-and-dis

Scrape  all images


In [18]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link.get('src'))

https://1.dam.s81c.com/public/content/dam/worldwide-content/events/internal-events/ul/g/4d/d3/4dd3b087-e58e-452d-a26ec1a36c755d2f.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/other/ul/g/84/7e/847eafdb-51ac-4815-8172baba40d48fb8.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/27/2d/this-week-at-ibm-cloud-seminer-20210308.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/93/ef/Migrating_SAP_to_cloud.jpg
https://1.cms.s81c.com/sites/default/files/2020-11-13/033fe6cf-f5fb-42ad-8a1cdcb680b7a970.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/29/ec/29ec397a-6f63-4f9d-b74f59fd20651e0d.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/5b/af/public_cloud_canada_homepage.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/54/ef/Data_privacy_blog_Claude_card.jpg
https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/99/

## Scrape data from html tables


In [19]:
#The below url contains a html table with data about colors and color codes.

In [20]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.


In [21]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [22]:
soup = BeautifulSoup(data,"html5lib")

In [23]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [24]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].getText() # store the value in column 3 as color_name
    color_code = cols[3].getText() # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->Hex Code#RRGGBB
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


## Authors


Ramesh Sannareddy


### Other Contributors


Rav Ahuja


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By        | Change Description                 |
| ----------------- | ------- | ----------------- | ---------------------------------- |
| 2020-10-17        | 0.1     | Ramesh Sannareddy | Created initial version of the lab |


 Copyright Â© 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).
