# Work with Web Data using Requests and Beautiful Soup packages

I followed very useful tutorials from digitalocean [1](https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3), [2](https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3).

In [1]:
import requests

In [44]:
url = 'https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html'
page = requests.get(url)

In [45]:
page

<Response [200]>

HTTP status codes of this response is 200 which means it is successfully downloaded.

In [46]:
page.text #read the content

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=us-ascii" />\n\n  <title>Turtle Soup</title>\n</head>\n\n<body>\n  <h1>Turtle Soup</h1>\n\n  <p class="verse" id="first">Beautiful Soup, so rich and green,<br />\n  Waiting in a hot tureen!<br />\n  Who for such dainties would not stoop?<br />\n  Soup of the evening, beautiful Soup!<br />\n  Soup of the evening, beautiful Soup!<br /></p>\n\n  <p class="chorus" id="second">Beau--ootiful Soo--oop!<br />\n  Beau--ootiful Soo--oop!<br />\n  Soo--oop of the e--e--evening,<br />\n  Beautiful, beautiful Soup!<br /></p>\n\n  <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br />\n  Game or any other dish?<br />\n  Who would not give all else for two<br />\n  Pennyworth only of Beautiful Soup?<br />\n  Pennyworth only of

In [2]:
from bs4 import BeautifulSoup

In [48]:
soup = BeautifulSoup(page.text, 'html.parser') #parse tree from parsed page

In [49]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Turtle Soup
  </title>
 </head>
 <body>
  <h1>
   Turtle Soup
  </h1>
  <p class="verse" id="first">
   Beautiful Soup, so rich and green,
   <br/>
   Waiting in a hot tureen!
   <br/>
   Who for such dainties would not stoop?
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
  </p>
  <p class="chorus" id="second">
   Beau--ootiful Soo--oop!
   <br/>
   Beau--ootiful Soo--oop!
   <br/>
   Soo--oop of the e--e--evening,
   <br/>
   Beautiful, beautiful Soup!
   <br/>
  </p>
  <p class="verse" id="third">
   Beautiful Soup! Who cares for fish,
   <br/>
   Game or any other dish?
   <br/>
   Who would not give all else for two
   <br/>
   Pennyworth only of 

In [50]:
soup.find_all('p') #find instances of a tag

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

In [51]:
soup.find_all('p')[2].get_text() #extract the text of the third <p> element 

'Beautiful Soup! Who cares for fish,\n  Game or any other dish?\n  Who would not give all else for two\n  Pennyworth only of Beautiful Soup?\n  Pennyworth only of beautiful Soup?'

In [52]:
soup.find_all(class_='chorus')

[<p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

In [53]:
soup.find_all('p', class_='chorus')

[<p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

In [54]:
soup.find_all(id='third')

[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]

# Collect and Parse a Web Page

In [None]:
# Collect first page of artists’ list
page = requests.get('http://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm')

In [68]:
# Create a BeautifulSoup object(a parse tree)
soup = BeautifulSoup(page.text, 'html.parser')

We’ll collect artists’ names and the relevant links available on the website. From the inspect menu item in the browser, we’ll see first that the table of names is within div tags where class="BodyText". This is important to note so that we only search for text within this section of the web page. We also notice that the name Zabaglia, Niccola is in a link tag, since the name references a web page that describes the artist. So we will want to reference the <a> tag for links. Each artist’s name is a reference to a link. 

In [57]:
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')

In [58]:
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
    print(artist_name.prettify())

<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11630">
 Zabaglia, Niccola
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34202">
 Zaccone, Fabian
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3475">
 Zadkine, Ossip
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=25135">
 Zaech, Bernhard
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=2298">
 Zagar, Jacob
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=23988">
 Zagroba, Idalia
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=8232">
 Zaidenberg, A.
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34154">
 Zaidenberg, Arthur
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=4910">
 Zaisinger, Matthäus
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=

There are bottom links which doesn't give information about the artists and we want to delete it. When we Inspact the DOM, we will se that the links are contained in an HTML table: table class="AlphaNav"

In [69]:
# Remove bottom links
last_links = soup.find(class_='AlphaNav')
last_links.decompose()


In [70]:
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')

# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
    print(artist_name.prettify())

<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11630">
 Zabaglia, Niccola
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34202">
 Zaccone, Fabian
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3475">
 Zadkine, Ossip
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=25135">
 Zaech, Bernhard
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=2298">
 Zagar, Jacob
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=23988">
 Zagroba, Idalia
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=8232">
 Zaidenberg, A.
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34154">
 Zaidenberg, Arthur
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=4910">
 Zaisinger, Matthäus
</a>
<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=

In [62]:
# Use .contents to pull out the <a> tag’s children
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href') # capture the links as well
    print(names)
    print(links)

Zabaglia, Niccola
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11630
Zaccone, Fabian
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34202
Zadkine, Ossip
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3475
Zaech, Bernhard
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=25135
Zagar, Jacob
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=2298
Zagroba, Idalia
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=23988
Zaidenberg, A.
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=8232
Zaidenberg, Arthur
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34154
Zaisinger, Matthäus
https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=4910
Zajac, Jack
https:/

### Writing the Data to a CSV File

In [4]:
import csv

In [5]:
f = csv.writer(open('z-artist-names.csv', 'w'))
f.writerow(['Name','Link'])

11

In [71]:
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href')


    # Add each artist’s name and associated link to a row
    f.writerow([names, links])

* Total Code written so far ;

In [74]:
page = requests.get('http://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

last_links = soup.find(class_='AlphaNav')
last_links.decompose()

# Create a file to write to, add headers row
f = csv.writer(open('z-artist-names.csv', 'w'))
f.writerow(['Name', 'Link'])

artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href')


    # Add each artist’s name and associated link to a row
    f.writerow([names, links])
    

Now, we will collect data from all 4 pages.

After creating csv file and giving first row as Name and Link, we will rewrite the code with for loops in order to scrap data from 4 pages;

In [6]:
pages = []

for i in range(1, 5):
    url = 'http://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
    pages.append(url)


for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='AlphaNav')
    last_links.decompose()

    artist_name_list = soup.find(class_='BodyText')
    artist_name_list_items = artist_name_list.find_all('a')

    for artist_name in artist_name_list_items:
        names = artist_name.contents[0]
        links = 'https://web.archive.org' + artist_name.get('href')

        f.writerow([names, links])