# Weeek 4 Day 3: XML, HTML, and Beautiful Soup

## What is a Web Scraper?

Sometimes websites to not have Application Programming Interface (API) in these cases one can build a **Web Scraper**. Web scraping is the process of using bots to extract content and data from a website, specifically the underlying HTML code. 

With an API, you have a communication tool where you, the *User*, communicate with the *Client*, the computer that sends the request to the *Server*, the computer that responds to your request. With a Web Scraper, you inertact directly with the Server.  

In order to do this there are a couple libraries that we will need. One is `requests`. This is a library that allow you to send HTTP requests to a server. We will use it to ping the server to get the html content for a specific web page. [Here is the documentation](https://pypi.org/project/requests/)

In [1]:
import requests

### XML

The [Extensible Markup Language](https://www.w3.org/XML/) (XML) is a markup language for representing data structures. XML was all the rage at the turn of the century: "many software designers can barely contain their excitement over its potential to establish a real Internet lingua franca" (*The New York Times* in 2000: "[The Next Big Step? It's Called XML](https://www.nytimes.com/2000/06/07/business/the-next-big-leap-it-s-called-xml.html)"). That obviously did not come to pass. But XML remains a robust and open—though verbose—standard for representing structured data.

XML has taken on something of an afterlife as the official data standard for the U.S. Congress. The [House](http://clerk.house.gov/index.aspx) and [Senate](https://www.senate.gov/general/XML.htm) both release information about members, committees, schedules, legislation, and votes in XML. These are immaculately formatted and documented and remarkably up-to-date: the data for members of the 118th Congress are already posted.

[Congress MemberData XML schema](https://clerk.house.gov/member_info/MemberData_UserGuide.pdf)

Use the `requests` library to make a HTTP get request to the House's webserver and get the list of current member data.

## House XML

In [5]:
# let's grab some XML
house_raw = requests.get('http://clerk.house.gov/xml/lists/MemberData.xml').text  # this grabs the XML as plain text

In [7]:
#view it

house_raw

'<?xml version="1.0" encoding="UTF-8"?><MemberData publish-date="June 6, 2025"><title-info><congress-num>119</congress-num><congress-text>One Hundred Nineteenth Congress</congress-text><session>1</session><majority>R</majority><minority>D</minority><clerk>KEVIN F. McCUMBER</clerk><weburl>https://clerk.house.gov</weburl></title-info><members><member><statedistrict>AK00</statedistrict><member-info><namelist>Begich, Nicholas</namelist><bioguideID>B001323</bioguideID><lastname>Begich</lastname><firstname>Nicholas</firstname><middlename>J.</middlename><sort-name>BEGICH,NICHOLAS</sort-name><suffix>III</suffix><courtesy>Mr.</courtesy><prior-congress>0</prior-congress><official-name>Nicholas J. Begich III</official-name><formal-name>Mr. Begich</formal-name><party>R</party><caucus>R</caucus><state postal-code="AK"><state-fullname>Alaska</state-fullname></state><district>At Large</district><townname>Chugiak</townname><office-building>CHOB</office-building><office-room>153</office-room><office-zi

In [9]:
import pprint

In [11]:
pprint.pprint(house_raw)

('<?xml version="1.0" encoding="UTF-8"?><MemberData publish-date="June 6, '
 '2025"><title-info><congress-num>119</congress-num><congress-text>One Hundred '
 'Nineteenth '
 'Congress</congress-text><session>1</session><majority>R</majority><minority>D</minority><clerk>KEVIN '
 'F. '
 'McCUMBER</clerk><weburl>https://clerk.house.gov</weburl></title-info><members><member><statedistrict>AK00</statedistrict><member-info><namelist>Begich, '
 'Nicholas</namelist><bioguideID>B001323</bioguideID><lastname>Begich</lastname><firstname>Nicholas</firstname><middlename>J.</middlename><sort-name>BEGICH,NICHOLAS</sort-name><suffix>III</suffix><courtesy>Mr.</courtesy><prior-congress>0</prior-congress><official-name>Nicholas '
 'J. Begich III</official-name><formal-name>Mr. '
 'Begich</formal-name><party>R</party><caucus>R</caucus><state '
 'postal-code="AK"><state-fullname>Alaska</state-fullname></state><district>At '
 'Large</district><townname>Chugiak</townname><office-building>CHOB</office-building

In [15]:
#the last 1000 lines


SyntaxError: invalid syntax (699755581.py, line 3)

In [13]:
#what data type is the raw one

type(house_raw)

str

This data is still in a string format (`type(house_raw)`), so it's difficult to search and navigate. Let's make our first soup together using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


## Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [17]:
from bs4 import BeautifulSoup

you may need to install an xml parser **--- if you do you may need to close your jupyternotebook after you install it and open it back up again**

In [25]:
# how do we make sense of all these tags? beautiful soup!
houseSoup = BeautifulSoup(house_raw,"lxml") # create a BeautifulSoup object out of XML

In [22]:
pip install lxml

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [23]:
pip install lxml bs4

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Collecting bs4
  Obtaining dependency information for bs4 from https://files.pythonhosted.org/packages/51/bb/bf7aab772a159614954d84aa832c129624ba6c32faa559dfb200a534e50b/bs4-0.0.2-py2.py3-none-any.whl.metadata
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


What's so great about this soup-ified string? We now have a suite of new functions and methods that let us navigate the tree. First, let's inspect the different tags/elements in this tree of House member data. This is the full tree of data.

In [28]:
#what data type is the soup object?

type(houseSoup)


bs4.BeautifulSoup

**.prettify()**

allows you have a 'pretty' output of some data

In [30]:
print(houseSoup.prettify())

<?xml version="1.0" encoding="UTF-8"?>
<html>
 <body>
  <memberdata publish-date="June 6, 2025">
   <title-info>
    <congress-num>
     119
    </congress-num>
    <congress-text>
     One Hundred Nineteenth Congress
    </congress-text>
    <session>
     1
    </session>
    <majority>
     R
    </majority>
    <minority>
     D
    </minority>
    <clerk>
     KEVIN F. McCUMBER
    </clerk>
    <weburl>
     https://clerk.house.gov
    </weburl>
   </title-info>
   <members>
    <member>
     <statedistrict>
      AK00
     </statedistrict>
     <member-info>
      <namelist>
       Begich, Nicholas
      </namelist>
      <bioguideid>
       B001323
      </bioguideid>
      <lastname>
       Begich
      </lastname>
      <firstname>
       Nicholas
      </firstname>
      <middlename>
       J.
      </middlename>
      <sort-name>
       BEGICH,NICHOLAS
      </sort-name>
      <suffix>
       III
      </suffix>
      <courtesy>
       Mr.
      </courtesy>
      <prior-cong

In [32]:
#print a 'prettify' verison of th members tag
print(houseSoup.members.prettify())

<members>
 <member>
  <statedistrict>
   AK00
  </statedistrict>
  <member-info>
   <namelist>
    Begich, Nicholas
   </namelist>
   <bioguideid>
    B001323
   </bioguideid>
   <lastname>
    Begich
   </lastname>
   <firstname>
    Nicholas
   </firstname>
   <middlename>
    J.
   </middlename>
   <sort-name>
    BEGICH,NICHOLAS
   </sort-name>
   <suffix>
    III
   </suffix>
   <courtesy>
    Mr.
   </courtesy>
   <prior-congress>
    0
   </prior-congress>
   <official-name>
    Nicholas J. Begich III
   </official-name>
   <formal-name>
    Mr. Begich
   </formal-name>
   <party>
    R
   </party>
   <caucus>
    R
   </caucus>
   <state postal-code="AK">
    <state-fullname>
     Alaska
    </state-fullname>
   </state>
   <district>
    At Large
   </district>
   <townname>
    Chugiak
   </townname>
   <office-building>
    CHOB
   </office-building>
   <office-room>
    153
   </office-room>
   <office-zip>
    20515
   </office-zip>
   <office-zip-suffix>
    0200
   </off

In [44]:
#what is .members?
houseSoup.members

<members><member><statedistrict>AK00</statedistrict><member-info><namelist>Begich, Nicholas</namelist><bioguideid>B001323</bioguideid><lastname>Begich</lastname><firstname>Nicholas</firstname><middlename>J.</middlename><sort-name>BEGICH,NICHOLAS</sort-name><suffix>III</suffix><courtesy>Mr.</courtesy><prior-congress>0</prior-congress><official-name>Nicholas J. Begich III</official-name><formal-name>Mr. Begich</formal-name><party>R</party><caucus>R</caucus><state postal-code="AK"><state-fullname>Alaska</state-fullname></state><district>At Large</district><townname>Chugiak</townname><office-building>CHOB</office-building><office-room>153</office-room><office-zip>20515</office-zip><office-zip-suffix>0200</office-zip-suffix><phone>(202) 225-5765</phone><elected-date date="20241105">November  5, 2024</elected-date><sworn-date date="20250103">January  3, 2025</sworn-date></member-info><committee-assignments><committee comcode="II00" rank="23"></committee><committee comcode="PW00" rank="26"></

In [42]:
#get the contents of .members
houseSoup.members.contents

[<member><statedistrict>AK00</statedistrict><member-info><namelist>Begich, Nicholas</namelist><bioguideid>B001323</bioguideid><lastname>Begich</lastname><firstname>Nicholas</firstname><middlename>J.</middlename><sort-name>BEGICH,NICHOLAS</sort-name><suffix>III</suffix><courtesy>Mr.</courtesy><prior-congress>0</prior-congress><official-name>Nicholas J. Begich III</official-name><formal-name>Mr. Begich</formal-name><party>R</party><caucus>R</caucus><state postal-code="AK"><state-fullname>Alaska</state-fullname></state><district>At Large</district><townname>Chugiak</townname><office-building>CHOB</office-building><office-room>153</office-room><office-zip>20515</office-zip><office-zip-suffix>0200</office-zip-suffix><phone>(202) 225-5765</phone><elected-date date="20241105">November  5, 2024</elected-date><sworn-date date="20250103">January  3, 2025</sworn-date></member-info><committee-assignments><committee comcode="II00" rank="23"></committee><committee comcode="PW00" rank="26"></committe

In [34]:
#what type is the contents

type(houseSoup.members.contents)

list

In [40]:
#what is the first one?

houseSoup.members.contents[0]

<member><statedistrict>AK00</statedistrict><member-info><namelist>Begich, Nicholas</namelist><bioguideid>B001323</bioguideid><lastname>Begich</lastname><firstname>Nicholas</firstname><middlename>J.</middlename><sort-name>BEGICH,NICHOLAS</sort-name><suffix>III</suffix><courtesy>Mr.</courtesy><prior-congress>0</prior-congress><official-name>Nicholas J. Begich III</official-name><formal-name>Mr. Begich</formal-name><party>R</party><caucus>R</caucus><state postal-code="AK"><state-fullname>Alaska</state-fullname></state><district>At Large</district><townname>Chugiak</townname><office-building>CHOB</office-building><office-room>153</office-room><office-zip>20515</office-zip><office-zip-suffix>0200</office-zip-suffix><phone>(202) 225-5765</phone><elected-date date="20241105">November  5, 2024</elected-date><sworn-date date="20250103">January  3, 2025</sworn-date></member-info><committee-assignments><committee comcode="II00" rank="23"></committee><committee comcode="PW00" rank="26"></committee

In [38]:
# how many people are in the house of representatives?
len(houseSoup.members.contents)

441

In [46]:
# we can get stuff out of tags using the find method
#the state full name?

houseSoup.members.contents[0].find('state-fullname')


<state-fullname>Alaska</state-fullname>

In [50]:
#find the first member's last name (lastname.text)
# it is under member-info

houseSoup.members.contents[0].find('member-info').lastname


<lastname>Begich</lastname>

In [54]:
#what data type is this?

type(houseSoup.members.contents[0].find('member-info').lastname.text)

str

In [58]:
#get the state's full name

houseSoup.members.contents[0].find('state-fullname').text


'Alaska'

In [62]:
# let's use a set to store the unique state fullnames
# iterate through the list; member will store each tag in the list one at a time

states = set()

for member in houseSoup.members.contents:

    states.add(member.find('state-fullname').text)

print(states)

{'Rhode Island', 'New York', 'Guam', 'Colorado', 'New Hampshire', 'Washington', 'Idaho', 'Northern Mariana Islands', 'Pennsylvania', 'Delaware', 'South Dakota', 'Tennessee', 'Texas', 'Virgin Islands', 'Maine', 'Nebraska', 'Florida', 'Vermont', 'Alaska', 'Mississippi', 'Puerto Rico', 'Maryland', 'Arkansas', 'California', 'South Carolina', 'District of Columbia', 'Arizona', 'Iowa', 'Montana', 'Alabama', 'Georgia', 'Kansas', 'Michigan', 'Nevada', 'Kentucky', 'North Carolina', 'Oklahoma', 'American Samoa', 'Louisiana', 'Indiana', 'Connecticut', 'Oregon', 'Illinois', 'Missouri', 'Utah', 'North Dakota', 'Hawaii', 'Wisconsin', 'Virginia', 'Minnesota', 'New Jersey', 'Ohio', 'Massachusetts', 'New Mexico', 'West Virginia', 'Wyoming'}


In [66]:
# how many commitees are there?

len(houseSoup.committees.contents)


26

In [68]:
#what is the first commitee? 
houseSoup.committees.contents[0]

<committee com-building-code="LHOB" com-header-text="The chairman and ranking minority member are ex officio members of all subcommittees." com-phone="225-2171" com-room="1301" com-zip="20515" com-zip-suffix="6001" comcode="AG00" type="standing"><committee-fullname>Committee on Agriculture</committee-fullname><ratio><majority>29</majority><minority>25</minority></ratio><subcommittee subcom-building-code="LHOB" subcom-phone="225-2171" subcom-room="1301" subcom-zip="20515" subcom-zip-suffix="0" subcomcode="AG03"><subcommittee-fullname>Nutrition and Foreign Agriculture</subcommittee-fullname><ratio><majority>11</majority><minority>8</minority></ratio></subcommittee><subcommittee subcom-building-code="LHOB" subcom-phone="225-2171" subcom-room="1301" subcom-zip="20515" subcom-zip-suffix="0" subcomcode="AG14"><subcommittee-fullname>Conservation, Research, and Biotechnology</subcommittee-fullname><ratio><majority>10</majority><minority>9</minority></ratio></subcommittee><subcommittee subcom-b

In [70]:
#get the name of the commitee
#hint - tag name is 'committee-fullname'

houseSoup.committees.contents[0].find('committee-fullname')


<committee-fullname>Committee on Agriculture</committee-fullname>

### Exercise 1: 

print out all comittee names

In [80]:
for committee in houseSoup.committees.contents:
    print(committee.find('committee-fullname').text)

Committee on Agriculture
Committee on Appropriations
Committee on Armed Services
Committee on Financial Services
Committee on the Budget
Committee on Education and Workforce
Committee on Foreign Affairs
Committee on Oversight and Government Reform
Committee on House Administration
Committee on Homeland Security
Committee on Energy and Commerce
Committee on Natural Resources
Committee on the Judiciary
Committee on Transportation and Infrastructure
Committee on Rules
Committee on Small Business
Committee on Ethics
Committee on Science, Space, and Technology
Committee on Veterans' Affairs
Committee on Ways and Means
Permanent Select Committee on Intelligence
Select Committee on the  Strategic Competition Between the United States and the Chinese Communist Party
Joint Economic Committee
Joint Committee on Taxation
Joint Committee on the Library
Joint Committee on Printing


### Exercise 2: 

print out all committee names with thier subcomittees

In [90]:
for committee in houseSoup.committees.contents:
    print(committee.find('committee-fullname').text)
    for subCom in committee.find_all('subcommittee-fullname'):
        print('\t' + subCom.text)

Committee on Agriculture
	Nutrition and Foreign Agriculture
	Conservation, Research, and Biotechnology
	Forestry and Horticulture
	General Farm Commodities, Risk Management, and Credit
	Commodity Markets, Digital Assets, and Rural Development
	Livestock, Dairy, and Poultry
Committee on Appropriations
	Agriculture, Rural Development, Food and Drug Administration, and Related Agencies
	Defense
	National Security, Department of State, and Related Programs
	Interior, Environment, and Related Agencies
	Labor, Health and Human Services, Education, and Related Agencies
	Energy and Water Development and Related Agencies
	Homeland Security
	Military Construction, Veterans Affairs, and Related Agencies
	Commerce, Justice, Science, and Related Agencies
	Transportation, Housing and Urban Development, and Related Agencies
	Financial Services and General Government
	Legislative Branch
Committee on Armed Services
	Military Personnel
	Readiness
	Tactical Air and Land Forces
	Intelligence and Special O

<!--  -->

## HTML Text

[HyperText Markup Language](https://www.w3schools.com/html/html_intro.asp) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. HTML elements tell the browser how to display the content. HTML elements label pieces of content such as "this is a heading", "this is a paragraph", "this is a link", etc.r

In [92]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [94]:
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [96]:
# get the title of the html doc

soup.title.text

"The Dormouse's story"

In [98]:
#get the name of the title tag

soup.title.name

'title'

In [100]:
#have the string associated with the title

soup.title.string

"The Dormouse's story"

In [104]:
#get the name of the parent classification

soup.title.parent.name


'head'

In [106]:
# get the paragraph tags
soup.p


<p class="title"><b>The Dormouse's story</b></p>

In [112]:
#get the 'class' within the paragraph

soup.p['class']


['title']

In [114]:
#get the 'a' tag
soup.a


<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [116]:
#find all 'a' tags

soup.find_all('a')


[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [124]:
# go through all of the 'a' tags and get the 'href' (link)

for links in soup.find_all('a'):
    print(links.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [126]:
#print the text in the soup

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



### Exercise 3: 

get the p tag with the class ['story']** use the documentation if you need any help :

<!--  -->

## NY Times

In [None]:
#get the soup of of the nytimes
#note: you have to use an html parser


In [29]:
# look at the soup object


In [30]:
#find all the 'div' tags


In [31]:
#find all heading 3 tags


### Exercise 4: 

Print out all the headlines

<!--  -->

## Music lyrics

we're using songlyrics.com

In [130]:
songHTML=requests.get("https://www.songlyrics.com/kygo-selena-gomez/it-ain-t-me-lyrics/").text

#create a navigatable tree of objects using beautiful soup
songSoup = BeautifulSoup

In [131]:
#pretty printing



In [132]:
#Find all of the tags on the lyrics page. 

tags = set()

for tag in songSoup.find_all():
    tags.add(tag.name)

print(tags)

TypeError: Tag.find_all() missing 1 required positional argument: 'self'

In [None]:
# Which do the tags do? What does the <br> tag do?  The <h3> tag?

In [35]:
#Print the top level tags and the text associated with them for the <p> tags. Does this help you locate the lyrics?



### Exercise 5: 

Retrieve the tag containing the lyrics. Remove the HTML tags and print the lyrics.

<!--  -->

### Exercise 6 - Pedals

fuzz pedals are great. let's grab some information about different fuzz pedals from a web page. - http://www.guitarsite.com/fuzz-pedals/

#### 6a. Problem 1: More Fuzz

- make a request of the fuzz-pedals 
- make it a soup object

#### 6b. Get Info

There's some information about fuzz pedals in one of the html tables on the page.  One line of code will retrieve all of the "table" tags on the page.

- find the amount of tables on the page

#### 6c. Image Descriptions

Find the right images and descriptions of the first pedal. 

hint 'alt' is for alternative descripton
