# Weeek 4 Day 3: XML, HTML, and Beautiful Soup

## What is a Web Scraper?

Sometimes websites to not have Application Programming Interface (API) in these cases one can build a **Web Scraper**. Web scraping is the process of using bots to extract content and data from a website, specifically the underlying HTML code. 

With an API, you have a communication tool where you, the *User*, communicate with the *Client*, the computer that sends the request to the *Server*, the computer that responds to your request. With a Web Scraper, you inertact directly with the Server.  

In order to do this there are a couple libraries that we will need. One is `requests`. This is a library that allow you to send HTTP requests to a server. We will use it to ping the server to get the html content for a specific web page. [Here is the documentation](https://pypi.org/project/requests/)

In [1]:
import requests

### XML

The [Extensible Markup Language](https://www.w3.org/XML/) (XML) is a markup language for representing data structures. XML was all the rage at the turn of the century: "many software designers can barely contain their excitement over its potential to establish a real Internet lingua franca" (*The New York Times* in 2000: "[The Next Big Step? It's Called XML](https://www.nytimes.com/2000/06/07/business/the-next-big-leap-it-s-called-xml.html)"). That obviously did not come to pass. But XML remains a robust and open—though verbose—standard for representing structured data.

XML has taken on something of an afterlife as the official data standard for the U.S. Congress. The [House](http://clerk.house.gov/index.aspx) and [Senate](https://www.senate.gov/general/XML.htm) both release information about members, committees, schedules, legislation, and votes in XML. These are immaculately formatted and documented and remarkably up-to-date: the data for members of the 118th Congress are already posted.

[Congress MemberData XML schema](https://clerk.house.gov/member_info/MemberData_UserGuide.pdf)

Use the `requests` library to make a HTTP get request to the House's webserver and get the list of current member data.

## House XML

In [2]:
# let's grab some XML
house_raw = requests.get('http://clerk.house.gov/xml/lists/MemberData.xml').text  # this grabs the XML as plain text

In [1]:
#view it


In [2]:
#the last 1000 lines 


In [3]:
#what data type is the raw one

This data is still in a string format (`type(house_raw)`), so it's difficult to search and navigate. Let's make our first soup together using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


## Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [6]:
from bs4 import BeautifulSoup

you may need to install an xml parser **--- if you do you may need to close your jupyternotebook after you install it and open it back up again**

In [7]:
# how do we make sense of all these tags? beautiful soup!
houseSoup = BeautifulSoup(house_raw,"lxml") # create a BeautifulSoup object out of XML



In [12]:
# pip install lxml

In [13]:
# pip install lxml bs4

What's so great about this soup-ified string? We now have a suite of new functions and methods that let us navigate the tree. First, let's inspect the different tags/elements in this tree of House member data. This is the full tree of data.

In [4]:
#what data type is the soup object?


**.prettify()**

allows you have a 'pretty' output of some data

In [None]:
print(houseSoup.prettify())

In [5]:
#print a 'prettify' verison of th members tag


In [6]:
#what is .members?

In [7]:
#get the contents of .members


In [8]:
#what type is the contents


In [9]:
#what is the first one?

In [10]:
# how many people are in the house of representatives?


In [11]:
# we can get stuff out of tags using the find method
#the state full name?



In [12]:
#find the first member's last name (lastname.text)
# it is under member-info


In [13]:
#what data type is this?


In [14]:
#get the state's full name



In [15]:
# let's use a set to store the unique state fullnames
# iterate through the list; member will store each tag in the list one at a time



In [16]:
# how many commitees are there?


In [17]:
#what is the first commitee? 


In [18]:
#get the name of the commitee
#hint - tag name is 'committee-fullname'


### Exercise 1: 

print out all comittee names

### Exercise 2: 

print out all committee names with thier subcomittees

<!--  -->

## HTML Text

[HyperText Markup Language](https://www.w3schools.com/html/html_intro.asp) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. HTML elements tell the browser how to display the content. HTML elements label pieces of content such as "this is a heading", "this is a paragraph", "this is a link", etc.r

In [47]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [48]:
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [19]:
# get the title of the html doc



In [20]:
#get the name of the title tag



In [21]:
#have the string associated with the title



In [22]:
#get the name of the parent classification



In [23]:
# get the paragraph tags



In [24]:
#get the 'class' within the paragraph




In [25]:
#get the 'a' tag



In [26]:
#find all 'a' tags



In [27]:
# go through all of the 'a' tags and get the 'href' (link)


In [28]:
#print the text in the soup

### Exercise 3: 

get the p tag with the class ['story']** use the documentation if you need any help :

<!--  -->

## NY Times

In [None]:
#get the soup of of the nytimes
#note: you have to use an html parser


In [29]:
# look at the soup object


In [30]:
#find all the 'div' tags


In [31]:
#find all heading 3 tags


### Exercise 4: 

Print out all the headlines

<!--  -->

## Music lyrics

we're using songlyrics.com

In [None]:
songHTML=requests.get("https://www.songlyrics.com/kygo-selena-gomez/it-ain-t-me-lyrics/").text

#create a navigatable tree of objects using beautiful soup


In [33]:
#pretty printing



In [34]:
#Find all of the tags on the lyrics page. 


In [None]:
# Which do the tags do? What does the <br> tag do?  The <h3> tag?

In [35]:
#Print the top level tags and the text associated with them for the <p> tags. Does this help you locate the lyrics?



### Exercise 5: 

Retrieve the tag containing the lyrics. Remove the HTML tags and print the lyrics.

<!--  -->

### Exercise 6 - Pedals

fuzz pedals are great. let's grab some information about different fuzz pedals from a web page. - http://www.guitarsite.com/fuzz-pedals/

#### 6a. Problem 1: More Fuzz

- make a request of the fuzz-pedals 
- make it a soup object

#### 6b. Get Info

There's some information about fuzz pedals in one of the html tables on the page.  One line of code will retrieve all of the "table" tags on the page.

- find the amount of tables on the page

#### 6c. Image Descriptions

Find the right images and descriptions of the first pedal. 

hint 'alt' is for alternative descripton
