# Web Scraping: BeautifulSoup
_Collecting data from the internet and parsing it into meaningful (often tabular) form._

### Docs

- [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Installation

If not using the Metis kernel, please install the following libraries

With conda:
- `conda install beautifulsoup4 requests lxml html5lib`

With pip:
- `pip install beautifulsoup4 requests lxml html5lib`

If you have installed everything correctly, you should now be able to import these libraries. (You may need to shutdown and restart this notebook for BeautifulSoup to recognize the `lxml` and `html5lib` parsers.)

In [1]:
from bs4 import BeautifulSoup
import requests

## What is BeautifulSoup?

- Python library
- **HTML parser**:  Interprets structure of HTML file 
- Does not actually get pages from the web.  Use `requests` library for that.

<br>
<img src="images/web_scraping_pipeline.png" alt="Web Scraping Pipeline" style="width: 650px;"/>

## Intro to HTML

_Basic language used to create a webpage._

- Tells browser what, where, and how to display text, images, and other media
- Structured, hierarchical nature 
- Comprised of "elements" with properties

Example HTML element
```html
<tag-name attr1="value of attr1" attr2="value of attr2" .... attrN="value of attrN">
    Inner text of the tag
</tag-name>
```

### Tags,  `<tag-name>`

- Elements are labeled with tags
- Tells us what type of "thing" to render

    
| Tag | Use
|---          | ---|
|`h1`, `h2`, ..., `h6`| headers|
|`p`| paragraphs |
|`a`| anchors (e.g. links) |
|`div`| divisions (or sections) of a page |
|`img`| images |
|`li` | list items |


### Attributes, `attr1`

- Special properties we want this tag to have
- Typically appear as `attribute name = "attribute value"` pair

| Attribute | Use | Notes
|---   | ---       | ---|
|`href` | hyperlink reference | Clicking this element directs user to value url|
|`class`| style class | Many elements may have same class |
|`id`| unique identifier | Only one element per id! |
|`style`| extra element styling | Bad practice, use css instead |

### Inner HTML Text

- Text that appears between tags
- Often the information we want to extract during web scraping

### HTML Structure

A full HTML document has a structure similar to this:

```html
<html> 
  <head> </head>
  <body>
     <h1>This is a header</h1>
     <p style="color:red;" id="learning_paragraph">You are learning HTML</p>
     <a href="www.google.com">A link to Google</a>
  </body>
</html>
```

**QUESTIONS**
> How many elements do we have within the HTML body?  What are their tags?

> What is the inner HTML of the header element?

> What attribute(s) does the paragraph have?  And the attribute value(s)?

Saving this code as a .html file and opening it in a browser should yield:

<br>
<img src="images/example_html.png" alt="Rendering of Example HTML" style="width: 300px;" align="left"/>

## Learn to Scrape with Dummy HTML

Let's begin learning how to scrape by working with some dummy HTML, written below as a string.

In [2]:
my_html = """
<html>

<head>
</head>

<body>
    <div style="border: 1px solid">
        There isn't much in this file, except a list of to-do items. 
        <ul>
          <li>Make coffee</li>
          <li>Sweep the floor</li>
          <li>Go to the store</li>
          <li>Write BeautifulSoup lecture</li>
        </ul>
    </div>
</body>

</html>
"""

Let's take a look at this simple webpage.

In [3]:
from IPython.core.display import display, HTML
display(HTML(my_html))     # make sure Jupyter knows to display it as HTML

If we want to grab the four items on our to-do list and analyze them, we can use Beautiful Soup!

In [4]:
soup = BeautifulSoup(my_html, "html5lib")

Simply looking at `soup` isn't very useful -- it's just our HTML repeated back to us.

In [5]:
soup

<html><head>
</head>

<body>
    <div style="border: 1px solid">
        There isn't much in this file, except a list of to-do items. 
        <ul>
          <li>Make coffee</li>
          <li>Sweep the floor</li>
          <li>Go to the store</li>
          <li>Write BeautifulSoup lecture</li>
        </ul>
    </div>



</body></html>

### `.find()`

But Beautiful Soup also knows how to navigate this HTML.  We can use the `find` command to get to a specific element.

In [6]:
soup.find('li')  #Grabs the first element tagged as li

<li>Make coffee</li>

In [7]:
type(soup.find('li'))

bs4.element.Tag

`find` returns a tagged element, but we can go further and just select this element's inner HTML text.

In [8]:
soup.find('li').text

'Make coffee'

In [9]:
type(soup.find('li').text)

str

### `.find_all()`

Instead of selecting just one of our list items, we can get all of them by using `find_all`.  

This method looks for all instances matching our criteria on the entire HTML and gives us back a list.

In [10]:
soup.find_all('li')

[<li>Make coffee</li>,
 <li>Sweep the floor</li>,
 <li>Go to the store</li>,
 <li>Write BeautifulSoup lecture</li>]

To analyze our to-do list, we probably just want the text from each tagged element.  How can we do that?

One approach is to loop through the list and apply `.text` to each element:

In [11]:
todos=[]

for element in soup.find_all('li'):
    todos.append(element.text)
    
print(todos)

['Make coffee', 'Sweep the floor', 'Go to the store', 'Write BeautifulSoup lecture']


Or we could use a list comprehension:

In [12]:
todos=[element.text for element in soup.find_all('li')]

todos

['Make coffee',
 'Sweep the floor',
 'Go to the store',
 'Write BeautifulSoup lecture']

Now we have a clean list of strings, ready for analysis!

## Scrape Select Items on a Test Webpage

Now on to a more complicated example.  Take a look at `test_webpage/page.html`.  Let's try to grab all of the article links, like the ones for Starbucks and Bitcoin.

First get the HTML and then parse it with BeautifulSoup.

In [13]:
#webpage_string = 'https://www.imdb.com/name/nm0001202/'
# webpage_string = 'https://www.imdb.com/name/nm0001202?ref_=nv_sr_srsg_0'
webpage_string = 'https://imslp.org/wiki/Category:Beethoven,_Ludwig_van'

In [14]:
# #webpage_string = 'test_webpage/page.html'
# # webpage_string = 'https://www.imdb.com/name/nm0001202?ref_=nv_sr_srsg_0'
# with open(webpage_string) as page:
#     test_html = page.read()
# #soup = BeautifulSoup(test_html, 'lxml')    
# soup = BeautifulSoup(test_html)

FileNotFoundError: [Errno 2] No such file or directory: 'https://imslp.org/wiki/Category:Beethoven,_Ludwig_van'

Links show up as `a` tags.  Let's just try to grab all of them.

In [None]:
soup.find_all('a')

Uh oh!  Looks like there are some links on the sidebar and in the footer, too.  We want only the ones in the articles so we'll need a better strategy.

Digging into the source code, it turns out that each of the articles live within a `div` labeled with the class `article`.  Let's try to get those.

### `class_` and `id_` 
The `find` and `find_all` methods take optional attribute arguments so you can filter down to elements with specific attributes like classes and ids.

In [None]:
soup.find_all('div', class_='article')

There are our articles!  

Each of these `div` elements are also soup objects, so we can now query these `div`s to drill down further to just the links.

In [None]:
for div in soup.find_all('div', class_='article'):
    for link in div.find_all('a'):
        print(link)

Excellent!  What if we want to print out the link text and the url it points to?


### `.get()`
The `get` method allows you access to any attribute of the element.

In [None]:
for div in soup.find_all('div', class_='article'):
    for link in div.find_all('a'):
        print(f'{link.text:20s} ---> {link.get("href")}')

## Scrape the Web

So far we've used BeautifulSoup to parse our own HTML strings and files.  Now let's scrape Box Office Mojo.

First let's take a look at some source code for _The Big Lebowski_.

- Navigate to https://www.boxofficemojo.com/title/tt0118715/ in your browser, preferably Chrome
- Right click and select "Inspect"
- Alternatively, you can "View Page Source"

To retrieve the HTML for this webpage, we will use `requests`.

### `requests`

The `requests` library allows us to grab information from the web.  There are two common types of requests:
- `get` -- simply requests information, akin to putting a url in your browser
- `post` -- sends information to the website, for example, writing an email

We will be using `get` to retrieve a page's HTML.

In [15]:
#url = 'https://www.boxofficemojo.com/title/tt0118715/' 
# url = 'https://www.imdb.com/name/nm0001202?ref_=nv_sr_srsg_0'
# url = 'https://imslp.org/'
url = 'https://imslp.org/wiki/Category:Beethoven,_Ludwig_van'
response = requests.get(url)
#response = requests.get(webpage_string )

The response we got back is an object that gives us access to:
- `response.text` -- the returned HTML (if any)
- `response.json` -- the returned JSON (if any), typical for APIs
- `response.status_code` -- a [code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) to tell you if your request was successful or if an error occurred, 2XX indicates success while 404 means not found

In [16]:
response.status_code  #200 = success!

200

In [17]:
response.text[:1000]  #First 1000 characters of the HTML

'<!DOCTYPE html>\n<html lang="en" dir="ltr" class="client-nojs">\n<head>\n<title>Category:Beethoven, Ludwig van - IMSLP: Free Sheet Music PDF Download</title>\n<meta charset="UTF-8" />\n<meta name="generator" content="MediaWiki 1.18.1" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n<meta name="smartbanner:title" content="IMSLP" />\n<meta name="smartbanner:author" content="Project Petrucci LLC" />\n<meta name="smartbanner:price" content="FREE" />\n<meta name="smartbanner:button" content="VIEW" />\n<meta name="smartbanner:hide-ttl" content="3600000" />\n<meta name="smartbanner:enable-ios" content="1" />\n<meta name="smartbanner:price-suffix-apple" content=" - On the App Store" />\n<meta name="smartbanner:icon-apple" content="https://imslp.org/images/c/cf/Iosicon.jpg" />\n<meta name="smartbanner:button-url-apple" content="https://itunes.apple.com/us/app/imslp/id1373671782" />\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />\n<link rel=

In [26]:
page = response.text

In [27]:
display(HTML(page)) 

0,1
"A Abbé Stadler, WoO 178 (Beethoven, Ludwig van) Abendlied unter'm gestirnten Himmel, WoO 150 (Beethoven, Ludwig van) Abschiedsgesang an Wiens Bürger, WoO 121 (Beethoven, Ludwig van) Abschiedsgesang, WoO 102 (Beethoven, Ludwig van) Adagio for Mandolin and Harpsichord, WoO 43b (Beethoven, Ludwig van) Adelaide, Op.46 (Beethoven, Ludwig van) Ah! Perfido, Op.65 (Beethoven, Ludwig van) Allegretto, WoO 39 (Beethoven, Ludwig van) Allegretto, WoO 53 (Beethoven, Ludwig van) Allegretto, WoO 61 (Beethoven, Ludwig van) Allemande, WoO 81 (Beethoven, Ludwig van) Alles Gute, alles Schöne!, WoO 179 (Beethoven, Ludwig van) Als die Geliebte sich trennen wollte, WoO 132 (Beethoven, Ludwig van) An die ferne Geliebte, Op.98 (Beethoven, Ludwig van) An die Geliebte, WoO 140 (Beethoven, Ludwig van) An die Hoffnung, Op.32 (Beethoven, Ludwig van) An die Hoffnung, Op.94 (Beethoven, Ludwig van) An einen Säugling, WoO 108 (Beethoven, Ludwig van) An Mälzel, WoO 162 (Beethoven, Ludwig van) An Minna, WoO 115 (Beethoven, Ludwig van) An Sie, Anh.18 (Beethoven, Ludwig van) Andante favori, WoO 57 (Beethoven, Ludwig van) Andenken, WoO 136 (Beethoven, Ludwig van) 4 Arietten und ein Duett, Op.82 (Beethoven, Ludwig van)B Bagatelle in F minor (Beethoven, Ludwig van) Bagatelle, WoO 61a (Beethoven, Ludwig van) Bagatelle, WoO 52 (Beethoven, Ludwig van) Bagatelle, WoO 54 (Beethoven, Ludwig van) Bagatelle, WoO 56 (Beethoven, Ludwig van) Bagatelle in B-flat major, WoO 60 (Beethoven, Ludwig van) 7 Bagatelles, Op.33 (Beethoven, Ludwig van) 11 Bagatelles, Op.119 (Beethoven, Ludwig van) 6 Bagatelles, Op.126 (Beethoven, Ludwig van) Der Bardengeist, WoO 142 (Beethoven, Ludwig van) Bundeslied, Op.122 (Beethoven, Ludwig van)C Cantata on the Accession of Emperor Leopold II, WoO 88 (Beethoven, Ludwig van) Cantata on the Death of Emperor Joseph II, WoO 87 (Beethoven, Ludwig van) Cello Sonata in E-flat major, Op.64 (Beethoven, Ludwig van) Cello Sonata No.1, Op.5 No.1 (Beethoven, Ludwig van) Cello Sonata No.2, Op.5 No.2 (Beethoven, Ludwig van) Cello Sonata No.3, Op.69 (Beethoven, Ludwig van) Cello Sonata No.4, Op.102 No.1 (Beethoven, Ludwig van) Cello Sonata No.5, Op.102 No.2 (Beethoven, Ludwig van) Chorus for the Allied Princes, WoO 95 (Beethoven, Ludwig van) Christus am Ölberge, Op.85 (Beethoven, Ludwig van) 12 Contredanses, WoO 14 (Beethoven, Ludwig van) Coriolan, Op.62 (Beethoven, Ludwig van)D Duet mit zwei obligaten Augengläsern, WoO 32 (Beethoven, Ludwig van) 3 Duets for Clarinet and Bassoon, WoO 27 (Beethoven, Ludwig van) Duo for 2 Flutes, WoO 26 (Beethoven, Ludwig van)E Ecossaise for Military Band, WoO 22 (Beethoven, Ludwig van) Ecossaise for Military Band, WoO 23 (Beethoven, Ludwig van) Ecossaise, WoO 86 (Beethoven, Ludwig van) 6 Ecossaises, WoO 83 (Beethoven, Ludwig van) Edel sei der Mensch, WoO 185 (Beethoven, Ludwig van) Egmont, Op.84 (Beethoven, Ludwig van) Ein Selbstgespräch, WoO 114 (Beethoven, Ludwig van) Elegie auf den Tod eines Pudels, WoO 110 (Beethoven, Ludwig van) Elegischer Gesang, Op.118 (Beethoven, Ludwig van) 3 Equali, WoO 30 (Beethoven, Ludwig van) Erlkönig, WoO 131 (Beethoven, Ludwig van) Es ist vollbracht, WoO 97 (Beethoven, Ludwig van) Ewig dein, WoO 161 (Beethoven, Ludwig van)F Fantasia for Piano, Op.77 (Beethoven, Ludwig van) Fantasia in C minor, Op.80 (Beethoven, Ludwig van) Farewell to the Piano (Beethoven, Ludwig van) Fidelio, Op.72 (Beethoven, Ludwig van) Flute Sonata in B-flat major, Anh.4 (Beethoven, Ludwig van) Der freie Mann, WoO 117 (Beethoven, Ludwig van) Fugue for Organ, WoO 31 (Beethoven, Ludwig van) Fugue in D major, Op.137 (Beethoven, Ludwig van) 9 Fugues for 4 Voices, Hess 238 (Beethoven, Ludwig van) Für Elise, WoO 59 (Beethoven, Ludwig van)G Gedenke mein!, WoO 130 (Beethoven, Ludwig van) Das Geheimnis, WoO 145 (Beethoven, Ludwig van) 12 German Dances, WoO 8 (Beethoven, Ludwig van) 12 German Dances, WoO 13 (Beethoven, Ludwig van) 6 German Dances, WoO 42 (Beethoven, Ludwig van) Germania, WoO 94 (Beethoven, Ludwig van) Gesang der Mönche, WoO 104 (Beethoven, Ludwig van) Der Gesang der Nachtigall, WoO 141 (Beethoven, Ludwig van) 6 Gesänge, Op.75 (Beethoven, Ludwig van) 3 Gesänge, Op.83 (Beethoven, Ludwig van) Die Geschöpfe des Prometheus, Op.43 (Beethoven, Ludwig van) Der glorreiche Augenblick, Op.136 (Beethoven, Ludwig van) Das Glück der Freundschaft, Op.88 (Beethoven, Ludwig van) Glück zum neuen Jahr, WoO 165 (Beethoven, Ludwig van) Glück zum neuen Jahr, WoO 176 (Beethoven, Ludwig van) Gratulations-Menuett, WoO 3 (Beethoven, Ludwig van) Große Fuge, Op.133 (Beethoven, Ludwig van)H Hofmann und kein Hofmann, WoO 180 (Beethoven, Ludwig van) Horn Sonata, Op.17 (Beethoven, Ludwig van) Hymne an die Nacht (Beethoven, Ludwig van)I Ich bitt' dich, WoO 172 (Beethoven, Ludwig van) Im Arm der Liebe ruht sich's wohl, WoO 159 (Beethoven, Ludwig van) In questa tomba oscura, WoO 133 (Beethoven, Ludwig van) 25 Irish Songs, WoO 152 (Beethoven, Ludwig van) 20 Irish Songs, WoO 153 (Beethoven, Ludwig van) 12 Irish Songs, WoO 154 (Beethoven, Ludwig van)J Der Jüngling in der Fremde, WoO 138 (Beethoven, Ludwig van)","K Klage, WoO 113 (Beethoven, Ludwig van) König Stephan, Op.117 (Beethoven, Ludwig van) Des Krieger's Abschied, WoO 143 (Beethoven, Ludwig van) Kriegslied der Österreicher, WoO 122 (Beethoven, Ludwig van) Kühl, nicht lau, WoO 191 (Beethoven, Ludwig van) Kurz ist der Schmerz, und ewig ist die Freude, WoO 163 (Beethoven, Ludwig van) Kurz ist der Schmerz, und ewig ist die Freude, WoO 166 (Beethoven, Ludwig van) Der Kuß, Op.128 (Beethoven, Ludwig van)L 7 Ländler, WoO 11 (Beethoven, Ludwig van) 6 Ländler, WoO 15 (Beethoven, Ludwig van) Die laute Klage, WoO 135 (Beethoven, Ludwig van) Leonora Overture No.1, Op.138 (Beethoven, Ludwig van) Leonora Overture No.2, Op.72a (Beethoven, Ludwig van) Leonora Overture No.3, Op.72b (Beethoven, Ludwig van) Leonore Prohaska, WoO 96 (Beethoven, Ludwig van) Der Liebende, WoO 139 (Beethoven, Ludwig van) Lied aus der Ferne, WoO 137 (Beethoven, Ludwig van) 6 Lieder, Op.48 (Beethoven, Ludwig van) 8 Lieder, Op.52 (Beethoven, Ludwig van) Lobkowitz-Cantate, WoO 106 (Beethoven, Ludwig van)M Man strebt, die Flamme zu verhehlen, WoO 120 (Beethoven, Ludwig van) Der Mann von Wort, Op.99 (Beethoven, Ludwig van) March for Military Band, WoO 20 (Beethoven, Ludwig van) March for Military Band, WoO 24 (Beethoven, Ludwig van) March for Wind Sextet in B-flat major, WoO 29 (Beethoven, Ludwig van) 2 Marches for Military Band, WoO 18-19 (Beethoven, Ludwig van) 3 Marches, Op.45 (Beethoven, Ludwig van) Mass in C major, Op.86 (Beethoven, Ludwig van) Meeresstille und glückliche Fahrt, Op.112 (Beethoven, Ludwig van) Merkenstein, Op.100 (Beethoven, Ludwig van) Merkenstein, WoO 144 (Beethoven, Ludwig van) Minuet, WoO 82 (Beethoven, Ludwig van) 12 Minuets, WoO 7 (Beethoven, Ludwig van) 6 Minuets, WoO 9 (Beethoven, Ludwig van) 6 Minuets, WoO 10 (Beethoven, Ludwig van) Missa solemnis, Op.123 (Beethoven, Ludwig van) Mit Mädeln sich vertragen, WoO 90 (Beethoven, Ludwig van) 11 Mödlinger Tänze, WoO 17 (Beethoven, Ludwig van) Musik zu einem Ritterballet, WoO 1 (Beethoven, Ludwig van)N 6 National Airs with Variations, Op.105 (Beethoven, Ludwig van) 10 National Airs with Variations, Op.107 (Beethoven, Ludwig van) Notturno in D major, Op.42 (Beethoven, Ludwig van)O O care selve, WoO 119 (Beethoven, Ludwig van) O Tobias!, WoO 182 (Beethoven, Ludwig van) Opferlied, Op.121b (Beethoven, Ludwig van) Opferlied, WoO 126 (Beethoven, Ludwig van) Overture in C major, Op.115 (Beethoven, Ludwig van)P La Partenza, WoO 124 (Beethoven, Ludwig van) Piano Concerto in D major, Op.61a (Beethoven, Ludwig van) Piano Concerto No.1, Op.15 (Beethoven, Ludwig van) Piano Concerto No.2, Op.19 (Beethoven, Ludwig van) Piano Concerto No.3, Op.37 (Beethoven, Ludwig van) Piano Concerto No.4, Op.58 (Beethoven, Ludwig van) Piano Concerto No.5, Op.73 (Beethoven, Ludwig van) Piano Quartet No.1 in E-flat major, WoO 36 (Beethoven, Ludwig van) Piano Quartet No.2 in D major, WoO 36 (Beethoven, Ludwig van) Piano Quartet No.3 in C major, WoO 36 (Beethoven, Ludwig van) Piano Sonata in C major, WoO 51 (Beethoven, Ludwig van) Piano Sonata No.1, Op.2 No.1 (Beethoven, Ludwig van) Piano Sonata No.2, Op.2 No.2 (Beethoven, Ludwig van) Piano Sonata No.3, Op.2 No.3 (Beethoven, Ludwig van) Piano Sonata No.4, Op.7 (Beethoven, Ludwig van) Piano Sonata No.5, Op.10 No.1 (Beethoven, Ludwig van) Piano Sonata No.6, Op.10 No.2 (Beethoven, Ludwig van) Piano Sonata No.7, Op.10 No.3 (Beethoven, Ludwig van) Piano Sonata No.8, Op.13 (Beethoven, Ludwig van) Piano Sonata No.9, Op.14 No.1 (Beethoven, Ludwig van) Piano Sonata No.10, Op.14 No.2 (Beethoven, Ludwig van) Piano Sonata No.11, Op.22 (Beethoven, Ludwig van) Piano Sonata No.12, Op.26 (Beethoven, Ludwig van) Piano Sonata No.13, Op.27 No.1 (Beethoven, Ludwig van) Piano Sonata No.14, Op.27 No.2 (Beethoven, Ludwig van) Piano Sonata No.15, Op.28 (Beethoven, Ludwig van) Piano Sonata No.16, Op.31 No.1 (Beethoven, Ludwig van) Piano Sonata No.17, Op.31 No.2 (Beethoven, Ludwig van) Piano Sonata No.18, Op.31 No.3 (Beethoven, Ludwig van) Piano Sonata No.19, Op.49 No.1 (Beethoven, Ludwig van) Piano Sonata No.20, Op.49 No.2 (Beethoven, Ludwig van) Piano Sonata No.21, Op.53 (Beethoven, Ludwig van) Piano Sonata No.22, Op.54 (Beethoven, Ludwig van) Piano Sonata No.23, Op.57 (Beethoven, Ludwig van) Piano Sonata No.24, Op.78 (Beethoven, Ludwig van) Piano Sonata No.25, Op.79 (Beethoven, Ludwig van) Piano Sonata No.26, Op.81a (Beethoven, Ludwig van) Piano Sonata No.27, Op.90 (Beethoven, Ludwig van) Piano Sonata No.28, Op.101 (Beethoven, Ludwig van) Piano Sonata No.29, Op.106 (Beethoven, Ludwig van) Piano Sonata No.30, Op.109 (Beethoven, Ludwig van) Piano Sonata No.31, Op.110 (Beethoven, Ludwig van) Piano Sonata No.32, Op.111 (Beethoven, Ludwig van) Piano Trio in E-flat major, Op.1 No.1 (Beethoven, Ludwig van) Piano Trio in G major, Op.1 No.2 (Beethoven, Ludwig van) Piano Trio in C minor, Op.1 No.3 (Beethoven, Ludwig van) Piano Trio in D major, Op.70 No.1 (Beethoven, Ludwig van) Piano Trio in E-flat major, Op.70 No.2 (Beethoven, Ludwig van) Piano Trio in B-flat major, Op.97 (Beethoven, Ludwig van) Piano Trio in E-flat major, WoO 38 (Beethoven, Ludwig van) 3 Pieces for Musical Clock, WoO 33a (Beethoven, Ludwig van) 2 Pieces, WoO 33b (Beethoven, Ludwig van) Polonaise for Military Band, WoO 21 (Beethoven, Ludwig van)"

0
J Der Jüngling in der Fremde (Various)

0
"M Menuetto and Allegretto (Beethoven, Ludwig van)"

0,1
"Ludwig van Beethovens Werke (Beethoven, Ludwig van) Sämmtliche Werke (Beethoven, Ludwig van)A Adagios de Beethoven, Op.101 (Brisson, Frédéric) Airs populaires, chants nationaux et motifs célèbres (Vilbac, Renaud de) Album for Violin and Harmonium (Hansen, Nikolaj) Album of Bass Songs (Various) Album of Classical Sonatinas (Seitz, Friedrich) Album of Classical Themes for the Junior String Orchestra (Various) Album of Symphony Themes for the Junior String Orchestra (Various) Album of Transcriptions for Piano Four-hands (Oesterle, Louis) Alte Meister für Junge Spieler (Moffat, Alfred) Alte Tänze (Pauer, Ernst) Alte Weisen (Burmester, Willy) Anthology of German Piano Music (Moszkowski, Moritz) Association Hymn Book (YMCA) Ausgewählte Lieder (Beethoven, Ludwig van)B Ballet-Album (Kleinmichel, Richard) Baroque and Classical Pieces for the Piano (Herring, Francis) Beethoven's Masterpieces (Beethoven, Ludwig van) Beethoven-Album (Beethoven, Ludwig van) Berühmte Ouverturen von Mozart, Beethoven, Weber, Mendelssohn, Bellini, Rossini (Various) Biblioteca d'oro (Longo, Alessandro) Blue Album of 20 Pieces for the Organ (Various) Blumenlese für Klavierliebhaber (Bossler, Heinrich Philipp) Brown Album of 20 Pieces for the Organ (Various) Bunte Bühne (Batka, Richard)C Cadenzas for Piano Concertos Nos.1-4 (Beethoven, Ludwig van) 5 Canons (Beethoven, Ludwig van) 2 Cello Sonatas, Op.5 (Beethoven, Ludwig van) Children's Piano Pieces the Whole World Plays (Wier, Albert Ernest) Classic Keyboard Music (Sauer, Emil von) 101 Classics (Dunstan, Ralph) A Collection of German Overtures (Oesterle, Louis) Concentus Sacri (Rosewig, Albert Henry) Le Concert au Salon (Vogel, Charles) Concert und Romanzen für Violine (Beethoven, Ludwig van) Concert Violin Solos the Whole World Plays (Wier, Albert Ernest) Concerte für das Pianoforte allein (Beethoven, Ludwig van) Concerte für Pianoforte zu vier Händen (Beethoven, Ludwig van) Contralto Songs (Various)D Dance Movements from the Works of Great Masters (Hermann, Friedrich) 3 Dances (Beethoven, Ludwig van) 18 Deutsche Gedichte (Various) Deutsche Weisen (Various) The Ditson Easy Trio Album (Rissland, Karl) 12 Duettini (Various) Duos, Trios, Quartette, Quintette, Sextette (Beethoven, Ludwig van)E L'école des grands maîtres, Op.119 (Barbot, Paul) 75 Eight-bar Studies in Pianoforte Technique (Dennée, Charles) 40 Études récréatives, Op.39 (Guichard, A.)F 15 Fugues (Beethoven, Ludwig van)G Gedichte von Goethe in Compositionen (Friedlaender, Max) 51 gemischte Chöre (Lafite, Carl) Gems of Antiquity (Neitzel, Otto) German, French and Italian Song Classics (Parker, Horatio) Les grands maîtres de l'art musical (Zöllner, Heinrich) Gray Album of 20 Pieces for the Organ (Various) Guitar Album, Op.27 (Dorn, Charles James)H Harmonium-Album (Blumenthal, Paul) Harmonium-Album (Stapf, Ernst) L'Harmonium-Concertant (Lange, Richard) L'harmonium-concertant (Vilbac, Renaud de) The Hundred Best Short Classics (Whitemore, Cuthbert F.)I In questa tomba oscura (Various) Instructions for the Piano Forte (Cramer, Johann Baptist) Instructive Album (Foote, Arthur) Instruktive Gänge durch die Kompositionen von Haydn, Mozart und Beethoven (Eschmann, Johann Carl) Introduction to the Art of Playing the Pianoforte, Op.42 (Clementi, Muzio)K Klassische Sinfonien und Kammermusik für Piano und Violine (Beethoven, Ludwig van) Klassisches Vortrags-Album (Klengel, Paul) Klaviersonaten (Beethoven, Ludwig van) Kleine und leichte Übungsstücke im Klavierspielen (Various) The Kneisel Collection (Kneisel, Franz) Konzerte und Phantasie für Klavier zu zwei Händen (Beethoven, Ludwig van)L Liederhort (Riemann, Hugo) Ludwig van Beethoven (Frimmel, Theodor von)M A March Album for the Organ (Various) Märsche für zwei Pianoforte zu acht Händen (Beethoven, Ludwig van) Masterpieces of Piano Music (Wier, Albert Ernest) Meister für die Jugend (Various) Il mio primo Beethoven (Beethoven, Ludwig van) Mittell's Violin Classics (Mittell, Philipp) 24 Morceaux choisis dans les quatuors et quintettes (Cramer, Johann Baptist) Le musée des pianistes (Various) Musikalischer Hausschatz der Deutschen (Fink, Gottfried Wilhelm) Musikalisches Angebinde zum Neuen Jahre (Müller, Carl Friedrich) Musikaliskt Tidsfördrif (Various)N Neujahrs und Carnevalsgabe 'Seyd uns zum zweytenmal willkommen' (Müller, Carl Friedrich)","N cont. Novello's Part-Song Book (Various)O Oesterle's Instructive Course for the Piano (Oesterle, Louis) Oeuvres complètes pour piano seul (Beethoven, Ludwig van) Opern-Perlen (Various) The Organist’s Journal (Archer, Frederic) Original Compositions and Adaptations for the Harp (Robinson, Gertrude Ina) Ornamentik in Beethovens Klavierwerken (Ehrlich, Heinrich) Our Favorite Tunes (Hermann, Friedrich) Ouvertüren für 2 Pianoforte zu 8 Händen (Beethoven, Ludwig van) Ouverturen für Piano und Violine (Beethoven, Ludwig van) Ouverturen-Album (Ulrich, Hugo) Ouvertures pour piano et violon (Beethoven, Ludwig van)P 3 Piano Sonatas, WoO 47 (Beethoven, Ludwig van) 6 Piano Sonatinas (Beethoven, Ludwig van) Piano Transcriptions of Various Composers (Friedman, Ignaz) Pianoforte-Album (Various) Pièces choisies faciles pour le Pianoforte (Various) Polyhymnia, Op.40 (Reinhard, August) 10 Popular Pieces of Moderate Difficulty (Beethoven, Ludwig van) Posaunen-Trios (Bamberg, Karl)Q Quartett-Album (Sitt, Hans) Quartett-Sätze (Beethoven, Ludwig van) Quartette von Haydn, Mozart, Beethoven für Klavier und Violine (Hermann, Friedrich) Les quatre pianistes (Schultze, Max) Quintette für 2 Violinen, 2 Bratschen und Violoncell (Beethoven, Ludwig van)S Sacred Minstrelsy (Parker, John William) Sacred Songs, Ancient and Modern (Hiles, John) Sammlung klassischer Stücke aus Werken berühmter Meister (Various) Sammlung klassischer und moderner Stücke (Sitt, Hans) Sämmtliche Compositionen (Beethoven, Ludwig van) Sämmtliche Sonaten fur Pianoforte (Beethoven, Ludwig van) Sämtliche Ouverturen (Beethoven, Ludwig van) Sämtliche Streichquartette (Beethoven, Ludwig van) The School and Community Orchestra (Gordon, Louis Morton) Die Schule des Octavenspiels, Op.48 (Kullak, Theodor) A Select Collection of Original Irish Airs (Thomson, George) Sketchbooks (Beethoven, Ludwig van) Sonata Album (G. Schirmer) Sonate per pianoforte (Beethoven, Ludwig van) Sonaten für Pianoforte und Violine (Beethoven, Ludwig van) 12 Sonates de Haydn, Mozart et Beethoven (Delisse, Paul) Sonatinen-Album (Köhler, Louis) Sonatiny (Michałowski, Aleksander) 15 Songs by Beethoven (Beethoven, Ludwig van) The Songs of England (Hatton, John Liptrot) Sonntags-musik (Pauer, Ernst) Souvenirs de musique de chambre (Alkan, Charles-Valentin) Souvenirs des concerts du Conservatoire (Alkan, Charles-Valentin) Standard Graded Course of Studies for the Pianoforte (Mathews, William Smythe Babcock) Streich-Quartett-Album (Zanger, Gustav) 17 Streichquartette (Beethoven, Ludwig van) 6 String Quartets, Op.18 (Beethoven, Ludwig van) 3 String Trios, Op.9 (Beethoven, Ludwig van) Stücke alter Meister (Burmester, Willy) Stücke alter Meister (Hollaender, Gustav) Studio generale (Czerny, Carl) Symphonien für Pianoforte und Violine (Beethoven, Ludwig van) Symphonien für Pianoforte zu 4 Händen mit Violine und Violoncello (Beethoven, Ludwig van) Symphonien für Pianoforte zu vier Händen (Beethoven, Ludwig van) Symphonies pour Piano (Beethoven, Ludwig van)T Tänze für Orchester (Beethoven, Ludwig van) Tenor Songs (Various) 9 Tonstücke (Beethoven, Ludwig van) 40 Transcriptions (Oesterle, Louis) Le trésor des pianistes (Farrenc, Aristide) Trifolien, Op.33 (Streben, Ernst) Trio-Albums (Various) Trios faciles (Wilhelm Hansen) Trios für Pianoforte, Violine und Violoncell (Beethoven, Ludwig van) Trios für Streichinstrumente (Beethoven, Ludwig van) Les trios symphoniques (Vilbac, Renaud de) Trios und Serenade (Beethoven, Ludwig van) Trios, Quartette und Quintette für Streichinstrumente (Beethoven, Ludwig van)U Ungdommens Melodi-Album (Hansen, Nikolaj) Useful Teaching Songs for All Voices (Lehmann, Liza) Useful Voluntaries (Pearce, Charles William)V Variationen für Piano und Violoncell (Beethoven, Ludwig van) Variationen für Pianoforte (Beethoven, Ludwig van) Violin Pieces the Whole World Plays (Wier, Albert Ernest) Violin Solos (André Jr., Anton) Violin-Terzette (Grünwald, Adolf) La virtuosité (Le Couppey, Félix) La virtuosité, Op.44 (Dumont, Félix) Vortrags-Album (Klengel, Paul)W 12 Waltzes by Beethoven (Callcott, William Hutchins) Wiener Pfennig-Magazin (Czerny, Carl) Die Wundergeige (Seybold, Arthur)"

0,1
"C Clarinet Quintet in A major, K.581 (Mozart, Wolfgang Amadeus)F 12 Fugues, Op.1 (Albrechtsberger, Johann Georg)P Piano Concerto No.20 in D minor, K.466 (Mozart, Wolfgang Amadeus) Prelude and Fugue in B-flat minor, BWV 867 (Bach, Johann Sebastian)R La ritrovata figlia di Ottone II, Op.39 (Kozeluch, Leopold)","S A Select Collection of Original Irish Airs (Thomson, George) A Select Collection of Original Scottish Airs (Thomson, George) A Select Collection of Original Welsh Airs (Thomson, George)T Thomson's Select Melodies of Scotland, Ireland and Wales (Thomson, George)W Waltz in E major (Beethoven, Ludwig van)"

0,1
"C Caprice, Op.3 (Lamb, Aidan)F Fantasy, Op.27 (Czerny, Carl)G Grande sonate, Op.41 (Moscheles, Ignaz)P Piano Sonata in A major, Op.6 No.3 (Woelfl, Joseph) Piano Sonata in A major (Liste, Anton) Piano Sonata in A minor, Op.6 No.1 (Woelfl, Joseph)","P cont. 2 Piano Sonatas, Op.1 (Ries, Ferdinand) 3 Piano Sonatas, Op.6 (Woelfl, Joseph)S 3 String Quartets, Op.47 (Aimon, Léopold) Symphony No.2, Op.80 (Ries, Ferdinand) Symphony No.4 in F major, Op.11 (Welles, Oliver Wilder)V 8 Variations on a French Song, D.624 (Schubert, Franz)"

0,1
"L Letters (Beethoven, Ludwig van)S Studien im Generalbass, Contrapunkt und in der Compositionslehre (Beethoven, Ludwig van)","T Thematisches Verzeichniss der im Druck erschienenen Werke von Ludwig van Beethoven (Nottebohm, Gustav)"


### `BeautifulSoup` Basics

Now that we have the HTML, let's learn its structure by parsing with BeautifulSoup.

In [19]:
soup = BeautifulSoup(page, "lxml")

In [20]:
print(soup)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<title>Category:Beethoven, Ludwig van - IMSLP: Free Sheet Music PDF Download</title>
<meta charset="utf-8"/>
<meta content="MediaWiki 1.18.1" name="generator"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="IMSLP" name="smartbanner:title"/>
<meta content="Project Petrucci LLC" name="smartbanner:author"/>
<meta content="FREE" name="smartbanner:price"/>
<meta content="VIEW" name="smartbanner:button"/>
<meta content="3600000" name="smartbanner:hide-ttl"/>
<meta content="1" name="smartbanner:enable-ios"/>
<meta content=" - On the App Store" name="smartbanner:price-suffix-apple"/>
<meta content="https://imslp.org/images/c/cf/Iosicon.jpg" name="smartbanner:icon-apple"/>
<meta content="https://itunes.apple.com/us/app/imslp/id1373671782" name="smartbanner:button-url-apple"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
<link href="/apple-touch-icon.png" rel="a

The `prettify` method turns the soup into a nicely formatted Unicode string with one tag on each line for readability.

In [21]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <title>
   Category:Beethoven, Ludwig van - IMSLP: Free Sheet Music PDF Download
  </title>
  <meta charset="utf-8"/>
  <meta content="MediaWiki 1.18.1" name="generator"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="IMSLP" name="smartbanner:title"/>
  <meta content="Project Petrucci LLC" name="smartbanner:author"/>
  <meta content="FREE" name="smartbanner:price"/>
  <meta content="VIEW" name="smartbanner:button"/>
  <meta content="3600000" name="smartbanner:hide-ttl"/>
  <meta content="1" name="smartbanner:enable-ios"/>
  <meta content=" - On the App Store" name="smartbanner:price-suffix-apple"/>
  <meta content="https://imslp.org/images/c/cf/Iosicon.jpg" name="smartbanner:icon-apple"/>
  <meta content="https://itunes.apple.com/us/app/imslp/id1373671782" name="smartbanner:button-url-apple"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
  <li

# Start here

### To do next
- from each page extract 
    - year of composition
    - opus number
    - posthumous or not
    - first publication
    - year of first performance
    - duration
- put into pandas df


In [25]:
# catalog = soup.find_all('div', class_='catpglnksp1')
unordered_lists = soup.find_all('ul')

# item 21 on this list is the list of compositions starting the letter A
#print(unordered_lists[21])

first_ul_that_is_a_composition = 21
last_ul_on_page_that_is_a_composition = 26-10

for alphabetic_position in range(first_ul_that_is_a_composition, 21+last_ul_on_page_that_is_a_composition):
    compositions_list = unordered_lists[alphabetic_position].find_all('a')

    for work in compositions_list:
        work_url = 'https://imslp.org/' + work.get('href')
        print(work_url)
        work_page_response = requests.get(work_url)
        work_soup = BeautifulSoup(work_page_response.text, "lxml")
        print(work_soup.find('title').text )
        print(work)
        print(work_soup.find_all('th')[1].text, work_soup.find_all('th')[1].findNext().text )
        print('\n')


https://imslp.org//wiki/Abb%C3%A9_Stadler,_WoO_178_(Beethoven,_Ludwig_van)
Abbé Stadler, WoO 178 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Abb%C3%A9_Stadler,_WoO_178_(Beethoven,_Ludwig_van)" title="Abbé Stadler, WoO 178 (Beethoven, Ludwig van)">Abbé Stadler, WoO 178 (Beethoven, Ludwig van)</a>
Composition Year
 1820



https://imslp.org//wiki/Abendlied_unter%27m_gestirnten_Himmel,_WoO_150_(Beethoven,_Ludwig_van)
Abendlied unter'm gestirnten Himmel, WoO 150 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Abendlied_unter%27m_gestirnten_Himmel,_WoO_150_(Beethoven,_Ludwig_van)" title="Abendlied unter'm gestirnten Himmel, WoO 150 (Beethoven, Ludwig van)">Abendlied unter'm gestirnten Himmel, WoO 150 (Beethoven, Ludwig van)</a>
Composition Year
 1820 March 4



https://imslp.org//wiki/Abschiedsgesang_an_Wiens_B%C3%BCrger,_WoO_121_(Beethoven,_Ludwig_van)
Abschiedsgesang an Wiens 

Andante favori, WoO 57 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Andante_favori,_WoO_57_(Beethoven,_Ludwig_van)" title="Andante favori, WoO 57 (Beethoven, Ludwig van)">Andante favori, WoO 57 (Beethoven, Ludwig van)</a>
Composition Year
 1803-1804



https://imslp.org//wiki/Andenken,_WoO_136_(Beethoven,_Ludwig_van)
Andenken, WoO 136 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Andenken,_WoO_136_(Beethoven,_Ludwig_van)" title="Andenken, WoO 136 (Beethoven, Ludwig van)">Andenken, WoO 136 (Beethoven, Ludwig van)</a>
Composition Year
 1808



https://imslp.org//wiki/4_Arietten_und_ein_Duett,_Op.82_(Beethoven,_Ludwig_van)
4 Arietten und ein Duett, Op.82 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/4_Arietten_und_ein_Duett,_Op.82_(Beethoven,_Ludwig_van)" title="4 Arietten und ein Duett, Op.82 (Beethoven, Ludwig van)">4

Cello Sonata No.5, Op.102 No.2 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Cello_Sonata_No.5,_Op.102_No.2_(Beethoven,_Ludwig_van)" title="Cello Sonata No.5, Op.102 No.2 (Beethoven, Ludwig van)">Cello Sonata No.5, Op.102 No.2 (Beethoven, Ludwig van)</a>
Composition Year
 1815



https://imslp.org//wiki/Chorus_for_the_Allied_Princes,_WoO_95_(Beethoven,_Ludwig_van)
Chorus for the Allied Princes, WoO 95 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Chorus_for_the_Allied_Princes,_WoO_95_(Beethoven,_Ludwig_van)" title="Chorus for the Allied Princes, WoO 95 (Beethoven, Ludwig van)">Chorus for the Allied Princes, WoO 95 (Beethoven, Ludwig van)</a>
Composition Year
 1814



https://imslp.org//wiki/Christus_am_%C3%96lberge,_Op.85_(Beethoven,_Ludwig_van)
Christus am Ölberge, Op.85 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/

Fantasia for Piano, Op.77 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Fantasia_for_Piano,_Op.77_(Beethoven,_Ludwig_van)" title="Fantasia for Piano, Op.77 (Beethoven, Ludwig van)">Fantasia for Piano, Op.77 (Beethoven, Ludwig van)</a>
Composition Year
 1809



https://imslp.org//wiki/Fantasia_in_C_minor,_Op.80_(Beethoven,_Ludwig_van)
Fantasia in C minor, Op.80 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Fantasia_in_C_minor,_Op.80_(Beethoven,_Ludwig_van)" title="Fantasia in C minor, Op.80 (Beethoven, Ludwig van)">Fantasia in C minor, Op.80 (Beethoven, Ludwig van)</a>
Composition Year
 1808



https://imslp.org//wiki/Farewell_to_the_Piano_(Beethoven,_Ludwig_van)
Waltz for Piano in F Major, Anh.15 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="mw-redirect categorypagelink" href="/wiki/Farewell_to_the_Piano_(Beethoven,_Ludwig_van)" title="Farewell to

Der glorreiche Augenblick, Op.136 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Der_glorreiche_Augenblick,_Op.136_(Beethoven,_Ludwig_van)" title="Der glorreiche Augenblick, Op.136 (Beethoven, Ludwig van)">Der glorreiche Augenblick, Op.136 (Beethoven, Ludwig van)</a>
Composition Year
 1814



https://imslp.org//wiki/Das_Gl%C3%BCck_der_Freundschaft,_Op.88_(Beethoven,_Ludwig_van)
Das Glück der Freundschaft, Op.88 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Das_Gl%C3%BCck_der_Freundschaft,_Op.88_(Beethoven,_Ludwig_van)" title="Das Glück der Freundschaft, Op.88 (Beethoven, Ludwig van)">Das Glück der Freundschaft, Op.88 (Beethoven, Ludwig van)</a>
Composition Year
 1803



https://imslp.org//wiki/Gl%C3%BCck_zum_neuen_Jahr,_WoO_165_(Beethoven,_Ludwig_van)
Glück zum neuen Jahr, WoO 165 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href

Kühl, nicht lau, WoO 191 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/K%C3%BChl,_nicht_lau,_WoO_191_(Beethoven,_Ludwig_van)" title="Kühl, nicht lau, WoO 191 (Beethoven, Ludwig van)">Kühl, nicht lau, WoO 191 (Beethoven, Ludwig van)</a>
Composition Year
 1825



https://imslp.org//wiki/Kurz_ist_der_Schmerz,_und_ewig_ist_die_Freude,_WoO_163_(Beethoven,_Ludwig_van)
Kurz ist der Schmerz, und ewig ist die Freude, WoO 163 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Kurz_ist_der_Schmerz,_und_ewig_ist_die_Freude,_WoO_163_(Beethoven,_Ludwig_van)" title="Kurz ist der Schmerz, und ewig ist die Freude, WoO 163 (Beethoven, Ludwig van)">Kurz ist der Schmerz, und ewig ist die Freude, WoO 163 (Beethoven, Ludwig van)</a>
Composition Year
 1813



https://imslp.org//wiki/Kurz_ist_der_Schmerz,_und_ewig_ist_die_Freude,_WoO_166_(Beethoven,_Ludwig_van)
Kurz ist der Schmerz, und ewig ist die Fr

2 Marches for Military Band, WoO 18-19 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/2_Marches_for_Military_Band,_WoO_18-19_(Beethoven,_Ludwig_van)" title="2 Marches for Military Band, WoO 18-19 (Beethoven, Ludwig van)">2 Marches for Military Band, WoO 18-19 (Beethoven, Ludwig van)</a>
Composition Year
 1808



https://imslp.org//wiki/3_Marches,_Op.45_(Beethoven,_Ludwig_van)
3 Marches, Op.45 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/3_Marches,_Op.45_(Beethoven,_Ludwig_van)" title="3 Marches, Op.45 (Beethoven, Ludwig van)">3 Marches, Op.45 (Beethoven, Ludwig van)</a>
Composition Year
 1803



https://imslp.org//wiki/Mass_in_C_major,_Op.86_(Beethoven,_Ludwig_van)
Mass in C major, Op.86 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Mass_in_C_major,_Op.86_(Beethoven,_Ludwig_van)" title="Mass in C major, Op.86 (Beethov

Overture in C major, Op.115 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Overture_in_C_major,_Op.115_(Beethoven,_Ludwig_van)" title="Overture in C major, Op.115 (Beethoven, Ludwig van)">Overture in C major, Op.115 (Beethoven, Ludwig van)</a>
Composition Year
 1814-15 (October-March)



https://imslp.org//wiki/La_Partenza,_WoO_124_(Beethoven,_Ludwig_van)
La Partenza, WoO 124 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/La_Partenza,_WoO_124_(Beethoven,_Ludwig_van)" title="La Partenza, WoO 124 (Beethoven, Ludwig van)">La Partenza, WoO 124 (Beethoven, Ludwig van)</a>
Composition Year
 1795–96



https://imslp.org//wiki/Piano_Concerto_in_D_major,_Op.61a_(Beethoven,_Ludwig_van)
Piano Concerto in D major, Op.61a (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Piano_Concerto_in_D_major,_Op.61a_(Beethoven,_Ludwig_van)" title="

Piano Sonata No.9, Op.14 No.1 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Piano_Sonata_No.9,_Op.14_No.1_(Beethoven,_Ludwig_van)" title="Piano Sonata No.9, Op.14 No.1 (Beethoven, Ludwig van)">Piano Sonata No.9, Op.14 No.1 (Beethoven, Ludwig van)</a>
Movements/SectionsMov'ts/Sec's
 Movements/Sections


https://imslp.org//wiki/Piano_Sonata_No.10,_Op.14_No.2_(Beethoven,_Ludwig_van)
Piano Sonata No.10, Op.14 No.2 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Piano_Sonata_No.10,_Op.14_No.2_(Beethoven,_Ludwig_van)" title="Piano Sonata No.10, Op.14 No.2 (Beethoven, Ludwig van)">Piano Sonata No.10, Op.14 No.2 (Beethoven, Ludwig van)</a>
Movements/SectionsMov'ts/Sec's
 Movements/Sections


https://imslp.org//wiki/Piano_Sonata_No.11,_Op.22_(Beethoven,_Ludwig_van)
Piano Sonata No.11, Op.22 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href

Piano Sonata No.29, Op.106 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Piano_Sonata_No.29,_Op.106_(Beethoven,_Ludwig_van)" title="Piano Sonata No.29, Op.106 (Beethoven, Ludwig van)">Piano Sonata No.29, Op.106 (Beethoven, Ludwig van)</a>
Movements/SectionsMov'ts/Sec's
 Movements/Sections


https://imslp.org//wiki/Piano_Sonata_No.30,_Op.109_(Beethoven,_Ludwig_van)
Piano Sonata No.30, Op.109 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Piano_Sonata_No.30,_Op.109_(Beethoven,_Ludwig_van)" title="Piano Sonata No.30, Op.109 (Beethoven, Ludwig van)">Piano Sonata No.30, Op.109 (Beethoven, Ludwig van)</a>
Movements/SectionsMov'ts/Sec's
 Movements/Sections


https://imslp.org//wiki/Piano_Sonata_No.31,_Op.110_(Beethoven,_Ludwig_van)
Piano Sonata No.31, Op.110 (Beethoven, Ludwig van) - IMSLP: Free Sheet Music PDF Download
<a class="categorypagelink" href="/wiki/Piano_Sonata_No.31,_Op

In [None]:
catalog = soup.find_all('span', id='article')
# catalog_ = soup.find_all('span').find('catmsgendp1')
print(len(catalog))

In [None]:
for all h2 in soup.find_all('h2'):
    find('Compositions by: Beethoven, Ludwig van')

In [None]:
# print(soup.find( id = 'mw-pages'))
rows = [row for row in soup.find_all('h2')] 
print(len(rows))
for row in rows:
    print(row)

In [None]:
soup.find('div', id="mw-pages") 

### FINALLY the next line gets us somewhere

In [None]:
# soup.find('div', lang="en", dir="ltr").find_all('div', class_='catpglnksp1')
soup.find_all('div', lang="en", dir="ltr")

In [None]:
soup.find('div', lang="en", dir="ltr").find_all('div', id_='mw-pages')

In [None]:
#soup.find_all('div')
for item in soup.find_all('div'):
    print(item.prettify())
#     for in_item in item.find_all('div'):
#         print('\n\n')
#         print(item.prettify())

In [None]:
soup.find_all('a')

In [None]:
soup.find(title="Category:Beethoven, Ludwig van")

**QUESTIONS**

> Select the first link on the page.

> Now select the LAST link on the page.  Can you get the text and the URL associated with this link?

In [None]:
for link in soup.find_all('a')[:100]:
    print(link, '\n')

Remember `find` gets only one match, but `find_all` retrieves all matches in a list.

In [None]:
for link in soup.find_all('a')[:5]:
    print(link, '\n')

And you can match only those with a specific `id` or `class` if you'd like.  Here are all the elements labeled with the "mojo-navigtaion-tab" class.

In [None]:
for element in soup.find_all(class_='mojo-navigation-tab'):
    print(element.prettify())

It's important to remember `find` and `find_all` return BeautifulSoup elements. You can continue searching these elements, thus chaining commands together.

Basic earnings information can be found in the `div` with the "mojo-performance-summary-table" class.  Let's extract the domestic gross from this element.

<br>
<img src="images/biglebow_table.png" alt="Big Lebowski Table" style="width: 500px;"/>

In [None]:
print(soup.find(class_='mojo-performance-summary-table').prettify())

In [None]:
soup.find(class_='mojo-performance-summary-table').find_all('span', class_='money')

Text needs to be extracted from one element at a time.  To get the domestic gross:

In [None]:
soup.find(class_='mojo-performance-summary-table').find_all('span', class_='money')[0].text

You can also find using an `id`; remember id should be unique to just one element.

In [None]:
print(soup.find(id='tabs').prettify())

### Web Scraping Pipeline

Now that we have the basics, let's practice web scraping.  **The main goal of web scraping is to extract data by taking advantage of a site's consistent format.**  That is, the code you write for one page on a website can hopefully be used on multiple pages to gather more information automatically.

Let's create code to get the following information for the movies on Box Office Mojo:
- Movie title
- Domestic gross
- Runtime
- MPAA rating
- Release date

#### Movie Title

In [None]:
soup.find('title')

In [None]:
title_string = soup.find('title').text

title_string

In [None]:
title_string.split('-')

In [None]:
title = title_string.split('-')[0].strip()

title

#### Domestic Gross: 

As we saw previously, the domestic gross can be found in a `span` within the "mojo-performance-summary-table" `div`.

In [None]:
dtg = soup.find(class_='mojo-performance-summary-table').find_all('span', class_='money')[0].text
dtg

The remainder of the information lives in this neighboring `div`.

<img src="images/biglebow_info.png" alt="Big Lebowski Information" style="width: 500px;"/>

#### Runtime: `.findNext()`

Sometimes you can find the information you are looking for by using text matching.  But note this must be an exact match!

In [None]:
soup.find(text='Run')  #does not match

In [None]:
soup.find(text='Running Time')  

Alternatively, we could use [regular expressions](https://docs.python.org/3/library/re.html).

In [None]:
import re
runtime_regex = re.compile('Run')
soup.find(text=runtime_regex)

In [None]:
rt_string = soup.find(text=re.compile('Run'))
print(rt_string)

In [None]:
type(rt_string)

The string we found is still a Beautiful Soup element. This means we can use it to navigate to the next element in the HTML, which is a `span` containing the actual runtime.

In [None]:
rt_string.findNext()

The `.findNext()` method can be incredibly useful when the information you want to find doesn't have a obvious tag, class, id, etc.

Let's clean this value up into usable data.

In [None]:
rt = rt_string.findNext().text
rt = rt.split()
minutes = int(rt[0])*60 + int(rt[2])
print(minutes)

#### MPAA Rating, Release Date

_**STEP 1:** Create function to grab values_ 

The text matching method can also help us get runtime, rating, and release date, so let's make a reuable function.

In [None]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from Box Office Mojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_element = obj.findNext()
    
    if next_element:
        return next_element.text 
    else:
        return None

In [None]:
# runtime
runtime = get_movie_value(soup,'Run')
print(runtime)

In [None]:
# rating
rating = get_movie_value(soup,'MPAA')
print(rating)

In [None]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

In [None]:
release_date = release_date.split('\n')[0]  #Select the only the date
print(release_date)

_**STEP 2:** Create helper functions to parse strings into appropriate data types_

The returned values all need a bit of formatting before we can work with this data.  Here are a few helper functions.

In [None]:
import dateutil.parser

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

_**STEP 3:** Apply these conversions_

Let's get these values again and format them all in one swoop. (Note: Rating is already correct as a string.)

In [None]:
raw_domestic_total_gross = dtg
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Running')
runtime = runtime_to_minutes(raw_runtime)

raw_release_date = get_movie_value(soup,'Release Date').split('\n')[0]
release_date = to_date(raw_release_date)

#### Put Results in Dictionary

Now that we have results for all five quantities, we can store them in a dictionary.

In [None]:
headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

movie_data.append(movie_dict)
movie_data

**QUESTION**

> Why might we want to store these data in a dictionary?  Why did we put the dictionary in a list?

### Scraping Tables

Let's take a look at the [top G-rated movies](https://www.boxofficemojo.com/chart/mpaa_title_lifetime_gross/?by_mpaa=G) of Box Office Mojo.  How could we pull all the data from this main page?

First request the HTML and parse it with Beautiful Soup.

In [None]:
url = 'https://www.boxofficemojo.com/chart/mpaa_title_lifetime_gross/?by_mpaa=G'

response = requests.get(url)
page = response.text

soup = BeautifulSoup(page,"lxml")

Now find the main table; its the only `table` on the page.

In [None]:
table = soup.find('table')
table

In [None]:
rows = [row for row in table.find_all('tr')]  # tr tag is for rows

Each row contains the information we want but requires more parsing.

In [None]:
rows[1]

Remember: you can chain methods together to look for information!

In [None]:
rows[1].find_all('td')[0].find('a')['href']

Now grab data for the first 5 movies with a loop.

In [None]:
movies = {}

for row in rows[1:6]:
    items = row.find_all('td')
    link = items[0].find('a')
    title, url = link.text, link['href']
    movies[title] = [url] + [i.text for i in items]
    
movies

### Scraping Multiple Pages

Now that we have the links for several G-rated movies we can visit each link to extract even more information about each movie.  Let's use `pandas` to help.

In [None]:
import pandas as pd

In [None]:
g_movies = pd.DataFrame(movies).T  #transpose
g_movies.columns = ['link_stub', 'title', 'rank_g_movies', 
                    'lifetime_gross', 'rank_overall', 'year']

g_movies.head()

We'll also combine all previous steps into one helper function.

In [None]:
def get_movie_dict(link):
    '''
    From BoxOfficeMojo link stub, request movie html, parse with BeautifulSoup, and
    collect 
        - title 
        - domestic gross
        - runtime 
        - MPAA rating
        - full release date
    Return information as a dictionary.
    '''
    
    base_url = 'https://www.boxofficemojo.com'
    
    #Create full url to scrape
    url = base_url + link
    
    #Request HTML and parse
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page,"lxml")

    
    headers = ['movie_title', 'domestic_total_gross',
               'runtime_minutes', 'rating', 'release_date']
    
    #Get title
    title_string = soup.find('title').text
    title = title_string.split('-')[0].strip()

    #Get domestic gross
    raw_domestic_total_gross = (soup.find(class_='mojo-performance-summary-table')
                                    .find_all('span', class_='money')[0]
                                    .text
                               )
    domestic_total_gross = money_to_int(raw_domestic_total_gross)

    #Get runtime
    raw_runtime = get_movie_value(soup,'Running')
    runtime = runtime_to_minutes(raw_runtime)
    
    #Get rating
    rating = get_movie_value(soup,'MPAA')

    #Get release date
    raw_release_date = get_movie_value(soup,'Release Date').split('\n')[0]
    release_date = to_date(raw_release_date)
    
    #Create movie dictionary and return
    movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

    return movie_dict

Now we just need to pass each link stub to this function.

In [None]:
g_movies_page_info_list = []

for link in g_movies.link_stub:
    g_movies_page_info_list.append(get_movie_dict(link))

In [None]:
g_movies_page_info_list

In [None]:
g_movies_page_info = pd.DataFrame(g_movies_page_info_list)  #convert list of dict to df
g_movies_page_info.set_index('movie_title', inplace=True)

g_movies_page_info

(Note: the rating is indeed missing from a few of these pages!  How could you fix that?)

We can now match this back up with the movie information collected from the table by merging these dataframes.

In [None]:
g_movies = g_movies.merge(g_movies_page_info, left_index=True, right_index=True)

g_movies

## Recap

- Beautiful Soup is a powerful HTML parser
- You can locate one element with `.find()` or all matching elements with `.find_all()`
- To select specific elements, you can filter by tags like `class` or `id` 
- You can also find items using text matching and `.findNext()`, `.findNextSibling()`, `.findChild()`, etc.
- Once you know how to scrape one page, you can scale up by systematically visiting other similar pages.

### Limitations
Beautiful Soup has its limitations though.  For example, we can't use Beautiful Soup if a page:
- Requires us to input a password
- Reveals information we want only when we interact with it
- Generates dynamically (with JavaScript) rather than statically serving HTML

For these situations we need a different tool, like **Selenium** -- coming soon!