# Naxos Music Library


## Background

Based on a brief conversation with Mackenzie Miller (then UM Interactive Media grad student) in early 2019, I've decided to build a scraper to get data from the Naxos Music Library about classical music in movies. I built the scraper originally in early 2019 and now in October 2019, I'm updating the code so it can be reusable. This effort is also an example of how annoying it can be to try to scrape certain websites (I'll get more into this later). 

With Naxos, you can see the data organized by either [Movie Title](https://www.naxos.com/musicinmovieslist.asp?letter=A) or [Composer](https://www.naxos.com/musicinmoviescomplist.asp?letter=A). I've chosen to go with the Composer arrangement because that's how Mackenzie and I were considering the data at first, and it seems like there are less composers than movies (although I haven't verified this).


## The Plan

1. Get the HTML from each page and save it as a backup. Because no composer's name starts with a number (yet), I only need to scrape the 26 pages that correspond to the letters of the alphabet. There are a few letters for which there is no composer with that last name: N, O, U, and X. **I'm not exactly sure if classical music only refers to music made in a certain period or if there are composers today who do compose classical music. Because of this I'll scrape the pages now, in case there are composers added whose last name start with these letters.**
1. For each page, find the table inside the `<td>` element with class `style5` and then work to get the data from there. Save the data as a CSV file.

@@TODO: Figure out how to save the data a JSON file.


---

The first we'll do is import the various libraries we'll be working with:

* @@TODO: explain each of the libraries being used.
* @@TODO: add this info to the .README file

In [1]:
import requests
from bs4 import BeautifulSoup
import lxml
from time import sleep
import csv
from string import ascii_uppercase

The url structure for the composers whose last names start with the letter `A` is:

`https://www.naxos.com/musicinmoviescomplist.asp?letter=A`

If we want to get the composers whose last names start with the letter `B` we just have to change what comes after the `=` in the url. We can conveniently get the entire alphabet in uppercase as a string by using `ascii_uppercase` from the `string` module.

To get the html for the pages we'll loop through the string and request each page. We'll save the scraped HTML to a folder that will follow the pattern: `YYYY-MM-pages`. I'm doing this in October 2019, so the folder will be `2019-10-pages`. Currently, this is a manual process, but I hope to automate it.

* @@TODO: automate the month/year pattern for folder creation
* @@TODO: automate the folder creation in the notebook


In [16]:
base_url = "https://www.naxos.com/musicinmoviescomplist.asp?letter="
year = 2019
month = 10
out_folder = "-".join([str(year), str(month), "pages"])

We'll use the `requests` library to request the html code of each page and save it for processing. This part of the code requires an internet connection, but the rest of the notebook does not. We'll append the path to each file to a list that we'll use when working with our data.

In [74]:
pages = []

for letter in ascii_uppercase:
    out_file = out_folder + "/" + letter + ".html"

    complete_url = base_url + letter
    response = requests.get(complete_url)
    page = response.text
    
    pages.append(out_file)
    open(out_file, "w+").write(page)

Before we process the 26 files, let's look at some examples so we can build out the overall scraper. The two big cases we have to write code for are:

1. When 

The files are encoded with the charset `ISO-8859-1` and not `UTF-8` when we first get them, so the first thing we'll do it change the encoding to `UTF-8` before turning it into a BeautifulSoup object. Additionally, I'll go in and remove the invisible characters: `\n`, `\t`, and `\r` that appear in the text before passing it on. These characters are usually harmless, but they show up when trying to find the next sibling of a row, so we'll clean them out.

The website is built with a lot of nested tables, but the table that we care about—the one that has the composer and music info, is inside a `<td>` element with a class attribute `style5`. The table that is a child of this `<td class="style5"></td>` element is what we care about.

Because the table is not well made, we can't do an orderly scraping of each row and get the data. To find things properly, I did the following:

1. All the composers are withing `<b></b>` tags. Find those in the table.
1. For each composer, find the `<tr></tr>` they are in. The music and movie data we want is in the following row.
1. In that following row, there is a div for each piece of music that was used in a movie.

In [72]:
letter = ascii_uppercase[0]
# for letter in ascii_uppercase:
out_file = out_folder + "/" + letter + ".html"

complete_url = base_url + letter
response = requests.get(complete_url)
page = response.text

open(out_file, "w+").write(page)

page = page.replace("\n", "").replace("\t", "").replace("\r", "").encode("utf-8")
soup = BeautifulSoup(page,"lxml")

main_table = soup.find('td',class_='style5').find("table")
composer_list = table.find_all("b")

composer = composer_list[0]
# for composer in composer_list:
composer_row = main_table.find(text=composer.text).parent.parent.parent
music_row = composer_row.next_sibling.next_sibling
music_row


<tr><td bgcolor="#ffffff" class="style5"> <div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">Giselle: Apparition de Giselle <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56">8.550755-56</a> )</span><br/><i>Red Shoes (The) (1948)</i><br/> </div> <div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">Giselle: Entree d'Hilarion, scene et fugue des Wilis <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56">8.550755-56</a> )</span><br/><i>Red Shoes (The) (1948)</i><br/> </div> <div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">Giselle: Pas de deux des jeunes paysans <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56">8.550755-56</a> )</span><br/><i>Red Shoes (The) (1948)</i><br/> </div> <div s

In [29]:
composers = table.find_all("b")

In [56]:
a= table.find(text=composers[0].text)

In [58]:
a.parent.parent.parent

<tr><td bgcolor="#EEEEEE"><b>ADAM, ADOLPHE</b></td></tr>