# What's the point of this?

Based on brief conversation with Mackenzie Miller from UM's interactive media program, I want to scrape the Naxos Records pages that have classical music in movies by composers: https://www.naxos.com/musicinmoviescomplist.asp?letter=A

# What's the plan?
There is one page per letter of the alphabet A-Z. The first thing I'll do is scrape each page and get those urls and then gather the entries for each leetter of the alphabet.

Each entry is a composer's name organized by *last name*, *first name* *middle name*
Under the composer we also have the subentries. Each subentry is the useful thing we want. It includes the name of the music piece, the item code that links to it (not fully clear what that means), and the name of the movie it appears in, along with the year the movie came out.

## so let's start

I'll scrape the pages like I did with my News21work

In [3]:
import requests
import urllib.parse as parse
from bs4 import BeautifulSoup
import lxml
from time import sleep
import csv
from pathlib import Path

In [4]:
baseurl = "https://www.naxos.com/musicinmoviescomplist.asp?letter="

Now I need to iterate through the alphabet uppercase...

In [5]:
from string import ascii_lowercase

alphabet = ascii_lowercase.upper()

In [6]:
alphabet

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [8]:
out_url_list = []
for letter in alphabet:
    testurl = baseurl + letter
    print(testurl)

https://www.naxos.com/musicinmoviescomplist.asp?letter=A
https://www.naxos.com/musicinmoviescomplist.asp?letter=B
https://www.naxos.com/musicinmoviescomplist.asp?letter=C
https://www.naxos.com/musicinmoviescomplist.asp?letter=D
https://www.naxos.com/musicinmoviescomplist.asp?letter=E
https://www.naxos.com/musicinmoviescomplist.asp?letter=F
https://www.naxos.com/musicinmoviescomplist.asp?letter=G
https://www.naxos.com/musicinmoviescomplist.asp?letter=H
https://www.naxos.com/musicinmoviescomplist.asp?letter=I
https://www.naxos.com/musicinmoviescomplist.asp?letter=J
https://www.naxos.com/musicinmoviescomplist.asp?letter=K
https://www.naxos.com/musicinmoviescomplist.asp?letter=L
https://www.naxos.com/musicinmoviescomplist.asp?letter=M
https://www.naxos.com/musicinmoviescomplist.asp?letter=N
https://www.naxos.com/musicinmoviescomplist.asp?letter=O
https://www.naxos.com/musicinmoviescomplist.asp?letter=P
https://www.naxos.com/musicinmoviescomplist.asp?letter=Q
https://www.naxos.com/musicinmo

That worked, so let's keep going and extract the html for each of those pages:

In [12]:
in_url_list = []
for letter in alphabet:
    outfile = "savedPages/naxos-records-" + letter + ".html"
    path_to_outfile = Path(outfile)
    
    if path_to_outfile.is_file():
        print("We have processed <<"+oufile+">> already.")
        pass
    else:
        in_url_list.append(outfile)
        #cycle through the diff pages
        testurl = baseurl + letter
        response = requests.get(testurl)
        a = response.text
        a = parse.unquote(a)
        out = open(outfile, 'w')
        out.write(a)
        out.close()
        print("page \'" + outfile +"\' has been saved for later")

page 'savedPages/naxos-records-A.html' has been saved for later
page 'savedPages/naxos-records-B.html' has been saved for later
page 'savedPages/naxos-records-C.html' has been saved for later
page 'savedPages/naxos-records-D.html' has been saved for later
page 'savedPages/naxos-records-E.html' has been saved for later
page 'savedPages/naxos-records-F.html' has been saved for later
page 'savedPages/naxos-records-G.html' has been saved for later
page 'savedPages/naxos-records-H.html' has been saved for later
page 'savedPages/naxos-records-I.html' has been saved for later
page 'savedPages/naxos-records-J.html' has been saved for later
page 'savedPages/naxos-records-K.html' has been saved for later
page 'savedPages/naxos-records-L.html' has been saved for later
page 'savedPages/naxos-records-M.html' has been saved for later
page 'savedPages/naxos-records-N.html' has been saved for later
page 'savedPages/naxos-records-O.html' has been saved for later
page 'savedPages/naxos-records-P.html' h

Now that we have the html for each one, let's look at one of those pages, like the letter A...

We want to identify the pieces we'll extract.
So it's all a table. 

The composer is the `<b>` text that's a child of a `td` with `bgcolor="#EEEEEE"`

All the music that is in the composer's movies is in a `td` with `class="style5"`. This is pretty messy because multiple pieces of a composer may be used in a movie and the same piece could be used in diff movies.

We can try to scrape these two divs first and see how to clean it later?

To get the process right, I'll look at the letter I which only has 4 composers.

In [13]:
in_url_list[8]


'savedPages/naxos-records-I.html'

Now that I verified which item in `in_url_list` corresponds to letter *I*, let's load the html for it.

In [20]:
soup = BeautifulSoup(open(in_url_list[8]).read(),"lxml")

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

In [21]:
soup.prettify()

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="https://www.w3.org/1999/xhtml">\n <head>\n  <script type="text/javascript">\n   var _sf_startpt=(new Date()).getTime()\n  </script>\n  <title>\n   Classical Music in Movies :  I - Classical Soundtrack and Classical Background Music.\n  </title>\n  <meta content="Classical Music in Movies and Classical Soundtrack: I. Choose from a wide variety of Classical Background Music, Classical movie songs and traditional musical background in mp3, dvd and cds for your own personal Classical Soundtrack collection from Naxos.com" name="Description"/>\n  <meta content="I - Classical Music in Movies Classical Soundtrack Classical Background Music classical music movies classic soundtracks and classical background music movie songs classic traditional background Classical movie songs" name="keywords"/>\n  <meta content="I - Classical Music in Movies Classical Soundt