# What's the point of this?

Based on brief conversation with Mackenzie Miller from UM's interactive media program, I want to scrape the Naxos Records pages that have classical music in movies by composers: https://www.naxos.com/musicinmoviescomplist.asp?letter=A

# What's the plan?
There is one page per letter of the alphabet A-Z. The first thing I'll do is scrape each page and get those urls and then gather the entries for each leetter of the alphabet.

Each entry is a composer's name organized by *last name*, *first name* *middle name*
Under the composer we also have the subentries. Each subentry is the useful thing we want. It includes the name of the music piece, the item code that links to it (not fully clear what that means), and the name of the movie it appears in, along with the year the movie came out.

## so let's start

I'll scrape the pages like I did with my News21work

In [1]:
import requests
import urllib.parse as parse
from bs4 import BeautifulSoup
import lxml
from time import sleep
import csv
from pathlib import Path

In [2]:
baseurl = "https://www.naxos.com/musicinmoviescomplist.asp?letter="

Now I need to iterate through the alphabet uppercase...

In [3]:
from string import ascii_lowercase

alphabet = ascii_lowercase.upper()

In [4]:
alphabet

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [5]:
out_url_list = []
for letter in alphabet:
    testurl = baseurl + letter
    print(testurl)

https://www.naxos.com/musicinmoviescomplist.asp?letter=A
https://www.naxos.com/musicinmoviescomplist.asp?letter=B
https://www.naxos.com/musicinmoviescomplist.asp?letter=C
https://www.naxos.com/musicinmoviescomplist.asp?letter=D
https://www.naxos.com/musicinmoviescomplist.asp?letter=E
https://www.naxos.com/musicinmoviescomplist.asp?letter=F
https://www.naxos.com/musicinmoviescomplist.asp?letter=G
https://www.naxos.com/musicinmoviescomplist.asp?letter=H
https://www.naxos.com/musicinmoviescomplist.asp?letter=I
https://www.naxos.com/musicinmoviescomplist.asp?letter=J
https://www.naxos.com/musicinmoviescomplist.asp?letter=K
https://www.naxos.com/musicinmoviescomplist.asp?letter=L
https://www.naxos.com/musicinmoviescomplist.asp?letter=M
https://www.naxos.com/musicinmoviescomplist.asp?letter=N
https://www.naxos.com/musicinmoviescomplist.asp?letter=O
https://www.naxos.com/musicinmoviescomplist.asp?letter=P
https://www.naxos.com/musicinmoviescomplist.asp?letter=Q
https://www.naxos.com/musicinmo

That worked, so let's keep going and extract the html for each of those pages:

In [6]:
in_url_list = []
for letter in alphabet:
    outfile = "savedPages/naxos-records-" + letter + ".html"
    path_to_outfile = Path(outfile)
    
    if path_to_outfile.is_file():
        print("We have processed <<"+outfile+">> already.")
        pass
    else:
        in_url_list.append(outfile)
        #cycle through the diff pages
        testurl = baseurl + letter
        response = requests.get(testurl)
        a = response.text
        a = a.encode('ascii','ignore').decode('utf-8')
        a = parse.unquote(a)
        out = open(outfile, 'w')
        out.write(a)
        out.close()
        print("page \'" + outfile +"\' has been saved for later")

page 'savedPages/naxos-records-A.html' has been saved for later
page 'savedPages/naxos-records-B.html' has been saved for later
page 'savedPages/naxos-records-C.html' has been saved for later
page 'savedPages/naxos-records-D.html' has been saved for later
page 'savedPages/naxos-records-E.html' has been saved for later
page 'savedPages/naxos-records-F.html' has been saved for later
page 'savedPages/naxos-records-G.html' has been saved for later
page 'savedPages/naxos-records-H.html' has been saved for later
page 'savedPages/naxos-records-I.html' has been saved for later
page 'savedPages/naxos-records-J.html' has been saved for later
page 'savedPages/naxos-records-K.html' has been saved for later
page 'savedPages/naxos-records-L.html' has been saved for later
page 'savedPages/naxos-records-M.html' has been saved for later
page 'savedPages/naxos-records-N.html' has been saved for later
page 'savedPages/naxos-records-O.html' has been saved for later
page 'savedPages/naxos-records-P.html' h

Now that we have the html for each one, let's look at one of those pages, like the letter A...

We want to identify the pieces we'll extract.
So it's all a table. 

The composer is the `<b>` text that's a child of a `td` with `bgcolor="#EEEEEE"`

All the music that is in the composer's movies is in a `td` with `class="style5"`. This is pretty messy because multiple pieces of a composer may be used in a movie and the same piece could be used in diff movies.

We can try to scrape these two divs first and see how to clean it later?

To get the process right, I'll look at the letter I which only has 4 composers.

In [7]:
in_url_list[8]


'savedPages/naxos-records-I.html'

Now that I verified which item in `in_url_list` corresponds to letter *I*, let's load the html for it.

In [8]:
soup = BeautifulSoup(open(in_url_list[8]).read(),"lxml")

In [9]:
soup.prettify()

<bound method Tag.prettify of <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="https://www.w3.org/1999/xhtml">
<head>
<script type="text/javascript">var _sf_startpt=(new Date()).getTime()</script>
<title>Classical Music in Movies :  I - Classical Soundtrack and Classical Background Music.</title>
<meta content="Classical Music in Movies and Classical Soundtrack: I. Choose from a wide variety of Classical Background Music, Classical movie songs and traditional musical background in mp3, dvd and cds for your own personal Classical Soundtrack collection from Naxos.com" name="Description"/>
<meta content="I - Classical Music in Movies Classical Soundtrack Classical Background Music classical music movies classic soundtracks and classical background music movie songs classic traditional background Classical movie songs" name="keywords"/>
<meta content="I - Classical Music in Movies Classical Soundtrack Cl

Now let's try to get the diff composers from this page

In [20]:
table = soup.find('td',class_='style5')
print(table)

<td class="style5">
<!-- START REVIEWS -->
<table align="center" bgcolor="#CCCCCC" border="0" cellpadding="4" cellspacing="1" width="650">
<tr><td bgcolor="#EEEEEE"><b>IBERT, JACQUES</b></td></tr>
<tr><td bgcolor="#ffffff" class="style5">
<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
							Flute Concerto: II. Andante <span class="style1">(Capriccio <a href="https://www.naxos.com/catalogue/item.asp?item_code=C71028">C71028</a> )</span><br/><i>Dionysia (A) (2015)</i><br/>
</div>
</td></tr>
<tr><td bgcolor="#EEEEEE"><b>IPPOLITOV-IVANOV, MIKHAIL MIKHAYLOVICH</b></td></tr>
<tr><td bgcolor="#ffffff" class="style5">
<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
							Caucasian Sketches, Suite No. 1, Op. 10: IV. Procession of the Sardar <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.553405">8.553405</a> )</span><br/><i>Curse of

Well this is the table of reviews. Let's see if we can extract the composer part from this table item.

In [21]:
type(table)

bs4.element.Tag

In [44]:
composer = table.find_all('b')

In [45]:
composer

[<b>IBERT, JACQUES</b>,
 <b>IPPOLITOV-IVANOV, MIKHAIL MIKHAYLOVICH</b>,
 <b>IVANOVICI, IOSIF</b>,
 <b>IVES, CHARLES</b>]

Nice, found them. Done in one step:

In [46]:
composers = soup.find('td',class_='style5').find_all('b')

In [47]:
composers

[<b>IBERT, JACQUES</b>,
 <b>IPPOLITOV-IVANOV, MIKHAIL MIKHAYLOVICH</b>,
 <b>IVANOVICI, IOSIF</b>,
 <b>IVES, CHARLES</b>]

So now that we have the composers in a list we know we an iterate through them all. Let's move to finding the other text associated with that composer.

In [48]:
table

<td class="style5">
<!-- START REVIEWS -->
<table align="center" bgcolor="#CCCCCC" border="0" cellpadding="4" cellspacing="1" width="650">
<tr><td bgcolor="#EEEEEE"><b>IBERT, JACQUES</b></td></tr>
<tr><td bgcolor="#ffffff" class="style5">
<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
							Flute Concerto: II. Andante <span class="style1">(Capriccio <a href="https://www.naxos.com/catalogue/item.asp?item_code=C71028">C71028</a> )</span><br/><i>Dionysia (A) (2015)</i><br/>
</div>
</td></tr>
<tr><td bgcolor="#EEEEEE"><b>IPPOLITOV-IVANOV, MIKHAIL MIKHAYLOVICH</b></td></tr>
<tr><td bgcolor="#ffffff" class="style5">
<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
							Caucasian Sketches, Suite No. 1, Op. 10: IV. Procession of the Sardar <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.553405">8.553405</a> )</span><br/><i>Curse of

In [49]:
name = composers[0].text

In [50]:
name

'IBERT, JACQUES'

In [78]:
a = table.find(text=name).parent.parent.parent

In [79]:
a

<tr><td bgcolor="#EEEEEE"><b>IBERT, JACQUES</b></td></tr>

In [82]:
b = a.next_sibling.next_sibling

In [83]:
b

<tr><td bgcolor="#ffffff" class="style5">
<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
							Flute Concerto: II. Andante <span class="style1">(Capriccio <a href="https://www.naxos.com/catalogue/item.asp?item_code=C71028">C71028</a> )</span><br/><i>Dionysia (A) (2015)</i><br/>
</div>
</td></tr>

This is an ugly way of doing things, but it works... I wonder if there's a better way?

Now that it's done, let's try to extract that div to it's own thing

In [84]:
uses = b.find('div')

In [85]:
uses

<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
							Flute Concerto: II. Andante <span class="style1">(Capriccio <a href="https://www.naxos.com/catalogue/item.asp?item_code=C71028">C71028</a> )</span><br/><i>Dionysia (A) (2015)</i><br/>
</div>

From this we can see that the first text item is the name of piece. We also get the item_code, the ID of the piece in their system, in the text associated with the link. The last thing we get is the movie(s) the piece appears in within the `<i>` tags. Let's try to extract those 3 things from the `uses` variable.

In [88]:
piece = uses.find('span',class_='style1').previous_sibling
print(piece)


							Flute Concerto: II. Andante 


Okay, we get the text with some whitespace. that can be cleaned in excel/gsheets

In [89]:
itemCode = uses.find('a').text
print(itemCode)

C71028


We got the item code!

In [90]:
movies = uses.find('i').text

In [91]:
movies

'Dionysia (A) (2015)'

We got them. These things can now be appeneded as rows to a large csv file!

## Time to build out.

With a quick glimpse at the page for the letter J I noticed we can have 1 composer with multiple pieces being used in films. Each piece will get it's own div. Let's see if I can combine everything I have done before this and try to see if find_all will work when I define the uses variable.

In [100]:
page = BeautifulSoup(open(in_url_list[0]).read(),"lxml")
table = page.find('td',class_='style5')
composers = page.find('td',class_='style5').find_all('b')

composer = composers[0].text
content = table.find(text=composer).parent.parent.parent.next_sibling.next_sibling
pieces = content.find_all('div')

In [101]:
pieces

[<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
 							Giselle: Apparition de Giselle <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56">8.550755-56</a> )</span><br/><i>Red Shoes (The) (1948)</i><br/>
 </div>,
 <div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
 							Giselle: Entree d'Hilarion, scene et fugue des Wilis <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56">8.550755-56</a> )</span><br/><i>Red Shoes (The) (1948)</i><br/>
 </div>,
 <div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
 							Giselle: Pas de deux des jeunes paysans <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56">8.550755-56</a> )</span><br/><i>Red Shoes (The) (1948)</i><br/>
 </div>,
 <div style=

This works!

In [105]:
piece = pieces[0]
pieceName = piece.find('span',class_='style1').previous_sibling
pieceCode = piece.find('a').text
pieceMovies = piece.find('i').text

print(composer,pieceName.strip(),pieceCode,pieceMovies)

ADAM, ADOLPHE Giselle: Apparition de Giselle 8.550755-56 Red Shoes (The) (1948)


So it works for one div, let's iterate across pieces:

In [106]:
for piece in pieces:
    pieceName = piece.find('span',class_='style1').previous_sibling
    pieceCode = piece.find('a').text
    pieceMovies = piece.find('i').text
    
    print(composer,pieceName.strip(),pieceCode,pieceMovies) 

ADAM, ADOLPHE Giselle: Apparition de Giselle 8.550755-56 Red Shoes (The) (1948)
ADAM, ADOLPHE Giselle: Entree d'Hilarion, scene et fugue des Wilis 8.550755-56 Red Shoes (The) (1948)
ADAM, ADOLPHE Giselle: Pas de deux des jeunes paysans 8.550755-56 Red Shoes (The) (1948)
ADAM, ADOLPHE Giselle: Pas des premieres Wilis 8.550755-56 Red Shoes (The) (1948)


And can we iterate across composers?

In [108]:
for composer in composers:
    composer_name = composer.text
    content = table.find(text=composer_name).parent.parent.parent.next_sibling.next_sibling
    pieces = content.find_all('div')
    
    for piece in pieces:
        pieceName = piece.find('span',class_='style1').previous_sibling
        pieceCode = piece.find('a').text
        pieceMovies = piece.find('i').text

        print(composer_name,pieceName.strip(),pieceCode,pieceMovies)

ADAM, ADOLPHE Giselle: Apparition de Giselle 8.550755-56 Red Shoes (The) (1948)
ADAM, ADOLPHE Giselle: Entree d'Hilarion, scene et fugue des Wilis 8.550755-56 Red Shoes (The) (1948)
ADAM, ADOLPHE Giselle: Pas de deux des jeunes paysans 8.550755-56 Red Shoes (The) (1948)
ADAM, ADOLPHE Giselle: Pas des premieres Wilis 8.550755-56 Red Shoes (The) (1948)
ADAMS, JOHN China Gates 8.559285 Call Me by Your Name (2017)
ADAMS, JOHN Hallelujah Junction 8.559285 Call Me by Your Name (2017)
ADAMS, JOHN Harmonium: No. 3. Wild Nights 603497121168 Birdman: Or (The Unexpected Virtue of Ignorance) (2014)
ADAMS, JOHN Phrygian Gates 8.559285 Call Me by Your Name (2017)
ADAMS, JOHN The Death of Klinghoffer: Chorus of the Exiled Palestinians 603497121168 Birdman: Or (The Unexpected Virtue of Ignorance) (2014)
ALBENIZ, ISAAC Asturias 8.553999 Talk of Angels (1998)
ALBENIZ, ISAAC Chant d'Espagne, Op. 232: No. 4. Cordoba ODE752-2 Girlfight (2000)
ALBINONI, TOMASO GIOVANNI Adagio 8.550014 Doors (The) (1991)  Ga

Awesome. Now we know we can get all the info for a page. A thing we can add is the url of the item code. Let's look at the last call of piece

In [109]:
piece

<div style="border-top:1px dotted #999999; border-bottom:1px dotted #999999; padding:3px 0px; margin:0px 0px">
							Rule Brittania <span class="style1">(<a href="https://www.naxos.com/catalogue/item.asp?item_code=8.553961">8.553961</a> )</span><br/><i>Alphabet Murders (The) (1965) <br/> BFG (The) (2016) <br/> Charge of the Light Brigade (The) <br/> Into the Arms of Strangers (2007) <br/> Minions (2015) <br/> Three Men and a Little Lady (1990)</i><br/>
</div>

The url for the item code is: `https://www.naxos.com/catalogue/item.asp?item_code=8.553961`. To the previous thing we were outputting we can add another column for the item_url. This item_url is the link to the item in the naxos catalog and is formed by adding the item code we already extract to the baseurl `https://www.naxos.com/catalogue/item.asp?item_code=`

In [110]:
item_url = 'https://www.naxos.com/catalogue/item.asp?item_code='+pieceCode

In [111]:
print(item_url)

https://www.naxos.com/catalogue/item.asp?item_code=8.553961


We can also try this idea:

In [122]:
alink = piece.find('a', href=True)['href']
print(alink)

https://www.naxos.com/catalogue/item.asp?item_code=8.553961


That works too. 

Now that we have all this it's important to keep in mind that it all depends on getting a list of composers which comes from `soup.find('td',class_='style5').find_all('b')`. From looking at the pages, there are some letter for whic there are no composers (examples, 'z', 'x'). Let's test if the composer list comes out empty for one of these.

In [124]:
page2= BeautifulSoup(open(in_url_list[-2]).read(),"lxml")
table = page2.find('td',class_='style5')
composers = page2.find('td',class_='style5').find_all('b')

In [125]:
composers

[]

In [126]:
len(composers)

0

Yeah, you get an empty array. We can use that to make a conditional.


So let's try it all together now:

In [130]:
naxos_list = []
naxos_list.append(['composerNameFull, pieceName, pieceCode, pieceURL, pieceMovies'])
for url in in_url_list:
    page = BeautifulSoup(open(url).read(),"lxml")
    table = page.find('td',class_='style5')
    composers = page.find('td',class_='style5').find_all('b')
    if len(composers) > 0:
        for composer in composers:
            composer_name = composer.text
            content = table.find(text=composer_name).parent.parent.parent.next_sibling.next_sibling
            pieces = content.find_all('div')

            for piece in pieces:
                pieceName = piece.find('span',class_='style1').previous_sibling
                pieceCode = piece.find('a').text
                pieceURL = piece.find('a', href=True)['href']
                pieceMovies = piece.find('i').text
                naxos_list.append([composer_name, pieceName, pieceCode, pieceURL, pieceMovies])
        else:
            print("no composers in file:" + url )


no composers in file:savedPages/naxos-records-A.html
no composers in file:savedPages/naxos-records-B.html
no composers in file:savedPages/naxos-records-C.html
no composers in file:savedPages/naxos-records-D.html
no composers in file:savedPages/naxos-records-E.html
no composers in file:savedPages/naxos-records-F.html
no composers in file:savedPages/naxos-records-G.html
no composers in file:savedPages/naxos-records-H.html
no composers in file:savedPages/naxos-records-I.html
no composers in file:savedPages/naxos-records-J.html
no composers in file:savedPages/naxos-records-K.html
no composers in file:savedPages/naxos-records-L.html
no composers in file:savedPages/naxos-records-M.html
no composers in file:savedPages/naxos-records-O.html
no composers in file:savedPages/naxos-records-P.html
no composers in file:savedPages/naxos-records-R.html
no composers in file:savedPages/naxos-records-S.html
no composers in file:savedPages/naxos-records-T.html
no composers in file:savedPages/naxos-records-

Well that didn't work. Let's try debugging that

In [133]:
for url in in_url_list:
    page = BeautifulSoup(open(url).read(),"lxml")
    table = page.find('td',class_='style5')
    composers = page.find('td',class_='style5').find_all('b')
    print(url, len(composers))

savedPages/naxos-records-A.html 8
savedPages/naxos-records-B.html 22
savedPages/naxos-records-C.html 14
savedPages/naxos-records-D.html 13
savedPages/naxos-records-E.html 5
savedPages/naxos-records-F.html 4
savedPages/naxos-records-G.html 16
savedPages/naxos-records-H.html 9
savedPages/naxos-records-I.html 4
savedPages/naxos-records-J.html 2
savedPages/naxos-records-K.html 6
savedPages/naxos-records-L.html 10
savedPages/naxos-records-M.html 17
savedPages/naxos-records-N.html 0
savedPages/naxos-records-O.html 2
savedPages/naxos-records-P.html 16
savedPages/naxos-records-Q.html 0
savedPages/naxos-records-R.html 8
savedPages/naxos-records-S.html 31
savedPages/naxos-records-T.html 8
savedPages/naxos-records-U.html 0
savedPages/naxos-records-V.html 5
savedPages/naxos-records-W.html 10
savedPages/naxos-records-X.html 0
savedPages/naxos-records-Y.html 0
savedPages/naxos-records-Z.html 1


I used the solution from [here](https://stackoverflow.com/questions/53513/how-do-i-check-if-a-list-is-empty) to check if the list is empty. If the list is empty, you'd get a `FALSE` usually, but `not FALSE` returns `TRUE`. 

In [139]:
for url in in_url_list:
    page = BeautifulSoup(open(url).read(),"lxml")
    table = page.find('td',class_='style5')
    composers = page.find('td',class_='style5').find_all('b')
    if not composers:
        print(url, len(composers),'false')
    else:
        print(url, len(composers), 'true')

savedPages/naxos-records-A.html 8 true
savedPages/naxos-records-B.html 22 true
savedPages/naxos-records-C.html 14 true
savedPages/naxos-records-D.html 13 true
savedPages/naxos-records-E.html 5 true
savedPages/naxos-records-F.html 4 true
savedPages/naxos-records-G.html 16 true
savedPages/naxos-records-H.html 9 true
savedPages/naxos-records-I.html 4 true
savedPages/naxos-records-J.html 2 true
savedPages/naxos-records-K.html 6 true
savedPages/naxos-records-L.html 10 true
savedPages/naxos-records-M.html 17 true
savedPages/naxos-records-N.html 0 false
savedPages/naxos-records-O.html 2 true
savedPages/naxos-records-P.html 16 true
savedPages/naxos-records-Q.html 0 false
savedPages/naxos-records-R.html 8 true
savedPages/naxos-records-S.html 31 true
savedPages/naxos-records-T.html 8 true
savedPages/naxos-records-U.html 0 false
savedPages/naxos-records-V.html 5 true
savedPages/naxos-records-W.html 10 true
savedPages/naxos-records-X.html 0 false
savedPages/naxos-records-Y.html 0 false
savedPages/

So this is how we'll check for the composers before proceeding.

In [144]:
naxos_list = []
naxos_list.append(['composerNameFull, pieceName, pieceCode, pieceURL, pieceMovies'])
for url in in_url_list:
    page = BeautifulSoup(open(url).read(),"lxml")
    table = page.find('td',class_='style5')
    composers = page.find('td',class_='style5').find_all('b')
    if not composers:
        print("no composers in file:" + url )
    else:
        for composer in composers:
            composer_name = composer.text
            content = table.find(text=composer_name).parent.parent.parent.next_sibling.next_sibling
            pieces = content.find_all('div')

            for piece in pieces:
                pieceName = piece.find('span',class_='style1').previous_sibling
                pieceCode = piece.find('a').text
                pieceURL = piece.find('a', href=True)['href']
                pieceMovies = piece.find('i').text
                naxos_list.append([composer_name, pieceName, pieceCode, pieceURL, pieceMovies])

no composers in file:savedPages/naxos-records-N.html
no composers in file:savedPages/naxos-records-Q.html
no composers in file:savedPages/naxos-records-U.html
no composers in file:savedPages/naxos-records-X.html
no composers in file:savedPages/naxos-records-Y.html


In [145]:
naxos_list

[['composerNameFull, pieceName, pieceCode, pieceURL, pieceMovies'],
 ['ADAM, ADOLPHE',
  '\n\t\t\t\t\t\t\tGiselle: Apparition de Giselle ',
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAM, ADOLPHE',
  "\n\t\t\t\t\t\t\tGiselle: Entree d'Hilarion, scene et fugue des Wilis ",
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAM, ADOLPHE',
  '\n\t\t\t\t\t\t\tGiselle: Pas de deux des jeunes paysans ',
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAM, ADOLPHE',
  '\n\t\t\t\t\t\t\tGiselle: Pas des premieres Wilis ',
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAMS, JOHN',
  '\n\t\t\t\t\t\t\tChina Gates ',
  '8.559285',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.559285',
  'Call Me by Your Name (

Looking at this I realize I forgot to strip the white space from the piece title. Let's add that in to the next version

In [146]:
naxos_list = []
naxos_list.append(['composerNameFull, pieceName, pieceCode, pieceURL, pieceMovies'])
for url in in_url_list:
    page = BeautifulSoup(open(url).read(),"lxml")
    table = page.find('td',class_='style5')
    composers = page.find('td',class_='style5').find_all('b')
    if not composers:
        print("no composers in file:" + url )
    else:
        for composer in composers:
            composer_name = composer.text
            content = table.find(text=composer_name).parent.parent.parent.next_sibling.next_sibling
            pieces = content.find_all('div')

            for piece in pieces:
                pieceName = piece.find('span',class_='style1').previous_sibling.strip()
                pieceCode = piece.find('a').text
                pieceURL = piece.find('a', href=True)['href']
                pieceMovies = piece.find('i').text
                naxos_list.append([composer_name, pieceName, pieceCode, pieceURL, pieceMovies])

no composers in file:savedPages/naxos-records-N.html
no composers in file:savedPages/naxos-records-Q.html
no composers in file:savedPages/naxos-records-U.html
no composers in file:savedPages/naxos-records-X.html
no composers in file:savedPages/naxos-records-Y.html


In [147]:
naxos_list

[['composerNameFull, pieceName, pieceCode, pieceURL, pieceMovies'],
 ['ADAM, ADOLPHE',
  'Giselle: Apparition de Giselle',
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAM, ADOLPHE',
  "Giselle: Entree d'Hilarion, scene et fugue des Wilis",
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAM, ADOLPHE',
  'Giselle: Pas de deux des jeunes paysans',
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAM, ADOLPHE',
  'Giselle: Pas des premieres Wilis',
  '8.550755-56',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.550755-56',
  'Red Shoes (The) (1948)'],
 ['ADAMS, JOHN',
  'China Gates',
  '8.559285',
  'https://www.naxos.com/catalogue/item.asp?item_code=8.559285',
  'Call Me by Your Name (2017)'],
 ['ADAMS, JOHN',
  'Hallelujah Junction',
  '8.559285',
  'https://www.naxos

This list can now be written to a csv. It's not tidy yet because we have the instances where one piece is used in multiple movies and we haven't separated that yet. I'm not sure how often that happens but I'll look at that in excel.

In [149]:
import csv

In [151]:
with open("out_url_list.csv","w+") as my_csv:
    csvWriter = csv.writer(my_csv,delimiter=',')
    csvWriter.writerows(naxos_list)

Looking at the file, the first row did not get appended as I intended but everything else looks workable. I should have checked earlier, but let's fix it now.

In [152]:
naxos_list = []
naxos_list.append(['composerNameFull', 'pieceName', 'pieceCode', 'pieceURL', 'pieceMovies'])
for url in in_url_list:
    page = BeautifulSoup(open(url).read(),"lxml")
    table = page.find('td',class_='style5')
    composers = page.find('td',class_='style5').find_all('b')
    if not composers:
        print("no composers in file:" + url )
    else:
        for composer in composers:
            composer_name = composer.text
            content = table.find(text=composer_name).parent.parent.parent.next_sibling.next_sibling
            pieces = content.find_all('div')

            for piece in pieces:
                pieceName = piece.find('span',class_='style1').previous_sibling.strip()
                pieceCode = piece.find('a').text
                pieceURL = piece.find('a', href=True)['href']
                pieceMovies = piece.find('i').text
                naxos_list.append([composer_name, pieceName, pieceCode, pieceURL, pieceMovies])

no composers in file:savedPages/naxos-records-N.html
no composers in file:savedPages/naxos-records-Q.html
no composers in file:savedPages/naxos-records-U.html
no composers in file:savedPages/naxos-records-X.html
no composers in file:savedPages/naxos-records-Y.html


In [153]:
naxos_list[0]

['composerNameFull', 'pieceName', 'pieceCode', 'pieceURL', 'pieceMovies']

Much better. Let's write this now to a file and label it as untidy.

In [154]:
with open("naxosClassicalMusicUntidy.csv","w+") as my_csv:
    csvWriter = csv.writer(my_csv,delimiter=',')
    csvWriter.writerows(naxos_list)