# Get a complete list of "The National" albums and songs

The goal of this code is to get a list of non-repeating songs by "The National" from Wikipedia, stripped from any additinal information, just song titles. We are only focusing on album songs (and EPs).

First we need to build a SPARQL query to get the list of "The National" albums.



In [1]:
#GOAL: get a list of albums and songs


#useful links
#http://docs.python-requests.org/en/master/user/quickstart/
#https://janakiev.com/blog/wikidata-mayors/


import requests
import csv
#test sparql query here: https://query.wikidata.org/
url = 'https://query.wikidata.org/sparql'

#the below is only getting us albums, to get more songs I would also like to
#add EPs, extended plays

# query = """

# SELECT ?album ?albumLabel ?article WHERE {
#   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
 
#   ?album wdt:P31 wd:Q482994.  #?album wdt:P31 wd:Q482994. #is an Album
#   ?album wdt:P175 wd:Q1142566.  # hasPerformerproperty "The National"
    
#   ?article schema:about ?album .
#   ?article schema:inLanguage "en" .
#   ?article schema:isPartOf <https://en.wikipedia.org/> .
  
# }

# """
#############################################################
#query including EPs (extended plays) below
 

query = """

SELECT ?album ?albumLabel ?article WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  { ?album wdt:P31 wd:Q482994. }
  UNION
  { ?album wdt:P31 wd:Q169930. }
  ?album wdt:P175 wd:Q1142566.
  ?article schema:about ?album.
  ?article schema:inLanguage "en".
  ?article schema:isPartOf <https://en.wikipedia.org/>.
}

"""

r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()


import pandas as pd
from collections import OrderedDict

albums = []
for item in data['results']['bindings']:
    albums.append(OrderedDict({
        'album': item['albumLabel']['value'],
        'article': item['article']['value']}))

df = pd.DataFrame(albums)
#df.set_index('albumLabel', inplace=True)
#a list of The National Albums in a df
df


Unnamed: 0,album,article
0,The Virginia EP,https://en.wikipedia.org/wiki/The_Virginia_EP
1,Cherry Tree,https://en.wikipedia.org/wiki/Cherry_Tree_(EP)
2,Sad Songs for Dirty Lovers,https://en.wikipedia.org/wiki/Sad_Songs_for_Di...
3,Alligator,https://en.wikipedia.org/wiki/Alligator_(The_N...
4,Boxer,https://en.wikipedia.org/wiki/Boxer_(The_Natio...
5,High Violet,https://en.wikipedia.org/wiki/High_Violet
6,The National,https://en.wikipedia.org/wiki/The_National_(al...
7,Trouble Will Find Me,https://en.wikipedia.org/wiki/Trouble_Will_Fin...
8,Sleep Well Beast,https://en.wikipedia.org/wiki/Sleep_Well_Beast


Now we need to use the wikipedia API to get html of the webpages we are interested in, we could also use BeautifulSoup for this puppose. Here I'm combining Wikipedi aAPI and BeautifulSoup. BeautifulSoup lets me expract the parts I'm interested in. First, let's get songs from one album:

In [2]:
import wikipedia
from bs4 import BeautifulSoup
#print(wikipedia.summary("Sad Songs for Dirty Lovers"))
#print(wikipedia.content("Sad Songs for Dirty Lovers"))


#print(wikipedia.WikipediaPage(title = 'Metropolis (1927 film)').summary)
#this will take page_id, get it!!!!


#example
#pageContent = wikipedia.WikipediaPage('Sad Songs for Dirty Lovers').content
#
#get content of a section
#section = wikipedia.WikipediaPage('Sleep Well Beast').section("Promotion")
#print(wikipedia.WikipediaPage('Euclid').sections)
#print(wikipedia.WikipediaPage('Sleep Well Beast').images)
#print(wikipedia.WikipediaPage('Sleep Well Beast').sections)


#print(wikipedia.WikipediaPage('Sleep Well Beast').html())
# wont work: , cannot get a section of html page print(wikipedia.WikipediaPage('Sleep Well Beast').section("Track listing").html())
content=wikipedia.WikipediaPage('Sleep Well Beast').html()

soup = BeautifulSoup(content, 'html.parser')
table = soup.find('table',{'class':'tracklist'}) #hopefully the same for all pages
#print(table)
songs = table.find_all('td',{'style':'vertical-align:top'})

for s in songs:
    print(s.string)
#print(songs)


"Nobody Else Will Be There"
None
None
None
"Born to Beg"
"Turtleneck"
"Empire Line"
"I'll Still Destroy You"
"Guilty Party"
"Carin at the Liquor Store"
"Dark Side of the Gym"
"Sleep Well Beast"


The string function is not getting me songs in places where there are additional html parameters but the text function below does the job. ".text gets all the child strings and return concatenated using the given separator" from https://stackoverflow.com/questions/25327693/difference-between-string-and-text-beautifulsoup

In [3]:
from bs4 import BeautifulSoup
import urllib.request
def make_soup(url):
    thepage=urllib.request.urlopen(url)
    soupdata=BeautifulSoup(thepage, "html.parser")
    return soupdata

soup= make_soup("https://en.wikipedia.org/wiki/Sleep_Well_Beast")
songsInAlbum=""
# find all table ,get the first
table = soup.find_all('table', class_="tracklist")[0]  # Only use the first table
#table
#iter over it
for record in table.findAll('tr'):
    albumdata=""
    #for data in record.findAll('td'):
    for data in record.findAll('td', style="vertical-align:top"):
        print(data.text)


"Nobody Else Will Be There"
"Day I Die" (composed by The National)
"Walk It Back" (includes an excerpt from the article "Faith, Certainty and the Presidency of George W. Bush" by Ron Suskind, first published in The New York Times[43])
"The System Only Dreams in Total Darkness"
"Born to Beg"
"Turtleneck"
"Empire Line"
"I'll Still Destroy You"
"Guilty Party"
"Carin at the Liquor Store"
"Dark Side of the Gym"
"Sleep Well Beast"


In [4]:
from bs4 import BeautifulSoup
import urllib.request
def make_soup(url):
    thepage=urllib.request.urlopen(url)
    soupdata=BeautifulSoup(thepage, "html.parser")
    return soupdata


for index, row in df.iterrows():
    soup= make_soup(row['article'])
#songsInAlbum=""
# find all table ,get the first
    table = soup.find_all('table', class_="tracklist")[0]  # Only use the first table
#table
#iter over it
    for record in table.findAll('tr'):
        #albumdata=""
        for data in record.findAll('td', style="vertical-align:top"):
            print(data.text)


"You've Done It Again, Virginia"
 
"Santa Clara"
 
"Blank Slate"
 
"Tall Saint" (Demo)
 
"Without Permission"
Caroline Martin
"Forever After Days" (Demo)
 
"Rest of Years" (Demo)
 
"Slow Show" (Demo)
 
"Lucky You" (Daytrotter Session)
 
"Mansion on the Hill" (Live)
Bruce Springsteen
"Fake Empire" (Live)
 
"About Today" (Live)
 
"Wasp Nest"
 
"All the Wine"
 
"All Dolled-Up in Straps"
 
"Cherry Tree"
 
"About Today"
 
"Murder Me Rachael" (Live)
 
"A Reasonable Man (I Don't Mind)"
Padma Newsome
"Cardinal Song"
"Slipping Husband"
"90-Mile Water Wall"
"It Never Happened"
"Murder Me Rachael"
"Thirsty"
"Available"
"Sugar Wife"
"Trophy Wife"
"Fashion Coat"
"Patterns of Fairytales"
"Lucky You"
"Secret Meeting" (Berninger, A. Dessner, Scott Devendorf)
"Karen" (Berninger, Bryce Dessner)
"Lit Up"
"Looking for Astronauts"
"Daughters of the SoHo Riots" (Berninger, B. Dessner, S. Devendorf)
"Baby, We'll Be Fine"
"Friend of Mine"
"Val Jester" (Berninger, B. Dessner)
"All the Wine" (The National)
"Abe

The code above gets all the song titles from all the albums and EPs. Now we need to store them in some data structure, trying with a list first.

In [5]:
#list of albums and eps with sublists of songs
#print(df)
from bs4 import BeautifulSoup
import urllib.request
def make_soup(url):
    thepage=urllib.request.urlopen(url)
    soupdata=BeautifulSoup(thepage, "html.parser")
    return soupdata
#careful, the df gets modified, so we need to create a copy to avoid modyfying the original
df_copy =df.copy()
albumdata = []
for index, row in df_copy.iterrows():
    soup= make_soup(row['article'])
    
    row['album'] = []
    table = soup.find_all('table', class_="tracklist")[0]  # Only use the first table

    for record in table.findAll('tr'):
        
        for data in record.findAll('td', style="vertical-align:top"):
            #print(data.text)
            row['album'].append(data.text)

    albumdata.append(row['album'])

print(albumdata)
#I could do it this way but I'm losing album names which is not ideal.
#a dictionary is probably better suited for this

[['"You\'ve Done It Again, Virginia"', '\xa0', '"Santa Clara"', '\xa0', '"Blank Slate"', '\xa0', '"Tall Saint" (Demo)', '\xa0', '"Without Permission"', 'Caroline Martin', '"Forever After Days" (Demo)', '\xa0', '"Rest of Years" (Demo)', '\xa0', '"Slow Show" (Demo)', '\xa0', '"Lucky You" (Daytrotter Session)', '\xa0', '"Mansion on the Hill" (Live)', 'Bruce Springsteen', '"Fake Empire" (Live)', '\xa0', '"About Today" (Live)', '\xa0'], ['"Wasp Nest"', '\xa0', '"All the Wine"', '\xa0', '"All Dolled-Up in Straps"', '\xa0', '"Cherry Tree"', '\xa0', '"About Today"', '\xa0', '"Murder Me Rachael" (Live)', '\xa0', '"A Reasonable Man (I Don\'t Mind)"', 'Padma Newsome'], ['"Cardinal Song"', '"Slipping Husband"', '"90-Mile Water Wall"', '"It Never Happened"', '"Murder Me Rachael"', '"Thirsty"', '"Available"', '"Sugar Wife"', '"Trophy Wife"', '"Fashion Coat"', '"Patterns of Fairytales"', '"Lucky You"'], ['"Secret Meeting" (Berninger, A. Dessner, Scott Devendorf)', '"Karen" (Berninger, Bryce Dessner)'

The above looks fine but the best way to store albums and songs is probably to have a dictionary where the keys are album names and values are songs of a given album.

In [6]:
from bs4 import BeautifulSoup
import urllib.request
def make_soup(url):
    thepage=urllib.request.urlopen(url)
    soupdata=BeautifulSoup(thepage, "html.parser")
    return soupdata


#album_dict = {} #both ways to initialise a dictionary are OK
album_dict = dict()
for index, row in df.iterrows():
    soup= make_soup(row['article'])
    article =row['article']
    album =row['album']

    table = soup.find_all('table', class_="tracklist")[0]  # Only use the first table

    for record in table.findAll('tr'):
    #songsInAlbum=""
        
        for data in record.findAll('td', style="vertical-align:top"):

            album_dict.setdefault(album, []).append(data.text)

#recipe here; http://code.activestate.com/recipes/52219-associating-multiple-values-with-each-key-in-a-dic/
#https://stackoverflow.com/questions/1024847/add-new-keys-to-a-dictionary

    
print(album_dict)


{'The Virginia EP': ['"You\'ve Done It Again, Virginia"', '\xa0', '"Santa Clara"', '\xa0', '"Blank Slate"', '\xa0', '"Tall Saint" (Demo)', '\xa0', '"Without Permission"', 'Caroline Martin', '"Forever After Days" (Demo)', '\xa0', '"Rest of Years" (Demo)', '\xa0', '"Slow Show" (Demo)', '\xa0', '"Lucky You" (Daytrotter Session)', '\xa0', '"Mansion on the Hill" (Live)', 'Bruce Springsteen', '"Fake Empire" (Live)', '\xa0', '"About Today" (Live)', '\xa0'], 'Cherry Tree': ['"Wasp Nest"', '\xa0', '"All the Wine"', '\xa0', '"All Dolled-Up in Straps"', '\xa0', '"Cherry Tree"', '\xa0', '"About Today"', '\xa0', '"Murder Me Rachael" (Live)', '\xa0', '"A Reasonable Man (I Don\'t Mind)"', 'Padma Newsome'], 'Sad Songs for Dirty Lovers': ['"Cardinal Song"', '"Slipping Husband"', '"90-Mile Water Wall"', '"It Never Happened"', '"Murder Me Rachael"', '"Thirsty"', '"Available"', '"Sugar Wife"', '"Trophy Wife"', '"Fashion Coat"', '"Patterns of Fairytales"', '"Lucky You"'], 'Alligator': ['"Secret Meeting" (B

In [7]:
#another way to do this

from bs4 import BeautifulSoup
import urllib.request
def make_soup(url):
    thepage=urllib.request.urlopen(url)
    soupdata=BeautifulSoup(thepage, "html.parser")
    return soupdata


#album_dict = {} #both ways to initialise a dictionary are OK
album_dict = {}
for index, row in df.iterrows():
    soup= make_soup(row['article'])
    article =row['article']
    album =row['album']

    table = soup.find_all('table', class_="tracklist")[0]  # Only use the first table

    for record in table.findAll('tr'):
    #songsInAlbum=""
        
        for data in record.findAll('td', style="vertical-align:top"):

            if album in album_dict:
        # append the new song  to the existing album 
                album_dict[album].append(data.text)
            else:
        # create a new album 
                album_dict[album] = [data.text]

#recipe here; https://stackoverflow.com/questions/3199171/append-multiple-values-for-one-key-in-a-dictionary
#https://stackoverflow.com/questions/1024847/add-new-keys-to-a-dictionary

    
print(album_dict)

{'The Virginia EP': ['"You\'ve Done It Again, Virginia"', '\xa0', '"Santa Clara"', '\xa0', '"Blank Slate"', '\xa0', '"Tall Saint" (Demo)', '\xa0', '"Without Permission"', 'Caroline Martin', '"Forever After Days" (Demo)', '\xa0', '"Rest of Years" (Demo)', '\xa0', '"Slow Show" (Demo)', '\xa0', '"Lucky You" (Daytrotter Session)', '\xa0', '"Mansion on the Hill" (Live)', 'Bruce Springsteen', '"Fake Empire" (Live)', '\xa0', '"About Today" (Live)', '\xa0'], 'Cherry Tree': ['"Wasp Nest"', '\xa0', '"All the Wine"', '\xa0', '"All Dolled-Up in Straps"', '\xa0', '"Cherry Tree"', '\xa0', '"About Today"', '\xa0', '"Murder Me Rachael" (Live)', '\xa0', '"A Reasonable Man (I Don\'t Mind)"', 'Padma Newsome'], 'Sad Songs for Dirty Lovers': ['"Cardinal Song"', '"Slipping Husband"', '"90-Mile Water Wall"', '"It Never Happened"', '"Murder Me Rachael"', '"Thirsty"', '"Available"', '"Sugar Wife"', '"Trophy Wife"', '"Fashion Coat"', '"Patterns of Fairytales"', '"Lucky You"'], 'Alligator': ['"Secret Meeting" (B

Now I want to clean the songs (get only song names) and save them as a new dictionary

In [8]:
#get only song names
#songs are only in double quotes, extract everything in double quotes 

# this line is problematic: "Walk It Back" (includes an excerpt from the article 
#"Faith, Certainty and the Presidency of George W. Bush" by Ron Suskind, first published in The New York Times[43])
# the code that gets everything in double quotes needs to be fixed to take only first instance in double quotes
import re
#re.findall(r'"([^"]*)"', inputString)


#also need to remove 25 from "Slow Show[25]"
album_dict_cleaned = {}
for key in album_dict:
    #print(album_dict[key])
    #only things in double quotes, get rid of double backslashes
    pattern = r'\[.*?\]' #this pattern finds everything apart from things in square brackets, to deal with Slow Show {25}
   # value_list =album_dict[key] #this is a list
    value_list = re.findall(r'"([^"]*)"', str(album_dict[key]).replace("\\", ""))
    #the above finds avarything in double qutes and replaces any slashes with nothing
    for element in value_list:       
        element_f = re.sub(pattern, '', element)
        
        if key in album_dict_cleaned:
        # append the new song  to the existing album 
            album_dict_cleaned[key].append(element_f)
        else:
        # create a new album 
            album_dict_cleaned[key] = [element_f]
    
    
#print(album_dict_cleaned)    
   #OK!

In [42]:
#get only song names
#better - final version below
#songs are only in double quotes, extract everything in double quotes 

# this line is problematic: "Walk It Back" (includes an excerpt from the article 
#"Faith, Certainty and the Presidency of George W. Bush" by Ron Suskind, first published in The New York Times[43])
# the code that gets everything in double quotes needs to be fixed to take only first instance in double quotes
import re
#re.findall(r'"([^"]*)"', inputString)


#also need to remove 25 from "Slow Show[25]"
album_dict_cleaned = {}
counter = 0
for key in album_dict:
    #print(album_dict[key])
    #only things in double quotes, get rid of double backslashes
    pattern = r'\[.*?\]' #this pattern finds everything apart from things in square brackets, to deal with Slow Show {25}
   # value_list =album_dict[key] #this is a list
   # value_list = str(album_dict[key]).replace("\\", "")
    value_list = album_dict[key]
    #print(value_list)
    #print(type(value_list))
    #break
    #the above finds avarything in double qutes and replaces any slashes with nothing
    for element in value_list: 
        #re.findall(r'"([^"]*)"', element)
        match = re.search(r'"([^"]*)"',element) 
        counter =counter+1
        
        if match:
            #print((match.group())+str(counter))
            
            if key in album_dict_cleaned:
        # append the new song  to the existing album 
                album_dict_cleaned[key].append(match.group())
                print(match.group())
            else:
        # create a new album 
                album_dict_cleaned[key] = [match.group()]
                print(match.group())
        else:
            print ('did not find')

    
#print(album_dict_cleaned)    
   #OK!

"You've Done It Again, Virginia"
did not find
"Santa Clara"
did not find
"Blank Slate"
did not find
"Tall Saint"
did not find
"Without Permission"
did not find
"Forever After Days"
did not find
"Rest of Years"
did not find
"Slow Show"
did not find
"Lucky You"
did not find
"Mansion on the Hill"
did not find
"Fake Empire"
did not find
"About Today"
did not find
"Wasp Nest"
did not find
"All the Wine"
did not find
"All Dolled-Up in Straps"
did not find
"Cherry Tree"
did not find
"About Today"
did not find
"Murder Me Rachael"
did not find
"A Reasonable Man (I Don't Mind)"
did not find
"Cardinal Song"
"Slipping Husband"
"90-Mile Water Wall"
"It Never Happened"
"Murder Me Rachael"
"Thirsty"
"Available"
"Sugar Wife"
"Trophy Wife"
"Fashion Coat"
"Patterns of Fairytales"
"Lucky You"
"Secret Meeting"
"Karen"
"Lit Up"
"Looking for Astronauts"
"Daughters of the SoHo Riots"
"Baby, We'll Be Fine"
"Friend of Mine"
"Val Jester"
"All the Wine"
"Abel"
"The Geese of Beverly Road"
"City Middle"
"Mr. Novem

Save everything in a pickle

In [43]:
import pickle

with open('theNationalSongDictionary.pickle', 'wb') as handle:
    pickle.dump(album_dict_cleaned, handle, protocol=pickle.HIGHEST_PROTOCOL)