# Nasa Wake up Music Project
This notebook does some cleaning and adds more data to the Nasa Wakeup Songs archive produced by Nasa's history department, and digitalised by Ross Spencer. 
Notebook by Matthew Allinson. Contact me via [mattallinson.com](http://mattallinson.com)

## 1. Getting genre data from Discogs

The first part of this notebook gets data from [discogs](https://www.discogs.com) to supplement the song information. I have included Genre and release date in my extended database, and have included a link to the discogs URI and ID too so it will be easier to get more data in the future. This method wasn't perfect though, and required some manual polishing afterwards.

In [1]:
import pandas as pd
import requests

### 1.1 Get initial data

In [2]:
#Initialization

df = pd.read_csv('./v2/nasawakeupcalls_no-genre.csv')

for col in ['year','genre', 'style', 'master_url', 'uri']:
    #adds the columns we're going to scrape from discogs
    if col not in df:
        df.insert(len(df.columns),col,None)
    
DISCOGS_URL = 'https://www.discogs.com'
DISCOGS_API = 'https://api.discogs.com/database/search?'

with open('token.txt') as token_file: # Get an API token from discogs and save as token.txt
    DISCOGS_TOKEN = token_file.read()

def save_csv(filename):    
    with open(filename,'w') as outfile:
        df.to_csv(outfile, index=False, sep=',')

###  1.2 Clean up data

After doing visual inspection on the csv file, there were some errors in the converter from PDF that left some song names with trailing punctuation or the word "Performed" so this sorts that, runs once.

Cleaning by hand was also done in Microsoft Excel to consolidate various duplicate spellings/name of repeat songs like Anchors Aweigh (Anchors Away, the Naval Hymn, etc)

In [None]:
for i in df.itertuples(): 
    song = i.Song
    if type(song) == float: #to do, work out why some of these are NaN
        pass
    elif 'performed' in song:
        df.at[i,'Song'] = song.strip('performed')
    else:
        df.at[i,'Song'] = song.strip(',- ').title()

### 1.3 Getting genre data from Discogs

We're going to search by track name (if possible) and use this to find genres for the track.

In [20]:
def discogs_api(song=None, artist=None):
    '''retrieves the top search result from discogs for a given song &
    artist combination. 
    '''
    header = {'Authorization':'Discogs token='+DISCOGS_TOKEN}
    payload = {'per_page':1,'page':1}
    
    if artist != None:
        payload.update({'artist':artist})
    if song != None:
        payload.update({'release_title':song})

    req = requests.get(DISCOGS_API,headers=header,params=payload)
    
    return req.json()['results']

def get_song_info(index):
    '''Searches the discogs database using the data for the
    given index in the data frame
    '''
    row = df.iloc[index]
    song = row['Song']
    artist = row['Artist']
    
    # Doesn't bother searching if song information is missing or bad quality
    if type(song) == float:
        return
    for u in ["unknown","unidentified","untitled","medley"]:   
        if u in song.lower():
            return
    
    # Stops "Unidentified" being handed to the search API
    if artist == "Unidentified":
        artist = None
    
    data = discogs_api(song, artist) #does the search
    
    if len(data) == 0: #no results
        return
    
    else:
        data = data[0]
        
    for key in ['genre','style','year','master_url']:
        if key not in data:
            continue # if the data is missing, leave as None
        elif type(data[key]) != list:
            df.at[index,key] = data[key] #if it's not a list, take the value
        elif len(data[key]) == 0: 
            continue # if it's an empty list, leave as None
        else:
            df.at[index,key] = data[key][0] #if it is a list, take the 1st value
    
    if 'uri' in data: #special case for formatting URI
        df.at[index,'uri'] = DISCOGS_URL + data['uri']
        

### 1.4 Do the search!

In [15]:
start = 0 #for restarting at a later point if discogs API throws a fit

for i in range(start,len(df)):
    get_song_info(i)
    prog =(i/len(df)-1)*100
    save_csv('nasawakeupcalls.csv') #this could maybe be more efficient?
    
    if i%10 == 0: #keep us updated
        print(int(prog),'% complete')
        

100 % complete


## 2. Sorting out the JSON

Takes all the genre data and adds it back into the JSON format. The final version of the csv had _even more_ hand work done on it. For Shuttle Missions if there was no genre data for a song, I found it manually using a discogs search of the artist. I was unable to make a simple machine way of doing this given the time constraints of the job I was doing it for, perhaps this can be revisted in the future.  

In [8]:
import json
import pandas as pd
from datetime import datetime 

### 2.1 Get initial data

In [2]:
#gets the metadata and structure from the original json
#loads up the data from new, cleaned-up and expanded csv fle

with open('./v2/nasawakeupcalls_no-genre.json') as f:
    song_json = json.load(f)

import_df = pd.read_csv('nasawakeupcalls.csv')
songs_df = import_df.where(pd.notnull(import_df), None) #Removes NaNs

### 2.2 Create a structured dictionary
Uses the form ```{'Mission name':data}``` where ```data``` matches the format used in the original json. I fear that this makes sense to me now but given that it's a hideous mess of nested loops and if statements it's going to be nonsense if I ever have to return to it. I hope my commenting is good.

In [3]:
programs = [i['Title'] for i in song_json['Programs']]
program_data = {}

#but-it-runs-meme.jpg
for p in programs:
    mission_data = [] # a blank list that will contain the mission data
    missions = list(songs_df.loc[songs_df['Program'] == p].Mission.unique()) #the labels of all the missions
    p_df = songs_df.loc[songs_df['Program'] == p].drop(columns=['Program']) # a data frame of all the info from that program
    for m in missions: 
        m_df = p_df.loc[p_df['Mission'] == m].drop(columns=['Mission']) # a data frame of all the info from that mission
        mission_dictionary = m_df.to_dict('records') # Makes a dictionary of that data
        wakeup_data = [] #a blank list that will contain the data for each day
        for wakeup_song in mission_dictionary:
            key = wakeup_song.pop('Dates')
            if len(wakeup_data) == 0:
                wakeup_data.append({key:[wakeup_song]}) #a dictionary {date:[{song info}]}
            elif key not in wakeup_data[-1]: # one song on that day
                wakeup_data.append({key:[wakeup_song]}) #a dictionary {date:[{song info}]}
            else: # multiple songs on that day
                wakeup_data[-1][key].append(wakeup_song)
                #a dictionary {date:[{song1 info},{song2 info}]}
        mission_data.append({'Mission':m,"WakeupCalls":wakeup_data}) 
    program_data[p]=mission_data

### 2.3 Add the data to the JSON

In [4]:
for program in song_json['Programs']:
    program_name = program['Title']
    program['Missions'] = program_data[program_name]

### 2.4 Update the Metadata and save

In [5]:
song_json['Metadata'][0]['UpdatedBy'] = 'Ross Spencer & Matthew Allinson'
song_json['Metadata'][0]['LastUpdatedDate'] = datetime.now().strftime('%Y-%m-%d')

In [7]:
with open ('nasawakeupcalls.json', 'w') as outfile:
    json.dump(song_json, outfile, sort_keys=True, indent=4, separators=(",", ": "))