# Data Collection Pipeline
    

## 1. Spotify API 

The motivation for using the Spotify API is that their large database of folk songs often includes matched lyrics. 


**Note**: The Spotify API authorisation requires client credentials associated with a unique Spotify user. These must be manually set for the code to run.


In [1]:
import os
client_id = os.getenv('SPOTIFY_CLIENT_ID')
client_secret = os.getenv('SPOTIFY_CLIENT_SECRET')


Each API call requires an *access token*, which expires after a certain amount of time has passed. The following code can be re-run once it has expired to generate a new access token. 

### Get Access Token:

In [2]:
import requests

data = {
    'grant_type': 'client_credentials',
    'client_id': client_id,
    'client_secret': client_secret,
}

response = requests.post('https://accounts.spotify.com/api/token', data=data)
response = response.json()
access_token = response['access_token']
print(access_token)


BQDkHV-jcrGqak-KqLYBuOK6EBDfabbgiDr61lRxrGlIDTdehsqfSUx9o5R5geJWp1ZZrnR4nIDglHUZlqmTQcxoMZjuF4WHg8OSXgHdBjk-vEPInbA


In order to find Swiss folk songs that meet our criteria, they should appear in multiple playlists when using   a search query such as "schweizer volkslieder". 

We will therefore find playlists matching this or similar queries and find the unique song IDs and names of the songs in these playlists. The song IDs with be used by a module found online called Syrics (https://github.com/akashrchandran/syrics) to fetch the lyrics. 

### Get playlist IDs:

In [5]:

authorization = 'Bearer ' + access_token
headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': authorization,
}

# TO-DO: The following code a list of potential different query strings 'q' for different playlists
#queries = ['schweizer volksmusik', 'schweizer volkslieder', 'schwiizer volkslieder','mundart volkslieder'] etc
queries = ['schweizer volksmusik', 'schweizer volkslieder', 'schwiizer volkslieder','mundart volkslieder']
all_playlist_ids = set()
for q in queries:
    params = {
        'q': q,
        'type': 'playlist',
        'market': 'CH',
        'limit': '50',
        'offset': '0',
    }
    result = requests.get('https://api.spotify.com/v1/search', params=params, headers=headers)
    result = result.json()
    playlists = result['playlists']['items']
    playlist_ids = [response['id'] for response in playlists]
    for pl_id in playlist_ids:
        all_playlist_ids.add(pl_id)


Once we have the playlist IDs, each playlist will be searched for songs.

### Get track IDs:

In [6]:
from collections import defaultdict

track_counts = defaultdict(int)
track_id2name = defaultdict(str)
track_name2id = defaultdict(str)
for playlist_id in all_playlist_ids:
    playlist_params = {
        'playlist_id': playlist_id,
        'additional_types': 'track',
        'market': 'CH',
    }

    playlist_json = requests.get('https://api.spotify.com/v1/playlists/' + playlist_id, params=playlist_params, headers=headers)
    playlist_json = playlist_json.json()
    tracks = playlist_json['tracks']['items']
    # only save track IDs and names that occur in more than one playlist:
    for response in tracks:
        if track_counts[response['track']['id']] >= 2:
            track_id2name[response['track']['id']] =  response['track']['name']
            track_name2id[response['track']['name']] =  response['track']['id']
        track_counts[response['track']['id']] += 1


The track IDs are passed as arguments to the Spotify lyrics module, Syrics. 
This module requires a value of 'sp_dc' based on your user's spotify browser session, obtained from the browser developer mode on an open Spotify page where the client is signed in.  


In [7]:
from syrics.api import Spotify
sp_dc = 'AQCT9VgGZvQNAFtgivM-SGTQ2G-klYASaGmgioSIfKym0xfIdsfkIb_HFVtlj3WjU3EeZyurxiNhAZJqT7lKkxtzQpxmZEFP5Xae-ovATT5QHZYic5fHYsR4lVXokoBrpdBX2cWcCGgLGQrN-K24sG1Gy3vW-HiiPjL70U1k1yVRcphHT2re2dfBnlMNxGj50iqFV0103faRVhdi3vXwDA0gEg6e'
sp = Spotify(sp_dc)

Each track ID is saved as a '.txt' file with the prefix 'lyrics' and a unique file ID  concatenated with the track's name. 

A simple language identification check is done to ensure that:

1) Folk songs from other languages are not included

2) Standard German folk songs, with no non-standard dialect words, are excluded. The assumption here is that many Swiss German words will be incorrectly identified as belonging to other languages 

For songs where no lyrics were found, the song names will provide a useful starting point for a subsequent manual search to expand the dataset.

### Get desired lyrics and save to files:

In [None]:
from langdetect import detect, LangDetectException

destination_folder = 'data/' # rename the folder as desired
def most_common(lst):
    return max(set(lst), key=lst.count)
langs_detected = set()
file_id = 0
for id in track_id2name.keys():
    trackname = track_id2name[id]
    filename = trackname.replace(" ", "_")
    response = sp.get_lyrics(id)
    if response:
        file_id += 1
        lines = response['lyrics']['lines']
        words_per_line = [dict(line)['words'] for line in lines]
        # detect language of the words in each line
        langs = []
        for l in words_per_line:
            try: 
                langs.append(detect(l))
                langs_detected.add(detect(l))
            except LangDetectException:
                continue
        language = most_common(langs)
        if language != 'de':            # this removes the Spanish-language folk songs that are otherwise saved
            continue
        elif len(set(langs)) == 1:      # this removes songs entirely in Standard German.
            continue
        # save the file with the title on the first line and an empty line before the full lyrics of the song
        with open(destination_folder+'lyrics_'+str(file_id)+'_'+filename+'.txt', 'w') as newfile:
            newfile.write(trackname+"\n"+"\n")
            for l in words_per_line:
                newfile.write(l+"\n")
    else:
        continue
    with open(destination_folder+'moretitles.txt', 'a') as titlesfile:
        titlesfile.write(trackname+"\n")
print(langs_detected)

## 2. Im Röseligarte

The second resource is a historical volume of Folk songs known as 'Im Röseligarte'. I found a folder with a PDF corresponding to each song in this collection:
https://www.sins942.ch/noten_im_roeseligarte.html

These PDFs were downloaded into a folder entitled 'Im_Roeseligarte'. I used the PdfReader module (https://pypdf2.readthedocs.io/en/3.0.0/modules/PdfReader.html)
to scrape the text from the PDFs. The non-lyric artefacts were fortunately easy to remove.


In [None]:
from PyPDF2 import PdfReader
import os
import re
folder = r'Im_Roeseligarte/'

for file_name in sorted(os.listdir(folder)):
    reader = PdfReader(folder+file_name)
    newfilename = 'lyrics_'+str(file_id+i)+'_'+file_name[file_name.find("_")+1:]
    print(newfilename)
    lyricsfile = open(newfilename+'.txt', 'w')
    for page_number in range(0,len(reader.pages)):
        page = reader.pages[page_number]
        text = page.extract_text()
        # remove page numbering and authorship text
        title = text[text.find("2016")+4:]
        lyrics = text[text.find("1."):text.find("Paul")]
        lyrics = re.sub("[0-9]+.","", lyrics)
        # write to file
        lyricsfile.write(title+"\n"+"\n")
        lyricsfile.write(lyrics)
    file_id += 1
    if file_id >154: # subsequent handful of PDFs were in a different format and had to be manually copy/pasted 
        break

## 3. Schweizerische Volkslieder - Tobel, 1882

In [None]:
4h

## 3. Other Internet resources



In [None]:
import requests
from bs4 import BeautifulSoup

def getdata(url):
    r = requests.get(url)
    return r.content

def get_lyrics_muster(lyric_page):
    i = 400
    html_data = getdata(lyric_page)
    page_soup = BeautifulSoup(html_data, "html.parser")
    tables = page_soup.find_all('table')
    texts = []
    for table in tables:
        table_text = table.get_text()
        texts.append(table_text)
        newfilename = 'lyrics_'+str(i)
        lyricsfile = open('Edimuster/'+newfilename+'.txt', 'w')
        lyricsfile.write(table_text)
        i += 1
    return texts

website = "http://www.edimuster.ch/baernduetsch/lied.htm"
get_lyrics_muster(website)
#   

In [None]:

def get_links_swissmom(website_link):
    html_data = getdata(website_link)
    soup = BeautifulSoup(html_data, "html.parser")
    list_links = []
    for link in soup.find_all("a", href=True):
        if str(link["href"]).startswith("/lieder/"):
            link_with_www = 'https://chindermusigwaelt.swissmom.ch/' + link["href"][1:]
            list_links.append(link_with_www)
    return list_links

website = "https://chindermusigwaelt.swissmom.ch/liedertexte/volkslieder/"

links = get_links_swissmom(website)

def get_lyrics_swissmom(lyric_page):
    html_data = getdata(lyric_page)
    page_soup = BeautifulSoup(html_data, "html.parser")
    h2_tag = page_soup.find('h2')
    if h2_tag:
        title_tag = h2_tag.find_previous_sibling('h1')    	
        p_tags = h2_tag.find_previous_siblings('p')
        p_tags = reversed(p_tags) 
    else:
        title_tag = page_soup.find('h1')
        p_tags = page_soup.find_all('p')
    title = title_tag.text
    lyrics = [title, "\n"]        
    for p in p_tags:
        text = p.get_text(separator = '\n', strip = True)
        lines = text.split('\n')
        lyrics.extend(lines)
    return lyrics

i = 302
for link in links:
    lyrics = get_lyrics_swissmom(link)
    title = lyrics[0].replace(" ","_")
    newfilename = 'lyrics_'+str(i)+'_'+ title
    print(newfilename)
    lyricsfile = open('Chindermusig/'+newfilename+'.txt', 'w')
    lyricsfile.write(title+"\n"+"\n")
    for l in lyrics:
        lyricsfile.write(l+"\n")
    i += 1
