# Making Sense of Data: Genres + Tracks
Dear Diary,
It is October now. We've had some cold rain.

The purpose of this notebook is to start reading some tracks (now that the UCI dataset is partially back).


In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
import numpy as np
import pandas as pd

# change filepath if running on another machine, this is local to mine
tracks = pd.read_csv("/Users/mkarroqe/Desktop/github/dancing-screen/fma_metadata/raw_tracks.csv")
tracks.head()

Unnamed: 0,track_id,album_id,album_title,album_url,artist_id,artist_name,artist_url,artist_website,license_image_file,license_image_file_large,...,track_information,track_instrumental,track_interest,track_language_code,track_listens,track_lyricist,track_number,track_publisher,track_title,track_url
0,2,1.0,AWOL - A Way Of Life,http://freemusicarchive.org/music/AWOL/AWOL_-_...,1,AWOL,http://freemusicarchive.org/music/AWOL/,http://www.AzillionRecords.blogspot.com,http://i.creativecommons.org/l/by-nc-sa/3.0/us...,http://fma-files.s3.amazonaws.com/resources/im...,...,,0,4656,en,1293,,3,,Food,http://freemusicarchive.org/music/AWOL/AWOL_-_...
1,3,1.0,AWOL - A Way Of Life,http://freemusicarchive.org/music/AWOL/AWOL_-_...,1,AWOL,http://freemusicarchive.org/music/AWOL/,http://www.AzillionRecords.blogspot.com,http://i.creativecommons.org/l/by-nc-sa/3.0/us...,http://fma-files.s3.amazonaws.com/resources/im...,...,,0,1470,en,514,,4,,Electric Ave,http://freemusicarchive.org/music/AWOL/AWOL_-_...
2,5,1.0,AWOL - A Way Of Life,http://freemusicarchive.org/music/AWOL/AWOL_-_...,1,AWOL,http://freemusicarchive.org/music/AWOL/,http://www.AzillionRecords.blogspot.com,http://i.creativecommons.org/l/by-nc-sa/3.0/us...,http://fma-files.s3.amazonaws.com/resources/im...,...,,0,1933,en,1151,,6,,This World,http://freemusicarchive.org/music/AWOL/AWOL_-_...
3,10,6.0,Constant Hitmaker,http://freemusicarchive.org/music/Kurt_Vile/Co...,6,Kurt Vile,http://freemusicarchive.org/music/Kurt_Vile/,http://kurtvile.com,http://i.creativecommons.org/l/by-nc-nd/3.0/88...,http://fma-files.s3.amazonaws.com/resources/im...,...,,0,54881,en,50135,,1,,Freeway,http://freemusicarchive.org/music/Kurt_Vile/Co...
4,20,4.0,Niris,http://freemusicarchive.org/music/Chris_and_Ni...,4,Nicky Cook,http://freemusicarchive.org/music/Chris_and_Ni...,,http://i.creativecommons.org/l/by-nc-nd/3.0/88...,http://fma-files.s3.amazonaws.com/resources/im...,...,,0,978,en,361,,3,,Spiritual Level,http://freemusicarchive.org/music/Chris_and_Ni...


Let's confirm that we have access to an audio file now before we play with using it in a dataset:

In [5]:
url = tracks['track_url'][0]
trackid = tracks['track_id'][0]
print(trackid, url)

2 http://freemusicarchive.org/music/AWOL/AWOL_-_A_Way_Of_Life/Food


Small problem: it appears that in the dataframe, urls are saved as:

    http://freemusicarchive.org/music/AWOL/AWOL_-_A_Way_Of_Life/Food
    
However, the audio files cannot be downloaded at this link, as it leads to the following landing page:

> ![](images/fma-working.png)
    
Which requires an additional click to get to this downloadable `.mp3` file:

    https://files.freemusicarchive.org/storage-freemusicarchive-org/music/WFMU/AWOL/AWOL_-_A_Way_Of_Life/AWOL_-_03_-_Food.mp3

# Getting Correct Track URL

Let's see if there's a way to derive the downloadable url from the one in the dataframe..
http://freemusicarchive.org/music/AWOL/AWOL_-_A_Way_Of_Life/Food

https://files.freemusicarchive.org/storage-freemusicarchive-org/music/WFMU/AWOL/AWOL_-_A_Way_Of_Life/AWOL_-_03_-_Food.mp3

Fortunately, the curator info can be found on the webpage, and in turn, in its source code:

| Webpage | Source Code |
| - | - |
| ![](images/fma-source-pic.png) | ![](images/fma-source-code.png) |

So let's go ahead and scrape it:

In [39]:
from lxml import html
import requests

def get_curator(source_url):
    # given the source url, pulls curator name from html source code
    # https://docs.python-guide.org/scenarios/scrape/
    # https://www.w3schools.com/xml/xpath_syntax.asp
    curator = 'not found yet'
    page = requests.get(source_url)
    tree = html.fromstring(page.content)
    
    all_elems = tree.xpath('//*/text()')
    for i in range(0, len(all_elems)):
        if all_elems[i] == "Curators":
            curator = all_elems[i + 3]
            break

    return curator

source_url = tracks["track_url"][0]
get_curator(source_url)

'WFMU'

In [56]:
def get_url(row):
    BASE = "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/"
    source_url = tracks["track_url"][row]    
    curator = get_curator(source_url) + '/'
    
    artist_name = tracks["artist_name"][row].replace(' ', '_')
    album_title = tracks["album_title"][row].replace(' ', '_')
    
    track_number = tracks["track_number"][row]
    if track_number != 0:
        track_number = str(track_number).zfill(2)
    
    track_title = tracks["track_title"][row].replace(' ', '_')
    
    pad = "_-_"
    track_string = artist_name + pad + track_number + pad + track_title
    
    # BASE/curator/artist_name/album_title/(artist_name - track_number - track_title) + .mp3 ?
    url = BASE + curator + artist_name + '/' + album_title + '/' + track_string + ".mp3"
    return url

Let's test if my parsing was correct:

In [41]:
url = get_url(0)
goal = "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/WFMU/AWOL/AWOL_-_A_Way_Of_Life/AWOL_-_03_-_Food.mp3"

if url == goal:
    print(url, "is TRUE")
else:
    print("Output:", url)
    print("Expected:", goal)

https://files.freemusicarchive.org/storage-freemusicarchive-org/music/WFMU/AWOL/AWOL_-_A_Way_Of_Life/AWOL_-_03_-_Food.mp3 is TRUE


# Robustness of `get_curator()`
Huzzah! Now let's test its robustness with other URLs..

In [57]:
urls = []
for i in range(0, 25):
    url = get_url(i)
    urls.append(url)

bad_urls = 0
for i in range(0, len(urls)):
    status = url_ok(urls[i])
    if status is False:
        print(i, urls[i])
        print("CORRECT:", tracks["track_url"][i], '\n')
        bad_urls += 1
        
print(bad_urls, '/', len(urls), "bad urls")

4 https://files.freemusicarchive.org/storage-freemusicarchive-org/music/not found yet/Nicky_Cook/Niris/Nicky_Cook_-_03_-_Spiritual_Level.mp3
CORRECT: http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Spiritual_Level 

5 https://files.freemusicarchive.org/storage-freemusicarchive-org/music/not found yet/Nicky_Cook/Niris/Nicky_Cook_-_04_-_Where_is_your_Love?.mp3
CORRECT: http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Where_is_your_Love 

6 https://files.freemusicarchive.org/storage-freemusicarchive-org/music/not found yet/Nicky_Cook/Niris/Nicky_Cook_-_05_-_Too_Happy.mp3
CORRECT: http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Too_Happy 

7 https://files.freemusicarchive.org/storage-freemusicarchive-org/music/not found yet/Nicky_Cook/Niris/Nicky_Cook_-_08_-_Yosemite.mp3
CORRECT: http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Yosemite 

8 https://files.freemusicarchive.org/storage-freemusicarchive-org/music/not found yet/Nicky

### Okay I've identified two problems here:
1. some songs are still not live due to the Tribe of Noise/FMA merger
2. some urls follow different path naming conventions (grr)

After some more source code digging however, I think I may be able to pull the download urls just through the webpage... 

In [131]:
def url_ok(url):
    if url == 'nan':
        return False
    
    r = requests.head(url)
    valid_codes = [200, 301]
    return (r.status_code in valid_codes)

In [113]:
def get_dl_url(row):
    # given the track row, pulls download url from html source code
    source_url = tracks["track_url"][row] 
    status = url_ok(source_url)
    
    if status is False:
        return None
    else:
        page = requests.get(source_url)
        tree = html.fromstring(page.content)

        all_elems = tree.xpath('//span[@class="playicn"]/a[1]/@href')
        if len(all_elems) == 0:
            return all_elems
        return all_elems[0]

source_url = tracks["track_url"][0]
print(source_url)
print(get_dl_url(0))

http://freemusicarchive.org/music/AWOL/AWOL_-_A_Way_Of_Life/Food
https://files.freemusicarchive.org/storage-freemusicarchive-org/music/WFMU/AWOL/AWOL_-_A_Way_Of_Life/AWOL_-_03_-_Food.mp3


### yay! Now for that same check..

In [137]:
def gen_urls(start, end):
# returns list of valid urls and track_ids of 404 pages
    urls = []
    bad = []
    for i in range(start, end):
        if i % 50 == 0:
            print(str(i).zfill(3), "/", end)
            
        url = get_dl_url(i)
        if len(url) != 0:
            urls.append(url)
        else:
            bad.append(i)
    return urls, bad

In [134]:
urls, bad = gen_urls(0, 25)
print(len(urls), "/", end, "valid urls")

0 / 25
20 / 25 valid urls


Let's examine the `bad` urls and confirm that the reason they were invalid was that the FMA site hadn't uploaded them yet:

In [128]:
for url in bad:
    print(tracks["track_url"][url])

http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Spiritual_Level
http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Where_is_your_Love
http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Too_Happy
http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Yosemite
http://freemusicarchive.org/music/Chris_and_Nicky_Andrews/Niris/Light_of_Light


It's great to see this message when you're expecting to :-)

<img src="images/fma-bummer.png" width=200px />

# Creating Data Subset
Now the next step is to create our full subset of downloadable mp3 files along with relevant attributes. My first gut feeling is to throw them into a dataframe, so let's start there (and then probably pickle them)

In [138]:
# making a copy of tracks and will edit it
track_attr = tracks
urls, bad = gen_urls(0, 500)
print(len(urls), "/", end, "valid urls")

000 / 500
050 / 500
100 / 500
150 / 500
200 / 500
250 / 500
300 / 500
350 / 500


MissingSchema: Invalid URL 'nan': No schema supplied. Perhaps you meant http://nan?

In [None]:
# import urllib.request
dest = "/audio_files"

# opener=urllib.request.build_opener()
# # SO hero: https://stackoverflow.com/a/36663971 for 403 Error
# opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
# urllib.request.install_opener(opener)
# urllib.request.urlretrieve(url, dest)

Troubleshooting the following error:
```PermissionError: [Errno 13] Permission denied: '/audio_files'```

1. Change permissions `audio_files` directory:
  - `sudo chmod g+wx audio_files`
  - This did not automatically work. I will try restarting the notebook.
  - Restarting did not solve the problem.
  
2. Let's try using the `requests` library instead:

***Update: both now work after running `sudo jupyter notebook` !*** Sticking with `requests`


In [None]:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
print(url)
r = requests.get(url)
# r.content

In [None]:
# The below code will download the media at url and save it to dest
open(dest, 'wb').write(r.content)

Yay that worked! The file was saved into the notebooks directory as `dest.mp3`. Let's fix where we're downloading:

In [None]:
import os
from os import path

dest = "/audio_files2"
if not os.path.exists(dest):
    os.makedirs(dest)
open(dest, 'wb').write(r.content)

In [None]:
# some useful code for faster parsing (https://www.codementor.io/aviaryan/downloading-files-from-urls-in-python-77q3bs0un)
# without downloading the file every time
def is_downloadable(url):
    """
    Does the url contain a downloadable resource
    """
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    content_type = header.get('content-type')
    if 'text' in content_type.lower():
        return False
    if 'html' in content_type.lower():
        return False
    return True

### Audio File Resources:
- To convert .mp3 to .wav. pydub (https://github.com/jiaaro/pydub)

- To read .wav files to numpy array soundfile (https://pypi.python.org/pypi/SoundFile/0.8.1)

- To play numpy arrays sounddevice (https://pypi.python.org/pypi/sounddevice)

- Numpy has some goodies too (DFT, correlation, convolution etc.)