# Scraping Spotify Data for All Songs from All Albums by a Producer

## End Goal:

Take All Pages in Wikipedia's [Category:Albums by producer](https://en.wikipedia.org/wiki/Category:Albums_by_producer) and scrape all album info and then call featurized audio from Spotify's API.


### Goal for Today (March 28, 2019):

Take a page like Wikipedia's [Category:Albums produced by Rick Rubin](https://en.wikipedia.org/wiki/Category:Albums_produced_by_Rick_Rubin) and return featurized audio from every song on every album in the list.

In [1]:
# Standard Imports

import numpy as np
import pandas as pd
import os
import sys
from collections import defaultdict
from importlib import reload
from bs4 import BeautifulSoup
import requests

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


# Load MongoDB

from pymongo import MongoClient
client = MongoClient()
# Access/Initiate Database
db = client['producer_db']
# Access/Initiate Table
tab = db['songs']
collection = db.tab

# Authorize Spotify API

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

client_id = os.environ['SPOTIFY_CLIENT_ID']
client_secret = os.environ['SPOTIFY_CLIENT_SECRET']
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [31]:
def get_category_links(wiki_url):
    """
    Takes a link to a category Wikipedia page and returns a list of urls to the hyperlinks
    """
    
    wiki_urls = []
    domain = 'https://en.wikipedia.org'
    
    html = requests.get(wiki_url).content
    soup = BeautifulSoup(html, 'html.parser')
    
    wiki_links = soup.find_all('div', class_="mw-category")[0].find_all('a')
    for link in wiki_links:
        path = link['href']
        url = domain + path
        wiki_urls.append(url)
        
    # check for a "next page" button
    next_page_url = ''
    next_page_links = soup.find_all('div', {'id':"mw-subcategories"})[0].find_all('a')[:5] #next page like will be within the first 5 links
    for link in next_page_links:
        if link.text == 'next page':
            next_page_path = link['href']
            next_page_url = domain + next_page_path
        
    # Append links from next pages recursively
    if next_page_url: 
        print('getting links from {}'.format(next_page_url))
        next_page_wiki_urls = get_category_links(next_page_url)
        wiki_urls = wiki_urls + next_page_wiki_urls
    
    
    return wiki_urls

In [32]:
wiki_url2 = 'https://en.wikipedia.org/wiki/Category:Albums_by_producer'
links = get_category_links(wiki_url2)
len(links)

getting links from https://en.wikipedia.org/w/index.php?title=Category:Albums_by_producer&subcatfrom=Bellotte%2C+Pete%0AAlbums+produced+by+Pete+Bellotte#mw-subcategories
getting links from https://en.wikipedia.org/w/index.php?title=Category:Albums_by_producer&subcatfrom=Chess%2C+Leonard%0AAlbums+produced+by+Leonard+Chess#mw-subcategories
getting links from https://en.wikipedia.org/w/index.php?title=Category:Albums_by_producer&subcatfrom=Dj+Premier%0AAlbums+produced+by+DJ+Premier#mw-subcategories
getting links from https://en.wikipedia.org/w/index.php?title=Category:Albums_by_producer&subcatfrom=Frayne%2C+George%0AAlbums+produced+by+George+Frayne#mw-subcategories
getting links from https://en.wikipedia.org/w/index.php?title=Category:Albums_by_producer&subcatfrom=Hayton%2C+Lennie%0AAlbums+produced+by+Lennie+Hayton#mw-subcategories
getting links from https://en.wikipedia.org/w/index.php?title=Category:Albums_by_producer&subcatfrom=Juicy+J%0AAlbums+produced+by+Juicy+J#mw-subcategories
gett

2581

In [33]:
links[:10]

['https://en.wikipedia.org/wiki/Category:Albums_produced_by_4th_Disciple',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_8Ball_%26_MJG',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_9th_Wonder',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_The_45_King',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_88-Keys',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_Kenny_Aaronson',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_Jim_Abbiss',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_Dave_Abbruzzese',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_Daniel_Abraham_(record_producer)',
 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_Josh_Abraham']

In [16]:
wiki_url = 'https://en.wikipedia.org/w/index.php?title=Category:Albums_by_producer&subcatfrom=Bellotte%2C+Pete%0AAlbums+produced+by+Pete+Bellotte#mw-subcategories'
wiki_urls = []
domain = 'https://en.wikipedia.org'

html = requests.get(wiki_url).content
soup = BeautifulSoup(html, 'html.parser')

wiki_links = soup.find_all('div', {'id':"mw-subcategories"})[0].find_all('a')[:5]

for link in wiki_links:
    if link.text == 'next page':
        print (link['href'], link.text)

/w/index.php?title=Category:Albums_by_producer&subcatfrom=Chess%2C+Leonard%0AAlbums+produced+by+Leonard+Chess#mw-subcategories next page


In [6]:
wiki_url = 'https://en.wikipedia.org/wiki/Category:Albums_produced_by_Rick_Rubin'
rr_albums = get_category_links(wiki_url)

In [7]:
rr_albums

['https://en.wikipedia.org/wiki/12_Songs_(Neil_Diamond_album)',
 'https://en.wikipedia.org/wiki/13_(Black_Sabbath_album)',
 'https://en.wikipedia.org/wiki/21_(Adele_album)',
 'https://en.wikipedia.org/wiki/All_World_2',
 'https://en.wikipedia.org/wiki/All_World:_Greatest_Hits',
 'https://en.wikipedia.org/wiki/American_IV:_The_Man_Comes_Around',
 'https://en.wikipedia.org/wiki/American_Grafishy',
 'https://en.wikipedia.org/wiki/American_III:_Solitary_Man',
 'https://en.wikipedia.org/wiki/American_Recordings_(album)',
 'https://en.wikipedia.org/wiki/American_V:_A_Hundred_Highways',
 'https://en.wikipedia.org/wiki/American_VI:_Ain%27t_No_Grave',
 'https://en.wikipedia.org/wiki/Amethyst_Rock_Star',
 'https://en.wikipedia.org/wiki/Angus_%26_Julia_Stone_(album)',
 'https://en.wikipedia.org/wiki/Antennas_to_Hell',
 'https://en.wikipedia.org/wiki/Anthology:_Through_the_Years',
 'https://en.wikipedia.org/wiki/Armed_Love',
 'https://en.wikipedia.org/wiki/Artpop',
 'https://en.wikipedia.org/wiki/

In [26]:
wiki_url2 = 'https://en.wikipedia.org/wiki/Category:Albums_by_producer'
links = get_category_links(wiki_url2)

In [27]:
len(links)

201

Cool! We now have a way to get lists of albums. Can Spotify give us a list of songs in an album?

In [47]:
album_id = sp.search(q='album:just in capes',type='album')['albums']['items'][0]['id']
for song in sp.album_tracks(album_id)['items']:
    print(song['name'], '\t|\t', song['id'])

Drumming Song 	|	 3DQGus4N9O9lk343trfNno
Emotions 	|	 2yrHacdhmKCntWO8pdfj6t
Cry Me a River 	|	 1rj6DjmbzgIj5muwrRvoX5
Renegade 	|	 2UlAJ40lg23zpfAF2fMhLL
Robots 	|	 4y6RREsxp7FFI0TWvDqkGm
Valerie 	|	 7r9CtQbliORexZwbOMeDgO
Gone 	|	 6SYptf8cpTTsVmmI0GWj2F
Forget You 	|	 0IqTlV1tMqAG7fhnv4xRcZ
Mashup (Love the Way You Lie/Dynamite/Teenage Dream) 	|	 1lZW1mmm6cQlHOG89Uoj4s
Sparkling Diamonds 	|	 12SEa6vgBhlpgwM8F91GQs


Affirmative.

In [None]:
sp.audio_analysis()