## Grouping

Some urls are repeated (because some songs have several genres). When the song is repeated, the chords and the uuid is filled only in one on them. We must group it

In [11]:
dataset_path = '../data/chords.csv'

In [12]:
SEPARATOR = '^'

In [13]:
import pandas as pd

In [14]:
df = pd.read_csv(dataset_path, sep=SEPARATOR)

In [15]:
df

Unnamed: 0,url,name,genre,decade,chords,uuid
0,https://tabs.ultimate-guitar.com/tab/kodaline/...,All I Want,Rock,2010s,"['C', 'F', 'C', 'G/B', 'Am', 'F', 'C', 'F', 'C...",62d4ec07-9d59-4e00-a312-e98ce4f3b2fd
1,https://tabs.ultimate-guitar.com/tab/hozier/ta...,Take Me To Church (ver 2),Rock,2010s,"['F#', 'Em', 'Am', 'Em', 'Am', 'G', 'Am', 'Em'...",30af0a56-524d-49b7-ac65-be4caddbe097
2,https://tabs.ultimate-guitar.com/tab/imagine-d...,Radioactive,Rock,2010s,"['G6sus2', 'Am', 'C', 'G6', 'G6sus2', 'Am', 'C...",672a1a12-3de2-4b68-bb17-55df130ce0e0
3,https://tabs.ultimate-guitar.com/tab/a_great_b...,Say Something (ver 3),Rock,2010s,"['Am', 'F', 'C', 'Gsus4', 'Am', 'F', 'C', 'Gsu...",9edbfa0f-6d6d-44a4-b3d7-189d33010e15
4,https://tabs.ultimate-guitar.com/tab/cage-the-...,Cigarette Daydreams,Rock,2010s,"['D', 'Dmaj7', 'Em', 'G', 'A', 'D', 'Dmaj7', '...",eab426dd-9c57-4fad-b1b1-2b94601ff6ac
...,...,...,...,...,...,...
5371,https://tabs.ultimate-guitar.com/tab/harry-bel...,Jamaica Farewell (ver 2),World Music,1950s,,
5372,https://tabs.ultimate-guitar.com/tab/harry-bel...,Man Smart Woman Smarter,World Music,1950s,,
5373,https://tabs.ultimate-guitar.com/tab/harry-bel...,Mama Look-A Boo-Boo,World Music,1950s,"['F#', 'Bb', 'F#', 'Bb', 'F#', 'B', 'C#', 'F#'...",c3ae271a-0eaf-4eed-b88d-e6fe8b590767
5374,https://tabs.ultimate-guitar.com/tab/harry-bel...,Jamaica Farewell (ver 3),World Music,1950s,"['C', 'C/G', 'F', 'G', 'C', 'C/G', 'C', 'C/G',...",3658fce5-4e31-43a4-a84e-a18a39c22cda


In [16]:
df[df["chords"].isnull()]

Unnamed: 0,url,name,genre,decade,chords,uuid
71,https://tabs.ultimate-guitar.com/tab/tom_odell...,Another Love,Folk,2010s,,
86,https://tabs.ultimate-guitar.com/tab/eddie-ved...,Tonight You Belong To Me,Folk,2010s,,
101,https://tabs.ultimate-guitar.com/tab/ed-sheera...,Photograph,Pop,2010s,,
102,https://tabs.ultimate-guitar.com/tab/ed-sheera...,Thinking Out Loud,Pop,2010s,,
113,https://tabs.ultimate-guitar.com/tab/ed-sheera...,Photograph (ver 2),Pop,2010s,,
...,...,...,...,...,...,...
5364,https://tabs.ultimate-guitar.com/tab/leadbelly...,John Henry,Blues,1950s,,
5365,https://tabs.ultimate-guitar.com/tab/leadbelly...,John Henry (ver 2),Blues,1950s,,
5370,https://tabs.ultimate-guitar.com/tab/harry-bel...,Jamaica Farewell,World Music,1950s,,
5371,https://tabs.ultimate-guitar.com/tab/harry-bel...,Jamaica Farewell (ver 2),World Music,1950s,,


In [17]:
henry = df[df["name"] == 'John Henry']
henry

Unnamed: 0,url,name,genre,decade,chords,uuid
5300,https://tabs.ultimate-guitar.com/tab/leadbelly...,John Henry,Religious Music,1950s,"['D', 'D5', 'D6', 'Dmaj7', 'A', 'A5', 'A6', 'A...",75aec3f7-fafd-49a3-b077-c93cda668352
5364,https://tabs.ultimate-guitar.com/tab/leadbelly...,John Henry,Blues,1950s,,


In [18]:
len(df["url"].unique())

3968

In [19]:
def genres(series):
    return series.str.cat(sep='%%')

def extract_single_not_null(series):
    no_nulls = series[series.notnull()].unique()
    
    if len(no_nulls) > 1:
        raise Exception(f'More than one different (and non null) elements: {no_nulls}. ')
    
    if len(no_nulls) == 0:
        return None
    
    return no_nulls[0]

def chords(series):
    return extract_single_not_null(series)

def uuid(series):
    return extract_single_not_null(series)

result = (henry
          .groupby(['url', 'name', 'decade'])
          .agg({'genre': genres, 'chords': chords, 'uuid':uuid }))
result

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,genre,chords,uuid
url,name,decade,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
https://tabs.ultimate-guitar.com/tab/leadbelly/john-henry-chords-706042,John Henry,1950s,Religious Music%%Blues,"['D', 'D5', 'D6', 'Dmaj7', 'A', 'A5', 'A6', 'A...",75aec3f7-fafd-49a3-b077-c93cda668352


In [20]:
clean = (df
          .groupby(['url', 'name', 'decade'])
          .agg({'genre': genres, 'chords': chords, 'uuid':uuid }))

In [39]:
clean = clean.reset_index()

## Filling empty

During the scraping process, some songs have not filled (due to service denegation). We can fill them now:

In [40]:
without_chords = clean[clean["chords"].isnull()]
without_chords

Unnamed: 0,url,name,decade,genre,chords,uuid
381,https://tabs.ultimate-guitar.com/tab/billy_bra...,A New England (ver 2),1980s,Folk,,
545,https://tabs.ultimate-guitar.com/tab/bruce_spr...,Brilliant Disguise,1980s,Folk,,
547,https://tabs.ultimate-guitar.com/tab/bruce_spr...,Hungry Heart (ver 4),1980s,Folk,,
815,https://tabs.ultimate-guitar.com/tab/daniel-jo...,True Love Will Find You In The End,1980s,Folk,,
816,https://tabs.ultimate-guitar.com/tab/daniel-jo...,True Love Will Find You In The End (ver 2),1980s,Folk,,
1623,https://tabs.ultimate-guitar.com/tab/hannes-wa...,Es Ist An Der Zeit (ver 2),1980s,Folk,,
2209,https://tabs.ultimate-guitar.com/tab/leonard-c...,First We Take Manhattan,1980s,Folk,,
2213,https://tabs.ultimate-guitar.com/tab/leonard-c...,Im Your Man,1980s,Folk,,
2604,https://tabs.ultimate-guitar.com/tab/neil-youn...,Tell Me Why (ver 2),1980s,Folk,,
2762,https://tabs.ultimate-guitar.com/tab/paul-simo...,Graceland (ver 2),1980s,Folk,,


In [133]:
%%writefile song_data.py
import pandas as pd 
import os

class SongData:
    SEPARATOR = '^'
    
    def __init__(self, initial_data_path=None, df=None):
        self.df = pd.DataFrame(data=[], columns=['url','name','genre','decade','chords','uuid'])
        
        if initial_data_path is not None and os.path.isfile(initial_data_path):
            self.df = pd.read_csv(initial_data_path, sep=self.SEPARATOR)
        
        if df is not None:
            self.df = df
            
    def add_basic_data(self,basic_data):
        self.df = self.df.append(basic_data,ignore_index=True)

    def add_details(self,details):
        self.df.loc[self.df['url'] == details["url"], ["chords"]] = str(details["chords"])
        self.df.loc[self.df['url'] == details["url"], ["uuid"]] = details["uuid"]
        
    def has_basic_data(self,url):
        return (self.df['url'] == url).any()
    
    def has_chords(self,url):
        return ((self.df['url'] == url) & (self.df['chords'].notnull())).any()
        
    def get_chords(self,url):
        return eval( self.df[self.df['url'] == url]['chords'][0])
    
    def has_genre_and_decade(self, genre, decade):
        return ((self.df['genre'] == genre) & (self.df['decade'] == decade)).any()
    
    def save(self, path):
        self.df.to_csv(path,index=False,sep=self.SEPARATOR)

Overwriting song_data.py


In [53]:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from functools import reduce
import time
import random
import jl_io as io
import os
import uuid

class ChordExtractor:

    def __init__(self, raw_html_output_directory):
        self.raw_html_output_directory = raw_html_output_directory
        self.driver = self.create_chrome_driver()
        self.first_time = True
        
        if not os.path.isdir(self.raw_html_output_directory):
            os.mkdir(self.raw_html_output_directory)
            
    def extract_song_data(self,url):
        chords_spans = self.get_chord_spans(url)
        
        chords = [span.decode_contents() for span in chords_spans]
        
        song_uuid = str(uuid.uuid4())
        with open(f"{self.raw_html_output_directory}/{song_uuid}.html", "w") as file: # De los datos,como del cerdo, se guarda todo.
            file.write(self.driver.page_source )
    
        info = {
            "url":url,
            "chords":chords,
            "uuid":song_uuid
        }
        
        return info
    
    def get_chord_spans(self,url):
        self.driver.get(url)
        
        if self.driver.page_source == '<html><head></head><body></body></html>':
            raise Exception('Denegation error')

        if self.first_time:
            self.click_on_accept_cookies()
            self.first_time = False

        soup = BeautifulSoup(self.driver.page_source, 'lxml')

        article = soup.findAll('article')[3];
        
        return article.findAll('span', {"style":"color: rgb(0, 0, 0);"})
    
    def click_on_accept_cookies(self):
        try:
            button = self.driver.find_element_by_xpath('//button[contains(text(), "thanks")]')

            button.click()
        except:
            print('cookies banner not found. Ignored')
            
    
    def create_chrome_driver(self):
        options = webdriver.ChromeOptions()
        options.add_argument('--ignore-certificate-errors')
        options.add_argument('--incognito')
        options.add_argument('--headless')
        
        driver = webdriver.Chrome("./chromedriver", options=options)
        return driver

In [54]:
song_data = SongData(df=without_chords)

In [55]:
raw_html_output = '../data/raw_html'

In [56]:
extractor = ChordExtractor(raw_html_output)

In [59]:
for url in without_chords['url']:
    print(url)
    chords = extractor.extract_song_data(url)
    song_data.add_details(chords)
    

https://tabs.ultimate-guitar.com/tab/billy_bragg/a_new_england_chords_1106407


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


https://tabs.ultimate-guitar.com/tab/bruce_springsteen/brilliant_disguise_chords_320034
https://tabs.ultimate-guitar.com/tab/bruce_springsteen/hungry_heart_chords_1046494
https://tabs.ultimate-guitar.com/tab/daniel-johnston/true-love-will-find-you-in-the-end-chords-438749
https://tabs.ultimate-guitar.com/tab/daniel-johnston/true-love-will-find-you-in-the-end-chords-932929
https://tabs.ultimate-guitar.com/tab/hannes-wader/es-ist-an-der-zeit-chords-1400854
https://tabs.ultimate-guitar.com/tab/leonard-cohen/first-we-take-manhattan-chords-64976
https://tabs.ultimate-guitar.com/tab/leonard-cohen/im-your-man-chords-64981
https://tabs.ultimate-guitar.com/tab/neil-young/tell-me-why-chords-1088598
https://tabs.ultimate-guitar.com/tab/paul-simon/graceland-chords-1057907
https://tabs.ultimate-guitar.com/tab/paul-simon/you-can-call-me-al-chords-84865
https://tabs.ultimate-guitar.com/tab/renaud/des-que-le-vent-soufflera-chords-81904
https://tabs.ultimate-guitar.com/tab/renaud/mistral-gagnant-chords

In [60]:
song_data.df

Unnamed: 0,url,name,decade,genre,chords,uuid
381,https://tabs.ultimate-guitar.com/tab/billy_bra...,A New England (ver 2),1980s,Folk,"['G', 'D', 'Em', 'C', 'G', 'D', 'C', 'G', 'G',...",edaf713b-f139-4e4e-a2f9-7427812a3d3a
545,https://tabs.ultimate-guitar.com/tab/bruce_spr...,Brilliant Disguise,1980s,Folk,"['A', 'Asus2', 'Asus4', 'A', 'A', 'Asus2', 'As...",32f1a13b-dbad-405f-a9df-0f808653117f
547,https://tabs.ultimate-guitar.com/tab/bruce_spr...,Hungry Heart (ver 4),1980s,Folk,"['C', 'Am', 'Dm7', 'G', 'C', 'Am', 'Dm7', 'G7'...",2bdc99fa-d553-4b99-b6ba-e04e0d3285b9
815,https://tabs.ultimate-guitar.com/tab/daniel-jo...,True Love Will Find You In The End,1980s,Folk,"['G', 'C', 'G', 'Em', 'Am', 'C', 'G', 'G', 'C'...",4f74e253-3ca8-4b13-8a9c-51fd41fda865
816,https://tabs.ultimate-guitar.com/tab/daniel-jo...,True Love Will Find You In The End (ver 2),1980s,Folk,"['G', 'C', 'G', 'Em', 'A7', 'C', 'C', 'G', 'G'...",eb58325e-e1c9-400a-ae2c-e3dc32090718
1623,https://tabs.ultimate-guitar.com/tab/hannes-wa...,Es Ist An Der Zeit (ver 2),1980s,Folk,"['G', 'C', 'Am', 'D', 'G', 'C', 'G', 'G', 'C',...",1534c007-ddae-416e-ad09-72a67f87e102
2209,https://tabs.ultimate-guitar.com/tab/leonard-c...,First We Take Manhattan,1980s,Folk,"['Dm', 'Am', 'Dm', 'Am', 'Dm', 'Am', 'G', 'F',...",232fec91-e80e-4b57-a569-3dcdd3c1a747
2213,https://tabs.ultimate-guitar.com/tab/leonard-c...,Im Your Man,1980s,Folk,"['Em', 'Bm', 'G', 'F#', 'Bm', 'Em', 'D', 'Em',...",1c502ff6-adcb-4a51-aee7-c649467523d7
2604,https://tabs.ultimate-guitar.com/tab/neil-youn...,Tell Me Why (ver 2),1980s,Folk,"['F#', 'G#', 'A', 'C', 'D', 'Am', 'C', 'G', 'C...",5173ca4d-397b-40df-9a31-e2ec312213b6
2762,https://tabs.ultimate-guitar.com/tab/paul-simo...,Graceland (ver 2),1980s,Folk,"['D', 'G', 'Bm', 'A', 'D', 'D', 'G', 'Bm', 'A'...",4b220027-9556-4aea-8f29-71b8ed64361f


In [61]:
with_chords = clean[clean["chords"].notnull()]

In [62]:
with_chords

Unnamed: 0,url,name,decade,genre,chords,uuid
0,https://tabs.ultimate-guitar.com/tab/1055161,Time To Say Goodbye Con Te Partirò,1990s,Pop%%Classical,"['G', 'D', 'Em', 'C', 'G', 'D', 'Em', 'C', 'G'...",d42333bf-e925-4ad1-a4b9-bfe2e2df209e
1,https://tabs.ultimate-guitar.com/tab/1060259,El Mañana (ver 3),2000s,Electronic,"['Am', 'Em/G', 'F', 'Em', 'Bm', 'Dm', 'Am', 'E...",d6123fcf-d023-4425-8644-0a04afd88ebf
2,https://tabs.ultimate-guitar.com/tab/10cc/im_n...,Im Not In Love,1970s,Pop%%Pop,"['F#m7/B', 'B6', 'F#m7/B', 'B6', 'F#m7/B', 'B6...",277077be-828c-46ab-93c0-5ed82f514e3b
3,https://tabs.ultimate-guitar.com/tab/1238388,Cajuína,1970s,World Music,"['Cm', 'Fm', 'G', 'Cm', 'C', 'Fm', 'Bb', 'Eb',...",c245ecf7-dd2e-40ca-baa9-5717c789b1a5
4,https://tabs.ultimate-guitar.com/tab/1510590,Balladen Om Herr Fredrik Åkare Och Den Söta Fr...,1960s,Jazz,"['Em', 'Am', 'C', 'B7', 'Em', 'Am', 'D7', 'G',...",7122de89-1f8f-4a03-b2d1-aa9d0568338b
...,...,...,...,...,...,...
3963,https://tabs.ultimate-guitar.com/tab/zaho/je_t...,Je Te Promets,2000s,Contemporary R&b,"['Am', 'Dm', 'Am', 'Am', 'Dm', 'Am', 'Am', 'F'...",5ee7945f-d3df-44cc-aa97-82e6ef1cba15
3964,https://tabs.ultimate-guitar.com/tab/ziggy_mar...,Beach In Hawaii,2000s,Reggae,"['G', 'F', 'F', 'G', 'F', 'Am', 'Am', 'Am', 'G...",49efaf87-b9e5-49f2-8eea-ec31fc63f564
3965,https://tabs.ultimate-guitar.com/tab/ziggy_mar...,Cry Cry Cry,2000s,Reggae,"['E', 'A', 'E', 'A', 'E', 'A', 'E', 'A', 'E', ...",445faed9-c30d-467e-b5db-ef1265430ae3
3966,https://tabs.ultimate-guitar.com/tab/ziggy_mar...,True To Myself,2000s,Reggae,"['A', 'E', 'Bm', 'D', 'A', 'E', 'Bm', 'D', 'A'...",c0d75c52-ae53-4b47-bc8e-a723ba542d00


In [64]:
final = pd.concat([with_chords, song_data.df])

In [65]:
final

Unnamed: 0,url,name,decade,genre,chords,uuid
0,https://tabs.ultimate-guitar.com/tab/1055161,Time To Say Goodbye Con Te Partirò,1990s,Pop%%Classical,"['G', 'D', 'Em', 'C', 'G', 'D', 'Em', 'C', 'G'...",d42333bf-e925-4ad1-a4b9-bfe2e2df209e
1,https://tabs.ultimate-guitar.com/tab/1060259,El Mañana (ver 3),2000s,Electronic,"['Am', 'Em/G', 'F', 'Em', 'Bm', 'Dm', 'Am', 'E...",d6123fcf-d023-4425-8644-0a04afd88ebf
2,https://tabs.ultimate-guitar.com/tab/10cc/im_n...,Im Not In Love,1970s,Pop%%Pop,"['F#m7/B', 'B6', 'F#m7/B', 'B6', 'F#m7/B', 'B6...",277077be-828c-46ab-93c0-5ed82f514e3b
3,https://tabs.ultimate-guitar.com/tab/1238388,Cajuína,1970s,World Music,"['Cm', 'Fm', 'G', 'Cm', 'C', 'Fm', 'Bb', 'Eb',...",c245ecf7-dd2e-40ca-baa9-5717c789b1a5
4,https://tabs.ultimate-guitar.com/tab/1510590,Balladen Om Herr Fredrik Åkare Och Den Söta Fr...,1960s,Jazz,"['Em', 'Am', 'C', 'B7', 'Em', 'Am', 'D7', 'G',...",7122de89-1f8f-4a03-b2d1-aa9d0568338b
...,...,...,...,...,...,...
3652,https://tabs.ultimate-guitar.com/tab/tom_waits...,Time,1980s,Folk,"['D', 'A7', 'D', 'D', 'G', 'A7', 'D', 'D', 'A7...",1fbff9ae-083d-4f8d-a7a3-ca6a0e54f398
3686,https://tabs.ultimate-guitar.com/tab/tracy_cha...,Baby Can I Hold You (ver 3),1980s,Folk,"['D', 'Em7', 'A7', 'D', 'Em7', 'A7', 'D', 'Em7...",bef7ccf1-326b-47b1-aabe-8c7de90c8785
3693,https://tabs.ultimate-guitar.com/tab/tracy_cha...,Talkin Bout A Revolution (ver 2),1980s,Folk,"['G', 'Cadd9', 'Em', 'D', 'Dsus4', 'G', 'Cadd9...",a5af1c23-675a-4502-ac17-99dc5f1caef0
3694,https://tabs.ultimate-guitar.com/tab/tracy_cha...,Talkin Bout A Revolution (ver 3),1980s,Folk,"['G', 'C', 'Em', 'D', 'G', 'C', 'Em', 'D', 'G'...",6efc5bd2-ea54-4386-94f9-53c6a5be5bbf


In [66]:
song_data = SongData(df = final)

In [67]:
song_data.save('../data/chords_clean.csv')