Skip to content
Kevin Barron edited this page Mar 2, 2019 · 2 revisions

Welcome to the GeniusScraper wiki!

Development

My first step was finding the page I wanted to start scraping. In my case I knew I wanted to scrape genius' album pages for the links to all songs to then check if any of the songs contained a keyword.

I tried finding a tutorial for this and came across the first-web-scraper tutorial. This was a good start to expose me to which Python libraries to use but the use was too different from what I wanted so I abandoned it quickly. It did, however, teach me what I should be looking for when trying to scrape websites. For the most part, unless the website is fancy, most of the text on a website should be in the html. This means that in order to find out if a rapper has ever said a given word I could determine this by finding that word, or not finding it, in the html for the page the lyrics are on. So at this point I needed to confirm that the lyrics on genius transcriptions were indeed in the html files.

I used Travis Scott's Astroworld as the starting point. I viewed the page source as well inspected the page. I first viewed the page source to make sure that the lyrics of a song are in the html, I confirmed this by clicking on a random song. Since I now knew the lyrics were in fact in the html I knew this was possible. I went back to the album page and inspected that page. I specifically tried finding where in the html the table of songs was located. I then noticed the pattern that each row had it's own div class, mainly chart_row-content.

Just to be sure I 'ctrl+f''d every instance of chart_row-content in the html in the page source. The number of occurrences matched the number of songs in the album. Now I just needed to scrape each instance and get the link within the div. I knew from the tutorial that the web scraping would be done by the Beautifulsoup library, so the next stop was in trying to learn how to use this library. A google search directed me to the documentation. It might be because I've been spoiled by Google's android documentation but I found the Beautifulsoup documentation really poor. I searched for a while and couldn't figure out how to search by class attributes rather than html elements. I gave up trying to find it myself so I just looked it up on Stack Overflow and surely enough found the answer.

At this point I had enough to be able to find the links within a div with a class of "chart_row-content". Luckily each instance of this class only had one link so all I had to do was write a for loop to go through each instance of the class and save the one link in it. Now that I had the links to all the songs I just had to repeat using the url to get a request and the html as a string. Once I had the html as a string I just needed to check if a given word was in the html. If it was in the html then I'd save the link and print it to the terminal.

Of course this approach has a very obvious limitation. If I search if a rapper has said any html related keyword this algorithm will always return true. But for my use of finding out when Offset has said the word "Patek" I was pretty confident no developer would use this as a class name or variable. This limitation should be kept in mind when using GeniusScraper.

The other not so obvious limitation is that html files are pretty large so searching for a given word in the whole file is time consuming. This approach gave me the correct output but took 22 seconds to search through 17 songs. I tried optimizing this by again using BeautifulSoup to first find the section in the html that the lyrics are found in, which by inspecting the page again I found it to be in a

element with class = "lyrics", who would've guessed? Once I got that p class I called getText() to convert it to a string and then I checked if the keyword was in that string. Surprisingly this approach on the same album took slightly over 24 seconds. I thought about this for a while and came to the conclusion that using BeautifulSoup's findAll() method must just be a linear search and thus searching the entire html doc for a keyword is the same as finding the class the keyword should be in and then searching for the keyword there. I reverted the changes to the original method of just searching for the keyword in the html since it was slightly faster but much simpler to understand.

The next step was to make the algorithm only dependent on an artists name and the keyword. I started out by refactoring the code to have two classes, one for searching albums and the other for searching artists. The albums class works exactly as described above. The artist class follows a similar approach but starts from the link to the artist page. From the artist page I inspect the page and use BeautifulSoup to get the link to the list of albums by that artist. Once I get to that page I again use the same procedure of inspection and telling BautifulSoup where to look to get the links to all albums. Once I have the links to all the albums I feed each link one by one to the albums class to finish off the process. Searching for keywords starting from the artist page is where the run time is noticeably slow. At an average of slightly over a second per song this can translate to taking up to a few minutes as there are also many operations to even get to the album pages.

Clone this wiki locally