<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scope" data-toc-modified-id="Scope-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scope</a></span></li><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Website's-permission" data-toc-modified-id="Website's-permission-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Website's permission</a></span></li><li><span><a href="#WebScraping" data-toc-modified-id="WebScraping-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>WebScraping</a></span></li></ul></div>

### Scope

In this JupyterNotebook, I plan to extract the names of the rappers and their song names in the form of urls from AZ-lyrics. This is will be done through **Selenium**.

I didn't have to reinvent the wheel for this webscraping notebook since I found this [notebook](https://github.com/aakashbansal/Songs-Lyrics-Web-Scraper/blob/master/Songs%20Names%20Scraper.ipynb), and reused a bit of my old code from this [webscraping notebook](https://github.com/gsagararatne/DataSciencePortfolio/blob/main/PythonWebScraping/ScrapingASong/WebScrapingASong.ipynb).

### Import libraries

In [8]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from time import sleep
import csv
import json
import re

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

### Website's permission

This is done by adding robots.txt to the end of a website's url. https://www.azlyrics.com/robots.txt

### WebScraping

Something that I noticed was the url's format: https://www.azlyrics.com/{1}/{2}.html

Where {1} is the first letter of the artist's name and {2} is the name of the artist.

In [10]:
# url to scrape the songs list from
base_url = "https://www.azlyrics.com/{}/{}.html"

# lyrical rap artist list
lyrical_artists = ["Hopsin","NF","TI","anderson-paak",
                   "andre3000","bigl","bigsean","blackthought","busta",
                   "chancetherapper","YBNCordae","denzelcurry",
                   "eminem","icecube","jayz","jcole","joynerlucas",
                   "kendricklamar","lildicky","lilwayne","logic",
                   "macklemore","marloncraft","meekmill","methodman",
                   "mfdoom","mosdef","nas","natedogg","nipseyhussle",
                   "notorious","pushat","rakim","royceda59",
                   "snoopdogg","token","wale"]

# empty dict for songs to be mapped into
lyrical_songs_dict = {}

In [11]:
# rap_songs_dict = { }
lyrical_songs_dict = {}

# reset
driver = webdriver.Chrome(ChromeDriverManager().install())


for artist in lyrical_artists:

    artist_url = base_url.format(artist[0].lower(), artist.lower())
    print("Going to url : ", artist_url)
    
    driver.get(artist_url)
    

    # Get the artist name element
    artist_name = driver.find_element(By.XPATH, './html/body/div[2]/div/div[2]/h1/strong').get_attribute('textContent')
    
    
    # Remove the "Lyric" from the artist name
    artist_name = artist_name[:-7]
    
    lyrical_songs_dict[artist_name] = []
    
    # Get the song names elements
    song_names = driver.find_elements(By.CLASS_NAME, "listalbum-item")
    
    for name in song_names:
        lyrical_songs_dict[artist_name].append(name.get_attribute('textContent'))
    # prevent spam
    sleep(5)



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [C:\Users\gsagararatne\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache


Going to url :  https://www.azlyrics.com/h/hopsin.html
Going to url :  https://www.azlyrics.com/n/nf.html
Going to url :  https://www.azlyrics.com/t/ti.html
Going to url :  https://www.azlyrics.com/a/anderson-paak.html
Going to url :  https://www.azlyrics.com/a/andre3000.html
Going to url :  https://www.azlyrics.com/b/bigl.html
Going to url :  https://www.azlyrics.com/b/bigsean.html
Going to url :  https://www.azlyrics.com/b/blackthought.html
Going to url :  https://www.azlyrics.com/b/busta.html
Going to url :  https://www.azlyrics.com/c/chancetherapper.html
Going to url :  https://www.azlyrics.com/y/ybncordae.html
Going to url :  https://www.azlyrics.com/d/denzelcurry.html
Going to url :  https://www.azlyrics.com/e/eminem.html
Going to url :  https://www.azlyrics.com/i/icecube.html
Going to url :  https://www.azlyrics.com/j/jayz.html
Going to url :  https://www.azlyrics.com/j/jcole.html
Going to url :  https://www.azlyrics.com/j/joynerlucas.html
Going to url :  https://www.azlyrics.co

In [12]:
for key,val in lyrical_songs_dict.items():
    print(key,len(val))

Hopsin 120
NF (Nathan Feuerstein) 108
T.I. 298
Anderson .Paak 89
Andre 3000 16
Big L 61
Big Sean 221
Black Thought 45
Busta Rhymes 286
Chance The Rapper 137
YBN Cordae (Entendre) 96
Denzel Curry 131
Eminem 398
Ice Cube 195
Jay-Z 311
J. Cole 252
Joyner Lucas 107
Kendrick Lamar 171
Lil Dicky 81
Lil Wayne 750
Logic 268
Macklemore 97
Marlon Craft 181
Meek Mill (Meek Millz) 296
Method Man 257
MF Doom 199
Mos Def (Yasiin Bey) 80
Nas 349
Nate Dogg 61
Nipsey Hussle 212
Notorious B.I.G. (Biggie Smalls) 130
Pusha T 84
Rakim 53
Royce Da 5'9" 209
Snoop Dogg 509
Token 97
Wale 357


Snoop Dogg seems to have the most number of songs - 509!

In [14]:
# naming file
json_file = "lyrical_rap_songs.json"

# Saving file
with open(json_file, 'w') as file:
    json.dump(lyrical_songs_dict, file)

In [15]:
# Reading file to check
# file containing artists - songs mapping
lyrical_rap_songs_json = "lyrical_rap_songs.json"
lyrical_dict = {}

with open(lyrical_rap_songs_json) as file:
    lyrical_dict = json.load(file)
    
# Check
# lyrical_dict

Extracted format: {name of artist} {names of songs extracted}.