Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on page /04-Data-Collection/08-Collect-Genius-Lyrics.html #36

Open
adamlporter opened this issue Feb 24, 2023 · 2 comments
Open

Comments

@adamlporter
Copy link

When I tried to work through this page, I got an error when trying to execute

artist = LyricsGenius.search_artist("Missy Elliott", max_songs=6)

The error is

HTTPError: 403 Client Error: Forbidden for url: https://genius.com/api/search/multi?q=Missy+Elliott

Apparently, genius.com has changed one (or more) of their settings, so that LyricsGenius no longer works. See
https://stackoverflow.com/questions/72078610/getting-lyrics-from-genius-api-gives-error
johnwmillr/LyricsGenius#190
johnwmillr/LyricsGenius#220
The conclusion from these is (unhappily) not to use LyricsGenius.

@adamlporter
Copy link
Author

adamlporter commented Mar 1, 2023

The procredures clean_up() and get_all_songs_from_the_album() work. I rewrote Melanie Walsh's download_album_lyrics() procedure to work without accessing LyricsGenius.

def download_album_lyrics(artist, album_name):
    clean_songs = get_all_songs_from_album(artist, album_name)
    
    artist = artist.replace(" ", "-")
    album_name = album_name.replace(' ','-')

    for song in clean_songs:
        song_title = re.sub("[^\w\s]",'',song) #get rid of punctuation
        song_title = song_title.replace(' ','-')
        try:
            url = f"https://genius.com/{artist}-{song_title}-lyrics"
            response = requests.get(url)
            if response.status_code == 200:
                Path(f"{artist}_{album_name}").mkdir(parents=True, exist_ok=True)
                html = response.text
                document = BeautifulSoup(html, "html.parser")
                div = document.find("div", class_=re.compile("^lyrics$|Lyrics__Root"))
                try:
                    lyrics = div.get_text("\n")
                    filen = f"{artist}-{album_name}/{song_title}.txt"
                    with open(filen, 'w') as file:
                        file.write(lyrics)
                    print(f"saving {filen}")
                except AttributeError:
                    print(f"No lyrics found for {song_title}")

            else:
                print(f"problem getting lyrics for {artist} - {song_title}")
                print(f"error code was {response.status_code}")
        except FileNotFoundError:
            print(f"{url} is not found")

I have tested this and is works -- sort of. I was able to download the lyrics for three albums, then the requests.get(url) started throwing FileNotFoundErrors.

I suspect genius.com is tracking IP addresses and starts blacklisting them if they make too many requests (either total or in a specific period of time). Interestingly, even after the download_album_lyrics() stops working, the get_all_songs_from_album() continues to work.

@adamlporter
Copy link
Author

It might be possible to replace genius.com with lyrics.com. The latter site has an easier HTML structure that makes it possible to extract lyrics text without using a regular expression. (This may be similar to what genius.com used when Melanie first wrote the textbook.)

response = requests.get("https://www.lyrics.com/lyric/8237688")
html = response.text
document = BeautifulSoup(html, "html.parser")
print(document.find('pre').text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant