# Data pipeline for lyric analysis for RotY

## Retrieve lyric data from Genius.com
List of artists is found in artists.txt in current directory.

Lyrics scraped from lyricsgenius generally are album tracks but **may include songs outside of albums**.

In [1]:
from scrape_lyrics import scrape_lyrics

scrape_lyrics('artists.txt')

Scraping lyrics from Genius.com...


KeyboardInterrupt: 

## Take .json files and create dataframe

Files from previous cell are then processed and converted to a single .csv file with pertinent data including:

- `lyrics` -> string
- `artist` -> string
- `song_title` -> string
- `producers` -> [string]
- `year_of_release` -> int
- `album` -> string

In [1]:
from make_df import make_df
import pandas as pd

df = make_df('artist_files', remove_na=True)
df.sample(5)

Unnamed: 0,artist,song_title,lyrics,producers,year_of_release,album
6550,2Pac,Troublesome 21,"Born to wreck shit, poppin' bubble gum\nOthers...",[JZ],1997,Def Jam’s How to Be a Player Soundtrack
4492,SnoopDogg,Back Up Off Me,What's up y'all? It's the Mean-ster Green-ster...,[Carlos Stephens],2000,Tha Last Meal
3586,LilWayne,Luv,Shout out to all my niggas\nYou already know\n...,[],2013,Dedication 5
4584,SnoopDogg,Different Languages,You make me say \nOhh\nYou make me say \n\nI'm...,"[Scoop DeVille, Teddy Riley]",2009,Malice N Wonderland
7225,Eminem,Seduction,Like a verbal seduction when\nSeduction when I...,[Boi-1da],2010,Recovery


In [4]:
df.shape

(4129, 6)

## Transliterate lyrics to ARPAbet
Take raw string for lyrics and transliterate to ARPAbet transcription using `<b>` as word boundaries and `<l>` for line boundaries.

In [7]:
from transcribe import transcribe_to_phonemes

df['arpa_transcription'] = df['lyrics'].apply(transcribe_to_phonemes)

In [9]:
df.to_csv('songs_with_transcription.csv', index=False)