## Validate Genius Lyrics

This notebook contains the processing pipeline to validate the lyrics that were pulled and save to a single CSV file that is accessible by web.

This notebook must be run on a Google Dataproc cluster with PySpark shell embedded into the Jupyter Notebook.

Cluster Config:
```
gcloud dataproc clusters create w210-capstone \
  --metadata "JUPYTER_CONDA_CHANNELS=conda-forge,JUPYTER_CONDA_PACKAGES=pandas:tqdm:beautifulsoup4:python-dotenv" \
  --bucket w210-capstone \
  --subnet default \
  --zone $ZONE \
  --master-machine-type n1-standard-1 \
  --master-boot-disk-size 80 \
  --num-workers $WORKERS \
  --worker-machine-type n1-standard-1 \
  --worker-boot-disk-size 80 \
  --image-version 1.2 \
  --project w261-215522 \
  --initialization-actions \
      'gs://dataproc-initialization-actions/jupyter/jupyter.sh'
```

`$ZONE` == GCP Zone  
`$WORKERS` == Number of nodes in cluster

In [1]:
sc

In [2]:
spark

In [3]:
import re

## Bring in all the data

In [4]:
metapath = 'gs:///genius-meta/genius_metadata.csv'
meta = sc.textFile(metapath)
meta = meta.map(lambda x: (x.split(',')[0], x.split(',')[2]))\
    .filter(lambda x: x[0] != 'msd_id')\
    .filter(lambda x: x[1] != '')
meta.take(5)

[('TRMMMYQ128F932D901',
  'https://genius.com/Faster-pussycat-silent-night-lyrics'),
 ('TRMMMRX128F93187D9',
  'https://genius.com/Hudson-mohawke-no-one-could-ever-lyrics'),
 ('TRMMMCH128F425532C', 'https://genius.com/Alan-lvb-la-vaina-brava-lyrics'),
 ('TRMMMBB12903CB7D21',
  'https://genius.com/Kris-kross-2-da-beat-chyall-lyrics'),
 ('TRMMMXJ12903CBF111',
  'https://genius.com/Jorge-negrete-el-hijo-del-pueblo-lyrics')]

In [5]:
msdpath = 'gs:///MSD/unique_tracks.txt'
msd = sc.textFile(msdpath)
# msd = msd.map(lambda x: (x.split('<SEP>')[0], x.split('<SEP>')[1], x.split('<SEP>')[2], x.split('<SEP>')[3]))
msd = msd.map(lambda x: (x.split('<SEP>')[0], x.split('<SEP>')[1:]))
msd.take(5)

[('TRMMMYQ128F932D901',
  ['SOQMMHC12AB0180CB8', 'Faster Pussy cat', 'Silent Night']),
 ('TRMMMKD128F425225D',
  ['SOVFVAK12A8C1350D9', 'Karkkiautomaatti', 'Tanssi vaan']),
 ('TRMMMRX128F93187D9',
  ['SOGTUKN12AB017F4F1', 'Hudson Mohawke', 'No One Could Ever']),
 ('TRMMMCH128F425532C',
  ['SOBNYVR12A8C13558C', 'Yerba Brava', 'Si Vos Querés']),
 ('TRMMMWA128F426B589',
  ['SOHSBXH12A8C13B0DF', 'Der Mystic', 'Tangle Of Aspens'])]

In [6]:
# pull in main lyrics
lyricspath = 'gs:///distributed-scrape'
lyrics = spark.read.format("CSV")\
    .option("header","true")\
    .option("multiLine", "true")\
    .load(lyricspath)

# pull in additiona lyrics
lyricspath2 = 'gs:///genius-lyrics/genius_lyrics.csv'
lyrics2 = spark.read.format("CSV")\
    .option("header","true")\
    .option("multiLine", "true")\
    .option("escape", '"')\
    .load(lyricspath2)

lyricspath3 = 'gs:///subset-10k/genius_lyrics.csv'
lyrics3 = spark.read.format("CSV")\
    .option("header","true")\
    .option("multiLine", "true")\
    .option("escape", '"')\
    .load(lyricspath3)

lyrics_all = lyrics.union(lyrics2).union(lyrics3)
lyrics_all.count()

327298

## Parse meta data and filter

1. Parase Genius URL and space separate
2. Remove parentheses and concatenate MSD artist/title
3. Find the Jaccard Distance between (1) and (2) and filter out those that are less than 0.25
4. Only keep lyrics with length < 5000

In [7]:
def DistJaccard(str1, str2):
    str1 = set(str1.split())
    str2 = set(str2.split())
    return float(len(str1 & str2)) / len(str1 | str2)

def remove_parentheses(phrase):
    return re.sub(r'\([^)]+\)', '', phrase).strip()

def parse_genius_url(url_string):
    url = re.search(r'https://genius.com/(.+)', url_string).group(1)
    url = re.search(r'(.*)-lyrics', url).group(1)
    return ' '.join(url.split('-')).lower()

def parse_msd(artist, title):
    a = remove_parentheses(artist).lower().strip()
    t = remove_parentheses(title).lower().strip()
    return a + ' ' + t

In [8]:
meta2 = meta.mapValues(parse_genius_url)
meta2.take(5)

[('TRMMMYQ128F932D901', 'faster pussycat silent night'),
 ('TRMMMRX128F93187D9', 'hudson mohawke no one could ever'),
 ('TRMMMCH128F425532C', 'alan lvb la vaina brava'),
 ('TRMMMBB12903CB7D21', 'kris kross 2 da beat chyall'),
 ('TRMMMXJ12903CBF111', 'jorge negrete el hijo del pueblo')]

In [9]:
msd2 = msd.mapValues(lambda x: parse_msd(x[1], x[2]))
msd2.take(5)

[('TRMMMYQ128F932D901', 'faster pussy cat silent night'),
 ('TRMMMKD128F425225D', 'karkkiautomaatti tanssi vaan'),
 ('TRMMMRX128F93187D9', 'hudson mohawke no one could ever'),
 ('TRMMMCH128F425532C', 'yerba brava si vos querés'),
 ('TRMMMWA128F426B589', 'der mystic tangle of aspens')]

In [10]:
# join by msd_id and apply pointwise jaccard distance
meta_msd = meta2.join(msd2, numPartitions=24)
meta_msd2 = meta_msd.map(lambda x: (x[0], x[1][0], x[1][1], DistJaccard(x[1][0], x[1][1])))
# meta_msd2.take(5)
df_meta_msd = meta_msd2.toDF(['msd_id', 'url_parsed', 'msd_parsed', 'jac_dist'])
print(df_meta_msd.count())
df_meta_msd.take(5)

327463


[Row(msd_id='TRMMWJS12903CBB7F5', url_parsed='aerosmith remember walking in the sand', msd_parsed='aerosmith remember', jac_dist=0.3333333333333333),
 Row(msd_id='TRMMWFG128F92DFAA2', url_parsed='ghost jigolo har megiddo', msd_parsed='hex subtek', jac_dist=0.0),
 Row(msd_id='TRMMWYJ128EF358EA7', url_parsed='retard o bot derelict', msd_parsed='retard-o-bot something from nothing', jac_dist=0.0),
 Row(msd_id='TRMMGDP128F933E59A', url_parsed='al green i say a little prayer', msd_parsed='al green i say a little prayer', jac_dist=1.0),
 Row(msd_id='TRMMGEG128F9300DC5', url_parsed='autechre acroyear2', msd_parsed='autechre acroyearii', jac_dist=0.3333333333333333)]

Join meta data and lyrics

In [11]:
df_meta_msd.createOrReplaceTempView('meta')
lyrics_all.createOrReplaceTempView('lyrics')

results = spark.sql("""
SELECT a.msd_id as msd_id, a.lyrics as lyrics
FROM meta as b
LEFT JOIN lyrics as a
  ON a.msd_id == b.msd_id
WHERE length(lyrics) < 5000 and b.jac_dist > 0.25
""").cache()
results.count()

217325

In [12]:
results.take(5)

[Row(msd_id='TRAADJU128F92F58E1', lyrics='I hear you praying with your hands clasped over your chest\nI hear men slaying while they say "keep doing your best"\nI hear the laughter of someone up above\nWho\'s playing games in the name of love\nI hear you laughing\nI hear you laughing\nI see people dying in the blood and the dust\nAnd the gunshots of vicious murderous lust\nI feel the sunshine as it heats up my blood\nI feel it burning like my hate if I could\nI hear you laughing\nI hear you laughing\nI hear the silence of a kid\'s suicide\nWho couldn\'t find any place he could hide\nI hear you laughing\nI hear you laughing\nI hear the chang ring as it hits your steel tills\nAnd all the loving you save for your dollar bills\nI hear my heart beat as I talk to myself\nI\'m just statistics to help you add to your wealth\nI hear you laughing\nI hear you laughing'),
 Row(msd_id='TRAADQX128F422B4CF', lyrics="\n\nIf you ever make it back to Nashville\nRemember you have still got a friend\nI'll 

Write the results to GCS in distributed manner

In [14]:
lyrics_valid = 'gs:///lyrics-valid'
results.write.format("CSV")\
    .option('header', 'false')\
    .option("multiLine", "true")\
    .mode('overwrite')\
    .save(lyrics_valid)

Concatenate all the CSVs into a single accessible file

In [20]:
!hdfs dfs -cat gs://w210-capstone/lyrics-valid/part-000* > lyrics-valid.csv
!gsutil cp lyrics-valid.csv gs://w210-capstone/data/lyrics-valid.csv

18/10/17 22:01:39 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.10-hadoop2
18/10/17 22:02:17 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.10-hadoop2
copyFromLocal: `gs://w210-capstone/data/lyrics-valid.csv': No such file or directory
