# HHWeb

A network graph of rap artists' associations by shared phrases. These shared phrases could come about as a result of paying homage, intertextual allusion, or simple plagiarism.  Outside the realm of modern intellectual property, this is a common an well accepted practice in blues music, a genre where rap music has deep roots. 

I listen to a lot of rap music, and I've noticed this popping up.  It seemed like a fun way to jump into NLP. Really, something about juxtaposing AAVE and technical/academic language just tickles me.

This project will show directionality of borrowed phrases, inferred by release date, and cluster artist around influential artists with the most sampled/borrowed phrases.

The length of a phrase will be proportional to number of times it appear in order for it to be significant.  This means that a short phrase must be highly unique to count as a link (to ensure it is not simply a common part of speech), while longer phrases can be shared, as their probability of _not_ being attributable is much less.

## Run this after you have the scraped corpus (hhweb_scraper)

In [9]:
import os
import pandas as pd
import numpy as np
import glob
import re
from collections import defaultdict
import networkx as nx
import time
import random
import re
from nltk.corpus import stopwords
import traceback
from difflib import SequenceMatcher

In [3]:
NGRAM_LEN = 5

In [4]:
# This is just for testing, before I pull in the corpus
# big_ego = "Artist: Dr. Dre f/ Hitman Album: The Chronic 2001 Song: Big Ego's Typed by: OHHLA Webmaster DJ Flash [Dr. Dre] I got mo' class than most of em, ran wit the best of em Forgave the less of em, and blazed at the rest of em What can I say? Cal-i-for-ni-A Where niggaz die everyday over some shit they say Disconnected from the streets forever As long as I got a baretta, nigga, I'm down for whateva I roll wit my shit off safety - for niggaz that been hatin me lately and the bitches that wanna break me If Cali blew up, I'd be in the Aftermath Bumpin gangsta rap shit, down to blast for cash Cause from Eazy-E, to D.O.C., to D.P.G. started from that S.O.B., D.R.E. Like Dub-C I'm rich rollin, pistol holdin Pockets swoll nigga, that's how I'm rollin Put the flame to the killer nigga Worldwide homicide mob figure and a builder, for real I'm hittin switches, makin bitches eat bitches See me grab my dick everytime I pose for pictures I own acres, floor seats watchin The Lakers I'm cool with eses who got AK's in cases Dedicated to all of those with big ego's Never fakin, we get the dough and live legal Haters hate this, we sip the Mo' and yank the heezos 1 - Niggaz play this in they Rovers Jeeps and Regals 2 - Bitches play this in they Benzes Jeeps and Geos {repeat 2X} [Hitman] I bust a Mr. Toughy, slash a Smoothy Doobie Crash and flex on Tuesday's, harassin hoes at movies Passin by with uzis - and who you aimin at? That shady bitch and that bitch nigga that was claimin that Rat-ta-tat-tat {*automatic gunfire and screaming*} {*more screaming as tires peel out*} I don't sympathize for wack hoes and wimpy guys You got to recognize Hitman is a enterprise Cali pride, born to ride and South Centralized The Henny got me energized - smoke the guys tryin to focus on mines - poke they eyes out I'm L.A.'s loc'est - hope they don't have to find out the hard way like snitch niggaz in the pen that get hit when the guards look the other way We hittin HARD, Hitman and Dre You playin games, I suggest you know the rules We puttin guns to fools, make you run yo' jewels Take yo' honey and cruise to the snootiest snooze, Cabos Pop coochie til the nut oozes, you shouldn't fuck wit crews that's sick, Aftermath cause we rule shit I'm Big Hit, don't confuse me wit no other by the flow motherfucker Dedicated to all of those with big ego's Never fakin, we get the dough and live legal Haters hate this, we sip the Mo' and yank the heezos 1 - Niggaz play this in they Rovers Jeeps and Regals 2 - Bitches play this in they Benzes Jeeps and Geos {repeat 2X}"
# tribe = "[Hook: Q-Tip] Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Well, I'm gone (Go on then!) [Verse 1: Q-Tip] Can I kick it? To all the people who can Quest like A Tribe does Before this, did you really know what live was? Comprehend to the track, for it's why cuz Gettin measures on the tip of the vibers Rock and roll to the beat of the funk fuzz Wipe your feet really good on the rhythm rug If you feel the urge to freak, do the jitterbug Come and spread your arms if you really need a hug Afrocentric living is a big shrug A life filled with fun that's what I love A lower plateau is what we're above If you diss us, we won't even think of Will Nipper the doggy give a big shove? This rhythm really fits like a snug glove Like a box of positives it's a plus, love As the Tribe flies high like a dove (Can I kick it?) [Hook: Phife Dawg] Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Can I kick it? (Yes, you can!) Well, I'm gone (Go on then!) [Verse 2: Phife Dawg] Can I kick it? To my Tribe that flows in layers Right now, Phife is a poem sayer At times, I'm a studio conveyor Mr. Dinkins, would you please be my mayor? You'll be doing us a really big favor Boy this track really has a lot of flavor When it comes to rhythms, Quest is your savior Follow us for the funky behavior Make a note on the rhythm we gave ya Feel free, drop your pants, check your ha-ir Do you like the garments that we wear? I instruct you to be the obeyer A rhythm recipe that you'll savor Doesn't matter if you're minor or major Yes, the Tribe of the game we're a player As you inhale like a breath of fresh air (Can I kick it?)"
# sage = "Can I kick it? (yes you can) [x3] Well I'm gone (go on then) Can I kick it, to all my people who get wicked like Sage does before this did you know what my real name was Paul Francis acting like he's on the same drugs Never even felt the authects of a strange buzz You never ever catch me holding a beer mug Your talking shit like as if you was a real thug if that's true lick a shot BUCK feel the slug that's what you get for totin guns like you were Elmer Fudd I'm selling tapes for three bones wanna catch a dub? this shit is dope kid it makes you wanna cut the rug Illuminati's got every part of my body bugged the micro chip is in your wrist now give it a tug be nice to females, give a bitch a hug Triple X styles comin cleaner than your tub you better tell your girl about it because she's a scrub A big brow never had a nip in the bud droppin me her seven digits while i'm in the club talkin bout I look I need a back rub son she's a natural disaster like a flash flood i ain't playin dawg you better go test her blood until your positive she's negative don't make no love with or without a glove, you know what i'm speaking of the cub scouts try and jump into the briney shrubs behind the bush turn a back push into a shove what you thinkin tryin bring the underground above? AOI make you cry like a dove,for that shit,for that shit "
# denance = "[Intro] Last year I was Dr-Drib- dribble down the court Dr-Drib- dribble down the court This year I'm kicking it I'ma kick it for like a motherfucking soccer ball [Verse 1] I'm crazy, I lost my mind I can't find it But that's OK, cause being normal's not a fucking option Cause if it was, then rap wouldn't be my main focus I'd have a 9-to-5, a wife that'll hang my clothes up I'd have a couple kids, a house to call my home, but Something crazy happened, rap became my home, yup! Every since the evidence became so relevant That I was meant to set mics on fire, I've been hesitant But that's over and I'm killing what the hell has sent If you have an issue 'bout who I say I'm better than You can try to write a song, diss me if you ever can But the only thing you got on me is this Eminem (chka chka) It's getting old, we don't share no pens So stop all these dumb accusations and comparisons We ain't nothing alike, we just white So what's the problem between us, that's causing this fight? [Hook] Can I kick it? (Yes, you can!)(x3) Now let me show the whole world that I ain't playing around (x2) [Verse 2] I need a U-Haul to carry this weight I bury the hate, inside of a very big crate Too scary to stay You better be very afraid I carry a cape, I'm Superman, American made You a fairy with a glare and it's gay You compare yourself to the best when you barely can slay I bring urgent care when I rap, don't you get carried away \"Son, sit down, get a job\", something your parents will say And when I eat MC's, that's really only an errand to me You ain't even half decent, boy/girl, you're half retarded You're like a turtle next to me, I'm an Aston Martin These kids are hopin' to cash out with the rappin' art when They realize 20 years down the road, they haven't started A career, then its clear that you in fact, are garbage So, please sit down or walk yourself inside of coffin Let the pros handle the hustle while you stand there stalking Hating on every move we make, hoping we don't reach stardom [Hook] Can I kick it? (Yes, you can!)(x3) Now let me show the whole world that I ain't playing around (x2)"

In [5]:
# There's probably a library out there that does this, but where's the fun in that?
# n=number of words in the cluster,  ly = lyrics to split into ngrams (str)
def ngram(n, ly):
    lst = ly.split()
    #We'll make the ngrams by zipping together a series of lists
    lists_to_zip = []
    lists_to_zip.append(lst)
    
    # Incase we get an int, not a range in
    
    for i in range(n):
        # Each list should have one more padding that the previously cretaed one
        new_list = ['*padding*'] + (lists_to_zip[-1])
#         print 'NEW LIST ------\n', new_list
        lists_to_zip.append(new_list)
#         print 'lists_to_zip --------\n', lists_to_zip
    zipped_lists = zip(*lists_to_zip[::-1])
    ngram_list = [' '.join(n) for n in zipped_lists if "*padding*" not in n]
    return ngram_list

In [18]:
# TODO talk to Christopher about borrowing this https://github.com/cing/rapwords/blob/master/RapWordsTalk.ipynb
# It's MIT licensed, but it would be nice to reach out
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

df_data = defaultdict(list)
stpwrds = stopwords.words('english')
song_count = len([name for name in os.listdir('./lyrics')])

for i, filename in enumerate(glob.iglob('lyrics/*.txt', recursive=True)):
    if i % 250 == 0:
        print('Working on song {} of {}'.format(i, song_count))
    with open(filename, 'r') as f:
        stripped_lyrics = f.read()
        
        artist = re.search('Artist:\s*(.*)\s*\n', stripped_lyrics)
        song = re.search('Song:\s*(.*)\s*\n', stripped_lyrics)
        lyrics = re.search('Typed by:\s*(.*)\s*\n([\s\S]*)', stripped_lyrics)

        if (artist is not None) and (song is not None) and (lyrics is not None):
            artist = artist.group(1)
            song = song.group(1)
            # Remixes are messing everything out.  Let's filter them out from the start
            if ('remix' not in song) and ('Remix' not in song):
                lyrics = lyrics.group(2).replace('\n', ' ')# group(1) is the transcriber
                lyrics = re.sub('[^0-9A-Za-z\s]', ' ', lyrics) # These tokens should be converted to spaces
                # The ing --> in mutation will be a shitshow. So long as it's consistent the  
                # "thing", "ring", etc. issues should be negligble with a large enough word group length
                lyrics = re.sub('ing ', 'in ', lyrics) 
                lyrics = re.sub('az ', 'as ', lyrics)
                # TODO Take care of the --> da
                #      as --> az (nigg)
                no_stop = ' '.join([w for w in lyrics.lower().split() if w not in stpwrds])

                df_data["filename"].append(filename)
                df_data["artist"].append(''.join(re.findall('[a-zA-Z0-9\s]', artist)))
                df_data["song"].append(''.join(re.findall('[a-zA-Z0-9\s]', song)))
                # We can always refer back to the pretty lyrics once we find a match
                df_data["lyrics"].append(lyrics)
                # We can use the lowercase, punctuation-free, stopword-free for comparison
                df_data["no_stop"].append(no_stop)
                df_data["ngram"].append(ngram(NGRAM_LEN, no_stop))

Working on song 0 of 60086
Working on song 100 of 60086
Working on song 200 of 60086
Working on song 300 of 60086
Working on song 400 of 60086
Working on song 500 of 60086
Working on song 600 of 60086
Working on song 700 of 60086
Working on song 800 of 60086
Working on song 900 of 60086
Working on song 1000 of 60086
Working on song 1100 of 60086
Working on song 1200 of 60086
Working on song 1300 of 60086
Working on song 1400 of 60086
Working on song 1500 of 60086
Working on song 1600 of 60086
Working on song 1700 of 60086
Working on song 1800 of 60086
Working on song 1900 of 60086
Working on song 2000 of 60086
Working on song 2100 of 60086
Working on song 2200 of 60086
Working on song 2300 of 60086
Working on song 2400 of 60086
Working on song 2500 of 60086
Working on song 2600 of 60086
Working on song 2700 of 60086
Working on song 2800 of 60086
Working on song 2900 of 60086
Working on song 3000 of 60086
Working on song 3100 of 60086
Working on song 3200 of 60086
Working on song 3300 o

In [19]:
lydf = pd.DataFrame(df_data).sort_values(by='song').reset_index(drop=True)
# display(lydf[lydf['song'].duplicated()])
display(lydf[lydf['artist'] == 'Dej Loaf'])

Unnamed: 0,artist,filename,lyrics,ngram,no_stop,song
3245,Dej Loaf,lyrics/ayo.dej.txt,Henny by the 5th gone and take your pick I m ...,"[henny 5th gone take pick pausin, 5th gone tak...",henny 5th gone take pick pausin cause fly feel...,Ayo
4453,Dej Loaf,lyrics/been_on.dej.txt,I ve been on my hustle It ain t too many peopl...,"[hustle many people look grindin grindin, many...",hustle many people look grindin grindin grindi...,Been On My Grind
4976,Dej Loaf,lyrics/birdcall.dej.txt,Aye that s that bird call All these haters go...,"[aye bird call haters got nerves, bird call ha...",aye bird call haters got nerves bad writin rhy...,Bird Call
5063,Dej Loaf,lyrics/b_please.dej.txt,Bitch please Rolled on ten speed bike like Yo...,"[bitch please rolled ten speed bike, please ro...",bitch please rolled ten speed bike like done r...,Bitch Please
6986,Dej Loaf,lyrics/butterfl.dej.txt,Intro You left me time Oooh ohhh This from ...,"[intro left time oooh ohhh oooh, left time ooo...",intro left time oooh ohhh oooh ohhh oooh ohhh ...,Butterflies
8111,Dej Loaf,lyrics/chasemin.dej.txt,Lately I been on my rap shit Had to stop singi...,"[lately rap shit stop singin back, rap shit st...",lately rap shit stop singin back bitch takin n...,Chase Mine
11012,Dej Loaf,lyrics/desire.dej.txt,I don t ask no questions I just handle busines...,"[ask questions handle business ask favors, que...",ask questions handle business ask favors ask n...,Desire
11198,Dej Loaf,lyrics/die4it.dej.txt,I be done killed me a nigga tryin to take some...,"[done killed nigga tryin take somethin, killed...",done killed nigga tryin take somethin away wor...,Die 4 It
13123,Dej Loaf,lyrics/easylove.dej.txt,Chorus Love ain t ever been so easy Love ain...,"[chorus love ever easy love ever, love ever ea...",chorus love ever easy love ever easy love make...,Easy Love
15298,Dej Loaf,lyrics/fools_fl.dej.txt,Intro Why do fools fall in love Why do fools...,"[intro fools fall love fools fall, fools fall ...",intro fools fall love fools fall love somebody...,Fools Fall In Love


In [None]:
# Check the collection of ngrams for intersections
# To reduce this to to an O(n^2) problem we create a list of all of the previously seen ngrams
# If the one we are looking at has been seen before, record its information
# Else add it to the list
master_list = dict()

def how_similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

song_count = len(df_data['ngram'])
for i, song in enumerate(df_data['ngram']): # loop through each song
    if i % 25 == 0:
        print('Working on song ', i, ' of ', song_count, ' -- Song:', [df_data['song'][i]])
    collision_list = dict() # We hold potential collisions here until we can determine if they are a result of a duplicate/remix song
    unique_list = dict() # We hold things here for now to avoid comparing a song to itself (repeating chorus, etc.)
    for j, n in enumerate(song): # Loop through each ngram
        if n not in master_list: # Any repeated ngrams _within_ the song will overwrite the previous one.  This is OK.
            unique_list[n] = {'index':i, 'artist': [df_data['artist'][i]], 'song': [df_data['song'][i]], 'no_stop': df_data['no_stop'][i], 'ngram':n, 'count':1}
        else:
            # Increment the count, append the artist and song to the original
            # This approach seems redundant, but I'm not sure how to access the the dict that triggered the else
#             print('Found ngram collision! Artists: {} / {}, Song: {} / {}, ngram:{}'.format(master_list[n]['artist'], df_data['artist'][i], master_list[n]['song'], df_data['song'][i], n))
            split = n.split(' ')
            found_collision = False
            #This a colision. But is it relevant?
            for k, song in enumerate(master_list[n]['song']):
                # The last song we saw with this ngram
                saved_no_stop = master_list[n]['no_stop'][-1]
                saved_title = master_list[n]['song'][-1]
                saved_artist = master_list[n]['artist'][-1]
                # TODO do this for the entire song text.  Slow, but we'll see. Maybe use quick sequence matcher?
                no_stop_similarity = how_similar(df_data['no_stop'][i], saved_no_stop)
                title_similarity = how_similar(df_data['song'][i], saved_title)
                artist_similarity = how_similar(df_data['artist'][i], saved_artist)
                
                
                if  (
                      df_data['artist'][i] not in saved_artist and # Is one in the other?
                      saved_artist not in df_data['artist'][i] and
                    # TODO This needs to be generalized for different ngrma lenghts
                    # Is the match a repeating (pair of) words? This works better with an even number of ngrams.  
                    # This filters 'oooh ahhh's and 'la la la la's
                      not (split[0:1] == split[2:3]) and  
                      no_stop_similarity < 0.5 and 
                      title_similarity < 0.5 and 
                      artist_similarity < 0.5
                    ):
                        
                    collision_list[n] = master_list[n]
                    collision_list[n]['artist'].append(df_data['artist'][i])
                    collision_list[n]['song'].append(df_data['song'][i])
                    collision_list[n]['count'] += 1
                    found_collision = True
                    break; # For now we will just record that there is a connection, not all of the connections
#                 else:
#                     print('Our filter kicked this out! Not saved! Similarity was ', song_similarity)
            
    if collision_list:
         print('Collision(s) found: ', collision_list.keys())
    # We have extracted all of the new ngrams, merge it into the master list
    master_list = {**master_list, **unique_list, **collision_list}
            
# TODO --> If a song has more than ... 15? 20? matching ngrams, throw the whole song out.  It's a remix or an alternate version.  


Working on song  0  of  56673  -- Song: ['004']
Working on song  25  of  56673  -- Song: ['1000 OClock']
Working on song  50  of  56673  -- Song: ['100 Percent of Something']
Working on song  75  of  56673  -- Song: ['100 Thousand Sold Pt 2']
Working on song  100  of  56673  -- Song: ['10 Bricks']
Working on song  125  of  56673  -- Song: ['1 2 3']
Working on song  150  of  56673  -- Song: ['1 2 Many']
Working on song  175  of  56673  -- Song: ['15 After Da Hour']
Collision(s) found:  dict_keys(['1 2 3 4 5 6', '2 3 4 5 6 7', '3 4 5 6 7 8'])
Collision(s) found:  dict_keys(['7 8 9 10 11 12'])
Working on song  200  of  56673  -- Song: ['187 Dance']
Working on song  225  of  56673  -- Song: ['1 2 3']
Working on song  250  of  56673  -- Song: ['1nce Again']
Working on song  275  of  56673  -- Song: ['First Brick']
Collision(s) found:  dict_keys(['1 2 3 4 5 6', '2 3 4 5 6 7', '3 4 5 6 7 8'])
Working on song  300  of  56673  -- Song: ['202088']
Working on song  325  of  56673  -- Song: ['Twen

In [None]:
how_similar('1 2 Step Remix Fingazz Version', '1 2 Step')

In [None]:
matches = [[x, master_list[x]['artist'], master_list[x]['song']] for x in master_list if len(master_list[x]['artist'])>1]
for o in matches:
    print(o, '\n')

In [None]:
matches_dict = defaultdict(list)
for o in matches:
    new_key = (', '.join(o[1]))
    matches_dict[new_key] = o
    
print(len(matches_dict))

for m in matches_dict:
    print(matches_dict[m], '\n')
    



In [None]:
print(len(matches))
print(len(matches_dict))

with open('matches/raw_matches.txt', 'w+') as output:
    for o in matches:
        output.write(str(o) + '\n')

with open('matches/by_artist_matches.txt', 'w+') as output:
    for m in matches_dict:
        output.write(str(matches_dict[m][0]) + '\n')#+ ' --> ' + str(matches_dict[m]) + '\n')

    ['wa da da dang wa da da da dang', ['Brother Ali', 'Cormega f Big Daddy Kane Grand Puba Kool DJ Red Alert KRSOne Parrish Smith'], ['Nine Double Em', 'Fresh']] 
    ['got clean underwear somebody say oh yeah oh yeah', ['Boot Camp Clik', 'Afroman'], ['Down by Law', 'Dopefiend']] 

##TODO
1. ~~If one artist's name is in another's, throw out the match~~
1. Currently checking if an ngram has 3 identical 2-word groups. Bellow is ideal
1. If an ngram repeats same phrase more than len(ngram)/2 times, throw it out
1. ~~Checked for number words in ngram. This doesn't take into account unique patterns of two words.  Try again for the previous goal~~
1. ~~Lower ngram length.  With stopwords removed, 8 is too hight.~~
1. 6 is still too high "Yay though I walk through the Shadow of Death" comes up 34 times but is too short. Don't underestimate the stopwords.

In [None]:
# For each song, generate a dict of ngrams where x < n > y
def dict_ngram(lyrics, rng=None):
    res = {}
    for i in range (*rng):
        res[i] = ngram(i, lyrics)
    return res
    
# print(dict_ngram(tribe, rng=(3, 8) ))
# print(dict_ngram(sage, rng=(3, 8) ))
tribe_dict = dict_ngram(tribe, rng=(2, 8) )
sage_dict = dict_ngram(sage, rng=(2, 8) )

print(tribe_dict)
# TODO We if we are doing this with multiple lenght ngrams,
# we don't want shorter ones that are a subset of the longer ones.
# Find a way to only keep the longest ngram
# def longest_ngram (dict_x, dict_y, rng)
#     rng = reversed(rng)
#     for i in rng:
#         print(i, set(tribe_dict[i]).intersection(sage_dict[i]))
    


In [None]:
# Testing out the graphing library
G = nx.Graph()
           
edges = []
for i in range (2,8):
    n = (set(tribe_dict[i]).intersection(sage_dict[i]))
    for edg in n:
        print('Edge for {}gram'.format(i), edg)
        edges.append(' '.join(edg))
        
print('All edges \n', edges)
for edg in edges:
#     G.add_edge('tribe', 'sage')
    G.add_edge('tribe', 'sage', lyric=' '.join(edg))
    
import matplotlib.pyplot as plt    
%matplotlib inline
nx.draw(G)