# Songs Dataset

Some useful links:
- [Genius API documentation](https://docs.genius.com/)
- [ChatGPT API documentation](https://openai.com/blog/introducing-chatgpt-and-whisper-apis)

## 1. Import Libraries & Authentication

In [None]:
# Download external libraries
!pip install transformers

In [None]:
# Import necessary libraries
import pandas as pd
from google.colab  import files
from google.cloud  import bigquery
from google.oauth2 import service_account
from tqdm          import tqdm

# Upload authentication key
gc_key = files.upload()

Saving songs-dataset-key.json to songs-dataset-key.json


In [None]:
# Define accesss token
access_token = "YFvVdTqoCGAr1Ry8YNQZvbRMQJirr59PhZJ2vC5gEgQjnybUuLqPvgxVGlLKOJPT"

# Set google bigquery parameters and client
key_path    = "./songs-dataset-key.json"
credentials = service_account.Credentials.from_service_account_file(key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"])
client      = bigquery.Client(credentials=credentials, project=credentials.project_id)

## 2. Import Data

### 2.1 Artist Data

In [None]:
# Set SQL query to get all data from artists database
query = """
        SELECT *
        FROM `songs-dataset.data.artists`
        """

# Get queried data as pandas dataframe (might take a while)
df_artists = client.query(query).to_dataframe()

# Print number of artists
print(f"Number of artists in dataset: {df_artists.shape[0]:,}")

Number of artists in dataset: 915,850


### 2.2 Songs Data

In [None]:
# Set SQL query to get all data from songs database
query = """
        SELECT *
        FROM `songs-dataset.data.songs`
        """

# Get queried data as pandas dataframe (might take a while)
df_songs = client.query(query).to_dataframe()

# Print number of artists
print(f"Number of songs in dataset: {df_songs.shape[0]:,}")

Number of songs in dataset: 2,314,858


## 3. Explore Data

In [None]:
# Print variable types and names for artists
df_artists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 915850 entries, 0 to 915849
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   artist_name       915850 non-null  object 
 1   alternate_name    96563 non-null   object 
 2   artist_url        915850 non-null  object 
 3   artist_id         915850 non-null  Int64  
 4   place_birth       915850 non-null  object 
 5   place_live        915850 non-null  object 
 6   place_death       915850 non-null  object 
 7   date_birth        915850 non-null  object 
 8   date_death        915850 non-null  object 
 9   gender            915850 non-null  object 
 10  race              915850 non-null  object 
 11  is_verified       915850 non-null  boolean
 12  about_artist      184210 non-null  object 
 13  genius_followers  915850 non-null  Int64  
 14  facebook_name     101509 non-null  object 
 15  instagram_name    189061 non-null  object 
 16  twitter_name      10

In [None]:
# Print variable types and names for songs
df_songs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2314858 entries, 0 to 2314857
Data columns (total 11 columns):
 #   Column              Dtype  
---  ------              -----  
 0   song_id             Int64  
 1   artist_id           Int64  
 2   album_id            Int64  
 3   feature_artists_id  object 
 4   url                 object 
 5   language            object 
 6   release_date        object 
 7   title               object 
 8   lyrics_state        object 
 9   has_lyrics          boolean
 10  lyrics              object 
dtypes: Int64(3), boolean(1), object(7)
memory usage: 187.6+ MB


In [None]:
# Set variables for query (change this to play around)
artist_name = "Beyoncé"

# Example: query information for Beyonce
df_artists.query(f'artist_name=="{artist_name}"')

Unnamed: 0,artist_name,alternate_name,artist_url,artist_id,place_birth,place_live,place_death,date_birth,date_death,gender,race,is_verified,about_artist,genius_followers,facebook_name,instagram_name,twitter_name
278502,Beyoncé,Beyoncé Knowles / Sasha Fierce / B.B. Homemake...,https://genius.com/artists/Beyonce,498,,,,,,,,False,Beyoncé Giselle Knowles-Carter (born September...,3969,beyonce,beyonce,Beyonce


In [None]:
# Get artist id for Beyonce
artist_id = df_artists.query(f'artist_name=="{artist_name}"').artist_id.values[0]

# See list of Beyonce songs
df_songs.query(f'artist_id=={artist_id}', engine="python")

Unnamed: 0,song_id,artist_id,album_id,feature_artists_id,url,language,release_date,title,lyrics_state,has_lyrics,lyrics
183463,5822137,498,662831,[],https://genius.com/Beyonce-at-last-karmatronic...,en,February 2011,At Last (Karmatronic Remix),complete,True,At last\nMy love has come along\nMy lonely day...
183465,3014456,498,0,[],https://genius.com/Beyonce-love-drought-and-sa...,en,2017,Love Drought & Sandcastles (Live from the 2017...,complete,True,Pt. 1 Love Drought\n\n[Intro: Lemonade Speech]...
185172,4321583,498,505314,[],https://genius.com/Beyonce-work-it-out-azzas-n...,en,"June 11, 2002",Work It Out (Azza’s Nu Soul Mix),complete,True,[Intro]\nBeyoncé\nIt's the Azza y’all\nLet's w...
185173,5120905,498,588283,[],https://genius.com/Beyonce-green-light-hba-rem...,en,"January 1, 2012",Green Light (HBA Remix),complete,True,[Intro: Beyoncé]\nGreen light\n\n[Verse: Beyon...
191382,2715227,498,0,[1421],https://genius.com/Beyonce-freedom-2016-bet-aw...,en,"June 26, 2016",FREEDOM (2016 BET Awards),complete,True,[Intro: Martin Luther King Jr.]\nWhen the arch...
...,...,...,...,...,...,...,...,...,...,...,...
480301,5745509,498,0,[],https://genius.com/Beyonce-hold-my-beer-lyrics,,,Hold My Beer,unreleased,True,
481358,4585892,498,445897,[],https://genius.com/Beyonce-upgrade-u-live-lyrics,en,"November 26, 2010",Upgrade U (Live),complete,True,"[Intro: JAY-Z & (Beyoncé)]\nHehehe! Yeah, B!\n..."
485334,8219796,498,917339,[],https://genius.com/Beyonce-im-that-girl-lyrics,en,"July 29, 2022",I’M THAT GIRL,complete,True,"[Intro]\nPlease, motherfuckers ain't stop—, pl..."
511060,8467972,498,960143,[],https://genius.com/Beyonce-cuff-it-mixed-lyrics,en,"October 7, 2022",CUFF IT (Mixed),complete,True,[Verse 1]\nI feel like fallin' in love (Fallin...


In [None]:
# Print lyrics for single ladies`a
print(df_songs.query(f'artist_id=={artist_id} & title=="Single Ladies (Put a Ring on It)"', engine="python").lyrics.values[0])

[Produced by Beyoncé, The-Dream & Tricky Stewart]

[Intro]
All the single ladies (All the single ladies)
All the single ladies (All the single ladies)
All the single ladies (All the single ladies)
All the single ladies
Now put your hands up

[Verse 1]
Up in the club (club), just broke up (up)
I’m doing my own little thing
Decided to dip (dip), but now you wanna trip (trip)
'Cause another brother noticed me
I’m up on him (him), he up on me (me)
Don’t pay him any attention
Cried my tears (tears), for three good years (years)
You can’t be mad at me
[Chorus]
'Cause if you like it, then you shoulda put a ring on it
If you like it, then you shoulda put a ring on it
Don’t be mad once you see that he want it
If you like it, then you shoulda put a ring on it
Whoa, oh, oh, oh, oh-oh, oh, oh, oh, oh, oh, oh
Whoa, oh, oh, oh, oh-oh, oh, oh, oh, oh, oh, oh
'Cause if you like it, then you shoulda put a ring on it
If you like it, then you shoulda put a ring on it
Don’t be mad once you see that he wan

In [None]:
# Merge artist data with number of songs into artists dataframe
df_artists = pd.merge(df_artists, df_songs[["artist_id", "song_id"]].groupby(["artist_id"], as_index=False).count().rename(columns={"song_id": "songs"}), on="artist_id", how="left")

## 3. Retrieving Data (TODO for Mike and Daniel)

# New section

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df_artists.sort_values("songs", ascending=False).head(50)

Unnamed: 0,artist_name,alternate_name,artist_url,artist_id,place_birth,place_live,place_death,date_birth,date_death,gender,race,is_verified,about_artist,genius_followers,facebook_name,instagram_name,twitter_name,songs
744832,The Grateful Dead,,https://genius.com/artists/The-grateful-dead,21900,,,,,,,,False,Amidst the growing counter-culture scene in th...,134,,,,2331.0
101652,Juice WRLD,JuiceTheKidd / Jarad “Juice WRLD” Higgins / Ja...,https://genius.com/artists/Juice-wrld,1237094,,,,,,,,True,"Jarad Anthony Higgins (December 2, 1998 – Dece...",6465,juiceworldddd,juicewrld999,JuiceWorlddd,1384.0
92155,Holy Bible (KJV),,https://genius.com/artists/Holy-bible-kjv,31651,,,,,,,,False,The [King James Version](https://www.britannic...,65,,,,1191.0
274341,La Bible de Jérusalem,,https://genius.com/artists/La-bible-de-jerusalem,82086,,,,,,,,False,###La Bible est un ensemble de textes considér...,2,,,,1189.0
155230,John Debney,John Cardon Debney,https://genius.com/artists/John-debney,160692,,,,,,,,False,John Cardon Debney is an American composer and...,1,pages/John-debney,,johndebney,1095.0
155761,Johnny Cash,John R. Cash / J.R. Cash,https://genius.com/artists/Johnny-cash,1167,,,,,,,,False,Johnny Cash (26 February 1932 – 12 September 2...,238,,,,1055.0
234049,Lorne Balfe,,https://genius.com/artists/Lorne-balfe,522601,,,,,,,,False,Lorne Balfe is a Grammy Award-winning composer...,3,lornebalfemusic,lornebalfe,Lornebalfe,997.0
274475,Лев Толстой (Leo Tolstoy),,https://genius.com/artists/Leo-tolstoy,46632,,,,,,,,False,"Count Lev Nikolayevich Tolstoy, usually referr...",4,,,,991.0
92153,Hans Zimmer,Hans Florian Zimmer,https://genius.com/artists/Hans-zimmer,36539,,,,,,,,False,[Hans Zimmer](https://en.wikipedia.org/wiki/Ha...,64,hanszimmer,,HansZimmer,944.0
163502,The Rolling Stones,,https://genius.com/artists/The-rolling-stones,774,,,,,,,,False,"Formed in London in 1962, The Rolling Stones l...",441,therollingstones,therollingstones,RollingStones,938.0


In [None]:
!pip install wikipedia-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)
page_py   = wiki_wiki.page('Jay-Z')

In [None]:
page_py.summary

"Shawn Corey Carter (born December 4, 1969), known professionally as Jay-Z, is an American rapper, record producer, and entrepreneur. Declared the greatest rapper of all time by Billboard, he has been central to the creative and commercial success of artists including Kanye West, Rihanna, and J. Cole. He is the founder and chairman of entertainment company Roc Nation, and was the president and chief executive officer of Def Jam Recordings from 2004 to 2007.Born and raised in New York City, Jay-Z began his musical career in the late 1980s; he co-founded the record label Roc-A-Fella Records in 1995 and released his debut studio album Reasonable Doubt in 1996. The album was released to widespread critical success, and solidified his standing in the music industry. He went on to release twelve additional albums, including the acclaimed albums The Blueprint (2001), The Black Album (2003), American Gangster (2007), and 4:44 (2017). He also released the full-length collaborative albums Watch 

In [None]:
context = page_py.summary

In [None]:
question = "Is Beyonce alive?"
qa_model(question = question, context = context)

{'score': 0.14341650903224945,
 'start': 2289,
 'end': 2323,
 'answer': 'she became the first female artist'}

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [None]:
artist = "Jay-Z"
QA_input = {
    'question': f"What is {artist}'s gender?",
    'context': page_py.summary
}
res = nlp(QA_input)

In [None]:
res

{'score': 0.33680063486099243,
 'start': 0,
 'end': 18,
 'answer': 'Shawn Corey Carter'}

In [None]:
import requests
from bs4 import BeautifulSoup

artist = "twenty one pilots"
res = requests.get(f"https://en.wikipedia.org/wiki/{artist}")

In [None]:
def artist_lookup(artist):
  res = requests.get(f"https://en.wikipedia.org/wiki/{artist}")
  soup = BeautifulSoup(res.text)
  infobox = soup.find_all("table", class_={"infobox"})

# beautiful soups returning selected info

In [None]:
import requests
from bs4 import BeautifulSoup

def artist_lookup(artist):
  res = requests.get(f"https://en.wikipedia.org/wiki/{artist}")
  soup = BeautifulSoup(res.text)
  infobox = soup.find_all("table", class_={"infobox"})
  return infobox[0].tbody

artist_keys = ["Born", "Genres", "Died"]
band_keys = ["Origin", "Genres", "Members"]
band_key = ["Origin"]
artist_list = ["twenty one pilots","Beyonce", "Michael Jackson", 'blink 182', 'john lennon']

band_dict = {x:{tr.th.text: tr.td.text for tr in artist_lookup(x) if tr.td and tr.th and tr.th.text in band_keys} for x in artist_list}
band_dict

{'twenty one pilots': {'Origin': 'Columbus, Ohio, U.S.',
  'Genres': '\nAlternative rock\nalternative hip hop\nelectropop\nrap rock\nindie pop\npop rock\nemo[1] (early)\n',
  'Members': '\nTyler Joseph\nJosh Dun\n'},
 'Beyonce': {'Genres': '\nR&B\npop\nhip hop\n'},
 'Michael Jackson': {'Genres': 'Popsoulfunkrhythm and bluesrockdiscopost-discodance-popnew jack swing'},
 'blink 182': {'Origin': 'Poway, California, U.S.',
  'Genres': '\nPop-punk\nalternative rock\npunk rock\nskate punk\n',
  'Members': '\nMark Hoppus\nTom DeLonge\nTravis Barker\n'},
 'john lennon': {'Genres': 'Rockpopexperimental'}}

In [None]:
artist_lookup("twenty one pilots")
{tr.th.text: tr.td.text for tr in artist_lookup("twenty one pilots") if tr.td and tr.th}

{'Origin': 'Columbus, Ohio, U.S.',
 'Genres': '\nAlternative rock\nalternative hip hop\nelectropop\nrap rock\nindie pop\npop rock\nemo[1] (early)\n',
 'Years active': '2009–present',
 'Labels': '\nFueled by Ramen\nElektra\nAtlantic\n',
 'Members': '\nTyler Joseph\nJosh Dun\n',
 'Past members': '\nNick Thomas\nChris Salih\n',
 'Website': 'twentyonepilots.com'}

In [None]:
band_dict = {tr.th.text: tr.td.text for tr in infobox[0].tbody if tr.td and tr.th}
band_dict


{'Origin': 'Columbus, Ohio, U.S.',
 'Genres': '\nAlternative rock\nalternative hip hop\nelectropop\nrap rock\nindie pop\npop rock\nemo[1] (early)\n',
 'Years active': '2009–present',
 'Labels': '\nFueled by Ramen\nElektra\nAtlantic\n',
 'Members': '\nTyler Joseph\nJosh Dun\n',
 'Past members': '\nNick Thomas\nChris Salih\n',
 'Website': 'twentyonepilots.com'}

# script to return artist/band dict

In [None]:
!pip install wptools
import requests
import openpyxl
import pandas as pd
import wptools
import string


# List of bands
band_df = pd.read_excel('bands.xlsx')
band_results = ''

# wiki api needed to make band member list
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"

# List of bands
band_df = pd.read_excel('bands.xlsx')
band_results = ''

# initiate band dictionary
bands_dict = {}

# define function that returns infomation on a given individual
def get_wikidata(artist_name):

    page = wptools.page(artist_name)
    page.get_parse()
    infobox = page.data['infobox']

    if 'name' in infobox:
        a = infobox['name'].translate(str.maketrans('', '', string.punctuation))
    elif 'birth_name' in infobox:
        a = infobox['birth_name'].translate(str.maketrans('', '', string.punctuation))
    else:
        a = None

    if 'birth_place' in infobox:
        b = infobox['birth_place'].translate(str.maketrans('', '', string.punctuation))
    else:
        b = None

    if 'birth_date' in infobox:
        c = infobox['birth_date']
    else:
        c = None

    if 'death_place' in infobox:
        d = infobox['death_place'].translate(str.maketrans('', '', string.punctuation))
    else:
        d = None

    if 'death_date' in infobox:
        e = infobox['death_date']
    else:
        e = None

    data_dict = {"name"     : {a},
               "birth_place": {b},
               "birth_date" : {c},
               "death_place": {d},
               "death_date" : {e}}

    return data_dict

# define huge function to return band member info + band genre
def wiki_band(band):
    # first part returns member list of band given band/name (past and current, if indeed a band of more than 1 person)
    PARAMS = {
        "action": "query",
        "format": "json",
        "list": "categorymembers",
        "cmtitle": f"Category:{band} members",
        "cmlimit": 500
    }
    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    MEMBERS = [member['title'] for member in DATA['query']['categorymembers']]

    # define local sub-dictionary to hold information on a given band
    bnd = {}

    # this first part runs if members list is not empty (i.e. band containing more than 1 person)
    if MEMBERS != []:
        # return genre of band
        page = wptools.page(band)
        page.get_parse()
        bnd['genre']= page.data['infobox']['genre'].translate(str.maketrans('', '', string.punctuation))
        # loop to return info of each band member in a given band in the form of a dictionary
        bnd['members']= []
        for k in MEMBERS:
            if k[:9] == "Category:":
                k_n = k[9:]
                try:
                    page = wptools.page(k_n)
                    page.get_parse()
                    infobox = page.data['infobox']
                    if infobox != None:
                        try:
                            bnd['members'].append(get_wikidata(k_n))
                        except LookupError:
                            print('not avail')
                except LookupError:
                    print("not avail")
            else:
                try:
                    page = wptools.page(k)
                    page.get_parse()
                    infobox= page.data['infobox']
                    if infobox != None:
                        try:
                            bnd['members'].append(get_wikidata(k))
                        except LookupError:
                            print('not avail')
                except LookupError:
                    print("not avail")
    # this part runs if members list is empty (i.e. band containing 1 person, artist)
    else:
        bnd['artist']= []
        bnd['artist'].append(get_wikidata(band))
        # returns genre of artist
        page = wptools.page(band)
        page.get_parse()
        infobox = page.data['infobox']
        if 'module' in infobox.keys():
            infobox = page.data['infobox']['module']
            end = '}}'
            start='}}'
            end_index= infobox.rfind(end)
            snd_to_last = infobox.rfind(end, 0, end_index)
            third_to_last= infobox.rfind(end, 0, snd_to_last)
            results = infobox[infobox.find(start) + len(start):third_to_last]
            bnd['genre']= results.translate(str.maketrans('', '', string.punctuation))
        elif 'genre' in infobox.keys():
            bnd['genre']= page.data['infobox']['genre'].translate(str.maketrans('', '', string.punctuation))
    return bnd

bands_name = [x for x in list(band_df.bands) if str(x) != 'nan']

for x in bands_name:
    bands_dict[x] = wiki_band(x)

In [None]:
bands_dict

{'The Beatles': {'genre': 'hlistRock musicRockPop musicpopbeat musicbeatPsychedelic musicpsychedelia',
  'members': [{'name': {'Pete Best'},
    'birth_place': {'Madras British India'},
    'birth_date': {'{{Birth date and age|df|=|y|1941|11|24}}'},
    'death_place': {None},
    'death_date': {None}},
   {'name': {'Norman Chapman'},
    'birth_place': {None},
    'birth_date': {'1937'},
    'death_place': {None},
    'death_date': {'July 1995 (aged 58)'}},
   {'name': {'George Harrison'},
    'birth_place': {'Liverpool England'},
    'birth_date': {'{{birth date|df|=|yes|1943|02|25}}'},
    'death_place': {'Los Angeles California US'},
    'death_date': {'{{death date and age|df|=|yes|2001|11|29|1943|2|25}}'}},
   {'name': {'John Lennon'},
    'birth_place': {'Liverpool England'},
    'birth_date': {'{{Birth date|1940|10|9|df|=|yes}}'},
    'death_place': {'New York City New York US'},
    'death_date': {'{{Death date and age|1980|12|8|1940|10|9|df|=|yes}}'}},
   {'name': {'Paul McCar

#music brains

In [None]:
%pip install musicbrainzngs

import musicbrainzngs as mbz
import openpyxl
import pandas as pd

mbz.set_useragent("mikethebike", "0", "http://miketheebike.com/")

'''df_band = pd.read_excel('bands.xlsx')
bands = [x for x in df_band.bands if str(x) !='nan']
'''

def art_return(x):
    res = mbz.search_artists(artist=x)
    artist_id = res['artist-list'][0]['id']
    artist = mbz.get_artist_by_id(artist_id)
    return artist["artist"]

key_list = ['gender','life-span', 'country']
artist_list = ['Bad bunny', 'Beyonce', 'Blink 182']

band_dict = {x: {ar: art_return(x)[ar] for ar in art_return(x)  if ar in key_list} for x in artist_list}

band_dict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


{'Bad bunny': {'gender': 'Male',
  'country': 'PR',
  'life-span': {'begin': '1994-03-10'}},
 'Beyonce': {'gender': 'Female',
  'country': 'US',
  'life-span': {'begin': '1981-09-04'}},
 'Blink 182': {'country': 'GB', 'life-span': {'begin': '2019'}}}

In [50]:
#%pip install musicbrainzngs
import musicbrainzngs as mbz

mbz.set_useragent("mikethebike", "0", "http://miketheebike.com/")
artist = mbz.get_artist_by_id('166cb987-3156-4166-b0d3-f1ad5a6c514c', includes=['aliases'])
aliases = artist['artist']['alias-list']

aliases

name_list = []

for x in range(len(aliases)):
  name_list.append(aliases[x]['alias'])

separator = ', '

result = separator.join(name_list)

str(result.lower())

'jaden christopher syre smith, jaden smith, syre'

In [51]:
#### return mbz_names
def return_names(id):
  artist = mbz.get_artist_by_id(id, includes=['aliases'])
  aliases = artist['artist']['alias-list']
  name_list = []
  for x in range(len(aliases)):
    name_list.append(aliases[x]['alias'])
    separator = ', '
    result = separator.join(name_list)
  return str(result.lower())

In [60]:
mbz_df.query('artist_name == "Jaden"')['alternate_name'].iloc[0]

'SYRE / Jaden Christopher Syre Smith / Jaden Smith'

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.0-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.0


In [None]:
from transformers import pipeline

qa_model = pipeline("question-answering")
question = "Where do I live?"
context = "My name is Merve and I live in İstanbul."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9538118243217468, 'start': 31, 'end': 39, 'answer': 'İstanbul'}

In [None]:
# TODO:
# 1. Use Wikipedia API to search for artist date of birth, place of birth, years active, place of death and genres.
# 2. Write function to query ChatGPT-3 (or 3.5) and Davinci model for race only.

In [None]:
# Wikipedia idea: let's try to get the formated information from Wikipedia. Look at
# the page for Jimi Hendrix: https://en.wikipedia.org/wiki/Jimi_Hendrix
# There is a box on the right corner with some formated information, let's try to
# programatically retrieve that + the introductory paragraph.

In [None]:
# Define function to query wikipedia API and search for a given name
def get_wikidata(artist_name):
  """
  This function should query the Wikipedia API, search for a string and return a
  dictionary of relevant information.

  Arguments:
    - artist_name : a string with the artist name to be queried
    - others      : potentially some API key and other parameters.

  Outputs:
    - data_dict   : a dictionary with relevant information.

  """

  # Initiate dictionary to hold relevant data
  data_dict = {"place_birth": [],
               "place_live" : [],
               "place_death": [],
               "date_birth" : [],
               "date_death" : [],
               "gender"     : [],
               "other_info" : []}

  # TODO
  # 1. Query wikipedia API and search for artist_name string
  # 2. This query might give you more than 1 result, how do we figure out the right one?
  # 3. If artist is found:
    # 3.1. Retrieve relevant information from formated box.
    # 3.2. Get textual information from first few paragraphs.
    # 3.3. Put information in list for right dictionary key.
  # 4. If artist is not found return empty dictionary.

  # Things to look out for:
    # 1. make sure you check if the request has returned with status 200 (success) and only continue query if so.
    # 2. some labels might be differente (is place of birth always spelled Born? Look at a few examples to get an idea).
    # 3. be careful with names that could refer to more than one person or page.

  # Return dictionary data
  return data_dict

# Music brains - genuis function


1. add test data, genuis and music brainz

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
import pandas as pd
genius_df = pd.read_csv('genius_data_frame_id.csv')

mbz_df = pd.read_csv('musicbrainz_artists.csv')

In [3]:
gen_name = list(genius_df.artist_name)
mbz_name = list(mbz_df.name)
gen_alt_name = list(genius_df.alternate_name)

AttributeError: ignored

#lower alternate cleaned name

In [78]:
def lower_name(row):
    name = str(row['alternate_name'])
    if '/' in name:
        name = name.replace('/', r' ')
    name = name.lower()
    words = name.split()
    unique = list(set(words))
    return ' '.join(unique)

In [79]:
from tqdm import tqdm
tqdm.pandas()
genius_df['lower_alt_name'] = genius_df.progress_apply(lambda row: lower_name(row), axis=1)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429675/429675 [00:06<00:00, 65464.74it/s]


In [80]:
mbz_df['lower_alt_name'] = mbz_df.progress_apply(lambda row: lower_name(row), axis=1)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2145300/2145300 [00:37<00:00, 57243.63it/s]


# save updated dataset

In [35]:
genius_df.to_csv('genius_artists_v3.csv', index = False)

# run function for gender


In [378]:
male = ['he', 'him', 'his']
female = ['she', 'her', 'hers']
group = ['they', 'them', 'their', 'theirs']

In [393]:
# define gender function (based off of about_artist text)
def gender_column(row):
    gend_text = str(row['first_five_about_artist'])
    doc = nlp(gend_text)
    subj_list = []
    gen = float('nan')
    for token in doc:
        if token.pos_ == "PRON":
            subj_list.append(token.text.lower())
            if any(elem in male for elem in subj_list) and any(elem in female for elem in subj_list):
                gen = 'both'
            elif any(elem in male for elem in subj_list):
                gen = 'male'
            elif any(elem in female for elem in subj_list):
                gen = 'female'
            elif any(elem in group for elem in subj_list):
                gen = 'group'
            else:
                gen = 'undetermined'
    return gen


In [288]:
def ff_column(row):
    gend_text = str(row['about_artist'])
    sentences = sent_tokenize(gend_text)
    first_five = sentences[:5]
    first_five_string = ' '.join(first_five)
    return first_five_string

In [395]:
#genius_df = genius_df.drop('new_gender',axis=1)

import spacy
from tqdm import tqdm
import math
import nltk

nltk.download('punkt')
from nltk.tokenize import sent_tokenize
nlp = spacy.load('en_core_web_sm')
genius_df['new_gender'] = genius_df.progress_apply(lambda row: gender_column(row), axis=1)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mnuno\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429675/429675 [52:23<00:00, 136.67it/s]


In [337]:
genius_df.columns = ['artist_name', 'alternate_name', 'artist_url', 'artist_id',
       'place_birth', 'place_live', 'place_death', 'date_birth', 'date_death',
       'gender', 'race', 'is_verified', 'about_artist', 'genius_followers',
       'facebook_name', 'instagram_name', 'twitter_name', 'new_gender',
       'cleaned_artist_name', 'query_id', 'lower_alternate_name',
       'query_fuzzy_id', 'comb_query_id', 'mbz_lower_alternate_name', 'first_five_about_artist']

In [343]:
genius_df['first_five_about_artist'] = genius_df.progress_apply(lambda row: float("nan") if row['first_five_about_artist'] == 'nan' else row['first_five_about_artist'], axis=1)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429675/429675 [00:07<00:00, 56702.53it/s]


# fuzzy match

In [286]:
genius_df['new_gender'].value_counts()

undetermined    407231
male             17924
female            4520
Name: new_gender, dtype: int64

In [36]:
%pip install rapidfuzz
from rapidfuzz import fuzz, process

Defaulting to user installation because normal site-packages is not writeable
Collecting rapidfuzz
  Downloading rapidfuzz-3.1.1-cp39-cp39-win_amd64.whl (1.8 MB)
     ---------------------------------------- 1.8/1.8 MB 56.8 MB/s eta 0:00:00
Installing collected packages: rapidfuzz
Successfully installed rapidfuzz-3.1.1
Note: you may need to restart the kernel to use updated packages.


In [195]:
mbz_df.query('artist_name=="Jelly Roll"')['lower_alt_name']

6936      jellyroll jason jelly roll bradley deford
249299        l. jelly roll david drew d. jellyroll
Name: lower_alt_name, dtype: object

In [266]:
genius_df.query('cleaned_art_name=="jeff laliberte"')

Unnamed: 0,artist_name,alternate_name,artist_url,artist_id,place_birth,place_live,place_death,date_birth,date_death,gender,...,new_gender,matched_name_id,matched_name,matched_id,cleaned_art_name,query_id1,lower_alt_name,query_id2,query_id,mbz_lower_alt_name
1556,Jeff Laliberte,Jeffrey Laliberte,https://genius.com/artists/Jeff-laliberte,1197045,,,,,,,...,undetermined,"tamelier jeffrey, 57ff84e4-7355-481f-ab17-114a...",tamelier jeffrey,57ff84e4-7355-481f-ab17-114abf697528,jeff laliberte,,jeffrey laliberte,,,


In [272]:
jason_man = mbz_df.query('cleaned_art_name=="jeff laliberte"')
process.extract('jeffrey laliberte', mbz_df['cleaned_art_name'],scorer = fuzz.token_sort_ratio)


[('jeffrey kallberg', 78.78787878787878, 1916606),
 ('lee jeffrey', 78.57142857142857, 583481),
 ('jeffrey lai', 78.57142857142857, 1815669),
 ('jeffrey oliver', 77.41935483870968, 190920),
 ('jeffrey miller', 77.41935483870968, 488503)]

In [255]:
jason_man

Unnamed: 0,id,type,artist_name,alternate_name,gender,country,begin_area_type,begin_area_name,end_area_type,end_area_name,lifespan_begin,lifespan_end,tags,cleaned_art_name,lower_alt_name
33458,46b19f7c-661a-4670-9fbf-e7a85607a069,Person,Jay-P,Jay P / Paul Omiria Epeju,male,UG,City,Kampala,,,1987-01-03,,"[rap, american, hip-hop, hip hop, hiphop, soul...",jay-p,jay epeju omiria p paul
1216898,69de008c-d3c9-4dca-a099-a5ec7b39b79c,Person,Jay-P,,male,US,,,,,,,[nan],jay-p,


In [228]:
genius_df['about_artist'].value_counts('nan')

.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [218]:
def return_mbzid1(row):
  if row['lower_alt_name'] != 'nan':
    name = process.extractOne(row['lower_alt_name'], short_list, scorer = fuzz.token_sort_ratio)
    matched_name = name[0]
    id = mbz_df['id'].iloc[name[2]]
    comb = f'{matched_name}, {id}'
  else:
    comb = 'nan'
  return comb

In [24]:
short_list = mbz_list_df['alternate_name']
genius_df['matched_name_id'] = genius_df.progress_apply(lambda row: return_mbzid1(row), axis = 1)

100%|██████████| 429675/429675 [3:07:39<00:00, 38.16it/s]


#separate id and name


In [30]:
genius_df[['matched_name', 'matched_id']] = genius_df['matched_name_id'].str.split(',',1,expand=True)

  genius_df[['matched_name', 'matched_id']] = genius_df['matched_name_id'].str.split(',',1,expand=True)


Unnamed: 0,artist_name,alternate_name,artist_url,artist_id,place_birth,place_live,place_death,date_birth,date_death,gender,...,about_artist,genius_followers,facebook_name,instagram_name,twitter_name,new_gender,lower_alt_name,matched_name_id,matched_name,matched_id
0,Josylvio,Yussef Abdelgalil Dowib / Joost Theo Sylvio,https://genius.com/artists/Josylvio,626456,,,,,,,...,"Josylvio, artiestennaam van Joost Theo Sylvio ...",51,,josylvio,Josylvio,undetermined,sylvio joost abdelgalil yussef theo dowib,"sylvio joost abdelgalil yussef theo dowib, 217...",sylvio joost abdelgalil yussef theo dowib,217c844d-dc5c-439d-b27a-a227ff3039bb
1,Jaden,Jaden Syre Smith / Jaden Smith / Jaden Christo...,https://genius.com/artists/Jaden,12232,,,,,,,...,Jaden Christopher Syre Smith is a well-known a...,1346,officialjaden,c.syresmith,jaden,male,jaden christopher smith syre,"jaden christopher smith syre, 166cb987-3156-41...",jaden christopher smith syre,166cb987-3156-4166-b0d3-f1ad5a6c514c
2,Jack Antonoff,Jack M. Antonoff / Jack Michael Antonoff,https://genius.com/artists/Jack-antonoff,264329,,,,,,,...,"Jack Michael Antonoff (born March 31, 1984) is...",326,,jackantonoff,jackantonoff,male,jack m. michael antonoff,"antonoff jack michael, 1155a7a9-a7b4-46ba-ba27...",antonoff jack michael,1155a7a9-a7b4-46ba-ba27-6b6c98be487e
3,Jabari Manwa,Jabari Manwarring / Jabari Chester Manwarring ...,https://genius.com/artists/Jabari-manwa,323655,,,,,,,...,"Jabari Chester Manwarring (born September 3rd,...",74,,jabarimanwa,JabariManwa,male,che$ter jab$ manwarring jabari chester,"manwarring jabari chester, dab626f0-8589-4c71-...",manwarring jabari chester,dab626f0-8589-4c71-93b4-2dc38afe69c7
4,JEON SOMI,SOMI,https://genius.com/artists/Jeon-somi,1120353,,,,,,,...,JEON SOMI (전소미) is a Korean-Canadian singer ba...,85,,somsomi0309,somi_official_,female,somi,"omi, 3b3dccff-9725-4947-9b56-c1f1db789e39",omi,3b3dccff-9725-4947-9b56-c1f1db789e39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429670,Jesse McCartney,,https://genius.com/artists/Jesse-mccartney,1810,,,,,,,...,With the release of Jesse McCartney’s new sing...,40,JesseMcCartney,JesseMcCartney,JesseMcCartney,male,,,,
429671,Jack Garratt,Jack Robert Garratt,https://genius.com/artists/Jack-garratt,285174,,,,,,,...,In 2016 Jack Garratt released his debut album ...,42,jackgarrattmusic,jackgarratt,JackGarratt,undetermined,garratt robert jack,"garrett robert r.j., d2453242-cc03-4abc-a4cc-3...",garrett robert r.j.,d2453242-cc03-4abc-a4cc-39152d193de4
429672,JAHKOY,,https://genius.com/artists/Jahkoy,187600,,,,,,,...,"Jahkoy is a R&B singer from Toronto, CA. He's ...",43,,jahkoy,JAHKOY,undetermined,,,,
429673,Jordin Sparks,Jordin Sparks Thomas / Jordin Brianna Sparks,https://genius.com/artists/Jordin-sparks,10207,,,,,,,...,"Jordin Brianna Sparks (born December 22, 1989)...",44,jordinsparks,jordinsparks,jordinsparks,female,sparks jordin thomas brianna,"sparks jordin jordon brianna, 5508631d-697f-48...",sparks jordin jordon brianna,5508631d-697f-4839-a669-06637e5bcb90


# clean art names

In [4]:
def clean_art(row):
  n1 = str(row['artist_name'])
  if '"' in n1:
    n1 = n1.replace('"', r'')
  if "'" in n1:
    n1 = n1.replace("'", r"")
  # mbz query name

  return n1.lower()



In [7]:
from tqdm import tqdm
tqdm.pandas()
mbz_df['cleaned_art_name'] = mbz_df.progress_apply(lambda row: clean_art(row), axis=1)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2145300/2145300 [00:17<00:00, 125845.56it/s]


In [8]:
genius_df['cleaned_art_name'] = genius_df.progress_apply(lambda row: clean_art(row), axis=1)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429675/429675 [00:04<00:00, 106066.04it/s]


# query match


In [87]:
def return_mbzid(row):
    mbzq_matches = mbz_df.query(f'cleaned_art_name == "{row["cleaned_art_name"]}"')
    try:
        if len(mbzq_matches) == 1:
            mbz_id = mbzq_matches.id.iloc[0]
        elif len(mbzq_matches) > 1:
            match_q = process.extractOne(row['lower_alt_name'].iloc[0], mbzq_matches['lower_alt_name'], scorer=fuzz.token_sort_ratio)
            if match_q[1] > 50:
                mbz_id = mbz_df['id'].iloc[match_q[2]]
        else:
            mbz_id = 'nan'
    except:
        mbz_id = 'nan'
    return mbz_id


In [159]:
## return match after query if results > 1
def return_mbzid_alt(row):
    if row['query_id1'] =='nan' and row['lower_alt_name'] != 'nan':
        try:
            mbzq_matches = mbz_df.query(f'cleaned_art_name == "{row["cleaned_art_name"]}"')
            if len(mbzq_matches) > 1:
                match_q = process.extractOne(row['lower_alt_name'], mbzq_matches['lower_alt_name'], scorer=fuzz.token_sort_ratio)
                if match_q[1] > 50:
                    mbz_id = mbz_df['lower_alt_name'].iloc[match_q[2]]
                else:
                    mbz_id = 'nan'
            else:
                mbz_id = 'nan'
        except:
            mbz_id = 'nan'
    else:
        mbz_id='nan'
    return mbz_id

In [181]:
genius_df['mbz_lower_alternate_name'] = genius_df.progress_apply(lambda row: 'nan' if row['query_id2'] == 'nan' else mbz_df.query(f"id == '{row['query_id2']}'")['lower_alt_name'].iloc[0],axis=1)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429675/429675 [01:48<00:00, 3977.75it/s]


In [160]:
genius_df['query_id2'] = genius_df.progress_apply(lambda row: return_mbzid_alt(row), axis=1)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429675/429675 [36:56<00:00, 193.89it/s]


In [165]:
# combined query id results
genius_df = genius_df.drop('comb_query_id', axis=1)
genius_df['query_id'] = genius_df.apply(lambda row: row['query_id1'] if row['query_id1'] != 'nan' else row['query_id2'], axis=1)
genius_df

Unnamed: 0,artist_name,alternate_name,artist_url,artist_id,place_birth,place_live,place_death,date_birth,date_death,gender,...,twitter_name,new_gender,matched_name_id,matched_name,matched_id,cleaned_art_name,query_id1,lower_alt_name,query_id2,query_id
0,Josylvio,Yussef Abdelgalil Dowib / Joost Theo Sylvio,https://genius.com/artists/Josylvio,626456,,,,,,,...,Josylvio,undetermined,"sylvio joost abdelgalil yussef theo dowib, 217...",sylvio joost abdelgalil yussef theo dowib,217c844d-dc5c-439d-b27a-a227ff3039bb,josylvio,217c844d-dc5c-439d-b27a-a227ff3039bb,abdelgalil theo joost sylvio dowib yussef,,217c844d-dc5c-439d-b27a-a227ff3039bb
1,Jaden,Jaden Syre Smith / Jaden Smith / Jaden Christo...,https://genius.com/artists/Jaden,12232,,,,,,,...,jaden,male,"jaden christopher smith syre, 166cb987-3156-41...",jaden christopher smith syre,166cb987-3156-4166-b0d3-f1ad5a6c514c,jaden,,syre christopher jaden smith,166cb987-3156-4166-b0d3-f1ad5a6c514c,166cb987-3156-4166-b0d3-f1ad5a6c514c
2,Jack Antonoff,Jack M. Antonoff / Jack Michael Antonoff,https://genius.com/artists/Jack-antonoff,264329,,,,,,,...,jackantonoff,male,"antonoff jack michael, 1155a7a9-a7b4-46ba-ba27...",antonoff jack michael,1155a7a9-a7b4-46ba-ba27-6b6c98be487e,jack antonoff,1155a7a9-a7b4-46ba-ba27-6b6c98be487e,michael jack antonoff m.,,1155a7a9-a7b4-46ba-ba27-6b6c98be487e
3,Jabari Manwa,Jabari Manwarring / Jabari Chester Manwarring ...,https://genius.com/artists/Jabari-manwa,323655,,,,,,,...,JabariManwa,male,"manwarring jabari chester, dab626f0-8589-4c71-...",manwarring jabari chester,dab626f0-8589-4c71-93b4-2dc38afe69c7,jabari manwa,dab626f0-8589-4c71-93b4-2dc38afe69c7,che$ter jab$ jabari manwarring chester,,dab626f0-8589-4c71-93b4-2dc38afe69c7
4,JEON SOMI,SOMI,https://genius.com/artists/Jeon-somi,1120353,,,,,,,...,somi_official_,female,"omi, 3b3dccff-9725-4947-9b56-c1f1db789e39",omi,3b3dccff-9725-4947-9b56-c1f1db789e39,jeon somi,8d4e6a86-fa1c-4a1b-b8c4-3b58a46dde9f,somi,,8d4e6a86-fa1c-4a1b-b8c4-3b58a46dde9f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429670,Jesse McCartney,,https://genius.com/artists/Jesse-mccartney,1810,,,,,,,...,JesseMcCartney,male,,,,jesse mccartney,9d075d6e-9492-4343-ac95-93a808f61477,,,9d075d6e-9492-4343-ac95-93a808f61477
429671,Jack Garratt,Jack Robert Garratt,https://genius.com/artists/Jack-garratt,285174,,,,,,,...,JackGarratt,undetermined,"garrett robert r.j., d2453242-cc03-4abc-a4cc-3...",garrett robert r.j.,d2453242-cc03-4abc-a4cc-39152d193de4,jack garratt,fcba6ab6-4a51-4688-b90b-57d6eab55b46,jack garratt robert,,fcba6ab6-4a51-4688-b90b-57d6eab55b46
429672,JAHKOY,,https://genius.com/artists/Jahkoy,187600,,,,,,,...,JAHKOY,undetermined,,,,jahkoy,8b694b3a-8ccc-42bf-a468-1f8058ba98f1,,,8b694b3a-8ccc-42bf-a468-1f8058ba98f1
429673,Jordin Sparks,Jordin Sparks Thomas / Jordin Brianna Sparks,https://genius.com/artists/Jordin-sparks,10207,,,,,,,...,jordinsparks,female,"sparks jordin jordon brianna, 5508631d-697f-48...",sparks jordin jordon brianna,5508631d-697f-4839-a669-06637e5bcb90,jordin sparks,5508631d-697f-4839-a669-06637e5bcb90,sparks jordin brianna thomas,,5508631d-697f-4839-a669-06637e5bcb90


In [121]:
mbz_df.query(f'cleaned_art_name == "{genius_df["cleaned_art_name"][2]}"')

Unnamed: 0,id,type,artist_name,alternate_name,gender,country,begin_area_type,begin_area_name,end_area_type,end_area_name,lifespan_begin,lifespan_end,tags,lower_alt_name,row_number,cleaned_art_name
608075,1155a7a9-a7b4-46ba-ba27-6b6c98be487e,Person,Jack Antonoff,Jack Michael Antonoff,male,US,County,Bergen County,,,1984-03-31,,[nan],antonoff jack michael,608075,jack antonoff


# column with index


In [20]:
import pandas as pd

mbz_df = mbz_df.assign(row_number=range(len(mbz_df)))
mbz_df

Unnamed: 0,id,type,artist_name,alternate_name,gender,country,begin_area_type,begin_area_name,end_area_type,end_area_name,lifespan_begin,lifespan_end,tags,lower_alt_name,row_number
0,89ad4ac3-39f7-470e-963a-56509c546377,Other,Various Artists,Variuos / [различные исполнители] / Multi‐inte...,,,,,,,,,"[rock, trance, electronic, classical, pop, spe...",blandede collectif razlichnye nhiều artistas l...,0
1,125ec42a-7229-4250-afc5-e057484327fe,Other,[unknown],Karavīru dz. / חבורת בנים בנות / חבורה / Group...,not applicable,XW,,,,,,,"[soundtrack, classical, special purpose artist...",participants hidden la orchestra divorce frequ...,1
2,24f1766e-9635-4d58-a4d4-9413f9f98a4c,Person,Johann Sebastian Bach,Johnann Sebastian Bach / Jean-Sébastien Bach /...,male,DE,City,Eisenach,City,Leipzig,1685-03-21,1750-07-28,"[classical, german, german composer, orchestra...",j johann savastian (1685-1750) سباستيان jan se...,2
3,b972f589-fb0e-474e-b64a-803b0364fa75,Person,Wolfgang Amadeus Mozart,モーツァルト / ヴォルフガンク・アマデウス・モーツァルト / W.A.Mozart / W...,male,AT,City,Salzburg,City,Wien,1756-01-27,1791-12-05,"[classical, german, opera, austrian, composer,...",anadeus w.a.mozart mosart mozart w.a. моцарт 모...,3
4,70248960-cb53-4ea4-943a-edb18f7d336f,Person,Bruce Springsteen,The Boss / Bruce Springsteen / Bruce Springtee...,male,US,City,Long Branch,,,1949-09-23,,"[rock, folk, cotm, american, singer-songwriter...",bruce frederick the joseph springteen boss spr...,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2145295,1c2a1fe5-d7f5-4b0e-a134-05bd704bab54,Person,Steffi Hartnigk,,female,DE,,,,,,,[nan],,2145295
2145296,aaab5219-79fa-4ac1-ac25-c7809a958ae4,Person,Uta Herfurth,,female,DE,,,,,,,[nan],,2145296
2145297,62463fc2-2292-4af3-8217-4a6dc4adee72,Person,Ulrike Böhmer,,female,DE,,,,,,,[nan],,2145297
2145298,9ce8a434-9a76-419f-b5e6-af6eb87df636,Person,Balázs Maróti,,male,DE,,,,,,,[nan],,2145298


In [91]:
mbz_list_df = pd.DataFrame({'art_name': mbz_df['artist_name'], 'alternate_name': mbz_df['lower_alt_name'],'index': mbz_df['row_number'], 'id': mbz_df['id']})

In [99]:
#mbz_list_df= mbz_list_df[mbz_list_df['alternate_name'] != 'nan']
mbz_list_df['art_name'].isna().sum()

26

In [103]:
genius_df

Unnamed: 0,artist_name,alternate_name,artist_url,artist_id,place_birth,place_live,place_death,date_birth,date_death,gender,...,facebook_name,instagram_name,twitter_name,new_gender,lower_alt_name,matched_name_id,matched_name,matched_id,query_id,cleaned_art_name
0,Josylvio,Yussef Abdelgalil Dowib / Joost Theo Sylvio,https://genius.com/artists/Josylvio,626456,,,,,,,...,,josylvio,Josylvio,undetermined,sylvio joost abdelgalil yussef theo dowib,"sylvio joost abdelgalil yussef theo dowib, 217...",sylvio joost abdelgalil yussef theo dowib,217c844d-dc5c-439d-b27a-a227ff3039bb,,josylvio
1,Jaden,Jaden Syre Smith / Jaden Smith / Jaden Christo...,https://genius.com/artists/Jaden,12232,,,,,,,...,officialjaden,c.syresmith,jaden,male,jaden christopher smith syre,"jaden christopher smith syre, 166cb987-3156-41...",jaden christopher smith syre,166cb987-3156-4166-b0d3-f1ad5a6c514c,,jaden
2,Jack Antonoff,Jack M. Antonoff / Jack Michael Antonoff,https://genius.com/artists/Jack-antonoff,264329,,,,,,,...,,jackantonoff,jackantonoff,male,jack m. michael antonoff,"antonoff jack michael, 1155a7a9-a7b4-46ba-ba27...",antonoff jack michael,1155a7a9-a7b4-46ba-ba27-6b6c98be487e,,jack antonoff
3,Jabari Manwa,Jabari Manwarring / Jabari Chester Manwarring ...,https://genius.com/artists/Jabari-manwa,323655,,,,,,,...,,jabarimanwa,JabariManwa,male,che$ter jab$ manwarring jabari chester,"manwarring jabari chester, dab626f0-8589-4c71-...",manwarring jabari chester,dab626f0-8589-4c71-93b4-2dc38afe69c7,,jabari manwa
4,JEON SOMI,SOMI,https://genius.com/artists/Jeon-somi,1120353,,,,,,,...,,somsomi0309,somi_official_,female,somi,"omi, 3b3dccff-9725-4947-9b56-c1f1db789e39",omi,3b3dccff-9725-4947-9b56-c1f1db789e39,,jeon somi
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429670,Jesse McCartney,,https://genius.com/artists/Jesse-mccartney,1810,,,,,,,...,JesseMcCartney,JesseMcCartney,JesseMcCartney,male,,,,,,jesse mccartney
429671,Jack Garratt,Jack Robert Garratt,https://genius.com/artists/Jack-garratt,285174,,,,,,,...,jackgarrattmusic,jackgarratt,JackGarratt,undetermined,garratt robert jack,"garrett robert r.j., d2453242-cc03-4abc-a4cc-3...",garrett robert r.j.,d2453242-cc03-4abc-a4cc-39152d193de4,,jack garratt
429672,JAHKOY,,https://genius.com/artists/Jahkoy,187600,,,,,,,...,,jahkoy,JAHKOY,undetermined,,,,,,jahkoy
429673,Jordin Sparks,Jordin Sparks Thomas / Jordin Brianna Sparks,https://genius.com/artists/Jordin-sparks,10207,,,,,,,...,jordinsparks,jordinsparks,jordinsparks,female,sparks jordin thomas brianna,"sparks jordin jordon brianna, 5508631d-697f-48...",sparks jordin jordon brianna,5508631d-697f-4839-a669-06637e5bcb90,,jordin sparks


# bert


In [None]:
%pip install racebert
from racebert import RaceBERT

model = RaceBERT()

# To predict race
model.predict_race("Barack Obama")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting racebert
  Downloading racebert-1.1.0-py3-none-any.whl (3.1 kB)
Collecting transformers (from racebert)
  Downloading transformers-4.30.1-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m96.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers->racebert)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers->racebert)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers->rac

Downloading (…)lve/main/config.json:   0%|          | 0.00/978 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/174M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/7.58k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/46.0 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'nh_black', 'score': 0.9808529019355774}]