# Budowa zbioru danych/Building the dataset.

## Import bibliotek/Import of libraries.

In [1]:
import pandas as pd
from pandasql import sqldf

## Załadowanie pierwotnego zbioru danych/Uploading the primary dataset.

Pierwotnym zbiorem danych będzie plik **spotify_millsongdata.csv**, dostępny pod adresem: https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset. Zbiór zawiera 57 650 rekordów. Z uwagi na dość długi czas przetwarzania danych przez modele językowe oparte na algorytmie BERT, wyszukamy względnie krótkie teksty piosenek (od 300 do 600 znaków dla próbki).

The primary dataset is going to be a file named **spotify_millsongdata.csv**, available at URL: https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset. The set contains 57 650 of samples. Because of the quite long time of processing data by the BERT-based lingual models, we're going to find relatively short song lyrics (between 300 and 600 characters for sample).

In [3]:
songs = pd.read_csv('spotify_millsongdata.csv')
songs

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...
...,...,...,...,...
57645,Ziggy Marley,Good Old Days,/z/ziggy+marley/good+old+days_10198588.html,Irie days come on play \r\nLet the angels fly...
57646,Ziggy Marley,Hand To Mouth,/z/ziggy+marley/hand+to+mouth_20531167.html,Power to the workers \r\nMore power \r\nPowe...
57647,Zwan,Come With Me,/z/zwan/come+with+me_20148981.html,all you need \r\nis something i'll believe \...
57648,Zwan,Desire,/z/zwan/desire_20148986.html,northern star \r\nam i frightened \r\nwhere ...


## Wydobycie unikalnych wartości ze zbioru/Extraction of unique values from the set.

Sprawdźmy zatem, jak wiele wartości unikalnych występuje w zbiorze. Pomoże nam w tym biblioteka pandasql, która pozwala na wywoływanie kwerend w języku SQL - zapisanych jako zmienna tekstowa - poprzez przekazywanie ich do funkcji zwracającej wynik kwerendy w postaci ramki danych.

Let's check how many unique values are contained in the set. Pandasql library, which lets us to make SQL queries - saved as a string variables - by assigning them to function returning the query result as a data frame, will be helpful in this task.

In [4]:
query = '''
SELECT DISTINCT(artist) FROM songs
'''
sqldf(query)

Unnamed: 0,artist
0,ABBA
1,Ace Of Base
2,Adam Sandler
3,Adele
4,Aerosmith
...,...
638,Joseph And The Amazing Technicolor Dreamcoat
639,Soundtracks
640,Van Der Graaf Generator
641,Various Artists


In [5]:
query = '''
SELECT DISTINCT(song) FROM songs
'''
sqldf(query)

Unnamed: 0,song
0,Ahe's My Kind Of Girl
1,"Andante, Andante"
2,As Good As New
3,Bang
4,Bang-A-Boomerang
...,...
44819,Mental Health
44820,The Setup
44821,Freedom Road
44822,G7


In [6]:
query = '''
SELECT DISTINCT(text) FROM songs
'''
sqldf(query)

Unnamed: 0,text
0,"Look at her face, it's a wonderful face \r\nA..."
1,"Take it easy with me, please \r\nTouch me gen..."
2,I'll never know why I had to go \r\nWhy I had...
3,Making somebody happy is a question of give an...
4,Making somebody happy is a question of give an...
...,...
57489,Irie days come on play \r\nLet the angels fly...
57490,Power to the workers \r\nMore power \r\nPowe...
57491,all you need \r\nis something i'll believe \...
57492,northern star \r\nam i frightened \r\nwhere ...


Artystów mamy 643, unikalnych tytułów piosenek 44 824 (przy czym część z nich może się pokrywać, ale nie muszą one być różnymi wykonaniami tej samej piosenki) i 57 494 unikalnych tekstów. Chcemy wyeliminować prawdopodobieństwo, aby w datasecie, który wykorzystamy w głównym modelu, znajdą się covery, dlatego wybierzemy wyłącznie unikalne teksty piosenek (we wskazanym wyżej przedziale) oraz ich wykonawców.

We have 643 artists, 44,824 unique song titles (some of which may overlap, but they don't have to be different performances of the same song), and 57,494 unique song lyrics. We want to eliminate the possibility of including covers in the dataset we will use for the main model, so we will only select unique song lyrics (within the aforementioned range) and their performers.

In [7]:
query = '''
SELECT DISTINCT(song), artist, text FROM songs
WHERE LENGTH(text) BETWEEN 300 AND 600
'''
songs = sqldf(query)
songs

Unnamed: 0,song,artist,text
0,Burning My Bridges,ABBA,"Well, you hoot and you holler and you make me ..."
1,Free As A Bumble Bee,ABBA,I'm down and I feel depressed \r\nSitting her...
2,Just A Notion,ABBA,Just a notion that's all \r\nJust a feeling t...
3,Rubber Ball Man,ABBA,The poster on the wall of a dear friend \r\nI...
4,Take A Chance,ABBA,Suzy was nine and I was ten \r\nRight at the ...
...,...,...,...
4715,Too Numb To Cry,Zakk Wylde,I see you across the room \r\nSearch for some...
4716,World Of Trouble,Zakk Wylde,"2, 3, 4, 1, drank all my fuckin' brew \r\nAnd..."
4717,5 Year Winter,Zao,"Dear Tiffany, \r\nYou've mad me nauseous for ..."
4718,"Lies Of Serpents, A River Of Tears",Zao,Your eyes \r\nYour eyes \r\nYour eyes search...


In [8]:
query = '''
SELECT DISTINCT(text), artist FROM songs
'''
songs = sqldf(query)
songs

Unnamed: 0,text,artist
0,"Well, you hoot and you holler and you make me ...",ABBA
1,I'm down and I feel depressed \r\nSitting her...,ABBA
2,Just a notion that's all \r\nJust a feeling t...,ABBA
3,The poster on the wall of a dear friend \r\nI...,ABBA
4,Suzy was nine and I was ten \r\nRight at the ...,ABBA
...,...,...
4713,I see you across the room \r\nSearch for some...,Zakk Wylde
4714,"2, 3, 4, 1, drank all my fuckin' brew \r\nAnd...",Zakk Wylde
4715,"Dear Tiffany, \r\nYou've mad me nauseous for ...",Zao
4716,Your eyes \r\nYour eyes \r\nYour eyes search...,Zao


# Zapis zbioru danych do pliku .csv/Saving dataset as .csv file.

In [9]:
songs.to_csv('songs.csv', index=False)