<a href="https://colab.research.google.com/github/jiahao303/music-classifier/blob/main/Music_Classifier_Nov_16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music Classifier Step 3: Tensorflow Datasets

In [13]:
import numpy as np
import pandas as pd
import tensorflow as tf
import re
import string

from tensorflow.keras import layers
from tensorflow.keras import losses

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import StringLookup

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

First, we get an inspection on one of the lyrics. We found that most of the neutral words such as pronouns (i.e. we, you, I) and auxiliaries are omitted since they have no use for the sentiment identifying.

In [9]:

url = 'https://raw.githubusercontent.com/jiahao303/music-classifier/main/tcc_ceds_music.csv'

df = pd.read_csv(url)

In [10]:

df = df.drop(["Unnamed: 0", "release_date"], axis =1)
df.at[5,"lyrics"]

'convoy light dead ahead merchantmen trump diesels hammer oily kill grind knuckle white eye alight slam hatch deadly night cunning chicken lair hound hell devil care run silent run deep final prayer warriors secret sleep merchantman nightmare silent death lie wait run silent run deep sink final sleep chill hearts fight open ocean wonder lethal silver fish boat shiver cast millions play killer victim fool obey order rehearse lifeboat shatter hull tear black smell burn jones eye watch crosswire tube ready medal chest weeks dead like rest run silent run deep final prayer warriors secret sleep merchantman nightmare'

Get the dataframe resorted, only keep the lyrics column and the sentiment/topic column.

In [11]:
sentiment = df[["topic", "lyrics"]]
sentiment

Unnamed: 0,topic,lyrics
0,sadness,hold time feel break feel untrue convince spea...
1,world/life,believe drop rain fall grow believe darkest ni...
2,music,sweetheart send letter goodbye secret feel bet...
3,romantic,kiss lips want stroll charm mambo chacha merin...
4,romantic,till darling till matter know till dream live ...
...,...,...
28367,obscene,cause fuck leave scar tick tock clock come kno...
28368,obscene,minks things chain ring braclets yap fame come...
28369,obscene,get ban get ban stick crack relax plan attack ...
28370,obscene,check check yeah yeah hear thing call switch g...


Our task will be to teach an algorithm to classify lyrics by predicting the topic based on the text of the lyrics.

In [15]:
sentiment.groupby("topic").size()

topic
feelings       612
music         2303
night/time    1825
obscene       4882
romantic      1524
sadness       6096
violence      5710
world/life    5420
dtype: int64

Encode the "topic" column values with integers.

In [16]:
le = LabelEncoder()
sentiment["topic"] = le.fit_transform(sentiment["topic"])
sentiment.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,topic,lyrics
0,5,hold time feel break feel untrue convince spea...
1,7,believe drop rain fall grow believe darkest ni...
2,1,sweetheart send letter goodbye secret feel bet...
3,4,kiss lips want stroll charm mambo chacha merin...
4,4,till darling till matter know till dream live ...


Inspect which integers correspond to which classes using the classes_ attribute of the encoder.

In [17]:
le.classes_

array(['feelings', 'music', 'night/time', 'obscene', 'romantic',
       'sadness', 'violence', 'world/life'], dtype=object)

In [18]:
data = tf.data.Dataset.from_tensor_slices((sentiment["topic"],sentiment["lyrics"]))

Iterate over the data values.

In [19]:
for topic,lyrics in data.take(5):
    print(topic)
    print(lyrics)
    print("")

tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(b'hold time feel break feel untrue convince speak voice tear try hold hurt try forgive okay play break string feel heart want feel tell real truth hurt lie worse anymore little turn dust play house ruin run leave save like chase train late late tear try hold hurt try forgive okay play break string feel heart want feel tell real truth hurt lie worse anymore little run leave save like chase train know late late play break string feel heart want feel tell real truth hurt lie worse anymore little know little hold time feel', shape=(), dtype=string)

tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(b'believe drop rain fall grow believe darkest night candle glow believe go astray come believe believe believe smallest prayer hear believe great hear word time hear bear baby touch leaf believe believe believe lord heaven guide sin hide believe calvary die pierce believe death rise meet heaven loud amen know believe', shape=(), dtype=string)

tf.Tensor(

We have created a special TensorFlow Dataset.
Now, we will split it into training, validation, and testing sets.

In [20]:
data = data.shuffle(buffer_size = len(data))

In [21]:
train_size = int(0.7*len(data))
val_size   = int(0.1*len(data))

train = data.take(train_size)
val   = data.skip(train_size).take(val_size)
test  = data.skip(train_size + val_size)

In [22]:
len(train), len(val), len(test)

(19860, 2837, 5675)