<a href="https://colab.research.google.com/github/jiahao303/music-classifier/blob/main/Music_Classifier_Nov_16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music Classifier Step 3: Tensorflow Datasets

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import re
import string

from tensorflow.keras import layers
from tensorflow.keras import losses

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers.experimental.preprocessing import StringLookup

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

First, we get an inspection on one of the lyrics. We found that most of the neutral words such as pronouns (i.e. we, you, I) and auxiliaries are omitted since they have no use for the sentiment identifying.

In [2]:

url = 'https://raw.githubusercontent.com/jiahao303/music-classifier/main/tcc_ceds_music.csv'

df = pd.read_csv(url)

In [3]:

df = df.drop(["Unnamed: 0", "release_date"], axis =1)
df.at[5,"lyrics"]

'convoy light dead ahead merchantmen trump diesels hammer oily kill grind knuckle white eye alight slam hatch deadly night cunning chicken lair hound hell devil care run silent run deep final prayer warriors secret sleep merchantman nightmare silent death lie wait run silent run deep sink final sleep chill hearts fight open ocean wonder lethal silver fish boat shiver cast millions play killer victim fool obey order rehearse lifeboat shatter hull tear black smell burn jones eye watch crosswire tube ready medal chest weeks dead like rest run silent run deep final prayer warriors secret sleep merchantman nightmare'

Get the dataframe resorted, only keep the lyrics column and the sentiment/topic column.

In [4]:
sentiment = df[["lyrics", "topic"]]
sentiment

Unnamed: 0,lyrics,topic
0,hold time feel break feel untrue convince spea...,sadness
1,believe drop rain fall grow believe darkest ni...,world/life
2,sweetheart send letter goodbye secret feel bet...,music
3,kiss lips want stroll charm mambo chacha merin...,romantic
4,till darling till matter know till dream live ...,romantic
...,...,...
28367,cause fuck leave scar tick tock clock come kno...,obscene
28368,minks things chain ring braclets yap fame come...,obscene
28369,get ban get ban stick crack relax plan attack ...,obscene
28370,check check yeah yeah hear thing call switch g...,obscene


Our task will be to teach an algorithm to classify lyrics by predicting the topic based on the text of the lyrics.

In [5]:
sentiment.groupby("topic").size()

topic
feelings       612
music         2303
night/time    1825
obscene       4882
romantic      1524
sadness       6096
violence      5710
world/life    5420
dtype: int64

Encode the "topic" column values with integers.

In [6]:
le = LabelEncoder()
sentiment["topic"] = le.fit_transform(sentiment["topic"])
sentiment.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,lyrics,topic
0,hold time feel break feel untrue convince spea...,5
1,believe drop rain fall grow believe darkest ni...,7
2,sweetheart send letter goodbye secret feel bet...,1
3,kiss lips want stroll charm mambo chacha merin...,4
4,till darling till matter know till dream live ...,4


Inspect which integers correspond to which classes using the classes_ attribute of the encoder.

In [7]:
le.classes_

array(['feelings', 'music', 'night/time', 'obscene', 'romantic',
       'sadness', 'violence', 'world/life'], dtype=object)

In [8]:
data = tf.data.Dataset.from_tensor_slices((sentiment["topic"],sentiment["lyrics"]))

Iterate over the data values.

In [16]:
for lyrics,topic in data.take(5):
    print(lyrics)
    print(topic)
    print("")

tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(b'filthy hand desert brezhnev take afghanistan begin take beirut galtieri take union lunch take cruiser hand apparently', shape=(), dtype=string)

tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(b'catch twist star plot line faulty bring columbus york betwixt east call wear leather vest earth squeal shudder halt crucifix help fear leave soul rent inside pant hide clean mess drop life lithesome want want want want rodriguez square shoulder curse run comb black ponytail think lonely room sink give stink smell perfume eye voice like outside street steam crack dealers dream score betcha light good say little diaz brother tote downtown hood damn good italians need lesson teach die harlem think warnin dance brain street manhattan garbage latin write say hard shit days manhattan sink like filthy shock write book say like ancient rome perfume burn eye hold tightly thighs flicker minute vanish go', shape=(), dtype=string)

tf.Tensor(7, shape=(), dtype=

We have created a special TensorFlow Dataset.
Now, we will split it into training, validation, and testing sets.

In [10]:
data = data.shuffle(buffer_size = len(data))

In [11]:
train_size = int(0.7*len(data))
val_size   = int(0.1*len(data))

train = data.take(train_size)
val   = data.skip(train_size).take(val_size)
test  = data.skip(train_size + val_size)

In [12]:
len(train), len(val), len(test)

(19860, 2837, 5675)

In [13]:
def standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    no_punctuation = tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),'')
    return no_punctuation

In [14]:
max_tokens = 2000
# each headline will be a vector of length 40
sequence_length = 40

vectorize_layer = TextVectorization(
    max_tokens = max_tokens,
    standardize=standardization,
    output_mode='int',
    output_sequence_length=sequence_length)

We need to adapt the vectorization layer to the lyrics. In the adaptation process, the vectorization layer learns what words are common in the lyrics.

In [17]:
lyrics = train.map(lambda x, y: y)
vectorize_layer.adapt(lyrics)

we define a helper function that operates on our Datasets. Note that our Dataset consists of a bunch of tuples of the form (topic, lyrics) for each data observation. Our helper function therefore accepts and returns two variables.

In [19]:
def vectorize_headline(label, text):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), [label]

train_vec = train.map(vectorize_headline)
val_vec   = val.map(vectorize_headline)
test_vec  = test.map(vectorize_headline)