In [1]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


Let's import all the necessary libraries into the project.

In [2]:
import os
import pandas as pd
!pip install pymorphy2 nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pymorphy2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, let's list all of the artists that will be used in the project.

In [3]:
jazz_artists = ["Billie Holiday", "Chet Baker", "Ella Fitzgerald", "Louis Armstrong"]
pop_artists = ["Billie Eilish", "Bruno Mars", "BTS", "Dua Lipa", "Ed Sheeran", "Harry Styles", "ITZY", "Lady Gaga", "Lizzo", "Taylor Swift"]
rap_artists = ["Drake", "Eminem", "Kanye West", "Kendrick Lamar", "Post Malone", "Tyler, The Creator"]
rock_artists = ["AC-DC", "Def Leppard", "Fleetwood Mac", "Glass Animals", "Imagine Dragons", "Journey", "Lynyrd Skynyrd", "Queen"]
artists = {'jazz': jazz_artists, 'pop': pop_artists, 'rap': rap_artists, 'rock': rock_artists}

Creating a database by looping over all of the files and appending it to one big array, which will then be used to transform into a DataFrame with the pandas library

In [12]:
data = []

for genre in artists.keys():
  genre_path = os.path.join("/content/gdrive/MyDrive/FP/", genre)
  for artist in artists[genre]:
    lyrics_path = os.path.join(genre_path, artist)
    lyrics_files = os.listdir(lyrics_path)
    for lyrics in lyrics_files:
      f = open(os.path.join(lyrics_path, lyrics))
      lines = f.read()
      if artist is not 'ITZY' and artist is not 'BTS':
        data.append({'Artist': artist, 'Title': lyrics[:-4], 'Genre': genre, 'Language': 'en', 'Lyrics': lines})
      else:
        data.append({'Artist': artist, 'Title': lyrics[:-4], 'Genre': genre, 'Language': 'kn', 'Lyrics': lines})
      f.close()
        

In [13]:
songs = pd.DataFrame(data)
songs

Unnamed: 0,Artist,Title,Genre,Language,Lyrics
0,Billie Holiday,’Tain’t Nobody’s Bizness If I Do,jazz,en,There ain't nothing I can do\nOr nothing I can...
1,Billie Holiday,(I Don’t Stand A) Ghost of a Chance,jazz,en,"I need your love so badly\nI love you, oh, so ..."
2,Billie Holiday,(This Is) My Last Affair,jazz,en,Can't you see\nWhat love and romance have done...
3,Billie Holiday,24 Hours A Day,jazz,en,Like a little old fashioned music box\nWith ju...
4,Billie Holiday,A Fine Romance,jazz,en,"A fine romance with no kisses\nA fine romance,..."
...,...,...,...,...,...
5104,Queen,Wishing Well,rock,en,"Throw down your hat, kick off your shoes\nI kn..."
5105,Queen,You and I,rock,en,Ooh-ooh\n\nMusic is playing in the darkness\nA...
5106,Queen,You Don’t Fool Me,rock,en,"Oh\n\nYou don't fool me, you don't fool me\nYo..."
5107,Queen,You Take My Breath Away,rock,en,"Ooh\nOoh, take it, take it all away\nOoh\nOoh,..."


Checking the genre distribution. You can note that jazz and pop are the ones that have noticeably less songs present.

In [14]:
songs.groupby('Genre').size()

Genre
jazz    1145
pop     1102
rap     1450
rock    1412
dtype: int64

# Processing text
The first step is to lower everything, so that all the text would be uniform.

In [15]:
songs.Lyrics = songs.Lyrics.str.lower()
songs.Lyrics[:10]

0    there ain't nothing i can do\nor nothing i can...
1    i need your love so badly\ni love you, oh, so ...
2    can't you see\nwhat love and romance have done...
3    like a little old fashioned music box\nwith ju...
4    a fine romance with no kisses\na fine romance,...
5    a sailboat in the moonlight and you\nwouldn't ...
6    a sunbonnet blue and a yellow straw hat\nshy l...
7    my yiddishe momme\ni need her more then ever n...
8    no one to talk with\nall by myself\nno one to ...
9    there ain't nothing i can do\nor nothing i can...
Name: Lyrics, dtype: object

Using a regex pattern, remove all the punctuation, as well as any letters that are not in the English alphabet.

In [16]:
songs.Lyrics = songs.Lyrics.str.replace(r"[^A-za-z]"," ")
songs.Lyrics[:10]

  """Entry point for launching an IPython kernel.


0    there ain t nothing i can do or nothing i can ...
1    i need your love so badly i love you  oh  so m...
2    can t you see what love and romance have done ...
3    like a little old fashioned music box with jus...
4    a fine romance with no kisses a fine romance  ...
5    a sailboat in the moonlight and you wouldn t t...
6    a sunbonnet blue and a yellow straw hat shy li...
7    my yiddishe momme i need her more then ever no...
8    no one to talk with all by myself no one to wa...
9    there ain t nothing i can do or nothing i can ...
Name: Lyrics, dtype: object

Tokenize every word. Tokenization is the process by which a large quantity of text is divided into smaller parts called tokens. Essentially, it splits the text into words, which would be helpful.

In [17]:
songs.Lyrics = list(map(word_tokenize, songs.Lyrics))
songs.Lyrics[:10]

0    [there, ain, t, nothing, i, can, do, or, nothi...
1    [i, need, your, love, so, badly, i, love, you,...
2    [can, t, you, see, what, love, and, romance, h...
3    [like, a, little, old, fashioned, music, box, ...
4    [a, fine, romance, with, no, kisses, a, fine, ...
5    [a, sailboat, in, the, moonlight, and, you, wo...
6    [a, sunbonnet, blue, and, a, yellow, straw, ha...
7    [my, yiddishe, momme, i, need, her, more, then...
8    [no, one, to, talk, with, all, by, myself, no,...
9    [there, ain, t, nothing, i, can, do, or, nothi...
Name: Lyrics, dtype: object

Now let's remove all the stopwords from the text. They do not add much meaning to the sentence, so we won't be needing them.

In [18]:
stop_words = set(stopwords.words('english'))
def delete_stopword(words):
    global stop_words
    new_s = [word for word in words if word not in stop_words]
    return new_s
  
songs.Lyrics = list(map(delete_stopword, songs.Lyrics))
songs.Lyrics[:10]

0    [nothing, nothing, say, want, anyway, care, pe...
1    [need, love, badly, love, oh, madly, stand, gh...
2    [see, love, romance, done, used, last, affair,...
3    [like, little, old, fashioned, music, box, one...
4    [fine, romance, kisses, fine, romance, friend,...
5    [sailboat, moonlight, heaven, heaven, two, sof...
6    [sunbonnet, blue, yellow, straw, hat, shy, lit...
7    [yiddishe, momme, need, ever, yiddishe, momme,...
8    [one, talk, one, walk, happy, shelf, misbehavi...
9    [nothing, nothing, say, folks, criticize, goin...
Name: Lyrics, dtype: object

Then, let's transform words into their normal, i.e. vocabulary form.

In [19]:
morph = pymorphy2.MorphAnalyzer()

def lemmatization(words):
    global morph
    new_s = [morph.parse(word)[0].normal_form for word in words]
    return new_s

songs.Lyrics = list(map(lemmatization, songs.Lyrics))
songs.Lyrics[:10]

0    [nothing, nothing, say, want, anyway, care, pe...
1    [need, love, badly, love, oh, madly, stand, gh...
2    [see, love, romance, done, used, last, affair,...
3    [like, little, old, fashioned, music, box, one...
4    [fine, romance, kisses, fine, romance, friend,...
5    [sailboat, moonlight, heaven, heaven, two, sof...
6    [sunbonnet, blue, yellow, straw, hat, shy, lit...
7    [yiddishe, momme, need, ever, yiddishe, momme,...
8    [one, talk, one, walk, happy, shelf, misbehavi...
9    [nothing, nothing, say, folks, criticize, goin...
Name: Lyrics, dtype: object

Deleting all the words that appear only once, since it is very likely they will not be adding semantic meaning and could be unqiue names or misspelled words.

In [20]:
from nltk.probability import FreqDist

def to_str(s):
    new_s = ' '.join(j for j in s)
    return new_s

text_tokens = word_tokenize(' '.join(j for j in list(map(to_str, songs.Lyrics))))
text = nltk.Text(text_tokens)
fdist = FreqDist(text)
words_to_del = list(filter(lambda k: fdist[k] == 1, fdist))

def delete_word(words):
    global words_to_del
    new_s = [word for word in words if word not in words_to_del]
    return new_s

songs.Lyrics = list(map(delete_word, songs.Lyrics))
songs.Lyrics = list(map(to_str, songs.Lyrics))
songs.Lyrics[:10]

0    nothing nothing say want anyway care people ma...
1    need love badly love oh madly stand ghost chan...
2    see love romance done used last affair tragedy...
3    like little old fashioned music box one tune p...
4    fine romance kisses fine romance friend like c...
5    sailboat moonlight heaven heaven two soft bree...
6    sunbonnet blue yellow straw hat shy little dec...
7    yiddishe momme need ever yiddishe momme long k...
8    one talk one walk happy shelf misbehavin savin...
9    nothing nothing say folks criticize going want...
Name: Lyrics, dtype: object

Let's also encode the labels because we will be needing these for classification and the labels can be understood by the computer.

In [21]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
songs.Genre = label_encoder.fit_transform(songs.Genre)
  
songs.Genre.unique()

array([0, 1, 2, 3])

Splitting the data into training and test, reserving 20% of the data for test:

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(songs.Lyrics, songs.Genre, test_size=0.2, random_state=42)

Vectorizing the data with two different methods:

In [23]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cv = CountVectorizer()
cv_train = cv.fit_transform(X_train)
cv_test = cv.transform(X_test)

tfidf = TfidfVectorizer()
tfidf_train = tfidf.fit_transform(X_train)
tfidf_test = tfidf.transform(X_test)

# Logistic Regression

Let's build a logistic regression model, using the count vectorizer data and look at its accuracy.

In [24]:
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42)
lr.fit(cv_train, y_train)
cv_pred = lr.predict(cv_test)
print('cv test')
print(classification_report(y_test, cv_pred))

cv test
              precision    recall  f1-score   support

           0       0.85      0.85      0.85       246
           1       0.62      0.60      0.61       220
           2       0.83      0.82      0.83       272
           3       0.70      0.73      0.72       284

    accuracy                           0.76      1022
   macro avg       0.75      0.75      0.75      1022
weighted avg       0.75      0.76      0.76      1022



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Now let's do the same but with the TF-IDF data.

In [25]:
lr = LogisticRegression(random_state=42)
lr.fit(tfidf_train, y_train)
tfidf_pred = lr.predict(tfidf_test)
print('tf-idf test')
print(classification_report(y_test, tfidf_pred))

tf-idf test
              precision    recall  f1-score   support

           0       0.83      0.84      0.83       246
           1       0.70      0.61      0.65       220
           2       0.88      0.85      0.86       272
           3       0.70      0.79      0.74       284

    accuracy                           0.78      1022
   macro avg       0.78      0.77      0.77      1022
weighted avg       0.78      0.78      0.78      1022



# Random Forest Classifier
Let's build a random forest classifier model, using the count vectorizer data and look at its accuracy.

In [26]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=3, n_jobs=-1, random_state=21)
forest.fit(cv_train, y_train)
cv_pred = forest.predict(cv_test)
print('cv test')
print(classification_report(y_test, cv_pred))

cv test
              precision    recall  f1-score   support

           0       0.52      0.71      0.60       246
           1       0.42      0.35      0.38       220
           2       0.83      0.63      0.72       272
           3       0.50      0.52      0.51       284

    accuracy                           0.56      1022
   macro avg       0.57      0.55      0.55      1022
weighted avg       0.57      0.56      0.56      1022



Now let's do the same but with the TF-IDF data.

In [27]:
forest.fit(tfidf_train, y_train)
tfidf_pred = forest.predict(tfidf_test)
print('tf-idf test')
print(classification_report(y_test, tfidf_pred))

tf-idf test
              precision    recall  f1-score   support

           0       0.50      0.70      0.58       246
           1       0.45      0.37      0.40       220
           2       0.82      0.65      0.73       272
           3       0.51      0.50      0.50       284

    accuracy                           0.56      1022
   macro avg       0.57      0.55      0.55      1022
weighted avg       0.58      0.56      0.56      1022



# Support Vector Model

Let's build a SVC model, using the count vectorizer data and look at its accuracy.

In [28]:
from sklearn import svm
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(cv_train, y_train)
cv_pred = clf.predict(cv_test)
print('cv test')
print(classification_report(y_test, cv_pred))

cv test
              precision    recall  f1-score   support

           0       0.80      0.85      0.83       246
           1       0.57      0.58      0.58       220
           2       0.82      0.79      0.80       272
           3       0.68      0.67      0.67       284

    accuracy                           0.72      1022
   macro avg       0.72      0.72      0.72      1022
weighted avg       0.72      0.72      0.72      1022



Now let's do the same but with the TF-IDF data.

In [29]:
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(tfidf_train, y_train)
cv_pred = clf.predict(tfidf_test)
print('tf-idf test')
print(classification_report(y_test, tfidf_pred))

tf-idf test
              precision    recall  f1-score   support

           0       0.50      0.70      0.58       246
           1       0.45      0.37      0.40       220
           2       0.82      0.65      0.73       272
           3       0.51      0.50      0.50       284

    accuracy                           0.56      1022
   macro avg       0.57      0.55      0.55      1022
weighted avg       0.58      0.56      0.56      1022



# XGB Model

Let's build an XGB model, using the count vectorizer data and look at its accuracy.

In [30]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

xgb = XGBClassifier(max_depth=10, n_estimators=50)
xgb.fit(cv_train, y_train)
cv_pred = lr.predict(cv_test)
print('cv test')
print(classification_report(y_test, cv_pred))

cv test
              precision    recall  f1-score   support

           0       0.92      0.44      0.59       246
           1       0.53      0.63      0.57       220
           2       0.56      0.94      0.70       272
           3       0.68      0.43      0.53       284

    accuracy                           0.61      1022
   macro avg       0.67      0.61      0.60      1022
weighted avg       0.67      0.61      0.60      1022



Now let's do the same but with the TF-IDF data.

In [31]:
xgb = XGBClassifier(max_depth=10, n_estimators=50)
xgb.fit(tfidf_train, y_train)
tfidf_pred = lr.predict(tfidf_test)
print('tf-idf test')
print(classification_report(y_test, tfidf_pred))

tf-idf test
              precision    recall  f1-score   support

           0       0.83      0.84      0.83       246
           1       0.70      0.61      0.65       220
           2       0.88      0.85      0.86       272
           3       0.70      0.79      0.74       284

    accuracy                           0.78      1022
   macro avg       0.78      0.77      0.77      1022
weighted avg       0.78      0.78      0.78      1022

