# Pop (1) vs. Rock (0) Lyrics Classification (NLP + ML)

## Objective
Build a binary text-classification model in Python to predict music genre using lyrics:
- **Pop = 1**
- **Rock = 0**

---

## 1) Dataset Loading

### Steps
1. Load the provided dataset (lyrics + label).
2. Split the dataset into:
   - **Training set: 70%**
   - **Test set: 30%**

---

## 2) Data Pre-processing

### Steps
Apply the following preprocessing to the text:

- Convert text to **lowercase**
- **Tokenize** the text (split into tokens/words)
- Remove:
  - **stopwords**
  - **punctuation**
- Apply **lemmatization** (or stemming)

> **Important:** Fit/define preprocessing on the training pipeline and apply the same transformation to the test set (do not “learn” from test labels).

---

## 3) Feature Engineering

Convert processed text into numerical vectors using **one** (or compare both):

### Option A — Bag of Words (BoW)
- Represents text as word counts
- Common baseline for text classification

### Option B — TF-IDF
- Weights words by importance across documents
- Often improves results vs raw counts for classification

---

## 4) Model Training

Train **all three** binary classifiers:

1. **Logistic Regression**
2. **Linear Support Vector Machine (SVM)**
3. **Naive Bayes**

---

## 5) Model Evaluation

Evaluate performance on the **test set** using:

- **Precision**
- **Recall**
- **F1-score**

Output using:
- `classification_report`

---

## 6) Conclusions

Write conclusions based on observed results, such as:

- Which model performs best overall (macro/weighted F1)?
- Which model balances precision vs recall better?
- Does TF-IDF outperform BoW (or vice versa)?
- Any common error patterns (e.g., misclassified songs with ambiguous lyrics)?

---

## Deliverables Checklist ✅
- [ ] Dataset loaded correctly  
- [ ] 70/30 split completed  
- [ ] Preprocessing applied (lowercase, tokenize, stopwords, punctuation, lemma/stem)  
- [ ] Feature extraction (BoW and/or TF-IDF)  
- [ ] Trained 3 models (LR, Linear SVM, Naive Bayes)  
- [ ] Printed classification report for each model  
- [ ] Written conclusions based on metrics and comparisons  

In [1]:
import numpy as np
import pandas as pd
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

nltk.download("punkt")
nltk.download("stopwords")
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\300312139\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\300312139\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\300312139\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
df = pd.read_csv("pop_vs_rock_lyrics_classification.csv")
df

Unnamed: 0,text,label
0,"Another one bites the dust, and another's gone.",0
1,"We will, we yeah rock you, pounding our feet o...",0
2,"I'm on the highway to hell, and I'm you know d...",0
3,"Sweet child o' mine, umm got eyes of the blues...",0
4,"oh keep dancing on my own, but now you're gone.",1
...,...,...
1995,"I'm on you know of the world, waiting for you ...",1
1996,"uhh will, we will rock you, pounding our feet ...",0
1997,"Just baby a dream, the best thing in my life.",1
1998,"Knockin' on heaven's door, can't take this any...",0


In [3]:
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33333, random_state = 42, shuffle = True)

In [4]:
X_train = X_train.str.lower()
X_test = X_test.str.lower()
X_train

1677    you're beautiful, it's true, i can't take my o...
329     we found love in a hopeless place, bright and uhh
6       you're beautiful, it's true, i can't take baby...
745     knockin' on heaven's door, can't take this any...
1553    you're the one that i want, can't get you out ...
                              ...                        
1130       don't stop believin', hold on to that feeling.
1294    baby, umm a song, you make me wanna roll my wi...
860     baby, oh a song, you make me wanna roll my win...
1459    you're the one baby i want, can't get you out ...
1126         welcome to the jungle, we got fun and games.
Name: text, Length: 1333, dtype: object

In [5]:
def tokenize_words(x):
    tokenizer = RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(x)

X_train = X_train.apply(tokenize_words)
X_test = X_test.apply(tokenize_words)
X_train

1677    [you, re, beautiful, it, s, true, i, can, t, t...
329     [we, found, love, in, a, hopeless, place, brig...
6       [you, re, beautiful, it, s, true, i, can, t, t...
745     [knockin, on, heaven, s, door, can, t, take, t...
1553    [you, re, the, one, that, i, want, can, t, get...
                              ...                        
1130    [don, t, stop, believin, hold, on, to, that, f...
1294    [baby, umm, a, song, you, make, me, wanna, rol...
860     [baby, oh, a, song, you, make, me, wanna, roll...
1459    [you, re, the, one, baby, i, want, can, t, get...
1126    [welcome, to, the, jungle, we, got, fun, and, ...
Name: text, Length: 1333, dtype: object

In [6]:
stopwords = set(stopwords.words('english'))
stopwords

def remove_stopwords(x):
    return [word for word in x if word not in stopwords]

X_train = X_train.apply(remove_stopwords)
X_test = X_test.apply(remove_stopwords)

In [7]:
X_train 

1677                      [beautiful, true, take, oh]
329       [found, love, hopeless, place, bright, uhh]
6                 [beautiful, true, take, baby, eyes]
745            [knockin, heaven, door, take, anymore]
1553                      [one, want, get, uhh, head]
                            ...                      
1130                  [stop, believin, hold, feeling]
1294    [baby, umm, song, make, wanna, roll, windows]
860      [baby, oh, song, make, wanna, roll, windows]
1459                     [one, baby, want, get, head]
1126               [welcome, jungle, got, fun, games]
Name: text, Length: 1333, dtype: object

In [8]:
def lemmatize_tokens(x):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in x]

X_train = X_train.apply(lemmatize_tokens)
X_test = X_test.apply(lemmatize_tokens)
X_train

1677                     [beautiful, true, take, oh]
329      [found, love, hopeless, place, bright, uhh]
6                 [beautiful, true, take, baby, eye]
745           [knockin, heaven, door, take, anymore]
1553                     [one, want, get, uhh, head]
                            ...                     
1130                 [stop, believin, hold, feeling]
1294    [baby, umm, song, make, wanna, roll, window]
860      [baby, oh, song, make, wanna, roll, window]
1459                    [one, baby, want, get, head]
1126               [welcome, jungle, got, fun, game]
Name: text, Length: 1333, dtype: object

In [9]:
X_test

1860      [another, one, bite, dust, another, gone]
353       [another, one, bite, dust, another, gone]
1333                            [yeah, hell, going]
905          [knockin, heaven, door, take, anymore]
1289            [get, satisfaction, yeah, try, try]
                           ...                     
1018    [sweet, child, yeah, got, eye, bluest, sky]
380             [uhh, rock, pounding, foot, ground]
1029        [want, know, moment, feel, light, love]
1688           [baby, rock, pounding, foot, ground]
84                      [highway, umm, hell, going]
Name: text, Length: 667, dtype: object

In [10]:
def join_text(x):
    return ' '.join(x)

X_train = X_train.apply(join_text)
X_test = X_test.apply(join_text)

In [11]:
bow_vectorizer = CountVectorizer(
    analyzer = 'word',
    ngram_range=(1,1)
)

X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)
X_train_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [12]:
bow_vectorizer.get_feature_names_out()

array(['another', 'anymore', 'baby', 'beautiful', 'believin', 'best',
       'bite', 'bluest', 'bright', 'burst', 'call', 'child', 'color',
       'come', 'dancing', 'denial', 'door', 'dream', 'dust', 'eye',
       'feel', 'feeling', 'firework', 'foot', 'forever', 'found', 'fun',
       'game', 'get', 'going', 'gone', 'gonna', 'got', 'ground', 'head',
       'heaven', 'hell', 'highway', 'hold', 'hopeless', 'jungle', 'keep',
       'knockin', 'know', 'let', 'life', 'light', 'like', 'live', 'love',
       'make', 'mine', 'moment', 'never', 'oh', 'one', 'place',
       'pounding', 'rock', 'roll', 'satisfaction', 'sky', 'smell', 'song',
       'spirit', 'stop', 'sweet', 'take', 'talk', 'teen', 'thing', 'top',
       'true', 'try', 'uhh', 'umm', 'used', 'waiting', 'wanna', 'want',
       'welcome', 'window', 'world', 'yeah'], dtype=object)

In [13]:
tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
X_train_tfidf.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.37857506, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.29119678, ..., 0.40535325, 0.        ,
        0.        ],
       [0.        , 0.        , 0.38187003, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [14]:
tfidf_vectorizer.get_feature_names_out()

array(['another', 'anymore', 'baby', 'beautiful', 'believin', 'best',
       'bite', 'bluest', 'bright', 'burst', 'call', 'child', 'color',
       'come', 'dancing', 'denial', 'door', 'dream', 'dust', 'eye',
       'feel', 'feeling', 'firework', 'foot', 'forever', 'found', 'fun',
       'game', 'get', 'going', 'gone', 'gonna', 'got', 'ground', 'head',
       'heaven', 'hell', 'highway', 'hold', 'hopeless', 'jungle', 'keep',
       'knockin', 'know', 'let', 'life', 'light', 'like', 'live', 'love',
       'make', 'mine', 'moment', 'never', 'oh', 'one', 'place',
       'pounding', 'rock', 'roll', 'satisfaction', 'sky', 'smell', 'song',
       'spirit', 'stop', 'sweet', 'take', 'talk', 'teen', 'thing', 'top',
       'true', 'try', 'uhh', 'umm', 'used', 'waiting', 'wanna', 'want',
       'welcome', 'window', 'world', 'yeah'], dtype=object)

In [15]:
tfidf_list = dict(zip (tfidf_vectorizer.get_feature_names_out(), tfidf_vectorizer.idf_))

def get_idf_values(x):
    return x[1]

tfidf_sorted = sorted(tfidf_list.items() , key=get_idf_values, reverse = True)
print("Most rare word in documents")
for el1, el2 in tfidf_sorted:
    print(el1, ":", el2)

Most rare word in documents
top : 4.324736215567678
waiting : 4.283914221047423
world : 4.2641115937512435
call : 4.225645312923447
going : 4.225645312923447
true : 4.188604041243098
forever : 4.17058553574042
live : 4.17058553574042
keep : 4.1528859586410185
satisfaction : 4.1528859586410185
thing : 4.13549421592915
wanna : 4.13549421592915
foot : 4.118399782569849
best : 4.101592664253468
dancing : 4.101592664253468
denial : 4.101592664253468
dream : 4.101592664253468
hell : 4.101592664253468
highway : 4.101592664253468
never : 4.101592664253468
pounding : 4.101592664253468
rock : 4.101592664253468
smell : 4.101592664253468
spirit : 4.101592664253468
teen : 4.101592664253468
place : 4.085063362302257
bright : 4.068802841430477
ground : 4.068802841430477
make : 4.068802841430477
window : 4.068802841430477
found : 4.052802500084036
gonna : 4.052802500084036
head : 4.052802500084036
roll : 4.052802500084036
hopeless : 4.021549956579932
moment : 4.021549956579932
song : 4.021549956579932

In [17]:
logModel = LogisticRegression(max_iter=10000, random_state=0)

logModel.fit(X_train_tfidf, y_train)

y_pred = logModel.predict(X_test_tfidf)
train_acc = accuracy_score(y_train, logModel.predict(X_train_tfidf))
test_acc  = accuracy_score(y_test, logModel.predict(X_test_tfidf))

print("Train:", train_acc)
print("Test:", test_acc)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Train: 1.0
Test: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       332
           1       1.00      1.00      1.00       335

    accuracy                           1.00       667
   macro avg       1.00      1.00      1.00       667
weighted avg       1.00      1.00      1.00       667



In [None]:
logModel2 = LogisticRegression(max_iter=10000, random_state=0)

logModel2.fit(X_train_bow, y_train)

y_pred = logModel.predict(X_)
train_acc = accuracy_score(y_train, logModel.predict(X_train_tfidf))
test_acc  = accuracy_score(y_test, logModel.predict(X_test_tfidf))

print("Train:", train_acc)
print("Test:", test_acc)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

In [18]:
from sklearn.svm import LinearSVC

# Khởi tạo model
svmModel = LinearSVC(random_state=0)

# Train
svmModel.fit(X_train_tfidf, y_train)

# Predict
y_pred_svm = svmModel.predict(X_test_tfidf)

# Accuracy
train_acc_svm = accuracy_score(y_train, svmModel.predict(X_train_tfidf))
test_acc_svm  = accuracy_score(y_test, y_pred_svm)

print("SVM Train:", train_acc_svm)
print("SVM Test:", test_acc_svm)

# Classification report
print(classification_report(y_test, y_pred_svm))

SVM Train: 1.0
SVM Test: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       332
           1       1.00      1.00      1.00       335

    accuracy                           1.00       667
   macro avg       1.00      1.00      1.00       667
weighted avg       1.00      1.00      1.00       667



In [19]:
X_test_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3281 stored elements and shape (667, 84)>

In [20]:
X_test

1860     another one bite dust another gone
353      another one bite dust another gone
1333                        yeah hell going
905        knockin heaven door take anymore
1289          get satisfaction yeah try try
                       ...                 
1018    sweet child yeah got eye bluest sky
380           uhh rock pounding foot ground
1029       want know moment feel light love
1688         baby rock pounding foot ground
84                   highway umm hell going
Name: text, Length: 667, dtype: object

In [21]:
X_train

1677                  beautiful true take oh
329     found love hopeless place bright uhh
6               beautiful true take baby eye
745         knockin heaven door take anymore
1553                   one want get uhh head
                        ...                 
1130              stop believin hold feeling
1294    baby umm song make wanna roll window
860      baby oh song make wanna roll window
1459                  one baby want get head
1126             welcome jungle got fun game
Name: text, Length: 1333, dtype: object