## Fell + Spohrleder (2014) n-gram Baseline Replication

In [None]:
import pandas as pd
english = pd.read_csv("../data/poptrag_lyrics_genres_corpus_filtered_english.csv")

### Preprocessing and n-gram Feature Extraction
1. extract unigram, bigram, and trigram features from the lyrics
- make lowercase, remove punctuation (except apostrophes)
- contractions like "don't" should not be split during tokenization
2. calculate n-gram tf-idf for all n-grams with genres (cat[n]) as documents (frequency +1 per tracks that contiain the n-gram) 
3. rank all n-grams by tf-idf score within each genre
4. "downrank artist specific ngrams" by removing n-grams that occur in less than x? (e.g, 50) different artists
5. per genre and n, select top 100 n-grams as binary features (present / not present in lyrics) (produces 2700 features max for 9 genres; might be less due to overlaps)

In [None]:
from helpers.n_gram_features import build_ngram_features
build_ngram_features(corpus=english, granularity=5, min_artists=50, top_n=100)

Extracting n-grams from all lyrics...

✓ Extracted unigrams:
  - Number of unique unigrams: 108,604
  - Matrix shape: (111938, 108604)
  - Example unigrams: ['shully', 'cape', 'allair', "ubangi's", 'freaky']
✓ Extracted bigrams:
  - Number of unique bigrams: 2,052,265
  - Matrix shape: (111938, 2052265)
  - Example bigrams: ['pour smell', 'box city', 'alight just', 'skynyrd tunes', 'faithless night']
✓ Extracted trigrams:
  - Number of unique trigrams: 6,582,416
  - Matrix shape: (111938, 6582416)
  - Example trigrams: ['the twilight in', 'cafés in the', 'aisles fill again', 'wishing wishing further', 'her beauty like']
Calculating tf-idf for combinations of n-grams and G5 genres...

Calculating genre-level TF-IDF for unigrams with cat5 genres ...
✓ Calculated TF-IDF for 211,691 genre-ngram pairs
Calculating genre-level TF-IDF for bigrams with cat5 genres ...
✓ Calculated TF-IDF for 2,920,224 genre-ngram pairs
Calculating genre-level TF-IDF for trigrams with cat5 genres ...
✓ Calculate

100%|██████████| 108604/108604 [00:18<00:00, 5844.69it/s]


✓ Calculated artist diversity for 108,604 n-grams
Counting artists per n-gram...


100%|██████████| 2052265/2052265 [05:29<00:00, 6221.73it/s]


✓ Calculated artist diversity for 2,052,265 n-grams
Counting artists per n-gram...


100%|██████████| 6582416/6582416 [25:12<00:00, 4353.07it/s] 


✓ Calculated artist diversity for 6,582,416 n-grams
Filtering ngrams occurring in at least 50 artists...

Ranking ngrams by genre and tfidf.

Total unique ngrams selected: 148
Total unique ngrams selected: 183
Total unique ngrams selected: 248
Total unique ngrams in final feature set: 579
{"i'm the", 'of the earth', 'i wanna be', "you ain't", 'he', 'on me', 'your life', 'up in my', 'the way you', 'to the ground', 'into the', 'my eyes', 'you were', 'if you', 'from the', 'to see', 'my mind', 'so hard to', 'put', 'oh', 'see', 'the dead', 'i know you', 'so i can', "i don't wanna", 'for you', 'the sun', 'the time', 'out of', 'cause', 'yeah yeah yeah', 'it to the', 'as the', 'where', 'with my', 'if i', 'you know that', 'will', 'i know', 'know what i', 'the way that', 'through the', 'right', 'what', 'i am a', 'again', "don't have to", 'wanna', 'i know i', 'me to the', 'of a', 'feel the', "i ain't", 'left', 'from', 'you see', 'you and i', 'let', "don't know", 'this', 'that i', 'i had to', 'i s

Unnamed: 0,a,a little,a little bit,a lot of,a nigga,a part of,a world of,about,again,ain't,...,you to,you want,you want me,you want to,you were,you will,you're,your,your eyes,your life
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,15,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111933,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
111934,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
111935,0,0,0,0,0,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
111936,4,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
from helpers.n_gram_features import build_ngram_features
build_ngram_features(corpus=english, granularity=12, min_artists=50, top_n=100)

In [None]:
from helpers.n_gram_features import build_ngram_features
build_ngram_features(corpus=english, granularity=25, min_artists=50, top_n=100)

In [None]:
from helpers.n_gram_features import build_ngram_features
build_ngram_features(corpus=english, granularity=32, min_artists=50, top_n=100)

## Train SVM Model 
- train SVM with linear kernel on n-gram count features with parameter C=1
- use one-vs-rest strategy for multi-class classification
<!-- - use 5-fold cross validation to evaluate performance
- report accuracy, precision, recall, F1-score per genre and overall -->

In [22]:
import pandas as pd
from helpers.simple_linear_SVC import perform_linear_SVC

labels_and_artists = pd.read_csv("../data/poptrag_lyrics_genres_corpus_filtered_english.csv")
features5 = pd.read_csv("../data/FS_G5_lyrics_n_gram_features.csv")
# features12 = pd.read_csv("../data/FS_G12_lyrics_n_gram_features.csv")
# features25 = pd.read_csv("../data/FS_G25_lyrics_n_gram_features.csv")
# features32 = pd.read_csv("../data/FS_G32_lyrics_n_gram_features.csv")

perform_linear_SVC(features5, labels_and_artists, granularity=5)
# perform_linear_SVM(features12, labels_and_artists, granularity=12)
# perform_linear_SVM(features25, labels_and_artists, granularity=25)
# perform_linear_SVM(features32, labels_and_artists, granularity=32)

Training set size: 89012
Test set size: 22926
Number of unique artists in train: 5389
Number of unique artists in test: 1348
Artist overlap (should be 0): 0


OSError: Cannot save file into a non-existent directory: '..\models\fell_spohrleder_svm'