### Preceeding Lyrics Cleaning Steps (already done):
- trim whitespace and remove empty lines
- (check for translations if line numbers match)
- remove everything curly and square brackets and everything inbetween
- remove asterisks from common profane words
- remove repetition indicators from lyrics  e.g. "2x", "x2", "(3)", "2 x"
- remove position markers wrongfully placed in lyrics (e.g., "chorus", "verse", "hook")

In [1]:
import pandas as pd

# read data from CSV file
corpus = pd.read_csv(
    "../data-raw/poptrag_lyrics_genres_corpus_20260118.csv", delimiter=","
)
cols_to_string = [
    c for c in corpus.columns if not (c.startswith("pmax") or c.startswith("nmax"))
]
corpus[cols_to_string] = corpus[cols_to_string].astype("string")
print(corpus.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254334 entries, 0 to 254333
Data columns (total 20 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   track.s.id                254334 non-null  string 
 1   track.s.title             254333 non-null  string 
 2   track.s.firstartist.name  254334 non-null  string 
 3   album.s.title             254327 non-null  string 
 4   album.s.releaseyear       254334 non-null  string 
 5   track.s.popularity        254334 non-null  string 
 6   track.language            240968 non-null  string 
 7   full_lyrics               187850 non-null  string 
 8   cat5                      167721 non-null  string 
 9   pmax5                     167721 non-null  float64
 10  nmax5                     167721 non-null  float64
 11  cat12                     167721 non-null  string 
 12  pmax12                    167721 non-null  float64
 13  nmax12                    167721 non-null  f

### Reduce Corpus for Analysis
- only tracks that have both lyrics and genre information
- only tracks in English
- remove album where lyrics are only "woof woof" 
- remove German "schlager" genre

In [2]:
# filter: must have MB label, be in english / later German, and have non-missing lyrics
print("initial no. of tracks in data set: %s" % format(corpus.shape[0], ","))

english = corpus[
    (corpus["cat5"].notna())
    & (corpus["full_lyrics"].notna())
    & (corpus["track.language"].isin(["English"]))
    & (
        corpus["album.s.title"] != "No Grave but the Sea (Deluxe Edition)"
    )  # contains only "woof woof"
    & (corpus["cat12"] != "schlager")  # German Genre
]

print("no. of tracks after filtering: %s" % format(english.shape[0], ","))

english.to_csv("../data/poptrag_lyrics_genres_corpus_filtered_english.csv", index=True)

initial no. of tracks in data set: 254,334
no. of tracks after filtering: 111,938
