## **4. Preprocessing**

#### **Step 1: File Lirik (CSV)**

In [45]:
import pandas as pd
import pickle
from slugify import slugify

artist_name = 'Avenged Sevenfold'
artist_slug = slugify(artist_name)

lyric_corpus = pd.read_csv(f'unduhan/{artist_slug}-with-lyrics.csv', sep=";")
lyric_corpus = lyric_corpus.sort_values('Judul')
lyric_corpus = lyric_corpus.drop(columns=['Unnamed: 0', 'Judul', 'album'])

lyric_corpus.head()

Unnamed: 0,Lirik
0,Watch your tongue or have it cut from your hea...
1,Skull!\n\nThey all know\nThey all know\n\nSorr...
2,"Finished with my woman, 'cause she couldn't he..."
3,You live your whole life staring at a wall\nYo...
4,"Open, blurry, nurture, loving\nCrawling, walki..."


#### **Step 2: Tokensasi Lirik**

In [46]:
from nltk.tokenize import RegexpTokenizer

lyric_corpus_tokenized = []
tokenizer = RegexpTokenizer(r'\w+')
for lyric in lyric_corpus['Lirik']:
    tokenized_lyric = tokenizer.tokenize(lyric.lower())
    lyric_corpus_tokenized.append(tokenized_lyric)
    
lyric_corpus['Token'] = lyric_corpus_tokenized
lyric_corpus.tail()

Unnamed: 0,Lirik,Token
105,"Come back to me, this is unconceivable\nBreaki...","[come, back, to, me, this, is, unconceivable, ..."
106,"We keep writing, talking and planning, but eve...","[we, keep, writing, talking, and, planning, bu..."
107,I sit stoic\nTouch of the divine upon my neck\...,"[i, sit, stoic, touch, of, the, divine, upon, ..."
108,I feel insane\nEvery single time I'm asked to ...,"[i, feel, insane, every, single, time, i, m, a..."
109,"Standing in the shade of altruism, answering t...","[standing, in, the, shade, of, altruism, answe..."


#### **Step 3: Hapus Angka dan Kata < 3 karakter**

Kami kemudian menghapus token numerik serta token yang berisi kurang dari 3 karakter (seperti bunyi melodi 'oh', 'na', 'la', 'da') yang cukup bisa mendistorsi hasil dari topik modelling.

In [47]:
for s, song in enumerate(lyric_corpus_tokenized):
    filtered_song = []    
    for token in song:
        if len(token) > 2 and not token.isnumeric():
            filtered_song.append(token)
    lyric_corpus_tokenized[s] = filtered_song

#### **Step 4: Lematisasi**

In [48]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for s,song in enumerate(lyric_corpus_tokenized):
    lemmatized_tokens = []
    for token in song:
        lemmatized_tokens.append(lemmatizer.lemmatize(token))
    lyric_corpus_tokenized[s] = lemmatized_tokens

#### **Step 5: Hapus Stop Words**

Terakhir, semua kata yang tidak memiliki arti khusus dalam Topic Modeling akan kami hapus untuk mengurangi noise. Kami menggunakan stopwords dari NLTK dalam bahasa inggris mengingat seluruh lirik lagu yang ada menggunakan Bahasa Inggris.

In [49]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
new_stop_words = ['ooh','yeah','hey','whoa','woah', 'ohh', 'hmm']
stop_words.extend(new_stop_words)
for s,song in enumerate(lyric_corpus_tokenized):
    filtered_text = []    
    for token in song:
        if token not in stop_words:
            filtered_text.append(token)
    lyric_corpus_tokenized[s] = filtered_text

#### **Step 6: Pickle It!**
Kami melakukan proses bundling token kedalam 1 file pickle untuk digunakan kemudian.

In [52]:
with open('pickle/tokenized.pkl', 'wb') as f:
    pickle.dump(lyric_corpus_tokenized, f)

Sampai pada proses ini, output yang kami dapat adalah file pickle `tokenized.pkl`.

In [53]:
print(lyric_corpus_tokenized)

