# Project Pipeline

Lucovica Schaerf, Antònio Mendes, Jaël Kortekaas

Large part of our code is used from: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

## Introduction

This file contains the the preprocessing pipeline to our project. 
As a first step we are importing the data and filtering out all the songs that
we don't need for our analysis. Secondly, we will implement the 'standard' 
pipeline and, once we obtain the most common words per each album, author, year
(...) we will move to another file to do the clustering and topic analysis.

## Import

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

plot_dir = Path("./figures")
data_dir = Path("./data")

In [26]:
songs = []

with open('./data/lyrics.csv', 'r', encoding="utf-8") as infile:
    songs = pd.read_csv(infile)
    
print(songs.columns)

Index(['index', 'song', 'year', 'artist', 'genre', 'lyrics'], dtype='object')


In [27]:
artists = ['joy-division', 'metallica', 'black-sabbath', 'pink-floyd', 'david-bowie']

david_bowie = songs[songs[u'artist'] == 'david-bowie']
black_sabbath = songs[songs[u'artist'] == 'black-sabbath']

print(david_bowie.shape, black_sabbath.shape)

(599, 6) (210, 6)


## Processing Pipeline

In [8]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['oh', 'yeah', 'hey', 'doo', 'oo', 'uh', 'la', 'verse', 'chorus', 'bridge']) # filter out common meaningless words/sounds and words describing song structure

In [94]:
import re
import string

lyrics = songs['lyrics'].tolist()
lyrics = [re.sub('\-', '', str(lyric)) for lyric in lyrics] # take out all hyphens that often connect meaningless words/sounds to these stopwords can be filtered out later
lyrics = [re.sub('[\.\,\?,\!,\(,\)]', '', str(lyric)) for lyric in lyrics] # take out all punctuation

In [109]:
# sentence splitting in songs is quite hard because ends of sentences are not indicated with periods, we decided to split on newlines instead as this is the closest indicator of a sentece boundary

sent_split_lyrics = []

for lyric in lyrics[0:10]:
    sent_split_lyric = lyric.split('\n')
    sent_split_lyrics.append(sent_split_lyric)

In [110]:
from nltk.tokenize import word_tokenize

lyrics_words = []
sentence_words = []

for lyric in sent_split_lyrics:
    for sentence in lyric:
        sentence_words.append(word_tokenize((sentence)))
    lyrics_words.append(sentence_words)
    
print(lyrics_words[:10])

[[['Oh', 'baby', 'how', 'you', 'doing'], ['You', 'know', 'I', "'m", 'gon', 'na', 'cut', 'right', 'to', 'the', 'chase'], ['Some', 'women', 'were', 'made', 'but', 'me', 'myself'], ['I', 'like', 'to', 'think', 'that', 'I', 'was', 'created', 'for', 'a', 'special', 'purpose'], ['You', 'know', 'what', "'s", 'more', 'special', 'than', 'you', 'You', 'feel', 'me'], ['It', "'s", 'on', 'baby', 'let', "'s", 'get', 'lost'], ['You', 'do', "n't", 'need', 'to', 'call', 'into', 'work', "'cause", 'you', "'re", 'the', 'boss'], ['For', 'real', 'want', 'you', 'to', 'show', 'me', 'how', 'you', 'feel'], ['I', 'consider', 'myself', 'lucky', 'that', "'s", 'a', 'big', 'deal'], ['Why', 'Well', 'you', 'got', 'the', 'key', 'to', 'my', 'heart'], ['But', 'you', 'ai', "n't", 'gon', 'na', 'need', 'it', 'I', "'d", 'rather', 'you', 'open', 'up', 'my', 'body'], ['And', 'show', 'me', 'secrets', 'you', 'did', "n't", 'know', 'was', 'inside'], ['No', 'need', 'for', 'me', 'to', 'lie'], ['It', "'s", 'too', 'big', 'it', "'s", '