# Contents

[1. Data Preparation](#Data-Preparation)  
&emsp;[a. Read Data](#Read-Data)  
&emsp;[b. Recommended Pre-Processing](#Recommended-Pre-Processing)  
&emsp;&emsp;[i. Filling Gaps](#Filling-gaps)  
&emsp;&emsp;[ii. Lyrics Normalisation](#Lyrics-Normalisation)  
[2. Filtering Data](#Filtering-Data)  
&emsp;[a. Get Unique Tracks](#Get-Unique-Tracks) 
    

# Data Preparation

## Read Data

In [43]:
import pandas as pd
df = pd.read_csv('lyrics.csv')


## Recommended Pre-Processing

### Filling Gaps

In [44]:
# make all na fields reflect as such
df = df.fillna('NA')

# ensure date format for album release date
df['album_rd'] = pd.to_datetime(df.album_rd)

# ignore any track that does not have any lyrics or are album notes
df = df[~df['eng_track_title'].str.contains('skit', case=False) & ~df['eng_track_title'].str.contains('note', case=False)]


### Lyrics Normalisation

Method adapted from :

In [45]:
import re
def normalise(text, remove_punc=True):
    """method to normalise text"""
    # change text to lowercase and remove leading and trailing white spaces
    text = text.lower().strip()

    # remove punctuation
    if remove_punc:
        # remove punctuation
        text = re.sub(r'[\W]', ' ', text)
        # remove double spacing sometimes caused by removal of punctuation
        text = re.sub(r'\s+', ' ', text)

    return text


In [46]:
# normalise lyrics
df['lyrics'] = df['lyrics'].apply(lambda x: normalise(x))


# Filtering Data

## Get Unique Tracks

1. Using Pandas' built-in duplicates

In [49]:
df.drop_duplicates(subset='track_title', inplace=True)


2. Using the 'repackaged' column

duplicated tracks are labelled as True 

In [50]:
df = df[~df['repackaged']]
