# Contents
* [Data Preprocessing](#Preprocessing)
    * [New features](#newfeats)
    * [Lyric cleansing](#refine)

# Data Preprocessing <a class="anchor" id="Preprocessing"></a>

In [1]:
# IMPORT DEPENDENCIES 
import pandas as pd
import numpy as np
import pycountry_convert as pc
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') #download resources for the lemmatizer
from textblob import Word
from unidecode import unidecode

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kayan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kayan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# READ DATA 
song_data = pd.read_csv('../Data/merged_final_top100.csv') 

# REMOVE SOME COLUMNS 
song_data = song_data.drop(['Unnamed: 0', 'id', 'previous_rank','peak_rank'], axis=1) #id = track_id

# CHECK VARIABLE TYPES 
song_data.dtypes

track_id               object
artist_names           object
track_name             object
source                 object
rank                    int64
weeks_on_chart          int64
streams                 int64
country                object
danceability          float64
energy                float64
key                   float64
loudness              float64
mode                  float64
speechiness           float64
acousticness          float64
instrumentalness      float64
liveness              float64
valence               float64
tempo                 float64
duration_ms           float64
time_signature        float64
album_release_date     object
lyrics                 object
lyrics_trans           object
dtype: object

In [3]:
# CHANGE 'none' LYRICS TO NULL 
song_data['lyrics'].replace('None', np.nan, inplace=True)
song_data['lyrics_trans'].replace('none', np.nan, inplace=True)
#count null values in each column 
song_data.isna().sum()

track_id                0
artist_names            0
track_name              0
source                  0
rank                    0
weeks_on_chart          0
streams                 0
country                 0
danceability            1
energy                  1
key                     1
loudness                1
mode                    1
speechiness             1
acousticness            1
instrumentalness        1
liveness                1
valence                 1
tempo                   1
duration_ms             1
time_signature          1
album_release_date      1
lyrics                489
lyrics_trans          489
dtype: int64

## New features <a class="anchor" id="newfeats"></a>

In this project, new features are added to the dataframe such as the number of words in the lyrics data. Additionally, new features like continent and country codes are included for mapping purposes. 

**Continent and country codes** 

In [4]:
def get_continent(country_name):
    '''get continent names'''
    country_alpha2 = pc.country_name_to_country_alpha2(country_name)
    continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    continent_name = pc.convert_continent_code_to_continent_name(continent_code)
    return continent_name

In [5]:
# ADD NEW DATAFRAME COLUMNS 
song_data['continent']= song_data['country'].apply(get_continent) #add to df
song_data['iso_alpha3']= [pc.country_name_to_country_alpha3(i, cn_name_format="default") for i in song_data['country']]
song_data

Unnamed: 0,track_id,artist_names,track_name,source,rank,weeks_on_chart,streams,country,danceability,energy,...,liveness,valence,tempo,duration_ms,time_signature,album_release_date,lyrics,lyrics_trans,continent,iso_alpha3
0,0yLdNVWF3Srea0uzk55zFn,Miley Cyrus,Flowers,Columbia,1,5,124198,United Arab Emirates,0.707,0.681,...,0.0322,0.646,117.999,200455.0,4.0,2023-01-13,"We were good, we were gold\nKinda dream that c...",we were good we were gold kinda dream that can...,Asia,ARE
1,1Qrg8KqiBpW07V7PNxwwwL,SZA,Kill Bill,Top Dawg Entertainment/RCA Records,2,10,106927,United Arab Emirates,0.644,0.735,...,0.1610,0.418,88.980,153947.0,4.0,2022-12-08,I'm still a fan even though I was salty\nHate ...,im still a fan even though i was salty hate to...,Asia,ARE
2,6AQbmUe0Qwf5PZnt4HmTXv,"PinkPantheress, Ice Spice",Boy's a liar Pt. 2,Warner Records,3,2,83627,United Arab Emirates,0.696,0.809,...,0.2480,0.857,132.962,131013.0,4.0,2023-02-03,Take a look inside your heart\nIs there any ro...,take a look inside your heart is there any roo...,Asia,ARE
3,0WtM2NBVQNNJLh6scP13H8,"Rema, Selena Gomez",Calm Down (with Selena Gomez),Mavin Records / Jonzing World,4,25,79714,United Arab Emirates,0.801,0.806,...,0.1140,0.802,106.999,239318.0,4.0,2022-08-25,"Vibez\nOh, no\nAnother banger\nBaby, calm down...",vibez oh no another banger baby calm down calm...,Asia,ARE
4,2dHHgzDwk4BJdRwy9uXhTO,"Metro Boomin, The Weeknd, 21 Savage",Creepin' (with The Weeknd & 21 Savage),Republic Records,5,11,79488,United Arab Emirates,0.715,0.620,...,0.0822,0.172,97.950,221520.0,4.0,2022-12-02,"Ooh, ooh-ooh\nOoh-ooh-ooh, ooh, ooh-ooh (Just ...",ooh oohooh oohoohooh ooh oohooh just cant beli...,Asia,ARE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7295,7ErtOGQ9DwyQa3lwP77j4u,Ruger,Asiwaju,Columbia,96,4,54026,South Africa,0.727,0.600,...,0.1060,0.754,199.796,216000.0,4.0,2022-11-14,Cook that thing\nMan getting high till I fade ...,cook that thing man getting high till i fade o...,Africa,ZAF
7296,4EI8VuxUuIHKfafU72emqz,Mariah Carey,We Belong Together,Island Records,97,50,53828,South Africa,0.840,0.476,...,0.0865,0.767,139.987,201400.0,4.0,2005,"Sweet love, yeah\nI didn't mean it when I said...",sweet love yeah i didnt mean it when i said i ...,Africa,ZAF
7297,3Puq6i4xIRH4lrPvJxIC83,"Deep London, Nkosazana Daughter, Murumba Pitch...",Piano Ngijabulise,Cycad Wave,98,14,53752,South Africa,0.835,0.454,...,0.0241,0.433,112.010,416037.0,4.0,2022-09-30,Okokuqala ukuhlakanipha\nUkumesaba uJehova\nAy...,first is wisdom to fear jehovah they dont hear...,Africa,ZAF
7298,7DQMBUK4oX9gV1qIzpoRz6,Aymos,Mama,DJs Production,99,14,53733,South Africa,0.802,0.469,...,0.0895,0.314,113.008,450304.0,4.0,2022-08-12,Mama mama mama mama\nMama mama mama mama\nMama...,mother mother mother mother mother mother moth...,Africa,ZAF


**Length of lyrics** 

The number of words in the original lyrics are approximately equivalent to the number of words in the translated lyrics. However, the number of words in the original text and the translated text can differ due to several reasons. Firstly, the length of the sentences and phrases in the original text may be longer or shorter than their translations in English. Additionally, certain words or phrases in the original language may not have an equivalent in English and may require more words to convey their meaning. Conversely, some words or phrases in English may convey a meaning that requires several words in the original language, resulting in a higher word count in the English translation. Furthermore, the translator's (googletrans) writing style and language proficiency can also impact the length of the translation.

In [6]:
# ADD NUMBER OF WORDS IN LYRICS AS FEATURES 
count1 = [len(str(s).split()) for s in song_data['lyrics']] #original language 
count2 = [len(str(s).split()) for s in song_data['lyrics_trans']] #translated to english 

In [7]:
# CHECK IF THE NUM OF WORDS IN ORIGILNAL EQUALS TRANSLATED LYRICS 
for i in range(len(count1)): 
    if count1[i]!=count2[i]: 
        print('original: ' + str(count1[i]) + ' | ' +'translated: ' + str(count2[i]))

original: 458 | translated: 456
original: 691 | translated: 690
original: 340 | translated: 339
original: 260 | translated: 257
original: 496 | translated: 541
original: 173 | translated: 148
original: 641 | translated: 622
original: 425 | translated: 391
original: 406 | translated: 466
original: 269 | translated: 295
original: 358 | translated: 346
original: 261 | translated: 263
original: 231 | translated: 232
original: 360 | translated: 261
original: 687 | translated: 686
original: 99 | translated: 260
original: 687 | translated: 671
original: 150 | translated: 123
original: 219 | translated: 279
original: 426 | translated: 470
original: 347 | translated: 374
original: 406 | translated: 466
original: 401 | translated: 447
original: 412 | translated: 475
original: 276 | translated: 323
original: 279 | translated: 314
original: 661 | translated: 714
original: 465 | translated: 520
original: 251 | translated: 268
original: 286 | translated: 318
original: 488 | translated: 565
original:

In [8]:
# ADD TO DATAFRAME AS NEW COLUMNS 
song_data['len_words_orig']= count1
song_data['len_words_trans']= count2

In [9]:
song_data

Unnamed: 0,track_id,artist_names,track_name,source,rank,weeks_on_chart,streams,country,danceability,energy,...,tempo,duration_ms,time_signature,album_release_date,lyrics,lyrics_trans,continent,iso_alpha3,len_words_orig,len_words_trans
0,0yLdNVWF3Srea0uzk55zFn,Miley Cyrus,Flowers,Columbia,1,5,124198,United Arab Emirates,0.707,0.681,...,117.999,200455.0,4.0,2023-01-13,"We were good, we were gold\nKinda dream that c...",we were good we were gold kinda dream that can...,Asia,ARE,334,334
1,1Qrg8KqiBpW07V7PNxwwwL,SZA,Kill Bill,Top Dawg Entertainment/RCA Records,2,10,106927,United Arab Emirates,0.644,0.735,...,88.980,153947.0,4.0,2022-12-08,I'm still a fan even though I was salty\nHate ...,im still a fan even though i was salty hate to...,Asia,ARE,362,362
2,6AQbmUe0Qwf5PZnt4HmTXv,"PinkPantheress, Ice Spice",Boy's a liar Pt. 2,Warner Records,3,2,83627,United Arab Emirates,0.696,0.809,...,132.962,131013.0,4.0,2023-02-03,Take a look inside your heart\nIs there any ro...,take a look inside your heart is there any roo...,Asia,ARE,372,372
3,0WtM2NBVQNNJLh6scP13H8,"Rema, Selena Gomez",Calm Down (with Selena Gomez),Mavin Records / Jonzing World,4,25,79714,United Arab Emirates,0.801,0.806,...,106.999,239318.0,4.0,2022-08-25,"Vibez\nOh, no\nAnother banger\nBaby, calm down...",vibez oh no another banger baby calm down calm...,Asia,ARE,495,495
4,2dHHgzDwk4BJdRwy9uXhTO,"Metro Boomin, The Weeknd, 21 Savage",Creepin' (with The Weeknd & 21 Savage),Republic Records,5,11,79488,United Arab Emirates,0.715,0.620,...,97.950,221520.0,4.0,2022-12-02,"Ooh, ooh-ooh\nOoh-ooh-ooh, ooh, ooh-ooh (Just ...",ooh oohooh oohoohooh ooh oohooh just cant beli...,Asia,ARE,458,456
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7295,7ErtOGQ9DwyQa3lwP77j4u,Ruger,Asiwaju,Columbia,96,4,54026,South Africa,0.727,0.600,...,199.796,216000.0,4.0,2022-11-14,Cook that thing\nMan getting high till I fade ...,cook that thing man getting high till i fade o...,Africa,ZAF,591,591
7296,4EI8VuxUuIHKfafU72emqz,Mariah Carey,We Belong Together,Island Records,97,50,53828,South Africa,0.840,0.476,...,139.987,201400.0,4.0,2005,"Sweet love, yeah\nI didn't mean it when I said...",sweet love yeah i didnt mean it when i said i ...,Africa,ZAF,497,497
7297,3Puq6i4xIRH4lrPvJxIC83,"Deep London, Nkosazana Daughter, Murumba Pitch...",Piano Ngijabulise,Cycad Wave,98,14,53752,South Africa,0.835,0.454,...,112.010,416037.0,4.0,2022-09-30,Okokuqala ukuhlakanipha\nUkumesaba uJehova\nAy...,first is wisdom to fear jehovah they dont hear...,Africa,ZAF,256,414
7298,7DQMBUK4oX9gV1qIzpoRz6,Aymos,Mama,DJs Production,99,14,53733,South Africa,0.802,0.469,...,113.008,450304.0,4.0,2022-08-12,Mama mama mama mama\nMama mama mama mama\nMama...,mother mother mother mother mother mother moth...,Africa,ZAF,410,466


In [10]:
# PRINT TRACKS WITH UNRELEASED LYRICS 
song_data.loc[song_data['lyrics'].isna()]

Unnamed: 0,track_id,artist_names,track_name,source,rank,weeks_on_chart,streams,country,danceability,energy,...,tempo,duration_ms,time_signature,album_release_date,lyrics,lyrics_trans,continent,iso_alpha3,len_words_orig,len_words_trans
6,3l6K9SW5VFJyA5jBtioFFt,3GAR BABY,HUSTLE NA MUST,TGFG ENTERTAINMENT,7,1,60525,United Arab Emirates,0.735,0.648,...,104.996,152040.0,4.0,2023-02-10,,,Asia,ARE,1,1
24,0CtZpaOhtzvLV3FfcsVpQo,"Vishal-Shekhar, Shilpa Rao, Caralisa Monteiro,...","Besharam Rang (From ""Pathaan"")",YRF Music,25,9,36508,United Arab Emirates,0.773,0.795,...,115.997,258474.0,4.0,2022-12-12,,,Asia,ARE,1,1
31,6FAYpZ4jve8vpvTwUvjK6H,"Vishal-Shekhar, Arijit Singh, Sukriti Kakar, V...",Jhoome Jo Pathaan,YRF Music,32,7,32846,United Arab Emirates,0.817,0.738,...,104.964,208164.0,4.0,2022-12-22,,,Asia,ARE,1,1
76,72zHuDxFQTjbL51qJQSA7j,"Jasleen Royal, B Praak, Romy, Anvita Dutt","Ranjha (From ""Shershaah"")",Sony Music Entertainment India Pvt. Ltd.,77,29,20935,United Arab Emirates,0.603,0.573,...,82.941,228855.0,4.0,2021-08-05,,,Asia,ARE,1,1
141,3eUtQSdde3wNmXOW2OESKi,"El Polaco, La China",Ya No Quiero Verte,Columbia,42,16,1088456,Argentina,0.619,0.731,...,139.875,165635.0,3.0,2022-10-28,,,South America,ARG,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7109,27fqy8VruqYZlKiK1qfwEd,"tlinh, 2pillz, Wokeupat4am",ghệ iu dấu của em ơi,Universal Music Indochina Distributed Labels,10,1,311416,Vietnam,0.584,0.462,...,105.076,205996.0,4.0,2023-02-10,,,Asia,VNM,1,1
7110,3ukrFH17Zl6iEZ2QJ1Zwiy,"RPT Orijinn, Ronboogz",Don't Côi,Rapital,11,8,303515,Vietnam,0.808,0.410,...,110.079,148880.0,4.0,2022-11-20,,,Asia,VNM,1,1
7144,3wUp8eCTshIrJcYbjWaoyP,Phuong Ly,ThichThich,Phuong Ly,45,30,171778,Vietnam,0.898,0.467,...,124.072,241935.0,4.0,2022-07-24,,,Asia,VNM,1,1
7153,2fjqdDz6jJn6VPgrSDDMvp,"GPG msmy, AK49",YOU (feat. AK49),GePolyG,54,2,156338,Vietnam,0.759,0.323,...,130.074,147692.0,4.0,2023-02-02,,,Asia,VNM,1,1


In [11]:
# CHANGE UNRELEASED LYRICS LENGTH TO 0 
song_data.loc[song_data['lyrics'].isna(), 'len_words_orig']= 0 
song_data.loc[song_data['lyrics'].isna(), 'len_words_trans'] = 0 
song_data.loc[song_data['lyrics'].isna()]

Unnamed: 0,track_id,artist_names,track_name,source,rank,weeks_on_chart,streams,country,danceability,energy,...,tempo,duration_ms,time_signature,album_release_date,lyrics,lyrics_trans,continent,iso_alpha3,len_words_orig,len_words_trans
6,3l6K9SW5VFJyA5jBtioFFt,3GAR BABY,HUSTLE NA MUST,TGFG ENTERTAINMENT,7,1,60525,United Arab Emirates,0.735,0.648,...,104.996,152040.0,4.0,2023-02-10,,,Asia,ARE,0,0
24,0CtZpaOhtzvLV3FfcsVpQo,"Vishal-Shekhar, Shilpa Rao, Caralisa Monteiro,...","Besharam Rang (From ""Pathaan"")",YRF Music,25,9,36508,United Arab Emirates,0.773,0.795,...,115.997,258474.0,4.0,2022-12-12,,,Asia,ARE,0,0
31,6FAYpZ4jve8vpvTwUvjK6H,"Vishal-Shekhar, Arijit Singh, Sukriti Kakar, V...",Jhoome Jo Pathaan,YRF Music,32,7,32846,United Arab Emirates,0.817,0.738,...,104.964,208164.0,4.0,2022-12-22,,,Asia,ARE,0,0
76,72zHuDxFQTjbL51qJQSA7j,"Jasleen Royal, B Praak, Romy, Anvita Dutt","Ranjha (From ""Shershaah"")",Sony Music Entertainment India Pvt. Ltd.,77,29,20935,United Arab Emirates,0.603,0.573,...,82.941,228855.0,4.0,2021-08-05,,,Asia,ARE,0,0
141,3eUtQSdde3wNmXOW2OESKi,"El Polaco, La China",Ya No Quiero Verte,Columbia,42,16,1088456,Argentina,0.619,0.731,...,139.875,165635.0,3.0,2022-10-28,,,South America,ARG,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7109,27fqy8VruqYZlKiK1qfwEd,"tlinh, 2pillz, Wokeupat4am",ghệ iu dấu của em ơi,Universal Music Indochina Distributed Labels,10,1,311416,Vietnam,0.584,0.462,...,105.076,205996.0,4.0,2023-02-10,,,Asia,VNM,0,0
7110,3ukrFH17Zl6iEZ2QJ1Zwiy,"RPT Orijinn, Ronboogz",Don't Côi,Rapital,11,8,303515,Vietnam,0.808,0.410,...,110.079,148880.0,4.0,2022-11-20,,,Asia,VNM,0,0
7144,3wUp8eCTshIrJcYbjWaoyP,Phuong Ly,ThichThich,Phuong Ly,45,30,171778,Vietnam,0.898,0.467,...,124.072,241935.0,4.0,2022-07-24,,,Asia,VNM,0,0
7153,2fjqdDz6jJn6VPgrSDDMvp,"GPG msmy, AK49",YOU (feat. AK49),GePolyG,54,2,156338,Vietnam,0.759,0.323,...,130.074,147692.0,4.0,2023-02-02,,,Asia,VNM,0,0


## Refine lyrics <a class="anchor" id="refine"></a>

To enhance the accuracy of the results and streamline the analysis, it is often useful to preprocess and simplify the lyrics data. To ensure that all stopwords are removed during the stopword removal process, contraction words such as 'dont' and 'wasnt' are expanded into their longer versions. Following the removal of stopwords, lemmatization is performed to simplify words by removing suffixes and prefixes. Words like 'oh' that hold no significant meaning and commonly appear in song lyrics are removed to further simplify the lyrics data. Due to the absence of translations for some foreign words in the googletrans library, those words are removed. Lastly, words are converted to the present tense, and remaining punctuation like diacritics is removed.

**Split contractions** 

Contractions are frequently used in song lyrics and may not be removed even if they are considered stopwords, due to the removal of punctuation in the previous step. To prevent this issue, the longer versions of these words can be used to replace them instead.

In [12]:
# SPLIT SHORTENED VERSIONS OF WORDS 
clean = [] 
for i in  song_data['lyrics_trans']:
    if isinstance(i, str): 
        text = i.replace("dont", "do not")
        text = text.replace("doesnt", "does not")
        text = text.replace("wont", "would not")
        text = text.replace("didnt", "did not") 
        text = text.replace("couldnt", "could not")
        text = text.replace("shouldnt", "should not")
        text = text.replace("ill", "i will")
        text = text.replace("cant", "can not")
        text = text.replace("thats", "that is")
        text = text.replace("werent", "were not")
        text = text.replace("youve", "you have")
        text = text.replace("wasnt", "was not")
        text = re.sub(r'\d+', '', text) #remove numbers 
        clean.append(text)
    else: 
        clean.append(i)
song_data['lyrics_clean']  = clean

**Remove stopwords**

Removing stopwords from text data can help to reduce noise in the data and improve the accuracy of text analysis or NLP algorithms by focusing on more meaningful words or phrases.

In [13]:
# REMOVE STOPWORDS FROM TRANSLATED LYRICS 
stop_words = stopwords.words('english')
song_data['lyrics_clean'] = song_data['lyrics_clean'].apply(lambda x: 
                            ' '.join([word for word in str(x).split() if word not in (stop_words)]))

In [14]:
song_data['lyrics_clean'] 

0       good gold kinda dream sold right til built hom...
1       im sti fan even though salty hate see broad kn...
2       take look inside heart room room would hold br...
3       vibez oh another banger baby calm calm girl bo...
4       ooh oohooh oohoohooh ooh oohooh believe man me...
                              ...                        
7295    cook thing man getting high ti fade vibe pure ...
7296    sweet love yeah mean said love shoulda held ti...
7297    first wisdom fear jehovah hear children playin...
7298    mother mother mother mother mother mother moth...
7299    okay lets go dude know work im going closet im...
Name: lyrics_clean, Length: 7300, dtype: object

**Remove suffixes (Lemmatization)**

In [15]:
def lemmatize_word(text): 
    '''lemmatize word'''
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [16]:
# LEMMATIZE WORDS IN LYRICS 
lemmatizer = WordNetLemmatizer()

song_data['lyrics_clean'] = song_data['lyrics_clean'].apply(lemmatize_word)
song_data['lyrics_clean']

0       good gold kinda dream sold right til built hom...
1       im sti fan even though salty hate see broad kn...
2       take look inside heart room room would hold br...
3       vibez oh another banger baby calm calm girl bo...
4       ooh oohooh oohoohooh ooh oohooh believe man me...
                              ...                        
7295    cook thing man getting high ti fade vibe pure ...
7296    sweet love yeah mean said love shoulda held ti...
7297    first wisdom fear jehovah hear child playing p...
7298    mother mother mother mother mother mother moth...
7299    okay let go dude know work im going closet im ...
Name: lyrics_clean, Length: 7300, dtype: object

**Remove words with insignificant meaning**

Words such as 'oh', 'ya', 'na', 'oooh' can be treated as stopwords since they don't have significant meaning. Two letter words are also often not meaningful so removing them can help improve accuracy in text analysis. 

In [17]:
def remove_two_letter_words(text):
    'remove two letter words'
    return ' '.join([word for word in text.split() if len(word) > 2])

In [18]:
# REMOVE 2 LETTER WORDS 
song_data['lyrics_clean'] = song_data['lyrics_clean'] .apply(lambda x: remove_two_letter_words(x))

In [19]:
song_data['lyrics_clean']

0       good gold kinda dream sold right til built hom...
1       sti fan even though salty hate see broad know ...
2       take look inside heart room room would hold br...
3       vibez another banger baby calm calm girl body ...
4       ooh oohooh oohoohooh ooh oohooh believe man me...
                              ...                        
7295    cook thing man getting high fade vibe pure way...
7296    sweet love yeah mean said love shoulda held ti...
7297    first wisdom fear jehovah hear child playing p...
7298    mother mother mother mother mother mother moth...
7299    okay let dude know work going closet going bre...
Name: lyrics_clean, Length: 7300, dtype: object

In [20]:
# REMOVE OOH
song_data['lyrics_clean']  = song_data['lyrics_clean'].str.replace('ooh', '')
song_data['lyrics_clean'] 

0       good gold kinda dream sold right til built hom...
1       sti fan even though salty hate see broad know ...
2       take look inside heart room room would hold br...
3       vibez another banger baby calm calm girl body ...
4            believe man metro boomin want nigga someb...
                              ...                        
7295    cook thing man getting high fade vibe pure way...
7296    sweet love yeah mean said love shoulda held ti...
7297    first wisdom fear jehovah hear child playing p...
7298    mother mother mother mother mother mother moth...
7299    okay let dude know work going closet going bre...
Name: lyrics_clean, Length: 7300, dtype: object

In [21]:
# REMOVE EXTRA WHITE SPACES 
song_data['lyrics_clean']  = song_data['lyrics_clean'].str.replace('\s+', ' ', regex=True).str.strip()
song_data['lyrics_clean']

0       good gold kinda dream sold right til built hom...
1       sti fan even though salty hate see broad know ...
2       take look inside heart room room would hold br...
3       vibez another banger baby calm calm girl body ...
4       believe man metro boomin want nigga somebody s...
                              ...                        
7295    cook thing man getting high fade vibe pure way...
7296    sweet love yeah mean said love shoulda held ti...
7297    first wisdom fear jehovah hear child playing p...
7298    mother mother mother mother mother mother moth...
7299    okay let dude know work going closet going bre...
Name: lyrics_clean, Length: 7300, dtype: object

**Remove non-english words from lyrics**

In cases where `googletrans` fails to offer an English translation for particular words in a foreign language, those words can be excluded from the dataset.

In [22]:
def remove_noneng(text, english_words): 
    '''remove non-english words'''
    return " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in english_words or not w.isalpha())

In [23]:
# REMOVE NON-ENGLISH WORDS NOT TRANSLATED BY GOOGLETRANS 
english_words  = set(nltk.corpus.words.words())
song_data['lyrics_clean'] = [remove_noneng(l, english_words) for l in song_data['lyrics_clean']] 
song_data['lyrics_clean']

0       good gold dream sold right til built home watc...
1       fan even though salty hate see broad know happ...
2       take look inside heart room room would hold br...
3       another banger baby calm calm girl body put he...
4       believe man want somebody said saw person kiss...
                              ...                        
7295    cook thing man getting high fade pure way hull...
7296    sweet love yeah mean said love tight never let...
7297    first wisdom fear hear child piano first wisdo...
7298    mother mother mother mother mother mother moth...
7299    let dude know work going closet going break bo...
Name: lyrics_clean, Length: 7300, dtype: object

In [24]:
# SET AS NULL VALUES 
song_data['lyrics_clean'] = song_data['lyrics_clean'].replace('nan', np.nan)
song_data[song_data['lyrics_clean'].isnull()]

Unnamed: 0,track_id,artist_names,track_name,source,rank,weeks_on_chart,streams,country,danceability,energy,...,duration_ms,time_signature,album_release_date,lyrics,lyrics_trans,continent,iso_alpha3,len_words_orig,len_words_trans,lyrics_clean
6,3l6K9SW5VFJyA5jBtioFFt,3GAR BABY,HUSTLE NA MUST,TGFG ENTERTAINMENT,7,1,60525,United Arab Emirates,0.735,0.648,...,152040.0,4.0,2023-02-10,,,Asia,ARE,0,0,
24,0CtZpaOhtzvLV3FfcsVpQo,"Vishal-Shekhar, Shilpa Rao, Caralisa Monteiro,...","Besharam Rang (From ""Pathaan"")",YRF Music,25,9,36508,United Arab Emirates,0.773,0.795,...,258474.0,4.0,2022-12-12,,,Asia,ARE,0,0,
31,6FAYpZ4jve8vpvTwUvjK6H,"Vishal-Shekhar, Arijit Singh, Sukriti Kakar, V...",Jhoome Jo Pathaan,YRF Music,32,7,32846,United Arab Emirates,0.817,0.738,...,208164.0,4.0,2022-12-22,,,Asia,ARE,0,0,
76,72zHuDxFQTjbL51qJQSA7j,"Jasleen Royal, B Praak, Romy, Anvita Dutt","Ranjha (From ""Shershaah"")",Sony Music Entertainment India Pvt. Ltd.,77,29,20935,United Arab Emirates,0.603,0.573,...,228855.0,4.0,2021-08-05,,,Asia,ARE,0,0,
141,3eUtQSdde3wNmXOW2OESKi,"El Polaco, La China",Ya No Quiero Verte,Columbia,42,16,1088456,Argentina,0.619,0.731,...,165635.0,3.0,2022-10-28,,,South America,ARG,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7109,27fqy8VruqYZlKiK1qfwEd,"tlinh, 2pillz, Wokeupat4am",ghệ iu dấu của em ơi,Universal Music Indochina Distributed Labels,10,1,311416,Vietnam,0.584,0.462,...,205996.0,4.0,2023-02-10,,,Asia,VNM,0,0,
7110,3ukrFH17Zl6iEZ2QJ1Zwiy,"RPT Orijinn, Ronboogz",Don't Côi,Rapital,11,8,303515,Vietnam,0.808,0.410,...,148880.0,4.0,2022-11-20,,,Asia,VNM,0,0,
7144,3wUp8eCTshIrJcYbjWaoyP,Phuong Ly,ThichThich,Phuong Ly,45,30,171778,Vietnam,0.898,0.467,...,241935.0,4.0,2022-07-24,,,Asia,VNM,0,0,
7153,2fjqdDz6jJn6VPgrSDDMvp,"GPG msmy, AK49",YOU (feat. AK49),GePolyG,54,2,156338,Vietnam,0.759,0.323,...,147692.0,4.0,2023-02-02,,,Asia,VNM,0,0,


**Convert words in past tense to present tense** 

In [25]:
def present_tense(text):
    '''convert to present tense'''
    result = []
    for word in text.split():
        w = Word(word)
        present_tense = w.lemmatize("v") # "v" stands for verb
        result.append(present_tense)
    return " ".join(result)

In [26]:
# TEST FUNCTION 
print(present_tense('cried abandoned'))

cry abandon


In [27]:
# CONVERT PAST TO PRESENT TENSE IN LYRICS 
cleaned = [] 
for text in song_data['lyrics_clean']: 
    if isinstance(text,str):
        t = present_tense(text)
        cleaned.append(t)
    else: 
        cleaned.append('nan')
cleaned

['good gold dream sell right til build home watch burn leave lie cry buy flower write name sand talk hour say thing understand take dance hold hand yeah love better love better love better baby love better love better baby paint nail match rise leave remorse regret forgive every word say leave baby fight cry buy flower write name sand talk hour yeah say thing understand take dance yeah hold hand yeah love better love better love better baby love better love better baby love better love better baby love better might also like leave fight cry buy flower write name sand talk hour yeah say thing understand better take dance yeah hold hand yeah love better yeah love better love better love better baby love better love better baby love better love better baby love better',
 'fan even though salty hate see broad know happy hate see happy one mature mature mature get therapist tell there men want none want one might might best idea new next get might love though rather jail alone get sense los

In [28]:
# PRINT LENGTH 
len(cleaned)

7300

In [29]:
# UPDATE DATAFRAME COLUMN 
song_data['lyrics_clean'] = cleaned 

**Remove diacritics** 

Due to the translation of some lyrics into English, diacritics may be present in the characters. To enhance accuracy and simplify the lyrics data, any remaining diacritics can be removed.

In [30]:
# REMOVE DIACRITICS 
song_data['lyrics_clean'] = song_data['lyrics_clean'].apply(lambda x: unidecode(x))

## Final data

In [31]:
# EXPORT FINAL DATA
song_data.to_csv('C:/Users/kayan/UCD/CMN212/Project/top-hits/Data/merged_finaltop100_revised.csv') 