## Lyrics Feature Extraction

Feature extraction from lyrics in original languages: analyzing vocabulary richness and repetitiveness using tokenization, Type-Token Ratio, compression-based metrics, and n-gram repetition.

#### 1. Load data

In [22]:
import pandas as pd
df = pd.read_csv('2025_data_lyrics_og.csv')
df.head()

Unnamed: 0,idx,year,sf_num,to_country,performer,song,running_final,running_sf,LY_SF_reciprocation,LY_SF_vote,LY_final_reciprocation,LY_final_vote,lyrics,lyrics_eng_translation,lyrics_all_english,lyrics_english,lyrics_english_mix,lyrics_url
0,0,2025,1.0,Albania,Shkodra Elektronike,Zjerm,,12.0,0.1,0.210526,0.0,0.0,"""Në këtë minutë, në këtë çast no paranoia Pas ...","At this minute, at this moment, no more parano...","At this minute, at this moment, no more parano...",0,0,https://eurovisionworld.com/eurovision/2025/al...
1,1,2025,2.0,Armenia,Parg,Survivor,,5.0,0.5,1.0,0.6,0.783784,"""Survivor\nI got my bad shades on Jet-black am...",,"""Survivor\nI got my bad shades on Jet-black am...",1,0,https://eurovisionworld.com/eurovision/2025/ar...
2,2,2025,2.0,Australia,Go-Jo,Milkshake Man,,1.0,0.35,0.722222,0.0,0.0,"""Come and take a sip from my special cup I hea...",,"""Come and take a sip from my special cup I hea...",1,0,https://eurovisionworld.com/eurovision/2025/au...
3,3,2025,2.0,Austria,JJ,Wasted Love,,6.0,0.5,0.789474,0.15,0.135135,"""I'm an ocean of love And you're scared of wat...",,"""I'm an ocean of love And you're scared of wat...",1,0,https://eurovisionworld.com/eurovision/2025/au...
4,4,2025,1.0,Azerbaijan,Mamagama,Run With U,,10.0,0.2,0.333333,0.0,0.0,"""Rhythm pulls me out Don't know where to start...",,"""Rhythm pulls me out Don't know where to start...",1,0,https://eurovisionworld.com/eurovision/2025/az...


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   idx                     37 non-null     int64  
 1   year                    37 non-null     int64  
 2   sf_num                  31 non-null     float64
 3   to_country              37 non-null     object 
 4   performer               37 non-null     object 
 5   song                    37 non-null     object 
 6   running_final           1 non-null      float64
 7   running_sf              31 non-null     float64
 8   LY_SF_reciprocation     37 non-null     float64
 9   LY_SF_vote              37 non-null     float64
 10  LY_final_reciprocation  37 non-null     float64
 11  LY_final_vote           37 non-null     float64
 12  lyrics                  37 non-null     object 
 13  lyrics_eng_translation  23 non-null     object 
 14  lyrics_all_english      37 non-null     obje

#### 2. Text pre-processing
Cleaning and Tokenisation

In [25]:
df['lyrics']

0     "Në këtë minutë, në këtë çast no paranoia Pas ...
1     "Survivor\nI got my bad shades on Jet-black am...
2     "Come and take a sip from my special cup I hea...
3     "I'm an ocean of love And you're scared of wat...
4     "Rhythm pulls me out Don't know where to start...
5     "Strobe lights, gettin' lost in your eyes Cott...
6     "It's cool now in the kitchen That happens whe...
7     "I've got golden locks and eyes so captivating...
8     "Blow me a kiss goodbye I don't want my tears ...
9     "You show me more More than meets the eye You ...
10    "Mi amore Mi amore Espresso macchiato, macchia...
11    "(Ich komme) On yö, sydän lyö Hän loveen lanke...
12    "Y'a plus d'amants Y'a plus de lits Finalement...
13    "(Mze, tsa)\nTavisupleba\n(Ani da bani, gani d...
14    "Ich ballalalalalalaler Löcher in die Nacht St...
15    "Asteri mou Asteri mou\nGlykia mou mana mi mou...
16    "Róandi hér, róandi þar Róa í gegnum öldurnar ...
17    "(Ready for take off)\nYou have probably h

In [26]:
### CLEAN TEXT
import re

def clean_lyrics(text):
    """
    Cleans the provided text by:
    - Replacing newline characters and escaped newlines with spaces
    - Removing punctuation and special characters
    - Converting text to lowercase
    """
    # Replace newline characters with spaces
    text = re.sub(r'\\n|\n', ' ', text)
    
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert text to lowercase
    text = text.lower()
    
    return text

In [27]:
df['cleaned_lyrics'] = df['lyrics'].apply(clean_lyrics)

In [28]:
### TOKENISE
import nltk

# Ensure that the NLTK tokenizers are downloaded
nltk.download('punkt')

def tokenize_lyrics(text):
    """
    Tokenizes the text into individual words using NLTK's word_tokenize.
    """
    tokens = nltk.word_tokenize(text)
    return tokens


[nltk_data] Downloading package punkt to /Users/kasia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [29]:
df['lyrics_tokens'] = df['cleaned_lyrics'].apply(tokenize_lyrics)

#### 3. Vocabulary Wealth
Calculate type-token ratio

In [30]:
def analyze_lyrics(lyrics):   
    # Count all words
    total_words = len(lyrics)
    
    # Count all unique words
    unique_words = len(set(lyrics))
    
    # Calculate Type-Token Ratio (TTR)
    ttr = unique_words / total_words if total_words > 0 else 0
    
    return {
        'total_words': total_words,
        'unique_words': unique_words,
        'type_token_ratio': ttr
    }

In [31]:
# Apply the analysis function to the 'Lyrics' column
df_analysis = df['lyrics_tokens'].apply(lambda x: pd.Series(analyze_lyrics(x)))

# Join the analysis results with the original DataFrame
df = pd.concat([df, df_analysis], axis=1)

In [32]:
df.head()

Unnamed: 0,idx,year,sf_num,to_country,performer,song,running_final,running_sf,LY_SF_reciprocation,LY_SF_vote,...,lyrics_eng_translation,lyrics_all_english,lyrics_english,lyrics_english_mix,lyrics_url,cleaned_lyrics,lyrics_tokens,total_words,unique_words,type_token_ratio
0,0,2025,1.0,Albania,Shkodra Elektronike,Zjerm,,12.0,0.1,0.210526,...,"At this minute, at this moment, no more parano...","At this minute, at this moment, no more parano...",0,0,https://eurovisionworld.com/eurovision/2025/al...,në këtë minutë në këtë çast no paranoia pas sh...,"[në, këtë, minutë, në, këtë, çast, no, paranoi...",241.0,134.0,0.556017
1,1,2025,2.0,Armenia,Parg,Survivor,,5.0,0.5,1.0,...,,"""Survivor\nI got my bad shades on Jet-black am...",1,0,https://eurovisionworld.com/eurovision/2025/ar...,survivor i got my bad shades on jetblack am in...,"[survivor, i, got, my, bad, shades, on, jetbla...",315.0,122.0,0.387302
2,2,2025,2.0,Australia,Go-Jo,Milkshake Man,,1.0,0.35,0.722222,...,,"""Come and take a sip from my special cup I hea...",1,0,https://eurovisionworld.com/eurovision/2025/au...,come and take a sip from my special cup i hear...,"[come, and, take, a, sip, from, my, special, c...",365.0,110.0,0.30137
3,3,2025,2.0,Austria,JJ,Wasted Love,,6.0,0.5,0.789474,...,,"""I'm an ocean of love And you're scared of wat...",1,0,https://eurovisionworld.com/eurovision/2025/au...,im an ocean of love and youre scared of water ...,"[im, an, ocean, of, love, and, youre, scared, ...",145.0,66.0,0.455172
4,4,2025,1.0,Azerbaijan,Mamagama,Run With U,,10.0,0.2,0.333333,...,,"""Rhythm pulls me out Don't know where to start...",1,0,https://eurovisionworld.com/eurovision/2025/az...,rhythm pulls me out dont know where to start y...,"[rhythm, pulls, me, out, dont, know, where, to...",256.0,110.0,0.429688


#### 4. Repetitiveness

Compression size reduction and n-gram repetitiveness

In [33]:
def LZW_compress(input_string):
    # Initialize the dictionary with single characters available in the input string
    # Ensures all characters in input are covered, including unicode characters
    dictionary = {chr(i): i for i in range(256)}
    current_string = ""
    codes = []
    code = 256  # Starting code for new entries
    
    # Extend the dictionary to include any unique characters in the input string not already in the dictionary
    for character in set(input_string):
        if character not in dictionary:
            dictionary[character] = code
            code += 1

    for character in input_string:
        new_string = current_string + character
        if new_string in dictionary:
            current_string = new_string
        else:
            codes.append(dictionary[current_string])
            dictionary[new_string] = code
            code += 1
            current_string = character
    if current_string:
        codes.append(dictionary[current_string])
    return codes, len(dictionary)

def calculate_compression(original, compressed):
    original_size = len(original) * 8  # Assuming 8 bits per character
    compressed_size = len(compressed) * 12  # Assuming 12 bits per LZW code
    reduction = ((original_size - compressed_size) / original_size) 
    return reduction

def apply_compression_and_calculate_reduction(row):
    compressed_lyrics, _ = LZW_compress(row['cleaned_lyrics'])
    reduction_percentage = calculate_compression(row['cleaned_lyrics'], compressed_lyrics)
    return reduction_percentage 

In [34]:
df['compression_size_reduction'] = df.apply(apply_compression_and_calculate_reduction, axis=1)

In [35]:
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from collections import Counter
import numpy as np
import pandas as pd
import nltk

# Ensure you have the necessary NLTK data
nltk.download('punkt')

def calculate_avg_ngram_repetitiveness(lyrics, n_values=[2,3,4,5,6,7,8,9,10]):
    
    tokens = word_tokenize(lyrics)
    
    repetitiveness_scores = []
    
    #print(len(tokens))
    for n in n_values:
        if len(tokens) < n:
            # If there are fewer tokens than the size of the n-gram, skip this n-value
            continue
            
        # Generate N-grams from the list of tokens
        n_grams = list(ngrams(tokens, n))
        #print(n_grams)
        # Count the frequency of each N-gram
        n_gram_counts = Counter(n_grams)
        #print(n_gram_counts)
        
        # Calculate repetitiveness: proportion of N-grams appearing more than once
        total_n_grams = len(n_grams)
        #print(n, " total: ", total_n_grams)
        repeated_n_grams = sum(count for count in n_gram_counts.values() if count > 1)
        #print(n, " repeated: ", repeated_n_grams)
        
        if total_n_grams > 0:  # Check to prevent division by zero
            repetitiveness_score = repeated_n_grams / total_n_grams
            repetitiveness_scores.append(repetitiveness_score)
    #print(repetitiveness_scores)
    # Calculate the average repetitiveness score across all n-gram sizes
    avg_repetitiveness = np.mean(repetitiveness_scores) if repetitiveness_scores else 0
    
    return avg_repetitiveness


[nltk_data] Downloading package punkt to /Users/kasia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [36]:
df['n_gram_repetitiveness'] = df['cleaned_lyrics'].apply(lambda x: calculate_avg_ngram_repetitiveness(x, n_values=[2,3,4,5,6,7,8,9,10]))

In [37]:
df

Unnamed: 0,idx,year,sf_num,to_country,performer,song,running_final,running_sf,LY_SF_reciprocation,LY_SF_vote,...,lyrics_english,lyrics_english_mix,lyrics_url,cleaned_lyrics,lyrics_tokens,total_words,unique_words,type_token_ratio,compression_size_reduction,n_gram_repetitiveness
0,0,2025,1.0,Albania,Shkodra Elektronike,Zjerm,,12.0,0.1,0.210526,...,0,0,https://eurovisionworld.com/eurovision/2025/al...,në këtë minutë në këtë çast no paranoia pas sh...,"[në, këtë, minutë, në, këtë, çast, no, paranoi...",241.0,134.0,0.556017,0.264637,0.411685
1,1,2025,2.0,Armenia,Parg,Survivor,,5.0,0.5,1.0,...,1,0,https://eurovisionworld.com/eurovision/2025/ar...,survivor i got my bad shades on jetblack am in...,"[survivor, i, got, my, bad, shades, on, jetbla...",315.0,122.0,0.387302,0.335901,0.368144
2,2,2025,2.0,Australia,Go-Jo,Milkshake Man,,1.0,0.35,0.722222,...,1,0,https://eurovisionworld.com/eurovision/2025/au...,come and take a sip from my special cup i hear...,"[come, and, take, a, sip, from, my, special, c...",365.0,110.0,0.30137,0.406426,0.520278
3,3,2025,2.0,Austria,JJ,Wasted Love,,6.0,0.5,0.789474,...,1,0,https://eurovisionworld.com/eurovision/2025/au...,im an ocean of love and youre scared of water ...,"[im, an, ocean, of, love, and, youre, scared, ...",145.0,66.0,0.455172,0.257532,0.341697
4,4,2025,1.0,Azerbaijan,Mamagama,Run With U,,10.0,0.2,0.333333,...,1,0,https://eurovisionworld.com/eurovision/2025/az...,rhythm pulls me out dont know where to start y...,"[rhythm, pulls, me, out, dont, know, where, to...",256.0,110.0,0.429688,0.289267,0.427451
5,5,2025,1.0,Belgium,Red Sebastian,Strobe Lights,,9.0,0.25,0.368421,...,1,0,https://eurovisionworld.com/eurovision/2025/be...,strobe lights gettin lost in your eyes cotton ...,"[strobe, lights, gettin, lost, in, your, eyes,...",185.0,76.0,0.410811,0.246926,0.692625
6,6,2025,1.0,Croatia,Marko Bošnjak,Poison Cake,,14.0,0.5,1.0,...,1,0,https://eurovisionworld.com/eurovision/2025/cr...,its cool now in the kitchen that happens when ...,"[its, cool, now, in, the, kitchen, that, happe...",260.0,111.0,0.426923,0.321663,0.408045
7,7,2025,1.0,Cyprus,Theo Evan,Shh,,15.0,0.4,0.722222,...,1,0,https://eurovisionworld.com/eurovision/2025/cy...,ive got golden locks and eyes so captivating i...,"[ive, got, golden, locks, and, eyes, so, capti...",272.0,76.0,0.279412,0.335052,0.617678
8,8,2025,2.0,Czechia,Adonxs,Kiss Kiss Goodbye,,12.0,0.4,0.736842,...,1,0,https://eurovisionworld.com/eurovision/2025/cz...,blow me a kiss goodbye i dont want my tears to...,"[blow, me, a, kiss, goodbye, i, dont, want, my...",222.0,82.0,0.369369,0.333185,0.579693
9,9,2025,2.0,Denmark,Sissal,Hallucination,,11.0,0.35,0.473684,...,1,0,https://eurovisionworld.com/eurovision/2025/de...,you show me more more than meets the eye you o...,"[you, show, me, more, more, than, meets, the, ...",167.0,77.0,0.461078,0.313112,0.243982


In [38]:
#save csv file
#df.to_csv('2025data_lyrics.csv')