## Lyrics Sentiment & Topic Analysis

This notebook applies sentiment and topic analysis to Eurovision song lyrics that are either originally in English or have been translated to English.

**Sentiment analysis** is performed using the `Distilbert-base-uncased-emotion` model : https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion?text=I+feel+a+bit+let+down

**Topic modelling** is performed via LDA (Latent Dirichlet Allocation) Topic Modeling using a Gensim corpus: https://radimrehurek.com/gensim/models/ldamodel.html

#### 1. Load data

In [None]:
import pandas as pd
df = pd.read_csv('eng_lyrics_all.csv')
print(df.shape)
df.head()

#### 2. Text pre-processing
With models like `distilbert-base-uncased-emotion`, there is no need to do heavy preprocessing.

However, light cleaning can help improve consistency and reduce noise.

In [None]:
import re

def clean_text(text):
    text = str(text).strip()
    text = text.replace('\n', ' ')                # remove or replace newline characters
    text = text.replace('\\n', ' ')                # remove or replace newline characters
    text = re.sub(r'http\S+', '', text)  # remove URLs
    text = re.sub(r'\s+', ' ', text)     # remove extra whitespace
    return text

df['lyrics_all_english_clean'] = df['lyrics_all_english'].apply(clean_text)


In [None]:
print(df['lyrics_all_english_clean'][400])

#### 2. Sentiment analysis
Model set up


In [None]:
pip install transformers torch

In [None]:
!pip install --upgrade --force-reinstall torch torchvision torchaudio

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

classifier = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True, truncation=True)

In [None]:
from tqdm.notebook import tqdm

# Apply tqdm to the iterator
results = []
for text in tqdm(df['lyrics_all_english_clean'], desc="Processing lyrics"):
    scores = classifier(text)[0]
    result_series = pd.Series({item['label']: item['score'] for item in scores})
    results.append(result_series)

# Combine results into a DataFrame and join
scores_df = pd.DataFrame(results)
df = df.join(scores_df)


In [None]:
df

#### 3. Topic Modelling

- pre-process text: remove stopwords, tokenise, lemmatise

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = tokenizer.tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and len(token) > 3]
    return tokens

# Assuming 'df' is your DataFrame and 'lyrics' is the column with the song lyrics
df['processed_lyrics'] = df['lyrics_all_english_clean'].apply(preprocess)

- create dictionary and corpus

In [None]:
from gensim import corpora

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(df['processed_lyrics'])

# Filter out words that occur in less than 100 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=100, no_above=0.8)

# Create a bag-of-words representation of each document
corpus = [dictionary.doc2bow(doc) for doc in df['processed_lyrics']]

- find the optimal number of topics

In [None]:
from gensim.models import LdaModel, CoherenceModel
import matplotlib.pyplot as plt

# Function to compute coherence scores for various numbers of topics
def compute_coherence_scores(dictionary, corpus, texts, start=2, limit=10, step=1):
    coherence_scores = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=20, random_state=42, iterations = 400)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_scores.append(coherencemodel.get_coherence())

    return model_list, coherence_scores

# Assuming 'dictionary' and 'corpus' are already defined, and 'processed_lyrics' is the tokenized text
model_list, coherence_scores = compute_coherence_scores(dictionary=dictionary, corpus=corpus, texts=df['processed_lyrics'], start=2, limit=14, step=1)

# Plotting the coherence scores
x = range(2, 14, 1)
plt.plot(x, coherence_scores)
plt.xlabel("Number of Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
from gensim.models import LdaModel

# Train the LDA model
num_topics = 2  # Adjust this according to how many topics you want to extract
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=20, random_state=42,iterations = 400)

In [None]:
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)
    print("")

- visualise topic cluster

In [None]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# Prepare the visualization
pyLDAvis.enable_notebook()  # If you are using a Jupyter notebook
vis = gensimvis.prepare(lda_model, corpus, dictionary)

# Display the visualization
pyLDAvis.display(vis)

- assign label to topic

In [None]:
topic_names = {
    0: 'reflection',
    1: 'love'
    # Add more mappings as needed for each topic
}

- assign topic labels to each song

In [None]:
def dominant_topic_with_names(lda_model, corpus, topic_names):
    topics = []
    for bow in corpus:
        topic_probs = lda_model.get_document_topics(bow)
        dominant_topic_index = sorted(topic_probs, key=lambda x: x[1], reverse=True)[0][0]
        # Use the topic_names mapping to get the descriptive name for the dominant topic index
        dominant_topic_name = topic_names.get(dominant_topic_index, "Unknown")  # "Unknown" is a default value
        topics.append(dominant_topic_name)
    return topics

# Apply the function with topic names to your DataFrame
df['dominant_topic'] = dominant_topic_with_names(lda_model, corpus, topic_names)

In [None]:
#save to csv file
#df.to_csv('eng_lyrics_sentiment_topic.csv')