## Introduction

Music and lyrics subtly shape our emotions, thoughts, and perceptions. The growing field of lyric analysis in data science is beginning to gain attention in academic research, exploring patterns in songwriting and even predicting musical success. This case study will touch upon some of these emerging ideas while focusing on the extensive discography of one of the most legendary and influential Egyptian artists, Amr Diab. 

With a career spanning over four decades, Amr Diab, also known as "El Hadaba", has shaped the modern Arabic music scene with his fusion of traditional Arabic melodies and contemporary Western influences that resonates across generations. He is regarded as one of the pioneers of modern Arabic pop music, resulting in him dominating the Arabic music charts while winning multiple World Music Awards. His songs cover a diverse range of themes, from love and nostalgia to personal growth and cultural identity. 

In this project, we will explore the potential insights that can be drawn from Amr Diab’s songwriting. From common words and themes to sentiment trends over time, we aim to shed light into the artist's remarkable contribution to Arabic music through a data-driven lens, using statistical analysis, natural language processing (NLP), and visualization techniques.

The dataset consists of lyrics collected from Genius.com, ensuring a comprehensive and accurate representation of Amr Diab’s discography. This study will provide insights into his artistic evolution and lyrical tendencies, contributing to a deeper appreciation of his music.

## Data Colletion 

The names of all songs, with their corresponding lyrics, were collected using https://genius.com/artists/Amr-diab, and made into a .csv file, after carefully examining the lyrics for any inconsistencies.

## Imports

This section imports the necessary libraries for handling and preprocessing Arabic text. It begins with Pandas, a powerful library for managing structured data, and re, which provides regular expression support for text processing.

In this project, we will be mostly using **Camel Tools** since it is a specialized NLP library designed for Arabic language processing. Therefore, unlike general-purpose NLP libraries, Camel Tools provides linguistically informed preprocessing functions tailored for Arabic.

Tools we'll be using here:
- Unicode normaliztion
- Text normaliztion: for example, converting ا → أ, إ, آ
- Diacritic Removal (harakat)
- Tokenization

In [1]:
import pandas as pd
import re

#%pip install camel-tools
from camel_tools.utils.normalize import normalize_unicode
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.utils.dediac import dediac_ar

## Reading Dataset

In [2]:
# Define file paths
DATA_PATH = "./data/amr_diab_songs.csv"
STOP_WORDS_PATH = "./data/stop_words.txt"
PREPROC_PATH = "./data/amr_diab_songs_proc.csv"

# Load the dataset
songs = pd.read_csv(DATA_PATH, header=0, encoding="utf-8", na_values="")

songs

Unnamed: 0,Year,Composer,Lyricist,Song,Lyrics
0,2023,محمد أحمد فؤاد,تامر حسين,بيوحشنا,ملازمنا، ملازمنا خياله وطيفه فين ما نروح\nملاز...
1,2023,أحمد إبراهيم,أيمن بهجت قمر,معرفش حد بالأسم ده,ما أعرفش حد بالإسم دا\nأنا اللي تاه عقله ولقاه...
2,2023,محمد يحيي,بهاء الدين محمد,ظبط مودها,لما تظبط مودها\nأطلب حتى عينيها تاخدها\nوأؤمر ...
3,2023,محمد يحيي,محمد القاياتي,سلامك وصلي,سلامك وصلي\nوأتاريني واحشك زي ما إنت واحشني يا...
4,2023,محمد يحيي,محمد البوغة,واخدين راحتهم,واخذين راحتهم قاعدين في قلبي مربعين وبيعصروه\n...
...,...,...,...,...,...
305,1983,هاني شنودة,هاني ذكي,الزمن,الزمن بينسى دايماً، مع الزمن مفيش وعود\nاللي ك...
306,1983,هاني شنودة,عبد الرحيم منصور,نور يا ليل,نور يا ليل الأسرار\nيا اللي عشقناك وإحنا صغار\...
307,1983,عزمي الكيلاني,عصام عبدالله,وقت وعشناه,وقت وعشناه إنتي وأنا، جرح حفرناه لبقية عمرنا\n...
308,1983,ياسر عبد الحليم,عوض الرخاوي,أحلى دنيا,إمتى نشوف البسمة الحلوة\nمالية شفايف كل الناس\...


We also load a predefined list of **stopwords**, because we'll be removing them from the lyrics of each song later on in this notebook. 

When it comes to data mining in NLP, people often have a common belief. "The most frequent words are the most important". This seems logical at first, but in reality, it's usually the less frequent words that carry more meaning. Think about it: in a news article, words like “breaking” or “crisis” are way more insightful than generic words like “the” or “is.” 

These common generic words are called **stopwords**, and they typically do not carry significant meaning and are often removed during text preprocessing NLP tasks. In Arabic, stopwords include words like "و" (and), "في" (in), and "على" (on), which appear frequently but contribute little to understanding the main content of a text. Hence, removing stopwords will help reduce noise, improve computational efficiency, and enhance the performance of machine learning models by focusing on more meaningful words.

In [3]:
# Load stopwords
my_stopwords = pd.read_csv(STOP_WORDS_PATH, header=None, names=["Word"], encoding="utf-8")
stopwords_set = set(my_stopwords["Word"]) # Convert stopwords into a set for faster lookup

my_stopwords

Unnamed: 0,Word
0,،
1,ء
2,ا
3,اب
4,اذار
...,...
11009,بكام
11010,لكام
11011,يل
11012,ويلا


## Inspecting Data

To ensure data quality, we remove any rows containing missing values, preventing potential errors during further analysis. Then, we confirm that no missing values remain by displaying the count of null values in each column. These steps help clean and prepare the dataset for subsequent processing.

In [4]:
# Inspect the dataset
songs.info()

# Drop rows with any missing data
songs.dropna(inplace=True)

# Check for missing values
songs.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310 entries, 0 to 309
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Year      310 non-null    int64 
 1   Composer  309 non-null    object
 2   Lyricist  309 non-null    object
 3   Song      310 non-null    object
 4   Lyrics    310 non-null    object
dtypes: int64(1), object(4)
memory usage: 12.2+ KB


Year        0
Composer    0
Lyricist    0
Song        0
Lyrics      0
dtype: int64

## Removing Stop Words

In [5]:
def remove_stopwords(text):
    words = re.findall(r'\w+', text)  
    return " ".join(word for word in words if word not in stopwords_set)

In [6]:
sample_text = 'أنا اللي تاه عقله، و لقاه ما أعرفش حد بالإسم دا موهوم       بقى وربك هداه هعتبره وقت فراغ و فاتABC 123'

print("Before Removing Stop Words:", sample_text)
sample_text = remove_stopwords(sample_text)
print("After Removing Stop Words:", remove_stopwords(sample_text))

Before Removing Stop Words: أنا اللي تاه عقله، و لقاه ما أعرفش حد بالإسم دا موهوم       بقى وربك هداه هعتبره وقت فراغ و فاتABC 123
After Removing Stop Words: أنا تاه عقله لقاه أعرفش بالإسم موهوم بقى وربك هداه هعتبره وقت فراغ فاتABC 123


The output here demonstrates the effect of the **remove_stopwords** function on a sample Arabic text. 

The original text includes common stopwords such as "اللي" (which), "ما" (not), "حد" (someone), and "دا" (this), as well as English letters (ABC) and numbers (123) at the end. The filtered text retains meaningful words while removing stopwords like "اللي", "ما", "حد", and "دا". However, numbers ("123") and English text ("ABC") remain. Hence, further processing will be needed to remove them.

## Cleaning Text

The **clean_text** function here performs a series of preprocessing steps to clean and normalize Arabic text for NLP tasks:
- It first ensures the input is a string and removes English characters, punctuation, numbers, and extra spaces using regular expressions. 
- Then, it applies Unicode normalization to standardize character encoding, followed by orthographic normalization to unify Arabic letter variations, such as different forms of "Alef" (ا, أ, إ) and "Teh Marbuta" (ة to ه). 
- Then it addresses text elongation by reducing repeated letters and manually correcting common cases like "وو" to "و" and "يي" to "ي". 
- Finally, it removes diacritics to simplify the text and make it more uniform for analysis. These steps help improve text consistency, reduce noise, and enhance the performance of NLP models.

In [7]:
def clean_text(text):
    text = str(text)  # Ensure input is a string
    text = re.sub(r"[A-Za-z]", "", text)  # Remove English text
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    text = re.sub(r"\d+", "", text)  # Remove numbers
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces

    # Unicode normalization
    text = normalize_unicode(text)  

    # Orthographic normalization
    text = normalize_alef_ar(text)
    text = normalize_alef_maksura_ar(text)
    text = normalize_teh_marbuta_ar(text)

    # Remove longation (repeated letters)
    p_longation = re.compile(r"(.)\1+")
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)

    # Fix repeated letters
    text = text.replace("وو", "و")
    text = text.replace("يي", "ي")
    text = text.replace("اا", "ا")

    # Remove diacritics
    text = dediac_ar(text)  

    return text

In [8]:
print("Before Cleaning:", sample_text)
sample_text = clean_text(sample_text)
print("After Cleaning:", sample_text)

Before Cleaning: أنا تاه عقله لقاه أعرفش بالإسم موهوم بقى وربك هداه هعتبره وقت فراغ فاتABC 123
After Cleaning: انا تاه عقله لقاه اعرفش بالاسم موهوم بقي وربك هداه هعتبره وقت فراغ فات


The output here shows the effect of the text cleaning function on the given Arabic text.

Before cleaning, the text contains Arabic words, English characters ("ABC"), numbers ("123"), diacritics, and different forms of Arabic letters (e.g., "الإسم" with hamza and alef-lam).

After cleaning:
- English characters and numbers are removed ("ABC 123" is gone).
- Diacritics are stripped, making the text simpler and more consistent.
- Certain Arabic letters are normalized, such as "الإسم" → "الاسم" and "بقي" → "بقى" to standardize spelling variations.

This preprocessing step is essential for improving text consistency and preparing it for further NLP tasks, such as tokenization or analysis.

## Tokenization

**Tokenization** is a crucial preprocessing step in NLP that involves breaking text into smaller units called **tokens**, which can be words, phrases, or subwords. This process standardizes text and makes it easier to analyze by separating words from punctuation, special characters, and unnecessary elements. Tokenization enables models to process and understand text effectively, and it plays a key role in improving machine learning performance by converting text into structured representations suitable for embedding and vectorization.We use Camel Tools’ simple_word_tokenize() help handle these challenges by efficiently segmenting Arabic text while preserving linguistic meaning.

In [9]:
def tokenize_text(text):
    return simple_word_tokenize(text)    # Returns a list of tokens

In [10]:
print("Before Tokenization:", sample_text)
sample_text = tokenize_text(sample_text)
print("After Tokenization:", sample_text)

Before Tokenization: انا تاه عقله لقاه اعرفش بالاسم موهوم بقي وربك هداه هعتبره وقت فراغ فات
After Tokenization: ['انا', 'تاه', 'عقله', 'لقاه', 'اعرفش', 'بالاسم', 'موهوم', 'بقي', 'وربك', 'هداه', 'هعتبره', 'وقت', 'فراغ', 'فات']


## Text Processing

Text processing in Arabic is highly important because the language has its own unique challenges that make NLP a bit trickier here than in English. Arabic is highly inflected, meaning words change a lot depending on tense, gender, and case, as well as diacritics and different script forms.

For example, take the short vowel marks like "َ" (fatha) or "ِ" (kasra) that can completely change a word’s meaning. For instance, "عَلَم" (ʿalam) means "flag," while "عِلْم" (ʿilm) means "knowledge." Since diacritics are often omitted in writing, text processing needs to either restore them or develop methods to handle the ambiguity.

Another challenge is the spelling normalization. Arabic has multiple ways to write the same word due to variations in "Alif" (ا vs. أ vs. إ) and "Ya" (ي vs. ى). If we don’t normalize them, the model might treat "مسؤولية" and "مسئولية" (both meaning "responsibility") as separate words!

All of these complexities make text preprocessing crucial in Arabic NLP. By careful text processing, we help models focus on the meaning of words rather than getting lost in small variations.

In [11]:
def find_non_normalized(text):
    """Finds and returns non-normalized Arabic letters and diacritics in the text."""
    non_normalized_chars = {
        "أ": "Alef with Hamza Above",
        "إ": "Alef with Hamza Below",
        "آ": "Alef with Madda",
        "ى": "Final Yeh (should be ي)",
        "ة": "Teh Marbuta (should be ه)"
    }
    
    diacritics_pattern = r"[\u064B-\u065F]"  # Arabic diacritics (Harakat)

    found_chars = {}

    # Check for non-normalized letters
    for char, desc in non_normalized_chars.items():
        count = text.count(char)
        if count > 0:
            found_chars[char] = (desc, count)

    # Check for diacritics
    diacritics_matches = re.findall(diacritics_pattern, text)
    if diacritics_matches:
        found_chars["Diacritics"] = ("Arabic Harakat", len(diacritics_matches))

    return found_chars

#### Original Text

In [12]:
# Apply to the original dataset
non_normalized_counts = songs["Lyrics"].apply(find_non_normalized)

# Aggregate occurrences across all lyrics
summary = {}
for entry in non_normalized_counts:
    for key, value in entry.items():
        if key in summary:
            summary[key] = (value[0], summary[key][1] + value[1])
        else:
            summary[key] = value

# Display results
for char, (desc, count) in summary.items():
    print(f"{char} ({desc}): {count} occurrences")

if len(summary) == 0:
    print("Lyrics are now clean")

أ (Alef with Hamza Above): 5092 occurrences
إ (Alef with Hamza Below): 3389 occurrences
ى (Final Yeh (should be ي)): 2269 occurrences
ة (Teh Marbuta (should be ه)): 3405 occurrences
Diacritics (Arabic Harakat): 204 occurrences
آ (Alef with Madda): 794 occurrences


In [13]:
# Apply text preprocessing
songs["Lyrics"] = songs["Lyrics"].apply(remove_stopwords).apply(clean_text).apply(remove_stopwords).apply(tokenize_text)

# Display processed data
songs.head()

Unnamed: 0,Year,Composer,Lyricist,Song,Lyrics
0,2023,محمد أحمد فؤاد,تامر حسين,بيوحشنا,"[ملازمنا, ملازمنا, خياله, وطيفه, فين, نروح, مل..."
1,2023,أحمد إبراهيم,أيمن بهجت قمر,معرفش حد بالأسم ده,"[اعرفش, بالاسم, تاه, عقله, ولقاه, اعرفش, بالاس..."
2,2023,محمد يحيي,بهاء الدين محمد,ظبط مودها,"[تظبط, مودها, اطلب, عينيها, تاخدها, واؤمر, واح..."
3,2023,محمد يحيي,محمد القاياتي,سلامك وصلي,"[سلامك, وصلي, واتاريني, واحشك, واحشني, وهتفضل,..."
4,2023,محمد يحيي,محمد البوغة,واخدين راحتهم,"[واخذين, راحتهم, قاعدين, قلبي, مربعين, وبيعصرو..."


#### After Text Processing

In [14]:
# Apply to the tokenized dataset
non_normalized_counts = songs["Lyrics"].apply(lambda tokens: find_non_normalized(" ".join(tokens)))

# Aggregate occurrences across all lyrics
summary = {}
for entry in non_normalized_counts:
    for key, value in entry.items():
        if key in summary:
            summary[key] = (value[0], summary[key][1] + value[1])
        else:
            summary[key] = value

# Display results
for char, (desc, count) in summary.items():
    print(f"{char} ({desc}): {count} occurrences")

if len(summary) == 0:
    print("Lyrics are now clean")

Lyrics are now clean


## Categorizing Decades

The purpose of categorizing decades here is to help with analysis of trends and patterns in Amr Diab's songs across different time periods. By grouping his songs based on the decade they were released, we can observe how musical styles, lyrical themes, and composer-lyricist collaborations have evolved over time. 

This categorization also allows for comparative analysis, such as understanding shifts in lyrical complexity, or examining how external cultural and technological factors influenced music production. 

Additionally, grouping by decades aids in data visualization, making it easier to spot overarching trends in the dataset, such as the rise of certain genres or the dominance of specific composers in different periods. This level of categorization provides a structured way to analyze and interpret historical musical data effectively.

In [15]:
def categorize_decade(year):
    if 1980 <= year <= 1984:
        return "Early 1980s"
    elif 1985 <= year <= 1989:
        return "Late 1980s"
    elif 1990 <= year <= 1994:
        return "Early 1990s"
    elif 1995 <= year <= 1999:
        return "Late 1990s"
    elif 2000 <= year <= 2004:
        return "Early 2000s"
    elif 2005 <= year <= 2009:
        return "Late 2000s"
    elif 2010 <= year <= 2014:
        return "Early 2010s"
    elif 2015 <= year <= 2019:
        return "Late 2010s"
    elif 2020 <= year <= 2025:
        return "Early 2020s"
    else:
        return None

In [16]:
# Categorize songs by decade
songs["Decade"] = songs["Year"].apply(categorize_decade)

## Splitting Names

In [17]:
def split_name(name):
    if pd.isna(name):
        return None, None
    parts = name.split()
    first_name = parts[0]
    last_name = " ".join(parts[1:]) if len(parts) > 1 else None
    return first_name, last_name


In [18]:
# Extract first and last names
songs["Composer_first_name"], songs["Composer_last_name"] = zip(*songs["Composer"].apply(split_name))
songs["Lyricist_first_name"], songs["Lyricist_last_name"] = zip(*songs["Lyricist"].apply(split_name))
#songs.drop(columns=["Year", "Composer", "Lyricist"], inplace=True)


In [19]:
# Save the processed data
songs.to_csv(PREPROC_PATH, index=False, encoding="utf-8")

# Display processed data
songs.head()

Unnamed: 0,Year,Composer,Lyricist,Song,Lyrics,Decade,Composer_first_name,Composer_last_name,Lyricist_first_name,Lyricist_last_name
0,2023,محمد أحمد فؤاد,تامر حسين,بيوحشنا,"[ملازمنا, ملازمنا, خياله, وطيفه, فين, نروح, مل...",Early 2020s,محمد,أحمد فؤاد,تامر,حسين
1,2023,أحمد إبراهيم,أيمن بهجت قمر,معرفش حد بالأسم ده,"[اعرفش, بالاسم, تاه, عقله, ولقاه, اعرفش, بالاس...",Early 2020s,أحمد,إبراهيم,أيمن,بهجت قمر
2,2023,محمد يحيي,بهاء الدين محمد,ظبط مودها,"[تظبط, مودها, اطلب, عينيها, تاخدها, واؤمر, واح...",Early 2020s,محمد,يحيي,بهاء,الدين محمد
3,2023,محمد يحيي,محمد القاياتي,سلامك وصلي,"[سلامك, وصلي, واتاريني, واحشك, واحشني, وهتفضل,...",Early 2020s,محمد,يحيي,محمد,القاياتي
4,2023,محمد يحيي,محمد البوغة,واخدين راحتهم,"[واخذين, راحتهم, قاعدين, قلبي, مربعين, وبيعصرو...",Early 2020s,محمد,يحيي,محمد,البوغة
