# Song Lyrics Data Exploration

This notebook explores the raw song lyrics dataset and performs initial data preprocessing steps.

## Data Overview
The dataset contains song lyrics with the following features:
- title: Song title
- tag: Genre/category tag
- artist: Song artist
- year: Release year
- views: Number of views
- features: Collaborating artists
- lyrics: Song lyrics text
- language: Detected language of lyrics
- id: Unique identifier for each song (belongs to the Genius platform)

## Setup
Import required libraries

In [2]:
import pandas as pd

## Load Raw Data

In [2]:
raw_data = pd.read_csv("../../data/raw/song_lyrics.csv")

## Initial Data Inspection
Let's look at the first few rows of our dataset

In [3]:
raw_data.head(5)

## Filter English Lyrics
Extract only English language songs for our analysis

In [4]:
english_lyrics = raw_data[raw_data["language"] == "en"]
english_lyrics.head(5)

## Data Statistics
Calculate basic statistics about the dataset

In [5]:
print('English lyrics: ', len(english_lyrics))
print('Total lyrics: ', len(raw_data))

# save the english lyrics to a csv file
english_lyrics.to_csv("../../data/interim/english_lyrics.csv", index=False)

## Load Preprocessed Data
Load the saved English lyrics for further analysis

In [3]:
english_lyrics = pd.read_csv("../../data/interim/english_lyrics.csv")

## Dataset Composition Analysis

In [6]:
# ratio of english lyrics
print('Ratio of english lyrics: ', len(english_lyrics) / len(raw_data))

## Artist and Collaboration Analysis
Analyze the distribution of artists and collaborations in the dataset

In [4]:
# count the number of unique artists
artist_counts = english_lyrics["artist"].value_counts()
print("artists: ", artist_counts)

# count the number of featured collaborators
featured_collaborators = english_lyrics["features"].value_counts()
print("featured collaborators: ", featured_collaborators)

## Filter Solo Songs
Remove songs with featured artists to focus on solo performances

In [5]:
# remove rows where features is not {}
english_lyrics = english_lyrics[english_lyrics["features"] == "{}"]

print('Number of rows after removing rows where features is not {}', len(english_lyrics))

## Data Cleanup
Remove unnecessary columns and clean the data

In [6]:
columns_to_drop = [
    "views",
    "features",
    "language",
    "language_cld3",
    "language_ft",
    "id",
    "year",
]

english_lyrics = english_lyrics.drop(columns=columns_to_drop)

## Lyrics Text Cleaning
Clean the lyrics text by removing section markers and formatting

In [7]:
import re

def clean_lyrics(lyrics):
    # Remove section tags like [Intro], [Verse 1], etc.
    cleaned = re.sub(r"\[.*?\]", "", lyrics)

    # Remove credits or text after "---"
    cleaned = re.split(r"---", cleaned)[0]

    # Remove symbols except for line breaks (\n) and alphanumeric characters
    cleaned = re.sub(r"[^\w\s\n]", "", cleaned)

    # Remove extra whitespace and blank lines
    cleaned = re.sub(r"\n\s*\n", "\n", cleaned).strip()

    return cleaned

# Apply the cleaning function
english_lyrics["lyrics"] = english_lyrics["lyrics"].apply(clean_lyrics)

## Save Processed Data
Save the cleaned and processed dataset

In [None]:
# save the cleaned english lyrics to a csv file
english_lyrics.to_csv("../../data/processed/english_lyrics_cleaned.csv", index=False)