# CIS 5450 Project: Difficulty Topics
**Group Members:**
* **Alon Jacoby**
* **Trey Elder**
* **Nicky Desai**

> This notebook documents how we implemented difficulty topics in our project. Use the link button in the top right when you select a cell to get a **hyperlink**.


https://colab.research.google.com/drive/1DdtT9NmdUAO-zXHJ9C5HkocWQOC8MWhv

## Topic 1: Feature Engineering
[Hyperlink](
  https://colab.research.google.com/drive/1DdtT9NmdUAO-zXHJ9C5HkocWQOC8MWhv#scrollTo=cM49fbucqg11&line=1&uniqifier=1)

### **Why we used this concept**
The raw Spotify dataset contains audio features, metadata, and track names—but many of these variables do not directly encode the underlying musical or structural characteristics of each song. From our density plots and exploratory analysis, we observed:

* Several audio variables are quasi-binary or clustered, suggesting they should be binned rather than treated as continuous  
* Categorical audio metadata (`key`, `mode`, `time_signature`) do not carry numeric meaning without transformation  
* Artist-level patterns (frequency of appearance, stylistic consistency) influence genre identity but are not encoded in the base attributes  

Feature engineering allows us to transform raw inputs into **model-ready signals** that better reflect musically meaningful structure. These transformations improve clustering behavior, model performance, and interpretability.

### **How we implemented it**
We engineered four major families of features:

#### 1. Binning Quasi-Binary Audio Features
* **Motivation:** Density plots showed strong bimodality for `instrumentalness`, `acousticness`, and `speechiness`.  
* **Process:**
  - Convert `instrumentalness` → binary (`> 0.5`)
  - Convert `acousticness` → binary (`> 0.5`)
  - Convert `speechiness` → 3-level ordinal category  
* **Output:** Clean categorical variables that reflect meaningful musical distinctions.

#### 2. One-Hot Encoding of Categorical Features
* **Encoder:** `OneHotEncoder(sparse_output=False, handle_unknown='ignore')`  
* **Categorical features encoded:**  
  - `mode`, `time_signature`, `key`  
  - engineered `speechiness`, `instrumentalness`, `acousticness`
* **Output:** A fully numeric representation of categorical musical attributes.

#### 3. Higher-Level Engineered Features (DuckDB)
* **Method:** SQL query executed via DuckDB  
* **Features created:**  
  - `artist_track_count`  
  - `energy_danceability` interaction term  
  - `title_length` (character count)  
* **Output:** Contextual features capturing artist prominence, rhythmic intensity, and titling patterns.

#### 4. Artist Group Identification (Regex-Based Feature)
* **Motivation:** Artist names often include structural cues indicating solo vs. group performers (“&”, “and”, “band”, “trio”).  
* **Output:** `artist_is_group` boolean variable for distinguishing ensemble-based artists.

### **Results & Interpretation**
- **Binned and one-hot encoded features** clarified the structure of audio attributes, improving the model’s interpretability around “instrumental vs. vocal,” “acoustic vs. electronic,” and “spoken vs. musical” distinctions.
- **DuckDB-derived features** enriched the dataset with musically relevant context, such as artist prominence, rhythmic interactions, and title styling cues.
- **Artist group identification** introduced a new dimension of interpretability, helping models differentiate ensemble-driven genres from solo-driven styles.

These engineered features are used directly in the modeling pipeline (Part 5) and help explain several patterns observed in our results—such as clusters dominated by acoustic singer-songwriters, high-energy rock bands, or speech-heavy spoken-word tracks.





## Topic 2: Entity Linking / External Model Integration  
[Hyperlink](
  https://colab.research.google.com/drive/1DdtT9NmdUAO-zXHJ9C5HkocWQOC8MWhv#scrollTo=ZM-SoQ0PBrsP)
### **Why we used this concept**
Track titles contain rich implicit information about genre, mood, theme, and stylistic intent—yet this meaning is not available in numerical form within the raw dataset. To meaningfully incorporate this semantic signal, we needed a method to **link each Spotify record to external knowledge** that captures linguistic and contextual patterns learned from large text corpora.

Record linking via pretrained language models allows us to enrich the dataset with **semantic embeddings** that encode:
- emotional tone  
- genre-specific keywords  
- cultural or thematic associations  
- stylistic similarities between tracks  

Integrating this external semantic representation provides information not captured by audio features alone and strengthens downstream clustering and modeling tasks.

### **How we implemented it**
We applied a transformer-based encoding pipeline using the pretrained `all-MiniLM-L6-v2` model from the Sentence-Transformers library. The process involved:

1. **Tokenizing track titles** into model-ready sequences  
2. **Encoding titles using the transformer**, leveraging either GPU or CPU depending on availability  
3. **Applying mean pooling** to obtain a single dense vector per track  
4. **Batch processing** to ensure efficiency  
5. **Appending the resulting embeddings** (384-dimensional vectors) back into the dataset as new semantic features  

The transformer acts as an *external knowledge source*, effectively linking each track to a position in a semantic space learned from millions of documents.

### **Results & Interpretation**
- These embeddings captured song-level meaning far better than raw text, enabling the model to differentiate between thematic categories such as:
  - reflective singer-songwriter ballads  
  - high-energy party songs  
  - orchestral or classical works  
  - country tracks with genre-specific vocabulary  
- The enriched dataset allowed clusters to form along **semantic as well as acoustic axes**, leading to more coherent genre-like groupings.
- The external model integration improved interpretability by revealing connections between text-derived and audio-derived patterns.

Overall, this record linking step added a powerful second modality to our dataset, enabling the project to use textual semantics alongside audio structure to produce deeper insights and better model performance.


## Topic 3: Hyperparameter Tuning
[Hyperlink](https://colab.research.google.com/drive/1DdtT9NmdUAO-zXHJ9C5HkocWQOC8MWhv#scrollTo=FUdNqETlEugx
  )

### **Why we used this concept**
Clustering algorithms such as **DBSCAN** and **K-Means** are highly sensitive to their configuration settings. Small changes in parameters can completely alter cluster structure, cause groups to collapse, or produce meaningless partitions—especially in a feature space as complex as ours, which blends audio descriptors with semantic embeddings.

To ensure our clusters reflected **true underlying musical and semantic patterns**, rather than arbitrary defaults, we used hyperparameter tuning to systematically search for the settings that produced the strongest—and most interpretable—structure. Tuning allowed us to:

- identify density thresholds appropriate for high-dimensional embeddings  
- prevent DBSCAN from assigning all points to noise  
- find stable values for `eps`, `min_samples`, and number of clusters  
- evaluate model behavior across different distance metrics  
- strengthen alignment between clusters and known genre patterns  

Hyperparameter tuning was therefore essential for producing **meaningful, robust, and musically coherent** clusters.

### **How we implemented it**
We carried out two forms of hyperparameter tuning tailored to the needs of each clustering method:

1. **Sweeping over the number of clusters for K-Means**  
   - Evaluated a range of `k` values on both raw and standardized features  
   - Computed **inertia** and **homogeneity score** for each `k`  
   - Identified cluster counts that balanced compactness with interpretability  

2. **Grid search for DBSCAN hyperparameters**  
   - Constructed a search grid for `eps` and `min_samples`  
   - Fit DBSCAN on a subsampled training set for each combination  
   - Required at least two clusters to apply the silhouette score  
   - Selected the configuration that maximized **silhouette score**  
   - Functioned as a manual, unsupervised analog to **GridSearchCV**  

3. **Bayesian Optimization for DBSCAN hyperparameters**
   - Based on prior results from grid search, allowed for a more fine-grained search in a subset of parameters.
   - Fit DBSCAN on 50 iterations of the search.
   - Optimization problem defined as maximization of silouhette score.

These tuning strategies allowed us to explore parameter landscapes systematically and avoid unreliable, default-driven outcomes.

### **Results & Interpretation**
- **DBSCAN**  
  - Default settings produced trivial or unstable clustering (e.g., excessive noise points).  
  - Tuning revealed parameter regions where DBSCAN discovered **clear density-based groupings** consistent with musical structure—such as acoustic ballads, electronic tracks, orchestral pieces, and lyric-centric pop.  
  - The optimal configuration balanced local density sensitivity with global separability.

- **K-Means**  
  - Varying `k` showed that certain cluster counts aligned far better with genre labels and the semantic–acoustic structure of the dataset.  
  - Improvements in inertia and homogeneity indicated that tuning `k` helped K-Means partition the space more coherently.  
  - Overly small or overly large `k` values degraded interpretability—tuning identified a reasonable middle ground.

Overall, hyperparameter tuning transformed clustering from a **parameter-guessing exercise** into a structured, data-driven process, enabling our models to produce richer, more interpretable, and musically meaningful insights.

