# Feature Engineering Specification — Spotify Cross-Border Diffusion Model

---

## 1. Prediction Objective

The objective of this project is to model the **cross-border diffusion of tracks across Spotify markets**.

For each track and each possible target country, we predict:

- **Classification task:**  
  `did_enter_target_country` → Will the track ever chart in this country?

- **Regression task:**  
  `days_to_enter_target_country` → How many days after the reference point will it chart there?

Each row in the training dataset represents one prediction scenario:

```
(track_id, target_country, reference_time)
```

This simulates making a prediction at a specific moment in time, using only information available at that time.

---

## 2. Reference Time Definition

### Definition

We define:

```
reference_time = first global chart appearance of the track (Top 200 only)
```

SQL construction:

```sql
SELECT track_id, MIN(date) AS reference_time
FROM spotify_data
WHERE chart = 'top200'
GROUP BY track_id;
```

### Why this reference time is chosen

At this moment, the following information is **known**:
- Track intrinsic properties
- Artist historical performance
- Market characteristics
- Origin country (first chart appearance)

At this moment, the following information is **NOT yet known**:
- Whether the track will enter other countries later
- When it will enter other countries

This ensures that the model does not use future information and prevents temporal leakage.

---

## 3. Feature Categories Overview

We engineer features in the following logical groups:

| Category | Purpose |
|----------|---------|
| Identifiers and targets | Define training dataset structure |
| Track intrinsic features | Describe inherent properties of the track |
| Artist features | Capture artist strength and global reach |
| Diffusion state features | Capture how far the track has already spread |
| Target country features | Describe structural properties of the destination market |
| Geographic relationship features | Describe relationship between origin and target country |
| Temporal features | Capture seasonal timing effects |

Together, these model:

```
track attractiveness × artist strength × current diffusion state × market receptiveness × geographic proximity
```

---

## 4. Feature Definitions and Construction

---

### Category A — Identifiers and Targets

These define dataset structure and prediction labels.

| Feature | Description | Source Columns | How to Build |
|---------|-------------|----------------|--------------|
| `track_id` | Track identifier | `track_id` | Direct column |
| `target_country` | Country to predict entry into | `region` | Cross join tracks with all countries |
| `reference_time` | First global chart date (Top 200) | `date`, `chart` | `MIN(date) WHERE chart = 'top200' GROUP BY track_id` |
| `origin_country` | First charting country | `region`, `date` | `region WHERE date = reference_time` |
| `did_enter_target_country` | Classification target | `region`, `date` | `1` if track appears in `target_country` Top 200 at `date > reference_time` |
| `days_to_enter_target_country` | Regression target | `region`, `date` | `MIN(date_target_country) − reference_time` |

---

### Category B — Track Intrinsic Features

These features describe inherent properties of the track and are known immediately.

| Feature | Description | Source Columns | How to Build |
|---------|-------------|----------------|--------------|
| `song_language` | Language of the song | `artist`, `origin_country` | Inferred from artist metadata and origin country |
| `days_since_release` | Track age at prediction time | `release_date`, `date` | `reference_time − release_date` |
| `available_markets_count` | Number of markets track is available in | `available_markets` | Count number of markets in `available_markets` |
| `duration_ms` | Track length | `duration_ms` | `TRY_CAST(duration_ms AS DOUBLE)` |
| `explicit_flag` | Explicit content indicator | `explicit` | Convert to binary (1 if true, else 0) |
| `release_day_of_week` | Day of week the song was released | `release_date` | `EXTRACT(DOW FROM release_date)` |

Audio features (intrinsic track properties):

| Feature | Source | How to Build |
|---------|--------|--------------|
| `af_danceability` | `af_danceability` | Value `WHERE date = reference_time` |
| `af_energy` | `af_energy` | Same |
| `af_valence` | `af_valence` | Same |
| `af_tempo` | `af_tempo` | Same |
| `af_loudness` | `af_loudness` | Same |
| `af_acousticness` | `af_acousticness` | Same |
| `af_speechiness` | `af_speechiness` | Same |
| `af_instrumentalness` | `af_instrumentalness` | Same |
| `af_liveness` | `af_liveness` | Same |

> **Note:** `popularity_at_reference_time` was excluded. Spotify's popularity score is algorithmically updated based on recent streams and could act as a leaky proxy for the target variable.

---

### Category C — Artist-Level Features

These describe artist historical success and global reach.

**Computed using only observations BEFORE `reference_time`.**

| Feature | Description | Source Columns | How to Build |
|---------|-------------|----------------|--------------|
| `artist_prior_chart_count` | Total prior chart entries | `artist`, `date` | `COUNT(*) WHERE date < reference_time` |
| `artist_prior_unique_regions` | Number of countries artist charted in | `artist`, `region`, `date` | `COUNT(DISTINCT region) WHERE date < reference_time` |
| `artist_prior_best_rank` | Best rank achieved historically | `rank`, `artist`, `date` | `MIN(rank) WHERE date < reference_time` |
| `artist_prior_unique_tracks` | Number of tracks charted | `track_id`, `artist`, `date` | `COUNT(DISTINCT track_id) WHERE date < reference_time` |
| `multi_artist_flag` | Collaboration indicator | `artist` | `1` if artist contains "feat.", "&", or "," |

---

### Category D — Diffusion State Features

These capture the track's initial charting strength at `reference_time`.

| Feature | Description | Source Columns | How to Build |
|---------|-------------|----------------|--------------|
| `num_regions_entered_so_far` | Countries already entered | `region`, `date` | `COUNT(DISTINCT region) WHERE date ≤ reference_time` |
| `num_continents_entered_so_far` | Continents entered | `country_continent` | `COUNT(DISTINCT country_continent) WHERE date ≤ reference_time` |
| `origin_is_major_market` | Origin country is US, UK, or DE | `origin_country` | `1` if origin in ('US', 'UK', 'DE') |
| `best_rank_so_far` | Best chart position on debut | `rank`, `date` | `MIN(rank) WHERE date ≤ reference_time` |
| `streams_sum_so_far` | Total streams on debut | `streams`, `date` | `SUM(streams) WHERE date ≤ reference_time` |
| `is_viral50_flag` | Track is on the Viral 50 chart at reference time | `chart`, `date` | `1` if track appears in `chart = 'viral50'` at `date ≤ reference_time` |
| `same_language_country_entered_count` | Countries with same language already entered | `country_official_language` | `COUNT WHERE language matches origin_country` |

> **Note:** At `reference_time` (first chart day), many diffusion features will have limited variation (e.g., `num_regions_entered_so_far` often = 1). They still capture debut strength. The `is_viral50_flag` adds a distinct signal — organic buzz vs. raw streaming volume.

> **Note:** `streams_sum_so_far` is borderline — first-day streams may partially reflect which markets are already engaged. Keep but flag in the report and test the model with and without it.

---

### Category E — Target Country Features

These describe structural properties of the destination market.

| Feature | Description | Source Columns | How to Build |
|---------|-------------|----------------|--------------|
| `target_country_population` | Market size proxy | `country_population` | Join on `target_country` |
| `target_country_continent` | Geographic region | `country_continent` | Join |
| `target_country_language` | Official language(s) | `country_official_language` | Join |
| `target_country_chart_turnover_rate` | Market volatility | `trend`, `region` | Fraction of NEW_ENTRY rows in trailing window |
| `target_country_total_streams_recent` | Market activity level | `streams`, `region` | `SUM(streams)` in trailing window |

---

### Category F — Geographic & Cultural Relationship Features

These capture proximity between the song's origin and the target country.

| Feature | Description | Source Columns | How to Build |
|---------|-------------|----------------|--------------|
| `same_language_flag` | Song language matches target country language | `song_language`, `country_official_language` | Compare song language and target language |
| `same_continent_flag` | Shared continent | `country_continent` | Compare origin and target continent |
| `artist_prior_success_in_target_country` | Has this artist charted in this specific target country before? | `artist`, `region`, `date` | `COUNT(DISTINCT track_id) WHERE region = target_country AND date < reference_time` |
| `neighbor_country_entered_flag` | Neighbor diffusion signal | `cultural_top5_targets` | Check if culturally close country already entered |
| `cultural_distance_origin_target` | Cultural distance measure | cultural distance table | Join origin and target country |

---

### Category G — Temporal Features

These capture seasonal timing effects.

| Feature | Description | Source Columns | How to Build |
|---------|-------------|----------------|--------------|
| `reference_month` | Month of reference_time | `reference_time` | `EXTRACT(MONTH FROM reference_time)` |
| `reference_day_of_week` | Day of week of first chart appearance | `reference_time` | `EXTRACT(DOW FROM reference_time)` |

---

## 5. Leakage Prevention Rules

Strict rules must be followed:

- Only use data where `date ≤ reference_time`
- Never use `target_country` rank before entry
- Never use `target_country` streams before entry
- Never use Spotify `popularity` score (algorithmically updated, reflects current performance)
- Never use any future information

This ensures realistic prediction conditions.

---

## 6. Final Training Dataset Structure

Each row contains:

```
track_id
target_country
reference_time
origin_country

[all engineered features from Categories B–G]

did_enter_target_country          ← classification target
days_to_enter_target_country      ← regression target
```

Estimated dataset size:

~10,000–20,000 unique tracks × 35 countries ≈ ~350,000–700,000 rows

DuckDB will be used to construct and query this dataset efficiently.