# Step 4: Data Homogeneity Review

This notebook contains a structured assessment of key fields from the MusicBrainz dataset to determine whether the data is homogeneous and suitable for downstream use in the Capstone Project.

Each test uses SQL queries to inspect value distributions and identify:
- Data types and integrity
- Redundancy or inconsistency
- Analytical value
- Relevance to soundtrack-matching logic


In [None]:
pd.read_sql("""
SELECT "gender", COUNT(*) as count
FROM artist
GROUP BY "gender"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `artist.gender`

- **Type:** `Smallint`
- **Assessment:** Very high null rate. Extended codes are not clearly documented.


In [None]:
pd.read_sql("""
SELECT "name", COUNT(*) as count
FROM artist
GROUP BY "name"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `artist.name`

- **Type:** `Text`
- **Assessment:** High redundancy. Same names (e.g., Indigo) occur hundreds of times.


In [None]:
pd.read_sql("""
SELECT "type", COUNT(*) as count
FROM release_group
GROUP BY "type"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `release_group.type`

- **Type:** `Smallint`
- **Assessment:** Clean coded field, modest nulls.


In [None]:
pd.read_sql("""
SELECT "name", COUNT(*) as count
FROM release_group
GROUP BY "name"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `release_group.name`

- **Type:** `Text`
- **Assessment:** Common names like 'Greatest Hits' make this field ambiguous.


In [None]:
pd.read_sql("""
SELECT "barcode", COUNT(*) as count
FROM release
GROUP BY "barcode"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `release.barcode`

- **Type:** `Text`
- **Assessment:** Mostly nulls. Many duplicates.


In [None]:
pd.read_sql("""
SELECT "language", COUNT(*) as count
FROM release
GROUP BY "language"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `release.language`

- **Type:** `Smallint`
- **Assessment:** Dominated by code 120. Must be decoded with a join.


In [None]:
pd.read_sql("""
SELECT "last_updated", COUNT(*) as count
FROM release
GROUP BY "last_updated"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `release.last_updated`

- **Type:** `Timestamp`
- **Assessment:** High cardinality. Some batch timestamps repeat.


In [None]:
pd.read_sql("""
SELECT "name", COUNT(*) as count
FROM artist_credit
GROUP BY "name"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `artist_credit.name`

- **Type:** `Text`
- **Assessment:** Similar to artist.name. Use with ID for accurate linkage.


In [None]:
pd.read_sql("""
SELECT "secondary_type", COUNT(*) as count
FROM release_group_secondary_type_join
GROUP BY "secondary_type"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `release_group_secondary_type_join.secondary_type`

- **Type:** `Smallint`
- **Assessment:** Clean lookup values. Critical for soundtrack detection.


In [None]:
pd.read_sql("""
SELECT "name", COUNT(*) as count
FROM release_group_secondary_type
GROUP BY "name"
ORDER BY count DESC
LIMIT 10;
""", engine)

### `release_group_secondary_type.name`

- **Type:** `Text`
- **Assessment:** Perfect lookup table. 1-to-1 mapping with type codes.


---

## ✅ Conclusion

This audit confirms that while the MusicBrainz schema is well-structured, many fields are:
- Heavily null-filled
- Ambiguous without a lookup table
- Repetitive due to open community entry

Only a few fields (like `secondary_type` or normalized IDs) are safe to use for core logic.

This homogeneity review informed which fields were ultimately used in the fuzzy matching pipeline and soundtrack enrichment strategies.

