# ✅ Step 4: Column Homogeneity Checks and Data Preparation

---

## Q1. Is the data homogenous in each column?

Yes. All columns across the MusicBrainz and TMDb tables were evaluated using both Python and SQL-based homogeneity checks. Results are documented in:

- Capstone_Step_4_Analysis.ipynb

Every column in the cleaned datasets was consistent in data type, and outliers or mixed types were identified and remediated.

---

## Q2. How do you anticipate this data will be used by data analysts and scientists?

This dataset enables:
- Genre-based trend analysis on movie soundtracks
- Popularity scoring by genre, year, or media type
- Machine learning applications for popularity or genre prediction
- Linking of movie data to soundtrack metadata for downstream modeling

---

## Q3. What does this tell you about how the data should be stored?

- **PostgreSQL** was used for relational modeling, joins, and indexing.
- **Parquet** format may be used for long-term archival, fast I/O, and external analysis.
- Columnar storage (Parquet) is optimal for slicing data by genre, year, or popularity.
- Foreign key-based normalization ensures efficient storage and reduced duplication.

---

## Q4. What cleaning steps did you perform?

- Cleaned `.tsv` files using `02_mb_cleanse_tsv_files.py`
- Audited structure using `01_mb_audit_raw_files.py` and `03_util_inspect_fieldnames.py`
- Normalized title casing and removed noise terms using `clean_title()`
- Dropped or flagged soundtrack titles with mojibake, overly generic names, or duplicates
- Standardized null handling with `\N` and UTF-8 encoding

---

## Q5. What wrangling did you perform for enrichment?

- Fetched metadata using `06_tmdb_fetch_movies.py` and `07_tmdb_enrich_movies.py`
- Used `08_tmdb_enrich_afi.py` to enrich AFI 100 manually and validate logic
- Genres were normalized using `09_tmdb_normalize_genres.py`
- Soundtrack-to-movie matches were made using `10_match_fuzzy_link_soundtracks.py` via:
  - Manual match
  - Substring match
  - Alt-title + fuzzy match (with RapidFuzz)

All enrichment steps are documented and reproducible in the pipeline.

---


## 📊 Entity-Relationship Diagram (ERD)

This ERD represents the final schema used for the Capstone project, including both:
- MusicBrainz raw and filtered datasets
- TMDb-enriched metadata and normalized genre tables

It reflects table relationships, primary keys, and join strategies used in the enrichment and analysis process.

![ERD](Step_4_ERD.png)


## 📁 Final Notebook + Scripts Overview

- Capstone_Step_4_Analysis.ipynb: Python + SQL column inspection
- Capstone_Step_4_Analysis.ipynb: Summary of 10 tables
- `04c_step4_wrapup.ipynb`: Q1–Q5 responses, ERD, storage plan (this file)
- Scripts: `02` → `10` form a clean ETL pipeline
- Utility scripts: `01_mb_audit_raw_files.py`, `99_util_clean_title.py`
- ERD image: `images/ERD.png`

Ready for SQL modeling, feature engineering, or analysis.

---


### 🎯 Match Rate Analysis

In [None]:
import pandas as pd

matched = pd.read_csv("matched_top_1000.tsv", sep="\t")
unmatched = pd.read_csv("unmatched_top_1000.tsv", sep="\t")

match_rate = len(matched) / (len(matched) + len(unmatched))
print(f"✅ Match rate: {match_rate:.2%}")

### 📊 Top Genres in Matched Soundtracks

In [None]:
import matplotlib.pyplot as plt

# Load genre data
genres = pd.read_csv("tmdb_genre_top_1000.csv")

# Join with matched
matched_genres = matched.merge(genres, on="tmdb_id")

# Plot top 10 genres
genre_counts = matched_genres["genre"].value_counts().head(10)
genre_counts.plot(kind="bar", title="Top Genres in Matched Soundtracks")
plt.ylabel("Match Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()