### Title Akas

Exploring and Profiling data


## Cleaned table - transformation and standardisation

- standardise field names
- replace market rollups - region

- create "ratings band" field for future analysis



In [0]:
  SELECT 
    titleId AS id_number,

    ordering AS ordering,

    title AS primary_title,

    -- Replace market rollups
      CASE 
        WHEN region = 'XWW' THEN 'International' 
        WHEN region = 'XNA' THEN 'North America' 
        WHEN region = 'XEU' THEN 'Europe'
        WHEN region = 'XAF' THEN 'Africa'
        WHEN region = 'XAS' THEN 'Asia'
        WHEN region = 'XSA' THEN 'South America'
        WHEN region = 'XAU' THEN 'Australasia'
        WHEN region = 'XKO' THEN 'Korea'
        WHEN region = 'XPI' THEN 'Palestinian Territory'
        WHEN region = 'XKX' THEN 'Kosovo'
        WHEN region = 'XSI' THEN 'Other'
      ELSE region
      END AS region
    ,

    language AS language,

    types AS types,

    attributes AS attributes,

    CAST(isOriginalTitle AS BOOLEAN) AS is_original_title

  FROM imdb.raw.title_akas

## Data Validation Checks  



In [0]:
WITH 

-- Setting base table
  base_table AS (
    SELECT * FROM imdb.raw.title_akas
  ),

-- Row Count
  row_count AS(
      SELECT COUNT(*) AS row_count
      FROM base_table
    ),

-- Distinct Counts
  disctinct_count_titleId AS(
      SELECT COUNT(DISTINCT titleId) AS distinct_count_titleId
      FROM base_table     
    ),                   -- less than row count as each titleId because the same movie released in different countries have different entries

  disctinct_count_isOriginalTitle AS(
    SELECT COUNT(DISTINCT isOriginalTitle) AS disctinct_count_isOriginalTitle
    FROM base_table
  ),                     -- only two (i.e. 0 and 1)

-- Distinct values of each field
    distinct_value_region AS(
      SELECT DISTINCT region FROM base_table
    ),                 -- many region codes with letter count > 2, needs to be investigated

    distinct_value_language AS(
      SELECT DISTINCT language FROM base_table
    ),


-- NULL count 
  null_count_language AS(
    SELECT COUNT(*) FROM base_table WHERE language IS NULL
  ),                -- NULL count is very substantial

  null_count_attributes AS(
    SELECT COUNT(*) FROM base_table WHERE attributes IS NULL
  ),                -- NULL count is very substantial

-- isOriginalTitle checks
  isOriginalTitle_check AS(
    SELECT * FROM base_table WHERE isOriginalTitle = 1
  )              -- isOriginalTitle = 1 iff (if and only if) ordering = 1


SELECT * FROM isOriginalTitle_check

In [0]:
DESCRIBE imdb.staging.title_akas

In [0]:
-- Investigating regions with letter count > 2
SELECT DISTINCT region FROM imdb.raw.title_akas WHERE LEN(region) > 2
-- 20 unique regions with character count > 2

Region refers to the country/area the movie was released in. The following are categorical groupings for region codes that have more than 2 characters:

1) Countries that no longer exist, but kept for historical accuracy

- {SUHH: Soviet Union (USSR), CSHH: Czechoslovakia, XYU / YUCS: Yugoslavia (Federal and Socialist Republics), VDVN: North Vietnam, BUMM: Burma (Modern Myanmar), CSXX: Serbia and Montenegro, DDDE: East Germany (GDR), ZRCD: Zaire (Modern DRC), XWG: West Germany (Pre-unification Germany)}

2) XSI
- Only 7 entries with XSI as region code
- Replace with "Other"

3) Market Rollups - categorical groupings for multiple regions aggregated into a single region code
- XWW: Replace with International
- XNA: Replace with North America
- XEU: Replace with Europe
- XAF: Replace with Africa
- XAS: Replace with Asia
- XSA: Replace with South America
- XAU: Replace with Australasia
- XKO: Replace with Korea
- XPI: Replace with Palestinian Territory
- XKX: Replace with Kosovo

In [0]:
-- Language field investigation: large null proportion
  SELECT * FROM imdb.raw.title_akas WHERE language IS NOT NULL
    -- Non-null outputs likely reflect multi-lingual countries that require language clarification

In [0]:
-- Attributes field investigation: large null proportion
  SELECT 
    COUNT(*) AS total_rows,
    COUNT(attributes) AS non_null_counts,
     (COUNT(*) - COUNT(attributes)) / COUNT(*) AS null_proportion
  FROM imdb.raw.title_akas

-- Over 99% of the 'attributes' field is unpopulated.
-- Delete in the staging -> transformed notebook

In [0]:
-- Investigation: 'title' field looks similar to 'primaryTitle' field in imdb.raw.title_basics table

-- How many 'primaryTitle' values exist in the 'title' field on title.akas
  WITH
  
  -- total amount of unique primaryTitle values
  total_count AS(
    SELECT COUNT(DISTINCT primaryTitle) AS total_count_A FROM imdb.raw.title_basics
  ),

  --amount of unique primaryTitle values that also exist in imdb.raw.title_akas
  count_join AS(
    SELECT COUNT(DISTINCT primaryTitle) AS count_join FROM imdb.raw.title_basics A
      INNER JOIN imdb.raw.title_akas B
      ON A.tconst = B.titleId
      AND TRIM(LOWER(A.primaryTitle)) = TRIM(LOWER(B.title))
    )


  -- proportion of unique primaryTitle values that exist on both tables
  SELECT 
    total_count_A,
    count_join, 
    (count_join / total_count_A) AS proportion
  FROM total_count 
  CROSS JOIN count_join

-- near 100% proportion, so we can rename 'title' -> 'primary_title' to align with imdb.raw.title_basics

**Investigation: ordering and isOriginalTitle fields**

Pattern: (isOriginalTitle = 1) iff (ordering = 1) 
- IMDb usually enters the "Original Title" first  
  

ordering: technical field
  - not needed in the transformed table (delete during staging -> transformed)

isOriginalTitle: semantic field
  - keep this field  



ChatGPT:

"A technical field (like ordering) manages database mechanics, such as sequence or sorting. A semantic field (like isOriginalTitle) conveys real-world meaning, making data interpretable for humans. While often redundant, technical fields ensure system integrity, whereas semantic fields provide the business context necessary for accurate analysis and reporting"  
Dimensional Data Modeling (Kimball)
