# Cleaning CDDB

This notebook is an annotated walkthrough of cleaning the Compact Disc Database (CDDB) dataset.

In [11]:
import logging

import pandas as pd
import pandera as pa

import clean_cddb
from clean_cddb.utils import get_failure_cases_summary_as_formatted_table, get_check_func_descriptions

def df_to_var(df, var_name):
    globals()[var_name] = df
    return df

pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(process)d - %(levelname)s - %(message)s",
)
filepath = "../data/input/cddb.tsv"
source_df = pd.read_csv(filepath, sep="\t", dtype="str", encoding='latin1')

## Apply validation checks (pandera schema) and review failure cases

In [12]:
try:
    validated_df = clean_cddb.schema(source_df, lazy=True)
    logging.info("Validation success. No failure cases detected.")
except pa.errors.SchemaErrors as err:
    logging.info("Validation failure. Failure cases detected.")
    logging.debug(err)
    failure_cases_df = err.failure_cases

failure_cases_df = failure_cases_df.pipe(get_check_func_descriptions, clean_cddb.schema)

2023-07-30 15:16:06,624 - 84544 - INFO - Validation failure. Failure cases detected.


`failure_cases_df`
* The `failure_cases_df` shows the name of the column, the check, failure case (example), and row index position of the failure case in the original data frame. 
* The index can support bulk operations such as joining and querying the original dataframe for failure cases or rejecting rows in the set of failure case indices.

In [13]:
(
    failure_cases_df
    .loc[
        :, ["schema_context", "column", "check", "failure_case", "index"]
    ]
    .sample(10, random_state=1)
)

Unnamed: 0,schema_context,column,check,failure_case,index
5567,Column,title,Check for *possibly* invalid symbols.,UltrasÃÂ³nica,9646
1145,Column,artist,Check for *possibly* invalid symbols.,BjÃÂ¶rk,6862
5766,Column,id,Check that the length of 'id' is 6 characters.,3931,9397
5684,Column,id,Check that the length of 'id' is 6 characters.,10938,8938
1302,Column,artist,Check for *possibly* invalid symbols.,Gilbert MontagnÃÂ©,9211
4822,Column,title,not_nullable,,7705
4431,Column,genre,not_nullable,,8596
4949,Column,title,Check for *possibly* invalid symbols.,Gympa PÃÂ¥,1402
3222,Column,genre,not_nullable,,5074
5216,Column,title,Check for *possibly* invalid symbols.,ÃâÃÂ¥ÃÂ¢ÃÂ®ÃÂ·ÃÂªÃÂ ÃË ÃïÃÂ»ÃÂ±ÃÂ¼,4708


### Summary of failure cases

Here we see aggregated counts of the number of failure cases for each validation check.

In [14]:
failure_cases_summary = (
    failure_cases_df.groupby(["column", "check"], as_index=False)
    .size()
    .sort_values(by=["column", "check"])
    .rename(columns={"size": "counts"})
)

failure_cases_summary

Unnamed: 0,column,check,counts
0,artist,Check for *possibly* invalid symbols.,639
1,artist,Check for invalid artist values.,697
2,artist,not_nullable,1
3,category,Check for invalid categories.,89
4,genre,Check for invalid genres.,1
5,genre,not_nullable,3388
6,id,Check that the length of 'id' is 6 characters.,477
7,title,Check for *possibly* invalid symbols.,747
8,title,not_nullable,8
9,year,Check if year is numeric.,28


We also have a helper utility function to display the source code along side each check function name.

In [15]:
# Report summary counts
failure_cases_summary_table = get_failure_cases_summary_as_formatted_table(failure_cases_df)

print(failure_cases_summary_table)

+----------+------------------------------------------------+----------+--------------------------------------------------------------------------------------+
| column   | check                                          |   counts | check_source_code                                                                    |
| artist   | Check for *possibly* invalid symbols.          |      639 | def check_col_has_valid_characters(x: Any) -> bool:                                  |
|          |                                                |          |     """Check for *possibly* invalid symbols."""                                      |
|          |                                                |          |                                                                                      |
|          |                                                |          |     # consider NaNs and floats to be invalid                                         |
|          |                            

# Cleaning step

We can use the same checks from the pandera validation schema to trigger cleaning actions such as:
* do nothing / ignore the value
* transform the value; e.g., replace value with a substitute (e.g., 'N/A')
* or reject the entire record

Here we apply several cleaning functions on the source_df via .pipe(Callable).
* Each function takes a dataframe and returns a dataframe, so we can chain together the cleaning operations like so.
* Later, we will 
  1. compare `source_df` and `clean_df` as a before/after check
  2. re-apply our validation checks (pandera schema) to the new `clean_df` to verify that our transformations improved our data quality

In [16]:
from clean_cddb.utils import log_df_change

clean_df = (
source_df

    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_standardize_various_artists)
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_try_to_fix_encoding_errors' procedure")
        
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_try_to_fix_encoding_errors, "artist")
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_try_to_fix_encoding_errors' procedure")
    .pipe(df_to_var, 'clean_df_artist_transforms_only')
    
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_invalid_symbols)
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_invalid_symbols' procedure")
    
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_invalid_categories)
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_invalid_categories' procedure")
    
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_id_zero_padding)
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_id_zero_padding' procedure")
    
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_genre_invalid) 
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_genre_invalid' procedure")
        
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_year)
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_year' procedure")
    
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_title)
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_title' procedure")
    
    .pipe(df_to_var, '_df_before').pipe(clean_cddb.clean_df_genre_coalesce_with_category)
    .pipe(log_df_change, before_df=_df_before, operation_label="Cleaning with 'clean_cddb.clean_df_genre_coalesce_with_category' procedure")
    
    # Save an intermediate dataframe prior to dropping records
    # so we can compare with source_df later
    .pipe(df_to_var, 'clean_df_before_drops')
    
    # Drop rows with "REJECT_ROW*" prefix
    .query("~id.str.contains('REJECT_ROW')")    
    .drop(columns=['merged_values'])
)

2023-07-30 15:16:06,678 - 84544 - INFO - Cleaning operation
operation_label: Cleaning with 'clean_cddb.clean_df_try_to_fix_encoding_errors' procedure
cleaning operation_label: Cleaning with 'clean_cddb.clean_df_try_to_fix_encoding_errors' procedure
Number of rows affected: 703
Columns affected: {'artist'}
Examples:
|      | ('artist', 'before')   | ('artist', 'after')   |
|-----:|:-----------------------|:----------------------|
| 7375 | Various Artists        | Various               |
| 6256 | Various Artists        | Various               |
| 7495 | Various Artists        | Various               |
| 8171 | Various Artist         | Various               |
| 7012 | Various Artists        | Various               |

2023-07-30 15:16:06,734 - 84544 - INFO - Cleaning operation
operation_label: Cleaning with 'clean_cddb.clean_df_try_to_fix_encoding_errors' procedure
cleaning operation_label: Cleaning with 'clean_cddb.clean_df_try_to_fix_encoding_errors' procedure
Number of rows affected: 29

Inspecting clean up on 'artist' field with (1) standardization to "Various" and (2) fixing encoding issues

In [17]:
comps_df_sample_markdown: str = (source_df
                                 .compare(clean_df_artist_transforms_only, result_names=('before', 'after'))
                                 .sample(5, random_state=0)
                                 .to_markdown()
                                 )

print("Example diffs between before-and-after")
print(comps_df_sample_markdown)

Example diffs between before-and-after
|      | ('artist', 'before')   | ('artist', 'after')   |
|-----:|:-----------------------|:----------------------|
| 3371 | Various Artists        | Various               |
| 9401 | Los CaÃÂ±os                        | Los Caños             |
|  316 | MaÃÂ±a's                        | Maña's                |
| 6546 | various                | Various               |
| 4434 | Various Artists        | Various               |


In [18]:
# apply schema to clean_df
try:
    validated_df = clean_cddb.schema(clean_df, lazy=True)
    logging.info("Validation success. No failure cases detected.")
except pa.errors.SchemaErrors as err:
    logging.info("Validation failure. Failure cases detected.")
    logging.debug(err)
    after_cleaning_failure_cases_df = err.failure_cases

after_cleaning_failure_cases_df = after_cleaning_failure_cases_df.pipe(
    get_check_func_descriptions, clean_cddb.schema
)

after_cleaning_failure_cases_summary = (
    after_cleaning_failure_cases_df.groupby(["column", "check"], as_index=False)
    .size()
    .sort_values(by=["column", "check"])
    .rename(columns={"size": "counts"})
)

2023-07-30 15:16:07,415 - 84544 - INFO - Validation failure. Failure cases detected.


# Evaluation

Counts of Failure Cases Before vs After Data Cleaning

In [19]:
(
failure_cases_summary
.merge(after_cleaning_failure_cases_summary, 
       on=['column', 'check'], 
       how='outer', 
       suffixes=['_before_cleaning', '_after_cleaning']
       )
.fillna('')
)

Unnamed: 0,column,check,counts_before_cleaning,counts_after_cleaning
0,artist,Check for *possibly* invalid symbols.,639,
1,artist,Check for invalid artist values.,697,
2,artist,not_nullable,1,
3,category,Check for invalid categories.,89,
4,genre,Check for invalid genres.,1,
5,genre,not_nullable,3388,
6,id,Check that the length of 'id' is 6 characters.,477,
7,title,Check for *possibly* invalid symbols.,747,497.0
8,title,not_nullable,8,
9,year,Check if year is numeric.,28,


Comparing `source_df` and `clean_df`

* We will actually use an intermediate dataframe `clean_df_before_drops` that has the same dimensions as our original dataframe.
    * Prior to dropping dirty records, `clean_df_before_drops` has values over-written with a prefix "REJECT_RECORD".
    * This enables easier side by side comparison.
* The final output `clean_df` will actually omit records that we intend to drop.


In [20]:
source_df.head()

Unnamed: 0,artist,category,genre,title,tracks,year,id,merged_values
0,Backstreet Boys,blues,Pop,Millennium,Larger Than Life | I Want It That Way | Show Me The Meaning Of Being Lonely | It's Gotta Be You | I Need You Tonight | Don't Want You Back | Don't Wanna Lose You Now | The One | Back To Your Heart | Spanish Eyes | No One Else Comes Close | The Perfect Fan | I'll Be There For You | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,,10000,
1,Various,data,,Frankfurt Trance Vol. 04 cd1,DJ Tom Stevens VS. Fridge - Outface 2000 (Radio Mix) | Alice Deejay - Better Off Alone (Signum Remix) | Tillmann Uhrmacher Feat. Peter Ries - Bassfly (Original Mix) | DJ 2 L 8 - Too Late | Time Square - Invisible Girl (Future Breeze Remix) | Cirillo - Across The Soundline | DJ Leon & Jam X - Hold It | Sean Dexter - Synthetica (Extended Mix) | DJ BjÃÂ¶rn - On A Mission (Original Mix) | 8Voice - Music Hypnotizes 2000 | Alex Apollo - Jahr 2000 | Headroom - Utopia (Radio Mix) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,,100001,
2,NO RETURN,data,Data,Self Mutilation,Do or Die | Truth and Reality | Lost | Soul Extractor | Sadistic Desire | The True Way | Fanatic Mind | Individualistic Ideal | One Life | Trail of Blood | Sect | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,,100002,
3,ÃÂ¤ÃÂ¸ÃÂ­ÃÂ¦?Ã¢â¬ËÃÂ©Ã¢â¬ÂºÃ¢â¬Â¦ÃÂ¤ÃÂ¿ÃÂ,data,Pop,ÃÂ¦ÃâÃÂ³ÃÂ£?Ã¢â¬Å¾ÃÂ¥Ã¢â¬Â¡ÃÂºÃÂ£?ÃÂ®ÃÂ£?Ã¢â¬Â¹ÃÂ£?Ã¢â¬ËÃÂ£Ã¢â¬Å¡Ã¢â¬Â°,ÃÂ§Ã¢â¬ÂºÃ¢â¬Â ÃÂ¥ÃÂ¸ÃÂ°ÃÂ£Ã¢â¬Å¡ÃÂ | ÃÂ£?Ã¢â¬Å¾ÃÂ£?ÃÂ¤ÃÂ£?Ã¢â¬Â¹ÃÂ¨ÃÂ¡Ã¢â¬âÃÂ£?ÃÂ§ÃÂ¤ÃÂ¼ÃÂ¡ÃÂ£?ÃÂ£ÃÂ£?ÃÂ¸ÃÂ£?ÃÂªÃÂ£Ã¢â¬Å¡Ã¢â¬Â° | ÃÂ©ÃÂ¢ÃÂ¨ÃÂ£?ÃÂ®ÃÂ£?ÃÂªÃÂ£?Ã¢â¬Å¾ÃÂ¦Ã¢â¬âÃÂ¥ | ÃÂ£?ÃÂ¸ÃÂ£?ÃÂ ÃÂ£?ÃÂ ÃÂ¥Ã¢â¬Â°?ÃÂ£?ÃâÃÂ£?Ã¢â¬Å¾ÃÂ£?Ã¢â¬Å¾ | ÃÂ£?ÃÂµÃÂ£Ã¢â¬Å¡ÃâÃÂ£?Ã¢â¬Å¡ÃÂ£?Ã¢â¬Å¾ | ÃÂ£?Ã¢â¬Å¡ÃÂ£Ã¢â¬Å¡?ÃÂ©?Ã¢â¬â¢ÃÂ¦ÃÅÃÂ¥ | ÃÂ¤ÃÂ¿ÃÂºÃÂ£?ÃÂ¸ÃÂ£?ÃÂ¡ÃÂ£?ÃÂ®ÃÂ¦Ã¢â¬âÃ¢â¬Â¦ | ÃÂ§Ã¢âÂ¢ÃÂ½ÃÂ£?Ã¢â¬Å¾ÃÂ¥ÃÂ¯ÃÂ«ÃÂ§ÃâÃÂ¾ÃÂ©ÃÂ¤ÃÂ¨ | ÃÂ£?Ã¢â¬Â¢ÃÂ£?Ã¢âÂ¢ÃÂ£Ã¢â¬Å¡Ã¢â¬Â°ÃÂ£?Ã¢â¬Å¾ÃÂ¦Ã¢âÂ¢Ã¢â¬Å¡ÃÂ¤ÃÂ»ÃÂ£ | ÃÂ¥ÃÂ¤ÃâÃÂ¨ÃÂ¡ÃâÃÂ¥Ãâ Ã¢â¬âÃÂ¨ÃÂ»ÃÂ | ÃÂ£?Ã¢â¬Å¡ÃÂ£?ÃÂªÃÂ£?ÃÂ¸ÃÂ£Ã¢â¬Å¡Ã¢â¬â¢ÃÂ¦Ã¢â¬Å¾Ã¢â¬ÂºÃÂ£?Ã¢âÂ¢ÃÂ£Ã¢â¬Å¡Ã¢â¬Â¹ÃÂ§ÃÂ§? | ÃÂ©?Ã¢â¬â¢ÃÂ¦ÃÅÃÂ¥ÃÂ¨ÃÂ²ÃÂ´ÃÂ¦Ã¢â¬â? | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,1989.0,100003,
4,Emanuel,data,Data,Felicidade,Felicidade quando o telefone toca | Vem bailar o tic tic | Quero que sejas minha e de mais ninguem | Eu sei que me amas | O melhor que hÃÂ¡ | Minha vizinha deixa me a bater mal | S. JoÃÂ£o ÃÂ© foliÃÂ£o | SÃÂ³ quero o teu carinho | tudo farei para ter a tua paixÃÂ£o | serÃÂ¡s sempre minha | Vem bailar o tic tic verÃÂ§ÃÂ£o dance | Mix | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,1998.0,100004,


In [21]:
clean_df_before_drops.head()

Unnamed: 0,artist,category,genre,title,tracks,year,id,merged_values
0,Backstreet Boys,blues,Pop,Millennium,Larger Than Life | I Want It That Way | Show Me The Meaning Of Being Lonely | It's Gotta Be You | I Need You Tonight | Don't Want You Back | Don't Wanna Lose You Now | The One | Back To Your Heart | Spanish Eyes | No One Else Comes Close | The Perfect Fan | I'll Be There For You | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,,010000,
1,Various,,,Frankfurt Trance Vol. 04 cd1,DJ Tom Stevens VS. Fridge - Outface 2000 (Radio Mix) | Alice Deejay - Better Off Alone (Signum Remix) | Tillmann Uhrmacher Feat. Peter Ries - Bassfly (Original Mix) | DJ 2 L 8 - Too Late | Time Square - Invisible Girl (Future Breeze Remix) | Cirillo - Across The Soundline | DJ Leon & Jam X - Hold It | Sean Dexter - Synthetica (Extended Mix) | DJ BjÃÂ¶rn - On A Mission (Original Mix) | 8Voice - Music Hypnotizes 2000 | Alex Apollo - Jahr 2000 | Headroom - Utopia (Radio Mix) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,,100001,
2,NO RETURN,,,Self Mutilation,Do or Die | Truth and Reality | Lost | Soul Extractor | Sadistic Desire | The True Way | Fanatic Mind | Individualistic Ideal | One Life | Trail of Blood | Sect | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,,100002,
3,REJECT_ROW - invalid artist,,REJECT_ROW - invalid artist,REJECT_ROW - invalid artist,REJECT_ROW - invalid artist,,REJECT_ROW - invalid artist,REJECT_ROW - invalid artist
4,Emanuel,,,Felicidade,Felicidade quando o telefone toca | Vem bailar o tic tic | Quero que sejas minha e de mais ninguem | Eu sei que me amas | O melhor que hÃÂ¡ | Minha vizinha deixa me a bater mal | S. JoÃÂ£o ÃÂ© foliÃÂ£o | SÃÂ³ quero o teu carinho | tudo farei para ter a tua paixÃÂ£o | serÃÂ¡s sempre minha | Vem bailar o tic tic verÃÂ§ÃÂ£o dance | Mix | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |,1998.0,100004,


In [22]:
pd.set_option('display.max_colwidth', 50)

comps_df = (
source_df.sort_index()
 .compare(clean_df_before_drops.sort_index(), result_names=('before_cleaning', 'after_cleaning'))
 .astype('object')
 .fillna('')
 )

# Show before/after comps

In [23]:
pd.set_option('display.max_colwidth', None)

columns_to_compare = ['artist', 'category', 'genre', 'title', 'tracks', 'year', 'id']

comps_df_formatted = (
comps_df
 .astype(str) 
 .stack()
 .reset_index()
 .rename(columns={'level_0': 'row_id', 'level_1': 'before_or_after'})
 .drop(columns=['merged_values'])
 .groupby(['row_id'], as_index=False)
    [columns_to_compare]
    .agg(lambda row: '  =>  '.join(row))
 .replace('^(  =>  )$', '', regex=True)
)

comps_df_formatted.sample(5, random_state=1)

Unnamed: 0,row_id,artist,category,genre,title,tracks,year,id
4320,4904,,,,,,1994 => 1994,
6257,7096,,,=> N/A,,,,
1924,2187,,,,,,1977 => 1977,
6332,7184,Die schÃÂ¶nsten Westernmelodien => Die schönsten Westernmelodien,,,,,,
3350,3813,,,,,,2000 => 2000,


In [24]:
print("Number of rows changed per column")

for column in clean_df.columns:
    n_rows_changed = comps_df[column][comps_df[column]['before_cleaning'] != comps_df[column]['after_cleaning']].shape[0]
    print(f"{column:.<15}{n_rows_changed}")

Number of rows changed per column
artist.........1326
category.......421
genre..........3684
title..........355
tracks.........347
year...........5231
id.............808


### Sample transformations

* Here we can see that we transform "Various Artists" and "<various>" to "Various". 
* We also fixed invalid characters converting text from "JÃ¶rg Hilbert & Felix Janosa" to "Jörg Hilbert & Felix Janosa".
* Later, we will do a more comprehensive before/after analysis after applying all the cleaning transformations

In [25]:
example_idxs = [7629, 1822, 117, 4129]
source_df.compare(clean_df_artist_transforms_only).loc[example_idxs, :].fillna(pd.NA)

Unnamed: 0_level_0,artist,artist
Unnamed: 0_level_1,self,other
7629,Various Artists,Various
1822,JÃÂ¶rg Hilbert & Felix Janosa,Jörg Hilbert & Felix Janosa
117,Various Artists,Various
4129,<various>,Various


In [26]:
comps_df.sample(10, random_state=3)

Unnamed: 0_level_0,artist,artist,category,category,genre,genre,title,title,tracks,tracks,year,year,id,id,merged_values,merged_values
Unnamed: 0_level_1,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning
6337,,,,,,,,,,,,,,,,
7462,,,,,,,,,,,1996.0,1996.0,,,,
906,,,,,,,,,,,2002.0,2002.0,,,,
6771,,,,,,,,,,,2002.0,2002.0,,,,
3271,,,,,,,,,,,,,,,,
7271,,,,,,,,,,,2001.0,2001.0,,,,
8111,,,,,,,,,,,1997.0,1997.0,,,,
9438,ÃÅÃÂ³ÃÂ°ÃÂ§ÃÂ¨ÃÂ«ÃÂªÃÂ¨ International,REJECT_ROW - invalid artist,misc,,Russian Pop,REJECT_ROW - invalid artist,ÃÅÃÂ³ÃÂ°ÃÂ§ÃÂ¨ÃÂ«ÃÂªÃÂ¨ +,REJECT_ROW - invalid artist,"ÃÅÃÂ¥ÃÂ«ÃÂ¼ÃÂ­ÃÂ¨ÃÂ¶ÃÂ feat. ÃË.ÃïÃÂ¨ÃÂªÃÂ®ÃÂ«ÃÂ ÃÂ¥ÃÂ¢ | ÃÅ ÃÂ®ÃÂ¬ÃÂ ÃÂ°ÃÂ®ÃÂ¢ÃÂ® feat. ÃË.ÃâÃÂªÃÂ«ÃÂ¿ÃÂ° | ÃâÃÂ ÃÂ£ÃÂ ÃÂ­ÃÂªÃÂ feat. ÃÅ.ÃËÃÂ³ÃÂ´ÃÂ³ÃÂ²ÃÂ¨ÃÂ­ÃÂ±ÃÂªÃÂ¨ÃÂ© | ÃïÃÂ ÃÂ¡ÃÂ³ÃÂ¸ÃÂªÃÂ¨_ÃÂ±ÃÂ²ÃÂ ÃÂ°ÃÂ³ÃÂ¸ÃÂªÃÂ¨ feat. Ãâ.ÃâÃÂ®ÃÂ¡ÃÂ°ÃÂ»ÃÂ­ÃÂ¨ÃÂ­ | ÃâÃÂ°ÃÂ ÃÂ¢ÃÂ ÃÂ³ ÃÂ¤ÃÂ®ÃÂ¬ÃÂ feat. Ãâ¡ÃÂ¥ÃÂ¬ÃÂ«ÃÂ¿ÃÂ­ÃÂ¥ | ÃÅÃÂ£ÃÂ­ÃÂ®ÃÂ¢ÃÂ¥ÃÂ­ÃÂ¨ÃÂ¿ feat. Ãï.ÃïÃÂ ÃÂ±ÃÂªÃÂ®ÃÂ¢ | ÃâÃÂ®ÃÂ¦ÃÂ¤ÃÂ¨ feat. ÃË.ÃÅ ÃÂ®ÃÂ°ÃÂ­ÃÂ¥ÃÂ«ÃÂ¾ÃÂª | ÃïÃÂ³ÃÂªÃÂ¥ÃÂ² feat. Ãâ¬.ÃïÃÂ ÃÂ°ÃÂ»ÃÂªÃÂ¨ÃÂ­ | 22 ÃÂ¯ÃÂ°ÃÂ¨ÃÂ²ÃÂ®ÃÂ¯ÃÂ feat. Ãâ.ÃÅÃÂ¨ÃÂ­ÃÂ ÃÂ¥ÃÂ¢ | ÃÅ¸ ÃÂ«ÃÂ¾ÃÂ¡ÃÂ«ÃÂ¾ ÃÂ¢ÃÂ ÃÂ±, ÃÂ¤ÃÂ¥ÃÂ¢ÃÂ®ÃÂ·ÃÂªÃÂ¨ feat. Ãï.Ãâ ÃÂ³ÃÂªÃÂ®ÃÂ¢ | Hands up feat. Ottawan | Video | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |",REJECT_ROW - invalid artist,2003.0,,4733.0,REJECT_ROW - invalid artist,,REJECT_ROW - invalid artist
3909,,,,,,,,,,,,,,,,
8211,,,,,,,,,,,1975.0,1975.0,,,,


In [27]:
comps_df.sample(5, random_state=4)

Unnamed: 0_level_0,artist,artist,category,category,genre,genre,title,title,tracks,tracks,year,year,id,id,merged_values,merged_values
Unnamed: 0_level_1,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning,before_cleaning,after_cleaning
6112,,,,,,,,,,,1999.0,1999.0,,,,
6082,Various Artists,Various,,,,,,,,,1991.0,1991.0,,,,
4633,,,,,,,,,,,1999.0,1999.0,,,,
1034,,,,,,,,,,,,,,,,
1581,Various Artists,Various,,,,,,,,,1992.0,1992.0,,,,


Smaller before/after example
* Here we can see that we transform "Various Artists" and "<various>" to "Various". 
* We also fixed invalid characters converting text from "JÃ¶rg Hilbert & Felix Janosa" to "Jörg Hilbert & Felix Janosa".
* Later, we will do a more comprehensive before/after analysis after applying all the cleaning transformations

In [28]:
example_idxs = [7629, 1822, 117, 4129]
source_df.compare(clean_df_artist_transforms_only).loc[example_idxs, :].fillna(pd.NA)

Unnamed: 0_level_0,artist,artist
Unnamed: 0_level_1,self,other
7629,Various Artists,Various
1822,JÃÂ¶rg Hilbert & Felix Janosa,Jörg Hilbert & Felix Janosa
117,Various Artists,Various
4129,<various>,Various


### Sample transformations

* Here we can see that we transform "Various Artists" and "<various>" to "Various". 
* We also fixed invalid characters converting text from "JÃ¶rg Hilbert & Felix Janosa" to "Jörg Hilbert & Felix Janosa".
* Later, we will do a more comprehensive before/after analysis after applying all the cleaning transformations

In [29]:
example_idxs = [7629, 1822, 117, 4129]
source_df.compare(clean_df_artist_transforms_only).loc[example_idxs, :].fillna(pd.NA)

Unnamed: 0_level_0,artist,artist
Unnamed: 0_level_1,self,other
7629,Various Artists,Various
1822,JÃÂ¶rg Hilbert & Felix Janosa,Jörg Hilbert & Felix Janosa
117,Various Artists,Various
4129,<various>,Various


# Transform to track-level data

We want a separate track-level dataset that can be joined to the album-level data in `clean_df`.

In [30]:
# Transform to track-level data
track_level_df = (
    # Start with original dataframe
    clean_df
    # Split 'tracks' on pipe into an array; we can "explode" it later
    .pipe(lambda _df: _df.assign(tracks=_df["tracks"].str.split("|")))
    # "explode"/expand from "tracks" array in to 1 observation per track
    # perform a self-join to CD data set; the CD-level data will repeat for each track
    .pipe(
        lambda _df: _df.merge(
            _df["tracks"].explode(), left_index=True, right_index=True
        )
    )
    .pipe(df_to_var, "df_after_explode")
    # Make new 'tracks' field; strip ' ' empty space track names to '' empty string
    .pipe(lambda _df: _df.assign(tracks=_df["tracks_y"].str.strip()))
    # Don't need these fields anymore
    .drop(columns=["tracks_x", "tracks_y"])
    # Filter out empty string track names
    .query("tracks!=''")
    .pipe(df_to_var, "df_after_empty_track_name_filter")
    .reset_index(drop=True)
    .loc[:, ['id', 'tracks']]
    .reset_index()
    .rename(columns={'id': 'album_row_id', 
                    'index': 'track_id',
                    'tracks': 'track_name',
                    }
            )
)

In [31]:
# Demo joining track_level_df and clean_df
(track_level_df
 .head(15)
 .merge(clean_df, left_on=['album_row_id'], right_on=['id'], how='inner')
 .loc[:, ['album_row_id', 'track_id', 'title', 'track_name']]
)

Unnamed: 0,album_row_id,track_id,title,track_name
0,10000,0,Millennium,Larger Than Life
1,10000,1,Millennium,I Want It That Way
2,10000,2,Millennium,Show Me The Meaning Of Being Lonely
3,10000,3,Millennium,It's Gotta Be You
4,10000,4,Millennium,I Need You Tonight
5,10000,5,Millennium,Don't Want You Back
6,10000,6,Millennium,Don't Wanna Lose You Now
7,10000,7,Millennium,The One
8,10000,8,Millennium,Back To Your Heart
9,10000,9,Millennium,Spanish Eyes


# End