# Data Quality report: Spotify API response with Taylor Swift's discography

### Import basic libraries and return the data

In [1]:
# Resources
import pandas as pd
from firstPart import Spotify_reader

In [2]:
# Import the data
filename = "taylor_swift_spotify.json"
spotify_wrapper = Spotify_reader
spotify_wrapper = Spotify_reader(filename)
spotify_wrapper.data_breakdown()
spotify_wrapper.export_csv()

## 1. Completeness

Firstly, it is essential to determine the number of rows containing at least one null or empty value to identify the total records with complete data. Subsequently, creating a list of columns with null or empty values will facilitate analysis, allowing for a comparison with the documentation to verify whether obtaining this kind of data from the API is expected.

In [3]:
# Pro-process the data to get a standar definitio of empty
empty_synonym = ["", None, float("nan"), "null", "<NA>"]
spotify_wrapper.export_db.replace(empty_synonym, pd.NA, inplace=True)

In [4]:
# Get basic numbers
number_of_fields = spotify_wrapper.export_db.shape[0]
null_rows_db = spotify_wrapper.export_db[
    spotify_wrapper.export_db.apply(lambda x: x.isna().any(), axis=1)
]
null_rows = null_rows_db.shape[0]
columns_with_null = spotify_wrapper.export_db.columns[
    spotify_wrapper.export_db.apply(lambda x: x.isna().any())
]

# Print data
print("Intitial screening:")
print(
    f"Out of {number_of_fields} records there are {null_rows} with at least one empty value, that's roughly {int(null_rows*100/number_of_fields)}% of the data."
)
print(f"Columns with unfilled data {columns_with_null.tolist()}")

Intitial screening:
Out of 539 records there are 74 with at least one empty value, that's roughly 13% of the data.
Columns with unfilled data ['track_id', 'track_name', 'audio_features.danceability', 'audio_features.energy', 'audio_features.key', 'audio_features.loudness', 'audio_features.speechiness', 'audio_features.acousticness', 'audio_features.liveness', 'audio_features.tempo', 'audio_features.time_signature', 'album_name']


First off, let's check if it is possible to get empty data for the `album_name`. Looking at the [documentation](https://developer.spotify.com/documentation/web-api/reference/get-an-album), it can be seen that the `album_name` could be empty 'in case of an album takedown.' This indicates that some albums in the data frame were taken down when the data was fetched.

Next, according to the [Spotify documentation](https://developer.spotify.com/documentation/web-api/reference/get-track) about tracks, it doesn't mark the `track_id` or `track_name` as nullable. In other words, it shouldn't be possible to get empty data for these fields. However, notice that the request to the endpoint GET/track{id} requires the `track_id` to be passed, making it virtually impossible to get an empty or null `track_id` due to the API design. Let's check if the tracks with empty data match with the albums that were taken-down.

In [5]:
column_to_filter = ["album_name"]
column_to_check = ["track_id"]
taken_down_albums_songs = spotify_wrapper.export_db[
    spotify_wrapper.export_db[column_to_filter].apply(lambda x: x.isna().any(), axis=1)
]
number_non_na_songs = (
    taken_down_albums_songs[column_to_check]
    .apply(lambda x: x.notna().any(), axis=1)
    .sum()
)
print(
    f"Out of {taken_down_albums_songs.shape[0]} songs from the taken-down albums, {number_non_na_songs} have a track_id."
)

Out of 62 songs from the taken-down albums, 59 have a track_id.


The last part is to review the audio features of a song. In the Spotify [documentation](https://developer.spotify.com/documentation/web-api/reference/get-audio-features) of the get track's audio features endpoint, none of the response attributes are marked as nulable. 

With previous findings, let's assess again the completeness ratio of the dataframe by acknoleging that  is perfectly normal to get an empty `album_name`. Notice that there was no evidence that an empty `album_name` lead to a missing `track_id`.

In [6]:
# Get null, but do not include
columns_with_null_fixed = list(columns_with_null)
columns_with_null_fixed.remove("album_name")

null_rows_fixed_db = spotify_wrapper.export_db[
    spotify_wrapper.export_db[columns_with_null_fixed].apply(
        lambda x: x.isna().any(), axis=1
    )
]
null_rows_fixed = null_rows_fixed_db.shape[0]

# Print data
print(
    f"Completeness ratio of {round((number_of_fields-null_rows_fixed)*100/number_of_fields,2)}%.\nOut of {number_of_fields} records there are {null_rows_fixed} with at least one empty value.",
    f"\nColumns with unfilled data {columns_with_null_fixed}",
)

Completeness ratio of 96.29%.
Out of 539 records there are 20 with at least one empty value. 
Columns with unfilled data ['track_id', 'track_name', 'audio_features.danceability', 'audio_features.energy', 'audio_features.key', 'audio_features.loudness', 'audio_features.speechiness', 'audio_features.acousticness', 'audio_features.liveness', 'audio_features.tempo', 'audio_features.time_signature']


Note: There are several other fields/attributes that are not in this JSON file but are present in the documentation. For the purpose of this analysis, this factor was not taken into account.

## 2. Uniqueness

This is the criteria that is going to be followed to assess the completeness of the dataset:
* Ensure there are no repeated album_id.
* Allow for repeated songs, but only if they belong to different albums.
* Ensure the properties of a song are consistent between repetitions.
* Avoid having two songs with the same 'track_id' but mismatching audio_features.
* The track_id should always match with the audio_features.id.
* There shouldn't be repeated track numbers.
* The album_total_tracks should match the number of songs in the album

In [7]:
# Check that there are no repeated `album_id`.
count_of_album_id = spotify_wrapper.albums_db["album_id"].value_counts()
albums_above_1 = (
    count_of_album_id[count_of_album_id > 1]
    .reset_index(name="count")
    .merge(spotify_wrapper.albums_db, how="left", on="album_id")
    .drop_duplicates()
)
print(
    f"There are {count_of_album_id.size} albums in the response, and {albums_above_1.shape[0]} repetitions."
)
albums_above_1

There are 26 albums in the response, and 1 repetitions.


Unnamed: 0,album_id,count,album_name,album_release_date,album_total_tracks,album.artist_id
0,1NAmidJlEaVgA3MpcPFYGq,2,Lover,2019-08-23,18,06HL4z0CvFAxyc27GX


In [8]:
# There could be repeated songs, as long as they belong to different albums.
# Remove the repeated album_ids
albums_track_id = ["album_id", "track_id"]
albums_track_id_repeated = (
    spotify_wrapper.export_db.groupby(albums_track_id).size().reset_index(name="count")
)
albums_track_id_repeated_filtered = albums_track_id_repeated[
    (albums_track_id_repeated["count"] > 1)
    & (~albums_track_id_repeated["album_id"].isin(albums_above_1["album_id"]))
]
print(
    f"There are {albums_track_id_repeated_filtered.shape[0]} songs that are repeated in the same album."
)
albums_track_id_repeated_filtered

There are 1 songs that are repeated in the same album.


Unnamed: 0,album_id,track_id,count
156,1fnJ7k0bllNfL1kVdNVW1A,3xYJScVfxByb61dYHTwiby,2


In [9]:
# The properties of a song should be consistent between repetitions.
duplicate_tracks_db = spotify_wrapper.tracks_db[
    spotify_wrapper.tracks_db.duplicated(subset="track_id", keep=False)
]
bad_tracks = []

# Group by track_id and check if other columns are the same
for track_id, group in duplicate_tracks_db.groupby("track_id"):
    if group.drop("track_id", axis=1).nunique().eq(1).all():
        pass
    else:
        bad_tracks.append(track_id)
bad_tracks.remove("")
bad_repetitions_track_db = spotify_wrapper.tracks_db[
    spotify_wrapper.tracks_db["track_id"].isin(bad_tracks)
]
print(
    f"There are {len(bad_tracks)} songs whose information is not consistent between entries"
)
bad_repetitions_track_db

There are 1 songs whose information is not consistent between entries


Unnamed: 0,disc_number,duration_ms,explicit,track_number,track_popularity,track_id,track_name,track.album_id
278,1,178426,False,2,99,1BxfuPKGuaTgP7aM0Bbdwr,Cruel Summer,1NAmidJlEaVgA3MpcPFYGq
296,1,178426,No,2,99,1BxfuPKGuaTgP7aM0Bbdwr,Cruel Summer,1NAmidJlEaVgA3MpcPFYGq


In [10]:
# Avoid having songs with the same 'track_id' but mismatching audio_features.
track_audio_features_columns = ["track_id"] + [
    column for column in spotify_wrapper.export_db.columns if "audio_features" in column
]
track_audio_features_db = spotify_wrapper.export_db[track_audio_features_columns]
track_audio_features_repeated_db = track_audio_features_db[
    track_audio_features_db.duplicated("track_id", keep=False)
]

bad_audio_features = []

# Group by track_id and check if other columns are the same
for track_id, group in track_audio_features_repeated_db.groupby("track_id"):
    if group.drop("track_id", axis=1).nunique().eq(1).all():
        pass
    else:
        bad_audio_features.append(track_id)
if " " in bad_audio_features:
    bad_audio_features.remove("")
bad_repetitions_features = track_audio_features_db[
    track_audio_features_db["track_id"].isin(bad_audio_features)
]

print(
    f"There are {len(bad_audio_features)} tracks whose audio_features are not consistent between entries"
)
bad_repetitions_features

There are 0 tracks whose audio_features are not consistent between entries


Unnamed: 0,track_id,audio_features.danceability,audio_features.energy,audio_features.key,audio_features.loudness,audio_features.mode,audio_features.speechiness,audio_features.acousticness,audio_features.instrumentalness,audio_features.liveness,audio_features.valence,audio_features.tempo,audio_features.id,audio_features.time_signature


In [11]:
# The track_id should always match with the audio_features.id
missmatch_audio_features_db = spotify_wrapper.export_db[
    spotify_wrapper.export_db["track_id"]
    != spotify_wrapper.export_db["audio_features.id"]
]
audio_features_missmatch_track_id = (
    missmatch_audio_features_db["audio_features.id"].value_counts().size
)
print(
    f"There are {audio_features_missmatch_track_id} audio_features without a matching track_id (this often happens when track_id==pd.NA)"
)
missmatch_audio_features_db

There are 8 audio_features without a matching track_id (this often happens when track_id==pd.NA)


Unnamed: 0,disc_number,duration_ms,explicit,track_number,track_popularity,track_id,track_name,audio_features.danceability,audio_features.energy,audio_features.key,...,audio_features.tempo,audio_features.id,audio_features.time_signature,artist_id,artist_name,artist_popularity,album_id,album_name,album_release_date,album_total_tracks
321,1,209680,False,8,84,,Gorgeous,0.8,0.535,7.0,...,92.027,1ZY1PqizIl78geGM4xWlEA,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6DEjYFkNZh67HP7R9PSZvv,reputation,2017-11-10,15
363,1,238093,False,35,32,,Jump Then Fall,0.617,,2.0,...,80.007,5zytSTR2g0I9psX2Z12ex6,,06HL4z0CvFAxyc27GX,Taylor Swift,120,1MPAXuTVL2Ej5x0JHiSPq8,,2017-11-09,46
375,1,212600,False,1,60,,Welcome To New York,0.789,0.634,7.0,...,116.992,3nRmDz7qGCvsMS30rGGY0x,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1yGbNOtRIgdIiGHOEBaZWf,1989 (Deluxe),2014-01-01,19
379,1,193293,False,5,60,,All You Had To Do Was Stay,0.605,0.725,5.0,...,96.97,6aLOekfwbytwWvQftxTEF0,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1yGbNOtRIgdIiGHOEBaZWf,1989 (Deluxe),2014-01-01,19
382,1,211933,False,8,61,,Bad Blood,0.646,0.794,7.0,...,170.216,2NlmmAjGYrrjAp0MED5rGx,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1yGbNOtRIgdIiGHOEBaZWf,1989 (Deluxe),2014-01-01,19
434,1,362826,False,6,54,,Back To December/Apologize/You're Not Sorry - ...,0.374,0.516,2.0,...,142.057,1IsquhJFJ0qcFZI7FeAEuN,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6fyR4wBPwLHKcRtxgd4sGh,,2010-10-25,16
442,1,389213,False,14,49,,Enchanted - Live/2011,0.34,0.663,8.0,...,163.678,3lm4L3pPL32PFy74dR17OR,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6fyR4wBPwLHKcRtxgd4sGh,,2010-10-25,16
445,1,230546,False,1,49,,Mine - POP Mix,0.696,0.768,7.0,...,121.05,0GxW5K0qzrq7L1jwSY5OmY,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6Ar2o9KCqcyYF9J0aQP3au,Speak Now,2010-10-25,14


In [12]:
# There shouldn't be repaeated track numbers, within the same album and there shouldn't be missing songs.
albums_track_number = ["album_id", "track_number", "disc_number"]
albums_track_number_repeated = (
    spotify_wrapper.export_db.groupby(albums_track_number)
    .size()
    .reset_index(name="count")
)
albums_track_number_repeated_filtered = albums_track_number_repeated[
    (albums_track_number_repeated["count"] > 1)
    & (~albums_track_number_repeated["album_id"].isin(albums_above_1["album_id"]))
]
print(
    f"There are {albums_track_number_repeated_filtered.shape[0]} track numbers that are repeated in the same disc_number of an album."
)
albums_track_number_repeated_filtered

There are 2 track numbers that are repeated in the same disc_number of an album.


Unnamed: 0,album_id,track_number,disc_number,count
169,1fnJ7k0bllNfL1kVdNVW1A,21,1,2
443,6DEjYFkNZh67HP7R9PSZvv,8,1,2


In [13]:
# The album_total_tracks should match the number of songs in the album
spotify_wrapper.export_db["album_total_tracks_fixed"] = spotify_wrapper.export_db[
    "album_total_tracks"
].replace("Thirteen", 13)
spotify_wrapper.albums_db["album_total_tracks_fixed"] = spotify_wrapper.albums_db[
    "album_total_tracks"
].replace("Thirteen", 13)
export_db_no_duplicate_albums = spotify_wrapper.export_db.drop_duplicates(
    subset=["album_id","disc_number","track_id", "track_name"], keep="first"
)
album_track_count = (
    export_db_no_duplicate_albums.groupby("album_id")["track_number"]
    .size()
    .reset_index()
)
album_track_count.columns = ["album_id", "actual_track_count"]
spotify_wrapper.albums_db = spotify_wrapper.albums_db.merge(
    album_track_count, on="album_id"
)
spotify_wrapper.albums_db[
    "track_difference_spotify_count"
] = spotify_wrapper.albums_db.apply(
    lambda x: x["actual_track_count"]-x["album_total_tracks_fixed"], axis=1
)
print(
    f"There are {(spotify_wrapper.albums_db['track_difference_spotify_count'] != 0).sum()} albums where the 'album_total_tracks' attribute does not match the count of non-repeated tracks for each album."
)
spotify_wrapper.albums_db[
    spotify_wrapper.albums_db["track_difference_spotify_count"] != 0
][
    [
        "album_id",
        "album_name",
        "album_total_tracks_fixed",
        "actual_track_count",
        "track_difference_spotify_count",
    ]
]

There are 5 albums where the 'album_total_tracks' attribute does not match the count of non-repeated tracks for each album.


Unnamed: 0,album_id,album_name,album_total_tracks_fixed,actual_track_count,track_difference_spotify_count
3,1fnJ7k0bllNfL1kVdNVW1A,Midnights (The Til Dawn Edition),24,23,-1
6,6kZ42qRrzov54LcAk4onW9,Red (Taylor's Version),34,30,-4
9,2Xoteh7uEpea4TohMxjtaq,evermore,10,15,5
15,6DEjYFkNZh67HP7R9PSZvv,reputation,15,16,1
26,5eyZZoQEFQWRHkV2xgAeBw,Taylor Swift,13,15,2


## 3. Timeliness

There is only one field with a date datatype: album_release_date. The verification criterion, while not aiming to validate the data, is to ensure that each album was released before Taylor Swift's birthday.

In [14]:
taylor_swift_birthday = pd.to_datetime("1989-12-13")
spotify_wrapper.export_db["album_release_date_formated"] = pd.to_datetime(
    spotify_wrapper.export_db["album_release_date"], format="%Y-%m-%d"
)
albums_after_birthday = spotify_wrapper.export_db[
    spotify_wrapper.export_db["album_release_date_formated"] < taylor_swift_birthday
]
print(
    f"Timeliness ratio: {round((number_of_fields-albums_after_birthday.shape[0])*100/number_of_fields,2)}% \nThere are {len(albums_after_birthday['album_id'].unique())} albums that were made before Taylor Swift was born. Tracks affected: {albums_after_birthday.shape[0]}.\nList of albums: {albums_after_birthday['album_id'].unique()}."
)

Timeliness ratio: 97.22% 
There are 1 albums that were made before Taylor Swift was born. Tracks affected: 15.
List of albums: ['5eyZZoQEFQWRHkV2xgAeBw'].


## 4. Validity and accuracy

* Validity is determined by calculating the ratio of records with data types matching those specified in the documentation to the total number of records.
* Accuracy is defined as the ratio of valid records falling within the accepted range of values to the total number of records.

Please note that records with pd.NA are considered both invalid and inaccurate

In [15]:
# Prepare the data. I notice that two columns are being casted as double because of the pd.Na values. I've double check with the json response with a json path to validate that there were no decimals
spotify_wrapper.export_db["audio_features.key"] = spotify_wrapper.export_db[
    "audio_features.key"
].apply(lambda x: int(x) if not pd.isna(x) else pd.NA)
spotify_wrapper.export_db["audio_features.time_signature"] = spotify_wrapper.export_db[
    "audio_features.time_signature"
].apply(lambda x: int(x) if not pd.isna(x) else pd.NA)

In [16]:
# Validation dictionary. Abstracted from Spotify's documentation
validation_dict = [
    {
        "columns": ["disc_number", "track_number", "album_total_tracks", "duration_ms"],
        "datatype": int,
        "min_value": 1,
    },
    {"columns": ["explicit"], "datatype": bool},
    {"columns": ["audio_features.tempo"], "datatype": float},
    {
        "columns": [
            "track_id",
            "artist_id",
            "album_id",
            "album_name",
            "artist_name",
            "audio_features.id",
            "track_name",
        ],
        "datatype": str,
    },
    {
        "columns": ["track_popularity", "artist_popularity"],
        "datatype": int,
        "max_value": 100,
        "min_value": 0,
    },
    {
        "columns": [
            "audio_features.danceability",
            "audio_features.energy",
            "audio_features.speechiness",
            "audio_features.acousticness",
            "audio_features.liveness",
            "audio_features.valence",
            "audio_features.instrumentalness",
        ],
        "datatype": float,
        "max_value": 1.0,
        "min_value": 0.0,
    },
    {
        "columns": ["audio_features.key"],
        "datatype": int,
        "max_value": 11,
        "min_value": -1,
    },
    {
        "columns": ["audio_features.loudness"],
        "datatype": float,
        "max_value": 0.0,
        "min_value": -60.0,
    },
    {
        "columns": ["audio_features.mode"],
        "datatype": int,
        "max_value": 1,
        "min_value": 0,
    },
    {
        "columns": ["audio_features.time_signature"],
        "datatype": int,
        "max_value": 7,
        "min_value": 3,
    },
]

In [17]:
def data_validator(data_frame, validation_dict):
    outdict = {
        "columns": [],
        "na_ratios": [],
        "validity_ratios": [],
        "accuracy_ratios": [],
        "bad_formatted_data": [],
        "bad_accuracy_data": [],
    }
    for validator in validation_dict:
        columns = validator.get("columns")
        validator_datatype = validator.get("datatype")

        for column in columns:
            values = data_frame[column].dropna()
            check_datatype = values.apply(lambda x: isinstance(x, validator_datatype))
            good_formatted, bad_formatted = (
                values[check_datatype],
                values[~check_datatype],
            )
            na_ratio = round(
                (data_frame[column].size - values.size) * 100 / data_frame[column].size,
                2,
            )
            validity_ratio = round(
                good_formatted.size * 100 / data_frame[column].size, 2
            )

            if not good_formatted.empty:
                min_value, max_value = (
                    validator.get("min_value"),
                    validator.get("max_value"),
                )
                if min_value is not None and max_value is not None:
                    check_accuracy = (min_value <= good_formatted) & (
                        good_formatted <= max_value
                    )
                elif min_value is not None:
                    check_accuracy = min_value <= good_formatted
                elif max_value is not None:
                    check_accuracy = good_formatted <= max_value
                else:
                    check_accuracy = pd.Series(True, index=good_formatted.index)

                good_accuracy, bad_accuracy = (
                    good_formatted[check_accuracy],
                    good_formatted[~check_accuracy],
                )

            else:
                good_accuracy, bad_accuracy = pd.Series(), bad_formatted

            accuracy_ratio = round(
                good_accuracy.size * 100 / data_frame[column].size, 2
            )

            outdict["columns"].append(column)
            outdict["na_ratios"].append(na_ratio)
            outdict["validity_ratios"].append(validity_ratio)
            outdict["accuracy_ratios"].append(accuracy_ratio)
            outdict["bad_formatted_data"].append(list(set(bad_formatted.tolist())))
            outdict["bad_accuracy_data"].append(list(set(bad_accuracy.tolist())))

    out_df = pd.DataFrame.from_dict(outdict)

    return out_df

In [18]:
validator_accuracy_df = data_validator(spotify_wrapper.export_db, validation_dict)
validator_accuracy_df

Unnamed: 0,columns,na_ratios,validity_ratios,accuracy_ratios,bad_formatted_data,bad_accuracy_data
0,disc_number,0.0,100.0,100.0,[],[]
1,track_number,0.0,100.0,100.0,[],[]
2,album_total_tracks,0.0,97.22,97.22,[Thirteen],[]
3,duration_ms,0.0,100.0,99.63,[],"[-107133, -223093]"
4,explicit,0.0,99.07,99.07,"[No, Si]",[]
5,audio_features.tempo,0.19,99.81,99.81,[],[]
6,track_id,1.48,98.52,98.52,[],[]
7,artist_id,0.0,100.0,100.0,[],[]
8,album_id,0.0,100.0,100.0,[],[]
9,album_name,11.5,88.5,88.5,[],[]


# 5. Consistency

This dimension is not evaluated since the purpose of this report is to identify all the data related issues within the data itself and with external sources. However, a quick way to do this would be by comparing with the Taylor Swift's [Wikipedia](https://en.wikipedia.org/wiki/List_of_songs_by_Taylor_Swift#Y) list of songs with the given dataset to assess a matching score of the number of song/albums. This could be done with the wikipedia-api and beautifulsoup4 package.






# 6. Quality Issues Summary

* Uniqueness

In [19]:
print(
    f"Completeness ratio of {round((number_of_fields-null_rows_fixed)*100/number_of_fields,2)}%.\nOut of {number_of_fields} records there are {null_rows_fixed} with at least one empty value.",
    f"\nColumns with unfilled data {columns_with_null_fixed}",
)

Completeness ratio of 96.29%.
Out of 539 records there are 20 with at least one empty value. 
Columns with unfilled data ['track_id', 'track_name', 'audio_features.danceability', 'audio_features.energy', 'audio_features.key', 'audio_features.loudness', 'audio_features.speechiness', 'audio_features.acousticness', 'audio_features.liveness', 'audio_features.tempo', 'audio_features.time_signature']


* Completeness

In [20]:
print(
    f"There are {count_of_album_id.size} albums in the response, and {albums_above_1.shape[0]} repetitions."
)
albums_above_1

There are 26 albums in the response, and 1 repetitions.


Unnamed: 0,album_id,count,album_name,album_release_date,album_total_tracks,album.artist_id
0,1NAmidJlEaVgA3MpcPFYGq,2,Lover,2019-08-23,18,06HL4z0CvFAxyc27GX


In [21]:
print(
    f"There are {albums_track_id_repeated_filtered.shape[0]} that are repated in the same album"
)
albums_track_id_repeated_filtered

There are 1 that are repated in the same album


Unnamed: 0,album_id,track_id,count
156,1fnJ7k0bllNfL1kVdNVW1A,3xYJScVfxByb61dYHTwiby,2


In [22]:
print(
    f"There are {len(bad_tracks)} songs whose information is not consistent between entries"
)
bad_repetitions_track_db

There are 1 songs whose information is not consistent between entries


Unnamed: 0,disc_number,duration_ms,explicit,track_number,track_popularity,track_id,track_name,track.album_id
278,1,178426,False,2,99,1BxfuPKGuaTgP7aM0Bbdwr,Cruel Summer,1NAmidJlEaVgA3MpcPFYGq
296,1,178426,No,2,99,1BxfuPKGuaTgP7aM0Bbdwr,Cruel Summer,1NAmidJlEaVgA3MpcPFYGq


In [23]:
print(
    f"There are {len(bad_audio_features)} tracks whose audio_features are not consistent between entries"
)
bad_repetitions_features

There are 0 tracks whose audio_features are not consistent between entries


Unnamed: 0,track_id,audio_features.danceability,audio_features.energy,audio_features.key,audio_features.loudness,audio_features.mode,audio_features.speechiness,audio_features.acousticness,audio_features.instrumentalness,audio_features.liveness,audio_features.valence,audio_features.tempo,audio_features.id,audio_features.time_signature


In [24]:
print(
    f"There are {audio_features_missmatch_track_id} audio_features without a matching track_id (this often happens when track_id==pd.NA)"
)
missmatch_audio_features_db

There are 8 audio_features without a matching track_id (this often happens when track_id==pd.NA)


Unnamed: 0,disc_number,duration_ms,explicit,track_number,track_popularity,track_id,track_name,audio_features.danceability,audio_features.energy,audio_features.key,...,audio_features.tempo,audio_features.id,audio_features.time_signature,artist_id,artist_name,artist_popularity,album_id,album_name,album_release_date,album_total_tracks
321,1,209680,False,8,84,,Gorgeous,0.8,0.535,7.0,...,92.027,1ZY1PqizIl78geGM4xWlEA,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6DEjYFkNZh67HP7R9PSZvv,reputation,2017-11-10,15
363,1,238093,False,35,32,,Jump Then Fall,0.617,,2.0,...,80.007,5zytSTR2g0I9psX2Z12ex6,,06HL4z0CvFAxyc27GX,Taylor Swift,120,1MPAXuTVL2Ej5x0JHiSPq8,,2017-11-09,46
375,1,212600,False,1,60,,Welcome To New York,0.789,0.634,7.0,...,116.992,3nRmDz7qGCvsMS30rGGY0x,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1yGbNOtRIgdIiGHOEBaZWf,1989 (Deluxe),2014-01-01,19
379,1,193293,False,5,60,,All You Had To Do Was Stay,0.605,0.725,5.0,...,96.97,6aLOekfwbytwWvQftxTEF0,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1yGbNOtRIgdIiGHOEBaZWf,1989 (Deluxe),2014-01-01,19
382,1,211933,False,8,61,,Bad Blood,0.646,0.794,7.0,...,170.216,2NlmmAjGYrrjAp0MED5rGx,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1yGbNOtRIgdIiGHOEBaZWf,1989 (Deluxe),2014-01-01,19
434,1,362826,False,6,54,,Back To December/Apologize/You're Not Sorry - ...,0.374,0.516,2.0,...,142.057,1IsquhJFJ0qcFZI7FeAEuN,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6fyR4wBPwLHKcRtxgd4sGh,,2010-10-25,16
442,1,389213,False,14,49,,Enchanted - Live/2011,0.34,0.663,8.0,...,163.678,3lm4L3pPL32PFy74dR17OR,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6fyR4wBPwLHKcRtxgd4sGh,,2010-10-25,16
445,1,230546,False,1,49,,Mine - POP Mix,0.696,0.768,7.0,...,121.05,0GxW5K0qzrq7L1jwSY5OmY,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,6Ar2o9KCqcyYF9J0aQP3au,Speak Now,2010-10-25,14


In [25]:
print(
    f"There are {albums_track_number_repeated_filtered.shape[0]} track numbers that are repeated in the same disc_number of an album."
)
albums_track_number_repeated_filtered

There are 2 track numbers that are repeated in the same disc_number of an album.


Unnamed: 0,album_id,track_number,disc_number,count
169,1fnJ7k0bllNfL1kVdNVW1A,21,1,2
443,6DEjYFkNZh67HP7R9PSZvv,8,1,2


In [26]:
print(
    f"There are {(spotify_wrapper.albums_db['track_difference_spotify_count'] != 0).sum()} albums where the 'album_total_tracks' attribute does not match the count of non-repeated tracks for each album."
)
spotify_wrapper.albums_db[
    spotify_wrapper.albums_db["track_difference_spotify_count"] != 0
][
    [
        "album_id",
        "album_name",
        "album_total_tracks_fixed",
        "actual_track_count",
        "track_difference_spotify_count",
    ]
]

There are 5 albums where the 'album_total_tracks' attribute does not match the count of non-repeated tracks for each album.


Unnamed: 0,album_id,album_name,album_total_tracks_fixed,actual_track_count,track_difference_spotify_count
3,1fnJ7k0bllNfL1kVdNVW1A,Midnights (The Til Dawn Edition),24,23,-1
6,6kZ42qRrzov54LcAk4onW9,Red (Taylor's Version),34,30,-4
9,2Xoteh7uEpea4TohMxjtaq,evermore,10,15,5
15,6DEjYFkNZh67HP7R9PSZvv,reputation,15,16,1
26,5eyZZoQEFQWRHkV2xgAeBw,Taylor Swift,13,15,2


* Timeliness

In [27]:
print(
    f"Timeliness ratio: {round((number_of_fields-albums_after_birthday.shape[0])*100/number_of_fields,2)}% \nThere are {len(albums_after_birthday['album_id'].unique())} albums that were made before Taylor Swift was born. Tracks affected: {albums_after_birthday.shape[0]}.\nList of albums: {albums_after_birthday['album_id'].unique()}."
)

Timeliness ratio: 97.22% 
There are 1 albums that were made before Taylor Swift was born. Tracks affected: 15.
List of albums: ['5eyZZoQEFQWRHkV2xgAeBw'].


* Validity and accuracy 

In [28]:
validator_accuracy_df

Unnamed: 0,columns,na_ratios,validity_ratios,accuracy_ratios,bad_formatted_data,bad_accuracy_data
0,disc_number,0.0,100.0,100.0,[],[]
1,track_number,0.0,100.0,100.0,[],[]
2,album_total_tracks,0.0,97.22,97.22,[Thirteen],[]
3,duration_ms,0.0,100.0,99.63,[],"[-107133, -223093]"
4,explicit,0.0,99.07,99.07,"[No, Si]",[]
5,audio_features.tempo,0.19,99.81,99.81,[],[]
6,track_id,1.48,98.52,98.52,[],[]
7,artist_id,0.0,100.0,100.0,[],[]
8,album_id,0.0,100.0,100.0,[],[]
9,album_name,11.5,88.5,88.5,[],[]
