## Spotify Song-Album Popularity Data Cleaning

This notebook takes in the combined CSVs from the data collection step, removes duplicate tracks, and cleans the data.

In [70]:
import pandas as pd

In [71]:
csv = pd.read_csv(f"raw_data/merged_eda_data.csv")

### Remove Duplicates

In [85]:
deduped_csv = csv.drop_duplicates(subset="track_id")

In [86]:
print(f"There were {csv.size} tracks in the original merged file.")
print(f"There are {deduped_csv.size} tracks in the deduped file.")
print()
print(f"Hence, there were {csv.size - deduped_csv.size} duplicate tracks during data collection.")
print(f"That corresponds to ~{int(((csv.size - deduped_csv.size) / csv.size) * 100)} % of the data collected.")

There were 1363698 tracks in the original merged file.
There are 727470 tracks in the deduped file.

Hence, there were 636228 duplicate tracks during data collection.
That corresponds to ~46 % of the data collected.


In [87]:
deduped_csv.dtypes

track_id             object
album_id             object
track_number          int64
track_count         float64
duration            float64
explicit             object
track_pop           float64
album_pop            object
comparative_pop     float64
danceability        float64
energy              float64
loudness            float64
speechiness         float64
acousticness        float64
instrumentalness     object
liveness            float64
valence             float64
tempo                object
dtype: object

### Clean up data types

In [88]:
# Do some minor data cleaning
deduped_csv['explicit'] = deduped_csv['explicit'].replace({'TRUE': 1, 'FALSE': 0, True: 1, False: 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [89]:
deduped_csv['album_pop'] = pd.to_numeric(deduped_csv['album_pop'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [90]:
deduped_csv['track_count'] = deduped_csv['track_count'].astype(int)
deduped_csv['duration'] = deduped_csv['duration'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [91]:
deduped_csv['instrumentalness'] = pd.to_numeric(deduped_csv['instrumentalness'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [92]:
deduped_csv['tempo'] = pd.to_numeric(deduped_csv['tempo'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### Report the differences

In [93]:
clean_data = deduped_csv.dropna()

In [94]:
clean_data.dtypes

track_id             object
album_id             object
track_number          int64
track_count           int64
duration              int64
explicit              int64
track_pop           float64
album_pop           float64
comparative_pop     float64
danceability        float64
energy              float64
loudness            float64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
dtype: object

In [95]:
print(f"There were {deduped_csv.size} tracks in the deduped file.")
print(f"There are {clean_data.size} tracks in the cleaned file.")
print()
print(f"Hence, there were {deduped_csv.size - clean_data.size} unclean records.")
print(f"That corresponds to ~{int(((deduped_csv.size - clean_data.size) / deduped_csv.size) * 100)} % of the deduplicated data.")

There were 727470 tracks in the deduped file.
There are 677952 tracks in the cleaned file.

Hence, there were 49518 unclean records.
That corresponds to ~6 % of the deduplicated data.


In [96]:
# Write the deduped data to a csv
clean_data.to_csv("raw_data/full_eda_data.csv", index=False)