In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [10, 5]

In [None]:
billboard_csv = 'billboard_hot_100_1991-01-05_to_2022-10-01.csv'
df = pd.read_csv(billboard_csv, parse_dates=['date'])

View data before any processing:

In [None]:
df

Reorder columns:

In [None]:
df = df.filter(['date', 'pos', 'pos_prev', 'pos_peak', 'weeks', 'artist', 'song'])

In [None]:
df.head()

Add some columns for convenience:
* Add a "year", derived from the "date" column, and use it as the index.
* Add an "artist_song" column, derived from the "artist" and "song" columns. (We'll use this to determine unique songs in each year.)

In [None]:
df['year'] = df['date'].dt.year
df.set_index('year', inplace=True)
df['artist_song'] = df['artist'] + ': ' + df['song']
df[['artist', 'song', 'artist_song']].head()

Remove the original 'artist' and 'song' columns, since we now have the combined 'artist_song':

In [None]:
df.drop(columns=['artist', 'song'], inplace=True)

In [None]:
_ = df.groupby(['year'])['artist_song'].count().plot.bar(title='Number of unique charting songs by year')

Sort all chart weeks by year (earliest first), then by peak chart position (lowest first), then by the number of consecutive weeks in the charts (most weeks first).

Finally, drop all rows (chart weeks) with duplicate year/artist_song, keeping only the first (highest & longest-charting week) of each duplicate.

In [None]:
num_rows_raw = len(df)
# `drop_duplicates` will error if we have an index column in the list, so we remove it before and add it back after.
df = df.sort_values(['year', 'pos_peak', 'weeks'], ascending=[True, True, False])\
    .reset_index().drop_duplicates(['year', 'artist_song']).set_index('year')
num_rows_unique_by_year = len(df)
df

In [None]:
print('Num rows (raw):', num_rows_raw)
print('Num rows (with unique songs per-year):', num_rows_unique_by_year)

Find the number of unique charting songs (by 'artist'/'song') for each year:

In [None]:
# Could also use `count()` instead of `nunique()` here, since we've already dropped duplicate songe by year,
# but this way works correctly on the dataframe even before droping duplicates.
_ = df.groupby(['year'])['artist_song'].nunique().plot.bar(title='Number of unique charting songs by year')

One thing that sticks out to me here is that more recent years have more unique charting songs, by a substantial margin.

My naive interpretation is that our tastes are becoming more eclectic, but maybe it has more to do with the music industry/distribution/etc. than our collective listening patterns/tastes? Independent artists getting more exposure through Soundcloud/Bandcamp? ([Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States.](https://www.billboard.com/pro/billboard-changes-streaming-weighting-hot-100-billboard-200/))