# Homework 4 - Tidy and Process the Billboard Dataset
The Billboard dataset comes with **76 columns** corresponding to the chart position of each song from `x1st.week` through `x76th.week`. This is a classic example of **wide** data that needs to be **melted** (unpivoted) into a long (tidy) format.

### Instructions
1. Follow the instructions on how to setup your Python and Jupyter (or VSCode) environment and cloning or downloading our repository. Instructions can be found in the class notes.
2. Fill the missing pieces of code in the provided notebook.
3. Run the notebook and make sure everything works.


### Dataset Overview
The dataset consists of songs and their weekly chart positions on the Billboard Hot 100. The dataset contains the following columns:
- `year`: The year the song entered the chart.
- `artist`: The artist of the song.
- `track`: The title of the song.
- `time`: The duration of the song.
- `date.entered`: The date the song entered the chart.
- `x1st.week` to `x76th.week`: The chart position of the song for each week.

### Goals

1. **Load** the Billboard dataset from CSV.
2. **Tidy** the data so each row represents one song in one week.
3. **Calculate** the actual date for each week using `date.entered + week * 7 days`.
4. **Split** the data into two tables:
   - A **songs** table with static song information.
   - A **positions** table with `(song_id, week, rank, date)`.
5. **Save** the tidy data to **Feather** format in the same directory with `_tidy` suffix.

### Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
     - Search for `Jupyter: Export to HTML`.
     - Save the HTML file to your computer and submit it via Canvas.

---

In [1]:
import os
# Local directory
print(os.getcwd())

c:\Ricardo\2025-02 SP25 USABLE ARTIFICIAL INTELLIGENCE\GitHub\usable_ai\Homework


In [2]:
import pandas as pd

# 1. Load the Billboard dataset
df_bill = pd.read_csv("../Datasets/billboard.csv")

# Let's check a few columns to see the structure.
df_bill.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,


 The dataset has columns like:

 - **year**, **artist.inverted**, **track**, **time**, **genre** … (song info)

 - **date.entered**, **date.peaked** … (chart-related dates)

 - **x1st.week** through **x76th.week** … (chart positions over 76 weeks)



 We want to **melt** these weekly columns into a single `week` and `rank` column.

In [3]:
# Your code here
# Melt the dataset to long format
df_melted = df_bill.melt(
    id_vars=['year', 'artist.inverted', 'track', 'time', 'genre', 'date.entered', 'date.peaked'], 
    var_name='week', 
    value_name='rank'
)

# Remove rows where rank is NaN (i.e., the song was not on the chart that week)
df_melted.dropna(subset=['rank'], inplace=True)

# Sort values for better readability
df_melted = df_melted.sort_values(by=['artist.inverted', 'track', 'week'])

# Reset index
df_melted.reset_index(drop=True, inplace=True)

# Display the tidied data
df_melted.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,x1st.week,87.0
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,x2nd.week,82.0
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,x3rd.week,72.0
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,x4th.week,77.0
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,x5th.week,87.0


 Notice how each row is now **one song** in **one week**. However, the `week` column currently contains strings like `"x1st.week"`, `"x2nd.week"`, etc. Let's clean those up and create a numeric week column.

In [4]:
# Your code here

# Convert 'week' column to numerical format (extracting the week number)
df_melted['week'] = df_melted['week'].str.extract('(\d+)').astype(int)

df_melted.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,1,87.0
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,2,82.0
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,3,72.0
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,4,77.0
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,5,87.0


 Now, `week = 1, 2, 3, ... 76`. Next, we want to calculate the **exact date** on the chart for each row by adding `week * 7` days to `date.entered`. Create a column named "date" to hold the result. See the expected result in our lecture materials for tidy data.

In [5]:
# Note that after doing that, you should have a new column called date
# Your code here

# transform "date.entered" to datetime format and add 7 * "week" as days
df_melted["date"] = pd.to_datetime(df_melted["date.entered"]) + pd.to_timedelta(df_melted["week"] * 7, unit="D")

df_melted.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank,date
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,1,87.0,2000-03-04
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,2,82.0,2000-03-11
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,3,72.0,2000-03-18
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,4,77.0,2000-03-25
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,5,87.0,2000-04-01


 ### Split into Two Tables



 **Why split?** We often separate the **static** song info (e.g., artist, track, time, genre) from the **weekly** chart performance (week, rank, date).



 - **Songs Table**: Contains unique identifiers for each song plus basic metadata.

 - **Positions Table**: Contains `(song_id, week, rank, date)`, referencing the **song_id** from the songs table.

In [6]:
# Your code here

# Create a Songs table with unique song_id
df_songs = df_melted[["artist.inverted", "track", "time", "genre"]].drop_duplicates().reset_index(drop=True)
df_songs["song_id"] = df_songs.index  # Assign a unique ID

print('df_songs shape:', df_songs.shape,'\n')

df_songs.head()

df_songs shape: (317, 5) 



Unnamed: 0,artist.inverted,track,time,genre,song_id
0,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,0
1,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,1
2,3 Doors Down,Kryptonite,3:53,Rock,2
3,3 Doors Down,Loser,4:24,Rock,3
4,504 Boyz,Wobble Wobble,3:35,Rap,4


 Next, we merge this `song_id` back into our `df_tidy` so we can create the positions table.

In [7]:
# Your code here

# Create the Positions table
# Merge to get song_id in the main dataframe
df_positions = df_melted.merge(df_songs, on=["artist.inverted", "track", "time", "genre"], how="left")

print('df_positions shape:',df_positions.shape,'\n')

df_positions.head()

df_positions shape: (5307, 11) 



Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank,date,song_id
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,1,87.0,2000-03-04,0
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,2,82.0,2000-03-11,0
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,3,72.0,2000-03-18,0
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,4,77.0,2000-03-25,0
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,5,87.0,2000-04-01,0


 ### Create the Positions Table



 We only keep the **relevant columns** for weekly positions: `song_id`, `week`, `rank`, and `date`.

In [8]:
# Your code here

# Keep only relevant columns in Positions table
df_positions = df_positions[["song_id", "week", "rank", "date"]]

print('df_positions shape:',df_positions.shape,'\n')

df_positions.head()

df_positions shape: (5307, 4) 



Unnamed: 0,song_id,week,rank,date
0,0,1,87.0,2000-03-04
1,0,2,82.0,2000-03-11
2,0,3,72.0,2000-03-18
3,0,4,77.0,2000-03-25
4,0,5,87.0,2000-04-01


## 8.Playing with the data
 Now that we have our data in a tidy format, let's do some analysis.

### Only songs that reached top 10
We can use `query()` to filter the data for songs that reached the top 10 at least once. We will merge this back to the songs table to get the song details.

Get a dataframe with the top 10 songs and their details.

In [9]:
# Your code here

# Get the top 10 songs based on best rank (lower is better so <=10)
df_top_10_songs = df_positions.query("rank <= 10")[["song_id","rank"]]

print('All records in df_positions where Rank <=10 - shape:', df_top_10_songs.shape)

display(df_top_10_songs)

df_top_10_with_details = df_top_10_songs.merge(df_songs, on="song_id", how="left")

# Display the resulting dataframe
print(df_top_10_with_details)

All records in df_positions where Rank <=10 - shape: (524, 2)


Unnamed: 0,song_id,rank
23,2,7.0
24,2,6.0
25,2,6.0
26,2,6.0
27,2,5.0
...,...,...
5279,316,5.0
5280,316,4.0
5281,316,4.0
5282,316,6.0


     song_id  rank  artist.inverted       track  time genre
0          2   7.0     3 Doors Down  Kryptonite  3:53  Rock
1          2   6.0     3 Doors Down  Kryptonite  3:53  Rock
2          2   6.0     3 Doors Down  Kryptonite  3:53  Rock
3          2   6.0     3 Doors Down  Kryptonite  3:53  Rock
4          2   5.0     3 Doors Down  Kryptonite  3:53  Rock
..       ...   ...              ...         ...   ...   ...
519      316   5.0  matchbox twenty        Bent  4:12  Rock
520      316   4.0  matchbox twenty        Bent  4:12  Rock
521      316   4.0  matchbox twenty        Bent  4:12  Rock
522      316   6.0  matchbox twenty        Bent  4:12  Rock
523      316   9.0  matchbox twenty        Bent  4:12  Rock

[524 rows x 6 columns]


You may want to remove duplicates to get a list of unique songs that reached the top 10. See `df.drop_duplicates()` for more details.

In [10]:
# Your code here

# Drop the 'rank' column before removing duplicates
df_unique_top_10_songs = df_top_10_with_details.drop(columns=["rank"]).drop_duplicates(subset=["song_id"])

print('All unique top 10 songs - shape:', df_unique_top_10_songs.shape)

display(df_unique_top_10_songs)


All unique top 10 songs - shape: (53, 5)


Unnamed: 0,song_id,artist.inverted,track,time,genre
0,2,3 Doors Down,Kryptonite,3:53,Rock
18,5,98°,Give Me Just One Night (Una Noche),3:24,Rock
23,8,Aaliyah,Try Again,4:03,Rock
38,11,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock
46,12,"Aguilera, Christina",I Turn To You,4:00,Rock
50,13,"Aguilera, Christina",What A Girl Wants,3:18,Rock
57,19,"Anthony, Marc",You Sang To Me,3:50,Latin
63,23,"Backstreet Boys, The",Shape Of My Heart,3:49,Rock
66,24,"Backstreet Boys, The",Show Me The Meaning Of Being Lonely,3:54,Rock
74,26,"Badu, Erkyah",Bag Lady,5:03,Rock


### How long did each song stay in the top 10?
Create add to the current dataframe or create a new dataframe with the following columns:
- `song_id` : the song id
- `weeks_in_top_10` : the number of weeks the song was in the top 10

In [11]:
# Your code here

# Group by song_id and count the number of weeks the song was in the top 10
weeks_in_top_10 = df_top_10_songs.groupby("song_id")["rank"].count().reset_index(name="weeks_in_top_10")

display(weeks_in_top_10)


Unnamed: 0,song_id,weeks_in_top_10
0,2,18
1,5,5
2,8,15
3,11,8
4,12,4
5,13,7
6,19,6
7,23,3
8,24,8
9,26,4


### In which week did each song reach the top 10?
Create or add to a new dataframe with the following columns:
- `week_reached_top_10` : the week in which the song reached the top 10 for the first time

In [12]:
# Your code here

# Merge the result with df_unique_top_10_songs to add the 'weeks_in_top_10' column
df_unique_top_10_songs = df_unique_top_10_songs.merge(weeks_in_top_10, on="song_id", how="left")

print('All unique top 10 songs - shape:', df_unique_top_10_songs.shape)

# Display the resulting dataframe
display(df_unique_top_10_songs)

All unique top 10 songs - shape: (53, 6)


Unnamed: 0,song_id,artist.inverted,track,time,genre,weeks_in_top_10
0,2,3 Doors Down,Kryptonite,3:53,Rock,18
1,5,98°,Give Me Just One Night (Una Noche),3:24,Rock,5
2,8,Aaliyah,Try Again,4:03,Rock,15
3,11,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,8
4,12,"Aguilera, Christina",I Turn To You,4:00,Rock,4
5,13,"Aguilera, Christina",What A Girl Wants,3:18,Rock,7
6,19,"Anthony, Marc",You Sang To Me,3:50,Latin,6
7,23,"Backstreet Boys, The",Shape Of My Heart,3:49,Rock,3
8,24,"Backstreet Boys, The",Show Me The Meaning Of Being Lonely,3:54,Rock,8
9,26,"Badu, Erkyah",Bag Lady,5:03,Rock,4


 ### 9. Save Tidy Data to Feather



 We want to save:

 - The **tidy** DataFrame (`df_tidy`) to a single file with the suffix `_tidy`.

 - (Optionally) Also save **songs** and **positions** as separate Feather files if needed.

In [13]:
# Your code here

print("df_melted shape:", df_melted.shape)
display(df_melted.head())

# Take out columns 'date.entered' and 'date.peaked' so it becomes df_tidy
df_tidy = df_melted.drop(columns=["date.entered", "date.peaked"])

print("\ndf_tidy shape:", df_tidy.shape)
display(df_tidy)

#Store under Datasets directory as a Feather file
df_tidy.to_feather("../Datasets/df_tidy.feather")

df_songs.to_feather("../Datasets/df_songs.feather")

df_positions.to_feather("../Datasets/df_positions.feather")






df_melted shape: (5307, 10)


Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank,date
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,1,87.0,2000-03-04
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,2,82.0,2000-03-11
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,3,72.0,2000-03-18
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,4,77.0,2000-03-25
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2000-02-26,2000-03-11,5,87.0,2000-04-01



df_tidy shape: (5307, 8)


Unnamed: 0,year,artist.inverted,track,time,genre,week,rank,date
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,1,87.0,2000-03-04
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2,82.0,2000-03-11
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,3,72.0,2000-03-18
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,4,77.0,2000-03-25
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,5,87.0,2000-04-01
...,...,...,...,...,...,...,...,...
5302,2000,matchbox twenty,Bent,4:12,Rock,5,22.0,2000-06-03
5303,2000,matchbox twenty,Bent,4:12,Rock,6,21.0,2000-06-10
5304,2000,matchbox twenty,Bent,4:12,Rock,7,18.0,2000-06-17
5305,2000,matchbox twenty,Bent,4:12,Rock,8,16.0,2000-06-24
