# Motivational Qualities of Songs for Daily Activities

In this assignment you will work on a study on song features and how they can be used as the basis for recommendations for specific daily activities. The study is:

* Kim, Y., Aiello, L.M. & Quercia, D. PepMusic: motivational qualities of songs for daily activities. EPJ Data Sci. 9, 13 (2020). https://doi.org/10.1140/epjds/s13688-020-0221-9

You can download the study from the above link. You can use the dataset provided by the authors, which is available inside the present folder at [data_archive_20190201.json](./data_archive_20190201.json).

---

> Panos Louridas, Associate Professor <br />
> Department of Management Science and Technology <br />
> Athens University of Economics and Business <br />
> louridas@aueb.gr

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from matplotlib.lines import Line2D
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier

from sklearn.metrics import classification_report, accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedKFold

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

In [None]:
json_data = pd.read_json('./data_archive_20190201.json').transpose()
json_data.head()

In [None]:
# Extract every information from features into separate columns
features_df = pd.json_normalize(json_data['features'])

# Keep important infromation in separate dataframe
df = pd.concat([json_data[['trackId', 'artists', 'songTitle']], features_df], axis=1)
df.head()

## Questions

### Q1: Clustering 

You will perform a clustering on the songs, using KMeans. The authors identify the optimum number of clusters by using the elbow method (gives four clusters) and the silhouette score (gives two) clusters and taking their average, i.e., three clusters.

Use both methods, like the authors, check the results, and then use three clusters. Visualize the clusters by using PCA on two dimensions.

Note that the data given by the authors contain the results of their clustering. Of course this will not be a feature that you will use for your clustering. The features you will use for clustering will be:

* `chordsScale`

* `chordsKey`

* `bpm`

* `rhythmHist`

* `regularity`

* `rhythmPattern`

* `keyKey`

* `loudness`

* `pitchBiHist`

* `keyScale`

Not all of these features are atomic, and not all of these features are numerical, so you should make the necessary transformations in the data so that you get all features in a single two-dimensional matrix.

Once you finish your clustering, compare the clusters that you have found with the clusters that the authors have found; how similar are your clusters to theirs? The authors assign activities, given by `activityType`, to clusters as in Table 2. Interpret your clusters like the authors do in the text of the paper and in figures 5, 6, as best as you can. 

We can see that some columns have categorical values. We will need to get the data in the correct format, since clustering algorithms like KMeans expect a flat, numerical matrix where each feature corresponds to a single numeric value.

In [None]:
df.info()

In [None]:
# Method that gets the categorical columns and returns a dataframe with a each columns expaned into multiple ones
def encode_categorical_values(df, columns_list):
  encoder = OneHotEncoder()
  categorical_encoded = encoder.fit_transform(df[columns_list]).toarray()
  df = pd.DataFrame(categorical_encoded, columns=encoder.get_feature_names_out(columns_list))
  return df

# Use the method for our categorical data
categorical_columns = ['chordsScale', 'chordsKey', 'keyKey', 'keyScale']
categorical_df = encode_categorical_values(df, categorical_columns)
categorical_df

In [None]:
# Expand columns that hold list values, into multiple columns
def expand_list_columns(df, columns_list):
  expanded_df = []
  for column in columns_list:
    expanded = pd.DataFrame(
      df[column].tolist(),
      columns=[f"{column}_{i+1}" for i in range(len(df[column].iloc[0]))]
    )
    expanded_df.append(expanded)

  expanded_df = pd.concat(expanded_df, axis=1)
  return expanded_df

list_columns = ['rhythmHist', 'rhythmPattern']
lists_df = expand_list_columns(df, list_columns)
lists_df

The `pitchBiHist` is different because it contains **nested lists**, while other columns like `rhythmHist` have flat lists. Instead of expanding it into many columns (which will increase complexity by a lot), we calculate summary statistics like **mean, standard deviation** to capture its overall behavior in a simpler way.

In [None]:
# Aggregate List Features
dpitchBiHist_mean_df = df['pitchBiHist'].apply(lambda x: np.mean(x[0]))
dpitchBiHist_std_df = df['pitchBiHist'].apply(lambda x: np.std(x[0]))

In [None]:
numerical_columns = ['bpm', 'regularity', 'loudness']

We will now combine all columns into a **flat, numerical matrix** where each feature corresponds to a single numeric value.


In [None]:
processed_df = pd.concat([
    df[numerical_columns], 
    categorical_df, 
    lists_df,
    dpitchBiHist_mean_df,
    dpitchBiHist_std_df
  ], axis=1)
processed_df

In [None]:
processed_df.describe()

We can see that the variances vary a lot, so we need to scale the data using **Standard Scaler** for optimal results.

In [None]:
scaler = StandardScaler()
scaled_df = scaler.fit_transform(processed_df)

scaled_df = pd.DataFrame(
  scaled_df,
  index=processed_df.index,
  columns=processed_df.columns
)
scaled_df

In [None]:
scaled_df.describe()

The variances are pretty similar with each other. This means that we can proceed with finding the optimal number of clusters in our data.

We apply the **Elbow Method** in order to find the optimal amount of clusters to use in our KMeans.

In [None]:
kmeans = KMeans(random_state=0)
visualizer = KElbowVisualizer(kmeans, k=(1,11))

visualizer.fit(scaled_df)
_ = visualizer.show()

We can see from the elbow method's results that one may choose 3 or 4 as k.
We also apply the **Silhouette Score method** to compare the results.

In [None]:
plt.figure(figsize=(2 * 5,  10 * 4))

scores = {}
for n_clusters in range(2, 11):
    plt.subplot(10, 2, n_clusters - 1)
    kmeans = KMeans(n_clusters, random_state=42)
    visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick')
    visualizer.fit(scaled_df)
    scores[n_clusters] = visualizer.silhouette_score_
    plt.title(f'clusters: {n_clusters} score: {visualizer.silhouette_score_}')

In [None]:
# Sort the amount of clusters sorted by the overall score
sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

We can see from the results of the Silhouette Score method that one may choose 2 as k.

We will take the **mean of the observations**, and fit our KMeans model with **number of clusters = 3**.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0)
clusters = kmeans.fit_predict(scaled_df)
kmeans

We will now map each song to its cluster and add the result as new column in the original dataframe.

In [None]:
# Mapping dictionary
cluster_names = {0: 'intense', 1: 'calm', 2: 'vibrant'}

df['clusteringLabel'] = [cluster_names[cluster] for cluster in clusters]
df.head()

In [None]:
np.unique(kmeans.labels_, return_counts=True)

We can observe that the clusters are well splitted:
* 344 songs where inserted on the **Intense** cluster.
* 257 songs where inserted on the **Calm** cluster.
* 506 songs where inserted on the **Vibrant** cluster.

In [None]:
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_df)
pca.explained_variance_ratio_

* The explained variance ratios for the two principal components are **0.3721** and **0.0392**, respectively.
* The first principal component captures approximately **37.2%** of the total variance in the data, indicating it is the most significant contributor to dimensionality reduction.
* The second principal component accounts for **3.9%** of the variance, suggesting diminishing returns in additional dimensions.
* Together, these two components explain **41.1%** of the dataset’s variance, implying that further components may be necessary for more comprehensive representation. While more components might better represent the dataset, these two are adequate for visual exploration.

In [None]:
plt.figure(figsize=(6, 4))

mapped_labels = [cluster_names[label] for label in kmeans.labels_]

_ = sns.scatterplot(
  x=pca_data[:, 0], 
  y=pca_data[:, 1], 
  hue=mapped_labels, 
  palette=sns.color_palette('muted', n_colors=3)
  )

plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()

From the scatterplot, we can observe that the data points are effectively clustered, showing clear separation between different groups. The distribution of points within each cluster appears dense and well-defined, suggesting that the clustering method (e.g., KMeans) has performed well. The PCA transformation helped reduce dimensionality while maintaining significant structure in the data, providing a simplified yet informative view of the underlying patterns. The separation between clusters highlights the distinctiveness of the groups, which can be useful for further analysis or classification tasks.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot a boxplot showing the bpm for each cluster label
sns.boxplot(
  x='clusteringLabel', 
  y='bpm', 
  data=df, 
  ax=axes[0],
  boxprops=dict(facecolor='none', edgecolor='black')
  )

# Plot a boxplot showing the loudness for each cluster label
sns.boxplot(
  x='clusteringLabel', 
  y='loudness', 
  data=df, 
  ax=axes[1],
  boxprops=dict(facecolor='none', edgecolor='black')
  )

# Plot a boxplot showing the danceability for each cluster label
sns.boxplot(
  x='clusteringLabel', 
  y='regularity', 
  data=df, 
  ax=axes[2],
  boxprops=dict(facecolor='none', edgecolor='black')
  )

axes[0].set_ylabel('BPM')
axes[1].set_ylabel('Loudness')
axes[2].set_ylabel('Danceability')

# Remove x-axis labels and grid lines in each subplot
[axe.set_xlabel('') for axe in axes]
[axe.grid(False) for axe in axes]

plt.tight_layout()
plt.show()

The analysis of tempo, loudness, and danceability reveals notable differences across the three music archetypes—calm, vibrant, and intense. 
* For both loudness and danceability, the intense group exhibits the highest values, while the calm group has the lowest. The vibrant group falls in between. 
* In contrast, the tempo, measured in beats per minute, is highest for the calm group, followed by vibrant and intense. This is likely because the calm group, consisting of instrumental genres like classical and meditation music, tends to feature slower, more complex rhythms.

In [None]:
def radar_factory(keys, values, axes=None, color='b', fontsize=10, figsize=(5, 5)):
    
    import math
    
    if axes is None:
        # Initialise the spider plot.
        plt.figure(figsize=figsize)
        ax = plt.subplot(111, polar=True)
    else:
        ax = axes
        
    num_categories = len(keys)
 
    # What will be the angle of each axis in the plot? 
    # We divide the plot / number of variables + 1.
    # We add one because we want to do a complete circle.
    angles = np.arange(num_categories + 1)
    angles = angles / num_categories * 2 * np.pi 
    angles[-1] = angles[0]
 
    # We want the first axis to be on top.
    ax.set_theta_offset(np.pi / 2)
    # Theta direction is -1 clockwise,
    # and 1 counterclockwise
    ax.set_theta_direction(-1)
 
    # Draw one axis per variable and add x labels
    ax.tick_params(labelsize=fontsize)
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(keys)
 
    # Draw y labels
    ax.set_rlabel_position(0)

    values.append(values[0])
    ax.plot(list(angles), values, 
            linewidth=1, linestyle='solid', color=color)

    ax.fill(angles, values, alpha=0.2, color=color)
    yticks = ax.get_yticks()
    ax.set_yticklabels([])
    ylims = ax.get_ylim()
    ax.set_ylim(min(yticks[0], ylims[0]), 
                max(yticks[-1], ylims[1]))

In [None]:
labels = ["C major", "F major", "G major", "A minor", "A# major", "A major", "E major"]
colors = ['b', 'r', 'g']

fig, ax = plt.subplots(figsize=(5, 5), subplot_kw=dict(polar=True))

# Iterate over each clustering label and plot the radar chart
for label, color in zip(df['clusteringLabel'].unique(), colors):
    cluster_data = scaled_df[df['clusteringLabel'] == label]
    values = [cluster_data['chordsKey_C'].mean(), 
              cluster_data['chordsKey_F'].mean(),
              cluster_data['chordsKey_G'].mean(),
              cluster_data['chordsKey_A'].mean(),
              cluster_data['chordsKey_A#'].mean(),
              cluster_data['chordsKey_E'].mean(),
              cluster_data['keyScale_major'].mean()]
    
    radar_factory(labels, values, axes=ax, color=color)

# Disable circular grid lines
ax.yaxis.grid(False)

custom_legend = [
    Line2D([0], [0], color=colors[0], lw=2, label='Calm'),
    Line2D([0], [0], color=colors[1], lw=2, label='Vibrant'),
    Line2D([0], [0], color=colors[2], lw=2, label='Intense')
]

plt.legend(
    handles=custom_legend, 
    bbox_to_anchor=(1.1, 1.1),
    frameon=True, 
    edgecolor='black'
    )

For the three music archetypes, the top musical keys for each group are as follows:

- **Calm Group:** The most common keys are A Major, A# Major, and A minor.
- **Intense Group:** The leading keys are E Major, G Major, and A# Major.
- **Vibrant Group:** The most frequent keys are A minor, E Major, and F Major.

These key choices reflect the distinct musical preferences and overall mood characteristics within each group.

In [None]:
x = (json_data['clusteringLabel'] == df['clusteringLabel']).sum()
y = len(json_data)

print(f'{x} found with the right label from {y}. Percentage is {x/y}')

This percentage indicates that the clustering performed with KMeans on the dataset has a moderate match with the ground truth labels, but it is far from perfect. The difference between my results and the aticle's clusters may stem from variations in feature selection, scaling methods, or the clustering algorithm used. Their approach may have utilized a more sophisticated method or domain-specific considerations that yielded more accurate clustering results. Therefore, the relatively low percentage suggests that this model could benefit from further refinement and feature adjustments.

### Q2: Classification

Following the classification, the authors build a classifier to predict the class (defined as the cluster) of a song. The authors build their classifier using Random Forests and they use a series of models, described in Table 3. Do the same, for all models, using scikit-learn, XGBoost, LightGBM, and CatBoost. Report your results.

Beyond the tree-based classifiers, proceed to build a neural-network classifier using TensorFlow or PyTorch. Report also your results.

In [None]:
scaled_df['clusteringLabel'] = df['clusteringLabel']
scaled_df

In [None]:
## TODO: REMOVE THIS
scaled_df = scaled_df.iloc[:100]
scaled_df

In [None]:
# Group columns by their prefix
grouped_columns = {
  'chordsScale': [col for col in scaled_df.columns if col.startswith('chordsScale')],
  'chordsKey': [col for col in scaled_df.columns if col.startswith('chordsKey')],
  'keyKey': [col for col in scaled_df.columns if col.startswith('keyKey')],
  'keyScale': [col for col in scaled_df.columns if col.startswith('keyScale')],
  'rhythmPattern': [col for col in scaled_df.columns if col.startswith('rhythmPattern')],
  'rhythmHist': [col for col in scaled_df.columns if col.startswith('rhythmHist')],
  'pitchBiHist': [col for col in scaled_df.columns if col.startswith('pitchBiHist')], 
  'bpm': ['bpm'],
  'regularity': ['regularity'],
  'loudness': ['loudness']
}

In [None]:
categories = {
    'Rhythm Histogram (RH)': (grouped_columns['rhythmHist']),
    'Rhythm Patterns (RP)': (grouped_columns['rhythmPattern']),
    'Stimulative Loudness': (grouped_columns['loudness']),
    'Stimulative Tempo': (grouped_columns['bpm']),
    'Danceability': (
      grouped_columns['bpm'] +
      grouped_columns['regularity'] +
      grouped_columns['rhythmHist'] +
      grouped_columns['rhythmPattern']),
    'Melody': (grouped_columns['pitchBiHist']),
    'Harmony': (
      grouped_columns['chordsScale'] +
      grouped_columns['chordsKey'] +
      grouped_columns['keyKey'] +
      grouped_columns['keyScale']),
    'All except RH and RP': (
      grouped_columns['loudness'] +
      grouped_columns['bpm'] +
      grouped_columns['regularity'] +
      grouped_columns['pitchBiHist'] +
      grouped_columns['chordsScale'] +
      grouped_columns['chordsKey'] +
      grouped_columns['keyKey'] +
      grouped_columns['keyScale']),
    'All': (
      grouped_columns['chordsScale'] +
      grouped_columns['chordsKey'] +
      grouped_columns['keyKey'] +
      grouped_columns['keyScale'] +
      grouped_columns['rhythmPattern'] +
      grouped_columns['rhythmHist'] +
      grouped_columns['pitchBiHist'] +
      grouped_columns['bpm']) +
      grouped_columns['regularity'] +
      grouped_columns['loudness']
}

# method that gets the columns by category
get_columns_by_category = {
  group: scaled_df[columns + ['clusteringLabel']] for group, columns in categories.items()
  }

In [None]:
def perform_baseline(df):
    X = df.drop('clusteringLabel', axis=1)
    y = df['clusteringLabel']
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        test_size=0.3, 
        random_state=42
    )
    
    # Initialize the DummyClassifier with the 'stratified' strategy
    baseline_clf = DummyClassifier(strategy='stratified', random_state=42)
    baseline_clf.fit(X_train, y_train)
    
    y_pred = baseline_clf.predict(X_test)
    
    # Get the classification report
    class_report_dict = classification_report(y_test, y_pred, output_dict=True)

     # Initialize a dictionary to store the metrics
    metrics = {}

    # Add per-class Accuracy, Precision, Recall, F1-Score for each category
    for class_name, metrics_data in class_report_dict.items():
        if class_name not in ['accuracy', 'macro avg', 'weighted avg']:  # Skip non-class metrics
            class_accuracy = (y_pred == y_test)[y_test == class_name].mean()
            metrics[class_name] = {
                'accuracy': class_accuracy,
                'precision': metrics_data.get('precision', None),
                'recall': metrics_data.get('recall', None),
                'f1-score': metrics_data.get('f1-score', None)
            }
    
    return metrics

In [909]:
def perform_regression(df, model, n_splits=10):
    X = df.drop('clusteringLabel', axis=1)
    y = df['clusteringLabel']

    # Initialize StratifiedKFold with given number of splits
    skf = StratifiedKFold(n_splits, shuffle=True, random_state=42)

    # Store metrics for each fold
    all_metrics = []

    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        clf = OneVsRestClassifier(model)
        # Train the classifier
        clf.fit(X_train, y_train)

        # Predict and get classification report
        y_pred = clf.predict(X_test)
        class_report = classification_report(y_test, y_pred, output_dict=True)

        # Collect fold metrics
        fold_metrics = {'accuracy': accuracy_score(y_test, y_pred)}
        for class_name, metrics in class_report.items():
            if class_name not in ['accuracy', 'macro avg', 'weighted avg']:
                class_accuracy = (y_pred == y_test)[y_test == class_name].mean()
                fold_metrics[class_name] = {
                    'accuracy': class_accuracy,
                    'precision': metrics.get('precision'),
                    'recall': metrics.get('recall'),
                    'f1-score': metrics.get('f1-score')
                }

        all_metrics.append(fold_metrics)

    # Average metrics across folds
    avg_metrics = {}

    for fold_metrics in all_metrics:
        # Average per-class metrics
        for class_name in fold_metrics:
            if class_name not in ['accuracy']:
                if class_name not in avg_metrics:
                    avg_metrics[class_name] = {'accuracy': [], 'precision': [], 'recall': [], 'f1-score': []}

                avg_metrics[class_name]['accuracy'].append(fold_metrics[class_name]['accuracy'])
                avg_metrics[class_name]['precision'].append(fold_metrics[class_name]['precision'])
                avg_metrics[class_name]['recall'].append(fold_metrics[class_name]['recall'])
                avg_metrics[class_name]['f1-score'].append(fold_metrics[class_name]['f1-score'])

    # Calculate the averages
    for class_name, metrics in avg_metrics.items():
        avg_metrics[class_name]['accuracy'] = np.mean(metrics['accuracy'])
        avg_metrics[class_name]['precision'] = np.mean(metrics['precision'])
        avg_metrics[class_name]['recall'] = np.mean(metrics['recall'])
        avg_metrics[class_name]['f1-score'] = np.mean(metrics['f1-score'])

    return avg_metrics

In [910]:
def produce_results_table(model, n_splits=10):
    results = []

    # Assuming get_columns_by_category is your dictionary with categories and DataFrames
    for category, df in get_columns_by_category.items():
        category_metrics = perform_regression(df, model, n_splits)

        # Iterate over each class in the metrics
        for class_name, metrics in category_metrics.items():
            if class_name not in ['accuracy', 'macro avg', 'weighted avg']:  # Exclude non-class metrics
                row = {
                    'Category': category,
                    'Class': class_name,
                    'Accuracy': metrics.get('accuracy', None),
                    'Precision': metrics.get('precision', None),
                    'Recall': metrics.get('recall', None),
                    'F1-Score': metrics.get('f1-score', None),
                }
                results.append(row)

    # Now, get the baseline results for each class in this category
    baseline_metrics = perform_baseline(df)

    # Iterate over each class in the baseline metrics
    for class_name, metrics in baseline_metrics.items():
        if class_name not in ['accuracy', 'macro avg', 'weighted avg']:  # Exclude non-class metrics
            row = {
                'Category': 'Baseline',
                'Class': class_name,
                'Accuracy': metrics.get('accuracy', None),
                'Precision': metrics.get('precision', None),
                'Recall': metrics.get('recall', None),
                'F1-Score': metrics.get('f1-score', None),
            }
            results.append(row)

    # Convert the list of rows into a DataFrame and return it
    return pd.DataFrame(results)

In [912]:
model = RandomForestClassifier(random_state=42, verbose=0)
results_table = produce_results_table(model)
results_table

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

Unnamed: 0,Category,Class,Accuracy,Precision,Recall,F1-Score
0,Rhythm Histogram (RH),calm,0.983333,0.985714,0.983333,0.983217
1,Rhythm Histogram (RH),intense,0.75,0.75,0.75,0.733333
2,Rhythm Histogram (RH),vibrant,0.916667,0.866667,0.916667,0.888095
3,Rhythm Patterns (RP),calm,0.966667,0.954762,0.966667,0.958858
4,Rhythm Patterns (RP),intense,0.55,0.633333,0.55,0.56
5,Rhythm Patterns (RP),vibrant,0.783333,0.75,0.783333,0.751429
6,Stimulative Loudness,calm,0.883333,0.842857,0.883333,0.852404
7,Stimulative Loudness,intense,0.45,0.516667,0.45,0.446667
8,Stimulative Loudness,vibrant,0.566667,0.636667,0.566667,0.528571
9,Stimulative Tempo,calm,0.65,0.682262,0.65,0.653231


In [None]:
model = XGBClassifier(random_state=42, verbosity=0)
results_table = produce_results_table(model, n_splits=2, oneVsRest = True)
results_table

In [None]:
model = LGBMClassifier(metric='auc',random_state=42, verbose=-1, n_jobs=1, oneVsRest = False)
results_table = produce_results_table(model)
results_table

In [None]:
model = CatBoostClassifier(verbose=0, random_state=42, oneVsRest = False)
results_table = produce_results_table(model, n_splits=1)
results_table

## Submission Instructions

You will submit a Jupyter notebook that will contain all your code, data, and analysis. Ensure that the notebook will run correctly in a computer that is not your own. That means, among other things, that it does not contain absolute paths. Remember that a notebook is not a collection of code cells thrown together; it should contain as much text as necessary for a person to understand what you are doing.

## Honor Code

You understand that this is an individual assignment, and as such you must carry it out alone. You may seek help on the Internet, on ChatGPT/Gemini/etc., by Googling or searching in StackOverflow for general questions pertaining to the use of Python and pandas libraries and idioms. However, it is not right to ask direct questions that relate to the assignment and where people will actually solve your problem by answering them. You may discuss with your colleagues in order to better understand the questions, if they are not clear enough, but you should not ask them to share their answers with you, or to help you by giving specific advice.