# RecAgent-Music Tutorial

## What is it? 

This repository is inspired by Lei Wang et al. (2023) ["When Large Language Model based Agent Meets User Behavior Analysis: A Novel User Simulation Paradigm](https://arxiv.org/abs/2306.02552) with accompanying [repository](https://github.com/RUC-GSAI/YuLan-Rec). 

We aim to simulate the interaction between a user and a music-based recommender system (e.g. Spotify or Deezer). We do so by extracting metadata about a user (e.g. their demographic, favourite artists, etc.), sampling a sequence of their listening history, and then prompting ChatGPT to predict whether the user will play or skip the next song (the "recommended song") in the sequence. In this way, ChatGPT acts as the "mind" of a user. 

Given a sufficient dataset, we can thus compare the performance of ChatGPT's predictions to the actual action of the given user. The comparative advantage of using a large language model (LLM) to perform such a simulation is their ability to determine latent features in textual data - for example, understanding a mood shift from one song to the next by analysing the respective lyrics. 

*Note: This simulation environment is based on a dataset of real user streaming history from Deezer - unfortunately though, it is not for public consumption. In any case, synthetic data with the same structure can be used in its place.*

## Setup

I have not set this repo up as a package as I think it'll be used in a fairly standalone way. If any future user desperately needs this, please get in contact with me. Otherwise, here are necessary steps for setting it up to work on your local machine: 

- I recommend running the code in a virtual environemnt. To do this, navigate to the project directory, run the command `python3 -m venv .venv` in the terminal, activate the virtual environment with `source .venv/bin/activate`, and then install the dependencies with `pip install -r requirements.txt`. 
- I recommend running simulations in a notebook environment like the examples below. (It is not set up like other experiments to instantiate runs in the terminal, though you could of course do this by running a script with the given parameters). 
- Create a `.env` file and add the line `OPENAI_API_KEY=abcdef` (supposing your key is `abcdef` in this example). If you want to use a model name other than "gpt-3.5-turbo-16k", you can also include `MODEL_NAME` in the .env file. 
- All three dataframes - `simcares_catalog.csv`, `simcares_streams.csv` and `simcares_users.csv` - should be in the `simcares_data` folder. (Or, of course, the "big" `simcares_20FEB2024_catalog.csv` variants). These three dataframes are linked via `song_id` and `user_id` columns.
- Ensure an `experiments` file is created in the project directory - this is where experimental results are saved. (Usually I think of _EXPXXX_ as containing different runs of the same version of the code). 

## Simulation parameters

The simulation has the following parameters: 
- `num_trials`: The number of trials to run in the simulation. Each trial corresponds to a single random user, and potentially many sub-trials based on the other hyperparameters. 
- `length_history_sequence`: The length $L$ of the song history sequence (for the first recommended song). 
- `num_recommendations`: The number of recommended songs $R$ to test per trial (see below note). 
- `only_smartradio`: Whether to only use "smartradio" instances in the Deezer data, which correspond to genuine recommendation sequences (as opposed to iterating through a user-defined playlist, etc.)
- `lyric_options`: The different kinds of song summaries: 
    - `no_lyric_summary`: Self-explanatory. 
    - `chatgpt_memory_summary`: Asking ChatGPT to summarise the song _based on its training_, i.e. "Summarise Hello Goodbye by The Beatles" without any extra information. Prompt structure seen in `RecAgent.get_chatgpt_memory_summary(song)`. 
    - `chatgpt_scraped_lyrics_summary`: After scraping the lyrics from azlyrics.com, ask ChatGPT to summarise said lyrics. Prompt structure seen in `RecAgent.get_chatgpt_scraped_lyrics_summary(song)`. 
    - `first_n_lyric_lines`: Condense the scraped lyrics into the first $n$ lines (defaults to $n=4$, controlled with `n_lyric_lines` in Simulation). 
- `user_profile_options`: Controls the verbosity of the user profile based on their traits: 
    - `simple`: Only includes age, gender and alltime favourite genres. 
    - `expanded`: Additionally includes favourite genres in the morning, evening, and recently; their favourite artist and; their favourite decade of music. 

Note that the simulation operates by iterating over each hyperparameter in `lyric_options`, `user_profile_options` and each of the $R$ recommendations, for each sampled user. In this way, one can directly compare performance between hyperparameters for the same given user instance. 

### Aside on `num_recommendations`

This parameter is a little confusing. Naively, one would think that to iterate over a number of recommended songs would imply we set up a "conversation" within ChatGPT, where we give it the first $S_L$ predictions and ask it to predict on song $S_{L+1}$, then say "yes, that's right, now how about song $S_{L+2}$", etc. However, that next prompt into ChatGPT will still contain all of the same text as the first prompt - we do not gain any "efficiency".

With this in mind, `num_recommendations` becomes a proxy for studying the effect of the length of sequence data for the same random user and sequence of songs. We first ask ChatGPT to predict $S_{L+1}$ given history $\{S_{l}\}_{l=1}^L$, and then ask it to predict $S_{L+2}$ given $\{ S_{l} \}_{l=1}^{L+1}$, all the way up to $S_{L+R}$ where $R$ is `num_recommendations`. As can be seen in the function `Simulation.run_trial()`, the same prompt structure is used each time, just with an extended song history. 

The decision to structure it like this is simply because I don't believe there is any predictive benefit to ChatGPT being told whether it guessed right or wrong on a given song. (With the ability of transformers to do "in-context learning", this would perhaps be different if we ran the simulation for a very long time, but at this scale I don't believe there is any benefit). 

## Structure of Simulation

A simulation trial in `Simulation.run_trial()` operates in the following way: 

1. Select a random user from the data and retrieve their user traits. (The data is pre-processed to filter out any users with a listening history less than `length_history_sequence + num_recommendations`). 
2. Sample a sequence of songs from their listening history $\{S_{l}\}_{l=1}^L$, and the songs to be recommended $\{S_{l}\}_{l=L+1}^R$, stored as lists of Song instances containing traits. (To avoid superfluous prompts to ChatGPT, the song summaries are stored in a cache in case the same song is queried more than once). 
3. For each `lyric_option`, `user_profile_option`, ask ChatGPT to predict whether the next recommended song $S_{L+1}$ is to be played or skipped by the user given their history $\{ S_{l}\}_{l=1}^L$ - this takes place in `RecAgent.predict_play_or_skip`. Parse ChatGPT's response into either 'play' or 'skip', and calculate the score based on the user's actual action (1 if correct, 0 if not). Log the results to the `SimulationManager`. 
4. Append song $S_{L+1}$ to the history, and repeat the process for predicting song $S_{L+2}$ given $\{ S_{l} \}_{l=1}^{L+1}$. Repeat for all `num_recommendations`. 
5. Finalise results, save CSV, print accuracies and p-values across different hyperparameters. 

*Note: Sometimes azlyrics.com prevents access when it suspects there is a bot scraping its site. To ensure the simulation doesn't run endlessly when this happens, there is an automatic stopping criteria if the last 10 song instances have all had 100% lyric scraping failure.* 


## Example

The simulation is easy to run. I personally prefer to run it in a notebook environment for ease. Let's run it with all possible hyperparameters for one user and output the text log of prompts and responses. 

In [1]:
from simulation import Simulation

num_trials=1
num_recs = 1
length_history_sequence=5
simulation = Simulation(name=f'trials-{num_trials}_recs-{num_recs}_history-{length_history_sequence}',
                        exp_name='EXP000',
                        num_trials=num_trials, 
                        length_history_sequence=length_history_sequence, 
                        num_recommendations=num_recs,
                        lyric_options=["no_lyric_summary", 
                                       "chatgpt_memory_summary", 
                                       "chatgpt_scraped_lyrics_summary", 
                                       "first_n_lyric_lines"],
                        user_profile_options=['simple', 'expanded'],
                        big_or_small_data='small',
                        only_smartradio=False,
                        debug=True)
simulation.run_simulation()

  warn_deprecated(
  warn_deprecated(


trial_id=0, idx_rec=0, lyric_option=no_lyric_summary, user_profile_option=simple

PROMPT

The user is a 24 year old female who likes pop, r&b, adult contemporary, hip hop and rap music. They have just listened to the following songs in the evening:

Let Me Love You by Mario, classified under genres pop and r&b, which was released in the year 2004, which goes for a duration of 249.0 seconds. The user played.

It's All Coming Back to Me Now (Radio Version) by Céline Dion, classified under genres pop rock, pop and adult contemporary, which was released in the year 2019, which goes for a duration of 329.0 seconds. The user skipped.

Solteiro Nunca Está Só by MC Kekel, classified under genres funk carioca, which was released in the year 2018, which goes for a duration of 147.0 seconds. The user skipped.

China by Anuel AA, classified under genres latino and reggaeton, which was released in the year 2019, which goes for a duration of 301.0 seconds. The user skipped.

Calma (Alan Walker Remix

Running Simulation: 100%|██████████| 1/1 [00:29<00:00, 29.16s/it]

trial_id=0, idx_rec=0, lyric_option=first_n_lyric_lines, user_profile_option=expanded

PROMPT

The user is a 24 year old female, who typically enjoys pop, r&b, adult contemporary, hip hop and rap music. In the mornings, they prefer pop, rock and alternative, while in the evenings, they lean towards pop, hip hop and adult contemporary. Recently, they've shown more interest in pop, r&b and hip hop. Their top artists include Pitbull, Soraia Ramos and Selena Gomez, and they have a particular fondness for music from the 2010s. They have just listened to the following songs in the evening:

Let Me Love You by Mario, classified under genres pop and r&b, which was released in the year 2004, which goes for a duration of 249.0 seconds. The first few lines of the song are 'Mmmm ..... Mmmmm.... Yeah....Mmmmm....Yeah, Yeah, Yeah Mmmm ..... Mmmmm.... Yeah....Mmmmm....Yeah, Yeah, Yeah  Baby, I just don't get it'. The user played.

It's All Coming Back to Me Now (Radio Version) by Céline Dion, classif




This time let's run it with more users and more recommendations. Remember, if the p-value is _less than_ a tolerance (say 0.05) we say that there is a statistically significant difference between ChatGPT's predictions and random chance (a coinflip guess), and if it is greater than that tolerance then the results are not statistically significant. 

In [2]:
num_trials=5
num_recs = 3
length_history_sequence=5
simulation = Simulation(name=f'trials-{num_trials}_recs-{num_recs}_history-{length_history_sequence}',
                        exp_name='EXP000',
                        num_trials=num_trials, 
                        length_history_sequence=length_history_sequence, 
                        num_recommendations=num_recs,
                        lyric_options=["no_lyric_summary", 
                                       "chatgpt_memory_summary", 
                                       "chatgpt_scraped_lyrics_summary", 
                                       "first_n_lyric_lines"],
                        user_profile_options=['simple', 'expanded'],
                        big_or_small_data='small',
                        only_smartradio=False,
                        debug=False)
simulation.run_simulation()

Running Simulation: 100%|██████████| 5/5 [03:29<00:00, 41.86s/it]

total_trials = 120; num_trials: 5; num_recommendations: 3, length_history_sequence: 5

Overall Accuracy: 0.50
P-value (Overall vs Random Chance): 1.0000
  ChatGPT's overall performance is not statistically significantly different from random chance (p>=0.05).

Comparing 'user_gender' Average Scores:
  female: 0.40, P-value: 0.1934
  male: 0.57, P-value: 0.2888

Comparing 'idx_recommended_song' Average Scores:
  0: 0.38, P-value: 0.1539
  1: 0.47, P-value: 0.8746
  2: 0.65, P-value: 0.0807

Comparing 'lyric_option' Average Scores:
  chatgpt_memory_summary: 0.50, P-value: 1.0000
  chatgpt_scraped_lyrics_summary: 0.57, P-value: 0.5847
  first_n_lyric_lines: 0.50, P-value: 1.0000
  no_lyric_summary: 0.43, P-value: 0.5847

Comparing 'user_profile_option' Average Scores:
  expanded: 0.43, P-value: 0.3663
  simple: 0.57, P-value: 0.3663





## Structure of Data

Since the Deezer data is not for public consumption, I have to wait until there is synthetic data in order to properly show the structure of the datasets here. 