#### Data imports & data cleaning  
The dataset we are using for the project is LFM-2B (http://www.cp.jku.at/datasets/LFM-2b/)
- sample from a bigger dataset that has less missing data
- link demographic data of the artists (musicbrainz.org)(https://musicbrainz.org/doc/Style/Artist)

- https://developer.spotify.com/documentation/web-api


## Preprocessing - Amir
#### Extract
Download the following tables - [Data Source](http://www.cp.jku.at/datasets/LFM-2b/)
1. `lyrics-features.json.bz2` (full version) (~6.1GB)
2. `user_track_playcount.zip` (Adapted for Track Recommendation) (~8.6GB)
3. `song_ids.zip` (Adapted for Track Recommendation) (~2.6GB)
4. `user_demographics.zip` (Adapted for Track Recommendation) (4.0MB)

`lyrics-features.json.bz2` must first be extracted into a `.json` file. Since I use a windows device, I used 7-Zip. These files are too large, so I enter them into a PostgreSQL database through DBeaver. However, `.json` files are not accepted in PostgreSQL, So I write a Python script to convert the `.json` into a `.csv` file. The file is quite large because there is a BERT embedding vector representation of the lyrics for every song. I exclude the BERT embedding when converting the JSON to a CSV, and instead save the BERT embedding to a `.pt` file so that it may be utilized for modeling and analysis.

The three other data tables must also be converted into a CSV file because the columns are not labeled neatly. So, I write another python script to convert `.tsv` files into `.csv`. Entering column names is programed manually into the script, as these files are too large to edit within Excel.

#### Transform
Next, I hop into SQL and do the following:
1. Create the tables
2. Import the data
3. Determined the primary & foreign keys
4. Merge
5. Sample
6. Export



#### More Extracting
Now remember, we still have the BERT embeddings. This embedding file is a `.pt` file. Since we've sampled the data, I go back and I recreate the `.pt` file, ensuring that only songs that are accounted for in the sample are included in the word embedding. A song's embedding can be accessed via the key,

`key = f"{json_obj['_id']['artist']} - {json_obj['_id']['track']}"`

that is, `artist - track` is the key. This embedding is only a numeric representation of the lyrics meant to be able to quantify the text data as a function of direction and magnitude with respect to the english language. Each song's lyrics can be defined as a vector in space. We can use cosine similarity to see how similar two vectors (or lyrics) are. In this case, we'll use it to build a recommendation model

You can find all the helper files in the drive. The original tables are obviously too large to import into Google Drive, but you can download them yourself and run the scripts from there. Just ensure you have enough space, and double check file paths in the Python code.

#### Loading

Well, here ya go.


## Come up with a new topic (music recommendation)
keyword: artists, gender, music recommendation  
Designing Fair Algorithms SP24, Phase 2  
Team: Amir ElTabakh(ae362), Isabella Wang(xw574), Amber Tsao(ct649), Zhixuan Qi(zq83)   
Introduction:
[TODO, A small paragraph about the project] [Amber]


#### Summary statistics

#### Research Question, Hypotheses, and Analysis Plan
- RQ:  (**gender**, subgenre)

## Modeling

- collaborative filtering: ALS [Amber]
- content-based recommendation: ItemKNN (https://recbole.io/docs/user_guide/model/general/itemknn.html)(tags)[Zhixuan]
- content-based recommendation: ItemKNN (Spotify(optional))


#### Baseline: **POP** (https://recbole.io/docs/user_guide/model/general/pop.html)[Isabella]
This algorithm records the popularity of songs in the dataset and recommend the most popular track to the users.
The popularity of an item is usually determined by the frequency of user interactions with it (listening counts). The algorithm is also used as baseline for our project, because it does not contain any personalization element in it. It serves as a good comparison how does the baseline look without any optimization or personalization involved.

In [1]:
!pip install recbole
!pip install ray
!pip install kmeans-pytorch
from recbole.quick_start import run_recbole



In [3]:
# df = pd.read_csv('path_to_ml-100k/u.data', sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])

# # Count the number of interactions per item
# item_popularity = df['item_id'].value_counts()
# top_10_items = item_popularity.head(10).index.tolist()
# print("Top 10 Popular Items:", top_10_items)


#### BPR (https://recbole.io/docs/user_guide/model/general/bpr.html) -  Isabella

Bayesian Personalized Ranking (BPR) is a common algorithm used in recommendation system. It generates personalized ranking for users using implicit inputs, like clicks, views, or purchases. In our case, that input would be listening counts (#of times a user listens to a track). The assumption is that user is more likely to prefer items similar to the ones they've interacted with in past.

Assumption: The algorithm uses pairwise preference assessment. The idea is that a randomly chosen interected item is ranked higher than a randomly chosen uninteracted item.  
Input: It encodes inputs using matrix factorization techniques. User and item interactions are decomposed into latent factors.  
Optimizer: Stochastic Gradient Descent (SGD
Objective Function: Maximum Posterior Estimation(MAP)

In [4]:
from recbole.config import Config
from recbole.data import create_dataset
from recbole.data import data_preparation
from recbole.model.general_recommender import BPR

# Load the dataset and model configurations
config = Config(model='BPR', dataset='mock_data.csv', config_file_list=['mock_data.yaml', 'bpr_algo.yaml'])

# Create the dataset
dataset = create_dataset(config)
dataset.info
train_data, valid_data, test_data = data_preparation(config, dataset)

# Initialize the model
model = BPR(config, train_data.dataset).to(config['device'])


ValueError: Neither [dataset/mock_data.csv] exists in the device nor [mock_data.csv] a known dataset name.

#### Results
- Performance Metrics: HitRatio@10, nDCG@10
- Fairness Metrics: Avg position 1st Female artists, Male artists, % females rec coverage


#### Contribution Notes

#### Sources cited