## Tarun Nadipalli - 705.603 Creating AI-Enabled Systems Final Project

For my final project, I decided to build my own content-based song recommendation application with the Spotify API. Users are able to input a link to a Spotify playlist that contains songs they like and my application will send back a link to playlist that contains my recommended songs. 

In the previous guide `data_analysis.ipynb`, I described the data I used and the design decisions I decided to implement (mainly excluding genre and artists from my dataset). In this notebook, we will walkthrough the full recommendation engine in `recommend.py` by describing the different TensorFlow models we have created along with how they play into the two steps of creating recommendations: retrieval and ranking.

Ultimately, recommender systems use user feedback (note: we only have implicit feedback in our use case)on certain items to extrapolate and predict how that user would rate other items. The items with the highest predicted ratings are then recommended to the user. To do so, these systems consist of two different stages, retrieval and ranking. 

In the retrieval stage, the model is trained on the input data and whittles down the large dataset of potential recommendations to just a few hundred. Once the potential recommendations that wouldn't be interesting to the user are weeded out, we send the smaller set of recommendations to the ranking stage. 

In the ranking stage, the small set of potential recommendations is analyzed to find the likelihood that the user will enjoy that recommendation. The recommendations with the highest likelihood are then returned to the user.

Note: People often refer to the user data/preferences as query data, whereas potential recommendations are referred to as candidates.

In `recommender.py` we use TensorFlow, TensorFlow Recommenders library, and Keras to complete our recommendation workflow.

#### Retrieval 

The retrieval stage has a few different steps and is comprised of two different model types. The first model is the query model, which computes the representation of all the query features. In our case, the query model is built off the features of the user inputted playlist songs data. The second type is the candidate model, which contains equally sized embeddings to represent all the candidate data (our pool of 300k songs to recommend from). In both cases, we use the 'SongModel' defined in `models.py` to instantiate a tf.keras.Model object to hold / compute our embeddings based on the input songs.

The next model type is a combination of the query and candidate model, the 'RetrievalModel' in `models.py`. This model takes both the query and candidate model's outputs, multiplies them to calculate a score that will help us determine any potential matches. The higher the score, the more likely they are to be similar. Additionally, it also implements a loss function to help evaluate how well our model is training on our query data. 

The 'RetrievalModel' also has employs the tfrs.metrics.FactorizedTopK task layer with the computed scores from above to weed out the potential recommendations from hundreds of thousands to just hundreds. FactorizedTopK compares the scores that the model calculates for an song-song pair to all the other possible candidate songs; if the score for the songs from the query input are higher, then we know our model has found a good song to include for the ranking stage.

Our retrieval function yields a DataFrame containing 1000 of the original 300k songs to be ranked in the ranking stage. Let's see how that works below.


In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

import warnings
warnings.filterwarnings("ignore")

from data import *
from spotify import Spotify
from recommend import *
tf.get_logger().setLevel('ERROR')

sp = Spotify(client_id='baf04d54648346de81af8a9904349531', 
                client_secret='074087a86045465dbd582802befa6f94',
                scope="playlist-modify-public")
    
# creating SQLite db connection to access song data
conn = create_connection(db_file)

# going to paste the RapCaviar Spotify Playlist as example
# https://open.spotify.com/playlist/37i9dQZF1DX0XUsuxWHRQd?si=b21c55631c15457b
user_df = sp.get_spotify_data_from_user()
songs_df = get_table_df(conn, 'features', '300000').drop(['track_id'], axis=1)

retrieved_recs_df = retrieval(user_df, songs_df)
retrieved_recs_df.head()


[93mGot it! The playlist you've submitted is: https://open.spotify.com/playlist/37i9dQZF1DX0XUsuxWHRQd?si=b21c55631c15457b


Unnamed: 0,track_uri,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
79906,1YBA6PuLlIrjNr9Hxl7qcj,0.0796,0.996,1.0,-16.873,1.0,0.18,0.13,0.743,0.38,0.00602,83.718,5279768,4
186541,4zQ2iVzOx7BiuA9JSUIr3S,0.0619,0.715,1.0,-15.423,1.0,0.276,0.625,0.938,0.456,1e-05,75.895,4680000,3
211274,1gNlPGGyQ4FKTqGBTrUgEg,0.236,0.913,7.0,-16.058,1.0,0.567,0.218,0.568,0.597,0.0363,167.558,5100018,4
80767,1wXkXhrYKpf8sNVtaZcS4L,0.0949,0.0833,1.0,-21.237,1.0,0.0435,0.0324,0.82,0.201,0.0235,88.757,3627182,4
262563,6yfA0Vw02XiuDDRhmqimER,0.783,0.918,1.0,-2.7,1.0,0.182,0.00656,0.0165,0.319,0.419,129.078,4677906,4


As you can see, we printed out the first 5 of 1000 candidate songs for ranking. Let's move on and describe the ranking workflow.

#### Ranking

As mentioned before, the ranking stage takes the output from the retrieval model and fine-tunes the selections to what it considers the best candidates. It does so by using tasking that predicts the ratings of potential candidates and returns the songs with the highest predicted ratings.

This involves the use of two main models as defined in `models.py`. The first is the 'RankingModel' which combines the query and candidate SongModel's as defined in the retrieval step. Except in this stage there are two major changes:
1. We add a column 'ratings' to the user input dataframe of songs all with a value of 1 so we can essentially tell the model that the user likes all of these songs.
2. The candidate SongModel input is not the dataframe of 300k songs, but the 1k songs from the retrieval stage output.

In this 'RankingModel' we define two stacked dense layers and a final layer that makes the rating predictions. 

The second model is the 'FullRankingModel' which uses the 'RankingModel' as one layer and adds two metrics. The first is a layer to calculate the loss with MeanSquaredError and the second is RootMeanSquaredError. Combined, we can properly train on the user data and extrapolate the learnings to other candidate songs from the retrieval step. 

It is worth mentioning as well that both the retrieval and rankings stages contain preprocessing steps such as encoding the categorical variables and normalizing the numerical variables. With that in mind, let's run the ranking stage and see the output.

The output in this case is the list of 50 Song IDs for songs that had the highest predicted ratings that we can pass to the Spotify API to create our playlist! 

In [2]:
ranked_recs = ranking(user_df, retrieved_recs_df)
print(ranked_recs)

['7fEHXf9gdC9GgdEnOplSMk' '04vKZAQjAl1xiwRpJUwcNi'
 '7jJwZljv47X3MtcBs8J0kK' '2TGccSSywyDDVBdVuruJfv'
 '7Dwznt3vxaMm9h6NqLggMG' '21gwamXOkYbGvc2pNujxwI'
 '1lRUE7Wvr2kPfd6T5dyx6V' '3KgOabOHuokOIaNizvMGAR'
 '4yRrCssPj4FJp4BWkFNugX' '3IJw5ZeS3ZcCYb36aIFWyZ'
 '04KJ4NGb2T22y26sTO41Q0' '42excP3MyVua6modv3v9Pz'
 '54YIcwm4UHK2CjrPdqG5ET' '3CGxYL47S3A4ouA11u3zoB'
 '4JxTsAC3TNQy9BQdR1vpsj' '4H0XhTWA9SjTynyFHDITjF'
 '0E5loSIAWRO56lW9X6r4mc' '0z4tMEEWIewXIwnMLVedIY'
 '5oTv8AtuFfHrJE5qZnF69S' '11iOB65szCoy6e5dscKrai'
 '4V367XIbBKqsBaE2KhYFW0' '7JbEO66T2aBBlscNt8Sqt5'
 '6pTqwdzpEMuz6G1gu6gYOg' '2PmF8UiRxg8a2L08TpQtwJ'
 '02c6Po7W4uthFjUWkukl0z' '5e574bhjycX1eH2l4Auage'
 '4Be5OppEnVognKlHUIN0v6' '4txwQCkCJMWLwpEXqjp6dq'
 '2ZRJRe82aZaVhOKKlbJr4v' '4iYRa2btalAzPZoSYfROqF'
 '2SHDnvo78qHFyZVgx9ZAjP' '1L49J3hzJeSGTqgtUb7vYD'
 '5YWQWLEUtqRbUDs7bzaUb0' '3njpLvANriMsdv3dgADEad'
 '5zvwWU7CdXdJTV9Y35qX1u' '2N1bBBvrtuZOVWCYY532ys'
 '3N40rGv0fQYA0erA4aCRTW' '0gUBZ1HzJo1Ha0K6TJam3j'
 '4UZzJnI07nFm07zWhqyDOm' '6DWd

The final step in this project is creating the playlist. Let's do it!

In [3]:
sp.create_playlist(ranked_recs)

[92mDone! Here are your recommendations: https://open.spotify.com/playlist/5bRdXusLShAR8e7JTpeLQN


'https://open.spotify.com/playlist/5bRdXusLShAR8e7JTpeLQN'

If you open this playlist, you'll see that there are a lot of similar rap songs with similar artists from the original input [RapCaviar](https://open.spotify.com/playlist/37i9dQZF1DX0XUsuxWHRQd?si=b21c55631c15457b) playlist, but we also see recommendations from other genres and artists that would normally be recommended. 

The final notebook in this series is `main_guide.ipynb`. Please refer to that notebook to understand what I've learned, how this system could improve, future work, and more.

#### References

1. [TensorFlow Retrieval Documentation](https://www.tensorflow.org/recommenders/examples/basic_retrieval)
2. [TensorFlow Ranking Documentation](https://www.tensorflow.org/recommenders/examples/basic_ranking)
3. [TensorFlow Feature Preprocessing Documentation](https://www.tensorflow.org/recommenders/examples/featurization)