In [29]:
import pandas as pd
import numpy as np
from mixability import *
from sort_songs import *

## **What Data Are We Working With?**

**Here are the main things things to consider when selecting your next song:**

1) Tempo: For the typical transition, you want to stay within 5 bpm of the current song. If the tempo of the next song is very different from the current, you are forced to speed up/slow down the current song which can be noticable to people.

2) Key: Matching the key, or finding a song with a similar key will help your transition sound much cleaner. While most people won't notice or care, selecting the right key will take a good transition to amazing.

3) Energy: You don't want to transition from high energy to low energy, or vice-versa. A change in energy is fine but an abrupt switch up will throw people off

4) Genre: Songs can differ greatly within genres, but staying within one genre or switching between similar genres will help with finding songs that go well together. Even if your tempo and key match, a transition from Techno to Classical could be a little questionable.

5) Popularity: Sometimes popularity is all you need. If people like a song, they probably won't even care how good your mix is.

## **Cleaning The Data**

**Lets turn our CSVs into DataFrames.**

We have 2 CSV's from kaggle. These data sets were collected using Spotify API, and include song names and their associated audio and descriptive features.

- We have "popular_songs.csv" which includes songs deemed as popular.

- And we have "unpopular_songs.csv" which includes songs deemed as unpopular.

In [None]:
POPULAR_SONGS = pd.read_csv("data/popular_songs.csv")
UNPOPULAR_SONGS = pd.read_csv("data/unpopular_songs.csv")

Let's start first with our popular songs dataset.

In [31]:
print(POPULAR_SONGS.columns)
print(f"Number of columns: {len(POPULAR_SONGS.columns)}")


Index(['energy', 'tempo', 'danceability', 'playlist_genre', 'loudness', 'liveness', 'valence', 'track_artist',
       'time_signature', 'speechiness', 'track_popularity', 'track_href', 'uri', 'track_album_name', 'playlist_name',
       'analysis_url', 'track_id', 'track_name', 'track_album_release_date', 'instrumentalness', 'track_album_id',
       'mode', 'key', 'duration_ms', 'acousticness', 'id', 'playlist_subgenre', 'type', 'playlist_id'],
      dtype='object')
Number of columns: 29


**Some things we can clean up right away are:**

- We currently have 29 columns of songs features which is definetly more than we need, so we should probably remove some. (Such as: Spotify ID's, acousticness, duration)

- We might prefer to see the song and artist names at the front, followed by key features. I think I would want something that would look like this:
    [song, artist, genre, tempo, key, energy, ...]

- I also might want to shorten some of the column names so they are easier to work with.

So let's try that.

In [32]:
POPULAR_SONGS = POPULAR_SONGS[["track_name", "track_artist", "playlist_genre", "tempo", "key", 
                               "energy", "track_popularity", "danceability", "liveness"]]

POPULAR_SONGS = POPULAR_SONGS.rename(columns={"track_name": "track", "track_artist": "artist",
                                              "playlist_genre": "genre", "track_popularity": "popularity"})

POPULAR_SONGS.head(5)

Unnamed: 0,track,artist,genre,tempo,key,energy,popularity,danceability,liveness
0,Die With A Smile,"Lady Gaga, Bruno Mars",pop,157.969,6,0.592,100,0.521,0.122
1,BIRDS OF A FEATHER,Billie Eilish,pop,104.978,2,0.507,97,0.747,0.117
2,That’s So True,Gracie Abrams,pop,108.548,1,0.808,93,0.554,0.159
3,Taste,Sabrina Carpenter,pop,112.966,0,0.91,81,0.67,0.304
4,APT.,"ROSÉ, Bruno Mars",pop,149.027,0,0.783,98,0.777,0.355


Some issues I notice with our current data are:

1) Danceability, liveness, and valence is on a sacle of 0-1, and popularity is on a scale of 0-100, so lets adjust popularity to match the other categories.

2) Now that they are all on the same scale, notice that danceability and liveness are currently rounded to the thousandth decimal point. Lets make sure everything is rounded to the hundredth power.

3) Tempo is also overly specific. We can just round it to the nearest whole number.

In [33]:
POPULAR_SONGS["popularity"] *= 0.01
POPULAR_SONGS["danceability"] = POPULAR_SONGS["danceability"].round(2)
POPULAR_SONGS["liveness"] = POPULAR_SONGS["liveness"].round(2)
POPULAR_SONGS["energy"] = POPULAR_SONGS["energy"].round(2)
POPULAR_SONGS["tempo"] = POPULAR_SONGS["tempo"].round()

POPULAR_SONGS.head(5)

Unnamed: 0,track,artist,genre,tempo,key,energy,popularity,danceability,liveness
0,Die With A Smile,"Lady Gaga, Bruno Mars",pop,158.0,6,0.59,1.0,0.52,0.12
1,BIRDS OF A FEATHER,Billie Eilish,pop,105.0,2,0.51,0.97,0.75,0.12
2,That’s So True,Gracie Abrams,pop,109.0,1,0.81,0.93,0.55,0.16
3,Taste,Sabrina Carpenter,pop,113.0,0,0.91,0.81,0.67,0.3
4,APT.,"ROSÉ, Bruno Mars",pop,149.0,0,0.78,0.98,0.78,0.36


**I decided to put all these steps into a function called "clean_data()" on the "clean_dataset.py" file, so we can reuse it on any dataframe of this format.**

## **Planning Our Suggestion System**

Now that we have a clean, easy to use dataframe, we have to figure out how we want our song suggestion system to work. Some questions I asked myself are:

1) What is the most important feature?

2) How should we work with genre?

3) Do we want one big calculation? 

4) Should we create a new value that gets outputted by our suggestion system?

The importance of different features depends on the person, but I have come up with my own way of organizing and weighing the categories.  


#### 1) I would probably split up our categories into:
- closeness to curent song

- the rating of category

The reason for this is because for a category like tempo, we want to keep the bpm similar to the current song, but popularity should always try to be maximized, and shouldn't depend on the rating of the current song.

#### 2) After splitting them up, here is how I rank them in importance:

**Closeness:**
1) Tempo: We want similar speed, we should probably just filter tempo out right away.

2) Key: While not noticed by most people, a mix that is in key will always sound better compared to one that isn't.

3) Energy: For simple, smooth transitions, we would like to keep the energy around the same. But we also don't want to limit ourselves if there is an opportunity to bump up the energy.

**Rating:**
1) Popularity: If a song is popular, there is a higher chance people will like it

2) Danceability: When DJing, you probably are looking for fun, danceable songs. But sometimes danceability comes with popularity.

3) Liveness: This data set describes liveness as "likelihood of a track being performed live". This doesn't matter as much to us compared to if a band were deciding what song to play live. But it does relate to our question of "what songs do people want to hear?" 



**Genre**
For now, I am not going to mess with genre. We'll still leave it in our dataframe, but it can just be a category the user can consider when using the program. Also I did notice afterwards, that "genre" in the context of this dataset describes the genre of the playlist the song was taken from, so it may not even be very accurate.

**Now lets weigh our categories:**

Closeness (0.55):
1. Tempo (0.45)
2. Energy (0.15)
3. Key (0.4)

Rating (0.45):
1. Popularity (0.6)
2. Danceability (0.35)
3. Liveness (0.05)

## **Coding The System**

Now we just create functions for each of our categories and then put it all together to output 1 final mixability score.

Our final "run_mixability()" function in the "mixability.py" file takes 2 songs, calculates the score, and returns it.
It looks something like this:

    (current_song: pd.Series, candidate_song: pd.Series) -> mixability_score: float


You can look through "mixability.py" and see how all the tests work.

In [34]:
current = POPULAR_SONGS.loc[1]                                          # Just grab the first song
for idx, candidate, in POPULAR_SONGS.iloc[:10].iterrows():              # Loop through first 10 songs
    score = run_mixability(current, candidate)
    print(f"{current["track"]} -> {candidate["track"]} has mixability of {score}")  # Print each mixability score

BIRDS OF A FEATHER -> Die With A Smile has mixability of 0.44
BIRDS OF A FEATHER -> BIRDS OF A FEATHER has mixability of 0.88
BIRDS OF A FEATHER -> That’s So True has mixability of 0.71
BIRDS OF A FEATHER -> Taste has mixability of 0.52
BIRDS OF A FEATHER -> APT. has mixability of 0.54
BIRDS OF A FEATHER -> Good Luck, Babe! has mixability of 0.49
BIRDS OF A FEATHER -> Diet Pepsi has mixability of 0.43
BIRDS OF A FEATHER -> WILDFLOWER has mixability of 0.41
BIRDS OF A FEATHER -> Sailor Song has mixability of 0.46
BIRDS OF A FEATHER -> Timeless (with Playboi Carti) has mixability of 0.48


Something you might notice is that it's telling us that mixing the same song together has a score of 92%, which is weird because we would expect 100%, but remember that our scoring takes into account the song's categorical variables as well."BIRDS OF A FEATHER" has good popularity, but not as much energy, which is why that score is being brought down.

I don't think this is a big issue, because when we actually implement this program, we can just tell it to skip itself in the search process.

**Let's move onto a full implementation of a dataframe.**

I think this program should create a copy of our original dataframe, and add a mixability column with all the scores. 
It would probably look something like this:

1. Create copy, with mixability column
2. Set current song
3. Run test on each song and set mixability score
4. Sort dataframe by mixability scores


Now, here are some things to also consider:
1. Again, we don't want to recommend the current song, so we can remove it from the dataframe before running the tests
2. We don't want to just fully remove the current song, so we could keep track of played songs in a separate dataframe

    a. Our played songs dataframe probably shouldn't include the mixability column. We should also drop it from our current
        song before adding it to the played songs dataframe

**run_mixability_scores()**

First you want to set up a cleaned dataframe of all the songs, and a played songs dataframe that is empty

Then here is what the final function does:
1. Input current song, dataframe, and played songs
2. Create a copy of dataframe with mixability column
3. Remove current song from copy
4. Loop through all songs and set mixability scores
5. Sort songs by mixability scores
6. Add current song to played songs dataframe (after removing mixability score)
7. Return sorted dataframe and played songs dataframe

You can look at the function in more detail on the "sort_songs.py" file.

In [35]:
pd.set_option("display.max_columns", None)                      # Just to make sure we see the entire dataframe
pd.set_option("display.width", 120)

popular_songs = pd.read_csv("popular_songs.csv")
cleaned = clean_data(popular_songs)
played = pd.DataFrame(columns=cleaned.columns)                      # Set up our 2 dataframes

current = cleaned.loc[922]
result, played = run_mixability_scores(current, cleaned, played)    # Run tests, return our sorted df and played songs df

result = result.drop(columns=["artist"])        # Just for easier viewing of dataframe
result["track"] = result["track"].str[:30]      # Just to shorten long track names

print(current)
print(result.head(10))
print("...")
print(played.head(1))


track           Poker Face
artist           Lady Gaga
genre                  pop
tempo                119.0
key                      4
energy                0.81
popularity            0.81
danceability          0.85
liveness              0.12
Name: 922, dtype: object
                               track       genre  tempo  key  energy  popularity  danceability  liveness  mixability
399   Cheerleader (Felix Jaehn Remix  electronic  118.0    4    0.68        0.80          0.78      0.16        0.84
24                          NEW DROP         pop  120.0    4    0.63        0.85          0.76      0.22        0.84
150                         NEW DROP     hip-hop  120.0    4    0.63        0.85          0.76      0.22        0.84
1430            WHINE IN BRAZIL FUNK   brazilian  117.0    4    0.91        0.72          0.86      0.18        0.83
1028                        Dynamite         pop  120.0    4    0.78        0.72          0.75      0.04        0.82
439                         Dy

should i check for copies of songs in the dataset first? might not be that big of an issue, and a filter function might take really long to run