# Criar base de recomendação

Neste notebook iremos pegar os dados pré-processados e criar nossa base de recomendação.

### Bibliotecas

- pandas
- scikit-learn
- textblob

### Instalação e importação de dependências

In [1]:
pip install textblob


Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob


### Buscando e organizando nossos dados

Nesta etapa buscamos nosso `.csv` criado no notebook `02-get-features-data.ipynb` e organizamos ele para criar nossa base de recomendação.

In [3]:
data_df = pd.read_csv("../data/processed_data.csv")
data_df.head()


Unnamed: 0,artist_name,track_name,track_url,danceability,energy,key,loudness,mode,speechiness,acousticness,...,type,id,uri,track_href,analysis_url,duration_ms,time_signature,artist_popularity,track_popularity,genres
0,Shakira,Chantaje (feat. Maluma),6mICuAdrwEjh6Y6lroV2Kg,0.852,0.773,8,-2.921,0,0.0776,0.187,...,audio_features,6mICuAdrwEjh6Y6lroV2Kg,spotify:track:6mICuAdrwEjh6Y6lroV2Kg,https://api.spotify.com/v1/tracks/6mICuAdrwEjh...,https://api.spotify.com/v1/audio-analysis/6mIC...,195840,4,94,76,colombian_pop dance_pop latin_pop
1,Ricky Martin,Vente Pa' Ca (feat. Maluma),7DM4BPaS7uofFul3ywMe46,0.663,0.92,11,-4.07,0,0.226,0.00431,...,audio_features,7DM4BPaS7uofFul3ywMe46,spotify:track:7DM4BPaS7uofFul3ywMe46,https://api.spotify.com/v1/tracks/7DM4BPaS7uof...,https://api.spotify.com/v1/audio-analysis/7DM4...,259196,4,76,70,latin_pop mexican_pop puerto_rican_pop
2,CNCO,Reggaetón Lento (Bailemos),3AEZUABDXNtecAOSC1qTfo,0.761,0.838,4,-3.073,0,0.0502,0.4,...,audio_features,3AEZUABDXNtecAOSC1qTfo,spotify:track:3AEZUABDXNtecAOSC1qTfo,https://api.spotify.com/v1/tracks/3AEZUABDXNte...,https://api.spotify.com/v1/audio-analysis/3AEZ...,222560,4,72,71,boy_band latin_pop reggaeton
3,"J Balvin, Pharrell Williams, BIA, Sky",Safari,6rQSrBHf7HlZjtcMZ4S4bO,0.508,0.687,0,-4.361,1,0.326,0.551,...,audio_features,6rQSrBHf7HlZjtcMZ4S4bO,spotify:track:6rQSrBHf7HlZjtcMZ4S4bO,https://api.spotify.com/v1/tracks/6rQSrBHf7HlZ...,https://api.spotify.com/v1/audio-analysis/6rQS...,205600,4,89,0,reggaeton reggaeton_colombiano urbano_latino
4,Daddy Yankee,Shaky Shaky,58IL315gMSTD37DOZPJ2hf,0.899,0.626,6,-4.228,0,0.292,0.076,...,audio_features,58IL315gMSTD37DOZPJ2hf,spotify:track:58IL315gMSTD37DOZPJ2hf,https://api.spotify.com/v1/tracks/58IL315gMSTD...,https://api.spotify.com/v1/audio-analysis/58IL...,234320,4,90,0,latin_hip_hop reggaeton urbano_latino


In [4]:
def drop_duplicates(df):
    df['artists_song'] = df.apply(lambda row: str(
        row['artist_name']) + str(row['track_name']), axis=1)
    return df.drop_duplicates('artists_song')


def drop_duplicates_ids(df):
    return df.drop_duplicates("id")


In [5]:
tracks_df = drop_duplicates(data_df)
print("Are all songs unique:", len(
    pd.unique(tracks_df.artists_song)) == len(tracks_df))

tracks_df = drop_duplicates_ids(tracks_df)
print("Are all songs unique:", len(pd.unique(tracks_df.id)) == len(tracks_df))


Are all songs unique: True
Are all songs unique: True


Agora iremos filtrar apenas as colunas que utilizaremos para o sistema de recomendação, além disso vamos separar os dados em 3 categorias (`Metadata`, `Audio` e `Text`):

1. Metadata
    - id
    - genres
    - artist_popularity
    - track_popularity
2. Audio
    - **Mood:** danceability, valence, energy, tempo
    - **Properties:** loudness, speechiness, instrumentalness
    - **Context:** liveness, acousticness
    - **metadata:** key, mode
3. Text
    - track_name

In [6]:
def select_columns(df):
    return df[["artist_name", "id", "track_name", "danceability", "energy", "key", "loudness", "mode",
               "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "artist_popularity", "genres", "track_popularity"]]


tracks_df = select_columns(tracks_df)
tracks_df.head()


Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_popularity,genres,track_popularity
0,Shakira,6mICuAdrwEjh6Y6lroV2Kg,Chantaje (feat. Maluma),0.852,0.773,8,-2.921,0,0.0776,0.187,3e-05,0.159,0.907,102.034,94,colombian_pop dance_pop latin_pop,76
1,Ricky Martin,7DM4BPaS7uofFul3ywMe46,Vente Pa' Ca (feat. Maluma),0.663,0.92,11,-4.07,0,0.226,0.00431,1.7e-05,0.101,0.533,99.935,76,latin_pop mexican_pop puerto_rican_pop,70
2,CNCO,3AEZUABDXNtecAOSC1qTfo,Reggaetón Lento (Bailemos),0.761,0.838,4,-3.073,0,0.0502,0.4,0.0,0.176,0.71,93.974,72,boy_band latin_pop reggaeton,71
3,"J Balvin, Pharrell Williams, BIA, Sky",6rQSrBHf7HlZjtcMZ4S4bO,Safari,0.508,0.687,0,-4.361,1,0.326,0.551,3e-06,0.126,0.555,180.044,89,reggaeton reggaeton_colombiano urbano_latino,0
4,Daddy Yankee,58IL315gMSTD37DOZPJ2hf,Shaky Shaky,0.899,0.626,6,-4.228,0,0.292,0.076,0.0,0.0631,0.873,88.007,90,latin_hip_hop reggaeton urbano_latino,0


Com as colunas selecionadas, vamos começar organizando a coluna `genres`, que é a unica da categoria `Metadata` que precisa ser sanitizada.

In [7]:
def genre_preprocess(df):
    df["genres_list"] = df["genres"].apply(lambda x: str(x).split(" "))
    return df


tracks_df = genre_preprocess(data_df)
tracks_df["genres_list"].head()


0               [colombian_pop, dance_pop, latin_pop]
1          [latin_pop, mexican_pop, puerto_rican_pop]
2                    [boy_band, latin_pop, reggaeton]
3    [reggaeton, reggaeton_colombiano, urbano_latino]
4           [latin_hip_hop, reggaeton, urbano_latino]
Name: genres_list, dtype: object

Com os dados da categoria `Metadata` organizados, vamos trabalhar no único da categoria `Text`, que é o `track_name`

In [8]:
def get_subjectivity(text):
    return TextBlob(str(text)).sentiment.subjectivity


def get_polarity(text):
    return TextBlob(str(text)).sentiment.polarity


def get_subjectivity_analysis(score):
    if score < 1/3:
        return "low"
    elif score > 1/3:
        return "high"
    else:
        return "medium"


def get_polarity_analysis(score):
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Positive'


def sentiment_analysis(df, column_name):
    df["subjectivity"] = df[column_name].apply(
        get_subjectivity).apply(lambda x: get_subjectivity_analysis(x))
    df["polarity"] = df[column_name].apply(
        get_polarity).apply(lambda x: get_polarity_analysis(x))

    return df


In [9]:
sentiment_df = sentiment_analysis(tracks_df, "track_name")
sentiment_df.head()

Unnamed: 0,artist_name,track_name,track_url,danceability,energy,key,loudness,mode,speechiness,acousticness,...,analysis_url,duration_ms,time_signature,artist_popularity,track_popularity,genres,artists_song,genres_list,subjectivity,polarity
0,Shakira,Chantaje (feat. Maluma),6mICuAdrwEjh6Y6lroV2Kg,0.852,0.773,8,-2.921,0,0.0776,0.187,...,https://api.spotify.com/v1/audio-analysis/6mIC...,195840,4,94,76,colombian_pop dance_pop latin_pop,ShakiraChantaje (feat. Maluma),"[colombian_pop, dance_pop, latin_pop]",low,Neutral
1,Ricky Martin,Vente Pa' Ca (feat. Maluma),7DM4BPaS7uofFul3ywMe46,0.663,0.92,11,-4.07,0,0.226,0.00431,...,https://api.spotify.com/v1/audio-analysis/7DM4...,259196,4,76,70,latin_pop mexican_pop puerto_rican_pop,Ricky MartinVente Pa' Ca (feat. Maluma),"[latin_pop, mexican_pop, puerto_rican_pop]",low,Neutral
2,CNCO,Reggaetón Lento (Bailemos),3AEZUABDXNtecAOSC1qTfo,0.761,0.838,4,-3.073,0,0.0502,0.4,...,https://api.spotify.com/v1/audio-analysis/3AEZ...,222560,4,72,71,boy_band latin_pop reggaeton,CNCOReggaetón Lento (Bailemos),"[boy_band, latin_pop, reggaeton]",low,Neutral
3,"J Balvin, Pharrell Williams, BIA, Sky",Safari,6rQSrBHf7HlZjtcMZ4S4bO,0.508,0.687,0,-4.361,1,0.326,0.551,...,https://api.spotify.com/v1/audio-analysis/6rQS...,205600,4,89,0,reggaeton reggaeton_colombiano urbano_latino,"J Balvin, Pharrell Williams, BIA, SkySafari","[reggaeton, reggaeton_colombiano, urbano_latino]",low,Neutral
4,Daddy Yankee,Shaky Shaky,58IL315gMSTD37DOZPJ2hf,0.899,0.626,6,-4.228,0,0.292,0.076,...,https://api.spotify.com/v1/audio-analysis/58IL...,234320,4,90,0,latin_hip_hop reggaeton urbano_latino,Daddy YankeeShaky Shaky,"[latin_hip_hop, reggaeton, urbano_latino]",high,Negative


Agora partimos para a última categoria, a `Audio`

In [10]:
def ohe_preparation(df, column, new_name):
    term_frequency_df = pd.get_dummies(df[column])
    feature_names = term_frequency_df.columns

    term_frequency_df.columns = [
        new_name + "|" + str(i) for i in feature_names]
    term_frequency_df.reset_index(drop=True, inplace=True)

    return term_frequency_df


In [11]:
subject_ohe = ohe_preparation(sentiment_df, 'subjectivity', 'subject')
subject_ohe.iloc[0]


subject|high      0
subject|low       1
subject|medium    0
Name: 0, dtype: uint8

In [12]:
tf_idf = TfidfVectorizer()
tf_idf_matrix = tf_idf.fit_transform(
    tracks_df["genres_list"].apply(lambda x: " ".join(x)))

genre_df = pd.DataFrame(tf_idf_matrix.toarray())
genre_df.columns = ["genre" + "|" + i for i in tf_idf.get_feature_names_out()]
genre_df.drop(columns="genre|unknown")
genre_df.reset_index(drop=True, inplace=True)

genre_df.iloc[0]


genre|21st_century_classical    0.0
genre|432hz                     0.0
genre|48g                       0.0
genre|5th_wave_emo              0.0
genre|8d                        0.0
                               ... 
genre|zouk                      0.0
genre|zouk_riddim               0.0
genre|zurich_indie              0.0
genre|zxc                       0.0
genre|zydeco                    0.0
Name: 0, Length: 5210, dtype: float64

In [13]:
print(tracks_df["artist_popularity"].describe())


count    330387.000000
mean         47.861472
std          20.912843
min           0.000000
25%          34.000000
50%          49.000000
75%          63.000000
max         100.000000
Name: artist_popularity, dtype: float64


In [14]:
popularity = tracks_df[["artist_popularity"]].reset_index(drop=True)
scaler = MinMaxScaler()

popularity_scaled = pd.DataFrame(scaler.fit_transform(
    popularity), columns=popularity.columns)
popularity_scaled.head()


Unnamed: 0,artist_popularity
0,0.94
1,0.76
2,0.72
3,0.89
4,0.9


Agora com todas as funções feitas, vamos criar o dataset de recomendações

In [17]:
def create_features_dataset(df, float_cols):
    tf_idf = TfidfVectorizer()
    tf_idf_matrix = tf_idf.fit_transform(
        df["genres_list"].apply(lambda x: " ".join(x)))

    genre_df = pd.DataFrame(tf_idf_matrix.toarray())
    genre_df.columns = ["genre" + "|" +
                        i for i in tf_idf.get_feature_names_out()]
    genre_df.drop(columns="genre|unknown")
    genre_df.reset_index(drop=True, inplace=True)

    df = sentiment_analysis(df, "track_name")

    subject_ohe = ohe_preparation(df, "subjectivity", "subject") * 0.3
    polar_ohe = ohe_preparation(df, "polarity", "polar") * 0.5
    key_ohe = ohe_preparation(df, "key", "key") * 0.5
    mode_ohe = ohe_preparation(df, "mode", "mode") * 0.5

    popularity = df[["artist_popularity", "track_popularity"]
                    ].reset_index(drop=True)
    scaler = MinMaxScaler()
    popularity_scaled = pd.DataFrame(scaler.fit_transform(
        popularity), columns=popularity.columns) * 0.2

    floats = df[float_cols].reset_index(drop=True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(
        floats), columns=floats.columns) * 0.2

    final_df = pd.concat([genre_df, floats_scaled, popularity_scaled,
                          subject_ohe, polar_ohe, key_ohe, mode_ohe], axis=1)
    final_df["id"] = df["id"].values

    return final_df


In [18]:
float_cols = tracks_df.dtypes[tracks_df.dtypes == "float64"].index.values
tracks_df.to_csv("tracks.csv", index=False)

features_df = create_features_dataset(tracks_df, float_cols=float_cols)
features_df.to_csv("features.csv")