## TF-IDF — Term Frequency-Inverse Document Frequency

**TF-IDF (term frequency-inverse document frequency)** is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

Words within a text document are transformed into importance numbers by a text vectorization process. 

TF-IDF enables us to gives us a way to associate each word in a document with **a number that represents how relevant each word is in that document**. Then, documents with similar, relevant words will have similar vectors, which is what we are looking for in a machine learning algorithm.

\begin{gather*}
Let:\\
t := term (word)\\
d := document\\
D := Document Set\\

\\
tf-idf(t, d, D) = tf(t, d) \cdot idf(t, D)\\
\\
where:\\
tf(t, d) = log(1 + freq(t,d))\\
idf(t, D) = log(\frac{N}{count(d \in D: t \in d)})
\end{gather*}



In [54]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import json

pd.set_option('display.max_rows', 100)

In [55]:
df = pd.read_csv('../complete_data.csv')

In [56]:
df.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,instrumentalness,liveness,valence,tempo,time_signature,followers,genre_artist,name_artist,popularity_artist,duration_mins
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],45tIt06XoI0Iio4LBEVpls,1922-02-22,0.645,0.445,...,0.744,0.151,0.127,104.851,3,91.0,[''],Uli,4.0,2.11505
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],14jtPCOoNZwquk5wd9DxrY,1922-06-01,0.695,0.263,...,0.0,0.148,0.655,102.009,1,3.0,[''],Fernando Pessoa,0.0,1.636667
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],5LiOoJbxVSAMkBS2fUm3X2,1922-03-21,0.434,0.177,...,0.0218,0.212,0.457,130.418,5,3528.0,"['tango', 'vintage tango']",Ignacio Corsini,23.0,3.027333
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],5LiOoJbxVSAMkBS2fUm3X2,1922-03-21,0.321,0.0946,...,0.918,0.104,0.397,169.98,3,3528.0,"['tango', 'vintage tango']",Ignacio Corsini,23.0,2.94845
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],3BiJGZsyX9sJchTqcSA7Su,1922-01-01,0.402,0.158,...,0.13,0.311,0.196,103.22,4,11327.0,"['adult standards', 'big band', 'easy listenin...",Dick Haymes,35.0,2.718


In [58]:
song_attributes_df = df[['danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'time_signature']]

Lets work with the bag of words contained in `genre_artist` column.

In [60]:
genres = df[['genre_artist']]

In [61]:
def str_to_list(row):
    """convert a string List into a List"""
    row = str(row).strip("[]").replace("'","").split(", ")
    return row

In [8]:
def _replace_spaces_to_dash(row):
    cleaned = []
    for genre in row:
        cleaned.append(genre.replace(' ', '-'))
    return cleaned

In [9]:
genres['genre_artist'] = genres['genre_artist'].apply(str_to_list)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [38]:
genres_list = genres['genre_artist'].to_list()

In [None]:
from collections import Counter
Counter()

In [40]:
flat_list = set([item for sublist in genres_list for item in sublist])

In [43]:
flat_list

{'',
 'irish-pub-song',
 'chicago-soul',
 'lldm',
 'malian-blues',
 'musica-antigua',
 'kolsche-karneval',
 'show-tunes',
 'pokemon',
 'british-black-metal',
 'finnish-edm',
 'brass-ensemble',
 'nitzhonot',
 'post-teen-pop',
 'surf-punk',
 'galician-rock',
 'metal-guitar',
 'belgian-indie',
 'electronica-argentina',
 'indian-instrumental',
 'melodic-techno',
 'bongo-flava',
 'deep-new-americana',
 'british-power-metal',
 'rajasthani-pop',
 'japanese-classical',
 'christian-punk',
 'australian-metalcore',
 'german-oi',
 'ecuadorian-indie',
 'rap-maroc',
 'baja-indie',
 'música-pitiusa',
 'techno-kayo',
 'klapa',
 'christian-a-cappella',
 'australian-hardcore',
 'lithuanian-hip-hop',
 'modern-rock',
 'wu-fam',
 'punk-mexicano',
 'sotalaulut',
 'italian-lounge',
 'new-comedy',
 'western-swing',
 'indonesian-psychedelia',
 'jazz-saxophone',
 'trance-mexicano',
 'italian-hardcore',
 'south-carolina-indie',
 'gypsy',
 'oaxaca-indie',
 'j-rock',
 'christian-alternative-rock',
 '"childrens-mus

In [11]:
genres[['genre_artist']]

Unnamed: 0,genre_artist
0,[]
1,[]
2,"[tango, vintage-tango]"
3,"[tango, vintage-tango]"
4,"[adult-standards, big-band, easy-listening, lo..."
...,...
586667,[chinese-viral-pop]
586668,"[alt-z, alternative-r&b, bedroom-pop, indie-ca..."
586669,"[alt-z, electropop, indie-pop, la-indie, pop, ..."
586670,"[chill-r&b, indie-cafe-pop, singaporean-pop]"


In [12]:
max(genres['genre_artist'].apply(lambda x: " ".join(x)).to_list(), key=len)

'alternative-dance alternative-rock art-pop atlanta-indie chillwave dance-punk dream-pop experimental-rock freak-folk garage-psych indie-pop indie-rock lo-fi modern-alternative-rock modern-rock neo-psychedelic new-rave noise-pop noise-rock nu-gaze shoegaze'

    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['consolidates_genre_lists'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.reset_index(drop = True, inplace=True)

In [28]:
def td_idf_vectorizer(df: pd.DataFrame, column: str):
    """ Returnt a TF-IDF weighted df """
    _tfidf = TfidfVectorizer()
    weighted_matrix = _tfidf.fit_transform(df[column].apply(lambda x: " ".join(x)))
    
    return pd.DataFrame(weighted_matrix.toarray())

In [29]:
weighted_genre_df = td_idf_vectorizer(genres, 'genre_artist')

In [13]:
tfidf = TfidfVectorizer()
weighted_matrix = tfidf.fit_transform(genres['genre_artist'].apply(lambda x: " ".join(x)))


In [14]:
weighted_genre_df = pd.DataFrame(weighted_matrix.toarray())

In [15]:
weighted_genre_df.columns = ['genre' + "-" + i for i in tfidf.get_feature_names_out()]
weighted_genre_df.reset_index(drop = True, inplace=True)
weighted_genre_df.head()

Unnamed: 0,genre-150,genre-21st,genre-432hz,genre-48g,genre-abc,genre-abstract,genre-acadienne,genre-accordeon,genre-accordion,genre-aceh,...,genre-zikir,genre-zillertal,genre-zim,genre-zither,genre-zolo,genre-zouglou,genre-zouk,genre-zuliana,genre-zurich,genre-zydeco
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
song_attributes_df

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,0.645,0.4450,0,-13.338,1,0.4510,0.674,0.744000,0.1510,0.1270,104.851,3
1,0.695,0.2630,0,-22.136,1,0.9570,0.797,0.000000,0.1480,0.6550,102.009,1
2,0.434,0.1770,1,-21.180,1,0.0512,0.994,0.021800,0.2120,0.4570,130.418,5
3,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918000,0.1040,0.3970,169.980,3
4,0.402,0.1580,3,-16.900,0,0.0390,0.989,0.130000,0.3110,0.1960,103.220,4
...,...,...,...,...,...,...,...,...,...,...,...,...
586667,0.560,0.5180,0,-7.471,0,0.0292,0.785,0.000000,0.0648,0.2110,131.896,4
586668,0.765,0.6630,0,-5.223,1,0.0652,0.141,0.000297,0.0924,0.6860,150.091,4
586669,0.535,0.3140,7,-12.823,0,0.0408,0.895,0.000150,0.0874,0.0663,145.095,4
586670,0.696,0.6150,10,-6.212,1,0.0345,0.206,0.000003,0.3050,0.4380,90.029,4


In [25]:
scaler = MinMaxScaler()

scaled = scaler.fit_transform(song_attributes_df)
scaled_attributes_df = pd.DataFrame(data=scaled, columns=song_attributes_df.columns)
scaled_attributes_df

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,0.650858,0.4450,0.000000,0.713748,1.0,0.464470,0.676707,0.744000,0.1510,0.1270,0.425564,0.6
1,0.701312,0.2630,0.000000,0.579173,1.0,0.985582,0.800201,0.000000,0.1480,0.6550,0.414029,0.2
2,0.437941,0.1770,0.090909,0.593796,1.0,0.052729,0.997992,0.021800,0.2120,0.4570,0.529335,1.0
3,0.323915,0.0946,0.636364,0.490073,1.0,0.051905,0.998996,0.918000,0.1040,0.3970,0.689907,0.6
4,0.405651,0.1580,0.272727,0.659263,0.0,0.040165,0.992972,0.130000,0.3110,0.1960,0.418945,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...
586667,0.565086,0.5180,0.000000,0.803491,0.0,0.030072,0.788153,0.000000,0.0648,0.2110,0.535333,0.8
586668,0.771948,0.6630,0.000000,0.837876,1.0,0.067147,0.141566,0.000297,0.0924,0.6860,0.609183,0.8
586669,0.539859,0.3140,0.636364,0.721626,0.0,0.042019,0.898594,0.000150,0.0874,0.0663,0.588905,0.8
586670,0.702321,0.6150,0.909091,0.822748,1.0,0.035530,0.206827,0.000003,0.3050,0.4380,0.365406,0.8


In [None]:
# cluster_pipeline = Pipeline([('Scaler', StandardScaler()), ('Cosine_similarity', cosine_similarity())])

In [26]:
model_df = pd.concat([weighted_genre_df, song_attributes_df, df[['id']]], axis=1)
model_df

Unnamed: 0,genre-150,genre-21st,genre-432hz,genre-48g,genre-abc,genre-abstract,genre-acadienne,genre-accordeon,genre-accordion,genre-aceh,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-13.338,1,0.4510,0.674,0.744000,0.1510,0.1270,104.851,3,35iwgR4jXetI318WEWsa1Q
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-22.136,1,0.9570,0.797,0.000000,0.1480,0.6550,102.009,1,021ht4sdgPcrDgSk7JTbKY
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-21.180,1,0.0512,0.994,0.021800,0.2120,0.4570,130.418,5,07A5yehtSnoedViJAZkNnc
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-27.961,1,0.0504,0.995,0.918000,0.1040,0.3970,169.980,3,08FmqUhxtyLTn6pAh6bk45
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-16.900,0,0.0390,0.989,0.130000,0.3110,0.1960,103.220,4,08y9GfoqCWfOGsKdwojr5e
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-7.471,0,0.0292,0.785,0.000000,0.0648,0.2110,131.896,4,5rgu12WBIHQtvej2MdHSH0
586668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-5.223,1,0.0652,0.141,0.000297,0.0924,0.6860,150.091,4,0NuWgxEp51CutD2pJoF4OM
586669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-12.823,0,0.0408,0.895,0.000150,0.0874,0.0663,145.095,4,27Y1N4Q4U3EfDU5Ubw8ws2
586670,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-6.212,1,0.0345,0.206,0.000003,0.3050,0.4380,90.029,4,45XJsGpFTyzbzeWK8VzR8S


In [None]:
testing_df = pd.read_csv('./testing-data.csv')

In [17]:
# 
# df_copy = pd.DataFrame(index=df_original.index,columns=df_original.columns)