# 유사곡 추천 101 - 2. 교집합, 자카드 지수, TF-IDF

[블로그](https://blog2.lucent.me/ml/music-recommender-systems-101-intro)에서 다루었던 첫 번째 방법인 교집합과, 교집합의 문제를 해결하는 두 가지 방법인 자카드 지수와 TF-IDF를 살펴봅시다.
<div style="max-height: 416px;list-style:none !important"><ol class="toc-item"><li><a href="#데이터-로드">데이터 로드</a></li><li><a href="#교집합의-크기">교집합의 크기</a><ul style="margin:0"><li><a href="#가장-큰-k개의-교집합-쿼리">가장 큰 k개의 교집합 쿼리</a></li></ul></li><li><a href="#자카드-지수">자카드 지수</a><ul style="margin:0"><li><a href="#자카드-지수-쿼리">자카드 지수 쿼리</a></li></ul></li><li><a href="#TF-IDF">TF-IDF</a><ul style="margin:0"><li><a href="#TF-IDF-쿼리">TF-IDF 쿼리</a></li></ul></li><li><a href="#Latent-Semantic-Analysis">Latent Semantic Analysis</a><ul style="margin:0"><li><a href="#LSA-쿼리">LSA 쿼리</a></li></ul></li></ol></div>

### 데이터 로드

[저번](https://github.com/kcy1019/MusicRecSys-101/blob/master/01%20-%20Getting%20the%20Dataset.ipynb)에 저장한 데이터를 읽어 봅시다.

In [1]:
import scipy.sparse as sparse
import pandas as pd
import numpy as np

In [2]:
play_count_matrix = sparse.load_npz('play_count_matrix.npz')
metadata = pd.read_hdf('metadata.hdf', key='data')

## 교집합의 크기

두 노래를 모두 들어본 사람의 수를 구하기 위해서는 먼저 '어떤 노래를 들어본 적 있는가? -> 1, 그렇지 않으면 -> 0' 과 같은 형태로
플레이 카운트 행렬을 이진화 해야 합니다.

이진화를 하고 나면, 간단하게 X * X.T를 수행함으로써 두 노래를 모두 들어본 사람의 수를 계산할 수 있습니다.

In [3]:
# 이진화
play_count_binary = play_count_matrix.copy().tocsr()
play_count_binary[play_count_binary > 0] = 1
# 교집합 계산
all_pair_intersections = (play_count_binary * play_count_binary.transpose())
all_pair_intersections = np.array(all_pair_intersections.todense())
all_pair_intersections.shape

(10000, 10000)

### 가장 큰 k개의 교집합 쿼리

In [4]:
def query_intersection(song_name, return_k=10):
    idx = metadata[metadata.title == song_name].song.cat.codes.values[0]
    cand = all_pair_intersections[idx].argsort()[::-1][:return_k+1]
    cand = list(cand[cand != idx][:return_k])
    return [idx] + cand  # 맨 위에는 쿼리의 대상이 된 곡을 넣어줍니다.

In [5]:
metadata.loc[query_intersection('You Belong With Me')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7330,TRJPXVB128F9316916,SOSROFB12AAF3B4C5D,Taylor Swift,You Belong With Me,Country_Traditional,Country
7761,TRSDRPY128F933E202,SOTWSXL12A8C143349,Taylor Swift,Love Story,,
7761,TROPUMP128F92EC162,SOTWSXL12A8C143349,Taylor Swift,Love Story,Country_Traditional,Country
5509,TROAQBZ128F9326213,SONYKOW12AB01849C9,OneRepublic,Secrets,Rock_Contemporary,Pop_Rock
4414,TRVSBTV12903CC6670,SOLFXKT12AB017E3E0,Charttraxx Karaoke,Fireflies,,
355,TRHKJNX12903CEFCDF,SOAXGDH12A8C13F8A1,Florence + The Machine,Dog Days Are Over (Radio Edit),,
8567,TRIEXMF128F92FDD60,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,Pop_Rock
8567,TRTEHXL128F931687B,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,
620,TRCPAGR128F423A01A,SOBOAFP12A8C131F36,Jason Mraz & Colbie Caillat,Lucky (Album Version),,
8065,TRSLDDC12903CC36E7,SOUSMXX12AB0185C24,Usher featuring will.i.am,OMG,,


In [6]:
metadata.loc[query_intersection('Fix You')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8597,TRYVBMA128E0789D39,SOWEJXA12A6701C574,Coldplay,Fix You,,
1125,TRENTGL128E0780C8E,SOCVTLJ12A6310F0FD,Coldplay,Clocks,Rock_College,Pop_Rock
4114,TRQFXKD128E0780CAE,SOKLRPJ12A8C13C3FE,Coldplay,The Scientist,Pop_Contemporary,Pop_Rock
6284,TRIKGRK128E0780DB0,SOPXKYD12A6D4FA876,Coldplay,Yellow,Pop_Contemporary,Pop_Rock
6284,TRTZNQZ12903CD044C,SOPXKYD12A6D4FA876,Coldplay,Yellow,,
5509,TROAQBZ128F9326213,SONYKOW12AB01849C9,OneRepublic,Secrets,Rock_Contemporary,Pop_Rock
355,TRHKJNX12903CEFCDF,SOAXGDH12A8C13F8A1,Florence + The Machine,Dog Days Are Over (Radio Edit),,
8567,TRIEXMF128F92FDD60,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,Pop_Rock
8567,TRTEHXL128F931687B,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,
4414,TRVSBTV12903CC6670,SOLFXKT12AB017E3E0,Charttraxx Karaoke,Fireflies,,


In [7]:
metadata.loc[query_intersection('Welcome To The Black Parade (Album Version)')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6562,TRKJHNE128F42380FC,SOQQAAQ12A67ADE34D,My Chemical Romance,Welcome To The Black Parade (Album Version),Grunge_Emo,Pop_Rock
4714,TRYZMOC128F423E58D,SOLXXZI12A8AE4733A,My Chemical Romance,Helena (So Long & Goodnight) (Album Version),Grunge_Emo,Pop_Rock
5509,TROAQBZ128F9326213,SONYKOW12AB01849C9,OneRepublic,Secrets,Rock_Contemporary,Pop_Rock
4414,TRVSBTV12903CC6670,SOLFXKT12AB017E3E0,Charttraxx Karaoke,Fireflies,,
6078,TRRJZWL128F146D790,SOPKPSQ12A58A7A5E4,My Chemical Romance,I'm Not Okay (I Promise) (Live From Sessions@AOL),Grunge_Emo,Pop_Rock
968,TRQTLTB128F92F785B,SOCKSGZ12A58A7CA4B,Linkin Park,Bleed It Out [Live At Milton Keynes],,
9932,TRNTALF128EF343800,SOZVCRW12A67ADA0B7,The Killers,When You Were Young,,
7935,TRRNFHH128F92D262D,SOUJVIT12A8C1451C1,Rise Against,Savior,Punk,Pop_Rock
1125,TRENTGL128E0780C8E,SOCVTLJ12A6310F0FD,Coldplay,Clocks,Rock_College,Pop_Rock
355,TRHKJNX12903CEFCDF,SOAXGDH12A8C13F8A1,Florence + The Machine,Dog Days Are Over (Radio Edit),,


## 자카드 지수

위의 결과를 보면, [블로그](https://blog2.lucent.me/ml/music-recommender-systems-101-intro)에서와 똑같이, 유명한 곡이 유사도와 관계 없이 계속 상위권을 차지하는 문제를 볼 수 있습니다.
이를 약간 완화하는 방법에는 [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index)가 있는데, 이는 단순히 교집합의 크기를 비교하는 것이 아니라,
**(교집합의 크기) / (합집합의 크기)** 를 이용합니다. 즉, 집합의 크기로 노멀라이즈 하는 것이라고 생각해볼 수 있겠죠?

다행히 자카드 지수에 대한 쿼리 역시 교집합 행렬을 이용하면 간단하게 계산할 수 있습니다.
각 노래를 들어본 사람의 수를 벡터로 나타내고, 이를 행렬의 각 행에 적절히 더해주면 합집합의 크기도 간단하게! (참고: [numpy broadcasting](http://cs231n.github.io/python-numpy-tutorial/#numpy-broadcasting))

In [8]:
play_count_binary_per_song = play_count_binary.sum(axis=1)
play_count_binary_per_song.shape

(10000, 1)

In [9]:
all_pair_union = play_count_binary_per_song - all_pair_intersections + play_count_binary_per_song.transpose()
all_pair_union.shape

(10000, 10000)

In [10]:
all_pair_jaccard = np.array(all_pair_intersections / all_pair_union)
all_pair_jaccard.shape

(10000, 10000)

### 자카드 지수 쿼리

아까랑 똑같습니다. 그냥 정렬하면 끝.

In [11]:
def query_jaccard(song_name, return_k=10):
    idx = metadata[metadata.title == song_name].song.cat.codes.values[0]
    cand = all_pair_jaccard[idx].argsort()[::-1][:return_k+1]
    cand = list(cand[cand != idx][:return_k])
    return [idx] + cand

In [12]:
metadata.loc[query_jaccard('You Belong With Me')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7330,TRJPXVB128F9316916,SOSROFB12AAF3B4C5D,Taylor Swift,You Belong With Me,Country_Traditional,Country
7761,TRSDRPY128F933E202,SOTWSXL12A8C143349,Taylor Swift,Love Story,,
7761,TROPUMP128F92EC162,SOTWSXL12A8C143349,Taylor Swift,Love Story,Country_Traditional,Country
977,TRGMZNT128F92DE267,SOCLMAD12AB017FC09,Taylor Swift,Tim McGraw,Country_Traditional,Country
2664,TRLVQME128F931BAF3,SOGTQNI12AB0184A5C,Owl City,Vanilla Twilight,Grunge_Emo,Pop_Rock
3418,TRBTNPR12903D13765,SOISNSU12AC468C0D8,Adam Lambert,If I Had You,,
2195,TRBNYBX128F422EC61,SOFRCGW12A81C21EA6,Plain White T's,Hey There Delilah,,
8691,TRLCLEM128F93402D3,SOWKQYL12AB0183B15,Jason Derulo,Whatcha Say,Hip_Hop_Rap,RnB
620,TRCPAGR128F423A01A,SOBOAFP12A8C131F36,Jason Mraz & Colbie Caillat,Lucky (Album Version),,
5676,TRMOYCC128C7196947,SOOJJCT12A6310E1C0,3 Doors Down,Here Without You,,Pop_Rock


**들어보기**

- https://playlist.radio/en/t/TRJPXVB128F9316916/1
- https://playlist.radio/en/t/TRLVQME128F931BAF3/2
- https://playlist.radio/en/t/TRBTNPR12903D13765/3

In [13]:
metadata.loc[query_jaccard('Fix You')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8597,TRYVBMA128E0789D39,SOWEJXA12A6701C574,Coldplay,Fix You,,
4114,TRQFXKD128E0780CAE,SOKLRPJ12A8C13C3FE,Coldplay,The Scientist,Pop_Contemporary,Pop_Rock
1125,TRENTGL128E0780C8E,SOCVTLJ12A6310F0FD,Coldplay,Clocks,Rock_College,Pop_Rock
6284,TRIKGRK128E0780DB0,SOPXKYD12A6D4FA876,Coldplay,Yellow,Pop_Contemporary,Pop_Rock
6284,TRTZNQZ12903CD044C,SOPXKYD12A6D4FA876,Coldplay,Yellow,,
7873,TRXWAZC128F9314B3E,SOUFPNI12A8C142D19,John Mayer,Heartbreak Warfare,Rock_Contemporary,Pop_Rock
620,TRCPAGR128F423A01A,SOBOAFP12A8C131F36,Jason Mraz & Colbie Caillat,Lucky (Album Version),,
4254,TRCBRTN12903CC4BD1,SOKUPAO12AB018D576,Paramore,The Only Exception (Album Version),,
8567,TRIEXMF128F92FDD60,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,Pop_Rock
8567,TRTEHXL128F931687B,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,


**들어보기**

- https://playlist.radio/en/t/80cc0a6909/1
- https://playlist.radio/en/t/TRCPAGR128F423A01A/2
- https://playlist.radio/en/t/TRTEHXL128F931687B/3

In [14]:
metadata.loc[query_jaccard('Welcome To The Black Parade (Album Version)')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6562,TRKJHNE128F42380FC,SOQQAAQ12A67ADE34D,My Chemical Romance,Welcome To The Black Parade (Album Version),Grunge_Emo,Pop_Rock
6078,TRRJZWL128F146D790,SOPKPSQ12A58A7A5E4,My Chemical Romance,I'm Not Okay (I Promise) (Live From Sessions@AOL),Grunge_Emo,Pop_Rock
4714,TRYZMOC128F423E58D,SOLXXZI12A8AE4733A,My Chemical Romance,Helena (So Long & Goodnight) (Album Version),Grunge_Emo,Pop_Rock
2583,TRHWCVS128F14895A3,SOGOZLT12A6D4FB302,My Chemical Romance,Teenagers (Album Version),Grunge_Emo,Pop_Rock
8774,TRWLOCY128F9338118,SOWPLVJ12AB0183586,Fall Out Boy,Sugar_ We're Goin Down,,
8774,TRSSSJW128F146D5DC,SOWPLVJ12AB0183586,Fall Out Boy,Sugar_ We're Goin Down,,
9415,TRSJBHU128F1496F3E,SOYIOZB12A58A797FC,Fall Out Boy,This Ain't A Scene_ It's An Arms Race,,
5882,TRMAFWC128F423E58F,SOOXMSN12A58A7A8D3,My Chemical Romance,To The End (Album Version),Grunge_Emo,Pop_Rock
3796,TRZIQMY128F146E70D,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,
3796,TRJQKHL128F9304295,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,


**들어보기**

- https://playlist.radio/en/t/TROBZMA12903CF2DEB/1
- https://playlist.radio/en/t/TRWLOCY128F9338118/2
- https://playlist.radio/en/t/TRAKBYP128F42385F0/3

이제는 넓은 범위에서 봤을 때엔 비슷하다고 할 수 있는 노래가 나옵니다.
여전히 &lt;You Belong With Me&gt; 에서 &lt;Whatcha Say&gt; 가 나오는 등, 뜬금없는 경우가 존재하지만요.

## TF-IDF

[블로그](https://blog2.lucent.me/ml/music-recommender-systems-101-intro)에서 본 것 처럼, TF-IDF를 이용해서 가중치를 계산하고, 이렇게 얻은 벡터를 코사인 유사도를 이용해서 쿼리해 봅시다.

In [15]:
def tf_idf(matrix):
    tf = matrix.sqrt()
    idf = 1 + np.log1p(matrix.shape[-1]) - np.log1p(np.bincount(matrix.tocoo().row))
    return tf.multiply(idf.reshape(-1, 1)).tocsr()

In [16]:
play_count_tfidf = tf_idf(play_count_matrix)

In [17]:
def cosine_similarity_sparse(v1, v2):
    return v1.dot(v2.transpose())[0, 0] / np.sqrt(v1.dot(v1.transpose())[0, 0]) / np.sqrt(v2.dot(v2.transpose())[0, 0])

### TF-IDF 쿼리

사실 이렇게 큰 벡터를 비교할 때에는 일일이 비교하기보다 [Approximate Nearest Neighbor](https://github.com/facebookresearch/pysparnn)를 사용하는게 상식적인 접근이지만, 우선은 그냥 해 봅시다.

In [18]:
def query_tfidf(song_name, return_k=10):
    idx = metadata[metadata.title == song_name].song.cat.codes.values[0]
    target = play_count_tfidf[idx]
    sim = np.array([cosine_similarity_sparse(target, play_count_tfidf[i]) for i in range(play_count_tfidf.shape[0])], dtype=np.float64)
    cand = sim.argsort()[::-1][:return_k+1]
    cand = list(cand[cand != idx][:return_k])
    return [idx] + cand

In [19]:
metadata.loc[query_tfidf('You Belong With Me')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7330,TRJPXVB128F9316916,SOSROFB12AAF3B4C5D,Taylor Swift,You Belong With Me,Country_Traditional,Country
7761,TRSDRPY128F933E202,SOTWSXL12A8C143349,Taylor Swift,Love Story,,
7761,TROPUMP128F92EC162,SOTWSXL12A8C143349,Taylor Swift,Love Story,Country_Traditional,Country
977,TRGMZNT128F92DE267,SOCLMAD12AB017FC09,Taylor Swift,Tim McGraw,Country_Traditional,Country
5241,TRFSCMH128F425CB85,SONHLJN12A81C2169B,Mickie Krause,Orange Trägt Nur Die Müllabfuhr (Go West),Pop_Contemporary,Pop_Rock
4414,TRVSBTV12903CC6670,SOLFXKT12AB017E3E0,Charttraxx Karaoke,Fireflies,,
6216,TROHFJK12903CC4BCE,SOPTLQL12AB018D56F,Travie McCoy,Billionaire [feat. Bruno Mars] (Explicit Albu...,,
8474,TRGCHLH12903CB7352,SOVWADY12AB0189C63,Miley Cyrus,Party In The U.S.A.,,Pop_Rock
620,TRCPAGR128F423A01A,SOBOAFP12A8C131F36,Jason Mraz & Colbie Caillat,Lucky (Album Version),,
5509,TROAQBZ128F9326213,SONYKOW12AB01849C9,OneRepublic,Secrets,Rock_Contemporary,Pop_Rock


**들어보기**

- https://playlist.radio/en/t/TROBZMA12903CF2DEB/1
- https://www.youtube.com/watch?v=ou6SJe5vBrA
- https://playlist.radio/en/t/TROHFJK12903CC4BCE/2

In [20]:
metadata.loc[query_tfidf('Fix You')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8597,TRYVBMA128E0789D39,SOWEJXA12A6701C574,Coldplay,Fix You,,
4114,TRQFXKD128E0780CAE,SOKLRPJ12A8C13C3FE,Coldplay,The Scientist,Pop_Contemporary,Pop_Rock
1125,TRENTGL128E0780C8E,SOCVTLJ12A6310F0FD,Coldplay,Clocks,Rock_College,Pop_Rock
6284,TRIKGRK128E0780DB0,SOPXKYD12A6D4FA876,Coldplay,Yellow,Pop_Contemporary,Pop_Rock
6284,TRTZNQZ12903CD044C,SOPXKYD12A6D4FA876,Coldplay,Yellow,,
7944,TRYNYSX128E07897B3,SOUKJBT12A6701C4D6,Coldplay,Speed Of Sound,Pop_Contemporary,Pop_Rock
5509,TROAQBZ128F9326213,SONYKOW12AB01849C9,OneRepublic,Secrets,Rock_Contemporary,Pop_Rock
4414,TRVSBTV12903CC6670,SOLFXKT12AB017E3E0,Charttraxx Karaoke,Fireflies,,
8567,TRIEXMF128F92FDD60,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,Pop_Rock
8567,TRTEHXL128F931687B,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,


**들어보기**

- https://playlist.radio/en/t/80cc0a6909/1
- https://playlist.radio/en/t/TROAQBZ128F9326213/2
- https://playlist.radio/en/t/TRTEHXL128F931687B/3

In [21]:
metadata.loc[query_tfidf('Welcome To The Black Parade (Album Version)')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6562,TRKJHNE128F42380FC,SOQQAAQ12A67ADE34D,My Chemical Romance,Welcome To The Black Parade (Album Version),Grunge_Emo,Pop_Rock
6078,TRRJZWL128F146D790,SOPKPSQ12A58A7A5E4,My Chemical Romance,I'm Not Okay (I Promise) (Live From Sessions@AOL),Grunge_Emo,Pop_Rock
4714,TRYZMOC128F423E58D,SOLXXZI12A8AE4733A,My Chemical Romance,Helena (So Long & Goodnight) (Album Version),Grunge_Emo,Pop_Rock
2583,TRHWCVS128F14895A3,SOGOZLT12A6D4FB302,My Chemical Romance,Teenagers (Album Version),Grunge_Emo,Pop_Rock
5882,TRMAFWC128F423E58F,SOOXMSN12A58A7A8D3,My Chemical Romance,To The End (Album Version),Grunge_Emo,Pop_Rock
9415,TRSJBHU128F1496F3E,SOYIOZB12A58A797FC,Fall Out Boy,This Ain't A Scene_ It's An Arms Race,,
7303,TRQFMAE128F92FA8F2,SOSPING12A58A7B4FF,Lou Reed & John Cale,I Believe (LP Version),,
3796,TRZIQMY128F146E70D,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,
3796,TRJQKHL128F9304295,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,
3796,TRYMVOW128F92ECE27,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,


**들어보기**

- https://playlist.radio/en/t/TROBZMA12903CF2DEB/1
- https://www.youtube.com/watch?v=GNm5drtAQXs
- https://www.youtube.com/watch?v=bKyNkjRumqY

그야말로 총체적 난국. 처음의 문제가 그닥 완화되지 않은 채 계속 발생하고, 그로 인해 추천 퀄리티가 자카드 지수 때보다 오히려 안 좋아졌습니다.

## Latent Semantic Analysis

이번에는 SVD를 이용해서 희소 행렬에 숨어있는 취향의 구조를 찾아내 봅시다.

In [22]:
from sklearn.decomposition import TruncatedSVD
def svd(matrix, K):
    transformer = TruncatedSVD(n_components=K, n_iter=10)
    return transformer.fit_transform(matrix)

In [23]:
play_count_lsa = svd(play_count_tfidf, K=50)
play_count_lsa.shape

(10000, 50)

In [24]:
def cosine_similarity_dense(v1, v2):
    return v1.dot(v2.transpose()) / np.sqrt(v1.dot(v1.transpose())) / np.sqrt(v2.dot(v2.transpose()))

### LSA 쿼리

역시 이런 벡터의 경우에도 Nearest Neighbor를 쓰는게 일반적이지만, 50차원에 1만 곡 밖에 안 되니까 그냥.

In [25]:
def query_lsa(song_name, return_k=10):
    idx = metadata[metadata.title == song_name].song.cat.codes.values[0]
    target = play_count_lsa[idx]
    sim = np.apply_along_axis(lambda v: cosine_similarity_dense(target, v), 1, play_count_lsa)
    cand = sim.argsort()[::-1][:return_k+1]
    cand = list(cand[cand != idx][:return_k])
    return [idx] + cand

In [26]:
metadata.loc[query_lsa('You Belong With Me')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7330,TRJPXVB128F9316916,SOSROFB12AAF3B4C5D,Taylor Swift,You Belong With Me,Country_Traditional,Country
977,TRGMZNT128F92DE267,SOCLMAD12AB017FC09,Taylor Swift,Tim McGraw,Country_Traditional,Country
7761,TRSDRPY128F933E202,SOTWSXL12A8C143349,Taylor Swift,Love Story,,
7761,TROPUMP128F92EC162,SOTWSXL12A8C143349,Taylor Swift,Love Story,Country_Traditional,Country
747,TRIUFKY12903CE6EA6,SOBWSGV12AB018B5E0,Selena Gomez & The Scene,Naturally,,
2195,TRBNYBX128F422EC61,SOFRCGW12A81C21EA6,Plain White T's,Hey There Delilah,,
4968,TRHCDSY128F931692E,SOMPTCI12AB017C416,Taylor Swift,Forever & Always,Country_Traditional,Country
3418,TRBTNPR12903D13765,SOISNSU12AC468C0D8,Adam Lambert,If I Had You,,
5241,TRFSCMH128F425CB85,SONHLJN12A81C2169B,Mickie Krause,Orange Trägt Nur Die Müllabfuhr (Go West),Pop_Contemporary,Pop_Rock
6421,TRKQRVO128F92D630B,SOQGJZA12A8C1367AE,Katy Perry,Thinking Of You,,


**들어보기**

- https://playlist.radio/en/t/TRJPXVB128F9316916/1
- https://playlist.radio/en/t/TRIUFKY12903CE6EA6/2
- https://playlist.radio/en/t/TRKQRVO128F92D630B/3

In [27]:
metadata.loc[query_lsa('Fix You')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8597,TRYVBMA128E0789D39,SOWEJXA12A6701C574,Coldplay,Fix You,,
4114,TRQFXKD128E0780CAE,SOKLRPJ12A8C13C3FE,Coldplay,The Scientist,Pop_Contemporary,Pop_Rock
1125,TRENTGL128E0780C8E,SOCVTLJ12A6310F0FD,Coldplay,Clocks,Rock_College,Pop_Rock
6284,TRIKGRK128E0780DB0,SOPXKYD12A6D4FA876,Coldplay,Yellow,Pop_Contemporary,Pop_Rock
6284,TRTZNQZ12903CD044C,SOPXKYD12A6D4FA876,Coldplay,Yellow,,
7944,TRYNYSX128E07897B3,SOUKJBT12A6701C4D6,Coldplay,Speed Of Sound,Pop_Contemporary,Pop_Rock
4486,TRNGAAK128F147DF92,SOLJWIQ12A6D4FA875,Coldplay,Sparks,Pop_Contemporary,Pop_Rock
8567,TRIEXMF128F92FDD60,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,Pop_Rock
8567,TRTEHXL128F931687B,SOWCKVR12A8C142411,Kings Of Leon,Use Somebody,,
5509,TROAQBZ128F9326213,SONYKOW12AB01849C9,OneRepublic,Secrets,Rock_Contemporary,Pop_Rock


**들어보기**

- https://playlist.radio/en/t/80cc0a6909/1
- https://playlist.radio/en/t/TRAALAH128E078234A/2
- https://www.youtube.com/watch?v=lZiNtbgm9oM

In [28]:
metadata.loc[query_lsa('Welcome To The Black Parade (Album Version)')]

Unnamed: 0_level_0,track,song,artist,title,style,genre
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6562,TRKJHNE128F42380FC,SOQQAAQ12A67ADE34D,My Chemical Romance,Welcome To The Black Parade (Album Version),Grunge_Emo,Pop_Rock
2583,TRHWCVS128F14895A3,SOGOZLT12A6D4FB302,My Chemical Romance,Teenagers (Album Version),Grunge_Emo,Pop_Rock
3796,TRZIQMY128F146E70D,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,
3796,TRJQKHL128F9304295,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,
3796,TRYMVOW128F92ECE27,SOJRFWQ12AB0183582,Fall Out Boy,Dance_ Dance,,
6078,TRRJZWL128F146D790,SOPKPSQ12A58A7A5E4,My Chemical Romance,I'm Not Okay (I Promise) (Live From Sessions@AOL),Grunge_Emo,Pop_Rock
9595,TRUNAXR128F93119C6,SOYWRLV12AB0186090,Panic At The Disco,I Write Sins Not Tragedies [Live In Chicago],,
8774,TRWLOCY128F9338118,SOWPLVJ12AB0183586,Fall Out Boy,Sugar_ We're Goin Down,,
8774,TRSSSJW128F146D5DC,SOWPLVJ12AB0183586,Fall Out Boy,Sugar_ We're Goin Down,,
9415,TRSJBHU128F1496F3E,SOYIOZB12A58A797FC,Fall Out Boy,This Ain't A Scene_ It's An Arms Race,,


**들어보기**

- https://playlist.radio/en/t/TROBZMA12903CF2DEB/1
- https://www.youtube.com/watch?v=lrFLekQ2lMU
- https://playlist.radio/en/t/TRIAOSK128E0785308/3

자카드 지수나 TF-IDF와 비슷한 것 같지만, &lt;You Belong To Me&gt;의 경우에는 케이티 페리나 셀레나 고메즈가 나온걸 보아 많이 좋아진 것을 확인할 수 있습니다!! 다만, 여전히 나머지 두 곡의 결과는 골때리는 문제를 보여주고 있습니다.
다음 포스팅과 노트북에서는 이런 부분을 해결할 수 있는 다른 방법에 대해 알아봅시다.