# 이론

## 추천시스템 필요한 이유
- 사용자 취향 파악으로 편의향상 
> 유튜브에 계속 머무는 이유
- 구매 전환율 향상 
> 당근마켓이 활성화되는 이유

## 추천시스템 유형
- 콘텐츠 기반 필터링 (CB)<br>
  : '너 '나의아저씨' 봤잖아, '이것'도 드라마인데..'
  > 비슷한 종류의 콘텐츠를 추천해줌
- 아이템 기반 필터링 (CF)<br>
  : '너처럼 '나의아저씨' 좋아하는 사람은, '이것'도 좋아하던데..'
  > 유저의 취향을 고려하여 콘텐츠를 추천해줌 (행동양식 기반 추천)
- 잠재요인 기반 협업필터링 (LF: latent based collaborative filtering)<br>
  : '취향이 너같은 사람은, '이것'도 좋아하던데..'
  > 기존 CF보다 공간을 효율적 활용


## 유사도 측정 함수
기계가 문서 간 얼마나 공통성을 많이 가지고 있는지 수치적으로 확인하도록 계산한다.


### 코사인 유사도
두 벡터 간의 코사인 각도를 이용하여 계산. 방향이 동일할수록 1의 값,90도에 가까워질수록 0의 값, 반대일수록 0의 값을 지님 <br><br>
$similarity = cos(\theta) = \frac{A\cdot B}{\parallel A \parallel \parallel B \parallel} = \frac { \sum_{i=1}^{n} A_i \times B_i} {\sqrt {\sum_{i=1}^{n} ( A_i)^2 }\times \sqrt {\sum_{i=1}^{n} ( B_i)^2}}$
<br><br>
![스칼라곱](http://localhost:8888/files/Desktop/Study/Python/6_Recommandation/image/cos1.jpg?_xsrf=2%7Cfd330b07%7C87d4fcc012a82ba8b2ba1b9de6c9e182%7C1646247877)
![](http://localhost:8888/files/Desktop/Study/Python/6_Recommandation/image/cos2.jpg?_xsrf=2%7Cfd330b07%7C87d4fcc012a82ba8b2ba1b9de6c9e182%7C1646247877)
<br>


In [11]:
from numpy import dot
from numpy.linalg import norm
import numpy as np
def cos_sim(A, B):
       return dot(A, B)/(norm(A)*norm(B))
    
doc1=np.array([0,1,1,1])
doc2=np.array([1,0,1,1])
doc3=np.array([2,0,2,2])

print(f"{doc1, doc2} :{cos_sim(doc1, doc2): 0.2f}") #문서1과 문서2의 코사인 유사도
print(f"{doc1, doc3}:{cos_sim(doc1, doc3): 0.2f}") #문서1과 문서3의 코사인 유사도
print(f"{doc2, doc3}:{cos_sim(doc2, doc3): 0.2f}") #문서2과 문서3의 코사인 유사도

(array([0, 1, 1, 1]), array([1, 0, 1, 1])) : 0.67
(array([0, 1, 1, 1]), array([2, 0, 2, 2])): 0.67
(array([1, 0, 1, 1]), array([2, 0, 2, 2])): 1.00


1번배열로부터 2번, 3번 간의 코사인 유사도가 같음.
2번배열로부터 3번은 수치1로 완전 일치. 동일한 방향으로 증가/유지하는지를 확인

### 자카드 유사도
A와 B라는 집합 가정. 합집합 중 교집합의 비율을 구함. 0과 1상 값을 지니며, 동일하면 1을 갖고, 공통원소가 없다면 0임

$J(A,B)=\frac{|A∩B|}{|A∪B|}=\frac{|A∩B|}{|A|+|B|−|
|A∩B||}$




In [25]:
doc1 = "apple banana everyone like likey watch card holder"
doc2 = "apple banana coupon passport love you"

# 토큰화
tokenized_doc1 = doc1.split()
tokenized_doc2 = doc2.split()
# 문서 별 토큰화 결과를 set으로 만들어 합집합과 교집합 계산

union = set(tokenized_doc1).union(set(tokenized_doc2))
print(f"|A∪B|= {len(union),union}") 

intersection = set(tokenized_doc1).intersection(set(tokenized_doc2))
print(f"|A∩B|= {len(intersection),intersection}")

print(f"{len(intersection)/len(union): 0.5f}")

|A∪B|= (12, {'coupon', 'everyone', 'passport', 'banana', 'like', 'card', 'holder', 'apple', 'love', 'likey', 'you', 'watch'})
|A∩B|= (2, {'banana', 'apple'})
 0.16667


### 피어슨 상관계수
두변수 X, Y 간의 선형상관관계를 계량화한 수치로 +1, -1 사이의 값을 가짐

$R_{xy} = \frac {Cov_{xy}}{\sigma_x  \sigma_y} $

In [58]:
doc1=np.array([0,1,1,1])
doc2=np.array([1,0,1,1])
doc3=np.array([2,0,2,2])

def corr(A, B):
       return np.cov(A,B)[0][1]/(np.std(A)*np.std(B))


print(f"{doc1, doc2} :{corr(doc1, doc2): 0.2f}") #문서1과 문서2의 상관계수
print(f"{doc1, doc3}:{corr(doc1, doc3): 0.2f}") #문서1과 문서3의 코사인 유사도
print(f"{doc2, doc3}:{corr(doc2, doc3): 0.2f}") #문서2과 문서3의 코사인 유사도

-0.4444444444444445


### Euclidean Distance 유클리디안 거리
두 개의 실수값을 갖는 벡터간의 평면 위에서의 거리 

$\overline{XY} = \sqrt {(x_1-y_1)^2 + (x_2-y_2)^2 +... + (x_n-y_n)^2 } 
 = \sqrt { \sum_{i=1}^{n}(x_i-y_i)^2}$

거리 계산을 할 때에는 각 요소들의 단위를 고려해야 한다. 각 요소별 단위 차이가 크다면 거리에 큰 영향을 줄 수 있다. 예를 들어, 과 인 경우에는 두 번째 요소가 거리에 큰 영향을 주지만 과 인 경우는 세 요소 모두 거리에 영향을 준다.

### Mahalanobis Distance 마할라노비스 거리

두 개의 실수 벡터 간에 사용하며 유클리드 거리에 비해 단위와 무관하며 상관을 고려할 수 있는 장점있는 척도로 가 와 의 공분산 행렬(covariance matrix)일 때 다음과 같이 계산한다.

$ d(x,y) = \sqrt {(x - y)^T S^{-1} (x - y) }$

# 구현

In [45]:
import pandas as pd # df 활용
import numpy as np # 유사도 함수 피어슨상관계수(np.corrcoef) 호출

from sklearn.decomposition import TruncatedSVD # 군집화
# 유사도 함수 : 코사인유사도
from sklearn.metrics.pairwise import cosine_similarity

In [46]:
ratings = pd.read_csv('./dataset/movielens100k/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [47]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100004.0,100004.0,100004.0,100004.0
mean,347.01131,12548.664363,3.543608,1129639000.0
std,195.163838,26369.198969,1.058064,191685800.0
min,1.0,1.0,0.5,789652000.0
25%,182.0,1028.0,3.0,965847800.0
50%,367.0,2406.5,4.0,1110422000.0
75%,520.0,5418.0,4.0,1296192000.0
max,671.0,163949.0,5.0,1476641000.0


In [48]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [49]:
movie_df = pd.read_csv('./dataset/movielens100k/movies.csv')
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [50]:
ratings = pd.merge(ratings, movie_df, on = 'movieId')
ratings

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,7,31,3.0,851868750,Dangerous Minds (1995),Drama
2,31,31,4.0,1273541953,Dangerous Minds (1995),Drama
3,32,31,4.0,834828440,Dangerous Minds (1995),Drama
4,36,31,3.0,847057202,Dangerous Minds (1995),Drama
...,...,...,...,...,...,...
99999,664,64997,2.5,1343761859,War of the Worlds (2005),Action|Sci-Fi
100000,664,72380,3.5,1344435977,"Box, The (2009)",Drama|Horror|Mystery|Sci-Fi|Thriller
100001,665,129,3.0,995232528,Pie in the Sky (1996),Comedy|Romance
100002,665,4736,1.0,1010197684,Summer Catch (2001),Comedy|Drama|Romance


In [51]:
ratings["userId"].unique()[:10]

array([  1,   7,  31,  32,  36,  39,  73,  88,  96, 110], dtype=int64)

In [52]:
len(ratings["userId"].unique()) # 671명의 유저가 

671

In [53]:
len(ratings["movieId"].unique()) # 9066개 영화를 평가함...인당평가횟수 약 14회

9066

In [54]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,7,31,3.0,851868750,Dangerous Minds (1995),Drama
2,31,31,4.0,1273541953,Dangerous Minds (1995),Drama
3,32,31,4.0,834828440,Dangerous Minds (1995),Drama
4,36,31,3.0,847057202,Dangerous Minds (1995),Drama
...,...,...,...,...,...,...
99999,664,64997,2.5,1343761859,War of the Worlds (2005),Action|Sci-Fi
100000,664,72380,3.5,1344435977,"Box, The (2009)",Drama|Horror|Mystery|Sci-Fi|Thriller
100001,665,129,3.0,995232528,Pie in the Sky (1996),Comedy|Romance
100002,665,4736,1.0,1010197684,Summer Catch (2001),Comedy|Drama|Romance


In [55]:
# 아이템 기반 행렬 생성
item_mtx= pd.pivot_table(data=ratings, values = 'rating', columns='userId',
                         index='title').fillna(0)
item_mtx

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$9.99 (2008),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Neath the Arizona Skies (1934),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
xXx (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
xXx: State of the Union (2005),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
¡Three Amigos! (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
À nous la liberté (Freedom for Us) (1931),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 아이템 기반 협업필터링 (CF)
고객 평점이 유사한 영화 추천하기 위해, [코사인 유사도](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) 활용 

In [56]:
# 아이템-평점 cf 코사인 유사도 거리 측정
item_based= cosine_similarity(item_mtx)
item_cf = pd.DataFrame(data = item_based, index= item_mtx.index , columns= item_mtx.index)
item_cf

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",1.000000,0.000000,0.0,0.164399,0.020391,0.0,0.014046,0.000000,0.000000,0.003166,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0000
$9.99 (2008),0.000000,1.000000,0.0,0.000000,0.000000,0.0,0.000000,0.079474,0.000000,0.156330,...,0.000000,0.000000,0.0,0.000000,0.0,0.013899,0.000000,0.058218,0.0,0.0000
'Hellboy': The Seeds of Creation (2004),0.000000,0.000000,1.0,0.000000,0.000000,1.0,0.000000,0.217357,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0000
'Neath the Arizona Skies (1934),0.164399,0.000000,0.0,1.000000,0.124035,0.0,0.085436,0.000000,0.000000,0.019259,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0000
'Round Midnight (1986),0.020391,0.000000,0.0,0.124035,1.000000,0.0,0.010597,0.143786,0.000000,0.136163,...,0.000000,0.000000,0.0,0.121567,0.0,0.000000,0.000000,0.000000,0.0,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
xXx (2002),0.000000,0.013899,0.0,0.000000,0.000000,0.0,0.000000,0.123940,0.000000,0.144961,...,0.161281,0.076029,0.0,0.017465,0.0,1.000000,0.152057,0.140222,0.0,0.2661
xXx: State of the Union (2005),0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.134815,...,0.000000,0.000000,0.0,0.000000,0.0,0.152057,1.000000,0.000000,0.0,0.0000
¡Three Amigos! (1986),0.000000,0.058218,0.0,0.000000,0.000000,0.0,0.081620,0.331663,0.214498,0.064908,...,0.112588,0.159223,0.0,0.166622,0.0,0.140222,0.000000,1.000000,0.0,0.0000
À nous la liberté (Freedom for Us) (1931),0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,1.0,0.0000


In [57]:
# 영화 호출 함수
def movie_based_cf(title):
    return item_cf[title].sort_values(ascending=False)[1:6]

# 테스트
movie_based_cf("$9.99 (2008)")

title
Kirikou and the Sorceress (Kirikou et la sorcière) (1998)    0.658145
Life is a Miracle (Zivot je cudo) (2004)                     0.658145
Wind Will Carry Us, The (Bad ma ra khahad bord) (1999)       0.658145
Head-On (Gegen die Wand) (2004)                              0.658145
Caramel (Sukkar banat) (2007)                                0.658145
Name: $9.99 (2008), dtype: float64

## 잠재요인 기반 협업필터링 (LF) 
특정영화와 비슷한 영화를 추천해주기 위해 svd 활용
* [Singular Value Decomposition](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)

In [58]:
# 영화별 군집행렬 확보
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=12) # 9062개의 영화에 대해 12개의 사용자의 요소값으로 분류
svd_mtx = svd.fit_transform(item_mtx)
svd_mtx, svd_mtx.shape # 9064개 영화가 12개 사용자 집단으로 구분

(array([[ 1.22749118e-02,  2.50750723e-03,  1.55491022e-02, ...,
         -2.11516418e-02,  1.15716688e-02, -9.82116584e-03],
        [ 4.23038324e-01, -6.91303550e-03, -4.06776954e-01, ...,
          3.07284219e-02, -5.17607013e-01, -2.47386945e-01],
        [ 1.66327790e-01,  1.08319322e-01,  6.02884205e-02, ...,
          2.07044569e-01,  6.72679759e-02,  3.25703681e-04],
        ...,
        [ 8.51772029e+00, -3.31479444e+00,  9.03726294e-01, ...,
         -1.39859490e-01,  7.20581576e-01,  2.21434761e+00],
        [ 3.19207068e-01,  3.05400066e-01, -5.41655365e-01, ...,
         -2.52231456e-02, -5.86005359e-01, -1.60120225e-01],
        [ 1.04837791e-01,  2.96769593e-01, -2.18217738e-01, ...,
         -1.88351847e-01,  1.10677040e-01,  1.17621116e-01]]),
 (9064, 12))

기존 아이템 기반 CF는 유저 964명에 대해, LF는 12개 사용자 집단에 대해 작업을 진행하기에 <br>
LF가 기존 CF보다 효율적

In [59]:
# 피어슨상관계수
corr = np.corrcoef(svd_mtx) # 군집 행렬 내 상관관계 점수 구함 
corr.shape

# 타이틀에 대한 리스트 확보
title_list = list(item_mtx.index) 

In [60]:
def simular_movies(movie) :
    idx = title_list.index(movie) #영화에 대한 인덱스 확보
    global corr, titles
    corr_dix = corr[idx]
    return list(titles[corr_movie>np.quantile(corr_movie,0.95)]) # 5% * 9064개 

simular_movies("$9.99 (2008)")

["'Round Midnight (1986)",
 '20 Feet from Stardom (Twenty Feet from Stardom) (2013)',
 '200 Motels (1971)',
 '3 Women (Three Women) (1977)',
 '3 Worlds of Gulliver, The (1960)',
 '45 Years (2015)',
 '99 Homes (2014)',
 'A Deadly Adoption (2015)',
 'A Most Violent Year (2014)',
 'A Very Murray Christmas (2015)',
 'Agony and the Ecstasy of Phil Spector, The (2009)',
 'Alice (1990)',
 'Alice in the Cities (Alice in den Stadten) (1974)',
 "Alice's Restaurant (1969)",
 'America, America (1963)',
 'Angel at My Table, An (1990)',
 'Apprenticeship of Duddy Kravitz, The (1974)',
 'April Love (1957)',
 'Artists and Models (1955)',
 'Baby Take a Bow (1934)',
 'Bachelorette (2012)',
 'Bad Timing: A Sensual Obsession (1980)',
 'Bambi Meets Godzilla (1969)',
 'Basic Instinct 2 (2006)',
 'Battered Bastards of Baseball, The (2014)',
 'Bed & Board (Domicile conjugal) (1970)',
 'Bedlam (1946)',
 'Best of Everything, The (1959)',
 'Beyond the Lights (2014)',
 'Big Combo, The (1955)',
 'Big Knife, The (19

In [61]:
len(simular_movies("$9.99 (2008)"))

454

# 참고

[참고1. 이론_추천시스템 및 CB/CF란?](https://lsjsj92.tistory.com/563) <br>
[참고2. 이론_잠재요인협업필터링(LF)란?](https://lsjsj92.tistory.com/564)<br>
[참고3. CF 및 딥러닝 구현 예시]('https://www.kaggle.com/code/rajmehra03/cf-based-recsys-by-low-rank-matrix-factorization')<br>
[참고4. 유사도함수](https://forensics.tistory.com/49)