# 연관분석(Association Rules) 알고리즘

연관분석은 지지도, 신뢰도, 향상도라는 개념을 알아야 합니다.
코드에서는 영화 데이터를 사용하므로, 영화추천 알고리즘에 대입해서 설명해보겠습니다.

* <strong>지지도(Support)</strong> : 전체 영화 시청 중에서 A영화와 B영화가 얼마나 자주 함께 시청되는지를 나타냅니다.  

예를 들어, 만약 당신이 '곡성'을 시청한 후에 '매트릭스'도 자주 시청한다면, '곡성'과 '매트릭스'의 지지도는 높을 것입니다.  
지지도는 특정한 두 영화가 함께 얼마나 자주 시청되는지를 측정하여 그 두 영화의 연관성을 파악하는 데 사용됩니다.

$$지지도(A→B) = { A와 B를 함께 시청한 횟수 \over  전체 시청 횟수}$$

* <strong>신뢰도(Confidence)</strong> : 어떤 영화을 시청하는 사람들 중에서 특정한 영화도 함께 시청하는 비율을 나타냅니다.  

즉, "A를 시청한 사람 중에서 B도 시청한 사람의 비율"을 의미합니다.   
예를 들어, 곡성을 시청하는 사람 중에서 파묘도 시청하는 사람의 비율이 높다면, 곡성을 시청하는 사람들이 파묘를 시청하는 경향이 있다는 것을 알 수 있습니다.  신뢰도는 한 영화을 시청하는 고객 집단이 다른 영화을 얼마나 자주 시청하는지를 측정하여 연관성을 파악하는 데 사용됩니다.

$$신뢰도(A→B) = { A와 B를 함께 시청한 횟수 \over A를 시청한 횟수}$$

* <strong>향상도(lift)</strong> : 특정한 영화 A를 시청할 때 다른 영화 B를 얼마나 향상시키는지를 나타냅니다.  
향상도는 A를 시청하는 경우와 A를 사지 않는 경우에 B를 시청하는 비율의 차이를 의미합니다.   

예를 들어, 곡성을 시청하는 사람들 중에서 파묘를 시청하는 비율이 전체 고객 중에서 파묘를 시청하는 비율보다 높다면, 곡성을 시청하는 사람들이 파묘를 시청하는 경향이 높다고 볼 수 있습니다.   
향상도는 두 영화이 함께 시청될 때 예상되는 티켓판매량 증가 정도를 측정하여 연관성을 파악하는 데 사용됩니다.

$$향상도(A→B) = {A와 B를 함께 시청한 횟수 \over A를 시청한 횟수 * B를 시청한 횟수 }$$

향상도가 1보다 크면 두 항목 간의 양의 상관 관계가 있으며, 향상도가 1보다 작으면 음의 상관 관계가 있습니다. 향상도가 1에 가까울수록 두 항목이 독립적인 관계에 있습니다.

예를 들어, 향상도가 2인 연관 규칙은 해당 규칙의 실제 발생 비율이 두 항목이 서로 독립적인 경우보다 2배 더 높다는 것을 의미합니다.

이러한 개념들은 상품 판매 데이터를 분석하여 상품 간의 연관성을 파악하고, 이를 통해 고객들에게 더 나은 추천을 제공하거나 상품 배치를 최적화하는 등의 의사 결정에 활용됩니다.

In [1]:
# 부모 폴더의 경로를 추가
import sys; sys.path.insert(0, '..')

from util.data_loader import DataLoader
from util.metric_calculator import MetricCalculator

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Movielens 데이터 로딩
data_loader = DataLoader(num_users=1000, num_test_items=5, data_path='../data/ml-10M100K/')
movielens = data_loader.load()

In [3]:
movielens 
# train 데이터 : [127830 rows x 8 columns]
# test 데이터 : [5000 rows x 8 columns]
# test_user2items : {1: [122, 362, 466, 520, 616], ..... 1053: [457, 2028, 1242, 2501, 5418]}
# item_content : [10681 rows x 4 columns]

Dataset(train=        user_id  movie_id  rating   timestamp  \
1           139       122     3.0   974302621   
2           149       122     2.5  1112342322   
3           182       122     3.0   943458784   
4           215       122     4.5  1102493547   
6           281       122     3.0   844437024   
...         ...       ...     ...         ...   
132825     1045     57949     0.5  1215617256   
132826     1045     58291     0.5  1215616991   
132827     1045     59306     3.0  1215617137   
132828     1045     60286     3.0  1215617037   
132829     1047     52952     4.0  1203399887   

                                      title                        genre  \
1                          Boomerang (1992)            [Comedy, Romance]   
2                          Boomerang (1992)            [Comedy, Romance]   
3                          Boomerang (1992)            [Comedy, Romance]   
4                          Boomerang (1992)            [Comedy, Romance]   
6                

In [4]:
movielens.train

Unnamed: 0,user_id,movie_id,rating,timestamp,title,genre,tag,rating_order
1,139,122,3.0,974302621,Boomerang (1992),"[Comedy, Romance]","[dating, nudity (topless - brief), can't remem...",408.0
2,149,122,2.5,1112342322,Boomerang (1992),"[Comedy, Romance]","[dating, nudity (topless - brief), can't remem...",84.0
3,182,122,3.0,943458784,Boomerang (1992),"[Comedy, Romance]","[dating, nudity (topless - brief), can't remem...",1104.0
4,215,122,4.5,1102493547,Boomerang (1992),"[Comedy, Romance]","[dating, nudity (topless - brief), can't remem...",320.0
6,281,122,3.0,844437024,Boomerang (1992),"[Comedy, Romance]","[dating, nudity (topless - brief), can't remem...",42.0
...,...,...,...,...,...,...,...,...
132825,1045,57949,0.5,1215617256,"Welcome Home, Roscoe Jenkins (2008)",[Comedy],,76.0
132826,1045,58291,0.5,1215616991,College Road Trip (2008),[Comedy],"[road trip, movie to see]",128.0
132827,1045,59306,3.0,1215617137,Prom Night (2008),"[Horror, Mystery, Thriller]",[remake],102.0
132828,1045,60286,3.0,1215617037,Finding Amanda (2008),"[Comedy, Drama]",,120.0


In [5]:
# 사용자 x 영화 행렬 형식으로 변환한다
user_movie_matrix = movielens.train.pivot(index='user_id', columns='movie_id', values='rating')
user_movie_matrix

movie_id,1,2,3,4,5,6,7,8,9,10,...,62000,62113,62293,62344,62394,62801,62803,63113,63992,64716
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,1.0,,,,,,3.0,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1048,,,,,,,,,,,...,,,,,,,,,,
1050,,3.0,,,,3.0,,,,3.0,...,,,,,,,,,,
1051,5.0,,3.0,,3.0,,4.0,,,,...,,,,,,,,,,
1052,,,,,,,,,,,...,,,,,,,,,,


In [6]:
# 라이브러리를 사용하기 위해 4 이상의 평갓값은 1, 
#                            4 미만의 평갓값과 결측치는 0으로 한다
user_movie_matrix[user_movie_matrix < 4] = 0
user_movie_matrix[user_movie_matrix.isnull()] = 0
user_movie_matrix[user_movie_matrix >= 4] = 1

user_movie_matrix

movie_id,1,2,3,4,5,6,7,8,9,10,...,62000,62113,62293,62344,62394,62801,62803,63113,63992,64716
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1050,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1051,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# !pip install --user mlxtend

In [8]:
from mlxtend.frequent_patterns import apriori

# 지지도가 0.1보다 높은 영화를 표시
freq_movies = apriori(user_movie_matrix, min_support=0.1, use_colnames=True)

freq_movies.sort_values('support', ascending=False).head()



Unnamed: 0,support,itemsets
42,0.415,(593)
23,0.379,(318)
21,0.369,(296)
19,0.361,(260)
25,0.319,(356)


In [9]:
# movie_id=593의 제목 확인(양들의 침묵)
movielens.item_content[movielens.item_content.movie_id == 593]

  and should_run_async(code)


Unnamed: 0,movie_id,title,genre,tag
587,593,"Silence of the Lambs, The (1991)","[Crime, Horror, Thriller]","[based on a book, anthony hopkins, demme, psyc..."


In [10]:
from mlxtend.frequent_patterns import association_rules
# 어소시에이션 규칙 계시청한(리프트 값이 높은 순으로 표시)

rules = association_rules(freq_movies, metric='lift', min_threshold=1) # freq_movies : 지지도가 0.1보다 높은 영화

rules.sort_values('lift', ascending=False).head()[['antecedents', 'consequents', 'lift']] # 향상도(lift) 순으로 표시

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,lift
649,(4993),(5952),5.45977
648,(5952),(4993),5.45977
1462,"(1196, 1198)","(1291, 260)",4.669188
1463,"(1291, 260)","(1196, 1198)",4.669188
1460,"(1291, 1196)","(260, 1198)",4.171359


In [11]:
# 어소시에이션 추천
from src.association import AssociationRecommender

recommender = AssociationRecommender()
recommend_result = recommender.recommend(movielens)


  and should_run_async(code)


# 평가

In [12]:
# 평가 지표 계시청한을 위해 MetricCalculator 객체를 생성합니다.
metric_calculator = MetricCalculator()

# calc 메서드를 사용하여 추천 시스템의 결과를 평가합니다.
# 여기서는 movielens 데이터셋을 기준으로 평가를 수행합니다.

# 이를 위해 movielens 데이터셋의 테스트 셋에서 실제 평점(movielens.test.rating)과 추천 결과(recommend_result.rating)를 리스트로 변환합니다.
# 또한, 사용자별로 평점을 갖는 딕셔너리 형태의 데이터를 사용하기 위해, movielens 데이터셋과 추천 결과의 사용자별 아이템 정보를 전달합니다.

# k=10은 Top-K 아이템을 사용하여 평가를 수행함을 의미합니다.
metrics = metric_calculator.calc(
    movielens.test.rating.tolist(), recommend_result.rating.tolist(),
    movielens.test_user2items, recommend_result.user2items, k=10)

# 계시청한된 메트릭을 출력합니다.
print(metrics)

rmse=0.000, Precision@K=0.011, Recall@K=0.036


  and should_run_async(code)


In [13]:
# 최소 지지도 값(min_support)을 리스트로 지정합니다.
# 다양한 최소 지지도 값에 대해 추천 시스템을 실행하고 그 결과를 평가합니다.
for min_support in [0.06, 0.07, 0.08, 0.09, 0.1, 0.11]:
    # 추천 시스템을 실행하여 결과를 받아옵니다.
    recommend_result = recommender.recommend(movielens, min_support=min_support)
    
    # 추천 시스템의 결과를 사용하여 평가 메트릭을 계시청한합니다.
    # 여기서는 movielens 데이터셋을 기준으로 평가를 수행합니다.
    
    # movielens 데이터셋의 테스트 셋에서 실제 평점(movielens.test.rating)과 추천 결과(recommend_result.rating)를 리스트로 변환합니다.
    # 또한, 사용자별로 평점을 갖는 딕셔너리 형태의 데이터를 사용하기 위해, movielens 데이터셋과 추천 결과의 사용자별 아이템 정보를 전달합니다.
    
    # k=10은 Top-K 아이템을 사용하여 평가를 수행함을 의미합니다.
    metrics = metric_calculator.calc(
        movielens.test.rating.tolist(), recommend_result.rating.tolist(),
        movielens.test_user2items, recommend_result.user2items, k=10)
    
    # 계시청한된 메트릭을 출력합니다.
    print(metrics)


  and should_run_async(code)


rmse=0.000, Precision@K=0.015, Recall@K=0.048




rmse=0.000, Precision@K=0.014, Recall@K=0.042




rmse=0.000, Precision@K=0.014, Recall@K=0.043




rmse=0.000, Precision@K=0.013, Recall@K=0.040




rmse=0.000, Precision@K=0.011, Recall@K=0.036




rmse=0.000, Precision@K=0.010, Recall@K=0.034
