## 과제3 사용자기반 협업필터링 추천시스템
- KMRD 데이터 사용
- 사용자(user_id)에게 상관관계가 높은 순서로 high_corr편을 참조하여 n_items편의 영화 추천
- 프로그램 4_KNN_bias_CF.ipynb 수정 (유사도를 상관계수로 변경, 높은 상관계수 high_corr 추출, n_items 영화 추천 )
    - `recommender(user=2, n_items=5, high_corr=10)`

In [145]:
import pandas as pd
import numpy as np

## 사용자(user_id)에게 상관관계가 높은 순서로 high_corr편을 참조하여 n_items편의 영화 추천
- 사용자와 상관관계가 높은 항목, 이는 사용자와 유사한 행동 패턴을 보인 다른 사용자들이 선호한 영화 목록
- n_items : 추천해야 하는 영화 n의 개수

### Data preprocessing

In [187]:
# 파일 경로 지정
movies_file_path = '/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/University/Ulsan university/Electrical & Electronic Engineering/4학년/2학기/Ai추천시스템/과제/kmrd/build/lib/kmr_dataset/datafile/kmrd-small/movies.txt'
peoples_file_path = '/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/University/Ulsan university/Electrical & Electronic Engineering/4학년/2학기/Ai추천시스템/과제/kmrd/build/lib/kmr_dataset/datafile/kmrd-small/peoples.txt'
countries_file_path = '/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/University/Ulsan university/Electrical & Electronic Engineering/4학년/2학기/Ai추천시스템/과제/kmrd/build/lib/kmr_dataset/datafile/kmrd-small/countries.csv'
genres_file_path = '/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/University/Ulsan university/Electrical & Electronic Engineering/4학년/2학기/Ai추천시스템/과제/kmrd/build/lib/kmr_dataset/datafile/kmrd-small/genres.csv'
castings_file_path = '/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/University/Ulsan university/Electrical & Electronic Engineering/4학년/2학기/Ai추천시스템/과제/kmrd/build/lib/kmr_dataset/datafile/kmrd-small/castings.csv'
rates_file_path = '/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/University/Ulsan university/Electrical & Electronic Engineering/4학년/2학기/Ai추천시스템/과제/kmrd/kmr_dataset/datafile/kmrd-small/rates.csv'

In [188]:
m_cols = ['movie', 'title', 'title_eng', 'year', 'grade']
movies = pd.read_csv(movies_file_path, sep='\t', names=m_cols, skiprows=1)
movies.head()

Unnamed: 0,movie,title,title_eng,year,grade
0,10001,시네마 천국,"Cinema Paradiso , 1988",2013.0,전체 관람가
1,10002,빽 투 더 퓨쳐,"Back To The Future , 1985",2015.0,12세 관람가
2,10003,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가
3,10004,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가
4,10005,스타워즈 에피소드 4 - 새로운 희망,"Star Wars , 1977",1997.0,PG


In [189]:
# movie와 title만 남기고 나머지 컬럼 삭제
movies = movies.drop(columns=['title_eng', 'year', 'grade'], axis=1)
movies

Unnamed: 0,movie,title
0,10001,시네마 천국
1,10002,빽 투 더 퓨쳐
2,10003,빽 투 더 퓨쳐 2
3,10004,빽 투 더 퓨쳐 3
4,10005,스타워즈 에피소드 4 - 새로운 희망
...,...,...
994,10995,공포의 여정
995,10996,버스틴 루즈
996,10997,블랙 엔젤
997,10998,폭주 기관차


In [149]:
# rates 데이터 불러오기
r_cols = ['user', 'movie', 'rate', 'time']
ratings = pd.read_csv(rates_file_path, sep=',', names=r_cols, skiprows=1)
ratings.head()

Unnamed: 0,user,movie,rate,time
0,0,10003,7,1494128040
1,0,10004,7,1467529800
2,0,10018,9,1513344120
3,0,10021,9,1424497980
4,0,10022,7,1427627340


In [150]:
# time 제거 
ratings = ratings.drop('time', axis=1)
ratings.head()

Unnamed: 0,user,movie,rate
0,0,10003,7
1,0,10004,7
2,0,10018,9
3,0,10021,9
4,0,10022,7


In [151]:
ratings.shape

(140710, 3)

In [152]:
duplicates = ratings.duplicated(subset=['user', 'movie'], keep=False)

# 불리언 인덱싱을 사용하여 중복된 행만 선택하여 출력
ratings[duplicates]

Unnamed: 0,user,movie,rate
273,7,10936,10
274,7,10936,10
305,9,10758,10
306,9,10758,10
308,9,10970,10
...,...,...,...
139204,50530,10936,10
139687,51009,10962,1
139688,51009,10962,10
139953,51273,10962,10


#### ratings DF에서 사용자가 중복으로 평가한 영화 중 높은 평점만 남기고 중복 제거

In [153]:
# 평점(rate)이 가장 높은 행만을 남김
ratings = ratings.sort_values('rate', ascending=False)
ratings = ratings.drop_duplicates(subset=['user', 'movie'], keep='first').reset_index(drop=True)

In [154]:
ratings.shape

(134331, 3)

### train, test set 분리

In [155]:
from sklearn.model_selection import train_test_split
X = ratings.copy()
y = ratings['user']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 42)

- 데이터분할과정에서 `stratify=y`를 사용했을 때 발생한 오류
    - `ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.`
    - stratify 매개변수에 지정된 타깃변수 y의 최소 분류 그룹이 2개 이상이어야 하지만, 최소 한 클래스에 속한 데이터가 1개만 존재하여 제대로된 분할이 불가능한 상황을 나타낸다.
    

### train 데이터 (사용자 X 영화 X 평점) Full matrix
- pivot table : 커다란 표의 데이터를 요약하는 통계표

In [156]:
# train 데이터로 Full matrix 구하기 
rating_matrix = X_train.pivot(index='user', columns='movie', values='rate')
rating_matrix.head(10)

movie,10001,10002,10003,10004,10005,10006,10007,10008,10009,10011,...,10978,10979,10980,10981,10982,10983,10985,10988,10994,10998
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,7.0,7.0,,,,,,,...,,,7.0,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,10.0,,,,,,,,,,...,,,,,,,,,,
3,,,,,10.0,,9.0,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,8.0
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,10.0,,,,,,,,,...,,,,,,,,,,
8,10.0,10.0,,,,,,,,,...,,,,,,,,,,
9,10.0,,,,,,,,,,...,,,,,,,,,,


- `ValueError: Index contains duplicate entries, cannot reshape` : 한 사용자가 한 영화에 대해 여러개의 평점을 매긴 경우, 중복된 인덱스가 발생하여 pivot table을 만들 수 없는 상황 발생
    - 중복된 데이터를 확인하고 중복된 데이터의 평균 평점을 계산하여 각 사용자가 각 영화에 대해 한 평점만을 가지도록 데이터를 수정
    - 그 이후에 다시 피봇테이블을 생성한다.

### user x user 의 전체 사용자간의 상관계수
- 사용자가 평가한 영화와 다른 사용자들이 평가한 영화를 기준으로 상관관계를 계산

In [157]:
matrix_dummy = rating_matrix.copy().fillna(0) # nan값을 0로 변경, nan값이 존재하면 안된다.

# column x column의 상관계수
user_similarity=matrix_dummy.corr(method='pearson')
user_similarity

movie,10001,10002,10003,10004,10005,10006,10007,10008,10009,10011,...,10978,10979,10980,10981,10982,10983,10985,10988,10994,10998
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10001,1.000000,0.068047,0.058454,0.059170,0.051467,0.044037,0.049286,0.019231,0.032061,0.010774,...,0.011767,0.004665,0.044868,-0.003744,-0.001166,0.009183,-0.002222,0.000258,0.015157,0.028751
10002,0.068047,1.000000,0.269646,0.253202,0.079013,0.065996,0.065427,0.059018,0.042432,0.030339,...,0.007527,0.003844,0.061099,-0.000124,-0.001073,0.009936,0.013491,-0.003669,0.035181,0.025494
10003,0.058454,0.269646,1.000000,0.414538,0.081917,0.084829,0.073971,0.048387,0.034988,0.033871,...,-0.001221,0.004764,0.078843,0.022983,-0.000628,0.001613,0.024918,-0.002146,0.012612,0.027945
10004,0.059170,0.253202,0.414538,1.000000,0.082654,0.091197,0.079287,0.059296,0.056718,0.056293,...,0.016368,0.007400,0.087845,0.019448,-0.000523,0.014318,-0.000997,-0.001787,0.028702,0.037640
10005,0.051467,0.079013,0.081917,0.082654,1.000000,0.411955,0.362275,0.067296,0.100693,0.066162,...,0.029716,0.007457,0.067387,0.013409,-0.000607,-0.003927,0.026194,0.010687,0.017423,0.014315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10983,0.009183,0.009936,0.001613,0.014318,-0.003927,-0.003546,-0.003307,0.007140,-0.001365,0.013945,...,-0.000296,-0.001665,-0.002467,-0.000488,-0.000152,1.000000,-0.000290,-0.000519,0.050276,0.040749
10985,-0.002222,0.013491,0.024918,-0.000997,0.026194,-0.001045,-0.000975,-0.000592,0.081924,-0.000286,...,-0.000087,-0.000491,-0.000727,-0.000144,-0.000045,-0.000290,1.000000,-0.000153,-0.000268,-0.000447
10988,0.000258,-0.003669,-0.002146,-0.001787,0.010687,0.012099,0.013453,-0.001062,0.037691,-0.000513,...,-0.000156,-0.000880,-0.001303,-0.000258,-0.000080,-0.000519,-0.000153,1.000000,-0.000480,-0.000801
10994,0.015157,0.035181,0.012612,0.028702,0.017423,0.005831,0.021426,0.012970,0.043828,0.011792,...,-0.000273,-0.001540,0.035914,-0.000452,-0.000141,0.050276,-0.000268,-0.000480,1.000000,0.018248


- 위 행렬은 영화와 영화 간의 상관관계를 나타내는 피어슨 상관계수 행렬이다.
    - 각 영화가 다른 모든 영화와 어떻게 관련되어 있는지를 수치적으로 보여준다.

In [158]:
user_similarity=matrix_dummy.T.corr(method='pearson')
user_similarity

user,0,1,2,3,4,5,6,7,8,9,...,52015,52016,52017,52018,52020,52021,52023,52024,52025,52026
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.000000,0.113817,0.073553,0.003146,0.035033,0.113817,0.081815,0.047065,0.142819,0.075568,...,-0.013444,-0.013444,-0.013444,-0.013444,-0.013444,-0.013444,-0.013444,-0.013444,-0.013444,-0.013444
1,0.113817,1.000000,-0.005454,0.149564,0.143156,1.000000,-0.002921,0.513434,0.242796,-0.006598,...,-0.001684,-0.001684,-0.001684,-0.001684,-0.001684,-0.001684,-0.001684,-0.001684,-0.001684,-0.001684
2,0.073553,-0.005454,1.000000,0.025202,-0.004463,-0.005454,-0.009462,-0.007488,0.180079,0.077217,...,-0.005454,-0.005454,-0.005454,-0.005454,-0.005454,-0.005454,-0.005454,-0.005454,-0.005454,-0.005454
3,0.003146,0.149564,0.025202,1.000000,0.012908,0.149564,-0.023000,0.065631,0.039255,0.028403,...,-0.013257,-0.013257,-0.013257,-0.013257,-0.013257,-0.013257,-0.013257,-0.013257,-0.013257,-0.013257
4,0.035033,0.143156,-0.004463,0.012908,1.000000,0.143156,-0.022042,0.062806,0.030809,0.092745,...,0.143156,0.143156,0.143156,0.143156,0.143156,0.143156,0.143156,0.143156,0.143156,0.143156
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52021,-0.013444,-0.001684,-0.005454,-0.013257,0.143156,-0.001684,-0.002921,-0.002311,-0.005851,-0.006598,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
52023,-0.013444,-0.001684,-0.005454,-0.013257,0.143156,-0.001684,-0.002921,-0.002311,-0.005851,-0.006598,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
52024,-0.013444,-0.001684,-0.005454,-0.013257,0.143156,-0.001684,-0.002921,-0.002311,-0.005851,-0.006598,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
52025,-0.013444,-0.001684,-0.005454,-0.013257,0.143156,-0.001684,-0.002921,-0.002311,-0.005851,-0.006598,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000


- 위 행렬은 `각 사용자가 다른 사용자`와 얼마나 비슷한 평가를 하는지를 보여주는 피어슨 상관계수를 계산하여 나타낸 것

### 특정 사용자에 대한 상위 high_corr 사용자 찾고 추천 영화 생성
- 해당 사용자와 상관계수가 높은 사용자들을 추출하여, 이들이 선호하는 아이템을 기반으로 추천 제공

- `get_high_corr_users()` : 주어진 user_siilarity_matrix에서 특정 target_user_id와 가장 상관관계가 높은 사용자들을 찾아내는 함수
- `recommend_items()` : 상관계수가 높은 사용자들이 높게 평가한 아이템을 바탕으로 추천 리스트를 생성

In [159]:
def get_high_corr_users(user_similarity_matrix, target_user_id, top_n=10):
    # target_user와 다른 사용자들 간의 상관계수를 추출
    correlations = user_similarity_matrix[target_user_id]
    # 상관계수가 높은 상위 N명의 사용자 ID 추출
    high_corr_users = correlations.sort_values(ascending=False)[1:top_n+1]  # 자기 자신은 제외
    return high_corr_users.index

def recommend_items(user_ratings, high_corr_users):
    # 상위 상관계수 사용자들이 높게 평가한 아이템 추출
    recommended_items = user_ratings.loc[high_corr_users]
    # 아이템별 평균 평점 계산
    item_scores = recommended_items.mean().sort_values(ascending=False)
    return item_scores[item_scores >= 4].index  # 평점 4 이상의 아이템만 추천

# 상위 상관계수 사용자 찾기, target_user_id = 2
high_corr_users = get_high_corr_users(user_similarity, target_user_id=2)

# 추천 아이템 선정
recommended_items = recommend_items(rating_matrix, high_corr_users)

#### 높은 상관계수 사용자 추출

In [160]:
high_corr_users

Index([1923, 3372, 2713, 15093, 1804, 14958, 2762, 2190, 2772, 15349], dtype='int64', name='user')

#### 상위 상관계수 사용자들이 높게 평가한 아이템 추출 후 평균 평점 계산, 이후 추천 영화 생성 결과

In [161]:
recommended_items

Index([10215, 10936, 10001, 10244, 10038], dtype='int64', name='movie')

---
## 프로그램 4_KNN_bias_CF.ipynb 수정 (유사도를 상관계수로 변경, 높은 상관계수 high_corr 추출, n_items 영화 추천 )
- 프로그램 파일 수정
    - recommender(user=2, n_items=5, high_corr=10)
- 편차 기반 예측 : 평점 편차를 기반으로 예측을 수행
    - CF_Knn_bias에서는 사용자의 평균 평점과 편차를 기반으로 예측을 보정하는 과정을 거친다.
    - 더 정교한 예측을 제공.

### `데이터는 위에서 진행한 행렬 그대로 사용`

### 각 user별 평점 평균과 영화의 평점 편차

In [162]:
# train 데이터의 user의 rating 평균과 영화의 평점편차 계산 
rating_mean = rating_matrix.mean(axis=1)   # 사용자별 영화평점 평균, axis=1이면 column에 행의 평균
rating_bias = (rating_matrix.T - rating_mean).T   # 각 영화 사용자 평점 - 사용자별 영화평점평균

### 주어진 영화의 가중 평균 rating을 계산하는 함수

In [182]:
def CF_knn_bias(user, movie, high_corr=10):  # high_corr을 이용한 상위 유사 사용자들 사용
    if movie in rating_bias:
        # 현 user와 다른 사용자 간의 유사도 가져오기
        sim_scores = user_similarity[user].copy()
        # 현 movie의 평점편차 가져오기
        movie_ratings = rating_bias[movie].copy()
        # 현 movie에 대한 rating이 없는 사용자 삭제
        none_rating_idx = movie_ratings[movie_ratings.isnull()].index
        movie_ratings = movie_ratings.drop(none_rating_idx)
        sim_scores = sim_scores.drop(none_rating_idx)
        
        # 높은 상관계수 사용자를 기준으로 neighbor 설정
        high_corr = min(high_corr, len(sim_scores))
        sim_scores = sim_scores.nlargest(high_corr)  # 상관계수가 높은 사용자 추출
        movie_ratings = movie_ratings[sim_scores.index]
        
        if sim_scores.sum() == 0:  # user_id와 유사도가 0인 경우 있음
            prediction = rating_mean[user]  # 사용자별 영화평점 평균
        else:
            # 편차로 예측값(편차 예측값) 계산
            prediction = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
            # 편차 예측값에 현 사용자의 평균 더하기
            prediction = prediction + rating_mean[user]
    else:
        prediction = rating_mean[user]
    return prediction

In [190]:
# 함수 외부에서 한 번만 실행
movies.set_index('movie', inplace=True)

In [195]:
# 데이터 프레임의 인덱스가 영화ID가 아니기때문에 발생한 오류를 해결
movies

Unnamed: 0_level_0,title
movie,Unnamed: 1_level_1
10001,시네마 천국
10002,빽 투 더 퓨쳐
10003,빽 투 더 퓨쳐 2
10004,빽 투 더 퓨쳐 3
10005,스타워즈 에피소드 4 - 새로운 희망
...,...
10995,공포의 여정
10996,버스틴 루즈
10997,블랙 엔젤
10998,폭주 기관차


: 

### 영화 추천

In [191]:
# recommender 함수 정의
def recommender(user, n_items=10, high_corr=10):
    # 현재 사용자의 모든 아이템에 대한 예상 평점 계산
    predictions = []
    rated_index = rating_matrix.loc[user][rating_matrix.loc[user] > 0].index  # 이미 평가한 영화 확인
    items = rating_matrix.loc[user].drop(rated_index)
    for item in items.index:
        predictions.append(CF_knn_bias(user, item, high_corr))  # 예상평점 계산
    recommendations = pd.Series(data=predictions, index=items.index, dtype=float)
    recommendations = recommendations.sort_values(ascending=False)[:n_items]  # 예상평점이 가장 높은 영화 선택

    # 추천된 영화의 제목을 가져오기
    recommended_items = movies.loc[recommendations.index]['title']
    return recommended_items

In [194]:
recommender(user=2, n_items=5, high_corr=10)

movie
10445              등대여명
10315                모정
10810     모두가 죽이고 싶던 여자
10415    쇼처럼 즐거운 인생은 없다
10352        세 남자와 아기 2
Name: title, dtype: object