## **아이템기반 최근접이웃 협업 필터링**

> **최근접이웃 협업필터링은 사용자 기반과 아이템 기반 협업 필터링으로 나뉘어진다. 일반적으로 아이템 기반 협업 필터링이 추천 정확도가 더 뛰어나다.**
- 최근접 이웃 협업 필터링 : 사용자가 기존에 평가했던 다른 아이템들을 기반으로, 사용자가 아직 평가하지 않은 아이템의 예측 평가를 도출하는 방식

- 사용자 - item 평점 행렬은 많은 아이템을 열로 가지는 다차원 행렬이며, 사용자가 아이템에 대해 평점을 매기는 경우가 많지 않기에, 희소 행렬의 특성을 가짐
- 협업 필터링은 추천 시스템에서 많이 사용되는 기술 중 하나입니다. 최근접 이웃 협업 필터링은 사용자나 항목 간의 유사성을 기반으로 추천을 수행합니다. 사용자 기반과 아이템 기반의 두 가지 주요 접근 방식이 있습니다.

1. **사용자 기반 협업 필터링**: 구매 목록이 당신과 비슷한 고객들이 다음과 같은 상품을 구매했습니다.
   - 이 방식은 사용자들 간의 유사성을 기준으로 추천을 수행합니다.
   - 예를 들어, 특정 사용자 A와 다른 사용자 B가 비슷한 항목을 선호한다면, A가 좋아할 만한 새로운 항목은 B가 이미 선호한 것과 비슷한 것으로 예측할 수 있습니다.
   - 사용자 간의 유사성은 주로 유클리디안 거리, 코사인 유사도 등을 사용하여 측정됩니다.

2. **아이템 기반 협업 필터링** : 이 상품을 선택한 다른 고객들은 다음과 같은 상품도 구매했습니다. (or 이 상품을 다음과 같은 상품과 같이 구매했습니다.) => 사용자들간의 평가가 비슷한 item을 고르고, 그 item 집합에서 아직 그 item을 보지 않은 사람에게 추천
   - **아이템의 속성과는 관련이 없음, Only 사용자들이 그 아이템을 좋아하는지 / 싫어하는 지가 유사한 아이템을 골라내는 기준이 된다**
   - 이 방식은 항목들 간의 유사성을 기반으로 추천을 합니다.
   - 예를 들어, 사용자 A가 특정 항목을 선호한다면, 그 항목과 유사한 다른 항목을 추천할 수 있습니다.
   - 아이템 간의 유사성은 주로 항목 간의 상관관계나 유사도를 측정하여 구합니다.

> 각 방식은 장단점이 있습니다. 사용자 기반 협업 필터링은 사용자 간의 관계를 기반으로 하기 때문에 개인화된 추천을 제공할 수 있지만, 사용자 수가 많을 경우 계산 복잡성이 증가할 수 있습니다. 반면에 아이템 기반 협업 필터링은 항목 간의 유사성을 기반으로 하기 때문에 항목 수가 많을 때 효과적이지만, 새로운 항목이나 사용자에 대한 추천은 제한적일 수 있습니다.

- 아이템 기반 최근접 이웃 협업 필터링은 사용자의 행동 데이터를 분석하여 유사한 아이템을 추천합니다.

- 각 아이템은 사용자들의 평가나 구매 등의 행동을 기반으로 특징 벡터로 표현됩니다.
- 사용자가 특정 아이템을 좋아하는 경우, 해당 아이템과 유사한 다른 아이템을 추천합니다.
- 이 방식은 사용자의 행동 데이터가 있을 때 효과적으로 사용될 수 있으며, 사용자의 개인화된 취향을 고려할 수 있습니다.
- 아이템 기반 협업 필터링은 사용자가 어떤 특정 아이템을 선호하는지에 대한 정보가 필요합니다.

> [협업 필터링](https://western-sky.tistory.com/47)

> [MovieLens Data Set](https://grouplens.org/datasets/movielens/latest/)

In [74]:
import pandas as pd
import numpy as np

movies = pd.read_csv('/content/drive/MyDrive/Kaggle - 파이썬 머신러닝 완벽 가이드/kaggleData/ml-latest-small/movies.csv')
ratings = pd.read_csv('/content/drive/MyDrive/Kaggle - 파이썬 머신러닝 완벽 가이드/kaggleData/ml-latest-small/ratings.csv')

In [75]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [76]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


> ### **Pivot_table()을 이용하여 row => row-column 형태로 변환**

- `Pivot_table()` 메서드는 데이터프레임에서 피벗 테이블을 생성하는데 사용된다.
- 이 메서드를 사용하면 데이터를 특정 방식으로 그룹화하고 집계할 수 있습니다. 아래는 `Pivot_table()` 메서드의 사용법에 대한 간단한 설명입니다:

1. **기본 구문**:
   ```python
   pivot_table( values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
   ```
   - `values`: 집계할 열(들)입니다.
   - `index`: 행 인덱스로 사용할 열(들)입니다.
   - `columns`: 열 인덱스로 사용할 열(들)입니다.
   - `aggfunc`: 집계 함수를 지정합니다. 기본값은 'mean'이며, 다른 함수로 'sum', 'count', 'min', 'max' 등을 사용할 수 있습니다.
   - `fill_value`: NA/NaN 값을 채우는 데 사용할 값입니다.
   - `margins`: 부분 합계 및 총계를 나타낼지 여부를 결정합니다.
   - `dropna`: NA/NaN 값을 제외할지 여부를 결정합니다.
   - `margins_name`: 총계 열 및 행의 이름을 지정합니다.

2. **예시**:
   ```python
   import pandas as pd

   # Sample DataFrame
   df = pd.DataFrame({
       'City': ['Seoul', 'Seoul', 'Busan', 'Busan', 'Seoul'],
       'Month': ['Jan', 'Feb', 'Jan', 'Feb', 'Jan'],
       'Sales': [100, 150, 200, 250, 300]
   })

   # Creating a pivot table
   pivot = df.pivot_table(values='Sales', index='City', columns='Month', aggfunc='sum')

   print(pivot)
   ```

   이 예시에서는 'City' 열을 행 인덱스로, 'Month' 열을 열 인덱스로 사용하여 'Sales' 열을 집계하고 있습니다. 결과는 각 도시와 월별 매출의 합계가 포함된 피벗 테이블이 됩니다.


> 1. pd.pivot_table(df, values='Sales', index='City', columns='Month')
2. df.pivot_table(values='Sales', index='City', columns='Month')

In [77]:
ratings = ratings[['userId', 'movieId', 'rating']]

In [78]:
ratings_matrix = ratings.pivot_table('rating', index = 'userId', columns = 'movieId')
ratings_matrix.head(3)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


In [79]:
# 칼럼명이 moviesId로, 무슨 영화인지를 알 수가 없음 => ㅡmovies data set의 title로 칼럼 변경
# ratings와 movies를 join해 title칼럼을 가져오고, pivot_table의 인자로 title을 입력해줌으로, title로 피벗
# 사용자가 평가하지 않은 영화에 대해서는 NaN => 최소 평점이 0.5이기에, Nan은 0으로 처리

# title 컬럼을 얻기 위해 movies 와 조인 수행
rating_movies = pd.merge(ratings, movies, on = 'movieId')
rating_movies.head(3)

Unnamed: 0,userId,movieId,rating,title,genres
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [80]:
# columns='title' 로 title 컬럼으로 pivot 수행.
ratings_matrix = rating_movies.pivot_table('rating', index = 'userId', columns = 'title')

# NaN 값을 모두 0 으로 변환
ratings_matrix = ratings_matrix.fillna(0)
ratings_matrix.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### **영화와 영화들 간 유사도 산출**

In [81]:
# cosine유사도는 행을 기준으로 서로 다른 행 간의 유사도를 반환 => rating_matrix는 userid 기준인 행 레벨 data이기에 영화를 기준으로 하는 행렬로 바꿔준다.
ratings_matrix_T = ratings_matrix.transpose()
ratings_matrix_T.head(3)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [82]:
# 주로 loc/iloc으로 슬라이싱
print("'71(2014)' 고객 영화평 :", ratings_matrix_T.iloc[0].unique())

'71(2014)' 고객 영화평 : [0. 4.]


In [83]:
# 아이템 기반의 코사인 유사도 계산
from sklearn.metrics.pairwise import cosine_similarity

item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T)

# cosine_similarity()로 반환된 넘파이 행렬을 영화명을 매핑해 DataFrame으로 변경
item_sim_df = pd.DataFrame(data = item_sim, index = ratings_matrix.columns, columns = ratings_matrix.columns)
print(item_sim_df.shape)

(9719, 9719)


In [84]:
item_sim_df.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [85]:
# 영화 "Godfather, The (1972)"와 유사도가 높은 상위 6개 영화
item_sim_df["Godfather, The (1972)"].sort_values(ascending=False)[1:6]

# 610명의 user들의 평가가 비슷한 영화를 추천해줌
# 이미 본 영화를 다시 추천할 수도 있다.

title
Godfather: Part II, The (1974)               0.821773
Goodfellas (1990)                            0.664841
One Flew Over the Cuckoo's Nest (1975)       0.620536
Star Wars: Episode IV - A New Hope (1977)    0.595317
Fargo (1996)                                 0.588614
Name: Godfather, The (1972), dtype: float64

### **아이템 기반 인접 이웃 협업 필터링으로 개인화된 영화 추천**

- 앞의 실습에서는, 개인적인 취향을 반영하지 않고, 오직 영화의 평점 간의 유사도만을 가지고 추천을 했다.

- 영화 유사도 data를 이용해 최근접 이웃 협업 필터링으로 개인에게 최적화된 영화 구현
- 개인이 아직 관람하지 않은 영화를 추천한다는 것!

In [86]:
def predict_rating(ratings_arr, item_sim_arr):
  ratings_pred = ratings_arr.dot(item_sim_arr) / np.array([np.abs(item_sim_arr).sum(axis=1)])
  return ratings_pred

In [87]:
ratings_pred = predict_rating(rating_matrix.values, item_sim_df.values)
ratings_pred_matrix = pd.DataFrame(data=ratings_pred, index= ratings_matrix.index, columns = ratings_matrix.columns)
ratings_pred_matrix.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.070345,0.577855,0.321696,0.227055,0.206958,0.194615,0.249883,0.102542,0.157084,0.178197,...,0.113608,0.181738,0.133962,0.128574,0.006179,0.21207,0.192921,0.136024,0.292955,0.720347
2,0.01826,0.042744,0.018861,0.0,0.0,0.035995,0.013413,0.002314,0.032213,0.014863,...,0.01564,0.020855,0.020119,0.015745,0.049983,0.014876,0.021616,0.024528,0.017563,0.0
3,0.011884,0.030279,0.064437,0.003762,0.003749,0.002722,0.014625,0.002085,0.005666,0.006272,...,0.006923,0.011665,0.0118,0.012225,0.0,0.008194,0.007017,0.009229,0.01042,0.084501


In [88]:
rating_matrix.values

array([[0. , 0. , 0. , ..., 0. , 4. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [4. , 0. , 0. , ..., 1.5, 0. , 0. ]])

> 예측 결과와 실제 평점 간의 차이

In [89]:
from sklearn.metrics import mean_squared_error

# 사용자가 평점을 부여한 영화에 대해서만 예측 성능 평가 MSE를 구함.
def get_mse(pred, actual):
  # ignore nonzero terms
  pred = pred[actual.nonzero()].flatten() # 다차원 배열을 1차원으로
  actual = actual[actual.nonzero()].flatten()
  return mean_squared_error(pred, actual)

print('아이템 기반 모든 인접 이웃 MSE: ', get_mse(ratings_pred, ratings_matrix.values ))

아이템 기반 모든 인접 이웃 MSE:  9.895354759094706


- 이 기술은 사용자들 간의 상호작용 데이터를 기반으로 추천을 수행합니다.
- 협업 필터링은 사용자가 항목에 대한 평가나 구매 기록과 같은 상호작용 데이터를 기반으로 추천을 수행하기 때문에 사용자의 개인 정보나 항목의 세부 정보를 사용하지 않습니다. 따라서, 이러한 방식은 컨텐츠 기반 필터링과 달리 새로운 항목이나 사용자에 대한 추천을 제공하는 데 더 유연하고 일반적으로 사용됩니다.

- 그러나 협업 필터링은 데이터 희소성(sparsity) 문제나 콜드 스타트(cold start) 문제와 같은 한계점을 가지고 있습니다. 데이터 희소성 문제는 사용자가 아이템에 대한 평가를 많이 하지 않아 추천이 어려운 상황을 나타냅니다. 콜드 스타트 문제는 새로운 사용자나 아이템에 대한 추천을 제공하는 것이 어려운 상황을 나타냅니다.

In [90]:
def predict_rating_topsim(ratings_arr, item_sim_arr, n = 20):
  # 사용자-아이템 평점 행렬 크기만큼 0으로 채운 예측 행렬 초기화
  pred = np.zeros(ratings_arr.shape)

  # 사용자-아이템 평점 행렬의 열 크기만큼 Loop 수행.
  for col in range(ratings_arr.shape[1]):
    # 유사도 행렬에서 유사도가 큰 순으로 n개 데이터 행렬의 index 반환
    top_n_items = [np.argsort(item_sim_arr[:, col])[:-n-1:-1]]
    # 개인화된 예측 평점을 계산
    for row in range(ratings_arr.shape[0]):
      pred[row, col] = item_sim_arr[col, :][top_n_items].dot(ratings_arr[row,:][top_n_items].T)
      pred[row,col] /= np.sum(np.abs(item_sim_arr[col,:][top_n_items]))

  return pred

In [91]:
ratings_pred = predict_rating_topsim(ratings_matrix.values , item_sim_df.values, n=20)
print('아이템 기반 인접 TOP-20 이웃 MSE: ', get_mse(ratings_pred, ratings_matrix.values ))

# 계산된 예측 평점 데이터는 DataFrame으로 재생성
ratings_pred_matrix = pd.DataFrame(data=ratings_pred, index= ratings_matrix.index,
                                   columns = ratings_matrix.columns)

  pred[row, col] = item_sim_arr[col, :][top_n_items].dot(ratings_arr[row,:][top_n_items].T)


아이템 기반 인접 TOP-20 이웃 MSE:  3.6949827608772314


In [92]:
ratings_pred_matrix

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.220798,0.000000,0.000000,1.677291,0.284372
2,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.220798,0.000000,0.000000,0.194828,0.000000
5,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,0.149633,0.0,0.418273,0.16678,0.0,0.130033,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.344930,0.268465,0.000000,0.694944,0.189602
607,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.194948,0.000000,0.000000,0.000000,0.000000
608,0.0,0.000000,0.0,0.159451,0.00000,0.0,0.243703,0.0,0.000000,0.0,...,0.0,0.129289,0.000000,0.112856,0.0,1.587302,2.988072,0.175489,0.702430,0.000000
609,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000


In [93]:
user_rating_id = rating_matrix.loc[9,:]
user_rating_id[user_rating_id > 0].sort_values(ascending = False)[:10]

title
Adaptation (2002)                                                                 5.0
Citizen Kane (1941)                                                               5.0
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    5.0
Producers, The (1968)                                                             5.0
Lord of the Rings: The Two Towers, The (2002)                                     5.0
Lord of the Rings: The Fellowship of the Ring, The (2001)                         5.0
Back to the Future (1985)                                                         5.0
Austin Powers in Goldmember (2002)                                                5.0
Minority Report (2002)                                                            4.0
Witness (1985)                                                                    4.0
Name: 9, dtype: float64

In [94]:
def get_unseen_movies(ratings_matrix, userId):
    # userId로 입력받은 사용자의 모든 영화정보 추출하여 Series로 반환함.
    # 반환된 user_rating 은 영화명(title)을 index로 가지는 Series 객체임.
    user_rating = ratings_matrix.loc[userId,:]

    # user_rating이 0보다 크면 기존에 관람한 영화임. 대상 index를 추출하여 list 객체로 만듬
    already_seen = user_rating[ user_rating > 0].index.tolist()

    # 모든 영화명을 list 객체로 만듬.
    movies_list = ratings_matrix.columns.tolist()

    # list comprehension으로 already_seen에 해당하는 movie는 movies_list에서 제외함.
    unseen_list = [ movie for movie in movies_list if movie not in already_seen]

    return unseen_list

In [95]:
def recomm_movie_by_userid(pred_df, userId, unseen_list, top_n = 10):
  # 예측 평점 DataFrame에서 사용자id index와 unseen_list로 들어온 영화명 컬럼을 추출하여
  # 가장 예측 평점이 높은 순으로 정렬함.
  recomm_movies = pred_df.loc[userId, unseen_list].sort_values(ascending = False)[:top_n]
  return recomm_movies

# 사용자가 관람하지 않는 영화명 추출
unseen_list = get_unseen_movies(ratings_matrix, 9)

# 아이템 기반의 인접 이웃 협업 필터링으로 영화 추천
recomm_movies = recomm_movie_by_userid(ratings_pred_matrix, 9, unseen_list, top_n=10)

# 평점 데이타를 DataFrame으로 생성.
recomm_movies = pd.DataFrame(data=recomm_movies.values,index=recomm_movies.index,columns=['pred_score'])
recomm_movies

Unnamed: 0_level_0,pred_score
title,Unnamed: 1_level_1
Shrek (2001),0.866202
Spider-Man (2002),0.857854
"Last Samurai, The (2003)",0.817473
Indiana Jones and the Temple of Doom (1984),0.816626
"Matrix Reloaded, The (2003)",0.80099
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),0.765159
Gladiator (2000),0.740956
"Matrix, The (1999)",0.732693
Pirates of the Caribbean: The Curse of the Black Pearl (2003),0.689591
"Lord of the Rings: The Return of the King, The (2003)",0.676711


> 내가 평가했던 item들에 대한 평가점수를 바탕으로, 구하고자 하는 item과 유사한 n개의 item들에 대해 기존에 평가한 다른 item들에 대한 점수와 기존 item과 유사한 n개의 item들의 유사도를 통해 계산한 예측 평가 점수를 구해, 그 예측 평가 점수가 높은 순으로 추천