## Users based collaborative filtering

- 영화를 추천 받고 싶을때 
    - 내가 좋아하는 감독, 장르, 키워드의 영화를 찾아본다 = `content based filtering`
    - 나랑 성향이 비슷한 친구들이 본 영화를 찾아본다 = `collaborative filtering`

- 협업 필터링의 특징
    - 가정 : 나와 비슷한 취향의 사람들이 좋아하는 것은 나도 좋아할 가능성이 높다.
        - 많은 사용자로부터 얻은 취향 정보를 활용
    - 핵심 포인트 : 많은 사용자들로 부터 얻은 취향 정보를 활용
        - 사용자의 취향 정보 = 집단 지성
        - 축적된 사용자들의 집단 지성을 기반으로 추천
        - ex : A상품을 구매한 사용자가 함께 구매한 다른 상품들
    - 집단 지성 (collective intelligence)
        - 개인 보다는 단체 / 그룹의 선택과 취향에 의존
        - 여러 사람의 의견을 종합적으로 반영
        - 다수의 의견으로 더 나은 선택 (한 명의 전문가 의견 < 여러 보통 사람의 의견)

### Memory based Approach
- `사용자 기반 협업 필터링` : 특정 사용자를 선택하고, 그 사용자와 유사한 취향을 가진 다른 사용자들을 찾는다.
    - 그 이후 유사한 사용자들이 좋아하거나 높게 평가한 아이템을 바탕으로, *해당 사용자가 아직 평가하지 않은 아이템*을 추천한다.
- `아이템 기반 협업 필터링` : 특정 아이템을 선택하고, 그 아이템과 유사한 아이템을 찾아서 추천
    - 특징 : 최적화 방법이나, 매개변수를 학습하지 않음(훈련이 필요 없음). 단순한 산술 연산만 사용
    - 방법 : Cosine similarity, Pearson correlation을 사용함
    - 장점 : 구현이 쉽고, 설명이 쉬움, 도메인에 의존적이지 않다.
    - 단점 : 데이터가 축적 X, Sparse한 데이터 경우 성능이 낮다. 확장 가능성이 낮다.
        - Long-tail 문제 : model 기반 협업 필터링에서 해결
            - Paleto법칙 : 많은 성과에서 결과의 약 80%가 20%의 원인에 의해 발생
            - 대부분의 사용자가 관심 갖는 소수 item으로 쏠림 현상

- 유사도 측정 방법
    - 두 방법 모두 유사도(거리)를 측정하여 사용함
    - 거리 측정 방법은 다양함
    - 유클리디안 거리, 코사인 유사도, 피어슨 상관계수 등이 있음

### Usere-based vs Item-based CF 비교
- 정확도 
    - 사용자 수 < 아이템 수 : user-based 유리
    - 사용자 수 > 아이템 수 : item-based 유리
- 모델 robustness
    - user-based : 사용자의 취향이 자주 변하는 경우에 취약
    - item-based : 아이템의 내용이 자주 변하는 경우에 취약
- 설명력
    - user-based : 특정 사용자와 비슷한 사용자의 실제 취향을 알기 쉬움
    - item-based : 특정 user와 비슷한 user의 실제 취향을 알기 쉬움
- 새로운 추천 가능
    - user-based :  여러 사용자의 데이터를 보기 때문에 더 새로운 추천할 수 있음
    - item-based :  과거 item데이터에 의존, 새로운 item추천하기 어려움

### 

In [1]:
import numpy as np
import pandas as pd

### Data Load

In [2]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.user', sep='|', names=u_cols, encoding='latin-1')

i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown', 
          'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
          'Thriller', 'War', 'Western']
movies = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.item', sep='|', names=i_cols, encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('/Users/jun/Library/Mobile Documents/com~apple~CloudDocs/Github/ai _recommendation _system/data/u.data', sep='\t', names=r_cols, encoding='latin-1')


In [3]:
# timestamp 제거 (time 정보를 사용할 일이 없으므로 삭제)
ratings = ratings.drop('timestamp', axis=1)
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


### Data 재구성 : `[movie_id, title]`

In [4]:
# movie ID와 title 빼고 다른 데이터 제거
movies = movies[['movie_id', 'title']]
movies

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)
...,...,...
1677,1678,Mat' i syn (1997)
1678,1679,B. Monkey (1998)
1679,1680,Sliding Doors (1998)
1680,1681,You So Crazy (1994)


### RMSE
- 평균 제곱근 오차(Root Mean Square Error) : 예측값과 실제값의 차이를 제곱하여 평균한 값의 제곱근 (값을 적게 나오게 하는 것이 목표)
    - $RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2}$
    - 오차 : 실제값과 예측값의 차이
    - 오차 제곱 합 : 각각 오차를 구하면 음수, 양수 모두 나오기 떄문에, 각 오차를 제곱한 값을 더한다.
    - 평균 제곱근 오차 : 제곱해서 모두 더한 값의 평균을 낸 것

In [5]:
# 정확도 (RMSE)를 계산하는 함수
def RMSE(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true)- np.array(y_pred))**2))

### 모델별로 테스트 데이터의 예측 및 실데이터 간의 정확도 계산

In [6]:
# 모델별 RMSE를 계산하는 함수 정의
def score(model):
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])
    y_true = np.array(X_test['rating'])
    return RMSE(y_true, y_pred)

- `id_pairs = zip(x_test['user_id'], x_test['movie_id'])` : test 데이터의 user_id, movie_id를 묶어줌
- `y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])`: id_pair의 각 (user, movie) 쌍에 대해 model(user, movie)를 호출하여 각 user와 movie에 대해 모델이 예측한 평점 y_pred를 계산
- `y_true = np.array(x_test['rating'])` : test 데이터의 실제 rating을 y_true에 저장 (가져오기)

### train, test set 분리
- user_id를 기준으로 일정 비율로 학습, 테스트 데이터 분리
- `stratify=y`: user_id의 분포에 따라 데이터셋이 균등하게 나누어진 것은 맞다.

In [7]:
from sklearn.model_selection import train_test_split
X = ratings.copy()
y = ratings['user_id']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 42, stratify=y)

### 학습데이터 (사용자 X 영화 X 평점) matrix
- pivot table : 대량의 데이터를 요약하여 의미 있는 보고서로 만들어주는 테이블

**pivot table로 데이터 구조화하는 이유**
- 데이터 구조화 : 피벗 테이블을 사용하여 user_id를 행으로, movie_id를 열로, 해당 위치의 값으로 rating을 사용하여 매트릭스 형태로 데이터를 조직화 -> **각 사용자가 각 영화에 대해 어떤 평점을 부여했는지 쉽게 확인 가능**
- 결측치 처리 : 많은 CF 시스템은 결측치(평점이 매겨지지 않은 영화)를 처리 할 수 있어야한다. -> 피벗 테이블로 어떤 사용자가 어떤 영화를 평가하지 않았는지 알 수 있음
- 효율적인 유사도 계산 : 구조화된 데이터 매트릭스는 유사도 계산을 위한 **입력 데이터**로 사용되어, 각 사용자 또는 아이템 간의 유사도를 효율적으로 계산할 수 있게 해준다

In [8]:
# train 데이터로 Full matrix 구하기 
rating_matrix = X_train.pivot(index='user_id', columns='movie_id', values='rating')
rating_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### 학습데이터의 전체 사용자간의 유사도 (cosine similarity)
- 사용자 간의 유사도 점수를 저장한 행렬 
    - 행과 열 모두 사용자 ID로, 값은 두 사용자 간의 유사도

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

matrix_dummy = rating_matrix.copy().fillna(0) # nan값을 0로 변경, nan값이 존재하면 안된다.
user_similarity = cosine_similarity(matrix_dummy, matrix_dummy)
user_similarity = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)
user_similarity

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.108361,0.046638,0.029577,0.245753,0.335853,0.344724,0.191582,0.057149,0.251979,...,0.257073,0.069412,0.231643,0.108093,0.176842,0.104799,0.232472,0.051528,0.129555,0.256333
2,0.108361,1.000000,0.057613,0.130237,0.054918,0.190552,0.079399,0.076146,0.167992,0.147376,...,0.136993,0.252887,0.255454,0.285193,0.232751,0.149088,0.102807,0.062386,0.109143,0.107686
3,0.046638,0.057613,1.000000,0.139805,0.000000,0.032485,0.043869,0.080968,0.022263,0.059925,...,0.027402,0.000000,0.175060,0.010343,0.105635,0.019052,0.127099,0.023917,0.060392,0.000000
4,0.029577,0.130237,0.139805,1.000000,0.000000,0.045190,0.088586,0.199526,0.135013,0.026919,...,0.055392,0.049773,0.076549,0.139382,0.113886,0.000000,0.130343,0.077357,0.157890,0.063911
5,0.245753,0.054918,0.000000,0.000000,1.000000,0.176443,0.281860,0.132205,0.038790,0.134200,...,0.183969,0.019305,0.073714,0.041807,0.081088,0.029743,0.188392,0.068342,0.055557,0.207259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.104799,0.149088,0.019052,0.000000,0.029743,0.086464,0.075012,0.095736,0.000000,0.080883,...,0.061061,0.299811,0.158064,0.221251,0.323989,1.000000,0.047368,0.162173,0.058828,0.124548
940,0.232472,0.102807,0.127099,0.130343,0.188392,0.230566,0.270071,0.164157,0.131458,0.255758,...,0.195863,0.113346,0.144570,0.173568,0.139877,0.047368,1.000000,0.092911,0.199881,0.135868
941,0.051528,0.062386,0.023917,0.077357,0.068342,0.095478,0.020036,0.076269,0.106763,0.063461,...,0.021901,0.055348,0.226017,0.170493,0.249612,0.162173,0.092911,1.000000,0.072402,0.099200
942,0.129555,0.109143,0.060392,0.157890,0.055557,0.197307,0.236086,0.089871,0.089297,0.169309,...,0.111291,0.078263,0.051882,0.137759,0.069516,0.058828,0.199881,0.072402,1.000000,0.142812


### 주어진 영화의 (movie_id) 가중평균 rating을 계산하는 함수
- 가중 평균 : 주어진 사용자와 다른 사용자 간의 유사도(user_similarity)
    - 사용자 간의 유사도를 가중치로 사용하여 특정 영화에 대한 예측 평점을 계산하는 방법
    - 이 방식은 유사도 점수를 해당 영화에 대한 평점에 대한 가중치로 사용하여, 주어진 사용자에 대한 영화의 평점을 예측한다.
- user_id와 movie_id에 대한 영화 평점 예측을 수행

##### 기본 CF 알고리즘
: 주어진 사용자가 다른 사용자의 평점과 유사도 정보를 기반으로 가중 평균을 계산해 예측 평점을 반환

In [10]:
def CF_simple(user_id, movie_id):
    
    if movie_id in rating_matrix: 
        # 현재 사용자와 다른 사용자 간의 similarity 가져오기
        sim_scores = user_similarity[user_id].copy()
        # 현재 영화에 대한 모든 사용자의 rating값 가져오기
        movie_ratings = rating_matrix[movie_id].copy()
        # 현재 영화를 평가하지 않은 사용자의 index 가져오기
        none_rating_idx = movie_ratings[movie_ratings.isnull()].index
        # 현재 영화를 평가하지 않은 사용자의 rating (null) 제거
        movie_ratings = movie_ratings.dropna()
        # 현재 영화를 평가하지 않은 사용자의 similarity값 제거
        sim_scores = sim_scores.drop(none_rating_idx)
        
        # 현재 영화를 평가한 모든 사용자의 가중평균값 구하기
        if sim_scores.sum() == 0:    # user_id와 유사도가 0인 경우 있음. (학습데이터에는 있지만 테스트 데이터에는 없는 경우)
            mean_rating = 3.0
        else:
            mean_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
    else:
        mean_rating = 3.0
    # 현재 사용자가 평가하지 않은 영화에 대한 예측 평점 반환
    return mean_rating

- 평가하지 않은 사용자 필터링 : 영화에 대한 평점을 남기지 않은 사용자들을 필터링하여 sim_scores에서 해당 사용자들의 유사도를 제거 (= 영화에 평점을 남긴 사용자들만 남김)
- 예상 평점 계산 : 남아있는 사용자들의 유사도와 그들의 영화에 대한 평점을 기반으로, 가중 평균을 구해 **해당 영화에 대한 현재 사용자의 예상 평점을 계산**


In [11]:
print(score(CF_simple))

1.0237210431087944


- score() 결과 분석 : 협업 필터링 알고리즘을 사용하여 평가한 RMSE 값이다.
    - RMSE는 모델의 예측 오차를 수치화하는 방법 중 하나이다.
    - 모델이 평균적으로 실제 평점과 약 1.02 정도 차이가 나는 것을 의미
    - RMSE값은 낮을수록 모델의 예측 정확도가 높은 것을 의미한다.

---
## Pearson 상관계수를 이용한 유사도 계산

In [23]:
# 일반적인 공식
def pearson_sim(a, b):
    covariance = np.dot((a - np.mean(a)), (b - np.mean(b)))
    std_deviation = np.linalg.norm(a - np.mean(a)) * np.linalg.norm(b - np.mean(b))
    return covariance/std_deviation

In [26]:
matrix_dummy

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,0.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
matrix_dummy.iloc[0]

movie_id
1       5.0
2       3.0
3       4.0
4       0.0
5       3.0
       ... 
1677    0.0
1679    0.0
1680    0.0
1681    0.0
1682    0.0
Name: 1, Length: 1641, dtype: float64

In [25]:
# 예시
a=np.array(matrix_dummy.iloc[0])
b=np.array(matrix_dummy.iloc[1])
sim_a_b = pearson_sim(a,b)
sim_a_b

np.float64(0.0581323816375836)

### user x user 의 상관계수
- `pandas.corr()` : 상관계수를 계산하는 함수

In [29]:
# column x column의 상관계수
user_similarity_pearson=matrix_dummy.corr(method='pearson')
user_similarity_pearson

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.112726,0.170990,0.150170,0.059776,0.008702,0.257786,0.145397,0.177087,0.066789,...,-0.023749,0.013940,0.026652,-0.023749,-0.023749,-0.023749,-0.023749,-0.023749,0.043453,-0.023749
2,0.112726,1.000000,0.117057,0.281553,0.200174,0.067902,0.159067,0.174119,0.054455,0.029120,...,-0.010031,-0.014193,-0.010031,-0.010031,-0.010031,-0.010031,-0.010031,-0.010031,-0.010031,0.090241
3,0.170990,0.117057,1.000000,0.182698,0.109595,0.035381,0.194911,0.081412,0.121559,0.066402,...,-0.008612,-0.012185,-0.008612,-0.008612,-0.008612,-0.008612,-0.008612,-0.008612,-0.008612,0.102128
4,0.150170,0.281553,0.182698,1.000000,0.110953,0.005220,0.190862,0.249643,0.187213,0.106566,...,-0.013680,0.031253,-0.013680,-0.013680,0.105545,-0.013680,-0.013680,-0.013680,0.057855,0.081700
5,0.059776,0.200174,0.109595,0.110953,1.000000,0.033365,0.179203,0.095602,0.171591,-0.014578,...,-0.008073,-0.011423,-0.008073,-0.008073,-0.008073,-0.008073,-0.008073,-0.008073,-0.008073,-0.008073
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,-0.023749,-0.010031,-0.008612,-0.013680,-0.008073,-0.003815,-0.020947,0.089197,0.075683,-0.008245,...,-0.001062,-0.001502,-0.001062,-0.001062,-0.001062,1.000000,-0.001062,-0.001062,-0.001062,-0.001062
1679,-0.023749,-0.010031,-0.008612,-0.013680,-0.008073,-0.003815,-0.020947,-0.014467,-0.017830,-0.008245,...,-0.001062,-0.001502,-0.001062,-0.001062,-0.001062,-0.001062,1.000000,1.000000,-0.001062,-0.001062
1680,-0.023749,-0.010031,-0.008612,-0.013680,-0.008073,-0.003815,-0.020947,-0.014467,-0.017830,-0.008245,...,-0.001062,-0.001502,-0.001062,-0.001062,-0.001062,-0.001062,1.000000,1.000000,-0.001062,-0.001062
1681,0.043453,-0.010031,-0.008612,0.057855,-0.008073,-0.003815,0.049979,-0.014467,-0.017830,-0.008245,...,-0.001062,0.706731,-0.001062,-0.001062,-0.001062,-0.001062,-0.001062,-0.001062,1.000000,-0.001062


In [30]:
user_similarity_pearson=matrix_dummy.T.corr(method='pearson')
user_similarity_pearson

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.058132,-0.001306,-0.004666,0.177313,0.262910,0.241265,0.150274,0.025869,0.173354,...,0.183580,0.028064,0.163780,0.070644,0.116565,0.060393,0.176032,0.019892,0.072988,0.186864
2,0.058132,1.000000,0.034856,0.115658,0.014203,0.150486,0.012888,0.052068,0.154626,0.106237,...,0.097363,0.236786,0.224978,0.270282,0.205844,0.128299,0.070756,0.047055,0.081128,0.068492
3,-0.001306,0.034856,1.000000,0.127033,-0.038105,-0.011251,-0.017721,0.059910,0.008101,0.019499,...,-0.012516,-0.019123,0.145425,-0.008057,0.078027,-0.002046,0.099878,0.009786,0.034439,-0.039159
4,-0.004666,0.115658,0.127033,1.000000,-0.026895,0.015383,0.050974,0.186867,0.126156,-0.002609,...,0.028728,0.036968,0.053200,0.128130,0.095096,-0.015181,0.111804,0.067912,0.142022,0.038887
5,0.177313,0.014203,-0.038105,-0.026895,1.000000,0.109227,0.201002,0.097219,0.014269,0.066390,...,0.123565,-0.014175,0.012016,0.010213,0.029866,-0.007502,0.142665,0.044973,0.008800,0.151584
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.060393,0.128299,-0.002046,-0.015181,-0.007502,0.045369,0.016065,0.074758,-0.014688,0.041091,...,0.022342,0.286295,0.127282,0.206635,0.303367,1.000000,0.017061,0.150055,0.032503,0.090716
940,0.176032,0.070756,0.099878,0.111804,0.142665,0.179434,0.205045,0.135997,0.113628,0.208288,...,0.146979,0.088575,0.097512,0.151423,0.100129,0.017061,1.000000,0.074001,0.167070,0.085811
941,0.019892,0.047055,0.009786,0.067912,0.044973,0.069379,-0.023233,0.061941,0.097915,0.036730,...,-0.005319,0.043043,0.209037,0.160030,0.235249,0.150055,0.074001,1.000000,0.055161,0.076491
942,0.072988,0.081128,0.034439,0.142022,0.008800,0.150970,0.177522,0.062738,0.072438,0.122948,...,0.063928,0.055427,0.006106,0.117130,0.031790,0.032503,0.167070,0.055161,1.000000,0.099495


### Nan 값이 포함된 경우의 상관계수
- 상관계수는 두 개의 리스트 갯수가 같아야 하고, `Nan 없어야 함`
- Nan 이 있을 경우 [결측치를 포함한 상관계수](https://zephyrus1111.tistory.com/209) 참조
- pandas.corr()에서 처리

In [33]:
matrix_dummy2 = rating_matrix.copy()
user_similarity_pearson2=matrix_dummy2.T.corr(method='pearson')
user_similarity_pearson2

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.340997,0.281425,-0.688247,0.392864,0.246536,0.310992,8.440722e-01,3.845925e-16,-0.126156,...,0.066879,-0.682981,0.511009,0.353553,0.358116,0.468046,0.178983,-1.000000,-0.201196,0.001392
2,0.340997,1.000000,-0.866025,,0.866025,0.531666,0.507630,8.660254e-01,-4.440892e-16,0.450000,...,0.131306,-0.475191,0.047809,0.145479,0.503367,-0.612372,0.872872,,0.408248,0.241523
3,0.281425,-0.866025,1.000000,0.000000,,-0.492366,-0.073985,1.110223e-16,,,...,,,-0.240663,,0.511891,,0.340503,,,
4,-0.688247,,0.000000,1.000000,,,-0.204124,2.000000e-01,,,...,,,-0.577350,,0.688247,,0.917985,1.000000,0.133631,
5,0.392864,0.866025,,,1.000000,0.278851,0.346360,5.007831e-01,,0.367806,...,0.137002,,0.496929,1.000000,0.112756,1.000000,-0.190618,1.000000,0.904534,0.249344
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.468046,-0.612372,,,1.000000,0.359864,0.211538,-5.000000e-01,,0.408248,...,0.564076,0.433555,0.375000,-0.471405,-0.108961,1.000000,-1.000000,-0.132453,-0.500000,0.000000
940,0.178983,0.872872,0.340503,0.917985,-0.190618,-0.073746,0.088755,3.015113e-01,2.182179e-01,0.194688,...,0.077784,0.456435,-0.324514,-0.101768,0.376867,-1.000000,1.000000,1.000000,0.133131,0.527986
941,-1.000000,,,1.000000,1.000000,0.361158,,,,0.866025,...,,,0.298807,0.414039,-0.417029,-0.132453,1.000000,1.000000,,-0.426401
942,-0.201196,0.408248,,0.133631,0.904534,0.018260,0.521077,2.151657e-01,8.660254e-01,0.435761,...,0.306073,,-0.218218,,0.763763,-0.500000,0.133131,,1.000000,0.199205


In [35]:
a=np.array(matrix_dummy2.iloc[0])
b=np.array(matrix_dummy2.iloc[1])
a = np.ma.masked_invalid(a) # NaN과 같은 유효하지 않은 값들을 마스크처리
b = np.ma.masked_invalid(b)
np.ma.corrcoef(a,b)

masked_array(
  data=[[0.9999999999999999, 0.3409971697352367],
        [0.3409971697352367, 1.0]],
  mask=[[False, False],
        [False, False]],
  fill_value=1e+20)

- 두 데이터 행 간의 상관계수를 계산하기 위해 사용
- `np.ma.masked_invalid()` : 이 함수는 배열 내의 유효하지 않은 값을 자동으로 마스크 처리한다.
    - 마스크 처리된 배열은 유효하지 않은 값들을 계산에서 제외시킬 수 있게 하여, 데이터 분석이나 연산 시 오류를 방지 할 수 있다.
- `np.ma.corrcoef()` : 상관계수를 계산하는 함수