1) 데이터 준비와 전처리
---
Movielens 데이터는 rating.dat 안에 이미 인덱싱까지 완료된 사용자-영화-평점 데이터가 깔끔하게 정리되어 있습니다.

루브릭

아래의 기준을 바탕으로 프로젝트를 평가합니다.


|평가문항|	상세기준|
|:------:|:--------:|
|1. CSR matrix가 정상적으로 만들어졌다.| 사용자와 아이템 개수를 바탕으로 정확한 사이즈로 만들었다.|
|2. MF 모델이 정상적으로 훈련되어 그럴듯한 추천이 이루어졌다.| 사용자와 아이템 벡터 내적수치가 의미있게 형성되었다.|
|3. 비슷한 영화 찾기와 유저에게 추천하기의 과정이 정상적으로 진행되었다. |MF모델이 예측한 유저 선호도 및 아이템간 유사도, 기여도가 의미있게 측정되었다.|

In [387]:
import os
import pandas as pd

rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [388]:
# 3점 이상만 남깁니다.
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [389]:
# rating 컬럼의 이름을 count로 바꿉니다.
ratings.rename(columns={'rating':'count'}, inplace=True)

In [390]:
ratings['count']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: count, Length: 836478, dtype: int64

In [391]:
# 영화 제목을 보기 위해 메타 데이터를 읽어옵니다.
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


2) 분석해 봅시다.

In [392]:
# movie_id는 3952까지 존재
movies.tail()

Unnamed: 0,movie_id,title,genre
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama
3882,3952,"Contender, The (2000)",Drama|Thriller


In [393]:
# 1) ratings에 있는 유니크한 영화 개수
ratings['movie_id'].nunique()

3628

In [394]:
# 2) rating에 있는 유니크한 사용자 수
ratings['user_id'].nunique()

6039

In [395]:
# 3) 가장 인기 있는 영화 30개(인기순)
ratings.groupby('movie_id')['user_id'].count().sort_values(ascending=False)

movie_id
2858    3211
260     2910
1196    2885
1210    2716
2028    2561
        ... 
1553       1
1548       1
2486       1
138        1
3876       1
Name: user_id, Length: 3628, dtype: int64

In [396]:
ratings

Unnamed: 0,user_id,movie_id,count,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000203,6040,1090,3,956715518
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [398]:
ratings.tail()

Unnamed: 0,user_id,movie_id,count
1000203,6040,1090,3
1000205,6040,1094,5
1000206,6040,562,5
1000207,6040,1096,4
1000208,6040,1097,4


In [399]:
ratings = ratings.merge(movies, how='left', on='movie_id')

In [400]:
movie = 'E.T. the Extra-Terrestrial (1982)'

In [401]:
ratings[ratings['title'].apply(lambda x : x.split('(')[0].rstrip() in movie)]

Unnamed: 0,user_id,movie_id,count,title,genre
26,1,1097,4,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
225,4,1097,4,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
1076,10,1097,5,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
1309,13,1097,5,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
1712,17,1097,3,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
...,...,...,...,...,...
835166,6035,1097,4,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
835873,6036,1097,4,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
836064,6037,1097,5,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
836201,6039,1097,4,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi


In [402]:
# 연도는 1919년 부터 2000년까지 존재
ratings['title'].apply(lambda x : int(x.split('(')[-1][:4])).sort_values()

371730    1919
395335    1919
835412    1919
587578    1919
665294    1919
          ... 
413662    2000
77813     2000
309771    2000
309769    2000
762702    2000
Name: title, Length: 836478, dtype: int64

In [403]:
genres = list(movies['genre'].value_counts().keys())

In [404]:
def unique_genre(genres):
    uniq_genre = [] # unique 장르를 담을 list
    for genre in genres:
        uniq_genre.extend(genre.split('|'))
    return list(set(uniq_genre))

In [405]:
genres = unique_genre(genres)

In [406]:
# G enres are pipe-separated and are selected from the following genres:
genres.sort()
genres

['Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

3) 내가 선호하는 영화를 5가지 골라서 rating에 추가해 줍시다.

- user_id : 기존 id가 6040까지 있으므로 내 id는 6041으로 설정한다.
- movie_id : 내가 선택할 영화의 id가 리스트에 존재한다면 해당 id로 설정하고
그렇지 않다면 3952 이후로 설정 
- count : 내가 좋아하는 영화이므로 5점을 부여
- title : title 형식에 맞게 '제목 (연도)'로 전처리
- genre : 장르가 여러개 존재할 경우 |로 연결 

In [407]:
# 2000년 이전 영화 중에서 내가 좋아하는 영화 5가지를 선택한다.

# 라이언 일병 구하기 Saving Private Ryan, 1998 War, Action, Drama
# 인생은 아름다워 Life Is Beautiful, 1997, Drama, Comedy
# 쇼생크 탈출 The Shawshank Redemption, 1994, Drama
# 타이타닉 Titanic, 1997, Romance, Drama
# 매트릭스 The Matrix, 1999 Sci-Fi, Action

# 영화 하나 하나 검색하며 title이 일치하는지 여부를 확인한다.

favorite_movies = ['Saving Private Ryan (1998)',
                   'Life Is Beautiful (1997)',
                   'The Shawshank Redemption (1994)',
                   'Titanic (1997)',
                   'The Matrix (1999)']

In [408]:
# 라이언 일병 구하기
movie_name = 'Saving Private Ryan (1998)'
ratings[ratings['title'].apply(lambda x : x == movie_name)]

Unnamed: 0,user_id,movie_id,count,title,genre
48,1,2028,5,Saving Private Ryan (1998),Action|Drama|War
102,2,2028,4,Saving Private Ryan (1998),Action|Drama|War
221,4,2028,5,Saving Private Ryan (1998),Action|Drama|War
473,7,2028,5,Saving Private Ryan (1998),Action|Drama|War
571,8,2028,5,Saving Private Ryan (1998),Action|Drama|War
...,...,...,...,...,...
834644,6027,2028,5,Saving Private Ryan (1998),Action|Drama|War
835008,6033,2028,5,Saving Private Ryan (1998),Action|Drama|War
835862,6036,2028,5,Saving Private Ryan (1998),Action|Drama|War
836059,6037,2028,4,Saving Private Ryan (1998),Action|Drama|War


In [409]:
# 인생은 아름다워
# Life Is Beautiful (1997)으로는 인덱싱이 되지 않는다'
# 제목을 'Life Is Beautiful (La Vita è bella) (1997)'로 수정한다.

movie_name = 'Life Is Beautiful (1997)'
favorite_movies[1] = 'Life Is Beautiful (La Vita è bella) (1997)'
ratings[ratings['title'].apply(lambda x : x[:4] == 'Life')]

Unnamed: 0,user_id,movie_id,count,title,genre
595,8,2324,3,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama
698,9,2324,5,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama
885,10,2324,5,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama
1673,17,2377,5,Lifeforce (1985),Horror|Sci-Fi
2078,19,2324,5,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama
...,...,...,...,...,...
835133,6035,2324,5,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama
835728,6036,2324,4,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama
835756,6036,2377,3,Lifeforce (1985),Horror|Sci-Fi
836038,6037,2324,4,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama


In [410]:
# 쇼생크 탈출
# 쇼생크 탈출 제목도 Shawshank Redemption, The로 되어있기 떄문에 제목을 수정한다.
movie_name = 'Shawshank Redemption (1994)'
movies[2] = 'Shawshank Redemption, The (1994)'
movies[movies['title'].apply(lambda x : (x[:4] == 'Shaw'))] 

Unnamed: 0,movie_id,title,genre,2
315,318,"Shawshank Redemption, The (1994)",Drama,"Shawshank Redemption, The (1994)"


In [411]:
# 타이타닉 
movie_name = 'Titanic (1997)'
ratings[ratings['title'].apply(lambda x : x == movie_name)]

Unnamed: 0,user_id,movie_id,count,title,genre
27,1,1721,4,Titanic (1997),Drama|Romance
557,8,1721,5,Titanic (1997),Drama|Romance
672,9,1721,5,Titanic (1997),Drama|Romance
1027,10,1721,3,Titanic (1997),Drama|Romance
1934,18,1721,4,Titanic (1997),Drama|Romance
...,...,...,...,...,...
833544,6016,1721,3,Titanic (1997),Drama|Romance
834214,6023,1721,5,Titanic (1997),Drama|Romance
834500,6025,1721,4,Titanic (1997),Drama|Romance
834642,6027,1721,4,Titanic (1997),Drama|Romance


In [412]:
# 매트릭스
# 매트릭스는 Matrix, The (1999)로 수정한다.
movie_name = 'The Matrix (1999)'
favorite_movies[4] = 'Matrix, The (1999)' 
ratings[ratings['title'].apply(lambda x : x[:6] == 'Matrix')]

Unnamed: 0,user_id,movie_id,count,title,genre
127,2,2571,4,"Matrix, The (1999)",Action|Sci-Fi|Thriller
286,5,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller
459,7,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller
517,8,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller
640,9,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller
...,...,...,...,...,...
834757,6030,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller
834829,6031,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller
835072,6035,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller
835456,6036,2571,3,"Matrix, The (1999)",Action|Sci-Fi|Thriller


In [413]:
favorite_movies

['Saving Private Ryan (1998)',
 'Life Is Beautiful (La Vita è bella) (1997)',
 'The Shawshank Redemption (1994)',
 'Titanic (1997)',
 'Matrix, The (1999)']

In [414]:
my_movies = {
    'user_id' : [6041] * 5,
    'movie_id' : [2028, 2324, 318, 1721, 2571],
    'count' : [5]*5,
    'title' : favorite_movies,
    'genre': ['Action|Drama|War',
             'Comedy|Drama',
             'Drama',
             'Drama|Romance',
             'Action|Sci-Fi|Thriller']
}

In [415]:
my_movies_df = pd.DataFrame(my_movies)

In [416]:
ratings = ratings.append(my_movies_df, ignore_index=True)

In [421]:
ratings.tail(10)

Unnamed: 0,user_id,movie_id,count,title,genre
836473,6040,1090,3,Platoon (1986),Drama|War
836474,6040,1094,5,"Crying Game, The (1992)",Drama|Romance|War
836475,6040,562,5,Welcome to the Dollhouse (1995),Comedy|Drama
836476,6040,1096,4,Sophie's Choice (1982),Drama
836477,6040,1097,4,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
836478,6041,2028,5,Saving Private Ryan (1998),Action|Drama|War
836479,6041,2324,5,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama
836480,6041,318,5,The Shawshank Redemption (1994),Drama
836481,6041,1721,5,Titanic (1997),Drama|Romance
836482,6041,2571,5,"Matrix, The (1999)",Action|Sci-Fi|Thriller


In [422]:
ratings.drop(['timestamp', 'genre'], axis=1, inplace=True)

In [423]:
ratings

Unnamed: 0,user_id,movie_id,count,title
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975)
1,1,661,3,James and the Giant Peach (1996)
2,1,914,3,My Fair Lady (1964)
3,1,3408,4,Erin Brockovich (2000)
4,1,2355,5,"Bug's Life, A (1998)"
...,...,...,...,...
836478,6041,2028,5,Saving Private Ryan (1998)
836479,6041,2324,5,Life Is Beautiful (La Vita è bella) (1997)
836480,6041,318,5,The Shawshank Redemption (1994)
836481,6041,1721,5,Titanic (1997)


In [424]:
# 고유한 유저와 영화의 개수
user_unique = ratings['user_id'].unique()
movie_unique = ratings['title'].unique()

# 고유한 값의 개수만큼 유저 id와 영화 제목을 indexing
user_to_idx = {v:k for k,v in enumerate(user_unique)}
movie_to_idx = {v:k for k,v in enumerate(movie_unique)}

In [425]:
# 데이터 컬럼 내 값을 indexing된 값으로 교체

# user_to_idx.get을 통해 user_id 컬럼의 모든 값을 인덱싱한 Series를 구해 봅시다. 
# 혹시 정상적으로 인덱싱되지 않은 row가 있다면 인덱스가 NaN이 될 테니 dropna()로 제거합니다. 
temp_user_data = ratings['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(ratings):   # 모든 row가 정상적으로 인덱싱되었다면
    print('user_id column indexing OK!!')
    ratings['user_id'] = temp_user_data   # data['user_id']을 인덱싱된 Series로 교체해 줍니다. 
else:
    print('user_id column indexing Fail!!')

# movie_to_idx을 통해 title 컬럼도 동일한 방식으로 인덱싱해 줍니다. 
temp_movie_data = ratings['title'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(ratings):
    print('title column indexing OK!!')
    ratings['title'] = temp_movie_data
else:
    print('title column indexing Fail!!')

ratings

user_id column indexing OK!!
title column indexing OK!!


Unnamed: 0,user_id,movie_id,count,title
0,0,1193,5,0
1,0,661,3,1
2,0,914,3,2
3,0,3408,4,3
4,0,2355,5,4
...,...,...,...,...
836478,6039,2028,5,48
836479,6039,2324,5,450
836480,6039,318,5,3628
836481,6039,1721,5,27


In [426]:
from scipy.sparse import csr_matrix

# CSR Matrix 생성
num_user = ratings['user_id'].nunique()
num_movie = ratings['title'].nunique()

csr_data = csr_matrix((ratings['count'], (ratings.user_id, ratings.title)), shape= (num_user, num_movie))
csr_data

<6040x3629 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Row format>

In [427]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [428]:
# Implicit AlternatingLeastSquares 모델의 선언
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=15, dtype=np.float32)

In [429]:
# als 모델은 input으로 item X user 꼴의 matrix를 받기 때문에 Transpose해줍니다.
csr_data_transpose = csr_data.T
csr_data_transpose

<3629x6040 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Column format>

4) CSR matrix를 직접 만들어 봅시다.

In [430]:
# 모델 훈련
als_model.fit(csr_data_transpose)

  0%|          | 0/15 [00:00<?, ?it/s]

In [432]:

# 훈련된 모델이 만든 나의 벡터와 영화 벡터 구하기
my_vector, matrix = user_to_idx[6041], movie_to_idx['Matrix, The (1999)']
my_vector, matrix_vector = als_model.user_factors[my_vector], als_model.item_factors[matrix]

In [433]:
my_vector

array([ 1.22833335e+00,  5.62239215e-02, -1.59503385e-01, -1.84889898e-01,
       -1.68979481e-01, -1.22353330e-01,  5.43722630e-01, -4.75693122e-02,
       -3.78764898e-01,  1.08256984e+00,  3.49917680e-01,  1.73650086e-01,
       -6.38982713e-01,  4.06543374e-01, -5.75075984e-01,  1.19217755e-02,
       -1.12571216e+00, -9.31851938e-02,  3.72354329e-01,  4.67013538e-01,
        1.51442021e-01,  5.52071571e-01,  3.53909194e-01,  7.51697600e-01,
       -3.21185254e-02,  6.82726979e-01, -1.98759526e-01, -2.10195243e-01,
        1.49604063e-02, -1.28654987e-01,  4.93119657e-01, -3.40344220e-01,
       -9.28556204e-01,  7.36328736e-02, -1.07435822e+00, -2.92284042e-01,
       -3.29814136e-01, -4.88126546e-01,  8.33079040e-01, -1.47266045e-01,
       -3.69759887e-01, -1.08713157e-01, -3.21160585e-01, -3.55445325e-01,
        1.37913799e+00, -1.71435475e-02,  5.18075377e-02, -4.99633908e-01,
        7.13892341e-01,  9.01375934e-02, -1.04771398e-01,  5.90528488e-01,
       -9.65893507e-01,  

In [None]:
ratings.

In [215]:
num_movie

3628

In [205]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 836483 entries, 0 to 836482
Data columns (total 5 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   user_id   836483 non-null  int64 
 1   movie_id  836483 non-null  object
 2   count     836483 non-null  object
 3   title     836483 non-null  int64 
 4   genre     836483 non-null  object
dtypes: int64(2), object(3)
memory usage: 31.9+ MB


In [204]:
num_movie

3628

In [None]:
# 5) als_model = AlternatingLeastSquares 모델을 직접 구성하여 훈련시켜 봅시다.

In [9]:
# 6) 내가 선호하는 5가지 영화 중 하나와 그 외의 영화 하나를 골라 훈련된 모델이 예측한 나의 선호도를 파악해 보세요.

In [10]:
# 7) 내가 좋아하는 영화와 비슷한 영화를 추천받아 봅시다.

In [11]:
# 8) 내가 가장 좋아할 만한 영화들을 추천받아 봅시다.