

概要
ユーザーレビューの得点を使って強調フィルタリング（メモリベース）を行う

データソース
https://erogamescape.dyndns.org/~ap2/ero/toukei_kaiseki/



事前準備

In [1]:
import numpy as np
import pandas as pd
from google.colab import drive
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df = pd.read_csv('drive/My Drive/dev/20230424_recommend_erogame/userbase.csv', encoding='utf-8')
# matrixを生成（メモリ節約のため型を調整）
df = df.pivot(index='uid', columns='game_id', values='score').astype('Int16')
# 0埋め
df.fillna(0,inplace=True)

In [4]:
df.head()

game_id,1,2,3,4,5,6,7,8,9,10,...,34250,34251,34252,34253,34254,34255,34258,34266,34268,34273
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Daile,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
#9.1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
$howka,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
&dagger;mmv&dagger;,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
&eacute;toile,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17821 entries,  Daile to ﾘﾘｨ
Columns: 25610 entries, 1 to 34273
dtypes: Int16(25610)
memory usage: 1.3+ GB


ゲーム数が多いのでかなり欠損値が多い。
今回は欠損値を無視する形で実装する。

### フロー
* 評価マトリクスからユーザーが評価済みのゲームを抽出する
* ユーザーが評価済みのゲームのみに絞ったマトリクスを再生成する
* 類似度を計算する

### 参考
https://techblog.gmo-ap.jp/2019/12/17/machine-learning-missing/

In [11]:
# 評価数が多すぎるデータはカットする
def get_recommend_source(df):
  delete_rows = []
  for index, row in df.iterrows():
    count = 0
    for val in row:
      if val > 0:
        count = count + 1
    if count > 1000:
      delete_rows.append(index)
  return df.drop(delete_rows)

recommend_source_df = get_recommend_source(df)

In [30]:
def get_recommend_game(df, recommend_source, target_user):
  # 評価済みのゲームIDを取得する
  target_games = []
  for column_name, item in df.loc[target_user].items():
    if item > 0:
      target_games.append(column_name)
  # 対象のゲームだけに絞った行列に再生成
  reccomend_base_df = recommend_source[target_games]
  # 各ユーザーとの類似度計算
  score_list = {}
  target_df = df.loc[target_user][target_games]
  for index, _ in reccomend_base_df.iterrows():
    if index == target_user:
      continue
    tmp_df = reccomend_base_df.loc[index]
    similarity = cosine_similarity(np.array([target_df,tmp_df]))
    score_list[index] = similarity[0,1]
  score_list = sorted(score_list.items(), key=lambda x:x[1], reverse=True)
  # 先頭10人の点数の高いゲームを抽出
  pickup_users = score_list[:10]
  print(pickup_users)
  recommend_games = {}
  for tmp_pickup_users in pickup_users:
    key = tmp_pickup_users[0]
    value = tmp_pickup_users[1]
    for key2, value2 in df.loc[key].items():
      if value2==0:
        continue
      if (key2 not in target_games):
        if (key2 not in recommend_games):
          recommend_games[key2] = []
        recommend_games[key2].append(value2*value)
  recommend_games_result = {}
  for key, value in recommend_games.items():
    recommend_games_result[key] = sum(value) / len(value)
  recommend_games_result = sorted(recommend_games_result.items(), key=lambda x:x[1], reverse=True)
  print(recommend_games_result)

# テスト
# get_recommend_game(df, recommend_source_df, 'XXXX')

[('tenchi', 0.8151762417560416), ('andrea', 0.8098887312225803), ('x-rated', 0.800354664696457), ('heart', 0.7933207560881542), ('smsksk', 0.7818081736984784), ('kad', 0.7805994510404689), ('Lumis.Eterne', 0.7671259295330675), ('mint', 0.7487959669706881), ('wanbe', 0.7408556912849291), ('usagi19', 0.7398447037802508)]
[(5378, 80.0354664696457), (5432, 76.93942946614513), (6644, 76.03369314616342), (7441, 75.36547182837465), (19952, 75.36547182837465), (190, 74.93754729988501), (6087, 74.27177650135545), (620, 74.15694784884455), (3868, 74.0855691284929), (10504, 74.0855691284929), (12591, 74.01390607882922), (3150, 73.63262915207405), (5358, 73.63262915207405), (19666, 73.24462567424483), (8204, 72.98550956011019), (5424, 72.83227448737759), (619, 72.5957489467636), (7668, 72.5957489467636), (6940, 72.27523532145811), (467, 72.03191982268113), (2093, 71.83011039626341), (7467, 71.39886804793387), (20312, 71.39886804793387), (6722, 71.35535268090351), (12972, 71.1792955195502), (16506,