# Doc2Vecを用いた推薦システム

このノートブックでは、Doc2Vecを用いて推薦システムを構築する方法を紹介します。データセットとしては「[CMU Book Summary Dataset](https://www.cs.cmu.edu/~dbamman/booksummaries.html)」を使います。このデータセットは、Wikipediaから16,559冊の本のあらすじを抽出して作成されています。タブ区切りで、以下の情報が格納されています。

1. Wikipedia article ID
2. Freebase ID
3. Book title
4. Author
5. Publication date
6. Book genres (Freebase ID:name tuples)
7. Plot summary


## 準備

### パッケージのインストール

In [1]:
!pip install -q nltk==3.2.5 gensim==4.1.2 pandas==1.1.5

[K     |████████████████████████████████| 24.1 MB 1.7 MB/s 
[?25h

### インポート

In [15]:
from pprint import pprint

import nltk
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### データセットの準備

まずは、データセットをダウンロードして展開します。

In [3]:
!wget https://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
!tar xvfz booksummaries.tar.gz

--2021-09-26 09:50:39--  https://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
Resolving www.cs.cmu.edu (www.cs.cmu.edu)... 128.2.42.95
Connecting to www.cs.cmu.edu (www.cs.cmu.edu)|128.2.42.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16795330 (16M) [application/x-gzip]
Saving to: ‘booksummaries.tar.gz’


2021-09-26 09:51:12 (502 KB/s) - ‘booksummaries.tar.gz’ saved [16795330/16795330]

booksummaries/
booksummaries/README
booksummaries/booksummaries.txt


中身を確認しておきましょう。

In [4]:
!head -1 booksummaries/booksummaries.txt

620	/m/0hhy	Animal Farm	George Orwell	1945-08-17	{"/m/016lj8": "Roman \u00e0 clef", "/m/06nbt": "Satire", "/m/0dwly": "Children's literature", "/m/014dfn": "Speculative fiction", "/m/02xlf": "Fiction"}	 Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs an

タブ区切りなので、pandasの`read_csv`で読み込んでしまいましょう。

In [7]:
df = pd.read_csv(
    "booksummaries/booksummaries.txt",
    sep="\t",
    encoding="utf-8",
    names=["wikipediaId", "freebaseId", "title", "author", "date", "genres", "summary"]
)
df.head()

Unnamed: 0,wikipediaId,freebaseId,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


## 前処理

Doc2vecで学習するために、テキストを単語に分割し、`TaggedDocument`のリストを作成しましょう。`TaggedDocument`はDoc2vecのためのフォーマットで、単語のリストとタグから構成されています。今回は、タグとして、本のタイトルを使いましょう。

In [10]:
train_doc2vec = [
    TaggedDocument((word_tokenize(row.summary)), tags=[row.title])
    for index, row in df.iterrows()
]

## モデルの学習

In [12]:
model = Doc2Vec(vector_size=50, alpha=0.025, min_count=10, dm=1, epochs=100)
model.build_vocab(train_doc2vec)
model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)
model.save("d2v.model")

## 本の推薦

では、学習したモデルを使って、本を推薦してみましょう。ここでは、与えた文に類似した本のタイトルを、その類似度とともに表示します。

In [14]:
# 学習したモデルの読み込み
model = Doc2Vec.load("d2v.model")

In [17]:
# Wikipediaの『動物牧場』のサマリから抽出した文
# https://en.wikipedia.org/wiki/Animal_Farm
sample = """
Napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs who will run the farm.
"""
new_vector = model.infer_vector(word_tokenize(sample))
sims = model.dv.most_similar([new_vector])
pprint(sims)

[('Animal Farm', 0.6877216100692749),
 ('The Wild Irish Girl', 0.6764125227928162),
 ('Ponni', 0.6193090677261353),
 ('Walk in My Soul', 0.5841655731201172),
 ('Sweet Thursday', 0.5803889036178589),
 ('Payback: Debt and the Shadow Side of Wealth', 0.5785144567489624),
 ("Family Guy: Stewie's Guide to World Domination", 0.5723893642425537),
 ('Ọba kò so', 0.5715106725692749),
 ("Snowball's Chance", 0.5708845853805542),
 ('The Evil Empire: 101 Ways That England Ruined the World',
  0.5648933053016663)]


最近では、[Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)や[LaBSE](https://tfhub.dev/google/LaBSE/2)など、多言語に対応した埋め込みを生成できるモデルもあるので、そのようなモデルを試してみるのも面白いでしょう。