# gensimを用いたトピックモデルの学習

このノートブックでは、gensimを使ってLDAのデモをします。データセットのリンクは、第7章の`data`フォルダの下にある`BookSummaries_Link.md`をご確認ください。

## 準備

### パッケージのインストール

In [5]:
!pip install -q nltk==3.2.5 gensim==4.1.2 pandas==1.1.5

Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 2.6 kB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


### インポート

In [53]:
from pprint import pprint

import nltk
import pandas as pd
import spacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel, LsiModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [54]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 3.7 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


### データセットの準備

今回、トピックモデルを学習するためのデータセットとして「[CMU Book Summary Dataset](https://www.cs.cmu.edu/~dbamman/booksummaries.html)」を使います。このデータセットは、Wikipediaから16,559冊の本のあらすじを抽出して作成されています。タブ区切りで、以下の情報が格納されています。

1. Wikipedia article ID
2. Freebase ID
3. Book title
4. Author
5. Publication date
6. Book genres (Freebase ID:name tuples)
7. Plot summary

まずはダウンロードして展開しましょう。

In [1]:
!wget https://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
!tar xvfz booksummaries.tar.gz

--2021-09-23 10:41:50--  https://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
Resolving www.cs.cmu.edu (www.cs.cmu.edu)... 128.2.42.95
Connecting to www.cs.cmu.edu (www.cs.cmu.edu)|128.2.42.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16795330 (16M) [application/x-gzip]
Saving to: ‘booksummaries.tar.gz’


2021-09-23 10:43:03 (226 KB/s) - ‘booksummaries.tar.gz’ saved [16795330/16795330]

booksummaries/
booksummaries/README
booksummaries/booksummaries.txt


中身を確認します。

In [4]:
!head -1 booksummaries/booksummaries.txt

620	/m/0hhy	Animal Farm	George Orwell	1945-08-17	{"/m/016lj8": "Roman \u00e0 clef", "/m/06nbt": "Satire", "/m/0dwly": "Children's literature", "/m/014dfn": "Speculative fiction", "/m/02xlf": "Fiction"}	 Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs an

タブ区切りなので、`pandas`の`read_csv`で読み込んでしまいましょう。

In [82]:
df = pd.read_csv(
    "booksummaries/booksummaries.txt",
    sep="\t",
    encoding="utf-8",
    names=["wikipediaId", "freebaseId", "title", "author", "date", "genres", "summary"]
)
df.head()

Unnamed: 0,wikipediaId,freebaseId,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


## 前処理

前処理では、テキストを変換し、gensimで学習できるような形式にします。テキストの前処理としては以下の3つを行います。

- ストップワードの除去
- 非アルファベット語の除去
- 小文字化

そのための関数を定義しましょう。

In [83]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

def lemmatization(words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    doc = nlp(" ".join(words))
    return [token.lemma_ for token in doc if token.pos_ in allowed_postags]


def preprocess(textstring):
    """tokenize, remove stopwords, non-alphabetic words, lowercase"""
    stops = set(stopwords.words("english"))
    tokens = word_tokenize(textstring)
    return [token.lower() for token in tokens if token.isalpha() and token not in stops]

In [84]:
df.summary = df.summary.apply(preprocess).apply(lemmatization)
df.head()

Unnamed: 0,wikipediaId,freebaseId,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","[manor, farm, call, animal, farm, meeting, hum..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","[gang, friend, droog, taste, music, novel, dro..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...","[text, plague, part, town, thousand, rat, popu..."
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,"[argument, step, chapter, one, epistemology, h..."
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...","[novel, posit, space, way, layer, zone, law, p..."


次に、テキストのリストからgensim用にコーパスを作成します。

In [91]:
# テキストの辞書表現を作成
dictionary = Dictionary(df.summary.values)
# 低頻度語と高頻度語の除去
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(summary) for summary in df.summary.values]

In [92]:
corpus[0][:10]

[(0, 1),
 (1, 4),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 35),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1)]

In [93]:
id2word = dict(dictionary.items())

## モデルの学習

### LDA

それでは、構築したコーパスをgensimの`LdaModel`に渡して、モデルを学習しましょう。コーパス以外には以下を指定しています。

- `iterations`: トピック分布を推測するときの、コーパス全体の最大反復回数
- `num_topics`: トピック数
- `id2word`: 単語IDから単語へのマッピング。トピックの表示用。

In [94]:
model = LdaModel(corpus=corpus, id2word=id2word, iterations=400, num_topics=10, random_state=2021)

In [95]:
model.show_topics()

[(0,
  '0.011*"man" + 0.009*"time" + 0.009*"ship" + 0.006*"year" + 0.006*"group" + 0.006*"way" + 0.006*"story" + 0.006*"life" + 0.006*"people" + 0.005*"friend"'),
 (1,
  '0.035*"family" + 0.027*"mother" + 0.018*"child" + 0.014*"life" + 0.014*"brother" + 0.013*"friend" + 0.013*"year" + 0.011*"man" + 0.011*"time" + 0.011*"love"'),
 (2,
  '0.011*"day" + 0.010*"book" + 0.010*"story" + 0.009*"man" + 0.009*"time" + 0.008*"friend" + 0.008*"boy" + 0.008*"life" + 0.007*"job" + 0.007*"night"'),
 (3,
  '0.030*"ship" + 0.014*"earth" + 0.014*"time" + 0.012*"crew" + 0.010*"world" + 0.009*"planet" + 0.009*"space" + 0.007*"human" + 0.006*"year" + 0.006*"race"'),
 (4,
  '0.013*"world" + 0.013*"war" + 0.011*"force" + 0.008*"time" + 0.008*"people" + 0.007*"order" + 0.007*"power" + 0.007*"government" + 0.006*"story" + 0.006*"group"'),
 (5,
  '0.026*"book" + 0.019*"story" + 0.014*"life" + 0.013*"novel" + 0.011*"character" + 0.011*"chapter" + 0.011*"woman" + 0.009*"man" + 0.009*"year" + 0.008*"author"'),
 (

In [96]:
top_topics = list(model.top_topics(corpus))
pprint(top_topics)

[([(0.03484744, 'family'),
   (0.027079526, 'mother'),
   (0.01845889, 'child'),
   (0.014007284, 'life'),
   (0.013950966, 'brother'),
   (0.013115266, 'friend'),
   (0.012610523, 'year'),
   (0.011448137, 'man'),
   (0.010960756, 'time'),
   (0.010862761, 'love'),
   (0.010634469, 'father'),
   (0.010614914, 'woman'),
   (0.010437573, 'son'),
   (0.009884664, 'day'),
   (0.009521468, 'wife'),
   (0.00949569, 'daughter'),
   (0.009491833, 'story'),
   (0.0087956395, 'death'),
   (0.008644367, 'girl'),
   (0.008607693, 'sister')],
  -1.2143685824930421),
 ([(0.011421871, 'man'),
   (0.009466162, 'time'),
   (0.009384962, 'ship'),
   (0.0063382485, 'year'),
   (0.006289876, 'group'),
   (0.0062795454, 'way'),
   (0.0062499307, 'story'),
   (0.006032021, 'life'),
   (0.0059705796, 'people'),
   (0.0052855946, 'friend'),
   (0.004933433, 'woman'),
   (0.004860364, 'order'),
   (0.004858799, 'day'),
   (0.0048566605, 'village'),
   (0.004849657, 'child'),
   (0.004829891, 'book'),
   (0.00

In [None]:
for idx in range(10):
    print("Topic #%s:" % idx, model.print_topic(idx, 10))
print("=" * 20)

Topic #0: 0.013*"jacky" + 0.006*"dahlia" + 0.005*"novel" + 0.005*"one" + 0.004*"story" + 0.004*"also" + 0.004*"book" + 0.004*"team" + 0.004*"narrator" + 0.003*"jeremy"
Topic #1: 0.010*"book" + 0.009*"war" + 0.006*"in" + 0.006*"world" + 0.005*"novel" + 0.005*"states" + 0.004*"also" + 0.004*"new" + 0.004*"chapter" + 0.004*"story"
Topic #2: 0.008*"he" + 0.007*"she" + 0.006*"mother" + 0.006*"one" + 0.005*"tells" + 0.005*"back" + 0.005*"house" + 0.005*"father" + 0.005*"school" + 0.005*"go"
Topic #3: 0.007*"life" + 0.006*"love" + 0.006*"family" + 0.006*"father" + 0.006*"he" + 0.006*"novel" + 0.005*"young" + 0.005*"she" + 0.004*"story" + 0.004*"one"
Topic #4: 0.007*"he" + 0.006*"one" + 0.004*"murder" + 0.004*"police" + 0.004*"man" + 0.003*"two" + 0.003*"case" + 0.003*"also" + 0.003*"would" + 0.003*"time"
Topic #5: 0.007*"earth" + 0.006*"one" + 0.005*"time" + 0.005*"human" + 0.005*"world" + 0.004*"new" + 0.004*"planet" + 0.004*"life" + 0.003*"space" + 0.003*"he"
Topic #6: 0.006*"he" + 0.005*"t

## 参考資料

- [Latent Dirichlet Allocation](https://radimrehurek.com/gensim/models/ldamodel.html)