# 복합 토픽 모델링(Combined Topic Modeling)

이 튜토리얼에서는 복합 토픽 모델(**Combined Topic Model**)을 사용하여 문서의 집합에서 토픽을 추출해보겠습니다.

## 토픽 모델(Topic Models)

토픽 모델을 사용하면 비지도 학습 방식으로 문서에 잠재된 토픽을 추출할 수 있습니다.

## 문맥을 반영한 토픽 모델(Contextualized Topic Models)
문맥을 반영한 토픽 모델(Contextualized Topic Models, CTM)이란 무엇일까요? CTM은 BERT 임베딩의 표현력과 토픽 모델의 비지도 학습의 능력을 결합하여 문서에서 주제를 가져오는 토픽 모델의 일종입니다.

# Contextualized Topic Models, CTM 설치

contextualized topic model 라이브러리를 설치합시다.

In [7]:
!pip install contextualized-topic-models==2.2.0



In [8]:
!pip install pyldavis



## 노트북 재시작

원활한 실습을 위해서 노트북을 재시작 할 필요가 있습니다.

상단에서 런타임 > 런타임 재시작을 클릭해주세요.

# 데이터

학습을 위한 데이터가 필요합니다. 여기서는 하나의 라인(line)에 하나의 문서로 구성된 파일이 필요한데요. 우선, 여러분들의 데이터가 없다면 여기서 준비한 파일로 실습을 해봅시다.

In [10]:
!pip install wget



In [16]:
!python -m wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt


Saved under dbpedia_sample_abstract_20k_unprep.txt


In [19]:
%%bash
head -n 1 dbpedia_sample_abstract_20k_unprep.txt

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry


In [21]:
%%bash
head -n 3 dbpedia_sample_abstract_20k_unprep.txt

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
Monte Zucker (died March 15, 2007) was an American photographer. He specialized in wedding photography, entering it as a profession in 1947. In the 1970s he operated a studio in Silver Spring, Maryland. Later he lived in Florida. He was Brides Magazine's Wedding Photographer of the Year for 1990 and
Henry Howard, 13th Earl of Suffolk, 6th Earl of Berkshire (8 August 1779 ��� 10 August 1779) was a British peer, the son of Henry Howard, 12th Earl of Suffolk. His father died on 7 March 1779, leaving behind his pregnant widow. The Earldom of Suffolk became dormant until she


In [22]:
%%bash
head -n 5 dbpedia_sample_abstract_20k_unprep.txt

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
Monte Zucker (died March 15, 2007) was an American photographer. He specialized in wedding photography, entering it as a profession in 1947. In the 1970s he operated a studio in Silver Spring, Maryland. Later he lived in Florida. He was Brides Magazine's Wedding Photographer of the Year for 1990 and
Henry Howard, 13th Earl of Suffolk, 6th Earl of Berkshire (8 August 1779 ��� 10 August 1779) was a British peer, the son of Henry Howard, 12th Earl of Suffolk. His father died on 7 March 1779, leaving behind his pregnant widow. The Earldom of Suffolk became dormant until she
Marinko Mato큄evi훶 (Croatian pronunciation: [mari흯ko mato�긡�땓t棨��]; born 8 August 1985) is an 

In [23]:
text_file = "dbpedia_sample_abstract_20k_unprep.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

# 필요한 것들을 임포트

In [24]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk

## 전처리

여기서 전처리 된 텍스트를 사용하는 이유는 무엇일까요? Bag of Words를 구축하려면 특수문자가 없는 텍스트가 필요하고, 모든 단어를 사용하는 것보다는 빈번한 단어들만 사용하는 것이 좋습니다.

In [25]:
nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jikim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
# normalization 전처리 후 문서
preprocessed_documents[:2]

['mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry',
 'died march american photographer specialized photography operated studio silver spring maryland later lived florida magazine photographer year']

In [27]:
# 전처리 전 문서 == documnets와 동일
unpreprocessed_corpus[:2]

['The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry',
 "Monte Zucker (died March 15, 2007) was an American photographer. He specialized in wedding photography, entering it as a profession in 1947. In the 1970s he operated a studio in Silver Spring, Maryland. Later he lived in Florida. He was Brides Magazine's Wedding Photographer of the Year for 1990 and"]

In [28]:
# 전체 단어 집합의 개수
print('bag of words에 사용 될 단어 집합의 개수 :',len(vocab))

bag of words에 사용 될 단어 집합의 개수 : 2000


In [29]:
vocab[:5]

['da', 'across', 'products', 'sports', 'township']

전처리 되지 않은 문서는 문맥을 반영한 문서 임베딩을 얻기 위한 입력으로 사용할 것이기 때문에 제거해서는 안 됩니다.  

전처리 전 문서와 전처리 후 문서를 TopicModelDataPreparation 객체에 넘겨줍니다. 이 객체는 bag of words와 문맥을 반영한 문서의 BERT 임베딩을 얻습니다. 여기서 사용할 pretrained BERT는 paraphrase-distilroberta-base-v1입니다.

In [30]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Batches:   0%|          | 0/100 [00:00<?, ?it/s]

In [31]:
tp.vocab[:10]

['abbreviated',
 'academic',
 'academy',
 'access',
 'according',
 'achieved',
 'acquired',
 'acre',
 'acres',
 'across']

In [32]:
len(tp.vocab)

2000

단어 집합의 상위 10개 단어를 출력해봅시다. 여기서 출력하는 tp.vocab과 앞에서의 vocab은 집합 관점에서는 같습니다.

In [33]:
set(vocab) == set(tp.vocab)

True

## Combined TM 학습하기
이제 토픽 모델을 학습합니다. 여기서는 하이퍼파라미터에 해당하는 토픽의 개수(n_components)로는 50개를 선정합니다.

In [34]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=50, num_epochs=20)
ctm.fit(training_dataset) # run the model

Epoch: [20/20]	 Seen Samples: [400000/400000]	Train Loss: 135.41401330566407	Time: 0:00:19.770967: : 20it [06:37, 19.88s/it]


# 토픽들

학습 후에는 토픽 모델이 선정한 토픽들을 보려면 아래의 메소드를 사용합니다.

```
get_topic_lists
```
해당 메소드에는 각 토픽마다 몇 개의 단어를 보고 싶은지에 해당하는 파라미터를 넣어즐 수 있습니다. 여기서는 5개를 선택했습니다. 아래의 토픽들은 위키피디아(일반적인 주제)으로부터 얻은 토픽을 보여줍니다. 우리는 영어 문서로 학습하였으므로 각 토픽에 해당하는 단어들도 영어 단어들입니다.

In [35]:
ctm.get_topic_lists(5)

[['film', 'directed', 'written', 'produced', 'stars'],
 ['mi', 'district', 'approximately', 'kilometres', 'south'],
 ['game', 'series', 'developed', 'video', 'games'],
 ['area', 'municipality', 'town', 'region', 'located'],
 ['team', 'season', 'division', 'head', 'games'],
 ['church', 'built', 'st', 'class', 'cathedral'],
 ['book', 'published', 'novel', 'written', 'fiction'],
 ['played', 'first', 'born', 'made', 'english'],
 ['known', 'american', 'born', 'best', 'york'],
 ['school', 'high', 'located', 'college', 'public'],
 ['played', 'born', 'former', 'professional', 'league'],
 ['series', 'television', 'show', 'released', 'aired'],
 ['world', 'european', 'national', 'championship', 'competed'],
 ['state', 'river', 'park', 'road', 'highway'],
 ['km', 'mi', 'north', 'west', 'within'],
 ['company', 'founded', 'based', 'group', 'headquartered'],
 ['system', 'software', 'developed', 'company', 'systems'],
 ['born', 'world', 'summer', 'olympics', 'silver'],
 ['district', 'also', 'populatio

# 시각화

우리의 토픽들을 시각화하기 위해서는 PyLDAvis를 사용합니다.

In [36]:
 lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [02:51, 17.13s/it]


In [37]:
import pyLDAvis as vis

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

Sampling: [10/10]: : 10it [02:52, 17.27s/it]
  default_term_info = default_term_info.sort_values(


이제 임의의 문서를 가져와서 어떤 토픽이 할당되었는지 확인할 수 있습니다. 예를 들어, 반도(peninsula)에 대한 첫번째 전처리 된 문서의 토픽을 예측해 봅시다.

In [38]:
topics_predictions = ctm.get_thetas(training_dataset, n_samples=5) # get all the topic predictions

Sampling: [5/5]: : 5it [01:25, 17.06s/it]


In [39]:
# 전처리 문서의 첫번째 문서
print(preprocessed_documents[0])

mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry


In [40]:
import numpy as np
topic_number = np.argmax(topics_predictions[0]) # get the topic id of the first document

In [41]:
ctm.get_topic_lists(5)[topic_number] #and the topic should be about natural location related things

['state', 'river', 'park', 'road', 'highway']

# 차후 사용을 위해 모델 저장하기

In [42]:
ctm.save(models_dir="./")



In [43]:
# let's remove the trained model
del ctm

In [44]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=50)

ctm.load("contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99",
                                                                                                      epoch=19)



In [45]:
ctm.get_topic_lists(5)

[['film', 'directed', 'written', 'produced', 'stars'],
 ['mi', 'district', 'approximately', 'kilometres', 'south'],
 ['game', 'series', 'developed', 'video', 'games'],
 ['area', 'municipality', 'town', 'region', 'located'],
 ['team', 'season', 'division', 'head', 'games'],
 ['church', 'built', 'st', 'class', 'cathedral'],
 ['book', 'published', 'novel', 'written', 'fiction'],
 ['played', 'first', 'born', 'made', 'english'],
 ['known', 'american', 'born', 'best', 'york'],
 ['school', 'high', 'located', 'college', 'public'],
 ['played', 'born', 'former', 'professional', 'league'],
 ['series', 'television', 'show', 'released', 'aired'],
 ['world', 'european', 'national', 'championship', 'competed'],
 ['state', 'river', 'park', 'road', 'highway'],
 ['km', 'mi', 'north', 'west', 'within'],
 ['company', 'founded', 'based', 'group', 'headquartered'],
 ['system', 'software', 'developed', 'company', 'systems'],
 ['born', 'world', 'summer', 'olympics', 'silver'],
 ['district', 'also', 'populatio

참고 자료 : https://github.com/MilaNLProc/contextualized-topic-models