## Readme

このノートブックでは、[BERTopic](https://maartengr.github.io/BERTopic/index.html)を用いて文章同士の関係性の可視化を簡易的に行う。


In [4]:
# !pip install -Uqq bertopic

In [59]:
import pandas as pd
from umap import UMAP
from bertopic import BERTopic
from IPython.display import display
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
# データの取得
docs_dict = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Pandas DataFrameへ変換
df_docs = pd.DataFrame({
    "data": docs_dict["data"],
    "target": docs_dict["target"]
})
df_docs["num words"] = [len(data.split(" ")) for data in df_docs["data"]]

print("Num words stats")
print(df_docs["num words"].mean(), df_docs["num words"].median())

# 2件だけデータを見てみる
print("## DataExample")
print("```")
for i in [0, 1]:
    print(df_docs.loc[i, "data"])
    print("---")
print("```")

# 計算量的な問題により、1000件のみに絞る
df_docs = df_docs.sample(1000, random_state=0)

# BERTopicによるクラスタリング
# 結果を固定したい場合は umapのrandom_stateを0にして固定
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=0)

# 各Topicの代表的なワードを絞る際にStopwordsを除く
vectorizer_model = CountVectorizer(stop_words="english")

# BERTopicモデルの規定
topic_model = BERTopic(
    representation_model="thenlper/gte-base", # 日本語ならば "intfloat/multilingual-e5-base" などが候補になる
    umap_model=umap_model,
    vectorizer_model=vectorizer_model
)

# topicやprobabilityの計算
topics, probs = topic_model.fit_transform(df_docs["data"])

df_docs["topic_predicted"] = topics
display(df_docs)

# 各Topicの概要
display(topic_model.get_topic_info())

Num words stats
203.21134458240476 84.0
## DataExample
```


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


---
My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local Bus

  - Orchid Farenheit 1280

  - ATI Graphi

Unnamed: 0,data,target,num words,topic_predicted
14736,Uh... slight clarification: That should be ...,2,21,11
15780,I am trying to obtain a HI-FI copy of Guns N' ...,6,39,0
7127,\n\nAh yes. California. Did the San Francisc...,10,61,2
2778,Can someone tell me where to find 120volt 3 wa...,12,153,0
14477,\nHmm...has anyone of us computer geeks (me in...,2,206,0
...,...,...,...,...
17059,"\nThose are pretty typical, I believe.\n",2,6,-1
16204,\nPat> In article <SHAFER.93Apr6094402@rigel.d...,14,115,3
14033,"\nIf no-one looks at the results, or acknowled...",19,164,-1
8112,For Sale:\n\nInformix WingZ Graphic Spreadshee...,6,28,11


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,198,-1_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_stephanopou...,"[maxaxaxaxaxaxaxaxaxaxaxaxaxaxax, stephanopoul...",[The following was posted and no doubt retyped...
1,0,297,0_use_openwindows_windows_file,"[use, openwindows, windows, file, using, run, ...","[\n\nTime for a new discussion, maybe ? I aske..."
2,1,92,1_god_jesus_say_sin,"[god, jesus, say, sin, people, believe, bible,...",[\nI see what you are getting at (or at least ...
3,2,89,2_game_gm_play_team,"[game, gm, play, team, period, players, games,...",[Philadelphia 1 1 2 1--5\n...
4,3,79,3_car_bike_fuel_new,"[car, bike, fuel, new, like, jj, engine, bmw, ...","[\nIf the clutch is in, then a large chunk of ..."
5,4,48,4_cancer_patients_treatment_breast,"[cancer, patients, treatment, breast, pages, a...",[KS> From: keith@actrix.gen.nz (Keith Stewart)...
6,5,43,5_hear_seconded_hovig_deletion,"[hear, seconded, hovig, deletion, motion, say,...","[, \n\n\n[ ... ]\n\n\nTo which I say:\nHear, h..."
7,6,37,6_armenian_armenians_people_know,"[armenian, armenians, people, know, azerbaijan...","[\n\nSo, did the Jews kill the Germans? \nYou ..."
8,7,34,7_space_ether_balloon_moon,"[space, ether, balloon, moon, dr, like, time, ...","[Forwarded from Neal Ausman, Galileo Mission D..."
9,8,33,8_key_encryption_chip_security,"[key, encryption, chip, security, bits, keys, ...","[[An article from comp.org.eff.news, EFFector ..."


In [58]:
# 各dataの可視化
topic_model.visualize_documents(df_docs["data"].values)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]