# Top2Vec

* [Top2Vec Github](https://github.com/ddangelov/Top2Vec)

## Tutorial

* [The Best Way to do Topic Modeling in Python - Top2Vec Introduction and Tutorial](https://www.youtube.com/watch?v=bEaxKSQ4Av8&list=PL2VXyKi-KpYt4Bb2dDZZoBLG4SQkrAz9g)

In [23]:
import os
import numpy as np
import pandas as pd
from top2vec import Top2Vec

In [2]:
print(Top2Vec.__doc__)


    Top2Vec

    Creates jointly embedded topic, document and word vectors.


    Parameters
    ----------
    documents: List of str
        Input corpus, should be a list of strings.

    min_count: int (Optional, default 50)
        Ignores all words with total frequency lower than this. For smaller
        corpora a smaller min_count will be necessary.

    topic_merge_delta: float (default 0.1)
        Merges topic vectors which have a cosine distance smaller than
        topic_merge_delta using dbscan. The epsilon parameter of dbscan is
        set to the topic_merge_delta.

    ngram_vocab: bool (Optional, default False)
        Add phrases to topic descriptions.

        Uses gensim phrases to find common phrases in the corpus and adds them
        to the vocabulary.

        For more information visit:
        https://radimrehurek.com/gensim/models/phrases.html

    ngram_vocab_args: dict (Optional, default None)
        Pass custom arguments to gensim phrases.

        For 

In [3]:
np.set_printoptions(threshold=np.inf)
np.set_printoptions(linewidth=1024)

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option("max_colwidth", None)
pd.set_option("max_seq_items", None)

# Terminology

* topic_num: ID of a topic. If there are 12 topics identified, then one in 0 to 11.
* topic_nums: List of topic IDs

# Data

* [Kaggle News Articles Categorization data](https://www.kaggle.com/competitions/learn-ai-bbc/data)

In [4]:
bbc = pd.read_csv("/Volumes/SSD/kaggle/bbc/BBCNewsTrain.csv")
bbc[:3]

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness. cynthia cooper worldcom s ex-head of internal accounting alerted directors to irregular accounting practices at the us telecoms giant in 2002. her warnings led to the collapse of the firm following the discovery of an $11bn (£5.7bn) accounting fraud. mr ebbers has pleaded not guilty to charges of fraud and conspiracy. prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates. but ms cooper who now runs her own consulting business told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002. she said andersen had given a green light to the procedures and practices used by worldcom. mr ebber s lawyers have said he was unaware of the fraud arguing that auditors did not alert him to any problems. ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company s finance chief giving only brief answers himself. the prosecution s star witness former worldcom financial chief scott sullivan has said that mr ebbers ordered accounting adjustments at the firm telling him to hit our books . however ms cooper said mr sullivan had not mentioned anything uncomfortable about worldcom s accounting during a 2001 audit committee meeting. mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing. worldcom emerged from bankruptcy protection in 2004 and is now known as mci. last week mci agreed to a buyout by verizon communications in a deal valued at $6.75bn.,business
1,154,german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy. munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january its first decline in three months. the study found that the outlook in both the manufacturing and retail sectors had worsened. observers had been hoping that a more confident business sector would signal that economic activity was picking up. we re surprised that the ifo index has taken such a knock said dz bank economist bernd weidensteiner. the main reason is probably that the domestic economy is still weak particularly in the retail trade. economy and labour minister wolfgang clement called the dip in february s ifo confidence figure a very mild decline . he said that despite the retreat the index remained at a relatively high level and that he expected a modest economic upswing to continue. germany s economy grew 1.6% last year after shrinking in 2003. however the economy contracted by 0.2% during the last three months of 2004 mainly due to the reluctance of consumers to spend. latest indications are that growth is still proving elusive and ifo president hans-werner sinn said any improvement in german domestic demand was sluggish. exports had kept things going during the first half of 2004 but demand for exports was then hit as the value of the euro hit record levels making german products less competitive overseas. on top of that the unemployment rate has been stuck at close to 10% and manufacturing firms including daimlerchrysler siemens and volkswagen have been negotiating with unions over cost cutting measures. analysts said that the ifo figures and germany s continuing problems may delay an interest rate rise by the european central bank. eurozone interest rates are at 2% but comments from senior officials have recently focused on the threat of inflation prompting fears that interest rates may rise.,business
2,1101,bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening. most respondents also said their national economy was getting worse. but when asked about their own family s financial outlook a majority in 14 countries said they were positive about the future. almost 23 000 people in 22 countries were questioned for the poll which was mostly conducted before the asian tsunami disaster. the poll found that a majority or plurality of people in 13 countries believed the economy was going downhill compared with respondents in nine countries who believed it was improving. those surveyed in three countries were split. in percentage terms an average of 44% of respondents in each country said the world economy was getting worse compared to 34% who said it was improving. similarly 48% were pessimistic about their national economy while 41% were optimistic. and 47% saw their family s economic conditions improving as against 36% who said they were getting worse. the poll of 22 953 people was conducted by the international polling firm globescan together with the program on international policy attitudes (pipa) at the university of maryland. while the world economy has picked up from difficult times just a few years ago people seem to not have fully absorbed this development though they are personally experiencing its effects said pipa director steven kull. people around the world are saying: i m ok but the world isn t . there may be a perception that war terrorism and religious and political divisions are making the world a worse place even though that has not so far been reflected in global economic performance says the bbc s elizabeth blunt. the countries where people were most optimistic both for the world and for their own families were two fast-growing developing economies china and india followed by indonesia. china has seen two decades of blistering economic growth which has led to wealth creation on a huge scale says the bbc s louisa lim in beijing. but the results also may reflect the untrammelled confidence of people who are subject to endless government propaganda about their country s rosy economic future our correspondent says. south korea was the most pessimistic while respondents in italy and mexico were also quite gloomy. the bbc s david willey in rome says one reason for that result is the changeover from the lira to the euro in 2001 which is widely viewed as the biggest reason why their wages and salaries are worth less than they used to be. the philippines was among the most upbeat countries on prospects for respondents families but one of the most pessimistic about the world economy. pipa conducted the poll from 15 november 2004 to 3 january 2005 across 22 countries in face-to-face or telephone interviews. the interviews took place between 15 november 2004 and 5 january 2005. the margin of error is between 2.5 and 4 points depending on the country. in eight of the countries the sample was limited to major metropolitan areas.,business


# Model

In [5]:
# Does not work
# model = Top2Vec(bbc['Text'].tolist(), embedding_model='universal-sentence-encoder', speed="learn", workers=8)
model = Top2Vec(
    documents=bbc['Text'].tolist(),
    document_ids=bbc.index.tolist(),
    embedding_model='distiluse-base-multilingual-cased'
)

2023-03-19 15:40:58,574 - top2vec - INFO - Pre-processing documents for training
2023-03-19 15:40:59,268 - top2vec - INFO - Downloading distiluse-base-multilingual-cased model
2023-03-19 15:41:00,858 - top2vec - INFO - Creating joint document/word embedding
2023-03-19 15:41:35,498 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-19 15:41:39,627 - top2vec - INFO - Finding dense areas of documents
2023-03-19 15:41:39,640 - top2vec - INFO - Finding topics


---
# Topic

**Topic (Topic Vector)** is a mean of a document vector cluster identified by HDBSCAN. It is a **thought vector** that is identified by the context words (topic words) nearby, but NOT a concrete word or sentence. It is desirable to have one specific categorical keyword that represents each topic, e.g. **Sport** for the 2nd topic but such distilation is not available in top2vec.

<img src="./image/top2vec_topic_vector.png" align="left" width=600/>



## Number of topics identified in the documents


In [6]:
topic_sizes, topic_ids = model.get_topic_sizes()
print(f"number of topics identified:[{len(topic_ids)}]")

number of topics identified:[4]


## Topic words 

List the context words that identify each topic.

In [7]:
topics_words, topic_scores, topic_ids = model.get_topics()
for topic_id, words, scores in zip(topic_ids, topics_words, topic_scores):
    print("-" * 80)
    print(f"Topic ID:{topic_id}")
    print("-" * 80)
    for word, score in zip(words[:10], scores[:10]):
        print(f"{word:20} {score}")

--------------------------------------------------------------------------------
Topic ID:0
--------------------------------------------------------------------------------
parliament           0.10377583652734756
politicians          0.10281675308942795
britain              0.10191775858402252
election             0.09515437483787537
elections            0.0923602283000946
no                   0.08872390538454056
non                  0.0843275785446167
voters               0.08393856137990952
british              0.08337553590536118
bbc                  0.08136938512325287
--------------------------------------------------------------------------------
Topic ID:1
--------------------------------------------------------------------------------
rugby                0.22808963060379028
mourinho             0.2248014509677887
football             0.21560746431350708
britain              0.17336198687553406
coach                0.15696346759796143
england              0.1561509370803833
to

## Topics of a document

Find the **Topic Words** that identify the **Topic** that is closest to the query document.


In [9]:
document_id = 1
query = bbc.iloc[document_id]['Text']
query

'german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy.  munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january  its first decline in three months. the study found that the outlook in both the manufacturing and retail sectors had worsened. observers had been hoping that a more confident business sector would signal that economic activity was picking up.   we re surprised that the ifo index has taken such a knock   said dz bank economist bernd weidensteiner.  the main reason is probably that the domestic economy is still weak  particularly in the retail trade.  economy and labour minister wolfgang clement called the dip in february s ifo confidence figure  a very mild decline . he said that despite the retreat  the index remained at a relatively high level and that he expected  a modest economic upswing  to continue.  germany s economy grew 1.

In [10]:
topic_nums, topic_score, topics_words, word_scores = model.get_documents_topics([document_id], reduced=False)
print(f"topic_nums:{topic_nums}, topic_score: {topic_score}")
for word, score in zip(topics_words[0][:10], word_scores[0][:10]):
    print(f"{word:20}: {score}")

topic_nums:[0], topic_score: [0.3969033]
parliament          : 0.10377583652734756
politicians         : 0.10281675308942795
britain             : 0.10191775858402252
election            : 0.09515437483787537
elections           : 0.0923602283000946
no                  : 0.08872390538454056
non                 : 0.0843275785446167
voters              : 0.08393856137990952
british             : 0.08337553590536118
bbc                 : 0.08136938512325287


Use text query instead of ```document_id```. ```topic_nums``` is a unique ID of a topic apparently.

* [query_topics(query, num_topics, reduced=False, tokenizer=None)](https://top2vec.readthedocs.io/en/latest/api.html?highlight=api#top2vec.Top2Vec.Top2Vec.query_topics)

> ```topic_nums``` (array of int, num_topic)) – The **unique number** of every topic will be returned.



In [11]:
topics_words, word_scores, topic_score, topic_nums = model.query_topics(query=query, num_topics=1)
print(f"topic_nums:{topic_nums}, topic_score: {topic_score}")
for word, score in zip(topics_words[0][:10], word_scores[0][:10]):
    print(f"{word:20}: {score}")

topic_nums:[0], topic_score: [0.36120173]
parliament          : 0.10377583652734756
politicians         : 0.10281675308942795
britain             : 0.10191775858402252
election            : 0.09515437483787537
elections           : 0.0923602283000946
no                  : 0.08872390538454056
non                 : 0.0843275785446167
voters              : 0.08393856137990952
british             : 0.08337553590536118
bbc                 : 0.08136938512325287


## Documents related to a topic

Find documents close to a topic.

* [search_documents_by_topic(topic_num, num_docs, return_documents=True, reduced=False)](https://top2vec.readthedocs.io/en/latest/api.html?highlight=Top2Vec#top2vec.Top2Vec.Top2Vec.search_documents_by_topic)

In [12]:
topic_id = 0
documents, scores, ids = model.search_documents_by_topic(
    topic_num=topic_id, 
    num_docs=3, 
    return_documents=True, 
    reduced=False
)
for index, document, score, id in zip(range(len(ids)), documents, scores, ids):
    print("-" * 80)
    print(f"document_id:{id:5} score:{score}")
    print(" ".join(document.split()[:100]))

--------------------------------------------------------------------------------
document_id:  658 score:0.6199405193328857
pre-poll clash on tax and spend labour and the tories have clashed over tax and spending plans as the row over gordon brown s budget turned into a full scale pre-election battle. tony blair claimed a tory government would cut £35bn from public services hitting schools hospitals and police. tory chairman liam fox accused labour of at best misrepresentation at worst a downright lie and said the smear tactics were a sign of desperation. the lib dems accused mr brown of ducking the issue of council tax rises. appearing together at a labour poster launch the prime minister hailed his
--------------------------------------------------------------------------------
document_id:  128 score:0.6122792959213257
howard and blair tax pledge clash tony blair has said voters will have to wait for labour s manifesto to see if the party has plans to increase tax. the premier was r

---
# Similarity Search

## Similar documents



In [14]:
query = bbc.iloc[1]['Text']
documents, scores, ids  = model.query_documents(query=query, num_docs=5)
for index, doc in [
    (_id, " ".join(documents[_i].split()[:15]))    # Top 15 words only
    for _i, _id in enumerate(ids) 
    if _id != document_id                          # Remove the query tself
]:
    print(f"{index}: {doc}")

57: german growth goes into reverse germany s economy shrank 0.2% in the last three months
1155: imf cuts german growth estimate the international monetary fund is to cut its 2005 growth
422: economy strong in election year uk businesses are set to prosper during the next few
1458: economy strong in election year uk businesses are set to prosper during the next few


In [15]:
documents, scores, ids = model.search_documents_by_documents(
    doc_ids=[document_id],
    num_docs=5
)
for index, doc in [
    (_id, " ".join(documents[_i].split()[:15]))    # Top 15 words only
    for _i, _id in enumerate(ids) 
    if _id != document_id                          # Remove the query tself
]:
    print(f"{index}: {doc}")

57: german growth goes into reverse germany s economy shrank 0.2% in the last three months
570: slowdown hits us factory growth us industrial production increased for the 21st month in a
1155: imf cuts german growth estimate the international monetary fund is to cut its 2005 growth
422: economy strong in election year uk businesses are set to prosper during the next few
1458: economy strong in election year uk businesses are set to prosper during the next few


## Similar words

In [17]:
model.similar_words(keywords=["german"], num_words=5)

(array(['deutsche', 'germany', 'england', 'english', 'japan'], dtype='<U8'),
 array([0.97359509, 0.95568295, 0.73594704, 0.6911235 , 0.65781959]))

---

# Custom document_id

By default, top2vec assign sequential ID from 0. Use cutom document ID (string or int) to identify the documents.

In [18]:
import uuid
from sklearn.datasets import fetch_20newsgroups

## 20 news gruop dataset

In [19]:
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

In [20]:
dir(newsgroups)

['DESCR', 'data', 'filenames', 'target', 'target_names']

### News Text

In [21]:
" ".join(newsgroups.data[1].split('\n')).strip()

'My brother is in the market for a high-performance video card that supports VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:    - Diamond Stealth Pro Local Bus    - Orchid Farenheit 1280    - ATI Graphics Ultra Pro    - Any other high-performance VLB card   Please post or email.  Thank you!    - Matt'

### Filename

In [24]:
os.path.basename(newsgroups.filenames[1])

'60215'

### Target Category

In [25]:
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## UUID as custom document_id

Use UUID as document id.

In [26]:
document_ids = [
    str(uuid.uuid4()) for _ in range(len(newsgroups.data))
]

In [27]:
news = Top2Vec(
    documents=newsgroups.data,
    document_ids=document_ids,
    embedding_model='distiluse-base-multilingual-cased'
)

2023-03-19 15:42:37,998 - top2vec - INFO - Pre-processing documents for training
2023-03-19 15:42:42,349 - top2vec - INFO - Downloading distiluse-base-multilingual-cased model
2023-03-19 15:42:43,661 - top2vec - INFO - Creating joint document/word embedding
2023-03-19 15:47:57,956 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-19 15:48:06,927 - top2vec - INFO - Finding dense areas of documents


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2023-03-19 15:48:07,882 - top2vec - INFO - Finding topics
