# Project: I Wanna Know

This project is an application designed for inspiration, in-depth reading, and exploration of popular topics. The idea is to start with 100 categories, each containing 100 finely curated topics. For example, if you choose the psychology category, you'll encounter topics like Gestalt psychology, where a concise, 5-page summary covers the essentials—with potential follow-up prompts added later.

## Implementation:
A specialized model processes millions of text lines from the internet to form clusters through topic modeling. Each cluster is then analyzed to extract and develop detailed topics. This approach is inspired by the deep research methodology seen in platforms like Perplexity (have you tried it?), offering a unique, customizable format for engaging with content.

## Running the project:

You will need conda (aka miniconda) as your python env manager.
1. Install [miniconda](https://docs.conda.io/projects/conda/en/stable/index.html)
2. Init your env:
    - `conda create -n bertopic_env python=3.9`
    - `conda activate bertopic_env`
    - `conda install jupyter`
3. Run jupyter
    - `jupyter notebook`

known issues:
1. Conda init:
    - [https://community.anaconda.cloud/](https://community.anaconda.cloud/t/unable-to-activate-environment-prompted-to-run-conda-init-before-conda-activate-but-it-doesnt-work/68677)



In [None]:
%pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Using cached hdbscan-0.8.40-cp39-cp39-macosx_10_9_universal2.whl
Collecting numpy>=1.20.0 (from bertopic)
  Downloading numpy-2.0.2-cp39-cp39-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting pandas>=1.1.5 (from bertopic)
  Downloading pandas-2.2.3-cp39-cp39-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting plotly>=4.7.0 (from bertopic)
  Downloading plotly-6.0.0-py3-none-any.whl.metadata (5.6 kB)
Collecting scikit-learn>=0.22.2.post1 (from bertopic)
  Downloading scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Collecting tqdm>=4.41.1 (from bertopic)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata 

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

In [None]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [None]:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got f

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6640,-1_to_the_is_and,"[to, the, is, and, of, for, you, in, it, that]",[It's like refusing 'God's kingdom come'.\n\nI...
1,0,1828,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[\n\n""Deeply rooted rivalry?"" Ahem, Jokerit ha..."
2,1,580,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[The following document summarizes the Clipper...
3,2,526,2_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, forget, why, lets...","[\nHuh?, \nYep.\n, ites:]"
4,3,482,3_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...",[From: Center for Policy Research <cpr>\nSubje...
...,...,...,...,...,...
210,209,10,209_table_tables_sale_2end,"[table, tables, sale, 2end, 1coffee, foldable,...",[Moving Sale: Must sell before May 5:\nFuton: ...
211,210,10,210_dod_denizens_doom_motorcycle,"[dod, denizens, doom, motorcycle, muck, recmot...",[This is probably a stupid question but as I a...
212,211,10,211_w4wg_network_lan_windows,"[w4wg, network, lan, windows, wfw, workgroups,...",[This may be a simple question but:\n\nWe have...
213,212,10,212_media_publications_spiking_digging,"[media, publications, spiking, digging, contri...","[\n\n\nIs this the same Monolithic, Centrally ..."
