-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apllying code in my own dataset #31
Comments
Hi @nassera2014! I, unfortunately, do not know much about Arabic, so I am basing my experience on other languages that can be tokenized through the use of the whitespace char. I am assuming you have your documents in a text file PreprocessingFirst thing first, you need to preprocess the documents. What we usually do is use our preprocessing pipeline (that will remove stopwords and just keep the most frequent 2000 tokens in the vocabulary) or you can run preprocessing in the way you prefer. However, I am not sure how words are tokenized in Arabic, and we currently tokenized based on whitespace. Arabic might require more steps. If whitespace tokenization is ok for Arabic, this snippet should be a good starting point: from contextualized_topic_models.utils.preprocessing import SimplePreprocessing
documents = [line.strip() for line in open("arabic_documents.txt").readlines()]
sp = SimplePreprocessing(documents, stopwords_language = "arabic")
preprocessed_documents, unpreprocessed_documents, vocab = sp.preprocess() TrainingThen you can use the rest of our code to run the topic model. For this step, you need:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list
from contextualized_topic_models.datasets.dataset import CTMDataset
handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data
# generate BERT data
training_bert = bert_embeddings_from_list(unpreprocessed_documents, "asafaya/bert-base-arabic")
training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)
ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)
ctm.fit(training_dataset) # run the model After the model has been fitted, you can get topics like this: ctm.get_topic_lists() I hope this helps, but feel free to ask questions if something's not clear :) |
closing this for now :) feel free to ping me again |
Thank you so much for your detailed and well-explained response, when i run the following programme
i get this error : `/home/nassera/PycharmProjects/Facebook/venv/lib/python3.6/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.) FileNotFoundError: [Errno 2] No such file or directory: '/home/nassera/.cache/torch/sentence_transformers/sbert.net_models_aubmindlab_bert-base-arabertv01' ` Note that i use Python 3.6 Thank you. |
Hi! Silvia |
Hi, Thank you for this great job, i'm beginner in BERT, and i want to use it in topic modeling (extraction topics from arabic text), do you have an idea how can i do this using your code? thank you so much.
BR
The text was updated successfully, but these errors were encountered: