Apllying code in my own dataset #31

nassera2014 · 2020-11-21T13:34:14Z

Hi, Thank you for this great job, i'm beginner in BERT, and i want to use it in topic modeling (extraction topics from arabic text), do you have an idea how can i do this using your code? thank you so much.

BR

vinid · 2020-11-21T14:24:12Z

Hi @nassera2014!

I, unfortunately, do not know much about Arabic, so I am basing my experience on other languages that can be tokenized through the use of the whitespace char. I am assuming you have your documents in a text file arabic_documents.txt, one document per line.

Preprocessing

First thing first, you need to preprocess the documents.

What we usually do is use our preprocessing pipeline (that will remove stopwords and just keep the most frequent 2000 tokens in the vocabulary) or you can run preprocessing in the way you prefer. However, I am not sure how words are tokenized in Arabic, and we currently tokenized based on whitespace. Arabic might require more steps.

If whitespace tokenization is ok for Arabic, this snippet should be a good starting point:

from contextualized_topic_models.utils.preprocessing import SimplePreprocessing

documents = [line.strip() for line in open("arabic_documents.txt").readlines()]
sp = SimplePreprocessing(documents, stopwords_language = "arabic")
preprocessed_documents, unpreprocessed_documents, vocab = sp.preprocess()

Training

Then you can use the rest of our code to run the topic model. For this step, you need:

the documents we have prepared in the previous setting
an Arabic BERT model (I am using this to create the document representations)

from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list
from contextualized_topic_models.datasets.dataset import CTMDataset

handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data

# generate BERT data
training_bert = bert_embeddings_from_list(unpreprocessed_documents, "asafaya/bert-base-arabic")

training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)

ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)

ctm.fit(training_dataset) # run the model

After the model has been fitted, you can get topics like this:

ctm.get_topic_lists()

I hope this helps, but feel free to ask questions if something's not clear :)

vinid · 2020-11-23T15:33:00Z

closing this for now :) feel free to ping me again

nassera2014 · 2020-11-27T15:02:07Z

Thank you so much for your detailed and well-explained response, when i run the following programme

arText = []
for post in posts :
    arText.append(post['text'])

documents = arText
# documents = [line.strip() for line in open("doc.txt").readlines()]
sp = SimplePreprocessing(documents, stopwords_language = "arabic")
preprocessed_documents, unpreprocessed_documents, vocab = sp.preprocess()
print("proce : ",preprocessed_documents)
print("unproc : ",unpreprocessed_documents)
print("vocab:",vocab)

from contextualized_topic_models_master.contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models_master.contextualized_topic_models.utils.data_preparation import TextHandler
from contextualized_topic_models_master.contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list
from contextualized_topic_models_master.contextualized_topic_models.datasets.dataset import CTMDataset

handler = TextHandler(sentences=preprocessed_documents)
handler.prepare() # create vocabulary and training data
from transformers import AutoTokenizer, AutoModel

# generate BERT data
training_bert = bert_embeddings_from_list(unpreprocessed_documents, "aubmindlab/bert-base-arabertv01")
# asafaya/bert-base-arabic
training_dataset = CTMDataset(handler.bow, training_bert, handler.idx2token)

ctm = CombinedTM(input_size=len(handler.vocab), bert_input_size=768, n_components=50)

ctm.fit(training_dataset) # run the model
print('topics : ',ctm.get_topics())

i get this error :

`/home/nassera/PycharmProjects/Facebook/venv/lib/python3.6/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0

FileNotFoundError: [Errno 2] No such file or directory: '/home/nassera/.cache/torch/sentence_transformers/sbert.net_models_aubmindlab_bert-base-arabertv01' `

Note that i use Python 3.6
Virtual Machine : Ubuntu 18.10
and my dataset is stored into MongoDB database.

Thank you.

silviatti · 2020-11-27T18:27:36Z

Hi!
We tried to replicate the code in a colab notebook and used AraBERT. It seems to work. Please check and upgrade your version of the contextualized-topic-model and sentence-transformer packages. This might solve your problem :)

Silvia

vinid closed this as completed Nov 23, 2020

vinid reopened this Nov 27, 2020

vinid closed this as completed Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apllying code in my own dataset #31

Apllying code in my own dataset #31

nassera2014 commented Nov 21, 2020

vinid commented Nov 21, 2020 •

edited

vinid commented Nov 23, 2020

nassera2014 commented Nov 27, 2020 •

edited

silviatti commented Nov 27, 2020 •

edited

Apllying code in my own dataset #31

Apllying code in my own dataset #31

Comments

nassera2014 commented Nov 21, 2020

vinid commented Nov 21, 2020 • edited

Preprocessing

Training

vinid commented Nov 23, 2020

nassera2014 commented Nov 27, 2020 • edited

silviatti commented Nov 27, 2020 • edited

vinid commented Nov 21, 2020 •

edited

nassera2014 commented Nov 27, 2020 •

edited

silviatti commented Nov 27, 2020 •

edited