python3.7, tensorflow, numpy, pandas, scipy, nltk, seqeval, transformers, scikit-learn, gensim, spacy, bcubed, matplotlib, tensorflow-addons, hdbscan, umap-learn
The calculation of embeddings with BERT can take quite a while. Therefore, embeddings can be calculated in advance and not newly for every training of the model. For this, adjust and execute a python script with the following content.
import sys
project_path = # insert absolute path to project here
sys.path.append(project_path)
sys.path.append(project_path + '/scripts')
sys.path.append(project_path + '/classes')
import sentence_dataset as sd
import argument_mining as AM
from argument_mining import Clustering, Segmentation
path_data = project_path + '/data/' + # insert dataset name (plus slash) here
path_embed = # insert path here
#1 Embeddings for segmentation
#arg_model = Segmentation(path_data, path_embed, file_size = 100, dir_size = 1000, mode = 'debatepediaSEG')
#arg_model.set_generators(batch_size=300, shuffle=False, stratify = False)
#arg_model.compute_and_save_embeddings(type = 'bert', separate_embedding=None, word_embedding=False)
#2 Embeddings for clustering
#arg_model = Clustering(path_data, path_embed, file_size = 100, dir_size = 1000, mode = 'debatepediaDS')
#arg_model.set_generators(batch_size=300, shuffle=False, stratify = False)
#arg_model.compute_and_save_embeddings(type = 'bert', separate_embedding=True, word_embedding=True)
For information about the usage of the parameters, refer to the documentation in the code in argument_mining.py
and sentence_dataset
.
One row in the resulting csv files constitutes one embeddings vector. For the clustering tasks (AS and DS), one embedding is calculated over one discussion pair for DS (or argument pair for AS). For the segmentation task, the embeddings are calculated sentence-wise and one file corresponds to one text document (resulting in different file sizes).
From the command line navigate to the location of main.py
and execute it with respective parameter setting (all ten parameters must be set). See the example usage below.
python main.py execution_mode dataset task embed_type machine batch_size epochs eval_partition model_type layer [layers]
execution_mode
: Eithertrain
,resume
,evaluate
,cluster
orplot_history
.dataset
: Name of the dataset which will be used in the paths to locate the data.task
: EitherAS
(argument clustering),DS
(discussion clustering),SEG
(segmentation into arguments)embed_type
: If using pre-calculated embeddings, this string will be used to differ the different embedding variants per dataset (will be used in the path).machine
: Eitherdefault
orcustom
. Determines where to find the pre-calculated embeddings. If usingcustom
, set the respective location inmain.py
. The default location of pre-calculated embeddings isdirname(PARENT_DIR) + '/data/'
.batch_size
: The batch size for training the modelepochs
: The number of epochs to train the modeleval_partition
: Eitherval
,train
,test
. The partition to use for evaluating the training.model_type
: EitherFNN
,LINEAR
orBILSTM
layer
: Number of neuronslayers
: Optional: Number of neurons for each additional layer if usingFNN
orBILSTM
python main.py train debatepedia DS BERT default 64 10 val FNN 300 200 100
main.py
main entry point to all functionality apart from pre-calculating the embeddings
sentence_dataset.py
Calculating BERT embeddings using the transformers library of huggingface
Contains the data. Holds one subdirectory for each dataset which each contain three files: train.json
, val.json
and test.json
.
The trained models will be saved here.
The main functionality is located in argument_mining.py
. data_generator.py
is used to process the data in batches.
create_features.py
processes the JSON files and preprocess_general.py
contains helper functions. They are used in other files throughout the project.
The JSON files must be split into train.json
, val.json
and test.json
and must show the following structure:
{'ID': topic ID
'topic': discussion title,
'subtopics': [{'ID': subtopic ID
'title': sub heading,
'arguments':[{'claim': claim,
'premise': premise,
'stance': pro/con}]
}]
}