Argument Search Framework

Requirements

python3.7, tensorflow, numpy, pandas, scipy, nltk, seqeval, transformers, scikit-learn, gensim, spacy, bcubed, matplotlib, tensorflow-addons, hdbscan, umap-learn

Usage

Pre-calculating embeddings

The calculation of embeddings with BERT can take quite a while. Therefore, embeddings can be calculated in advance and not newly for every training of the model. For this, adjust and execute a python script with the following content.

import sys
project_path = # insert absolute path to project here
sys.path.append(project_path)
sys.path.append(project_path + '/scripts')
sys.path.append(project_path + '/classes')
import sentence_dataset as sd
import argument_mining as AM
from argument_mining import Clustering, Segmentation

path_data = project_path + '/data/' + # insert dataset name (plus slash) here
path_embed = # insert path here

#1 Embeddings for segmentation
#arg_model = Segmentation(path_data, path_embed, file_size = 100, dir_size = 1000, mode = 'debatepediaSEG')
#arg_model.set_generators(batch_size=300, shuffle=False, stratify = False)
#arg_model.compute_and_save_embeddings(type = 'bert', separate_embedding=None, word_embedding=False)

#2 Embeddings for clustering
#arg_model = Clustering(path_data, path_embed, file_size = 100, dir_size = 1000, mode = 'debatepediaDS')
#arg_model.set_generators(batch_size=300, shuffle=False, stratify = False)
#arg_model.compute_and_save_embeddings(type = 'bert', separate_embedding=True, word_embedding=True)

For information about the usage of the parameters, refer to the documentation in the code in argument_mining.py and sentence_dataset.

One row in the resulting csv files constitutes one embeddings vector. For the clustering tasks (AS and DS), one embedding is calculated over one discussion pair for DS (or argument pair for AS). For the segmentation task, the embeddings are calculated sentence-wise and one file corresponds to one text document (resulting in different file sizes).

Next steps

From the command line navigate to the location of main.py and execute it with respective parameter setting (all ten parameters must be set). See the example usage below.

python main.py execution_mode dataset task embed_type machine batch_size epochs eval_partition model_type layer [layers]

execution_mode : Either train, resume, evaluate, cluster or plot_history.
dataset: Name of the dataset which will be used in the paths to locate the data.
task: Either AS (argument clustering), DS (discussion clustering), SEG (segmentation into arguments)
embed_type: If using pre-calculated embeddings, this string will be used to differ the different embedding variants per dataset (will be used in the path).
machine: Either default or custom. Determines where to find the pre-calculated embeddings. If using custom, set the respective location in main.py. The default location of pre-calculated embeddings is dirname(PARENT_DIR) + '/data/'.
batch_size: The batch size for training the model
epochs: The number of epochs to train the model
eval_partition: Either val, train, test. The partition to use for evaluating the training.
model_type: Either FNN, LINEAR or BILSTM
layer: Number of neurons
layers: Optional: Number of neurons for each additional layer if using FNN or BILSTM

python main.py train debatepedia DS BERT default 64 10 val FNN 300 200 100

Folder structure

main.py main entry point to all functionality apart from pre-calculating the embeddings sentence_dataset.py Calculating BERT embeddings using the transformers library of huggingface

data

Contains the data. Holds one subdirectory for each dataset which each contain three files: train.json, val.json and test.json.

model

The trained models will be saved here.

classes

The main functionality is located in argument_mining.py. data_generator.py is used to process the data in batches.

scripts

create_features.py processes the JSON files and preprocess_general.py contains helper functions. They are used in other files throughout the project.

Datasets

The JSON files must be split into train.json, val.json and test.json and must show the following structure:

{'ID': topic ID
 'topic': discussion title, 
 'subtopics': [{'ID': subtopic ID
                'title': sub heading,
                'arguments':[{'claim': claim,
                              'premise': premise,
                            'stance': pro/con}]
                }]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classes

classes

data

data

scripts

scripts

README.md

README.md

main.py

main.py

sentence_dataset.py

sentence_dataset.py

Repository files navigation

Argument Search Framework

Requirements

Usage

Pre-calculating embeddings

Next steps

Folder structure

data

model

classes

scripts

Datasets

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
classes		classes
data		data
scripts		scripts
README.md		README.md
main.py		main.py
sentence_dataset.py		sentence_dataset.py

michaelfaerber/arg-search-framework

Folders and files

Latest commit

History

Repository files navigation

Argument Search Framework

Requirements

Usage

Pre-calculating embeddings

Next steps

Folder structure

data

model

classes

scripts

Datasets

About

Resources

Stars

Watchers

Forks

Languages