
# Quick introduction to text2network package

## Prerequisites

To run all elements correctly, the required python packages need to be installed. That is:

- Generally: PyTorch & Numpy
- Preprocessing: tables (hdf5)
- Analysis: networkx & python-louvain & pandas

Finally, a Neo4j server, http accessible, version 4.02, should be running for processing. We currently use a custom http connector, which 
is faster than the default interface. Sadly, the connector does not work for versions above 4.02.
We are in the process of upgrading to a standard Bolt connector.
You can choose the version of the database in the Neo4j Desktop App.

## Components

This package is organized into four classes, corresponding to four steps of analysis.

1. Preprocessing: Read files from folders, process text and save into database

2. Training of BERT models: According to a chosen hierarchy, train a number of BERT models

3. Processing / Network extraction: Use trained models to extract semantic networks and save into 
a Neo4j graph database.

4. Network analysis: Condition and work with the data in the Neo4j graph database.

We begin by setting up the configuration of the project. This configuration file can be passed along to all classes.


# Configuration

We use the standard python configuration parser to read an ini file.



In [1]:
configuration_path='D:/NLP/COCA/cocaBERT/config/config.ini'

# Load Configuration file
import configparser
config = configparser.ConfigParser()
config.read(configuration_path)

[]

Inside the configuration file are a number of fields. 
It is essential to set the correct paths.

<code>import_folder</code> holds the txt data

<code>pretrained_bert</code> holds the pre-trained BERT Pytorch model, that is used for all divisions of the corpus.

<code>trained_berts</code> will store the fine-tuned BERT models.

<code>processing_cache</code> simply keeps track of which subcorpus has already been processed into the Neo4j Graph-

<code>database</code> holds the processed text in a hdf5 database.

<code>log</code> is the folder for the log.



# Pre-processing

Once text files are comfortably situated in a folder, the text can be pre-processed.
Sentences that are too long are split, tags and other nuisance characters are deleted and so forth.

Most importantly, each sentence is saved in a database, together with its metadata.
This always includes the following:

<code>Year</code>: A time variable integer. Typically, YYYY, but YYYYMMDD could be used.

<code>Source</code>: Name of the txt file

<code>p1 through p4</code>: Up to four parameters coming from the file name

<code>run_index</code> An index across all sentences in all text files.

<code>seq_id</code> An index across sentences within a given text file.

<code>text</code> The sentence, capped at a maximum length of characters.

Since each sentence is then saved as a row in the database, we can determine at a later
stage how we seek to query and split the corpus into subcorpora (e.g. by year and parameter 1).

So initially, we need to use the configuration file to define the properties of the text we are going to use. In particular,
we need to define what the file names mean.
Two options for the file structure are possible:

First, the import folder could include sub-folders of years.

    import_folder/
        import_folder/year1/
        ------p1_p2_p3_p4.txt
        ------p1_p2_p3_p4.txt
        (...)
        import_folder/year2/
        ------p1_p2_p3_p4.txt
        ------p1_p2_p3_p4.txt
        (...)
     
Alternatively, all txt files can also reside in a single folder.

    import_folder/
        ------year1_p1_p2_p3_p4.txt
        ------year1_p1_p2_p3_p4.txt
        ------year2_p1_p2_p3_p4.txt
        ------year3_p1_p2_p3_p4.txt
        (...)
        
Accordingly, we set the following parameters in the configuration file: <code>split_symbol</code>
is the symbol that splits between parameters (here "\_"). <code>number_params</code> denotes the number
of parameters (here 4). If we had only two parameters, our text files might be
of the form <code>p1_p2.txt</code> and we would set that value to 2.
Finally, <code>max_seq_length</code> denotes the maximum length of a sentence.
<code>char_mult</code> is a multiplier that determines how many letters the average word can have.
The total sequence length in letters (symbols) is given by <code>max_seq_length*char_mult</code>.
Having a fixed-length format here is helpful for performance. Sequences can, of course, be shorter. Later components 
will also re-split sentences if smaller batch sizes are desired. Setting the sequence size very high
ensures that no sentence will be unduly split, however this will increase file size.

We begin by instancing the preprocessing class.
At this stage, we will also set up logging.

In [None]:
from src.classes.nw_preprocessor import nw_preprocessor

# Set up preprocessor
preprocessor = nw_preprocessor(config)
# Set up logging
preprocessor.setup_logger()


Note that is is sufficient to pass the <code>config</code>, however the class also
takes optional parameters, if we want to overwrite the configuration file.
This is the standard behavior for all modules. So for example one could instead do:

    preprocessor = nw_preprocessor(config, max_seq_length=50)


Next, we can process the text files and create the database.
If our text files are split among multiple sub-folders, with years as folder names,
we call the <code>preprocess_folders</code> method


In [None]:
preprocessor.preprocess_folders(overwrite=True,excludelist=['checked', 'Error'])

here, <code>overwrite</code> indicates that we wish to overwrite any existing database.
<code>excludelist</code> is a list of strings corresponding to any of the parameters
in the file name. Filenames including elements from this list are not processed.

If, instead, all files are in a single folder, we run



In [None]:
preprocessor.preprocess_files(overwrite=True,excludelist=['checked', 'Error'])

Note that both functions also take a <code>folder</code> variable, if we want to not use the folder of the configuration file.
In this way, the pre-processing can also be done across many sources. Note, however, that the 
file name of the txt file is essential and needs to follow the same convention:
Either folders with year names, or files starting with years, and then up to four parameters.

The module will try to take care of encodings and other matters. If the file can not be read, an error will be 
returned.

Once done, a <code>db.h5</code> file will be created in the <code>database</code> folder, which includes all
individual sentences and their meta-data.

# Training BERT


## Understanding split hierarchy

We will train one BERT model for each logical division of the corpus. This sub-division will be carried along all subsequent steps. So, processing a certain subdivision requires that a corresponding BERT model has been trained. Different divisions can be trained and saved, as they will be saved in distinct folders.

Subdivisions are specified via the <code>split_hierarchy</code> option in the configuration file.

It is a list of parameters by which to split the corpus and train the models. All parameters are always saved as meta-data, but we might want to aggregate across them when training BERT.

The simplest division is by year:

    split_hierarchy=["year"]

This will train one BERT per year.
However, we might also train one BERT per combination of year, p1 and p2, e.g.

    split_hierarchy=["year","p1",p2"]


By setting this parameter, the trainer module can ascertain how many BERTs are required, and which sentences it should train on.

## Training process

We do not wish to use word-pieces. The pre-trained BERT has word-pieces disabled. For that reason, the vocabulary needs to be amended. It is desirable, although not strictly necessary, to use the same vocabulary across all models. To keep this reasonable, set <code>new_word_cutoff</code> for large corpora. Only words that occur more often will be included in the vocabulary.

The training process creates first one shared vocabulary, resizes the BERT models and then trains them individually.

Each model is trained until either <code>eval_loss_limit</code> or <code>loss_limit</code> is reached, where the first denotes the loss across test sequences, whereas the second in the current batch during training. The configuration file also includes the usual model parameters, that should be set according to GPU size and corpus size.

To train all BERTs, we initialize the trainer and run the training.
Again, attributes may be given via the config file or as individual parameters.



In [None]:
from src.classes.bert_trainer import bert_trainer

trainer=bert_trainer(config)
trainer.train_berts()


# Network processing

Having trained BERTs, we need to extract semantic networks. This involves running inference across the subdivisions of the corpus and saving network ties in the Neo4j database.

All interfacing with Neo4j is done via the network class, which we initialize first.



In [None]:
from src.classes.neo4jnw import neo4j_network
neograph = neo4j_network(config)


The network is, of course, entirely empty at this stage. To fill it, we also create a processer that takes the network interface as input.



In [None]:
from src.classes.nw_processor import nw_processor
processor = nw_processor(config, neograph)

Since all options are already specified in the configuration file, we can directly process our semantic networks.


In [None]:
processor.run_all_queries(delete_incomplete=True, delete_all=False)


Where we can specify whether we would like to clean the graph database first - in order not to duplicate ties - or not.

Note that the processor remembers whether a BERT model has already been processed to completion. By specifying <code>delete_incomplete</code>, the processor will first clean the graph database of subdivisions that were not completed.
This is useful if the processing gets interrupted.

Conversely, <code>delete_all</code> cleans the graph entirely for a fresh start.

