## Using the complex word sequence labeller

In order to use the complex word models you must download the sequence labeller files available [here](https://github.com/marekrei/sequence-labeler), please cite both the sequence labeller paper and CWI sequence labelling paper if using these models for research. 

Below is example code showing each function in the `Complexity_labeller class`

In [4]:
import sys
sys.path.insert(0, './sequence-labeler-master')

from complex_labeller import Complexity_labeller
model_path = './cwi_seq.model'
temp_path = './temp_file.txt'

In [5]:
model = Complexity_labeller(model_path, temp_path)

Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


2021-10-03 11:36:05.000259: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-03 11:36:05.002471: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2021-10-03 11:36:05.024249: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-10-03 11:36:05.059278: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3793005000 Hz


There are two options when converting text to CoNLL-type tab-separated format:

- `convert_format_string`
- `convert_format_token`

In [6]:
Complexity_labeller.convert_format_string(model, 'You can convert a string like this')

In [7]:
Complexity_labeller.convert_format_token(model, ['You','can','convert','tokens','like','this'])

Once the text has been converted there are four methods to access complexity information:

- `get_dataframe`
- `get_bin_labels`
- `get_prob_labels`

In [8]:
#Converting example sentence:'Based in an armoured train parked in its sidings, he met with numerous ministers'

Complexity_labeller.convert_format_string(model,'Based in an armoured train parked in its sidings, he met with numerous ministers')

The `get_dataframe` method returns a dataframe containing the original tokenized sentence, binary complexity labels and complex class probabilities.

If a word recieves a binary label = 1, it has been classified as a complex word.

In [9]:
dataframe = Complexity_labeller.get_dataframe(model)

In [10]:
dataframe

Unnamed: 0,index,sentences,labels,probs
0,0,"[Based, in, an, armoured, train, parked, in, i...","[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]","[[0.96680665, 0.033193372], [0.99995637, 4.359..."


Example below shows how to access binary information from the dataframe format: 

In [11]:
list(zip(dataframe['sentences'].values[0],dataframe['labels'].values[0]))

[('Based', 0),
 ('in', 0),
 ('an', 0),
 ('armoured', 1),
 ('train', 0),
 ('parked', 0),
 ('in', 0),
 ('its', 0),
 ('sidings', 1),
 (',', 0),
 ('he', 0),
 ('met', 0),
 ('with', 0),
 ('numerous', 1),
 ('ministers', 0)]

`get_bin_labels` returns the binary complexity labels for the input

In [12]:
Complexity_labeller.get_bin_labels(model)

[array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0])]

The `get_prob_labels` method returns the probability of each token belonging to the complex class.

In [13]:
Complexity_labeller.get_prob_labels(model)

[0.033193372,
 4.3595624e-05,
 0.000119937315,
 0.9801681,
 0.01585573,
 0.2678754,
 4.052542e-05,
 0.00021037956,
 0.8165311,
 6.47893e-05,
 0.000112162525,
 0.010358474,
 6.746332e-05,
 0.89688677,
 0.4075551]