This notebook demonstrates how negation and uncertainty patterns can be created for the CheXpert labeler. This is illustrated using a small set of sample data on weather forecasts. The text data was created using weather forecast transcripts which can be found [here](https://learnenglishteens.britishcouncil.org/sites/teens/files/weather_forecast_-_transcript_4.pdf).

First, the data has to be loaded from a Knodle dataset collection. The data consists of "phrases", which include mentions and unmentions of eight different weather labels, and "patterns", which are divided into pre-negation uncertainty, negation and post-negation uncertainty patterns.
While the mentions are used as main rules for the corresponding labels and represented in the T and Z matrices, the unmentions and the patterns are used to finetune the matches. The unmentions, like the mentions, are saved in plain text files consisting of simple keywords corresponding to the respective labels. The patterns are saved in three different text files comprised of [SemgrexPatterns](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) which are patterns for matching node and edge configurations of a dependency graph. The patterns are composed of nodes, which represent IndexedWords, and the relations between them, which represent edges in a SemanticGraph. For more detailed information, please have a look at the syntax [here](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).

In [166]:
# Imports
import os
import pandas as pd
from tqdm.auto import tqdm
from minio import Minio
from knodle.labeler.CheXpert.label import Labeler

# Client to access the dataset collection
client = Minio("knodle.cc", secure=False)

## Download Data

Specify the path to the directory where you want all files to be saved to as `CHEXPERT_DATA_DIR`. The default path is given below, simply adjust it if you wish to.

In [167]:
# DATA DIRECTORY -------------------------------------------------------------------------------------
CHEXPERT_DATA_DIR = os.path.join(os.getcwd(), "examples", "labeler", "chexpert")

The next step is downloading the data from Minio. In the next section of code, the `mention`, `unmention` & `pattern` directories are first created and then the text files are saved to them. In case you want to download other data, just change the path to the correct Minio folder and adjust the file names.

In [168]:
# RULE DIRECTORIES -----------------------------------------------------------------------------------
MENTION_DATA_DIR = os.path.join(CHEXPERT_DATA_DIR, "phrases", "mention")
os.makedirs(MENTION_DATA_DIR, exist_ok=True)
files_mention = [
    "clouds.txt", "cold.txt", "rain.txt", "snow.txt",
    "storm.txt", "sun.txt", "warm.txt", "wind.txt"
]
for file in tqdm(files_mention):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/weather/phrases/mention/", file),
        file_path=os.path.join(MENTION_DATA_DIR, file),
    )

UNMENTION_DATA_DIR = os.path.join(CHEXPERT_DATA_DIR, "phrases", "unmention")
os.makedirs(UNMENTION_DATA_DIR, exist_ok=True)
files_unmention = [
    "rain.txt"
]
for file in tqdm(files_unmention):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/weather/phrases/unmention/", file),
        file_path=os.path.join(UNMENTION_DATA_DIR, file),
    )


# PATTERN DIRECTORY ----------------------------------------------------------------------------------
PATTERNS_DIR = os.path.join(CHEXPERT_DATA_DIR, "patterns")
os.makedirs(PATTERNS_DIR, exist_ok=True)
files_patterns = [
    "pre_negation_uncertainty.txt", "negation.txt", "post_negation_uncertainty.txt"
]
for file in tqdm(files_patterns):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/weather/patterns/", file),
        file_path=os.path.join(PATTERNS_DIR, file),
    )

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))

2022-01-24 16:04:36,094 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/phrases/mention/clouds.txt -> https://knodle.cc/knodle/datasets/weather/phrases/mention/clouds.txt
2022-01-24 16:04:36,193 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/phrases/mention/clouds.txt -> https://knodle.cc/knodle/datasets/weather/phrases/mention/clouds.txt
2022-01-24 16:04:36,217 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/phrases/mention/cold.txt -> https://knodle.cc/knodle/datasets/weather/phrases/mention/cold.txt
2022-01-24 16:04:36,239 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/phrases/mention/cold.txt -> https://knodle.cc/knodle/datasets/weather/phrases/mention/cold.txt
2022-01-24 16:04:36,262 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/phrases/mention/rain.txt -> https://knodle.cc/knodle/datasets/weather/phrases/ment




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

2022-01-24 16:04:36,550 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/phrases/unmention/rain.txt -> https://knodle.cc/knodle/datasets/weather/phrases/unmention/rain.txt
2022-01-24 16:04:36,571 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/phrases/unmention/rain.txt -> https://knodle.cc/knodle/datasets/weather/phrases/unmention/rain.txt





HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

2022-01-24 16:04:36,606 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/patterns/pre_negation_uncertainty.txt -> https://knodle.cc/knodle/datasets/weather/patterns/pre_negation_uncertainty.txt
2022-01-24 16:04:36,625 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/patterns/pre_negation_uncertainty.txt -> https://knodle.cc/knodle/datasets/weather/patterns/pre_negation_uncertainty.txt
2022-01-24 16:04:36,644 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/patterns/negation.txt -> https://knodle.cc/knodle/datasets/weather/patterns/negation.txt
2022-01-24 16:04:36,664 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/patterns/negation.txt -> https://knodle.cc/knodle/datasets/weather/patterns/negation.txt
2022-01-24 16:04:36,694 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/patterns/post_negation_uncertainty.txt -> https://




Following the same steps as above, the sample data, for which we use weather forecasts, is downloaded and stored. The sample data, in contrast to the other files, needs to be provided in a csv file.

In [169]:
# SAMPLE DIRECTORY -----------------------------------------------------------------------------------
SAMPLE_DIR = os.path.join(CHEXPERT_DATA_DIR, "reports")
os.makedirs(SAMPLE_DIR, exist_ok=True)
files_sample = [
    "weather_forecast.csv"
]
for file in tqdm(files_sample):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/weather/reports/", file),
        file_path=os.path.join(SAMPLE_DIR, file),
    )

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

2022-01-24 16:04:36,757 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/reports/weather_forecast.csv -> https://knodle.cc/knodle/datasets/weather/reports/weather_forecast.csv
2022-01-24 16:04:36,778 urllib3.poolmanager INFO     Redirecting http://knodle.cc/knodle/datasets/weather/reports/weather_forecast.csv -> https://knodle.cc/knodle/datasets/weather/reports/weather_forecast.csv





Finally, the directory where you want the output matrices X, T and Z to be stored, needs to be specified.

In [170]:
# OUTPUT DIRECTORY -----------------------------------------------------------------------------------
OUTPUT_DIR = os.path.join(CHEXPERT_DATA_DIR, "output")
os.makedirs(OUTPUT_DIR, exist_ok=True)

## Preview Dataset & Patterns

The sample data, or X matrix as it is called within Knodle, is shown below. It consists of six lines, or documents as they are called within the labeler code, each including information about the weather.

In [171]:
pd.read_csv(os.path.join(SAMPLE_DIR, "weather_forecast.csv"), header=None)

Unnamed: 0,0
0,It is very windy and cold. There is going to be rain or snow.
1,It is rainy all day. There may be a thunderstorm in the afternoon.
2,"The weather is dry, but cloudy. So no rain today."
3,"It is cold, but snow is still not likely."
4,"The weather is acting up today, even a storm is possible."
5,"The weather is getting better, no development of rain today."
6,"The clouds have cleared, it is going to be a sunny day."


There are eight different weather labels, so eight different text files with match phrases, that could be assigned to each of these sentences. The labels are: "clouds", "cold", "rain", "snow", "storm", "sun", "warm" and "wind". Take for example the first line of the data: "It is very windy and cold. There is going to be rain or snow." If we had to assign positive, negative or uncertain for each of these labels here, we would probably assign the following:
- **positive**: cold, wind
- **negative**: clouds, storm, sun, warm
- **uncertain**: rain, snow

The author of the sentence mentions the weather conditions "wind", "cold", "rain" and "snow". So the labeler will find these labels based on the match phrases in the corresponding text files. However, a mention does not always mean that the label should be positive. In the case of the example sentence, it says "there is going to be rain or snow", so there is in fact uncertainty concerning the labels "rain" and "snow". This uncertainty we want the labeler to recognize and we manage that through providing the necessary uncertainty pattern.

Below, the provided post-negation uncertainty patterns are shown. The last two lines of the file show the patterns responsible for "rain" and "snow" being labelled as uncertain. The first of the two patterns makes sure that "rain", the word before the "or",  is labelled as uncertain and the other one does the same for "snow", the word after the "or". As mentioned before, a detailed explanation of Semgrex can be found on the corresponding [website](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). In this example, however, the pattern is quite simple, the following explanation concerning the node relations is taken from the Semgrex website:
- A >reln B 	A is the governor of a relation reln with B
- A <reln B 	A is the dependent of a relation reln with B

In the relation "rain or snow", "rain" is the governor and "snow" the dependent, therefore we need the ">reln" relation to label "rain" as uncertain and "<reln" to label "snow" as uncertain.

In [172]:
pd.read_csv(os.path.join(PATTERNS_DIR, "post_negation_uncertainty.txt"), header=None)

Unnamed: 0,0
0,# Rain cannot be excluded.
1,{} < {} ({lemma:/excluded/} > {dependency:/neg/} {})
2,# May/might/would/could be XXX
3,{} > {} {lemma:/may|might|would|could/}
4,# '{} >{dependency:/cop/} {lemma:/may|would|could/}
5,# A Storm would be possible
6,{} <{} {lemma:/possible/}
7,# may be XXX
8,{} <{} {lemma:/be/} >{} {lemma:/may|could|would/}
9,# XXX or YYY


Have a look at the other example weather forecasts as well and their corresponding negation or uncertainty patterns. A very helpful website for figuring out dependencies or getting part-of-speech tags for a text is the [CoreNLP website](https://corenlp.run/), referring to it when creating patterns yourself can be quite useful.

## Labelling

Now, that the data is loaded, use the adjusted [CheXpert labeler](https://github.com/stanfordmlgroup/chexpert-labeler) in Knodle to label the weather forecasts. The labeler is started by initiating the `Labeler()` class, followed by running the associated `label()` function. Since we are not using the original CheXpert data, `chexpert_bool` is set to `False`. And the `config_pattern_tutorial.py` file is specified as config file for the labeler.
You can choose yourself how you want the uncertain labels to be used. The default of the labeler is to transform all uncertain labels to positive labels, but this can be changed through the `uncertain` argument in the `label()` function of the `labeler`. Simply specify "-1" if you want to keep the uncertain labels or "0" if you want the uncertain labels to be changed to negative labels. Unfortunately the other modules e.g. trainers in Knodle can only handle Z matrices containing zeros and ones as of now.

In [173]:
# The labeler class is initiated without passing a config.py file, so the default one is used.
labeler = Labeler()

# The label function is run, outputting the matrices X, T and Z.
labeler.label(transform_patterns=False, uncertain=-1, chexpert_bool=False)

NameError: name 'Labeler' is not defined

Unfortunately the code cannot be run from this notebook, but must be run in a python file from the terminal, because of NegBio. Please run these lines of code in your terminal:

**0) Go to home directory**
`cd`

**1) Clone the repository**
`git clone https://github.com/ncbi-nlp/NegBio.git`

**2) Add the NegBio directory to your PYTHONPATH**
`export PYTHONPATH="${PYTHONPATH}:/home/elisabear/NegBio"`

**3) Create the virtual environment**
`cd ~/PycharmProjects/knodle/knodle/labeler/CheXpert
conda env create -f environment.yml`

**4) Activate the virtual environment**
`conda activate chexpert-label`

**5) Install NLTK data**
`python -m nltk.downloader universal_tagset punkt wordnet`

**6) Go to directory where file is saved**
`cd ~/PycharmProjects/knodle`

**7) Convert the notebook to a Python file**
`jupyter nbconvert --to python examples/labeler/chexpert/chexpert_patterns_tutorial.ipynb`

**8) Run the notebook**
`python examples/labeler/chexpert/chexpert_patterns_tutorial.py`