# Introduction
This code is meant for automatically transform an annotated corpora to a dataset in `csv` format.
While we do have a standard for annotations, instruction to adapt the code to your annotation syntax are available below.

If you find any problems or request some features update please do so with [this form](https://github.com/nicofirst1/TranscriptionTagger/issues/new/choose).

# Setup

Run this to copy the necessary files in your colab.
If it says something like:
```fatal: destination path 'TranscriptionTagger' already exists and is not an empty directory.```
You can ignore it

In [None]:
!git clone https://github.com/nicofirst1/TranscriptionTagger
!pip install nltk
!pip install -e TranscriptionTagger/

Import necessary libraries

In [None]:
import csv
import json
import re
from collections import Counter
from TranscriptionTagger.utils import remove_features, get_name, get_ngram, find_repetitions, multi_corpus_upload
from google.colab import files

# Custom files and settings

In this section we will define some settings for your transcription!
We will start with the variables you defined and move on to how to find them in your annotated corpus (REGEX).

## Variables

### Defining variables with JSON
When annotating your data, you will for sure use some variables. These variables may have a hierarchical structure, where one category includes many variations. To allow the program to find your variables you have to build a "dictionary" where you specify them.
Here, we use JSON files that allow you to come up with how many categories and variables you want in a clear and defined manner. Check out the [introduction to JSON tutorial](https://www.w3schools.com/js/js_json_intro.asp) if you are not familiar with it.
You can also look at the  [dependent variables](./dependent_variables.json) we are using in this project.

### Your files
Now that you are familiar with how the JSON syntax work, we need to define what your variables are. Here you have two options, depending on the option you choose you will need to run different cells.

#### 1. Use the default variables file
At the moment there are two files, one for [dependent variables](./dependent_variables.json) and one for [independent variables](./independent_variables.json) in this code. If you open these link you will see the original files (not modificable). If you want to modify them you need to open them from this page. On the folder symbol on the left of this code (check [this video](./includes/changing_variables.gif) for a how-to and  [this tutorial](https://neptune.ai/blog/google-colab-dealing-with-files) on how to access local files system to google colab, point 4), click on TranscriptionTagger (this will open the direcotry) and then you will see the two files. Click on them to modify their content.

In [None]:
dependent_variable_path = 'TranscriptionTagger/dependent_variables.json'
independent_variable_path = 'TranscriptionTagger/independent_variables.json'

variable_files=[dependent_variable_path, independent_variable_path]
variable_files=[open(path, "r+") for path in variable_files]
variable_files=[x.read() for x in variable_files]
variable_files

#### 2. Upload your own files
Here you can upload your own variable files as long as they are still in JSON format. You can upload how many files you want.

In [None]:
variable_files=files.upload()

for fn in variable_files.keys():
    if ".json" not in fn:
        raise FileNotFoundError(f"File {fn} is not a JSON file!")
    print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(variable_files[fn])))

variable_files=list(variable_files.values())

# decode
variable_files=[x.decode("utf8") for x in variable_files]
variable_files

## Regular Expressions (REGEX)

Now it's time to define the regular regular expressions (REGEX) to find your annotation in the corpus.

In our project each annotation is contained in square brackets and starts with a dollar sign. The variables are divided by a dot (without spaces) and the last element after the dot is the annotated word (with spaces).
For example : `[$variable1.variable2.annotated word]`

To come up with a new tag REGEX you can use [regex101](https://regex101.com/). To check out how it works, open  [regex101](https://regex101.com/). Copy-paste the
content of `square_regex` (`(\[\$[\S ]*?\])`) into the regular expression bar (on the top) and a sample paragraph in the test string (on the bottom), e.g.:
```
S    akiid, akiid, bi l [$G-OTH.fooxinende], aa, yaʕni il jumʕa la bass ʕidna sabit uu aħħad iħna [$DEM-HAAY.haay] [$G-OTH.daayrakt] leen il alwaad ʕidna       aa, iða j jaww [$IA.kulliʃ] [$IA.zeen] insawwi maʃaawi, w iða j jaww mu [$IA.zeen], insawwi yaʕni l aklaat illi tijmaʕ il ʕaaʔila, tiðakkiriin iħna l ʕiraaqiyiin         id dooLMa, w is [$CK.simaC], (laughing), w il lamma l ħilwa w il aħfaad, fa insawwi [$IA.CK.hiiCi]          bass il yoom la yaʕni innu aani w il ħajji [$GQ.gaaʕdiin], akθar il marraat nugʕud iS SuBiħ [$IA.nitrayyag], baʕdeen il, il gahwa uu baʕdeen nuqʕud insoolif, inʃuuf [$IA.ʃinu] ʕidna maʃaariiʕ, niTLaʕ maθalan irruuħ nimʃi [$IA.fadd] niSS saaʕa saaʕa
```

Define the REGEXs:

In [21]:
# regex to find the complete annotation rule
square_regex = re.compile(r"(\[\$[\S ]*?\])")
# regex to find the content of an annotation
feat_regex = re.compile(r'\[\$([\S ]*?)\]')
# regex to find the token in the annotation
annotated_token_regex=re.compile(r"\.[\S ]*?\]")
# regex to univocally finding the speaker name in the paragraph
name_regex= re.compile(r"(^[A-Z]) ")

## Corpus file
Finally, you need to upload your corpus file. This is the filed containing your annotated tokens.


In [None]:
# DONT USE THIS, IT IS JUST FOR DEBUG

if False:
    corpus_path="/Users/giulia/Downloads/Telegram Desktop/input.txt"
    with open(corpus_path, "r+", encoding="utf16") as f:
        corpus_text=f.read()


In [None]:
corpus_path = files.upload()
corpus_text=multi_corpus_upload(corpus_path)

print("\n\n First 400 characters:\n")
corpus_text[:400]

## Output file

In order to obtain the comma separated values (`csv`) file, you can specify the separator. By default the separator value is a comma (as the name implies), but you can also use:
- semicolon: `;`
- comma: `,`
- tab : `\t`

Be aware that having the same character in your corpus may break the `csv` visualization. We suggest using a symbol that does not appear in your corpus, and then set the separator in the program you are using to visualize (e.g. excel) manually.

In [None]:
separator = ';'

# Main program
Time to start the main program. First let us merge all the variables we uploaded before

In [None]:
variable_files=[json.loads(f) for f in variable_files]

variable_dict = {}

for j in variable_files:
    for k,v in j.items():
        if k in variable_dict.keys():
            print(f"Warning : I have found a duplicate variable named '{k}' with values '{v}'. "
                  f"If this is expected ignore this message, otherwise check the variables files for duplicates!")
        variable_dict[k]=v



# get an inverse of the dependent variable
idv = {}
for k, v in variable_dict.items():
    if isinstance(v, list):
        for i in v:
            idv[i] = k
    else:
        idv[v] = k

print("Correctly loaded all the variables. Check them out to see if there are any errors")
print(json.dumps(variable_dict,sort_keys=True, indent=4))

## Preprocessing transcriptions
Based on your type of annotated corpus you will need to preprocess the file.
Feel free to comment out (add `#` at the start of the line) any preprocessing that does not fit your criteria.
To give you a sense of our transcriptions here is how the first 5 paragraphs look like:

```
B       marħaba ʕeeni ʃ axbaariC, iħna niʃakkariC ihwaaya lennahu, ey    fitaħtuulna beetkum uu istaqbaltuuna             hm, iħCiilna l yoom iʃ sawweeti, ʃinu Caan maʃruuʕiC aSLan, (laughing)     hm  ey  hm  ii bi l ʕaafya      ey  ahh  (laughing)  Caan huwwa imxaTTiT innu tijiin inti haaða l isbuuʕ
S           halow ħabiibti, halow ʕeeni, il ħam..., (...) baSiiTa itdallili uu haaða abSaT ʃii insawwiilkumiyaa            yaa miyyat hala biikum    il yoom? il yoom ma ʕindi ʃii, aa, mit... mittafqiin ʕal [$MAAL.mawʕid maalatkum] il yoom, fa [$SS-DIF.gaʕadt] iS Subiħ, aa sawweet [$OTH-SA.SS-DIF.ifTuur] aani w il ħajji, rayyagta ab... aLLa [$CK.yʕaafiiC] uu baʕdeen ijeet ʕala mawʕidna [??ihnaana], [$GQ.gaaʕda] antiDurkum, SaLLeet, ma ʕindi ʃii baʕad, da antiDurkum, (laughing)      ey liʔannahu mittafqiin ʕa l mawʕid fa ma [$GQ.nigdar] inɣayyra


B   eh, la, yaʕni aa, waħħad ʕan il mawʕid Caan inti yaʕni ʕindiC ɣeer barnaamij maθalan bi l fooxinende        ey  hm  ee, id doo...,(laughing), ii, (laughing), w il lamma l ħilwa, (laughing)   ee  ey          hm  ey  hm, ħeel zeen
S    akiid, akiid, bi l [$G-OTH.fooxinende], aa, yaʕni il jumʕa la bass ʕidna sabit uu aħħad iħna [$DEM-HAAY.haay] [$G-OTH.daayrakt] leen il alwaad ʕidna       aa, iða j jaww [$IA.kulliʃ] [$IA.zeen] insawwi maʃaawi, w iða j jaww mu [$IA.zeen], insawwi yaʕni l aklaat illi tijmaʕ il ʕaaʔila, tiðakkiriin iħna l ʕiraaqiyiin         id dooLMa, w is [$CK.simaC], (laughing), w il lamma l ħilwa w il aħfaad, fa insawwi [$IA.CK.hiiCi]          bass il yoom la yaʕni innu aani w il ħajji [$GQ.gaaʕdiin], akθar il marraat nugʕud iS SuBiħ [$IA.nitrayyag], baʕdeen il, il gahwa uu baʕdeen nuqʕud insoolif, inʃuuf [$IA.ʃinu] ʕidna maʃaariiʕ, niTLaʕ maθalan irruuħ nimʃi [$IA.fadd] niSS saaʕa saaʕa

```
As you can see our file has a repeating structure of the kind:
- interviewer name (`B`), space, paragraph, newline
- interviewee name (`S`), space, paragraph, newline

In [None]:
# step 1: split the whole corpus in different elements every new line, creating a list of paragraphs. For us this means splitting interviewer and interviewee in different paragraphs
corpus= corpus_text.split("\n")
# step 2: remove spaces at the start and end of each paragraph
corpus = [x.strip() for x in corpus]
# step 3 : remove empty paragraphs from the list
corpus = [x for x in corpus if x != '']
# step 4 : filter out all the paragraphs that do not have any DETECTED speaker
prev_c=copy(corpus)
corpus=[x for x in corpus if get_name(x,name_regex)]

removed_sentences=len(prev_c)-len(corpus)
print(f"I removed {removed_sentences} paragraphs, since I could not detect a speaker")
if removed_sentences>0:
    diff=set(prev_c)-set(corpus)
    print("the removed lines are: ")
    diff
    print("\n\n")


# take a peek at the first four paragraphs
print("First paragraphs are")
for idx in range(3):
    print(corpus[idx])

### Interviewers and Interviewees
As you can see form the previous example, we consider a file where we have only one interviewer and one interviewee alternating each other. But it can be the case where you have multiple interviewers and interviewees in random order. In this case we need to know the names of the interviewers in order to split them from the interviewees.

If you need to add multiple interviewers run the following cell after adding the names


In [None]:
#interviewers = "name1,name2,...,nameN"
interviewers=""

Get the interviewers

In [None]:
# if user specified the interviewer's names then take that
if len(interviewers) > 0:
    interviewers = interviewers.split(',')
    # remove spaces
    interviewers = [x.strip() for x in interviewers]
else:
    # else use the first character of the transcription
    interviewers = [corpus[0][0]]
     # filter out empty names
    interviewers = [x for x in interviewers if x != '']
    interviewers=[interviewers[0]]

print(f"The selected interviewers are: {', '.join(interviewers)}")

### Output file settings
Now it's time to create the output files.

There are different outputs and here we specify their names.

In [None]:
# get the name of the output path
dataset_path="dataset.csv"
annoation_info_path="annotation_info.csv"
not_annotated_path="not_annotated_log.csv"

It can be useful to add the previous paragraph in the final csv file. For example, when examining an annotation, you want to know what the previous speaker said before the current one. If you are interested in this information being in the final output set `previous_line` to `True`, else leave it `False`

In [None]:
#previous_line = True
previous_line = False

Additionally, we include a context for all the annotated token present in your data. The context is made of the previous/next words after the annotated one, following a n-gram rule. Following, you can se the size of your context:

In [None]:
# ngram_params = (previous words, next words)
ngram_params=(10,5)

The output is a csv file and it needs a header. For this reason here we define the header as the following elements:
- the `token` for the annotated words
- the complete list of variables
- (Optional) the previous line
- The context in which the text was found
- an `unk` (unknown) category for variables that were found in the annotations but not present in the variable files (useful to catch some errors)

In [None]:
# compile regex to find features
csv_header = list(variable_dict.keys())

# define the end of the csv
csv_end = ['context', 'unk']
if previous_line:
    csv_end.insert(0, 'previous line')
csv_header = ["token"] + csv_header + csv_end
csv_file = [csv_header]
unk_categories = []

print(f"The csv header looks like this")
csv_header

### Finding interviewees

This part looks for all the names present in the file

In [None]:

# get interviewer/interviewees names
names = [get_name(x, name_regex).strip() for x in corpus]
names = set(names)
# filter out empty names
names = [x for x in names if x != '']

# remove all mention of interviwers in names
for i in interviewers:
    names = [x for x in names if i not in x]
interviewees = list(names)

# notify user about names
print(f"I found the following interviewees names: {', '.join(interviewees)}")

### Annotation information

Here we build our counter and logger for the annotations. We want to count the frequency of each annotated token in the corpus, how many times it appears and who speaks it. Moreover, for logging reason, we want to present the tokens which are annotated but were found outside the annotation rule.

In [None]:

# ### Finding all the annotated words
whole_corpus="\n".join(corpus)
annotations=feat_regex.findall(whole_corpus)
annotations=[x.split(".")[-1] for x in annotations]

# count the number of annotations
annotation_counter = Counter(annotations)
annotation_counter={k:dict(annotated=v) for k,v in annotation_counter.items()}

print(f"The total number of annotated words is {len(annotation_counter)}")

Now we check, for each annotated token, the number of times it appear inside the annotation format vs outside.

In [None]:

# check if there are any annotations not annotated
for k,v in annotation_counter.items():
    # check for annotation repetitions
    wild_rep,ann_rep,_= find_repetitions(whole_corpus, k, annotated_token_regex)
    total_rep=wild_rep+ann_rep
    not_annotated=total_rep-v['annotated']
    annotation_counter[k]['not_annotated']=not_annotated


While the previous cell, counts all the tokens in the whole corpus, we probably want to differentiate between speakers. For example, it may be the case where we do not annotate tokens from interviewers, focusing only on interviewees.

For this reason, here we need to specify where we want to look for the missed tokens.

In [None]:
# choose which speaker to check for annotations, you can uncomment one of the following lines:
# speakers=interviewers+interviewees
# speakers=interviewers
# speakers=["name1","name2",...,"nameN"]

speakers=interviewees

Now it's time to find those missing annotations

In [None]:
not_annotated_log={}

for k,v in annotation_counter.items():

    res=[find_repetitions(x, k, annotated_token_regex) for x in corpus if get_name(x,name_regex) in speakers]
    _,_,wild_not_annotated=zip(*res)

    # unfold the list
    wild_not_annotated=[item for sublist in wild_not_annotated for item in sublist]

    # if there are not annotated words
    if len(wild_not_annotated) > 0:
        not_annotated_log[k]=wild_not_annotated

Finally, some post-processing

In [None]:
# augment annotation_counter with speakers and add total number
for k in annotation_counter.keys():
    annotation_counter[k]['total'] = annotation_counter[k]['annotated'] + annotation_counter[k]['not_annotated']
    for speaker in interviewees+interviewers:
        annotation_counter[k][speaker] = 0


## Starting the main loop
This part starts the main loop.

In [None]:

# for every paragraph in the transcript
for idx in range(len(corpus)):
    c = corpus[idx]
    cur_speaker = get_name(c,name_regex)

    # add speaker related metrics
    for k, v in annotation_counter.items():
        wild_rep, ann_rep, _ = find_repetitions(c, k, annotated_token_regex)
        rep = wild_rep + ann_rep
        annotation_counter[k][cur_speaker] += rep

    # get the paragraph without features
    if cur_speaker in interviewees:
        sp = corpus[idx - 1]
    else:
        continue
    clean_p = remove_features(c, square_regex)

    # get the features
    tags = feat_regex.finditer(c)

    # for every tags with features in the paragraph
    for t in tags:
        # get index of result + tag
        index = t.start()
        t = t.group(1)

        # initialize empty row
        csv_line = ["" for _ in range(len(csv_header))]

        # get the features
        feats = t.rsplit(".", 1)
        text = feats[1]
        feats = feats[0]

        context=get_ngram(text,clean_p,ngram_params)

        # for every feature in the word
        for f in feats.split("."):
            # if the category is not present in the dict, then add to unk
            if f not in idv.keys():
                unk_categories.append(f)
                csv_line[-1] = csv_line[-1] + f + ","
            else:
                category = idv[f]
                cat_idx = csv_header.index(category)
                csv_line[cat_idx] = f

        # add initial infos and final unk to the line
        csv_line[0] = text
        csv_line[-2] = context
        if previous_line:
            csv_line[-3] = sp

        csv_line[-1] = csv_line[-1].strip(",")
        csv_file.append(csv_line)

## Saving the output
Finally, we need to save the output in the csv file for all our results

In [None]:

# write the csv
with open(dataset_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerows(csv_file)


# generate the annotation info file
header=["token"]+list(list(annotation_counter.values())[0].keys())

with open(annoation_info_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerow(header)

    for k,v in annotation_counter.items():
        writer.writerow([k]+list(v.values()))


# save the not annotated log
header=["token"]+[f"context {i}" for i in range(max(len(x) for x in not_annotated_log.values()))]
with open(not_annotated_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerow(header)
    for k, v in not_annotated_log.items():
        writer.writerow([k]+v)

If you want to download it right away run this cell:

In [None]:
files.download(dataset_path)
files.download(annoation_info_path)
files.download(not_annotated_path)

### Unk categories
Here, we print the unknown category if we found any.

In [None]:

if len(unk_categories) > 0:
    unk_categories = set(unk_categories)
    unk_categories = sorted(unk_categories)
    print(print(
        f"I have found several categories not listed in your variable file.\n"
        f"Following in alphabetical order:"))
    for idx,c in enumerate(unk_categories):
        print(idx,f"'{c}'")