Import necessary libraries

In [4]:
import csv
import json
import re
from copy import copy
from google.colab import files
import os

ModuleNotFoundError: No module named 'google'

# Custom files and settings

In this section we will define some settings for your transcription!
We will start with the variables you defined and move on to how to find them in your annotated corpus (REGEX).

## Variables

### Defining variables with JSON
When annotating your data, you will for sure use some variables. These variables may have a hierarchical structure, where one category includes many variations. To allow the program to find your variables you have to build a "dictionary" where you specify them.
Here, we use JSON files that allow you to come up with how many categories and variables you want in a clear and defined manner. Check out the [introduction to JSON tutorial](https://www.w3schools.com/js/js_json_intro.asp) if you are not familiar with it.
You can also look at the  [dependent variables](./dependent_variables.json) we are using in this project.

### Your files
Now that you are familiar with how the JSON syntax work, we need to define what your variables are. Here you have two options, depending on the option you choose you will need to run different cells.

#### 1. Use the default variables file
At the moment there are two files, one for [dependent variables](./dependent_variables.json) and one for [independent variables](./independent_variables.json) in this code. If you open these link you will see the original files (not modificable), but you can open them from this web page on the top right (check out [this tutorial](https://neptune.ai/blog/google-colab-dealing-with-files) on how to access local files system to google colab, point 4), you can find the files inside the `TranscriptionTagger` directory.

In [None]:
dependent_variable_path = 'TranscriptionTagger/dependent_variables.json'
independent_variable_path = 'TranscriptionTagger/independent_variables.json'

variable_files=[dependent_variable_path, independent_variable_path]
variable_files=[open(path, "r+") for path in variable_files]


#### 2. Upload your own files
Here you can upload your own variable files as long as they are still in JSON format. You can upload how many files you want.

In [3]:
variable_files=files.uploads()

for fn in variable_files.keys():
    if ".json" not in fn:
        raise FileNotFoundError(f"File {fn} is not a JSON file!")
    print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(variable_files[fn])))

NameError: name 'files' is not defined

## Regular Expressions (REGEX)

Now it's time to define the regular regular expressions (REGEX) to find your annotation in the corpus.

In our project each annotation is contained in square brackets and starts with a dollar sign. The variables are divided by a dot (without spaces) and the last element after the dot is the annotated word (with spaces).
For example : `[$variable1.variable2.annotated word]`

To come up with a new tag REGEX you can use [regex101](https://regex101.com/). To check out how it works, open  [regex101](https://regex101.com/). Copy-paste the
content of `square_regex` (`(\[\$[\S ]*?\])`) into the regular expression bar (on the top) and a sample paragraph in the test string (on the bottom), e.g.:
```
S    akiid, akiid, bi l [$G-OTH.fooxinende], aa, yaʕni il jumʕa la bass ʕidna sabit uu aħħad iħna [$DEM-HAAY.haay] [$G-OTH.daayrakt] leen il alwaad ʕidna       aa, iða j jaww [$IA.kulliʃ] [$IA.zeen] insawwi maʃaawi, w iða j jaww mu [$IA.zeen], insawwi yaʕni l aklaat illi tijmaʕ il ʕaaʔila, tiðakkiriin iħna l ʕiraaqiyiin         id dooLMa, w is [$CK.simaC], (laughing), w il lamma l ħilwa w il aħfaad, fa insawwi [$IA.CK.hiiCi]          bass il yoom la yaʕni innu aani w il ħajji [$GQ.gaaʕdiin], akθar il marraat nugʕud iS SuBiħ [$IA.nitrayyag], baʕdeen il, il gahwa uu baʕdeen nuqʕud insoolif, inʃuuf [$IA.ʃinu] ʕidna maʃaariiʕ, niTLaʕ maθalan irruuħ nimʃi [$IA.fadd] niSS saaʕa saaʕa
```

Define the REGEXs:

In [None]:
square_regex = re.compile(r"(\[\$[\S ]*?\])")
feat_regex = re.compile(r'\[\$([\S ]*?)\]')
sequence_regex = re.compile(r"({[\S ]+})")

## Transcription file
Finally, you need to upload your transcription file. This is the filed containing your annotated corpus.


In [None]:
transcription_path = files.upload()

## Output file

In order to obtain the comma separated values (`csv`) file, you can specify the separator. By default the separator value is a comma (as the name implies), but you can also use:
- semicolon: `;`
- comma: `,`
- tab : `\t`

In our project we use the tab, since it is the one that allows excel to view it (for Germany the default value is tab).

In [None]:
separator = '\t'

# Helper functions
Following some functions that we will use later in the code.

In [None]:
def remove_features(corpus):
    """
    Remove the features from the corpus
    """
    corpus = copy(corpus)
    words = square_regex.findall(corpus)
    for w in words:
        try:
            text = w.rsplit(".", 1)[1][:-1]
            corpus = corpus.replace(w, text)
        except IndexError:
            print(f"I found an error for the tag '{w}'. Myabe it does not have a point in it?\n"
                        f"Please check the tag and try again.", "error")
            exit()
            continue
    return corpus

In [None]:
def get_name(line):
    return line.split(" ")[0]

# Main program
Time to start the main program. First let us merge all the variables we uploaded before

In [None]:
variable_files=[json.load(f) for f in variable_files]

variable_dict = {}

for json in variable_files:
    for k,v in json.items():
        if k in variable_dict.keys():
            print(f"Warning : I have found a duplicate variable named '{k}' with values '{v}'. "
                  f"If this is expected ignore this message, otherwise check the variables files for duplicates!")
        variable_dict[k]=v



# get an inverse of the dependent variable
idv = {}
for k, v in variable_dict.items():
    if isinstance(v, list):
        for i in v:
            idv[i] = k
    else:
        idv[v] = k

print("Correctly loaded all the variables. Check them out to see if there are any errors")
print(json.dumps(variable_dict))

In [None]:

# get the name of the output path
output_path=os.path.basename(transcription_path)
output_path=os.path.splitext(output_path)[0]+"_output.csv"

# opend the file
with open(transcription_path, 'r+', encoding="utf16") as f:
    trans = f.readlines()
trans = trans[2:]
trans = [x.strip() for x in trans]
trans = [x for x in trans if x != '']

# ask for interviwers name
interviewers = input(print(f'Add the name(s) of the interviewer(s) (separated by comma), '
                          f'leave empty for classical interviewer-interviewees structure: '))
print("\n")
if len(interviewers) > 0:
    interviewers = interviewers.split(',')
else:
    interviewers = trans[0][0]

# ask for previous line
previous_line = input(print(f'When generating the final cvs file, I can also include the speaker utterance.'
                           f' Do you want me to include it? (y/n): '))
print("\n")
if previous_line == 'y':
    previous_line = True
else:
    previous_line = False

# get speak/list names
names = [get_name(x).strip() for x in trans]
names = set(names)

# remove all mention of interviwers in names
for i in interviewers:
    names = [x for x in names if i not in x]
interviewees = list(names)

# notify user about names
print(print(f"I found the following names: {', '.join(interviewees)}"))

# compile regex to find features
csv_header = list(dependent_variable.keys())

# define the end of the csv
csv_end = ['sequence in sentence', 'unk']
if previous_line:
    csv_end.insert(0, 'previous line')
csv_header = ["text"] + csv_header + csv_end
csv_file = [csv_header]
unk_categories = []

# for every paragraph in the transcript
for idx in range(len(trans)):
    c = trans[idx]

    # get the paragraph without features
    if get_name(c) in interviewees:
        sp = trans[idx - 1]
    else:
        continue
    clean_p = remove_features(c)

    # capture all the sequences
    sequences = sequence_regex.finditer(clean_p)
    sequences = [(x.start(), x.end(), x.group()) for x in sequences]

    # get the features
    tags = feat_regex.finditer(c)

    # for every tags with features in the paragraph
    for t in tags:
        # get index of result + tag
        index = t.start()
        t = t.group(1)

        # initialize empty row
        csv_line = ["" for _ in range(len(csv_header))]

        # get the features
        feats = t.rsplit(".", 1)
        text = feats[1]
        feats = feats[0]

        # for every feature in the word
        for f in feats.split("."):
            # if the category is not present in the dict, then add to unk
            if f not in idv.keys():
                unk_categories.append(f)
                csv_line[-1] = csv_line[-1] + f + ","
            else:
                category = idv[f]
                cat_idx = csv_header.index(category)
                csv_line[cat_idx] = f

        # add initial infos and final unk to the line
        csv_line[0] = text
        if previous_line:
            csv_line[-3] = sp

        # add the sequence to the line
        if len(sequences) != 0:
            for s in sequences:
                seq_start, seq_end, seq = sequences[0]
                if seq_start < index < seq_end:
                    seq = seq.replace("{", "").replace("}", "")
                    csv_line[-2] = seq
        csv_line[-1] = csv_line[-1].strip(",")
        csv_file.append(csv_line)

# write the csv
with open(output_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerows(csv_file)
print(f"Done!\nFile has been saved in '{output_path}'","ok")
if len(unk_categories) > 0:
    unk_categories = set(unk_categories)
    unk_categories = sorted(unk_categories)
    print(print(
        f"I have found several categories not listed in '{independent_variable_path}' or in '{dependent_variable_path}'.\n"
        f"Following in alphabetical order:"))
    for c in unk_categories:
        print(print(c.strip()))