In [47]:
!git clone https://github.com/nicofirst1/TranscriptionTagger

Cloning into 'TranscriptionTagger'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 56 (delta 30), reused 26 (delta 0), pack-reused 0[K
Receiving objects: 100% (56/56), 38.21 KiB | 1.66 MiB/s, done.
Resolving deltas: 100% (30/30), done.


Import necessary libraries

In [61]:
import csv
import json
import re
from copy import copy
import os
from google.colab import files

ModuleNotFoundError: No module named 'google'

# Custom files and settings

In this section we will define some settings for your transcription!
We will start with the variables you defined and move on to how to find them in your annotated corpus (REGEX).

## Variables

### Defining variables with JSON
When annotating your data, you will for sure use some variables. These variables may have a hierarchical structure, where one category includes many variations. To allow the program to find your variables you have to build a "dictionary" where you specify them.
Here, we use JSON files that allow you to come up with how many categories and variables you want in a clear and defined manner. Check out the [introduction to JSON tutorial](https://www.w3schools.com/js/js_json_intro.asp) if you are not familiar with it.
You can also look at the  [dependent variables](./dependent_variables.json) we are using in this project.

### Your files
Now that you are familiar with how the JSON syntax work, we need to define what your variables are. Here you have two options, depending on the option you choose you will need to run different cells.

#### 1. Use the default variables file
At the moment there are two files, one for [dependent variables](./dependent_variables.json) and one for [independent variables](./independent_variables.json) in this code. If you open these link you will see the original files (not modificable), but you can open them from this web page on the top right (check out [this tutorial](https://neptune.ai/blog/google-colab-dealing-with-files) on how to access local files system to google colab, point 4), you can find the files inside the `TranscriptionTagger` directory.

In [50]:
dependent_variable_path = 'TranscriptionTagger/dependent_variables.json'
independent_variable_path = 'TranscriptionTagger/independent_variables.json'

variable_files=[dependent_variable_path, independent_variable_path]
variable_files=[open(path, "r+") for path in variable_files]
variable_files=[x.read() for x in variable_files]
variable_files

['{\n  "Typ IA": "IA",\n  "Oth L" : "OTH-L",\n  "SA":  "OTH-SA",\n  "German" : ["G-SCHOOL", "G-JOB", "G-FRIE", "G-GER", "G-COV", "G-OTH"],\n  "Relig Phrase": "RELIG",\n  "Oth Variety": "OTH-DIAL",\n  "Demonstratives": ["DEM-HA-BEG", "DEM-A-END", "DEM-NO-HA-BEG", "DEM-NO-A-END", "DEM-HAL", "DEM-HAAY", "DEM-HAAYA", "DEM-HA", "DEM-I-END", "DEM-DHOOLE"],\n  "maal": ["MAAL", "NO-MAAL"],\n  "mu": "MU",\n  "Präfix" : ["DA", "JAAY", "GAAM"],\n  "Suffix": ["NO-N", "SUF-IINA", "SUF-NO-IINA"],\n  "K/Č": ["KC", "CK"],\n  "Q/G": ["QG", "GQ"],\n  "Q/K" : ["QK", "KQ"],\n  "Syllable Structure": ["SS-DIF", "SS-LEV"]\n\n}',
 '{\n  "Something": "something else"\n\n}']

#### 2. Upload your own files
Here you can upload your own variable files as long as they are still in JSON format. You can upload how many files you want.

In [None]:
variable_files=files.upload()

for fn in variable_files.keys():
    if ".json" not in fn:
        raise FileNotFoundError(f"File {fn} is not a JSON file!")
    print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(variable_files[fn])))

variable_files=list(variable_files.values())

# decode
variable_files=[x.decode("utf8") for x in variable_files]
variable_files

## Regular Expressions (REGEX)

Now it's time to define the regular regular expressions (REGEX) to find your annotation in the corpus.

In our project each annotation is contained in square brackets and starts with a dollar sign. The variables are divided by a dot (without spaces) and the last element after the dot is the annotated word (with spaces).
For example : `[$variable1.variable2.annotated word]`

To come up with a new tag REGEX you can use [regex101](https://regex101.com/). To check out how it works, open  [regex101](https://regex101.com/). Copy-paste the
content of `square_regex` (`(\[\$[\S ]*?\])`) into the regular expression bar (on the top) and a sample paragraph in the test string (on the bottom), e.g.:
```
S    akiid, akiid, bi l [$G-OTH.fooxinende], aa, yaʕni il jumʕa la bass ʕidna sabit uu aħħad iħna [$DEM-HAAY.haay] [$G-OTH.daayrakt] leen il alwaad ʕidna       aa, iða j jaww [$IA.kulliʃ] [$IA.zeen] insawwi maʃaawi, w iða j jaww mu [$IA.zeen], insawwi yaʕni l aklaat illi tijmaʕ il ʕaaʔila, tiðakkiriin iħna l ʕiraaqiyiin         id dooLMa, w is [$CK.simaC], (laughing), w il lamma l ħilwa w il aħfaad, fa insawwi [$IA.CK.hiiCi]          bass il yoom la yaʕni innu aani w il ħajji [$GQ.gaaʕdiin], akθar il marraat nugʕud iS SuBiħ [$IA.nitrayyag], baʕdeen il, il gahwa uu baʕdeen nuqʕud insoolif, inʃuuf [$IA.ʃinu] ʕidna maʃaariiʕ, niTLaʕ maθalan irruuħ nimʃi [$IA.fadd] niSS saaʕa saaʕa
```

Define the REGEXs:

In [51]:
square_regex = re.compile(r"(\[\$[\S ]*?\])")
feat_regex = re.compile(r'\[\$([\S ]*?)\]')
sequence_regex = re.compile(r"({[\S ]+})")

## Transcription file
Finally, you need to upload your transcription file. This is the filed containing your annotated corpus.


In [70]:
# DONT USE THIS, IT IS JUST FOR DEBUG

if True:
    transcription_path="/home/dizzi/Downloads/Aya.txt"
    with open(transcription_path, "r+", encoding="utf16") as f:
        transcription_text=f.read()


In [None]:
transcription_path = files.upload()

assert len(transcription_path) ==1 , "Support for multiple transcription file is not available! Please upload just one file"
transcription_text=list(transcription_path.values())[0]
transcription_text=transcription_text.decode("utf16")

# print first 50 character of the transcription file
print("\n\n First 400 characters:\n")
transcription_text[:400]

## Output file

In order to obtain the comma separated values (`csv`) file, you can specify the separator. By default the separator value is a comma (as the name implies), but you can also use:
- semicolon: `;`
- comma: `,`
- tab : `\t`

In our project we use the tab, since it is the one that allows excel to view it (for Germany the default value is tab).

In [53]:
separator = '\t'

# Helper functions
Following some functions that we will use later in the code.

In [55]:
def remove_features(corpus):
    """
    Remove the features from the corpus
    """
    corpus = copy(corpus)
    words = square_regex.findall(corpus)
    for w in words:
        try:
            text = w.rsplit(".", 1)[1][:-1]
            corpus = corpus.replace(w, text)
        except IndexError:
            print(f"I found an error for the tag '{w}'. Myabe it does not have a point in it?\n"
                        f"Please check the tag and try again.", "error")
            exit()
            continue
    return corpus

def get_name(line):
    return line.split(" ")[0]

# Main program
Time to start the main program. First let us merge all the variables we uploaded before

In [56]:
variable_files=[json.loads(f) for f in variable_files]

variable_dict = {}

for j in variable_files:
    for k,v in j.items():
        if k in variable_dict.keys():
            print(f"Warning : I have found a duplicate variable named '{k}' with values '{v}'. "
                  f"If this is expected ignore this message, otherwise check the variables files for duplicates!")
        variable_dict[k]=v



# get an inverse of the dependent variable
idv = {}
for k, v in variable_dict.items():
    if isinstance(v, list):
        for i in v:
            idv[i] = k
    else:
        idv[v] = k

print("Correctly loaded all the variables. Check them out to see if there are any errors")
print(json.dumps(variable_dict,sort_keys=True, indent=4))

Correctly loaded all the variables. Check them out to see if there are any errors
{
    "Demonstratives": [
        "DEM-HA-BEG",
        "DEM-A-END",
        "DEM-NO-HA-BEG",
        "DEM-NO-A-END",
        "DEM-HAL",
        "DEM-HAAY",
        "DEM-HAAYA",
        "DEM-HA",
        "DEM-I-END",
        "DEM-DHOOLE"
    ],
    "German": [
        "G-SCHOOL",
        "G-JOB",
        "G-FRIE",
        "G-GER",
        "G-COV",
        "G-OTH"
    ],
    "K/\u010c": [
        "KC",
        "CK"
    ],
    "Oth L": "OTH-L",
    "Oth Variety": "OTH-DIAL",
    "Pr\u00e4fix": [
        "DA",
        "JAAY",
        "GAAM"
    ],
    "Q/G": [
        "QG",
        "GQ"
    ],
    "Q/K": [
        "QK",
        "KQ"
    ],
    "Relig Phrase": "RELIG",
    "SA": "OTH-SA",
    "Something": "something else",
    "Suffix": [
        "NO-N",
        "SUF-IINA",
        "SUF-NO-IINA"
    ],
    "Syllable Structure": [
        "SS-DIF",
        "SS-LEV"
    ],
    "Typ IA": "IA",
    "maal": [
    

## Preprocessing transcriptions
Based on your type of annotated corpus (aka transcription) you will need to preprocess the file. Here we have 4 steps.
Feel free to comment out (add `#` at the start of the line) any preprocessing that does not fit your criteria.
To give you a sense of our transcriptions here is how the first 5 paragraphs look like:

```
file:///aishug294879ryshfda9763afo8947a5gf
2022 Oct 21, Fri 10:30

B       marħaba ʕeeni ʃ axbaariC, iħna niʃakkariC ihwaaya lennahu, ey    fitaħtuulna beetkum uu istaqbaltuuna             hm, iħCiilna l yoom iʃ sawweeti, ʃinu Caan maʃruuʕiC aSLan, (laughing)     hm  ey  hm  ii bi l ʕaafya      ey  ahh  (laughing)  Caan huwwa imxaTTiT innu tijiin inti haaða l isbuuʕ
S           halow ħabiibti, halow ʕeeni, il ħam..., (...) baSiiTa itdallili uu haaða abSaT ʃii insawwiilkumiyaa            yaa miyyat hala biikum    il yoom? il yoom ma ʕindi ʃii, aa, mit... mittafqiin ʕal [$MAAL.mawʕid maalatkum] il yoom, fa [$SS-DIF.gaʕadt] iS Subiħ, aa sawweet [$OTH-SA.SS-DIF.ifTuur] aani w il ħajji, rayyagta ab... aLLa [$CK.yʕaafiiC] uu baʕdeen ijeet ʕala mawʕidna [??ihnaana], [$GQ.gaaʕda] antiDurkum, SaLLeet, ma ʕindi ʃii baʕad, da antiDurkum, (laughing)      ey liʔannahu mittafqiin ʕa l mawʕid fa ma [$GQ.nigdar] inɣayyra


B   eh, la, yaʕni aa, waħħad ʕan il mawʕid Caan inti yaʕni ʕindiC ɣeer barnaamij maθalan bi l fooxinende        ey  hm  ee, id doo...,(laughing), ii, (laughing), w il lamma l ħilwa, (laughing)   ee  ey          hm  ey  hm, ħeel zeen
S    akiid, akiid, bi l [$G-OTH.fooxinende], aa, yaʕni il jumʕa la bass ʕidna sabit uu aħħad iħna [$DEM-HAAY.haay] [$G-OTH.daayrakt] leen il alwaad ʕidna       aa, iða j jaww [$IA.kulliʃ] [$IA.zeen] insawwi maʃaawi, w iða j jaww mu [$IA.zeen], insawwi yaʕni l aklaat illi tijmaʕ il ʕaaʔila, tiðakkiriin iħna l ʕiraaqiyiin         id dooLMa, w is [$CK.simaC], (laughing), w il lamma l ħilwa w il aħfaad, fa insawwi [$IA.CK.hiiCi]          bass il yoom la yaʕni innu aani w il ħajji [$GQ.gaaʕdiin], akθar il marraat nugʕud iS SuBiħ [$IA.nitrayyag], baʕdeen il, il gahwa uu baʕdeen nuqʕud insoolif, inʃuuf [$IA.ʃinu] ʕidna maʃaariiʕ, niTLaʕ maθalan irruuħ nimʃi [$IA.fadd] niSS saaʕa saaʕa

```
As you can see our file structure contains two lines with the file name and date, then an empty lines and then a repeating structure of the kind:
- interviewer name (`B`), tab, paragraph, newline
- interviewee name (`S`), tab, paragraph, newline

In [71]:
# step 1: split the whole corpus in different elements every new line, creating a list of paragraphs. For us this means splitting interviewer and interviewee in different paragraphs
trans= transcription_text.split("\n")
# step 2: our first two paragraphs are the file name and the date, which we don't need so discard them
trans = trans[2:]
# step 3: remove spaces at the start and end of each paragraph
trans = [x.strip() for x in trans]
# step 4 : remove empty paragraphs from the list
trans = [x for x in trans if x != '']

# take a peek at the first four paragraphs
for idx in range(3):
    print(trans[idx])

B  haloow aaya, ʃ ʃooniC? zeena ħamd il laa, ʃ axbaariC, (laughing)   furSa... (laughing), huwwa, hammaateen da ariid agulliC innu furSa saʕiida innu tʕarrafit ʕaleeC waLLa, aLLa ysallmiC  hassa iħCiili, ʃ sawweeti ma sawweeti l yoom, qabil ma nijiiC iħna? ah  hm   ahm(laughing)
A [haloow $IA.ʃooniC], [$RELIG.ħamd-il-laa] b xeer inti haloow [$IA.ʃooniC]? waLLa [$RELIG.l ħamd il laa], tamaam, furSa saʕiida tʕarrafit [$CK.ʕaleeC], (laughing) aani l asʕad waLLa tʃarrafit [$CK.biiC] il yoom [$QG.qaʕadit] saaʕa tisʕa, aahm, ʃwayya wiyya l ʕaaʔila leen yoom [$GE.wooxenende], ijjammaʕna ʕa rayuug uu baʕdeen dirasit, uu ħadd ma intiDHaritkum tijuun uu baʕad iltiqeet b ħadaratkum, baʕad ma sawweet ʃii liʔann [$CK.baaCir] ʕindi [$G-SCHOOL.teest] fa laazim aħaDDir, uu [$DA.da adrus] ʃwayya ʕindi [$GE.SCHOOL.ʃitres], (laughing)
B   ooh, (laughing), maʕnaatha iħna l yoom raaħ niʃuɣLiC ʃwayya, (laughing), raaħ naaxið min waqtiC hm in ʃaa La ya raBBi kull il tawfiiq (...), uu b il ʕaada yaʕni haaða r

### Interviewers and Interviewees
As you can see form the previous example, we consider a file where we have only one interviewer and one interviewee alternating each other. But it can be the case where you have multiple interviewers and interviewees in random order. In this case we need to know the names of the interviewers in order to split them from the interviewees.

If you need to add multiple interviewers run the following cell after adding the names


In [58]:
#interviewers = "name1,name2,...,nameN"
interviewers=""

Get the interviewers

In [59]:
# if user specified the interviewer's names then take that
if len(interviewers) > 0:
    interviewers = interviewers.split(',')
    # remove spaces
    interviewers = [x.strip() for x in interviewers]
else:
    # else use the first character of the transcription
    interviewers = [trans[0][0]]

print(f"The selected interviewers are {', '.join(interviewers)}")

The selected interviewers are B


### Output file settings
Now it's time to create the output file.

The output file will have the same name as the transcription one but with `_output.csv` at the end

In [73]:
# get the name of the output path
output_path=os.path.basename(transcription_path)
output_path=os.path.splitext(output_path)[0]+"_output.csv"
output_path

'Aya_output.csv'

It can be useful to add the previous paragraph in the final csv file. For example, when examining an annotation, you want to know what the previous speaker said before the current one. If you are interested in this information being in the final output set `previous_line` to `True`, else leave it `False`

In [63]:
#previous_line = True
previous_line = False

The output is a csv file and it needs a header. For this reason herre we define the header as the following elements:
- the `text` for the annotated words
- the complete list of variables
- (Optional) the previous line
- The sentence in which the text was found
- an `unk` (unkown) category for variables that were found in the annotations but not present in the variable files (useful to catch some errors)

In [64]:

# compile regex to find features
csv_header = list(variable_dict.keys())

# define the end of the csv
csv_end = ['sequence in sentence', 'unk']
if previous_line:
    csv_end.insert(0, 'previous line')
csv_header = ["text"] + csv_header + csv_end
csv_file = [csv_header]
unk_categories = []

print(f"The csv header looks like this")
csv_header

The csv header looks like this


['text',
 'Typ IA',
 'Oth L',
 'SA',
 'German',
 'Relig Phrase',
 'Oth Variety',
 'Demonstratives',
 'maal',
 'mu',
 'Präfix',
 'Suffix',
 'K/Č',
 'Q/G',
 'Q/K',
 'Syllable Structure',
 'Something',
 'sequence in sentence',
 'unk']

### Finding interviewees

This part looks for all the names present in the file

In [65]:

# get interviewer/interviewees names
names = [get_name(x).strip() for x in trans]
names = set(names)

# remove all mention of interviwers in names
for i in interviewers:
    names = [x for x in names if i not in x]
interviewees = list(names)

# notify user about names
print(f"I found the following interviewees names: {', '.join(interviewees)}")

I found the following interviewees names: A, A(laughing)


### Starting the main loop
This part starts the main loop.

In [74]:

# for every paragraph in the transcript
for idx in range(len(trans)):
    c = trans[idx]

    # get the paragraph without features
    if get_name(c) in interviewees:
        sp = trans[idx - 1]
    else:
        continue
    clean_p = remove_features(c)

    # capture all the sequences
    sequences = sequence_regex.finditer(clean_p)
    sequences = [(x.start(), x.end(), x.group()) for x in sequences]

    # get the features
    tags = feat_regex.finditer(c)

    # for every tags with features in the paragraph
    for t in tags:
        # get index of result + tag
        index = t.start()
        t = t.group(1)

        # initialize empty row
        csv_line = ["" for _ in range(len(csv_header))]

        # get the features
        feats = t.rsplit(".", 1)
        text = feats[1]
        feats = feats[0]

        # for every feature in the word
        for f in feats.split("."):
            # if the category is not present in the dict, then add to unk
            if f not in idv.keys():
                unk_categories.append(f)
                csv_line[-1] = csv_line[-1] + f + ","
            else:
                category = idv[f]
                cat_idx = csv_header.index(category)
                csv_line[cat_idx] = f

        # add initial infos and final unk to the line
        csv_line[0] = text
        if previous_line:
            csv_line[-3] = sp

        # add the sequence to the line
        if len(sequences) != 0:
            for s in sequences:
                seq_start, seq_end, seq = sequences[0]
                if seq_start < index < seq_end:
                    seq = seq.replace("{", "").replace("}", "")
                    csv_line[-2] = seq
        csv_line[-1] = csv_line[-1].strip(",")
        csv_file.append(csv_line)

## Saving the output
Finally, we need to save the output in the file

In [75]:

# write the csv
with open(output_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerows(csv_file)
print(f"Done!\nFile has been saved in '{output_path}'","ok")


Done!
File has been saved in 'Aya_output.csv' ok


If there were some unk categories, print them here

In [76]:

if len(unk_categories) > 0:
    unk_categories = set(unk_categories)
    unk_categories = sorted(unk_categories)
    print(print(
        f"I have found several categories not listed in '{independent_variable_path}' or in '{dependent_variable_path}'.\n"
        f"Following in alphabetical order:"))
    for c in unk_categories:
        print(print(c.strip()))

I have found several categories not listed in 'TranscriptionTagger/independent_variables.json' or in 'TranscriptionTagger/dependent_variables.json'.
Following in alphabetical order:
None

None
OTH-DIAL
None
DEM-HA-GEB
None
DEM-HAADHI
None
DL
None
ENGL-SCHOOL
None
EPENTH-VOW
None
G
None
G-SCHOOl
None
GA
None
GE
None
IA-AYA
None
JOB
None
K
None
L
None
MAAl
None
NO-A-END
None
NO-HA
None
NO-HA-SUF
None
NO-HA-SUFF
None
NO-IINA
None
NO-MAAl
None
NO-UUN-SUF
None
No-IINA
None
No-MAAL
None
OT-SA
None
OTH
None
OTH-SA-FEAT
None
OTH_SA
None
SCHOOL
None
SS-DIFF
None
SS-EPENTH-VOW
None
SS-HA
None
SS-RAISE
None
SUF-WITH-IINA
None
SUF-YA
None
higiina ʃ
None
