# Introduction

This tool is meant for researchers in the field of Corpus Linguistics and work on e.g. language variation.

Given your corpus and a list of variables, this code will build a structured dataset (in `csv`) for your future analysis. Moreover, it will help you to spot inconsistencies into your annotation and provide a clear overview of how your annotated features correlate with speakers. 

While we do have a standard for annotations (ELAN and Praat), instructions to adapt the code to your own syntax are available below.

If you find any problems or request some features update, please do so with [this form](https://github.com/nicofirst1/CorpusCompass/issues/new/choose).

## How does it work?

This file you are viewing is called a google Colab. It allows you to run the code (python in this case) in an interactive manner by clicking on each `code cell`. If you are not familiar with google Colab, we advise you to check the [tutorial](https://colab.research.google.com/notebooks/intro.ipynb).

When you run each cell, you will see (most of the time) some additional information appearing regarding the status of the program, e.g. number of annotated words, varaibles, etc ...



# Setup

Run this to copy the necessary files in your colab.
If it says something like:
```fatal: destination path 'CorpusCompass' already exists and is not an empty directory.```
You can ignore it

In [None]:
!git clone -b dev https://github.com/nicofirst1/CorpusCompass
!pip install nltk
!pip install -e CorpusCompass/

Import necessary libraries

In [None]:
import csv
import json
import re
from collections import Counter
from CorpusCompass.utils import multi_corpus_upload, split_paragraphs, get_name, find_repetitions, remove_features, get_ngram
from copy import copy
from google.colab import files

# Custom files and settings

In this section, we will define some settings for your transcription!

We will start with the variables you defined and move on to how to find them in your annotated corpus (REGEX).

## Variables

Investigating language variation means that there is more than one way of saying the same thing. Speakers may vary pronunciation, morphology, word choice, etc.

In linguistic research, we usually work with a number of variables (dependent and independent). 
In simple terms, an independent variable is the “input” (controlled factor) and a dependent variable is what results from the set of independent variables as an "output" (outcomes being measured). On the one hand, an independent variable is what is given (e.g. age, sex, education). On the other hand, a dependent variable is what results from the set of independent variables (e.g. pronunciation of a phoneme, morpheme or words).

### Defining variables with JSON
When annotating your data, you will for sure use some variables. These variables may have a hierarchical structure, where one category includes many variations. To allow the program to find your variables you have to build a "dictionary" where you specify them.
Here, we use JSON files that allow you to come up with how many categories and variables you want in a clear and defined manner. Check out the [introduction to JSON tutorial](https://www.w3schools.com/js/js_json_intro.asp) if you are not familiar with it.
You can also look at the  [dependent variables](./dependent_variables.json) we are using in this project for an example.

### Your files
Now that you are familiar with how the JSON syntax work, we need to define what your variables are. Here you have two options, depending on the option you choose you will need to run different cells.

##### 1. Use the default variables file
At the moment there are two files, one for [dependent variables](./dependent_variables.json) and one for [independent variables](./independent_variables.json) in this code. If you open these link you will see the original files (not modificable). If you want to modify them you need to open them from this page. On the folder symbol on the left of this code (check [this video](./includes/changing_variables.gif) for a how-to, and  [this tutorial](https://neptune.ai/blog/google-colab-dealing-with-files) on how to access local files system to google colab, point 4), click on CorpusCompass (this will open the direcotry) and then you will see the two files. Click on them to modify their content.

##### 2. Upload your own files
Here you can upload your own variable files as long as they are still in JSON format. You can upload as many files as you want.

##### Choose your method
You can choose either method 1 or 2, by changing the value of `variable_method` to be 1 or 2. By defualt the method is set to 1 (using the variables defined here).

It should look like this `variable_method=1` or this `variable_method=2`

In [None]:
# uncomment for different methods

variable_method=1

assert variable_method in [1,2], f"Invalid method number {variable_method}! Choose between 1 and 2"

In [None]:

if variable_method==1:
    # method 1, load from files
    dependent_variable_path = 'CorpusCompass/dependent_variables.json'
    independent_variable_path = 'CorpusCompass/independent_variables.json'

    variable_files=[dependent_variable_path, independent_variable_path]
    variable_files=[open(path, "r+") for path in variable_files]
    variable_files=[x.read() for x in variable_files]
else:
    # method 2, upload files
    variable_files=files.upload()

    for fn in variable_files.keys():
        if ".json" not in fn:
            raise FileNotFoundError(f"File {fn} is not a JSON file!")
        print('User uploaded file "{name}" with length {length} bytes'.format(
          name=fn, length=len(variable_files[fn])))

    variable_files=list(variable_files.values())

    # decode
    variable_files=[x.decode("utf8") for x in variable_files]

# show variables
variable_files

## Regular Expressions (REGEX)

Now it's time to define the regular expressions (REGEX) to find your annotation in the corpus.

In this example, each annotation is contained in square brackets and starts with a dollar sign. The variables are divided by a dot (without spaces) and the last element after the dot is the annotated word (with spaces).
Formally : 

```[$variable1.variable2.annotated word]```

An example: 

`[$verb.ing.playing]`


If your annotations follow a different rule, you need to come up with a new REGEX.

You can use [regex101](https://regex101.com/) for this. Follow these instructions to test it:
- Open  [regex101](https://regex101.com/)
- Copy-paste the content of the `square_regex` (defined below) (`(\[\$[\S ]*?\])`) into the regular expression bar (on the top)
- Copy paste a sample paragraph in the text string (on the bottom), e.g.:
```
A: akiid aani asawwi  [$G-JOB.awsbildung]  bass igulluuli innu l ʕaluum il baaylooji yaʕni li huwwa taħlilaat ṣaʕub bass iða ṣaʕub asawwi ɣeera akiid yaʕni, innu musaaʕadat doktoor asnaan ħaaba  [$TYP-IA.hammeen], haaða ʃ ʃii [$TYP-IA.hamma] aah, ey waḷḷa, aḷḷa ysallmič  ʕindi tlaθ aṭfaal, aah, il ʕindi ṭifla čibiira sitt isniin bi l madrasa uw ʕali ʕumra tlaθ isniin, arbaʕ isniin [$G-EDU.kindargaartin] uw samiir santeen uw nuṣṣ, santeen uw θmann iʃhuur [$G-EDU.kindarkriipa]

```

Define our REGEXs:

In [None]:
# regex to find the complete annotation rule
square_regex = re.compile(r"(\[\$[\S ]*?\])")
# regex to find the content of an annotation
feat_regex = re.compile(r'\[\$([\S ]*?)\]')
# regex to find the token in the annotation
annotated_token_regex=re.compile(r"\.[\S ]*?\]")
# regex to univocally finding the speaker name in the paragraph
# uncomment if you don't have speakers at the start of each paragraph
# name_regex= re.compile(r"^")
name_regex= re.compile(r"(^[A-Z]): ")


## Corpus file
Finally, you need to upload your corpus file. This is the file containing your annotated tokens.


In [None]:
# Change use_example_corpus to False, in order to upload your own corpus
# use_example_corpus=False
use_example_corpus=True


if use_example_corpus:
    import pathlib
    cur_path=pathlib.Path().resolve()
    corpus_path=cur_path.joinpath("CorpusCompass/includes/corpus_example.txt")
    with open(corpus_path, "r+", encoding="utf8") as f:
        corpus_text=f.read()
else:
    corpus_path = files.upload()
    corpus_text=multi_corpus_upload(corpus_path)

In [None]:

print("\n\n First 400 characters:\n")
corpus_text[:400]

## Output file

In order to obtain the comma separated values (`csv`) file, you can specify the separator. By default the separator value is a comma (as the name implies), but you can also use:
- semicolon: `;`
- comma: `,`
- tab : `\t`

Be aware that having the same character in your corpus may break the `csv` visualization. We suggest using a symbol that does not appear in your corpus, and then manually set the separator in the program you are using to visualize (e.g. excel).

In [None]:
separator = ';'

# Main program
Time to start the main program. First let us merge all the variables we uploaded before

In [None]:
variable_files=[json.loads(f) for f in variable_files]

variable_dict = {}

for j in variable_files:
    for k,v in j.items():
        if k in variable_dict.keys():
            print(f"Warning : I have found a duplicate variable named '{k}' with values '{v}'. "
                  f"If this is expected ignore this message, otherwise check the variables files for duplicates!")
        variable_dict[k]=v


print("Correctly loaded all the variables. Check them out to see if there are any errors")
print(json.dumps(variable_dict,sort_keys=True, indent=4))

## Preprocessing transcriptions
Based on your type of annotated corpus you will need to preprocess the file.
Feel free to comment out (add `#` at the start of the line) any preprocessing that does not fit your criteria.
To give you a sense of our transcriptions, here is how the first paragraphs look like:

```
A: akiid aani asawwi  [$G-JOB.awsbildung]  bass igulluuli innu l ʕaluum il baaylooji yaʕni li huwwa taħlilaat ṣaʕub bass iða ṣaʕub asawwi ɣeera akiid yaʕni, innu musaaʕadat doktoor asnaan ħaaba  [$TYP-IA.hammeen], haaða ʃ ʃii [$TYP-IA.hamma] aah, ey waḷḷa, aḷḷa ysallmič  ʕindi tlaθ aṭfaal, aah, il ʕindi ṭifla čibiira sitt isniin bi l madrasa uw ʕali ʕumra tlaθ isniin, arbaʕ isniin [$G-EDU.kindargaartin] uw samiir santeen uw nuṣṣ, santeen uw θmann iʃhuur [$G-EDU.kindarkriipa]

B: sawweet yaʕni baarħa innu gidarit adrus bi l leel ʕala muud awaffir waqti l yoom ilkum  ayy, ʃukran ilič tislamiin  la, [$G-DL.rootiin] il yoomi ʕindi dawaam aani daaʔiman min iθ θmaaniya w nuṣṣ li s saaʕa arbaʕa asawwi [$G-JOB.awsbildung] tamriiḍ li huwwa [$G-JOB.kraankenbifleegaaa] baʕd id dawaam arjaʕ mumkin asawwi [$G-DL.aaynkawfin] loo maθalan asawwi [$G-DL.teermiin] aw ʃii baʕadeen arjaʕ li l beet, mumkin aṭbux, aakul, leen aani saakna waħdi fa ʃwayya tkuun ṣaʕba bi n nisba ili

C: ħiluw aʃyaaʔ [$TYP-IA.aku] ħilwa aa bass atmanna n naas yaʕni, aa [$TYP-IA.Q-WORD.ʃoon]  yistaxdimuuha b ṣuura ṣaħiiħa ey haaða yaʕni il mafruuḍ il kull tfakkir bii ey hm, naʕam, naʕam, naʕam, hm bali, ṣaħħ naʕam, ey ey ey, ṣaħiiħ, ey ṣaħħ, ṣaħiiħ ey ħamd-il-laa ey, naʕam, hmm la, la ṭabʕan akiid, ma, yaʕni ṣaar, aa,  [$TYP-IA.fadd]  ʃii wiyya l ħayaat yaʕni [$TYP-IA.Q-WORDʃoon] ka anna ma ʃurbat il ṃayy ma tgidriin, yaʕni aani ḍiʕit

D: yaʕni n naas aa, mitfahhimiin iʃ ʃii uw yaʕni ma [$TYP-IA.da] ydaxxiluun iʃ ʃaɣḷaat yaʕni ḍiddhum haaði fa huwwa l ʕeeb, il ʕeeb bi l baʃar illi il ʕarabi il [$TYP-IA.da] yiji ey, aha, bi ḍ ḍaBiṭ, naʕam bi ḍ ḍaBiṭ la waḷḷa [$TYP-IA.da] ruuħ aa, bass aa l [$G-EDU.koors] xiḷas, ey ma ṭaḷḷaʕit natiija yaʕni, ey ey, ey waḷḷa, waḷḷa yaʕni aa, min, aa xijalit min nafsi yaʕni, gumit aa, ħatta ħatta nafsiiti tiʕbat yaʕni aa, yaʕni ħatta l ħijiyya uw aṣdiqaaʔi yaʕni, igulli ʕammu [$TYP-IA.Q-WORD.ʃinu] yaʕni

```
As you can see our file has a repeating structure of the kind:
- speaker name (`A`,`B`,`C`,`D`), space, paragraph, newline

In [None]:
# step 1: split the whole corpus in different elements every new line, creating a list of paragraphs. For us this means splitting interviewer and interviewee in different paragraphs
corpus= split_paragraphs(corpus_text)
# step 2: remove spaces at the start and end of each paragraph
corpus = [x.strip() for x in corpus]
# step 3 : remove empty paragraphs from the list
corpus = [x for x in corpus if x != '']
# step 4 : filter out all the paragraphs that do not have any DETECTED speaker
prev_c=copy(corpus)
corpus=[x for x in corpus if get_name(x,name_regex)]

removed_sentences=len(prev_c)-len(corpus)
print(f"I removed {removed_sentences} paragraphs, since I could not detect a speaker\n"
      f"I will show it over here, sorted by the their line:")
if removed_sentences>0:
    diff=set(prev_c)-set(corpus)
    diff=sorted(diff,key=lambda x: prev_c.index(x))
    for i in diff:
        print(f"{prev_c.index(i)}: {i}")
    print("\n\n")
    print("If you see paragraphs you are interested in, consider manually changing them in the corpus, or expanding the 'name_regex' rule")


if len(corpus)==0:
    print("All the paragraphs in the corpus have been deleted! You should review your regex rules")
else:
    # take a peek at the first four paragraphs
    print("First paragraphs are")
    for idx in range(3):
        print(corpus[idx])

### Speaker of interest
As you can see form the previous example, we consider a file where we have two speakers alternating each other. But it can be the case where you have multiple in random order. In this case you can specify the names of the speakers you are interested in, notice that the others will be skipped from the final output.

If you need to add multiple speakers run the following cell after adding the names. It should look something like:


`speakers_of_interest = "name1,name2,...,nameN"`

In [None]:
speakers_of_interest=""

Get the interviewers

In [None]:
# if user specified the interviewer's names then take that
if len(speakers_of_interest) > 0:
    speakers_of_interest = speakers_of_interest.split(',')
    # remove spaces
    speakers_of_interest = [x.strip() for x in speakers_of_interest]
else:
    # else use the first character of the transcription
    speakers_of_interest = [corpus[1][0]]
     # filter out empty names
    speakers_of_interest = [x for x in speakers_of_interest if x != '']
    speakers_of_interest=[speakers_of_interest[0]]

print(f"The selected speakers of interest are: {', '.join(speakers_of_interest)}")

### Finding all speakers

This part looks for all the names present in the file and checks the independent variables

In [None]:

# get interviewer/interviewees names
all_speakers = [get_name(x, name_regex).strip() for x in corpus]
all_speakers = set(all_speakers)
# filter out empty all_speakers
all_speakers = [x for x in all_speakers if x != '']

# notify user about names
print(f"I found the following speakers names: {', '.join(all_speakers)}")


# get the independent variables
independent_variable_dict={k:v for k,v in variable_dict.items() if k in all_speakers}
# remove from variable_dict the independent variables
variable_dict={k:v for k,v in variable_dict.items() if k not in all_speakers}


# get an inverse of the dependent variable
idv = {}
for k, v in variable_dict.items():
    if isinstance(v, list):
        for i in v:
            idv[i] = k
    else:
        idv[v] = k

### Output file settings
Now it's time to create the output files.

There are different outputs and here we specify their names.

In [None]:
# get the name of the output path
dataset_path="dataset.csv"
annoation_info_path="annotation_info.csv"
not_annotated_path="not_annotated_log.csv"

It can be useful to add the previous paragraph in the final csv file. For example, when examining an annotation, you want to know what the previous speaker said before the current one (turn taking). If you are interested in this information being in the final output set `previous_line` to `True`, else leave it `False`

In [None]:
previous_line = False

Additionally, we include a context for all the annotated tokens present in your data. The context is made of the previous/next words after the annotated one, following a n-gram rule as you can see in the rule below: 

`ngram_params = (number previous words, number next words)`

Here we decided to take into account 10 words appearing before the annotated one and 5 after.

In [None]:
ngram_params=(10,5)

The output is a csv file and it needs a header. For this reason, here we define the header with the following elements:
- The `token` for the annotated words
- The complete list of your variables
- (Optional) the previous line
- The speaker's name
- The context in which the token was found
- An `unk` (unknown) category for variables that were found in the annotations but not present in the variable files (useful to catch some errors)

In [None]:
# compile regex to find features
csv_header = list(variable_dict.keys())

# define the end of the csv
csv_end = ["speaker",'context', 'unk']
if previous_line:
    csv_end.insert(0, 'previous line')
csv_header = ["token"] + csv_header + csv_end
csv_file = [csv_header]
unk_categories = []

print(f"The csv header looks like this")
csv_header

### Annotation information

Here we build our counter and logger for the annotations. We want to count the frequency of each annotated token in the corpus, how many times it appears and who uses it. Moreover, for logging reasons, we want to present the tokens which are annotated but were found outside the annotation rule (not recognized).

In [None]:

# ### Finding all the annotated words
whole_corpus="\n".join(corpus)
annotations=feat_regex.findall(whole_corpus)
annotations=[x.split(".")[-1] for x in annotations]

# count the number of annotations
annotation_counter = Counter(annotations)
annotation_counter={k:dict(annotated=v) for k,v in annotation_counter.items()}

print(f"The total number of annotated words is {len(annotation_counter)}")

Now, we check the number of times the token appears annotated (following the REGEX rule) vs not, we do this for each annotated token.

In [None]:

# check if there are any annotations not annotated
for k,v in annotation_counter.items():
    # check for annotation repetitions
    wild_rep,ann_rep,_= find_repetitions(whole_corpus, k, annotated_token_regex)
    total_rep=wild_rep+ann_rep
    not_annotated=total_rep-v['annotated']
    annotation_counter[k]['not_annotated']=not_annotated

print(f"The total repetitions of annotated words is {sum([x['annotated'] for x in annotation_counter.values()])}")
print(f"The total repetitions of not annotated words is {sum([x['not_annotated'] for x in annotation_counter.values()])}")

While the previous cell, counts all the tokens in the whole corpus, we probably want to differentiate between speakers.
For this reason, here we need to specify where we want to look for the missed tokens.

You have the following options:

- Choose all the speakers found in your corpus with: `speakers=all_speakers`
- Choose only the speakers you are interested in (defined previously):
`speakers=speakers_of_interest`
- Choose other speakers you are interested in, by manually enumerating them: `speakers= ["name1","name2",...,"nameN"]`

By default, we check for all the speakers

In [None]:
# choose which speaker to check for annotations, you can uncomment one of the following lines:
speakers=all_speakers

Now it's time to find those missing annotations

In [None]:
not_annotated_log={}

for k,v in annotation_counter.items():

    res=[find_repetitions(x, k, annotated_token_regex) for x in corpus if get_name(x,name_regex) in speakers]
    _,_,wild_not_annotated=zip(*res)

    # unfold the list
    wild_not_annotated=[item for sublist in wild_not_annotated for item in sublist]

    # if there are not annotated words
    if len(wild_not_annotated) > 0:
        not_annotated_log[k]=wild_not_annotated

print(f"The total number of not annotated words is {len(not_annotated_log)}")

Finally, some pre-processing

In [None]:
# augment annotation_counter with speakers and add total number
for k in annotation_counter.keys():
    annotation_counter[k]['total'] = annotation_counter[k]['annotated'] + annotation_counter[k]['not_annotated']
    for speaker in all_speakers:
        annotation_counter[k][speaker] = 0


## Starting the main loop
This part starts the main loop. You don't need to change anything here, if you are interested check out the comments.

In [None]:

# for every paragraph in the transcript
for idx in range(len(corpus)):
    c = corpus[idx]
    cur_speaker = get_name(c,name_regex)

    # add speaker related metrics
    for k, v in annotation_counter.items():
        wild_rep, ann_rep, _ = find_repetitions(c, k, annotated_token_regex)
        rep = wild_rep + ann_rep
        annotation_counter[k][cur_speaker] += rep

    # get the paragraph without features
    if cur_speaker in speakers_of_interest:
        sp = corpus[idx - 1]
    else:
        continue
    clean_p = remove_features(c, square_regex)

    # get the features
    tags = feat_regex.finditer(c)

    # for every tags with features in the paragraph
    for t in tags:
        # get index of result + tag
        index = t.start()
        t = t.group(1)

        # initialize empty row
        csv_line = ["" for _ in range(len(csv_header))]

         # get independent variable information
        for k, v in independent_variable_dict.items():
            if cur_speaker!=k:
                continue
            for var in v:

                category = idv[var]
                cat_idx = csv_header.index(category)
                csv_line[cat_idx] = var


        # get the features
        feats = t.rsplit(".", 1)
        text = feats[1]
        feats = feats[0]

        context=get_ngram(text,clean_p,ngram_params)

        # for every feature in the word
        for f in feats.split("."):
            # if the category is not present in the dict, then add to unk
            if f not in idv.keys():
                unk_categories.append(f)
                csv_line[-1] = csv_line[-1] + f + ","
            else:
                category = idv[f]
                cat_idx = csv_header.index(category)
                csv_line[cat_idx] = f

        # add initial infos and final unk to the line
        csv_line[0] = text
        csv_line[-2] = context
        csv_line[-3] = cur_speaker
        if previous_line:
            csv_line[-4] = sp

        csv_line[-1] = csv_line[-1].strip(",")
        csv_file.append(csv_line)

## Saving the output
Finally, we need to save the output in the csv file for all our results

In [None]:

# write the csv
with open(dataset_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerows(csv_file)


# generate the annotation info file
header=["token"]+list(list(annotation_counter.values())[0].keys())

with open(annoation_info_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerow(header)

    for k,v in annotation_counter.items():
        writer.writerow([k]+list(v.values()))


# save the not annotated log
header=["token"]+[f"context {i}" for i in range(max(len(x) for x in not_annotated_log.values()))]
with open(not_annotated_path, "w", newline="", encoding="utf16") as f:
    writer = csv.writer(f, delimiter=separator)
    writer.writerow(header)
    for k, v in not_annotated_log.items():
        writer.writerow([k]+v)

If you want to download it right away run this cell:

In [None]:
files.download(dataset_path)
files.download(annoation_info_path)
files.download(not_annotated_path)

### Unknown categories
Here, we show the unknown category, if any could be found.

In [None]:

if len(unk_categories) > 0:
    unk_categories = set(unk_categories)
    unk_categories = sorted(unk_categories)
    print(print(
        f"I have found several categories not listed in your variable file.\n"
        f"Following in alphabetical order:"))
    for idx,c in enumerate(unk_categories):
        print(idx,f"'{c}'")