# CANDOR corpus annotation with Stanza and analysis of annotated files

In this tutorial we are analysing grammar, sentiment and performing Named Entity Recognition on the [CANDOR](https://www.science.org/doi/10.1126/sciadv.adf3197) transcripts using [Stanza](https://stanfordnlp.github.io/stanza/).

See other notebooks in this repository for instructionf on file preparation.

## Annotation with Stanza

Now, lets annotate our files!

First install Stanzaa and download models for English. This needs to be done only once per session. If you would like to use this code for another language, you should change the language code.

In [None]:
!pip install stanza
import stanza
stanza.download('en')

Now we can process the transcript utterances. In the code block below, we are performing the steps listed in stanza.Pipeline():
- tokenization
- part-of-speech
- lemmatization
- dependential syntactic parsing
- named entity recognition
- sentiment analysis

The output is formatted acording to the CONLLU Plus specifications.

Running this code might be faster locally.

Note that one sentence might be too short for sentiment analysis and that doing sentiment analysis for the whole utterance or a larger chunk probably makes more sense.

In [None]:
import os
import pandas as pd
import stanza

# Load the Stanza pipeline
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse,ner,sentiment')


def process_transcript(transcript_path, output_directory, file_id):
    """
    Processes a single transcript file and outputs it in CoNLL-U Plus format with metadata.
    """
    # Load the transcript CSV
    df = pd.read_csv(transcript_path)

    # Ensure output directory exists
    os.makedirs(output_directory, exist_ok=True)

    # Get the base name for the output file
    base_name = os.path.splitext(os.path.basename(transcript_path))[0]
    output_file = os.path.join(output_directory, f"{base_name}_stanza.conllu")

    # Open the output file for writing
    with open(output_file, 'w', encoding='utf-8') as f_out:

        # Write document metadata
        f_out.write(f"# newdoc id = {file_id}\n")
        f_out.write("# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC NAMEDENTITY\n")

        for _, row in df.iterrows():
            # Extract metadata
            turn_id = row['turn_id']
            speaker = row['speaker']
            start = row['start']
            stop = row['stop']
            utterance = row['utterance']

            # Skip rows with empty utterances
            if pd.isna(utterance):
                continue

            # Annotate the utterance using Stanza
            doc = nlp(utterance)


            # newpar id = mf920901-001-p1

            # Write token annotations in CoNLL-U Plus format
            sent_id = 1
            # print(type(doc))

            for sentence in doc.sentences:

                # Write metadata for the sentence
                f_out.write(f"# turn = {turn_id}\n")
                f_out.write(f"# sent = {sent_id}\n")
                f_out.write(f"# text = {sentence.text}\n")
                f_out.write(f"# speaker = {speaker}\n")
                f_out.write(f"# turn_start = {start}\n")
                f_out.write(f"# turn_end = {stop}\n")
                f_out.write(f"# sentiment = {sentence.sentiment}\n")

                sent_id+=1

                for token in sentence.tokens:
                    word_ner = token.ner
                    for word in token.words:
                        f_out.write("\t".join([
                            str(word.id),  # ID
                            word.text,  # FORM
                            word.lemma,  # LEMMA
                            word.upos,  # UPOS
                            word.xpos or '_',  # XPOS
                            word.feats or '_',  # FEATS
                            str(word.head),  # HEAD
                            word.deprel,  # DEPREL
                            '_',  # DEPS
                            '_',  # MISC
                            token.ner,  # namedentity

                            # ner_tags.get(word.id, '_')  # namedentity
                        ]) + "\n")
                f_out.write("\n")  # Blank line after each sentence
            f_out.write("\n")  # Blank line after each utterance

    print(f"Processed and saved: {output_file}")


def process_all_transcripts(transcripts_directory, output_directory):
    """
    Processes all transcript files in a directory and saves the annotated files.
    """
    for file in os.listdir(transcripts_directory):
        # here we are processing one file only for testing. Change the if
        # condition to 'if "transcription" in file' to process all files:
        if "_1_transcription" in file:
            file_id = file.split("transcript")[0].rstrip('_')
            transcript_path = os.path.join(transcripts_directory, file)
            process_transcript(transcript_path, output_directory, file_id)


# Directories
transcripts_directory = "/content/drive/MyDrive/DSI_multimodal_HS24/CANDOR_flattened"
output_directory = "/content/drive/MyDrive/DSI_multimodal_HS24/candor_stanza"

# Process all transcripts
process_all_transcripts(transcripts_directory, output_directory)


INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| pos       | combined_charlm           |
| lemma     | combined_nocharlm         |
| depparse  | combined_charlm           |
| sentiment | sstplus_charlm            |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: pos
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
INFO:stanza:Loading: lemma
  checkpoint = torch.load(file

Processed and saved: /content/drive/MyDrive/DSI_multimodal_HS24/candor_stanza/intrvw_1_transcription_transcript_cliffhanger_stanza.conllu


ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 2


# Multimodal analysis with Stanza files
let's repeat the query from the multimodal analysis file (see repository), now using the richly annotated data stored in the CONLLU files.

First, we need the library for working with the CONLLU format

In [None]:
!pip install conllu



Collecting conllu
  Downloading conllu-6.0.0-py3-none-any.whl.metadata (21 kB)
Downloading conllu-6.0.0-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-6.0.0
