# Data management (individual QDPX project $\to$ CSV files)

STAGE 1 OF THE DATA PIPELINE

Take an Atlas.ti project and extract the annotations into a longform CSV file (plus auxiliary info in other CSV files).

Things that happen in this script:
1. Walk XML trees in the QDPX project and generate corresponding rectangular dataframes
2. Combine `Codebook` (code info) and `Sources` (document + annotation info) into a single dataframe
3. Translate each "guid" into the human-readable interpretation (e.g., code or document name)
4. Filter out (document, annotator) pairs not listed as "completed" in the Google spreadsheet
5. Extract full quote text from chat transcripts (those stored in the XML file are truncated to a certain number of characters)
6. In cases where the annotator failed to highlight full lines of text, fill out quotes using the chat transcript
7. Extract the speaker identity from the quote text (as a nonnegative integer)

## Flags

In [None]:
output = False

input_version = 2   # 2, 3, or 4 (different versions have different documents)
input_release = 2   # this increments with updates to the data
output_version = 17

descriptions = (input_version == 2) # True for v2, False for v3 and v4

## Baseline setup

In [None]:
import xml.etree.ElementTree as ET
import os
import numpy as np
import pandas as pd
import itertools

In [None]:
ET.VERSION

In [None]:
datadir = "../data"
qdpxdir = "full-project-data-{}.{}".format(input_version, input_release)
inputdocsdir = "sources"
qdefile = "project.qde"

document_metadata_file = os.path.join(datadir, "annotation-timeline.csv")

outputparentdir = "../output"
outputchilddir = "v{}".format(output_version)
outputdir = os.path.join(outputparentdir, outputchilddir)

# stage 1
usersfile = "users.csv"
codesfile = "codes.csv"
sourcesfile = "raw-masked-annotations.csv"
samplesourcesfile = "small-" + sourcesfile
notesfile = "notes.csv"
linksfile = "links.csv"
setsfile = "sets.csv"

# stage 2
speakererrorsfile = "ambiguous-speaker-quotations.txt"
annotationsfile = "human-readable-annotations.csv"
sampleannotationsfile = "small-" + annotationsfile

In [None]:
if output:
    try:
        os.mkdir(outputparentdir)
    except FileExistsError:
        print("High-level output directory already exists; no action taken.")
    
    try:
        os.mkdir(outputdir)
    except FileExistsError:
        print("WARNING: low-level output directory already exists. You might want to increment your version number.")

## Read in the raw data
We'll read the whole file into a big tree structure, then take a look at it

In [None]:
fin = os.path.join(datadir, qdpxdir, qdefile)
tree = ET.parse(fin)
root = tree.getroot()

In [None]:
root.tag

In [None]:
root.attrib

In [None]:
for child in root:
    print(child.tag, "\n\t", child.attrib)

### 0. Users

In [None]:
# only the Users
for user in root[0]: # "User"
    print("User", user.attrib)

### 1. Codebook

In [None]:
# only the Codebook > Codes
# there are 70 of these - we'll only look at the first 3
# root[1] is the CodeBook node
# root[1][0] is its only child node - the Codes node - and its own children are Code nodes
# root[1][0][0] is a Code node - access its name/label using `root[1][0][0].attrib["name"]
for code in root[1][0][0:3]: # "Code"
    print("Code", code.attrib)

In [None]:
for code in root[1][0][4]: # "Code"
    print("Sub-Code", code.attrib)

In [None]:
for code in root[1][0][4][0]: # "Code"
    print("Sub-Sub-Code", code.attrib)

### 2. Sources (annotated documents)

In [None]:
# doing "only the Sources (annotated documents)" would give a lot of
# output, so instead I'm only doing the first 3 quotes of the first
# document

print("NOTE: These aren't the real XML tags! They were too long.\n")

doc = root[2][0] # "TextSource"
print("Doc", doc.attrib, "\n")
for quote in doc[0:3]: # "PlaintextSelection"
    print("\tQuote", quote.attrib)
    for code in quote: # "Coding"
        print("\t\tCode", code.attrib)
        for ref in code: # "CodeRef"
            print("\t\t\tCodeRef", ref.attrib)
    print()

### 3. Notes

In [None]:
# only the Notes (the first 3)
for note in root[3][0:3]: # "Notes"
    print("Note", note.attrib, "\n")

### 4. Links

In [None]:
# only the Links
if not input_version in {4}:
    for link in root[4]: # "Link"
        print("Link", link.attrib, "\n")

### 5. Sets

In [None]:
# only the Sets
if not input_version in {4}:
    for codeSet in root[5]: # "Set"
        print("Set", codeSet.attrib)
        for code in codeSet[0:min(len(codeSet), 3)]: # "MemberCode"
            print("\tMemberCode", code.attrib)

## Create raw dataframes (sometimes within directories)
Here, we must split the big XML file into subtrees before reading

### 0. Users

In [None]:
# cols: guid, name
usersStr = ET.tostring(root[0], encoding='utf8', method='xml')

usersDa = pd.read_xml(usersStr)

usersDa

### 1. Codebook

In [None]:
root[1][0][3][0].attrib # FIXME forgot that the codebook is nested now FML

In [None]:
def is_description(node):
    if node.tag == "{urn:QDA-XML:project:1.0}Description":
        assert(len(node) == 0 and len(node.attrib) == 0)
        return True
    return False

In [None]:
def is_leaf(node):
    if len(node) == 0:
        return True
    if len(node) == 1 and is_description(node[0]):
        return True
    return False

In [None]:
codebookDepth = 0
codebookSize = 0 # only leaf nodes
bfsq = [(root[1][0], 0)]
curDepth = 1

while len(bfsq) > 0:
    # debugging
    prevDepth = curDepth
    
    # regular stuff
    curNode, curDepth = bfsq.pop(0)
    
    # debugging
    if prevDepth != curDepth and curDepth > 0:
        print("\n\n{}.".format(curDepth), end = " ")
    try:
        print(curNode.attrib["name"], end = "     ")
    except(KeyError):
        print("FIXME (Node type: {}; Attributes: {})".format(curNode.tag.split("}")[1], curNode.attrib), end = "     ")
    
    # remove Description nodes, as they're annoying
    if is_description(curNode):
        continue
    
    # regular stuff
    codebookDepth = curDepth
    if is_leaf(curNode):
        codebookSize += 1
    for child in curNode:
        bfsq.append((child, curDepth+1))
    
    # debugging
    #if not "name" in curNode.attrib.keys():
    #    print(curNode.tag, end=", ")
    #    print(curNode.attrib)

print("\n")
print("Codebook depth: {}".format(codebookDepth))
print("Number of interesting leaf nodes: {}".format(codebookSize))

In [None]:
codebookCols = ["guid", "color", "isCodable", "name"] + ["lvl_{}".format(k+1) for k in range(codebookDepth)]
print(codebookCols)

In [None]:
codebookArr = [None] * codebookSize

In [None]:
def get_node_field(node, field):
    try:
        return node.attrib[field]
    except(KeyError):
        if node.tag == "{urn:QDA-XML:project:1.0}Codes":
            return "*"
        return None

In [None]:
trace = []
idx = 0

# pre-order dfs
def dfs(cur, idx):
    if not is_description(cur):
        # leaf node
        if is_leaf(cur):
            # debugging
            #print("{}. ".format(idx) + " > ".join([get_node_field(node, "name") for node in trace]))
            names = [get_node_field(node, "name") for node in trace]
            row = [get_node_field(cur, col) for col in codebookCols[:3]] + [" > ".join(names)] + names
            codebookArr[idx] = row
            idx += 1
        # internal node
        else:
            for child in cur:
                trace.append(child)
                idx = dfs(child, idx)
    trace.pop(len(trace) - 1)
    return idx

for node in root[1][0]:
    trace.append(node)
    idx = dfs(node, idx)

In [None]:
codebookArr[-2]

In [None]:
codebookDa = pd.DataFrame(data = codebookArr, columns = codebookCols)

In [None]:
codebookDa.head()

### 2. Sources (annotated documents)

In [None]:
# cols: guid, name, creatingUser, creationDateTime, plainTextPath, richTextPath
sourcesStr = ET.tostring(root[2], encoding='utf8', method='xml')

sourcesDa = pd.read_xml(sourcesStr)

assert(sourcesDa["PlainTextSelection"].dropna().shape[0] == 0)
sourcesDa = sourcesDa.drop("PlainTextSelection", axis=1)

sourcesDa

In [None]:
#sourcesDa["Description"].value_counts() # just checking

In [None]:
# Sources are deeply nested compared to the other stuff. This directory will look like:
# sourcesDir = {Doc guid > (quotesDa, quotesDir)} where for each Doc,
# quotesDir = {Quote guid > codesDa} where for each Quote, codeRefs have been pivoted
#                                    into codesDa (I think there's only one per code)
sourcesDir = {}

In [None]:
for k in range(5):
    print(root[2][0][k].attrib["name"])

In [None]:
# a quote
quote = root[2][0][0]
print(quote.tag)
quote.attrib # Atlas didn't wanna store the text I guess UPDATE - Atlas has changed its mind, see cell above

In [None]:
for idx, ann in enumerate(quote):
    print(idx, ":", ann)
    print(ann.attrib)
    print()

In [None]:
# an annotation (coding)
ann = quote[0]
print(ann.tag)
ann.attrib

In [None]:
# a code reference
ref = ann[0]
print(ref.tag)
ref.attrib

In [None]:
"""
idx : index of the document (order of upload to Atlas)
"""
def get_doc_text(idx):
    fin = sourcesDa.at[idx, "plainTextPath"].split("//")[1]
    #print(fin)
    fin = os.path.join(datadir, qdpxdir, inputdocsdir, fin)
    #print(fin)

    f = open(fin)

    # source document as a string
    docstr = f.read()

    f.close()
    
    return docstr

In [None]:
#quote.attrib
"targetGUID" in code.attrib.keys()

In [None]:
doc.attrib

In [None]:
codebookDa[codebookDa["lvl_2"].isna()]

In [None]:
# cols of these: guid, name, creatingUser, creationDateTime, startPosition, endPosition
c = 0
for doc_idx, doc in enumerate(root[2]):
    # print(doc.attrib["name"])
    quotesStr = ET.tostring(doc, encoding='utf8', method='xml')
    
    try:
        quotesDa = pd.read_xml(quotesStr)
        try:
            assert(quotesDa["Coding"].dropna().shape[0] == 0) # should be able to do this with len() instead
            quotesDa = quotesDa.drop("Coding", axis=1)
        except(KeyError):
            print("WARNING: Document {} has no Coding's.\n".format(doc.attrib["name"]))
    
    except(ValueError):
        # document hasn't been annotated
        sourcesDa.drop(sourcesDa.index[sourcesDa["guid"] == doc.attrib["guid"]], inplace=True)
        continue
    
    # remove rows that are just Descriptions
    if "Description" in quotesDa.columns:
        print("Description found -- File: {}, GUID: {}".format(doc.attrib["name"], doc.attrib["name"]))
        assert(doc.attrib["guid"] == "B889885D-0073-4C1F-B748-106B7C01FD10") # THIS MAY CHANGE
        quotesDa = quotesDa[quotesDa["Description"].isna()].drop(columns=["Description"])
        #display(quotesDa)
    
    quotesDir = {}
    
    for quote in doc:
        if c < 3:
            print("quote: {}".format(quote.attrib["guid"]))
            c += 1
        codesStr = ET.tostring(quote, encoding='utf8', method='xml')
        try:
            codesDa = pd.read_xml(codesStr)
        except(ValueError):
            # Sometimes we delete a coding but the orphaned quotation
            # stays in the file. This is uninteresting so we skip it.
            if "guid" in quote.attrib.keys():
                quotesDa = quotesDa.loc[quotesDa["guid"] != quote.attrib["guid"]].reset_index(drop=True)
            continue
        
        # initialize derived columns to store information in XML child nodes
        codesDa["isCode"] = False
        codesDa["isNote"] = False
        codesDa["CodeRef.targetGUID"] = np.nan
        codesDa["NoteRef.targetGUID"] = np.nan
        
        # upward reference to the quote that all these codes were assigned to
        codesDa["quoteGUID"] = quote.attrib["guid"]
        
        # code- and note-specific actions
        for idx, coding in enumerate(quote): # "Coding"
            if len(coding) == 1:
                code = coding[0]
            else:
                print("WARNING: Coding {} has {} Refs".format(coding.tag, len(coding)))
                print(code.attrib)
                1/0
            
            if code.tag == "{urn:QDA-XML:project:1.0}CodeRef":
                codesDa.at[idx, "isCode"] = True
                if "targetGUID" in code.attrib:
                    codesDa.at[idx, "CodeRef.targetGUID"] = code.attrib["targetGUID"]
                else:
                    # debugging
                    warning = "WARNING: Document {} > Quote \"{}\" > Code \"{}\" has {} references\n".format(
                        doc.attrib["name"], 
                        quote.attrib["name"], 
                        code.attrib, 
                        len(code))
                    print(warning)
            elif code.tag == "{urn:QDA-XML:project:1.0}NoteRef":
                codesDa.at[idx, "isNote"] = True
                codesDa.at[idx, "NoteRef.targetGUID"] = code.attrib["targetGUID"]
            else:
                # debugging
                warning = "WARNING: unrecognized XML tag in Document {} > Quote {} > {} {}\n".format(
                    doc.attrib["name"], 
                    quote.attrib["name"], 
                    code.tag, 
                    code.attrib)
                print(warning)
        
        # default-initialize any columns we need for merging later
        tagset = {coding[0].tag for coding in quote}
        
        if not "{urn:QDA-XML:project:1.0}CodeRef" in tagset:
            codesDa["guid"] = np.nan
            codesDa["creatingUser"] = np.nan
            codesDa["creationDateTime"] = np.nan
        if not "{urn:QDA-XML:project:1.0}NoteRef" in tagset:
            codesDa["targetGUID"] = np.nan
        else:
            print("Document {} > quote {} has notes".format(doc.attrib["name"], quote.attrib["name"]))
        
        # write output
        quotesDir[quote.attrib["guid"]] = codesDa

    # write more output
    sourcesDir[doc.attrib["guid"]] = (quotesDa, quotesDir)

In [None]:
# This cell only runs correctly for Project Version 3 (I hardcoded the index for testing)
if input_version in {3}:
    quotesDa = sourcesDir["E266595E-8846-4BB3-904F-A818FDD5DC0B"][0]
    display(quotesDa[quotesDa["guid"] == "D087E98C-3500-429E-A6A0-43EB9388E7B1"])
    display(quotesDa[quotesDa["guid"] == "B4E46844-CBFA-431C-BCFD-3EB82155E6CA"])
    display(quotesDa[quotesDa["guid"] == "799E4C5D-76F8-4CFA-A438-BE8A1B92157B"])

In [None]:
quotesDa.head(2)

In [None]:
codesDa

In [None]:
coding.attrib

In [None]:
code.attrib

In [None]:
quote.attrib

In [None]:
d = 0
q = 1
(sourcesDir[list(sourcesDir.keys())[d]][q])[list(sourcesDir[list(sourcesDir.keys())[d]][q].keys())[2]]
#len(sourcesDir[list(sourcesDir.keys())[d]][q])

In [None]:
tagset

In [None]:
sourcesDa.shape

See the first dataframe below for a "standard" `quotesDa` (all elements are either `Coding`s or `NoteRef`s).

See the second dataframe below for a "standard" `codesDa`.

In [None]:
# double check I did it right
#display(sourcesDir["378A15D0-C2D3-4E73-AC3E-DC9B260BD9D4"][0].head(3)) # quotesDa
#display(sourcesDir["378A15D0-C2D3-4E73-AC3E-DC9B260BD9D4"][1]["49EB5814-CAAE-43DD-B03D-E77B98C7753C"]) # codesDa
src = list(sourcesDir.keys())[0]
display(sourcesDir[src][0].head(3))
display(sourcesDir[src][1][list(sourcesDir[src][1].keys())[0]]) # codesDa

### 3. Notes

In [None]:
# cols: guid, name, creatingUser, creationDateTime, modifyingUser, modifiedDateTime, plainTextPath, richTextPath
notesStr = ET.tostring(root[3], encoding='utf8', method='xml')

notesDa = pd.read_xml(notesStr)

notesDa

### 4. Links

In [None]:
# cols: guid, name, color, direction, originGUID, targetGUID
if not input_version in {4}:
    linksStr = ET.tostring(root[4], encoding='utf8', method='xml')

    linksDa = pd.read_xml(linksStr)

    display(linksDa)

### 5. Sets

In [None]:
# cols: name, guid
if not input_version in {4}:
    setsStr = ET.tostring(root[5], encoding='utf8', method='xml')

    setsDa = pd.read_xml(setsStr)

    memberTypes = list(setsDa.columns)[2:]

    for memberType in memberTypes:
        newColName = memberType + ".targetGUIDs"
        setsDa.rename(columns={memberType : newColName}, inplace=True)
        setsDa[newColName] = "N/A"

    display(setsDa)

In [None]:
if not input_version in {4}:
    for idx, codeSet in enumerate(root[5]):
        members = {member.tag : [] for member in codeSet}
        for member in codeSet:
            members[member.tag].append(member.attrib["targetGUID"])
        #print(members, "\n")
        for tag, targetGUIDs in members.items():
            col = tag.split('}')[1] + ".targetGUIDs"
            setsDa.at[idx, col] = targetGUIDs

    display(setsDa)

## Now that we have all the data out of XML, we need to consolidate it
Specifically, No. 2: Sources

### 2. Sources
We want to merge all the different dictionaries and dataframes into a single dataframe of annotations.

In [None]:
# the relevant data structures (for now) are sourcesDa and sourcesDir

In [None]:
# sourcesDir = doc guid -> (quotesDa, quotesDir)
# quotesDa = quote guid x [text, start, end, time, doc, etc.]
# quotesDir = quote guid -> codesDa
# codesDa = code/noteref x [code vs note flag, note target guid, quote guid]

In [None]:
# reminding myself what they look like...
src_guid_ex = sourcesDa["guid"][0]
quotesDa_ex = sourcesDir[src_guid_ex][0]
quotesDir_ex = sourcesDir[src_guid_ex][1]
quote_guid_ex = list(quotesDir_ex.keys())[0]
codesDa_ex = quotesDir_ex[quote_guid_ex]

display("sources", sourcesDa.head(3)) # documentsDa
display("quotes", quotesDa_ex.head(3)) # quotesDa
display("annotations", codesDa_ex)

In [None]:
#"22BDC312-CA2D-47C3-ABF8-453195276C54" in sourcesDa["guid"]

In [None]:
#"22BDC312-CA2D-47C3-ABF8-453195276C54" in quotesDa_ex["guid"]

In [None]:
#"22BDC312-CA2D-47C3-ABF8-453195276C54" in codesDa_ex["targetGUID"]

In [None]:
quotesDa_ex.shape

In [None]:
len(quotesDir_ex)

In [None]:
# FIXME: why do I have a CodeRef column and both {}.targetGUID columns, but
# no NoteRef column? Need to check whether Notes were taken at all
for key, val in quotesDir_ex.items():
    display(val)
    break

In [None]:
count = 0
for doc, (quotesDa, quotesDir) in sourcesDir.items():
    #display(quotesDa)
    for quote, codesDa in quotesDir.items():
        count += 1
        #print(quote)
        #display(codesDa)
    print("Finished document {} (total {} quotes)".format(doc, count))
print(count)

In [None]:
"B889885D-0073-4C1F-B748-106B7C01FD10" in sourcesDir.keys()

In [None]:
docDir = {}

for doc, (quotesDa, quotesDir) in sourcesDir.items():
    #da = None
    #for quote, codesDa in quotesDir.items():
    #    tmpDa = codesDa.set_index("guid")
    #    if da is None:
    #        da = tmpDa
    #    else:
    #        da = da.append(tmpDa)
    #print(doc)
    codesDa = pd.concat(quotesDir.values(), ignore_index=True)
    docDir[doc] = codesDa.add_prefix("annotation.").merge(quotesDa.add_prefix("quote."), 
                                left_on="annotation.quoteGUID", 
                                right_on="quote.guid", 
                                suffixes=("__ERROR-left", "__ERROR-right"), 
                                how="outer")
    docDir[doc]["quote.documentGUID"] = doc

In [None]:
display(docDir[src_guid_ex].head(3))

In [None]:
da = pd.concat(docDir.values(), ignore_index=True).astype({"quote.startPosition": "int64", 
                                                           "quote.endPosition": "int64"})

In [None]:
if descriptions:
    display(sourcesDa[sourcesDa["Description"].notna()])

In [None]:
if descriptions:
    display(sourcesDa["Description"].value_counts())

In [None]:
da = da.merge(sourcesDa.add_prefix("document."), 
              left_on="quote.documentGUID", 
              right_on="document.guid", 
              suffixes=("__ERROR-left", "__ERROR-right"), 
              how="outer")
da.shape

In [None]:
da["annotation.isCode"].value_counts()

In [None]:
da["annotation.isNote"].value_counts()

In [None]:
pd.set_option("display.max_columns", None)
display(da.head(5))
pd.reset_option("max_columns")

In [None]:
assert(len(da["annotation.CodeRef"].value_counts()) == 0) # remove this in the next cell
display(da["annotation.isCode"].value_counts()) # FIXME check that the mechanism I'm using to decide this is still valid
assert(len(da["annotation.targetGUID"].value_counts()) == 0) # remove this in the next cell

# these are fine, just rare
if descriptions:
    #display(da["quote.Description"].value_counts()) # FIXME this makes input version 5 break
    display(da["document.Description"].value_counts())

In [None]:
da = da.drop(columns=["annotation.CodeRef", "annotation.targetGUID"])

In [None]:
# from annotation.CodeRef.guid
codebookDa[codebookDa["guid"] == "AE184BD2-6DF4-492B-B4FE-F7D446C30B51"] # yay!

### Output sources and all the other data as-is

In [None]:
if output:
    usersDa.to_csv(os.path.join(outputdir, usersfile))
    codebookDa.to_csv(os.path.join(outputdir, codesfile))
    da.to_csv(os.path.join(outputdir, sourcesfile))
    da.head(20).to_csv(os.path.join(outputdir, samplesourcesfile)) # for easy visualization on GitHub
    notesDa.to_csv(os.path.join(outputdir, notesfile))
    if not input_version in {4}:
        linksDa.to_csv(os.path.join(outputdir, linksfile))
        setsDa.to_csv(os.path.join(outputdir, setsfile))

## Consolidate even more

Instead of 5 dataframes, we want 1 (or $<$5).

### Drop value-less columns

In [None]:
dropcols = []
for col in da.columns:
    #print(col, ":", len(da[col].unique()))
    if len(da[col].unique()) == 1:
        dropcols = dropcols + [col]
print(dropcols)
da1 = da.drop(columns=dropcols)

In [None]:
equiv_cols = {#"annotation.quoteGUID" : "quote.guid", 
              #"annotation.targetGUID" : "annotation.NoteRef.targetGUID", 
              "quote.documentGUID" : "document.guid"}

for left, right in equiv_cols.items():
    if (da1[left].eq(da1[right]) | (da1[left].isna() & da1[right].isna())).all():
        da1.drop(columns=left, inplace=True)
    else:
        print("oops, {} doesn't always equal {}".format(left, right))
        display(da1[da1[left].ne(da1[right])][left].value_counts())
        display(da1[da1[left].ne(da1[right])][right].value_counts())

In [None]:
rename = {"annotation.creatingUser" : "annotation.creatingUserGUID", 
          "quote.creatingUser" : "quote.creatingUserGUID", 
          "quote.modifyingUser" : "quote.modifyingUserGUID", 
          "document.creatingUser" : "document.creatingUserGUID"}
da1.rename(columns=rename, inplace=True)

In [None]:
pd.set_option("display.max_columns", None)
display(da1.head(5))
pd.reset_option("max_columns")

### Translate GUIDs into words, where possible

In [None]:
def guid_to_identifier(guid, df, guid_col, id_col, id_type):
    rows = df[df[guid_col] == guid].reset_index()
    if len(rows) != 1:
        #if guid is np.nan or guid is None:
        if pd.isnull(guid):
            return np.nan
        err = "ERROR query for {} guid {} produced {} results with the following identifier(s): \n\t{}".format(
            id_type, guid, len(rows), "\n\t".join(rows[id_col]))
        raise Exception(err)
    return rows.at[0, id_col]

In [None]:
print(guid_to_identifier("57500D78-CB6B-4955-9A3C-4A3940F6263A", usersDa, "guid", "name", "user"))

In [None]:
# NOTE this code block is *supposed* to produce an error
try:
    print(guid_to_identifier("fake-guid", usersDa, "guid", "name", "user"))
except Exception as e:
    assert(str(e).startswith('ERROR query for user guid fake-guid produced 0 results with the following identifier(s)'))

In [None]:
def guid_to_user(guid):
    return guid_to_identifier(guid, usersDa, "guid", "name", "user")

In [None]:
def guid_to_code(guid):
    return guid_to_identifier(guid, codebookDa, "guid", "name", "code")

In [None]:
def guid_to_note(guid):
    return guid_to_identifier(guid, notesDa, "guid", "name", "note")

In [None]:
# test them each once
print(guid_to_user("8F219B13-6EC7-4DBD-A8B7-73F4C1A66B69"))
print(guid_to_code("AE184BD2-6DF4-492B-B4FE-F7D446C30B51"))
#print(guid_to_note("F3ACD375-0E92-4324-BE15-727C4651C1EE")) # not using notes anymore apparently

In [None]:
# this is for display purposes - to visualize the columns with GUID values
cols = ["annotation.creationDateTime", 
        #"annotation.isCode", # got rid of these cause we only have Codes now(?)
        #"annotation.isNote", 
        "quote.name", # not sure why this started causing errors all of a sudden
        "quote.creationDateTime",
        "quote.startPosition", 
        "quote.endPosition", 
        "quote.modifiedDateTime", 
        "document.name", 
        "document.creationDateTime", 
        "document.plainTextPath", 
        "document.richTextPath"]
display(da1.drop(columns=[col for col in cols if col in da1.columns]).head(3))

In [None]:
#print(da1["annotation.NoteRef.targetGUID"].value_counts(), "\n")
print(da1["quote.modifyingUserGUID"].value_counts(), "\n")

#### Users

In [None]:
# see what's going on
#da1[da1["annotation.creatingUserGUID"].isna()]

In [None]:
# all the user-based ones
da1["annotation.creatingUser"] = da1[["annotation.creatingUserGUID"]].applymap(guid_to_user)["annotation.creatingUserGUID"]
da1["quote.creatingUser"] = da1[["quote.creatingUserGUID"]].applymap(guid_to_user)["quote.creatingUserGUID"]
da1["quote.modifyingUser"] = da1[["quote.modifyingUserGUID"]].applymap(guid_to_user)["quote.modifyingUserGUID"]
if not input_version in {2, 4}:
    da1["document.creatingUser"] = da1[["document.creatingUserGUID"]].applymap(guid_to_user)["document.creatingUserGUID"]

In [None]:
pd.set_option("display.max_columns", None)
display(da1.head(3))
pd.reset_option("max_columns")

In [None]:
print(da1["annotation.creatingUser"].value_counts(), "\n") # cool!

#### Annotations (Codes and Notes)

In [None]:
da1[da1["annotation.CodeRef.targetGUID"].isna()]

In [None]:
# all the code-based ones
da1["annotation.CodeRef.target"] = da1[["annotation.CodeRef.targetGUID"]].applymap(guid_to_code)["annotation.CodeRef.targetGUID"]
da1.head(3)

In [None]:
# all the note-based ones (there's only one)
#da1["annotation.NoteRef.target"] = da1[["annotation.NoteRef.targetGUID"]].applymap(guid_to_note)["annotation.NoteRef.targetGUID"]
#da1.tail(3)

In [None]:
da1.columns

In [None]:
cols = ["quote.name", 
        "annotation.isCode", 
        "annotation.isNote", 
        "annotation.CodeRef.target", 
        "annotation.NoteRef.target", 
        "annotation.creatingUser", 
        "annotation.creationDateTime", 
        "quote.startPosition", 
        "quote.endPosition", 
        "quote.creatingUser", 
        "quote.creationDateTime", 
        "quote.modifyingUser", 
        "quote.modifiedDateTime", 
        "document.name", 
        "document.creatingUser", 
        "document.creationDateTime", 
        "document.modifyingUser",
        "document.modifiedDateTime", 
        "document.plainTextPath", 
        "document.richTextPath", 
        "annotation.guid", 
        "annotation.CodeRef.targetGUID", 
        "quote.guid", 
        "document.guid"]

da2 = da1.copy()
da2 = da2[[col for col in cols if col in da2.columns]]

rename = {"quote.name" : "quote.text",
          "annotation.CodeRef.target" : "annotation.code", 
          #"annotation.NoteRef.target" : "annotation.note", 
          "annotation.CodeRef.targetGUID" : "annotation.codeRef.guid"}

da2.rename(columns=rename, inplace=True)

In [None]:
pd.set_option("display.max_columns", None)
display(da2.head(3))
pd.reset_option("max_columns")

### Throw out document-annotator pairs not marked as part of the intentional dataset
This removes documents that are incomplete, annotated under different schemes, etc.

From now on, we only work with data from documents whose annotations are complete according to the spreadsheet.

In [None]:
# print(da2[da2["document.name"].str.endswith(".txt")].shape)
# print(da2[~da2["document.name"].str.endswith(".txt")].shape)
# da3 = da2[da2["document.name"].str.endswith(".txt")]

In [None]:
annotators = ["Annotator_0", "Annotator_1", "Annotator_2"]

Read in the metadata file:

In [None]:
document_metadata = pd.read_csv(document_metadata_file)
document_metadata = document_metadata.set_index("Document Name")
document_metadata = document_metadata.drop(index="103", columns=["Unnamed: 10", "Unnamed: 11"])
document_metadata = document_metadata.fillna({"Notes" : ""})
document_metadata = document_metadata.fillna({annotator : False for annotator in annotators})

display(document_metadata.head(3))
# display(document_metadata.tail(8)) 

In [None]:
document_metadata["Annotator_2"].value_counts()

From now on, we only work with data from documents whose annotations are complete

In [None]:
# sound but incomplete filtering
da3 = da2[da2["document.name"].isin(document_metadata.index.unique())]
da3 = da3.reset_index(drop=True)
print("{} to {}".format(da2.shape, da3.shape))
da3.head(1)

In [None]:
# read from the dataframe
du = pd.Series(list(zip(da3["document.name"], 
                        da3["annotation.creatingUser"].str.split(" ").str[0])), # first names only
               index = da3.index)

# read from the metadata
completed = np.concatenate([list(zip(document_metadata.index[document_metadata[annotator]], 
                                    itertools.repeat(annotator))) 
                           for annotator in annotators], axis=0)

# stupid numpy autoconvert thing
completed = completed.T
completed = list(zip(completed[0], completed[1]))
#print(len(completed))

fil = du.isin(completed)
#print(fil.value_counts())

print(da3.shape)
da3 = da3[fil]
print(da3.shape)

### Get more text from the original documents

Read in all the documents

In [None]:
doctext = {}
docs = da3["document.plainTextPath"].unique()
docfnames = pd.Series(docs).str.split("//").str[1]

for i, doc in enumerate(docfnames):
    #print(doc)
    fin = os.path.join(datadir, qdpxdir, inputdocsdir, doc)
    f = open(fin)
    doctext[docs[i]] = f.read() # source document as a string
    f.close()

print(len(doctext), len(docs))

Force the `quote.text` column to contain the entire quote

In [None]:
quote_info = pd.Series(zip(da3["document.plainTextPath"], da3["quote.startPosition"], da3["quote.endPosition"]), index=da3.index)
quote_text = quote_info.map(lambda v, doctext=doctext : doctext[v[0]][v[1]:v[2]])
da3["quote.text"] = quote_text

Extract the speaker

In [None]:
#parstart = quote_info.map(lambda v, doctext=doctext : doctext[v[0]].rfind("\n", 0, v[1]) + 1)
parstart = quote_info.map(lambda v, doctext=doctext : doctext[v[0]].rfind("\n2019-", 0, v[1]+5) + 1)
(parstart == da3["quote.startPosition"]).value_counts()

In [None]:
parend = quote_info.map(lambda v, doctext=doctext : doctext[v[0]].find("\n", v[2]))
(parend == da3["quote.endPosition"]).value_counts()

In [None]:
par_info = pd.Series(zip(da3["document.plainTextPath"], parstart, parend), index=da3.index)
par_text = par_info.map(lambda v, doctext=doctext : doctext[v[0]][v[1]:v[2]])

Debug the speaker scraping step

In [None]:
par_text[0]

In [None]:
# check for items that don't have a speaker (usually they're code)
err = "**** MISSING SPEAKERS ****\n\n"
fil = ~par_text.str.lower().str.contains(" person ")
tmp = par_text[fil]
for i in tmp.index:
    err += "[{}] {}:{}, {}, {}, {}\n".format(i, parstart[i], parend[i], 
                                             da3.loc[i, "document.name"], 
                                             da3.loc[i, "annotation.creatingUser"], 
                                             da3.loc[i, "annotation.code"])
    err += tmp[i] + "\n\n"
print(err)

In [None]:
# check for items that don't have a speaker at the right index (results from the previous check are excluded)
# this is now redundant to the next cell
fil = ~fil & ~(par_text.str[24:32].str.lower() == " person ")
fil &= ~(par_text.str[24:46].str.lower() == " code change : person ")
fil &= ~(par_text.str[24:48].str.lower() == " executed code : person ")
tmp = par_text[fil]
tmp

In [None]:
# check for items that have multiple speakers (we don't exclude the previous ones here, as there are few)
#fil = ~fil & (par_text.str.find("2019-") != par_text.str.rfind("2019-"))
fil = par_text.str.find("2019-") != par_text.str.rfind("2019-")
tmp = par_text[fil]
print(len(tmp), "items checked")
problems = {}
for j in tmp.index: # this could be done more concisely using re, but I don't feel like scrolling all the way up to load the package
    s = tmp[j]
    i = 0
    speakers = {}
    while i != -1:
        k, l = None, None
        if s[i+24 : i+32].lower() == " person ":
            k = i + 32
            l = s.find(":", i+32)
            line = s[l+1:s.find("\n", l+1)].strip()
            if line == "(hello)" or line == "(bye)":
                k, l = None, None # skip these lines
        elif s[i+24 : i+45].lower() == " code change : person ":
            k = i+45
            l = s.find("\n", i+45)
        elif s[i+24 : i+48].lower() == " executed code : person ":
            k = i+47
            l = s.find("#########", i+47)
        else:
            #print("~[{}] {}:{}, {}, {}, {}".format(j, parstart[j], parend[j], 
            #                                      da3.loc[j, "document.name"], 
            #                                      da3.loc[j, "annotation.creatingUser"], 
            #                                      da3.loc[j, "annotation.code"]))
            #print(s[i:], "\n")
            problems[(da3.loc[j, "document.name"], da3.loc[j, "quote.startPosition"], 
                      da3.loc[j, "quote.endPosition"])] = (da3.loc[j, "document.plainTextPath"], 
                                                           parstart[j], parend[j], speakers.copy())
        
        if k is not None and l is not None and k < l < len(s) and s[k:l].strip().isdigit():
            key = s[k:l].strip()
            if key in speakers.keys():
                speakers[key] += 1
            else:
                speakers[key] = 1 # this is where the person ID is
        
        i = s.find("2019-", i+33)
    if len(speakers) != 1:
        #print("[{}] {}:{}, {}, {}, {}".format(j, parstart[j], parend[j], 
        #                                      da3.loc[j, "document.name"], 
        #                                      da3.loc[j, "annotation.creatingUser"], 
        #                                      da3.loc[j, "annotation.code"]))
        #print(speakers)
        #print(s, "\n")
        problems[(da3.loc[j, "document.name"], da3.loc[j, "quote.startPosition"], 
                  da3.loc[j, "quote.endPosition"])] = (da3.loc[j, "document.plainTextPath"], 
                                                       parstart[j], parend[j], speakers.copy())
    #else:
    #    print("[{}] ok".format(j))

In [None]:
print(len(problems), "problem quotations")

In [None]:
err += "\n**** MULTIPLE SPEAKERS ****\n\n"
for i, ((doc, qstart, qstop), (path, pstart, pstop, speakers)) in enumerate(problems.items()):
    err += "{}) Document {} [{}:{}], speaker counts = {}".format(i, doc, qstart, qstop, speakers) + "\n"
    err += doctext[path][qstart:qstop] + "\n\n"
    if (qstart, qstop) != (pstart, pstop):
        err += "**context**\n"
        err += doctext[path][pstart:pstop] + "\n\n"
print(err)

Write the results

In [None]:
da3["quote.paragraphStartPosition"] = parstart
da3["quote.paragraphEndPosition"] = parend
da3["quote.paragraphText"] = par_text

In [None]:
# this will get the first speaker which is usually probably fine (check the above error output if you want)
speaker = par_text.str.lower().str.split(" person ").str[1].str.split(n=1).str[0].fillna("-1")
# different documents have different spacing around the ":", so it's sometimes left trailing by the above
speaker = speaker.str.split(":").str[0] # probs would have been more efficient to do this first but whatever
speaker.value_counts()

In [None]:
speaker = speaker.astype(np.int64)
da3["quote.speaker"] = speaker

In [None]:
# -1 valued speaker is inferred to be the learner (because usually it's a code comment that we assume the learner wrote)
da3["quote.speakerIsLearner"] = speaker <= 0
da3["quote.speakerIsLearner"].value_counts()

### Output the resulting dataframe

In [None]:
if output:
    da3.to_csv(os.path.join(outputdir, annotationsfile))
    da3.head(20).to_csv(os.path.join(outputdir, sampleannotationsfile)) # for easy visualization on GitHub
    
    f = open(os.path.join(outputdir, speakererrorsfile), "w")
    f.write(err)
    f.close()

## Consolidate even more even more

FIXME BOOKMARK still need to incorporate `Link`s and `Set`s. 
Also need to take a look at the text files associated with `Code`s.

In [None]:
raise Exception('stopping point')

In [None]:
tmp = pd.read_csv(os.path.join(outputdir, annotationsfile), index_col=0)

In [None]:
tmp["quote.text"]

# Sandbox

In [None]:
import platform
print(platform.python_version())

Get the quote text because Atlas hates me :'(

Getting context for quotes/annotations

In [None]:
idx = 6 #79 #49 #17

In [None]:
fin = da2.at[idx, "document.plainTextPath"].split("//")[1]
print(fin)
fin = os.path.join(datadir, qdpxdir, inputdocsdir, fin)
print(fin)

f = open(fin)

# source document as a string
docstr = f.read()

f.close()

In [None]:
docstr

In [None]:
# quote start and end
start = da2.at[idx, "quote.startPosition"]
end = da2.at[idx, "quote.endPosition"]
print(start, ":", end)

In [None]:
# quote text
quote = docstr[start:end]
print(quote)

In [None]:
# minimal sequence of contiguous full lines containing the quote (start, end, and text)
parstart = docstr.rfind("\n", 0, start) + 1
parend = docstr.find("\n", end)
print(parstart, ":", parend)

par = docstr[parstart:parend]
print(par)

In [None]:
# parstart (above) splitting the lines into separate list elements
fulllines = par.split("\n")

# quote (above) splitting the lines into separate list elements
quotelines = quote.split("\n")

display(fulllines)
display(quotelines)

In [None]:
def get_speaker_old(line):
    # line is a code comment or some such
    if line[:6] != "PERSON":
        return None
    
    ls = line.split(" : ")
    if len(ls) > 0:
        ls2 = ls[0].split(" ")
        if len(ls2) > 1:
            return int(ls2[1])
    
    # unknown issue
    print("UNKNOWN ERROR : get_speaker({})".format(line))

print(get_speaker_old("PERSON 0 : or so i should remove or"))
print(get_speaker_old("PERSON 1 : (bye)"))
print(get_speaker_old("PERSON 1 : "))
print(get_speaker_old("    # off by one"))
print(get_speaker_old("RETURN : sum of squares of the elements in the matrix"))


In [None]:
def get_speaker(line):
    ret = {"speaker" : None, "text" : None}
    # line is a code comment or some such
    if line[:7] != "PERSON ":
        ret["text"] = line
    else:
        try:
            ls1 = line.split(" : ")
            ls2 = ls1[0].split(" ")
            ret["speaker"] = int(ls2[1])
            ret["text"] = ls1[1]
        # line is a code comment or some such, but happened to
        # start with the string "PERSON " (unlikely)
        except IndexError:
            print("UNKNOWN ERROR : get_speaker({})".format(line))
    
    return ret

print(get_speaker("PERSON 0 : or so i should remove or"))
print(get_speaker("PERSON 1 : (bye)"))
print(get_speaker("PERSON 1 : "))
print(get_speaker("    # off by one"))
print(get_speaker("RETURN : sum of squares of the elements in the matrix"))

In [None]:
# speaker of each line
speakers = [get_speaker(line) for line in fulllines]

# quote text of each line
quotetextlines = [(line[0] if len(line) < 2 else line[1]) for line in 
             [line.split(" : ") for line in quotelines]]

# full text of each line
fulltextlines = [(line[0] if len(line) < 2 else line[1]) for line in 
                [line.split(" : ") for line in fulllines]]

print(speakers)
print(quotetextlines)
print(fulltextlines)

In [None]:
numspeakers = len(set(speakers))
print(numspeakers)

In [None]:
if numspeakers == 1:
    quotetext = "\n".join(textlines)
    print(quotetext)
else:
    # selectively join things together somehow?
    raise Exception(":)")

In [None]:
# df[newcols] = df.groupby(documentID).transform(f)
# pass in a function f that will take in dx (a shorter dataframe where the document is constant) and return an equally-sized dataframe

In [None]:
setsDa.head()

In [None]:
#help(pd.DataFrame.drop)
#help(pd.DataFrame.reset_index)
help(rfind)

In [None]:
#help(pd.Series.isin)
#help(pd.DataFrame.merge)
#help(pd.concat)
help(pd.DataFrame.applymap)

# Stuff that didn't work

## First attempt: `portableqda`
Doesn't work, feel free to skip

In [None]:
# https://pypi.org/project/portableqda/

In [None]:
import portableqda

In [None]:
help(portableqda)

## Second attempt: `xml.etree.ElementTree`
WIP

In [None]:
import xml.etree.ElementTree as ET
import os

In [None]:
datadir = "../data"
qdpxdir = "full-project-data"
qdefile = "project.qde"

In [None]:
fin = os.path.join(datadir, qdpxdir, qdefile)
tree = ET.parse(fin)
root = tree.getroot()

In [None]:
root.tag

In [None]:
root.attrib

In [None]:
for child in root:
    print(child.tag, "\n\t", child.attrib)

In [None]:
# yess you can see the "plainTextPath" attribute for each source! now we
# know the mapping and don't have to fuck around with bash scripting
# things I've forgotten how to do

for child in root:
    print(child.tag)
    for grandchild in child:
        print("\t", grandchild.tag)
        print("\t\t", grandchild.attrib)

In [None]:
# only the Users
for child in root[0]:
    print(child.attrib)

In [None]:
# only the Codebook
for child in root[1][0]:
    print(child.attrib)

In [None]:
# doing "only the Sources (annotated documents)" would give a lot of
# output, so instead I'm only doing the first document

"""
for child in root[2]:
    print(child.attrib)
    for grandchild in child:
        print("\t", grandchild.attrib)
    print()
"""

child = root[2][0]
print(child.attrib)
for grandchild in child:
    print("\t", grandchild.attrib)

In [None]:
# only the Notes
for child in root[3]:
    print(child.attrib)
    print()

## Third attempt: `pandas`
Simply sticking in the whole `.qde` file doesn't work

In [None]:
import pandas as pd

In [None]:
da = pd.read_xml(os.path.join(datadir, qdpxdir, qdefile))

da

In [None]:
da["User"].unique()

In [None]:
da["Codes"].unique()

In [None]:
da["TextSource"].unique()

In [None]:
for col in da.columns:
    print(col, "\t : ", da[col].unique())

## Debugging the extra rows issue

In [None]:
for doc, (quotesDa, quotesDir) in sourcesDir.items():
    for quote, codesDa in quotesDir.items():
        if quote == "A1FF467C-2857-4848-A993-5C838B2F6491":
            print("doc guid:", doc)
            
            print("quotes:")
            display(quotesDa)
            
            print("quote:", quote)
            
            print("codes:")
            display(codesDa)

In [None]:
for doc, (quotesDa, quotesDir) in sourcesDir.items():
    for quote, codesDa in quotesDir.items():
        if quote == "B20E50DC-8E26-4552-A0B4-0FBABB45A8E9":
            print("doc guid:", doc)
            
            print("quotes:")
            display(quotesDa)
            
            print("quote:", quote)
            
            print("codes:")
            display(codesDa)

As far as I can tell, the orphaned quotes are nowhere in the data at this point. How the fuck are they getting back in?

Answer: restarting the kernel fixed the problem smh... still not sure how rerunning cells didn't - might have a hidden bug somewhere.

...

There are 2340 `<Coding>` XML nodes and 2 `<NoteRef>` XML nodes, so something's going on here.

In [None]:
sum([docDir[doc].shape[0] for doc in docDir.keys()])

So the problem exists prior to merging in the document information from `sourcesDa`.

...

So the extra rows are getting some system default initialization for these cells.

In [None]:
#da[da["annotation.isCode"] != True]
extraDa = da[~da["annotation.isCode"].isin([True, False])] # "~" is negation

extraDa

So there are 45 such extra rows. Some documents appear more than once, but not all documents appear at all. I could just drop them, but I kind of want to know what's going on.

In [None]:
extraDa.columns

In [None]:
extraDa[["quote.guid", "quote.name", "document.guid", "document.name"]]

In [None]:
#extraDa.to_csv("extra_rows.csv")

Manual inspection reveals that these are simply quotations with no associated codes. I thought I already removed those - need to go back and double-check that I did it right.

In [None]:
# S : set of people
def eval_q5d(S):
    ret = True
    for x in S: # forall x
        if I(x): # if I(x) then
            ret2 = False
            for y in S: # there exists a y such that
                ret4 = True
                for z in S: # forall z
                    if z != y: # if z != y then
                        ret5 = not F(x, z) or I(z)
                    else:
                        ret5 = True
                    ret4 = ret4 and ret5
                ret3 = F(x, y) and not I(y) and ret4
                ret2 = ret2 or ret3
        else:
            # default to True
            ret = True
    return ret