# Calculating annotation statistics from CAS JSON files

This notebook demonstrates how to access annotated texts in the CAS JSON format and how to extract some statistical data from them. Example data is included in the `data/named_entity_json_cas` folder. If you want to use your own data from INCEpTION with this notebook, you can e.g. open the document on the annotation page and then use the *Export*  button from the action bar and export the data in the *UIMA CAS JSON* format.

Start by installing the prerequisites.

In [None]:
from distutils.command.install_data import install_data
# !pip install dkpro-cassis
!pip install --force-reinstall git+https://github.com/dkpro/dkpro-cassis.git
!pip install pandas
!pip install plotly

from cassis import *
import os
import pandas as pd
import plotly.graph_objects as go

Now, let's load the CAS JSON data into memory so we can process it more easily in the subsequent steps.

In [2]:
directory = "data/named_entity_json_cas"
documents_by_name = dict()

# Load the data files
for filename in os.listdir(directory):
    filepath = os.path.join(directory, filename)
    if os.path.isfile(filepath):
      with open(filepath, 'rb') as f:
        documents_by_name[filename] = load_cas_from_json(f)

As a first step, let's look which types of annotations exist in our data. This may be quite a bit more than what we would usually see in the INCEpTION user interface, but it will also include the type for the layers from our project settings.

In [3]:
all_type_names = set()
for filename in documents_by_name:
  doc = documents_by_name[filename]
  for type in doc.typesystem:
    all_type_names.add(type.name)

display(all_type_names)

{'de.tudarmstadt.ukp.clarin.webanno.api.type.FeatureDefinition',
 'de.tudarmstadt.ukp.clarin.webanno.api.type.LayerDefinition',
 'de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.morph.MorphologicalFeatures',
 'de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS',
 'de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData',
 'de.tudarmstadt.ukp.dkpro.core.api.metadata.type.TagDescription',
 'de.tudarmstadt.ukp.dkpro.core.api.metadata.type.TagsetDescription',
 'de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity',
 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma',
 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence',
 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Stem',
 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token',
 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.TokenForm',
 'uima.tcas.DocumentAnnotation'}

Now let us count how many annotations for each type we have per document.

In [4]:
data = []
for filename, doc in documents_by_name.items():
  row = { "Document" : filename }
  data.append(row)
  for t in sorted(doc.typesystem):
    row[t.short_name] = len(doc.select(t))

display(pd.DataFrame(data))

Unnamed: 0,Document,FeatureDefinition,LayerDefinition,MorphologicalFeatures,POS,DocumentMetaData,TagDescription,TagsetDescription,NamedEntity,Lemma,Sentence,Stem,Token,TokenForm,DocumentAnnotation
0,foodista_blog_2019_08_13_northern-british-colu...,1,1,0,0,1,0,1,17,0,19,0,326,0,1
1,foodista_blog_2019_10_22_lewiston-clarkstons-n...,1,1,0,0,1,0,1,34,0,27,0,535,0,1


Our example data constists of texts annotated for *named entities*. We could already see that there is a `NamedEntity` type. The named entity label is stored in the `value` feature. So let us now extract all these labels from the `value` feature of all `NamedEntity` annotations per document and count them.

In [14]:
from collections import Counter

data = []
for filename, doc in documents_by_name.items():
  data.append({ 
    "Document" : filename,
    **Counter(entity.get("value") for entity in doc.select("NamedEntity"))
  })

df = pd.DataFrame(data)
df.set_index("Document", inplace=True)
display(pd.DataFrame(data))

Unnamed: 0,Document,Location,Person,Date,Route,Product,Product Category,Event,Organization
0,foodista_blog_2019_08_13_northern-british-colu...,5,2,1,2.0,5,2,,
1,foodista_blog_2019_10_22_lewiston-clarkstons-n...,10,7,6,,5,1,2.0,3.0


We can also use a library like plotly to render the results as a chart.

In [18]:
df_transposed = df.transpose()

fig_data = []
for document in df_transposed.columns:
    fig_data.append(go.Bar(name=document, x=df_transposed.index, y=df_transposed[document]))

layout = go.Layout(
    title='Entities per Document',
    xaxis_title='Category',
    yaxis_title='Count',
    barmode='stack'  # Stack bars on top of each other
)

fig = go.Figure(data=fig_data, layout=layout)

fig.show()