<img align="right" width="400" src="https://www.fhnw.ch/de/++theme++web16theme/assets/media/img/fachhochschule-nordwestschweiz-fhnw-logo.svg" alt="FHNW Logo">


# Import Annotated Data from Inception

by Fabian Märki

## Summary
The aim of this notebook is to show how annotated data (through [Inception](https://inception-project.github.io)) can be imported into a Jupyter Notebook.  

### Links
- https://inception-project.github.io/example-projects/python/
- https://inception-project.github.io/releases/21.1/docs/user-guide.html#sect_formats_uimaxmi

This notebook does not contain assigments: <font color='red'>Enjoy.</font>

<a href="https://colab.research.google.com/github/markif/2021_HS_CAS_NLP_LAB_Notebooks/blob/master/01_a_Import_Annotated_Data_from_Inception.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
%%capture

!pip install 'fhnw-nlp-utils>=0.2.13,<0.3.0'

from fhnw.nlp.utils.storage import download
from fhnw.nlp.utils.storage import save_dataframe


import pandas as pd
import numpy as np

In [2]:
from fhnw.nlp.utils.system import system_info
print(system_info())

OS name: posix
Platform name: Linux
Platform release: 5.11.0-40-generic
Python version: 3.6.9
Tensorflow version: 2.5.1
GPU is available


You would need to download the annotated data exported as **UIMA CAS XMI (XML 1.0)** from your Inception server.

Let's use a prepared dataset that is stored in the cloud.

In [3]:
import zipfile

download("https://drive.google.com/uc?id=1HJ1TDDcyD2STQjn_QwkyjzAJxhBRae8X", "data/german_doctor_reviews_annotated_text.zip")

with zipfile.ZipFile("data/german_doctor_reviews_annotated_text.zip","r") as zip_ref:
    zip_ref.extractall("data")

Extract the data...

In [4]:
!pip install dkpro-cassis

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [5]:
from cassis import *

with open("data/TypeSystem.xml", "rb") as f:
    typesystem = load_typesystem(f)
    
with open("data/german_doctor_reviews.xmi", "rb") as f:
    doc = load_cas_from_xmi(f, typesystem=typesystem)
    
# Since Sentiment is a sentence-level annotation in INCEpTION, we get
# one annotation per sentence. So we can simply iterate over the 
# Sentiment annotations and write its polarity and the covered text
# to the output file
with open("data/german_doctor_reviews.csv", "w") as f:
    for sentiment in doc.select("webanno.custom.Sentiment"):
        f.write(f"{sentiment.Polarity}\t{sentiment.get_covered_text()}\n")

Load the data into a Pandas DataFrame...

In [6]:
data = pd.read_csv("data/german_doctor_reviews.csv", sep="\t", header=None, names=["label", "text"])

In [7]:
data.head(3)

Unnamed: 0,label,text
0,Very Negative,Dies ist eine schlechter Arzt.
1,Very Positive,Dies ist ein super Arzt. Ich bin sehr zufrieden.


### Setup Sentiment

Bring it in line with the original data which contains a **rating** row with the characteristics:
- 1 = Very Positive
- 2 = Positive
- 3 = Neutral
- 4 = Negative
- 5 = Very Negative

which might be a bit counterintuitive (I would have expected that a higher number represents a better review - the more stars the better).

In [8]:
data["rating"] = 1
data.loc[data["label"] == "Positive", "rating"] = 2
data.loc[data["label"] == "Neutral", "rating"] = 3
data.loc[data["label"] == "Negative", "rating"] = 4
data.loc[data["label"] == "Very Negative", "rating"] = 5

The classification should recognize if a comment has a positive or negative sentiment. Let's assume that a good rating (1-2) caries a positive message whereas a low rating (5-6) caries a negative one (later we will exclude neutral ratings, i.e. the task becomes a binary classification).

In [9]:
data["label"] = "positive"
data.loc[data["rating"] >= 3, "label"] = "neutral"
data.loc[data["rating"] >= 5, "label"] = "negative"

data["sentiment"] = data["label"].apply(lambda x: 1 if x == "positive" else (-1 if x == "negative" else 0))
data = data.astype({"sentiment": "int32"})

In [10]:
data.head(3)

Unnamed: 0,label,text,rating,sentiment
0,negative,Dies ist eine schlechter Arzt.,5,-1
1,positive,Dies ist ein super Arzt. Ich bin sehr zufrieden.,1,1


In [11]:
save_dataframe(data, "data/german_doctor_reviews_annotated.parq")