# HUMANNOTATOR
*Example notebook*  
Lawrence Vriend  
  
Build easy custom annotators for your Jupyter/pandas workflow!

In [1]:
import sys
sys.path.insert(0, '../')
from humannotator import Annotator, task_factory, load_data
import pandas as pd

---
### Load the data
You can pass a `list`, `dict`, `Series` or `DataFrame` object into the Annotator.  
Here we will load a dataframe with a few newspaper articles.

In [2]:
df = pd.read_csv('news.csv', index_col=0)
data = load_data(df, id_col='news_id')

These articles consist of long strings. As such the DataFrame is not a great way to view them.  
But we can look at the records in our data one by one by passing the data into the annotator.  
- Long strings will automatically be **truncated** by the annotator.  
- When using the annotator in a Jupyter notebook, you can expand/collapse these items by clicking on them.

Navigate through the records using 'x' for next, 'z' for previous and '.' to exit:

In [3]:
annotator = Annotator(data=data)
annotator()

### Set up some tasks
Of course, we don't only want to look at the data, we want to annotate it.  
In order to do so, we must set up some annotation tasks.  
We can create tasks using the `task_factory`:

In [4]:
choices={
    '0': 'not toxic media',
    '1': 'toxic media',
    '3': 'exclude from dataset',
}
instruction = "Is the topic political in nature?"
task1 = task_factory(choices, 'Toxic media')
task2 = task_factory(bool, 'Political', instruction=instruction, nullable=True) 

annotator = Annotator([task1, task2], data)

Alternatively, we can access and add tasks through **subscription**.  
We can set up a task by passing in the `kind` and optionally:
- an `instruction`,
- whether the task is `nullable`,
- whether it has any `dependencies`.

For now let's add a task that takes in a string: 

In [5]:
annotator.tasks['Politician'] = str, 'Who is the main political figure?'

You can check and change the order of the tasks by accessing the `order` attribute on `tasks`.

### Dependencies between tasks
In this case it may be a good idea to add some dependencies to our workflow.  
If we mark the record to be excluded, then there is no need to perform any subsequent tasks.  
Also, if a task is not political, then we don't need to state the politician.   
Let's set that up:

In [6]:
dependency1 = ("`Toxic media` == 'exclude from dataset'", None)
dependency2 = ("Political == False", None)
annotator.tasks['Political'] = bool, "Is the topic political in nature?", True, dependency1
annotator.tasks['Politician'] = str, "Who is the main political figure?", True, [dependency1, dependency2]

A dependency consists of two parts:
- A [pandas query statement](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query)
- The value to assign if the statement evaluates to True

Multiple dependencies can be added.  
Dependencies will be evaluated in order.  
If the statement evaluates to True, then the assignment is performed.  
Once this happens, no other dependencies will be checked.

---
### Run the annotator by calling it
The annotator keeps track of where you were.  
Pass the annotator a list of ids if you only want to annotate specific records.  
You can exit the annotator and it will continue where you left of when you run it again.  
An annotation is only stored if ALL tasks were performed.

Let's add a user too:

In [7]:
annotator(user='LV')

---
### Highlighter
We can use the highlighter to highlight specific phrases.  
Pass 'highlight_text' as a key-word argument to the annotator call to do so.  
Alternatively, we could have instantiated the annotator with the 'highlight_text' argument.

In [8]:
phrases = ['trump', 'news', 'drone', 'judge', '\w*rand', '\w*com\w*']
annotator(phrases=phrases, flags=2)

---
### Access your annotations
The annotations are stored in a dataframe.

In [9]:
annotator.annotated

Unnamed: 0,Toxic media,Political,Politician,timestamp,user
052632_2015-02-28,not toxic media,True,Rand Paul,2019-09-24 23:36:27.777639936,LV
071607_2016-12-12,exclude from dataset,,,2019-09-24 23:36:55.789925120,LV
141694_2016-02-10,not toxic media,False,,2019-09-24 23:36:50.162441984,LV
137157_2017-02-09,not toxic media,True,Donald Trump,2019-09-24 23:37:14.469518080,LV
034187_2016-09-27,toxic media,False,,2019-09-24 23:37:22.318926080,LV
018678_2017-04-23,toxic media,False,,2019-09-24 23:37:25.850139904,LV
120386_2016-11-14,not toxic media,False,,2019-09-24 23:37:31.623525888,LV


---
### Merge your annotations with the data

In [10]:
annotator.merged()

Unnamed: 0_level_0,DATA,DATA,DATA,ANNOTATIONS,ANNOTATIONS,ANNOTATIONS,ANNOTATIONS,ANNOTATIONS
Unnamed: 0_level_1,title,date,text,Toxic media,Political,Politician,timestamp,user
news_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
052632_2015-02-28,Rand Paul wins 2015 CPAC straw poll,2015-02-28,[Washington (CNN)Sen. Rand Paul won the Conser...,not toxic media,True,Rand Paul,2019-09-24 23:36:27.777639936,LV
071607_2016-12-12,Can Singing Mice Reveal the Roots of Human Spe...,2016-12-12,"[One chilly day in February 1877, a British co...",exclude from dataset,,,2019-09-24 23:36:55.789925120,LV
141694_2016-02-10,Dollar hits 15-month low against yen after Yel...,2016-02-10,The dollar fell to a 15-month low against the...,not toxic media,False,,2019-09-24 23:36:50.162441984,LV
137157_2017-02-09,Trump's Supreme Court pick dispirited by presi...,2017-02-09,"Donald Trump's Supreme Court nominee, Neil Go...",not toxic media,True,Donald Trump,2019-09-24 23:37:14.469518080,LV
034187_2016-09-27,FULL TEXT: 10 Things Milo Hates About Islam - ...,2016-09-27,"I’m Milo Yiannopoulos, thank you for coming. T...",toxic media,False,,2019-09-24 23:37:22.318926080,LV
018678_2017-04-23,5 Border Horrors Establishment Media Mostly Ig...,2017-04-23,The brutality that comes from the open border ...,toxic media,False,,2019-09-24 23:37:25.850139904,LV
120386_2016-11-14,Crew members injured as plane avoids near coll...,2016-11-14,A Canadian airliner with 54 passengers on boar...,not toxic media,False,,2019-09-24 23:37:31.623525888,LV
135236_2016-11-10,Bodies Of Missing Married Couple Found On Susp...,2016-11-10,[The bodies of two more presumed victims of To...,,,,NaT,
184514_2017-03-17,"350 Square Feet, Two Kids, Two Cats and a Rabb...",2017-03-17,Maligned though New York’s rental market may b...,,,,NaT,
106098_2017-06-02,CDC warns about deadly mushrooms amid surge in...,2017-06-02,Dangerous wild “death cap” mushrooms in Califo...,,,,NaT,


---
### Save and load your data

In [11]:
annotator.save('annotator.pkl')

In [12]:
annotator2 = Annotator.load('annotator.pkl')

We can access our annotations:

In [13]:
annotator2.annotated

Unnamed: 0,Toxic media,Political,Politician,timestamp,user
052632_2015-02-28,not toxic media,True,Rand Paul,2019-09-24 23:36:27.777639936,LV
071607_2016-12-12,exclude from dataset,,,2019-09-24 23:36:55.789925120,LV
141694_2016-02-10,not toxic media,False,,2019-09-24 23:36:50.162441984,LV
137157_2017-02-09,not toxic media,True,Donald Trump,2019-09-24 23:37:14.469518080,LV
034187_2016-09-27,toxic media,False,,2019-09-24 23:37:22.318926080,LV
018678_2017-04-23,toxic media,False,,2019-09-24 23:37:25.850139904,LV
120386_2016-11-14,not toxic media,False,,2019-09-24 23:37:31.623525888,LV


But when we try to access the data something unexpected happens:

In [14]:
annotator2.data

NO DATA LOADED
Load the data first by assigning it to the `data` property of the annotator.


By default the humannotator will not store the data when you pickle it.  
After unpickling our annotator we need to then load our data back in for it to work:

In [15]:
annotator2.data = data

Now we can continue where we left off.  
Let's set it the annotator to **text mode** as well.  
This is what the annotator looks like from the terminal:

In [None]:
annotator2(text_display=True)

HUMANNOTATOR                                                       user: LV
id: 135236_2016-11-10                                                 1 / 3
item: 
    title: Bodies Of Missing Married Couple Found On Suspected S.C.
        Kidnapper's Land
    date: 2016-11-10
    text: [The bodies of two more presumed victims of Todd Kohlhepp, the
        South Carolina man who has confessed to multiple murders, have been
        identified as a young married couple who went missing in [...]

Task 1 / 3
Toxic media (category)
  
[0] - not toxic media  
[1] - toxic media  
[3] - exclude from dataset  
  
[.] - exit  




If you do wish to save the data with the annotator, then set the `save_data` flag to True.