# Linalgo Annotate SDK demo

In [None]:
%load_ext autoreload
%autoreload 2
%pylab inline

## Setting up the environment

First you need to import the `linalgo` library to manipulate the tasks created online. The installation is described [here](https://linalgo.github.io/annotate-sdk/).

In [None]:
import os 
from linalgo.client import LinalgoClient

Then you can connect to the backend. 
- You need to know the address of you annotate instance (localhost in this case)
- To get the authentication token, you must connect to the platform and retrieve it from the developper dashboard

In [None]:
token = os.getenv('LIN_TOKEN')
api_url = 'http://localhost:8000'
linalgo_client = LinalgoClient(token=token, api_url=api_url)

You are now ready to list the tasks to which we have access and select the ones we would like to work on.

In [None]:
tasks = linalgo_client.get_tasks(task_ids=[14])
for task in tasks:
    print(f"id: {task.id}, name: {task.name}")

## Training a machine learning algorithm

Let's look at the different types of annotations that have been made on the tasks we selected.

In [None]:
entities = task.entities
for entity in entities:
    print(f"id: {entity['id']}, name: {entity['title']}")

In this study, we're going to train one algorithm per entity type (binary classification). We'll start with the `CT Ideate` type (id 4) and filter the dataset for these annotations only using the `task_transform()` function. 

In [None]:
label = 4
data, target = [], []
for task in tasks:
    docs, labels = task.transform(target='binary',  label=label)
    data.extend(docs)
    target.extend(labels)

In [None]:
print(f"number of docs: {len(data)}")
print('----------------------------')
print(f"1: 'data': {data[0]}, 'label': {target[0]}")
print(f"2: 'data': {data[1]}, 'label': {target[1]}")
print(f"3: 'data': {data[2]}, 'label': {target[2]}")
print('...')

We have 1220 documents and their associated labels for training. We can now use our favorite classifier from scikit-learn and fit it to our data.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=43)

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression()),
])

text_clf.fit(X_train, y_train)
y_score = text_clf.decision_function(X_test)

## Evaluating the model

Now that we have a trained algorithm, we would like to know what kind of performance it has on our current dataset. We'll use the [AuROC](http://gim.unmc.edu/dxtests/roc3.htm) metric for that. [Other metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) might alos be appropriate depending on the type of task that we're automating. 

In [None]:
from sklearn.metrics import roc_curve, auc

In [None]:
fpr, tpr, thres = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(fpr, tpr, color='darkorange',
         lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

## Automating the annotation process

Now that we have a model with good performance (~.93 AuROC) we can use it to mimick the behaviour of a human annotator. To do so, we'll instanciate the `Annotator` class and call it `Rob v1`, a name to remenber that this is our first model.

In [None]:
from linalgo.annotate import Annotator

In [None]:
annotator = Annotator(name='rob_v1', model=text_clf, annotation_type_id=label, threshold=0, owner_id=2)

We're going to use task 46 to test Rob v1.

In [None]:
task = linalgo_client.get_task(46)
annotator.assign_task(task)
print(f"Task name: {task.name}, \nNumber of docs: {len(task.documents)}")

Finally, let's use `Rob v1` to annotate all the documents in that task.

In [None]:
annotations, r = [], []
for doc in task.documents:
    annotation = annotator._get_annotation(doc)
    annotations.append(annotation)
    # Here we save the documents and annotations for quick local visualization
    if annotation.type_id != -1:
        l = "YES"
    else:
        l = "NO"
    r.append({'doc': doc.content, 'CT Ideate': l, 'score': annotation.score})

Let's visualize the newly annotated documents.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
d = pd.DataFrame(r)
d.loc[:6, ['doc', 'CT Ideate', 'score']]

If we want to, we can upload the annotations to the LinHub website.

In [None]:
linalgo_client.upload([anno.to_json() for anno in annotations][:10])

## Understanding our models mistakes

Our model is pretty good, but not perfect. We're laways interested in making it better. To that end, it is usually quite informative to look at documents that have been incorrectly annoated by `Rob v1`.

In [None]:
#TODO: DEMO how to compare manual and Rob's annotations.