# Annotating Data to Fit SpaCy Specifications

I'm the kind of programmer who really needs to have a good idea of everything that goes on in my code - I can't just have a black box where I put one thing in and produce one thing out, I need to have a comprehensive overview of every step that goes in. That's why I write out these guides and tutorials - it helps me picture everything that goes into my code, my thinking process, the steps, the data analysis, and so on - which I can use to bolster my own understanding of what's happening under the hood. 

SpaCy's models expect data in their own JSON format, outlined [here](https://spacy.io/api/annotation#json-input). Since the documentation gets pretty technical without a lot of step-by-step explanations, I decided to have an easier case-tutorial to use as a demo. In this use case, I have a PDF of a clinical paper that I want to identify some simple named entities like 'condition name = cdn', 'treatment options = top', and 'negative qualities = ntq'. Obviously, I don't have an appropriate dataset ready to train on, so I have to make one.

### What are my options?
Medium author [Kaustumbh Jaiswal](https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718) writes about adding custom labels to SpaCy _but he brings in a pre-made dataset in a csv form_. I talk a little bit about why the dataset you use is so important in the next post, but for now, this kind of pre-made setup isn't useful for our purposes becuase we want to create something that works in an entirely custom environment. We don't want to slightly improve SpaCy's model, we want it to change completely. Therefore, we need a tool, program, or schema for converting text into the annotated JSON format. 

Based on a quick Google search, there are a couple easy methods of annotating some text into a workable JSON format. There's a fair amount of options out there that I classified into __'click, drag, highlight, select,' apps__ which I go into a bit of detail about below. 

Medium author [Manivannan  Murugavel](https://medium.com/@manivannan_data) has a decent text prep application [available on GitHub](https://github.com/ManivannanMurugavel/spacy-ner-annotator) and an [overview](https://medium.com/@manivannan_data/how-to-train-ner-with-custom-training-data-using-spacy-188e0e508c6) of how to work it. However, the JSON produced by his application doesn't seem to match the JSON format that SpaCy outlines on their annotation guidelines that I reference above. There are some issues that I mention below, but if you're having trouble visualizing what SpaCy wants you to do, running Murugavel's code is how I began to understand it. 

![one.png](attachment:one.png)

vs.

![two.png](attachment:two.png)

This might be because I didn't tag everything in the right way because the data I'm working with wasn't in a Q/A format - but even if I wasn't tagging everything correctly, the output wouldn't have been the same. There's no identifier called 'content' in SpaCy's example structure, nor 'entities'. They maybe have also changed their formatting since March 2019. 

I found this [tool](https://dataturks.com/features/document-ner-annotation.php) by DataTurks with an [explanation](https://dataturks.com/help/dataturks-ner-json-to-spacy-train.php) of how to covert their JSON files to ones used by SpaCY, but it still looks like hand-annotations of data and I don't want to do that unless I absolutely have to. 

### Click, Drag, Highlight, Select applications are annoying for large texts, is there an easier way to assemble a training dataset?
SpaCy maintainer Ines Montani got this [exact question](https://stackoverflow.com/questions/46826541/methods-for-creating-training-data-for-spacy-models) on Stack Overflow in 2017 and gave a pretty comprehensive answer. She recommended a SpaCy-built tool called [Prodigy](https://prodi.gy/) that amplifies an existing tag set through active learning and user input. Prodigy seems like a good option once I already have a set of tags up and running when I want a machine learning model to annotate everything for me and to be able to improve it as it goes. __Because of the cost of using it, it's too pricey to implement on a case-by-case basis or in initial stages of a project__. I don't even have my example tags set up, so I don't think it's a good option right now. She also recommended [Brat](http://brat.nlplab.org/), but that's more for highlighting and annotating spans of text rather than creating NER training data, and she was unsure what the eventual output would be. If anybody's had experience using Brat, let me know in the comments below. 

