News Labs – Senior Software Engineer – Coding Exercise
- The API should allow supporting documents with multiple languages
- The API should allow supporting multiple file types
- We can't rely on fixed standards and conventions for formatting and annotations so we should allow flexibility for tweaking the tool once we are able to test it with a larger test set.
- The API should return structured data in such a way that the consumer can filter the data it wishes to use. For example it should allow separating spoken text (which I will call speech) from technical notes (which I will call annotations).
- For this stage of development I decided to discard formatting entirely and reduce the document(s) to plain text and split into lines (which I call segments).
- To aid language recognition I assumed that the languages contained in files is known in advance. For example in the given file I'm assuming we know it contains English and Farsi.
I decided an output JSON format I would like to achieve and compiled a few lines as an example (src/fixtures/farsi-english-script.json).
To allow me to try and satisfy the problem I set up some tests using Jest and tried out a few different libraries to perform language recofgnition on each line.
I quickly confirmed my assumption that both recognising the language of each line and separating spoken text from annotations was not easy.
For language detection I decided to use franc which uses trigrams (character triples and their frequency in each language). The accuracy is not completely satisfying, especially with shorter texts such as is frequently the case here. Distinction of Farsi and English could be made simpler by simply evaluating the charchter set of each line but I wanted to try a solution that would support other languages too.
For distinguishing speech from annotations I tried writing functions that would take into consideration language, text case and the presence of certain keywords. However I thought this would be hard to maintain and expand.
My second approach was to use a text classifier. Given a larger set of documents the classifier could be trained on a set of of segments and then evaluated against another set. The model could then be made part of the distributable module and applied to larger datasets.
Given the time allocated and my limited practical experience with tools around text classification I decided to keep things simple and adopt a Naive Bayes module module that I think would be enough for a POC.
I did some testing by providing a few lines of training data and saw some improvment.
Training the model would be a main challenge going forward so I set up a simple expressjs/react/redux app to expose the segments and allow confirming or changing the type returned by the classifier.
The training app would also make use of the module itself to load and parse documents, making the result available to the UI.
A Nodejs module that given a file will extract text and return a structured JSON representation of it.
- Returns metadata about the file and a list of segments in a structured JSON format
- Uses ISO 639-3 language codes
- Leverages franc-min for detecting languages (support for 82 languages but could be replaced to support up to 402 languages)
- Leverages textract module for extracting text from different document formats (pdf, doc, docx, rtf...) but relies on some extra dependencies (see Usage).
- Node (8.x or higher) and npm (6.x or higher) installed
- unrtf installed. On OSX with brew run
brew install unrtf
# clone the repo # install dependencies for the classifier, the module and the training app projects npm --prefix classifier i npm --prefix src i npm --prefix training-app i # run the training app cd training app npm run start # open in browser localhost:3000 # stop the server with ctrl+c # generate the model cd ../classifier node generate.js # run the training app again to see results
The module exports a function, example usage is present in
At this stage the module is a POC but it could be packaged and distributed so that it could be used as follows:
const extractor = require('../src'); const response = await extractor('path/to/file.rtf', ['eng', 'fas']);
The unit tests are left incomplete.
- React, Redux