Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
classifier
docs
src
training-app
.gitignore
README.md
training.gif

README.md

News Labs – Senior Software Engineer – Coding Exercise

Assumptions

  • The API should allow supporting documents with multiple languages
  • The API should allow supporting multiple file types
  • The API should be a module that can be used to expose different kinds of interfaces (e.g. Javascript module, CLI, REST api, microservice...)
  • We can't rely on fixed standards and conventions for formatting and annotations so we should allow flexibility for tweaking the tool once we are able to test it with a larger test set.
  • The API should return structured data in such a way that the consumer can filter the data it wishes to use. For example it should allow separating spoken text (which I will call speech) from technical notes (which I will call annotations).

Limitations

  • For this stage of development I decided to discard formatting entirely and reduce the document(s) to plain text and split into lines (which I call segments).
  • To aid language recognition I assumed that the languages contained in files is known in advance. For example in the given file I'm assuming we know it contains English and Farsi.

Process

I decided an output JSON format I would like to achieve and compiled a few lines as an example (src/fixtures/farsi-english-script.json).

To allow me to try and satisfy the problem I set up some tests using Jest and tried out a few different libraries to perform language recofgnition on each line.

I quickly confirmed my assumption that both recognising the language of each line and separating spoken text from annotations was not easy.

For language detection I decided to use franc which uses trigrams (character triples and their frequency in each language). The accuracy is not completely satisfying, especially with shorter texts such as is frequently the case here. Distinction of Farsi and English could be made simpler by simply evaluating the charchter set of each line but I wanted to try a solution that would support other languages too.

For distinguishing speech from annotations I tried writing functions that would take into consideration language, text case and the presence of certain keywords. However I thought this would be hard to maintain and expand.

My second approach was to use a text classifier. Given a larger set of documents the classifier could be trained on a set of of segments and then evaluated against another set. The model could then be made part of the distributable module and applied to larger datasets.

Given the time allocated and my limited practical experience with tools around text classification I decided to keep things simple and adopt a Naive Bayes module module that I think would be enough for a POC.

I did some testing by providing a few lines of training data and saw some improvment.

Training the model would be a main challenge going forward so I set up a simple expressjs/react/redux app to expose the segments and allow confirming or changing the type returned by the classifier.

Video walkthrough

The training app would also make use of the module itself to load and parse documents, making the result available to the UI.

Solution

A Nodejs module that given a file will extract text and return a structured JSON representation of it.

Features:

  • Returns metadata about the file and a list of segments in a structured JSON format
  • Uses ISO 639-3 language codes
  • Leverages franc-min for detecting languages (support for 82 languages but could be replaced to support up to 402 languages)
  • Leverages textract module for extracting text from different document formats (pdf, doc, docx, rtf...) but relies on some extra dependencies (see Usage).

Usage

Requirements:

  • Node (8.x or higher) and npm (6.x or higher) installed
  • unrtf installed. On OSX with brew run brew install unrtf
# clone the repo
# install dependencies for the classifier, the module and the training app projects
npm --prefix classifier i
npm --prefix src i
npm --prefix training-app i

# run the training app
cd training app
npm run start
# open in browser localhost:3000
# stop the server with ctrl+c

# generate the model
cd ../classifier
node generate.js

# run the training app again to see results

The module exports a function, example usage is present in training-app/server.js

At this stage the module is a POC but it could be packaged and distributed so that it could be used as follows:

const extractor = require('../src');

const response = await extractor('path/to/file.rtf', ['eng', 'fas']);

The unit tests are left incomplete.

Tech used

  • Node & Javascript
  • Jest
  • React, Redux
  • Blueprintjs
You can’t perform that action at this time.