A small project for the SENIOR NLP ENGINEER position at CompagnyXYZ (obfuscated name)
- Clone the project.
git clone http://github.com/ierezell/CompagnyXYZTakeHomeAssessment
- Then from the project root :
- Create a venv :
python -m venv .venv
- Activate it :
source ./.venv/bin/activate
- install the requirements :
pip install -r requirements.txt
-
Install poetry (a python project manager) : https://python-poetry.org/docs/master/#installation
-
Install the project from the root with
poetry install
-
Run commands in the project with
poetry run my_command
or activate a shell withpoetry shell
and thenmy_command
-
The main entry point is
cli
to run all the the commands. (get help withcli --help
).
- Use the cli to split the corpus to passive and active sentences.
- Then train the model or contact me to get pre-trained weights.
- Optimize the model (or contact me to get onnx weights).
- Finally launch the server
- (Optional) Use
poetry build
to generate a wheel to distribute the package.
Will show the swagger documentation created by fastAPI. You can also use this UI to test the routes with the "Try out" button.
Expect a json body parameter named text with one string. Returns a json with one field : "logits" containing the classification result (float between 0 and 1). Close to 1 means quite certain of the text to be a passive sentence.
All the tests were made with pyTest. Please refer to their doc for all the options.
To run all the tests, use : poetry run pytest .
(or pytest .
)
With the finetuned model I can get almost 100% of accuracy by having a threshold (above 0.5 is passive, under is active). We can see the score results here : Results
Model can still be trained and gain performance but overall, the raw accuracy is around 70-80% while the thresholded one is almost 100%.
I can make all the training metrics available if needed (it's on weight and biases).
A request to the server is done in 15ms (wall time & locally) which is good enough for a poc.
- There is a small bias in the data : most of the passive sentences are longer than the non-passive ones.
- There is also more non-passive sentences but I re-balanced the dataset.
- DataLoader could be improved (more shuffling/sampling/augmentation) to get better results.
- To train the classification model, I first used bert with a classification head but it's too much parameters for so few data. Freezing the embedding part and training only the head lead to correct results. Once more data is gathered/generated, we could finetune the embedding layer to obtain better performance.
- For this task in particular, the order of the words are important so any bag of word / non positional encoding seems less efficient (thus the use of bert). I'm afraid that those models would use other feature (like sentence length) to classify but I didn't found the time to test it.
- I used an "MLP-ish" head as I was already using pytorch lightning but any SVM/Regression or other algo could do.
- I didn't removed punctuation or lemmatize/stem words as I trust bert embeddings to deal with that. However, with spacy or nltk it would be easy to do so.
- Api was made using fastApi for speed (uvloop and uvicorn), type hints and easy testing.
- The api was made to host the trained classifier but in a production grade environment, for equal performance (+-5%) using the rule base method would be faster/lighter and thus more suitable.
- The model was compressed using onnx to gain speed and size (""production like"").
- Deployment option could be : lambda function, aws inferentia or other Ec2 based (gpu or not), with onnx or equivalent compilation/optimization.
- For the sentence splitter, the custom rule base one is fine but spacy one (or nltk) would be better.