Description

A small project for the SENIOR NLP ENGINEER position at CompagnyXYZ (obfuscated name)

How to run

Installation

Clone the project. git clone http://github.com/ierezell/CompagnyXYZTakeHomeAssessment
Then from the project root :

With pip

Create a venv : python -m venv .venv
Activate it : source ./.venv/bin/activate
install the requirements : pip install -r requirements.txt

With poetry

Install poetry (a python project manager) : https://python-poetry.org/docs/master/#installation
Install the project from the root with poetry install
Run commands in the project with poetry run my_command or activate a shell with poetry shell and then my_command
The main entry point is cli to run all the the commands. (get help with cli --help).

Developer Flow :

Use the cli to split the corpus to passive and active sentences.
Then train the model or contact me to get pre-trained weights.
Optimize the model (or contact me to get onnx weights).
Finally launch the server
(Optional) Use poetry build to generate a wheel to distribute the package.

HTTP API

/ or /docs

Will show the swagger documentation created by fastAPI. You can also use this UI to test the routes with the "Try out" button.

/inference

Expect a json body parameter named text with one string. Returns a json with one field : "logits" containing the classification result (float between 0 and 1). Close to 1 means quite certain of the text to be a passive sentence.

Testing

All the tests were made with pyTest. Please refer to their doc for all the options.

To run all the tests, use : poetry run pytest . (or pytest .)

Results :

With the finetuned model I can get almost 100% of accuracy by having a threshold (above 0.5 is passive, under is active). We can see the score results here : Results

Model can still be trained and gain performance but overall, the raw accuracy is around 70-80% while the thresholded one is almost 100%.

I can make all the training metrics available if needed (it's on weight and biases).

A request to the server is done in 15ms (wall time & locally) which is good enough for a poc.

Thoughts

Classifier

There is a small bias in the data : most of the passive sentences are longer than the non-passive ones.
There is also more non-passive sentences but I re-balanced the dataset.
DataLoader could be improved (more shuffling/sampling/augmentation) to get better results.
To train the classification model, I first used bert with a classification head but it's too much parameters for so few data. Freezing the embedding part and training only the head lead to correct results. Once more data is gathered/generated, we could finetune the embedding layer to obtain better performance.
For this task in particular, the order of the words are important so any bag of word / non positional encoding seems less efficient (thus the use of bert). I'm afraid that those models would use other feature (like sentence length) to classify but I didn't found the time to test it.
I used an "MLP-ish" head as I was already using pytorch lightning but any SVM/Regression or other algo could do.
I didn't removed punctuation or lemmatize/stem words as I trust bert embeddings to deal with that. However, with spacy or nltk it would be easy to do so.

Server

Api was made using fastApi for speed (uvloop and uvicorn), type hints and easy testing.
The api was made to host the trained classifier but in a production grade environment, for equal performance (+-5%) using the rule base method would be faster/lighter and thus more suitable.
The model was compressed using onnx to gain speed and size (""production like"").
Deployment option could be : lambda function, aws inferentia or other Ec2 based (gpu or not), with onnx or equivalent compilation/optimization.

Utils

For the sentence splitter, the custom rule base one is fine but spacy one (or nltk) would be better.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
.vscode		.vscode
aimodel		aimodel
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json

License

ierezell/AIModelTemplate

Folders and files

Latest commit

History

Repository files navigation

Description

How to run

Installation

With pip

With poetry

Developer Flow :

HTTP API

/ or /docs

/inference

Testing

Results :

Thoughts

Classifier

Server

Utils

About

Resources

License

Security policy

Stars

Watchers

Forks

Languages