In this document, I give some directions for future work. This is focused more in terms of engineering than machine learning / data science, 
as the notebooks already include the latter.

About the TODOs:
- I almost didn't add any check for file formats, and there's a pending function to be implemented to check for IMDB directory format
- Implement the DL model, making it sklearn-friendly, so that the base class is compatible
- Make the `n_jobs` dependent on the number of cores the machine has
- It's more efficient to read just the small sample rather than loading all data and then subsampling.
- Add tests for all the code

About CLI:
- Add some verbose param to silent logger by setting level to warning
- Better documentation about what the options in the CLI do
- Add option to switch between classical and DL model
- Allow user to get soft predictions (probability) from CLI
- Turn the config file to raw text file, and allow the user to pass a config file to use alternative settings


About better packaging
- Added `data/sample` to inside the package but should be somewhere else. I'm not sure how to solve this with poetry. In fact, this is an issue in poetry still to be solved https://github.com/python-poetry/poetry/issues/890
- Cythonise code to speed up and obfuscate IP
- Not an expert on this but there has to be a way to package the dependencies together with the package, so that it can be installed offline
- Add wheel to a (potentially company internal) repository, rather than to the Github repo.


About production-ready code
- Add missing docstrings
- Implement black pre-commit hooks
- Use python typing
- Use a linter e.g. pylint, and isort
- Enable continuous integration

About production-ready system / scaling
- For very big data probably would be better to use something like, e.g. Spark, rather than pandas. Dask can also be useful. Not very experienced on this, but actually going to https://pages.databricks.com/202001-EU-EV-UnifyDataMLworkshopLondon_04.WaitlistPage.html next month :)
- Set up a flask / django server to enable API calls, so that the server doesn't need to run in the same machine
- Allow data to be in a database, scaling better than reading from disk
- Use something like `MLflow` to version the model and data used to train it
- If speed is critical
   - Enable parallelised predictions, usually this is easy and we use
   - Some models, like SVM, allow descentralised/parallelised training, where not all data needs to be in the same node
   - TF-IDF, for example, allows easy parallelisation. So depending on the final ML pipeline we might parallelise (either across cores or different physical machines) some bits to improve speed/memory
- If data is continuously arriving, allow the system to periodically auto-retrain with its own outputs, logging metrics. In each of those iterations, one might potentially want the system to send a random sample of the outputs (some non-confident and some confident), just to review by a human everything is going fine.
- In a really big system we should create a Docker image, more easily deployed and machine-agnostic. Then, we can use something like Kubernetes for orchestration
