Telegram ML competition 2023

The task in this competition is to create a library that detects a programming or markup language of a code snippet.

My solution is inspired by this article about char-level CNN for programming language classification

Steps:

The histplot below shows the amount of code snippets for each language (numbers on the X axis mathes with description in common.py file)

Then I created a model and trained it on ~150k examples for 10 epochs
After the model has trained, I got 79% accuracy on the validation dataset (~37k samples)
Finally, I created telegram bot to test my model in real life

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
telegram_bot		telegram_bot
README.md		README.md
code_snippet_clf.onnx		code_snippet_clf.onnx
code_snippet_dataset.py		code_snippet_dataset.py
code_snippet_model.py		code_snippet_model.py
common.py		common.py
libtglang_pipeline.ipynb		libtglang_pipeline.ipynb
utils.py		utils.py
word_embeddings.py		word_embeddings.py

Provide feedback