This iPython notebook builds and trains a deep model in Keras to predict languages based on lines of text. For now, the scope is restricted to programming languages since the required syntax is much more strict.
For programming languages, we use large codebases for each language. Namely,
- Python: Django
- Java: Java off-heap cache
- C: FreeBSD
- C++: Caffe
To add/modify a language or its configuration for this model, you need only modify the Config cell in the notebook. You may use any suitably (~10,000s of lines, depending on language complexity) large codebases for each language. Set the CODE_DIRS dictionary to point at the root directories of these repos. The notebook will retrieve all of the desired code from the repos and use it as data.