# JetBrains ML Engineer Intern (CLPD) Test  Assignment

__Task Description:__
Please, send us a link to a git repository with the self-contained, reproducible Jupyter notebook that includes code and text comparing any two existing open source solutions for the problem described in the position posting (e.g enry and pygments) on some small but realistic dataset of your choice (e.g clone several OSS git repositories). It must be possible to re-run all the cells in the notebook without any errors to get similar results. Be prepared to discuss the implementation details and trade-offs of each of the chosen approaches.  

__Job posting link:__ https://www.jetbrains.com/careers/jobs/ml-engineer-intern-cpld-705/  

__Candidate Name:__ Mudit Chaudhary  
__Candidate Email:__ mchaudhary@umass.edu

## 1. Chosen Open source solutions programming language detection task

In this assignment, I choose the following open source solutions for the programming language detection task:

1. __Enry:__ Programming language detector and toolbox to ignore binary or vendored files. enry, started as a port to Go of the original Linguist Ruby library, that has an improved 2x performance. [Link](https://github.com/go-enry/go-enry)

2. __GuessLang:__ Guesslang is neural network based programming language detector which supports over 50 programming languages. It is also used by Microsoft VS Code for programming languages detection. [Link](https://guesslang.readthedocs.io/en/latest/)

### 1.1 Motivation to choose these solutions
These two solutions are chosen because of their inherent differences in how they approach the problem. Enry focuses on a multi-faceted approach by utilizing not only the file content but also other features such as file name, extensions, etc., whereas, Guesslang only focuses on the file content. We will discuss about the differences between their implementations as we go through this notebook.

## 2. Requirements to run this notebook

The Jupyter notebook assumes the following programming languages are already installed on the system:
1. Python (Tested on 3.7 or later)
2. Go (Tested on 1.17.3)

Most of the requirements are installed within this notebook. However, if due to JetBrain's organization constraints you cannot install the packages from within the Jupyter notebook, the required packages are listed below:


1. guesslang == 2.2.1


The following coding cell installs the required packages. Please do not run if the packages are installed separately.

In [19]:
# Install packages
!pip install guesslang
!go mod init jetbrain_oa.com/enryExample
!go get github.com/go-enry/go-enry/v2

go: creating new go.mod: module jetbrain_oa.com/enryExample


## 3. Dataset 
The dataset that I'll evaluate and compare the two solutions is collected from open source github repositories. The small dataset consists of the following repositories:

1. [facebook/DPR](https://github.com/facebookresearch/DPR)
2. [Homebrew/brew](https://github.com/Homebrew/brew)
3. [newrelic/opensource-website](https://github.com/newrelic/opensource-website)
4. [kkos/oniguruma](https://github.com/kkos/oniguruma)
5. [Azure/azure-storage-net-data-movement](https://github.com/Azure/azure-storage-net-data-movement)
6. [laravel/laravel](https://github.com/laravel/laravel)

This collection of repositories is deliberately chosen so as to increase the language diversity and repository structure. The data is available in the ```data``` folder.

In [20]:
# Import statement
import time
import guesslang
from typing import List, Tuple, Optional
import os

----------------------------------------------------------------------------------------------  
----------------------------------------------------------------------------------------------  
## 4. Comparison 


### 4.1 Comparison of language detection features

In this section, we compare the difference in major language detection features that aforementioned solutions offer. After describing the differences, we go on to show implementation of those features through code. We do not compare the list of languages offered by these two solutions in this section.

![alt text](./images/feature_comparison.png)

From a quick observation, we might conclude that Guesslang doesn't hold anywhere near Enry. However, we need to understand that these two solutions are at very different stages of development and for a slightly different idealogy.

#### 4.1.1 Using Guesslang on a single code file

We pick a single code file ```./data/DPR/generate_dense_embeddings.py``` which is a python code. We will also check the time it takes for guesslang to process

In [21]:
def guessLangFile(filepath: str, verbose=False, topk=3, model=None) -> List[Tuple[str, float]]:
    """
    Given a file path, return the source code probabilities 
    """
    start = time.time()
    if model == None:
        guessLangModel = guesslang.Guess()
    else:
        guessLangModel = model
    endModelInit = time.time()
    codeFile = open(filepath, "r")
    code = codeFile.read()
    codeFile.close()
    probs = guessLangModel.probabilities(code)
    
    end = time.time()   
    if verbose:
        file_size = os.path.getsize(filepath)
        if model == None:
            print("Time to intialize guessLang model(secs) " + str(endModelInit-start))
        print("Total time to execute(secs) " + str(end-start))        
        
        
    return probs[:topk] 

#### Probabilties and time taken to execute with model initialized inside the function

In [22]:
guessLangFile("./data/DPR/generate_dense_embeddings.py", verbose = True)

Time to intialize guessLang model(secs) 0.6882758140563965
Total time to execute(secs) 0.8142340183258057


[('Python', 1.0),
 ('Julia', 1.2100569435347097e-09),
 ('Lua', 3.5541371928848875e-11)]

#### Probabilties and time taken to execute with model initialized outside the function

In [23]:
model = guesslang.Guess()
guessLangFile("./data/DPR/generate_dense_embeddings.py", verbose = True, model=model)

Total time to execute(secs) 0.11790800094604492


[('Python', 1.0),
 ('Julia', 1.2100569435347097e-09),
 ('Lua', 3.5541371928848875e-11)]

#### 4.1.2 Using Enry on a single code file

We pick a single code file ```./data/DPR/generate_dense_embeddings.py``` which is a python code. We will also check the time it takes for enry to process. 
As enry does not provide reliable python bindings, we use Golang code in the file ```./code/enry_singleFile.go```

In [24]:
!go run code/enry_singleFile.go ./data/DPR/generate_dense_embeddings.py

Filename:  ./data/DPR/generate_dense_embeddings.py
Predicted language:  Python
2021/11/11 21:02:21 Total execution time(secs)  0.000210


#### 4.1.3 Using Enry and Guesslang on files with different sizes
We will run enry and Guesslang on the following files with different sizes.

1. ```./data/azure-storage-net-data-movement/lib/NativeMD5.cs (8776 bytes) (C#)```
2. ```./data/brew/Library/Homebrew/utils.rb (15079 bytes) (Ruby)```
3. ```./data/brew/Library/Homebrew/tap_constants.rb (959 bytes) (Ruby)```
4. ```./data/opensource-website/jest-preprocess.js (169 bytes) (Javascript)```
5. ```./data/DPR/dpr/models/pytext_models.py (4915 bytes) (Python)```

In [25]:
files = [ "./data/brew/Library/Homebrew/utils.rb","./data/azure-storage-net-data-movement/lib/NativeMD5.cs",
        "./data/brew/Library/Homebrew/tap_constants.rb", "./data/opensource-website/jest-preprocess.js", "./data/DPR/dpr/models/pytext_models.py"]

for file in files:
    print("Filename: "+ file)
    print(guessLangFile(file, verbose = True, model=model, topk=1))

Filename: ./data/brew/Library/Homebrew/utils.rb
Total time to execute(secs) 0.008651018142700195
[('Ruby', 1.0)]
Filename: ./data/azure-storage-net-data-movement/lib/NativeMD5.cs
Total time to execute(secs) 0.0034759044647216797
[('C#', 0.9999887943267822)]
Filename: ./data/brew/Library/Homebrew/tap_constants.rb
Total time to execute(secs) 0.002330303192138672
[('Shell', 0.3526136875152588)]
Filename: ./data/opensource-website/jest-preprocess.js
Total time to execute(secs) 0.0023140907287597656
[('JavaScript', 0.3400322496891022)]
Filename: ./data/DPR/dpr/models/pytext_models.py
Total time to execute(secs) 0.0049626827239990234
[('Python', 1.0)]


In [26]:
!go run code/enry_singleFile.go ./data/azure-storage-net-data-movement/lib/NativeMD5.cs

!go run code/enry_singleFile.go ./data/brew/Library/Homebrew/utils.rb

!go run code/enry_singleFile.go ./data/brew/Library/Homebrew/tap_constants.rb 

!go run code/enry_singleFile.go ./data/opensource-website/jest-preprocess.js

!go run code/enry_singleFile.go ./data/DPR/dpr/models/pytext_models.py

Filename:  ./data/azure-storage-net-data-movement/lib/NativeMD5.cs
Predicted language:  C#
2021/11/11 21:02:26 Total execution time(secs)  0.000215
Filename:  ./data/brew/Library/Homebrew/utils.rb
Predicted language:  Ruby
2021/11/11 21:02:27 Total execution time(secs)  0.000149
Filename:  ./data/brew/Library/Homebrew/tap_constants.rb
Predicted language:  Ruby
2021/11/11 21:02:28 Total execution time(secs)  0.000206
Filename:  ./data/opensource-website/jest-preprocess.js
Predicted language:  JavaScript
2021/11/11 21:02:28 Total execution time(secs)  0.000153
Filename:  ./data/DPR/dpr/models/pytext_models.py
Predicted language:  Python
2021/11/11 21:02:29 Total execution time(secs)  0.000174


#### 4.1.4 Observations

In this section, I present my observations from running the above mini-experiment. I also present a graph showing the time it took for these solution on different file sizes (ran on CPU system not GPU).

![alt text](./images/executionTimeComparison.png)

We can make the following observations:
1. GuessLang's execution time has a higher positive correlation to the file size in comparison to enry. This due to the fact that GuessLang only uses fole content for prediction whereas enry uses other strategies as described above.
2. GuessLang has an associated warmup time on the first time execution i.e., once the model is loaded, the execution time reduces in comparison for the next run due to model available in the cache.
3. GuessLang gets 1 prediction wrong for ```./data/brew/Library/Homebrew/tap_constants.rb```. It predicts ```Shell``` instead of ```Ruby```
4. Enry has faster execution time in comparison to GuessLang.

### 4.2 Comparison of filtering features

Enry offers filtering features to have an option ignore files such as binaries, configuration, documentation, etc.
GuessLang does not provide such filtering options.

Enry provides the following filters:
1. IsBinary
2. IsVendor
3. IsConfiguration
4. IsDocumentation
5. IsDotFile
6. IsImage
7. IsTest
8. IsGenerated

These filters are useful when running an automated code repository analysis.

#### 4.2.1 Present filtering features 
In this section, we show how some of enry's filtering features work. 
We will perform these functions on the following files:

1. ```./data/DPR/conf/extractive_reader_train_cfg.yaml (Configuration file)```
2. ```./data/opensource-website/README.md (Documentation file)```

In [27]:
!go run ./code/enry_filter.go ./data/DPR/conf/extractive_reader_train_cfg.yaml

Filename:  ./data/DPR/conf/extractive_reader_train_cfg.yaml
Is Config?  true
Is Documentation?  false
2021/11/11 21:02:41 Total execution time(secs)  0.000156


In [28]:
!go run ./code/enry_filter.go ./data/opensource-website/README.md

Filename:  ./data/opensource-website/README.md
Is Config?  false
Is Documentation?  true
2021/11/11 21:02:42 Total execution time(secs)  0.000067


### 4.3 Command line tools

Both enry and Guesslang provide command line tools. However, the command line tools for GuessLang are limited in comparison. GuessLang can only run the analysis on a single file, whereas enry can run a full-fledged code analysis of a repo without additional coding.

#### 4.3.1 Present Command line tools for enry

In this section, we will run enry's command line tool on all the repos in our dataset for code analysis. There were some issues in installing enry CLI. I am instead running their CLI code instead. 

In [29]:
# On Repo Brew
start = time.time()
!(cd ./data/brew && go run ../../enry/main.go)
end = time.time()

print("Time to execute 1 enry command over 1 repo(secs) " + str(end-start))

93.46%	Ruby
2.29%	fish
1.97%	Shell
1.71%	Roff Manpage
0.50%	HTML+ERB
0.03%	Swift
0.03%	Dockerfile
0.01%	PostScript
Time to execute 1 enry command over 1 repo(secs) 1.662355899810791


In [30]:
# On Repo azure-storage-net-data-movement
!(cd ./data/azure-storage-net-data-movement && go run ../../enry/main.go)

99.80%	C#
0.14%	Batchfile
0.05%	PowerShell


In [31]:
# On Repo laravel
!(cd ./data/laravel && go run ../../enry/main.go)

81.71%	PHP
17.73%	Blade
0.56%	JavaScript
0.00%	CSS


In [32]:
# On Repo oniguruma
!(cd ./data/oniguruma && go run ../../enry/main.go)

94.22%	C
2.41%	Shell
1.85%	Python
0.65%	HTML
0.38%	CMake
0.32%	Makefile
0.10%	M4Sugar
0.04%	C++
0.02%	Batchfile


In [33]:
# On Repo opensource-website
!(cd ./data/opensource-website && go run ../../enry/main.go)

85.16%	JavaScript
14.34%	SCSS
0.49%	CSS


#### 4.3.2 Present Command line tools for GuessLang

GuessLang CLI cannot perform repository analysis like enry. It can be used to perform programming language identification per file.

In [34]:
start = time.time()
!guesslang ./data/DPR/generate_dense_embeddings.py
!guesslang ./data/brew/Library/Homebrew/brew.rb
!guesslang ./data/brew/Library/Homebrew/tap_constants.rb
end = time.time()

print("Time to execute 3 GuessLang CLI commands(secs) " + str(end-start))

Programming language: Python
Programming language: Ruby
Programming language: Shell
Time to execute 3 GuessLang CLI commands(secs) 15.340967178344727


#### 4.3.3 Observations

We can observe the following from enry and GuessLang tools:
1. GuessLang CLI tool's functionality is fairly limited in comparison to enry.
2. Enry's CLI tool is much faster in comparison. This might be because GuessLang's CLI tool requires model loading everytime the command call is made.


### 4.4 Difference between content-based classifiers

Enry uses a Naive Bayesian Classifier for its content-based classifier, whereas GuessLang uses a Linear Neural Network.

## 5. Summary

We can observe that enry's performance in terms of speed surpasses GuessLang. Although enry uses a simpler classifier model but because of additional strategies such as using filename extension, shebang, model line, etc. it is able to perform well. However, its performance might be affected if only the code-snippet is available. Enry provides more functionality such as filteringa and repository code analysis.

GuessLang is a simpler solution which only provides content-based classification using a neural network. Because it was developed to be used in situations where only the code-snippets are available, it performs well under those conditions. It is reported to have 93.45% accuracy over a test dataset of 230,000 source code files.