# Choosing 

Before choosing the frameworks I wanted to test, I considered following choices:

Linguist(Github), Enry, Pygments, Guesslang(Microsoft/VSCode)

Linguist is used by Github to detect files in a repository. It's written in Ruby and I discovered that it takes a significant amount of time to load the linguist Gem file (around 2 seconds on a modern Ryzen7 processor with a NVME ssd). It uses a number of different approaches to define a list of candidate languages that it feeds into a Bayes classifier to make a final prediction. Enry is a golang port of Linguist and works virtually the same but it runs faster. Pygments is primarily used as a code higlighter but it has the capability to predict the language of the sourcecode. After trying it out on a small portion of my test data I observed that the prediction was wrong most of the time. That's why I didn't consider it any further. Lastly I tested out Guesslang which is used in VSCode to automaticly predict the language of a files. It uses a neural network to make the prediction and after testing it out I found it very promising.

In conclusion I am choosing Enry over Linguist because of its edge in performance and the comparison will be between <b>Enry and Guesslang</b>. 

## Imports and helper functions

In [1]:
from wrappers.guesslang_wrapper import detect_guesslang
from wrappers.enry_wrapper import detect_enry

import os
from collections import defaultdict

# load paths to data
paths = defaultdict(list)
data_path = "./test_data/"

for dir in os.listdir(data_path):
    for file in os.listdir(data_path + dir):
        paths[dir].append(data_path + dir + "/" + file)

ground_truth = []
paths_flat = []

for k in paths:
    for p in paths[k]:
        ground_truth.append(k)
        paths_flat.append(p)

def accuracy(preds, gtruth):
    wrongly_classified_idcs = []

    for idx, t in enumerate(gtruth):
        if t != preds[idx]:
            wrongly_classified_idcs.append(idx)

    return "Accuracy: " + str(1 - len(wrongly_classified_idcs) / len(gtruth)) 

While comparing the two frameworks we will consider two usecases. In the first usecase we input source files with given file extensions:

In [2]:
# Make predcitions enry
preds_enry = [p.decode("utf-8") for p in detect_enry(paths_flat, no_ext=False)]
accuracy(preds_enry, ground_truth)

Enry ran in 2 seconds.


'Accuracy: 0.9938938618925831'

In [3]:
# Make predcitions guesslang
preds_guesslang = detect_guesslang(paths_flat)
accuracy(preds_guesslang, ground_truth)

100%|██████████| 31280/31280 [00:53<00:00, 587.98it/s]

Guesslang ran in 54 seconds.





'Accuracy: 0.8032608695652174'

We can see that Enry has a much higher accuracy. That is due to the fact that it uses a sequence of matching strategies like extracting the file extention from the filename or extracting information from a shebang to narrow down the number of possible mathces before falling back on a the Bayesian clasifier. That greatly increases the acuracy and speed in this particualr usecase. Guesslang uses a neural-network instead, to predict a language amongst 50 possible classes, only using the sourcecode as its input feautures. 

<h1>!!!!Do a bunch of other comparisons, confusion matrix, randomly select missclassified files and analize!!!!</h1>

For the second use case, we assume we only have the sourcecode available without the file extension. Since Guesslang doesn't consider any other feautures than the code itself, we only rerun the Enry classifyer, but this time with the <b>no-ext</b> flag.

In [4]:
preds_enry_no_ext = [p.decode("utf-8") for p in detect_enry(paths_flat, no_ext=True)]
accuracy(preds_enry_no_ext, ground_truth)

Enry ran in 235 seconds.


'Accuracy: 0.4943734015345268'

We notice that the accuracy and especially the execution speed of the Enry classifier suffer considerably, if we dont provide further information in form of a file extension.