! Note, that first two code cells take several hours to execute, avoid running them.

# Main task

Preprocess each project separately, to turn it into a dataset for a trained model:

In [None]:
import os


for project_name in os.listdir('sources'):
    os.system("bash preprocess.sh " + project_name)

To calculate quality metrics, evaluate a trained model passing each preprocessed project as test data:

In [8]:
import csv
import os

result_file = "quality_metrics.csv"
header = ["Project", "F1", "Precision", "Recall"]
with open(result_file, "w") as f:
    writer = csv.DictWriter(f, fieldnames=header)
    writer.writeheader()
    f.close()
for project_name in os.listdir('sources'):
    os.system("python3 code2seq.py --project %s --load models/java-large-model/model_iter52.release --test "
              "data/%s/%s.test.c2s" % (project_name, project_name, project_name))

Let's look at the projects, where recognition results were best and worst:

In [9]:
import pandas as pd

data = pd.read_csv(result_file)
data = data.sort_values(by="F1", ascending=False)
data

Unnamed: 0,Project,F1,Precision,Recall
80,Devlight__InfiniteCycleViewPager,0.907839,0.915598,0.900210
89,Tencent__MSEC,0.904964,0.910945,0.899061
71,Trinea__android-auto-scroll-view-pager,0.878378,0.902778,0.855263
72,daimajia__NumberProgressBar,0.867925,0.877863,0.858209
48,hugeterry__CoordinatorTabLayout,0.855072,0.867647,0.842857
...,...,...,...,...
60,square__keywhiz,0.396026,0.422675,0.372539
57,libgdx__packr,0.389381,0.448980,0.343750
32,EnterpriseQualityCoding__FizzBuzzEnterpriseEdi...,0.379221,0.392473,0.366834
77,rest-assured__rest-assured,0.327217,0.437792,0.261236


Now, find a mean and spread for each of quality metrics:

In [10]:
import numpy as np

def get_mean_and_spread(metric):
    print(metric+":", "mean", np.mean(data[metric]), "spread", np.std(data[metric]))

get_mean_and_spread("F1")
get_mean_and_spread("Precision")
get_mean_and_spread("Recall")

F1: mean 0.6487602105263157 spread 0.13753456932061955
Precision: mean 0.6802209578947369 spread 0.12633969306886522
Recall: mean 0.6226689368421051 spread 0.14871029515239287


# Additional research
Let's find some features, which are common for all projects with the **best** recognition results:

In [11]:
import os
import numpy as np

def get_size(start_path):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size

def get_number_of_methods(project_name):
    os.system("bash GetPaths.sh "+project_name)
    return len([line for line in open(project_name+".test.raw.txt", "r")])

def get_mean_size_of_tree(project_name):
    return np.mean([len(line) for line in open(project_name+".test.raw.txt", "r")])

def get_projects_info(project_names):
    projects = pd.DataFrame(columns=["Project", "Size", "Number of methods", "Mean size of tree"])
    for project_name in project_names:
        project_info = {
           "Project": project_name,
           "Size": get_size("sources/"+project_name),
           "Number of methods": get_number_of_methods(project_name),
           "Mean size of tree" : get_mean_size_of_tree(project_name)
        }
        projects = projects.append(project_info, ignore_index=True)
    return projects

best_project_names = data.head(5)["Project"].tolist()
get_projects_info(best_project_names)

Unnamed: 0,Project,Size,Number of methods,Mean size of tree
0,Devlight__InfiniteCycleViewPager,189661,294,9012.132653
1,Tencent__MSEC,5911724,12141,4359.744337
2,Trinea__android-auto-scroll-view-pager,14764,25,3427.6
3,daimajia__NumberProgressBar,18650,43,4888.674419
4,hugeterry__CoordinatorTabLayout,25449,47,7749.595745


For projects with **worst** recognition results:

In [12]:
worst_project_names = data.tail(5)["Project"].tolist()
get_projects_info(worst_project_names)

Unnamed: 0,Project,Size,Number of methods,Mean size of tree
0,square__keywhiz,1076135,1301,11406.160646
1,libgdx__packr,29681,28,19566.642857
2,EnterpriseQualityCoding__FizzBuzzEnterpriseEdi...,73804,71,4603.323944
3,rest-assured__rest-assured,2326817,3464,5281.786952
4,socketio__socket.io-client-java,146636,377,9351.6313


As we can see recognition results do not depend on the size of the project and the number of methods in its code, but the complexity of methods is important (this can be seen from the size of the syntax trees).

Reading descriptions of these projects on github allows to say that model works better on projects, where GUI is a main part (because there are a lot of similar patterns, describing behaviour of graphics elements).
And projects with worse recognition results are mostly back-end-focused apps for difficult purposes.