<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dataset-Loading-and-Exploration" data-toc-modified-id="Dataset-Loading-and-Exploration-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dataset Loading and Exploration</a></span><ul class="toc-item"><li><span><a href="#Declutter-dataset" data-toc-modified-id="Declutter-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Declutter dataset</a></span></li><li><span><a href="#Extracing-more-comments-from-the-JabRef-source-code" data-toc-modified-id="Extracing-more-comments-from-the-JabRef-source-code-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Extracing more comments from the JabRef source code</a></span><ul class="toc-item"><li><span><a href="#Code-to-extract-comments-from-JabRef-Source-code-along-with-their-classes" data-toc-modified-id="Code-to-extract-comments-from-JabRef-Source-code-along-with-their-classes-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Code to extract comments from JabRef Source code along with their classes</a></span><ul class="toc-item"><li><span><a href="#Helper-functions" data-toc-modified-id="Helper-functions-1.2.1.1"><span class="toc-item-num">1.2.1.1&nbsp;&nbsp;</span>Helper functions</a></span></li></ul></li></ul></li><li><span><a href="#Supplementary-Dataset-(from-Ref2:-Classifying-code-comments-in-Java)" data-toc-modified-id="Supplementary-Dataset-(from-Ref2:-Classifying-code-comments-in-Java)-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Supplementary Dataset (from Ref2: Classifying code comments in Java)</a></span></li></ul></li><li><span><a href="#Data-Merging-and-Cleaning" data-toc-modified-id="Data-Merging-and-Cleaning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Merging and Cleaning</a></span></li><li><span><a href="#Training-a-word-embedding-on-the-sample-comments" data-toc-modified-id="Training-a-word-embedding-on-the-sample-comments-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Training a word embedding on the sample comments</a></span></li><li><span><a href="#References" data-toc-modified-id="References-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>References</a></span></li></ul></div>

# Dataset Loading and Exploration

In [1]:
import glob
import pandas as pd
from datetime import datetime as dt
from tqdm import tqdm

In [2]:
import os
os.listdir()

['.ipynb_checkpoints',
 'CS515-Declutter.ipynb',
 'declutter-gold_DevelopmentSet.csv',
 'gitJabRef',
 'JavaCommentsClassification',
 'JavaCommentsClassification.tar']

## Declutter dataset

We first explore the declutter dataset, a skeletal dataset consisting of nothing but the comment, the link to its source file with a line number, and its label. It consists of a thousand rows.

In [46]:
declutter = pd.read_csv("declutter-gold_DevelopmentSet_orig.csv")
print(len(declutter))
declutter.head()

1050


Unnamed: 0,ID,type,path_to_file,begin_line,link_to_comment,comment,non-information
0,FR587,Javadoc,https://github.com/nnovielli/jabref/blob/maste...,151,https://github.com/nnovielli/jabref/blob/maste...,@implNote taken from {@link com.sun.javafx.sce...,yes
1,FR974,Line,https://github.com/nnovielli/jabref/blob/maste...,95,https://github.com/nnovielli/jabref/blob/maste...,icon.setToolTipText(printedViewModel.getLocali...,yes
2,FR1359,Line,https://github.com/nnovielli/jabref/blob/maste...,45,https://github.com/nnovielli/jabref/blob/maste...,Synchronize changes of the underlying date val...,no
3,FR30,Javadoc,https://github.com/nnovielli/jabref/blob/maste...,1102,https://github.com/nnovielli/jabref/blob/maste...,Ask if the user really wants to close the give...,yes
4,FR774,Block,https://github.com/nnovielli/jabref/blob/maste...,227,https://github.com/nnovielli/jabref/blob/maste...,css: information *,no


## Extracing more comments from the JabRef source code

We feel it is possible to understand more about the comments if we can extract related information about whther they are about a specific class or not. 

In [47]:
allfiles = glob.glob(f'./gitjabref/**/*.java', recursive=True)
print(f'Read {len(allfiles)} java source files from jabref repo. \nFirst 2 {allfiles[:2]}')


Read 1524 java source files from jabref repo. 
First 2 ['./gitjabref\\jabref\\src\\jmh\\java\\org\\jabref\\benchmarks\\Benchmarks.java', './gitjabref\\jabref\\src\\main\\java\\module-info.java']


In [48]:
srcfiles = glob.glob(f'.\gitjabref\jabref\src/**/*.java', recursive=True)
print(f'Read {len(srcfiles)} java source files from jabref source folder.')
srcfiles[:2]

Read 1524 java source files from jabref source folder.


['.\\gitjabref\\jabref\\src\\jmh\\java\\org\\jabref\\benchmarks\\Benchmarks.java',
 '.\\gitjabref\\jabref\\src\\main\\java\\module-info.java']

We can see that all there are no source code files present outside the src folder.

In [49]:
logic = glob.glob(f'D:\Temp\JavaCommentsClassification.tar\gitJabRef\jabref\src\main\java\org\jabref\logic/**/*.java', recursive=True)
print(f'Read {len(logic)} java source files from jabref logic folder.')
logic[:2]

Read 450 java source files from jabref logic folder.


['D:\\Temp\\JavaCommentsClassification.tar\\gitJabRef\\jabref\\src\\main\\java\\org\\jabref\\logic\\TypedBibEntry.java',
 'D:\\Temp\\JavaCommentsClassification.tar\\gitJabRef\\jabref\\src\\main\\java\\org\\jabref\\logic\\autosaveandbackup\\AutosaveManager.java']

We can see that 450 of the soruce code files are in the logic folder.

In [50]:
li = ['cat','dog','boy']
li.index('dog')

1

### Code to extract comments from JabRef Source code along with their classes

In [41]:
start = dt.now()
print(start) 
# test

for file in logic[:1]:
    linenos = []
    classname = []
    with open(file) as reader:
        lines = reader.readlines()
        counter = 0
        for line in lines:       
            if line.find("/*") >= 0:
                linenos.append(counter)
            if line.find("*/") >= 0:
                linenos.append(counter)
            counter += 1
            if line.find("*/") >= 0:
                classname.append(counter)
            
            

print(f'Source File: {file}')
print(f'Comments start and end line numbers: {linenos}')
for n in linenos:
    print(lines[n])

for n in classname:
    print(lines[n])
print(f'time taken:{dt.now()-start}') 

2020-05-07 13:48:37.808850
Source File: D:\Temp\JavaCommentsClassification.tar\gitJabRef\jabref\src\main\java\org\jabref\logic\TypedBibEntry.java
Comments start and end line numbers: [12, 14, 35, 39, 49, 51]
/**

 */

    /**

     */

    /**

     */

public class TypedBibEntry {

    public boolean hasAllRequiredFields(BibEntryTypesManager entryTypesManager) {

    public String getTypeForDisplay() {

time taken:0:00:00.005943


#### Helper functions

In [42]:
def print_lineNos(files):
    start = dt.now()
    print(start) 
    # test
    for file in files:
        print(f'\n\n{file}:\n\n')
        with open(file) as reader:
            lines = reader.readlines()
            for (number, line) in enumerate(lines):
                comment = []
                print(number,line.strip())
    print(f'time taken:{dt.now()-start}') 

In [12]:
print_lineNos(logic[:1])

2020-05-06 13:45:58.321244


D:\Temp\JavaCommentsClassification.tar\gitJabRef\jabref\src\main\java\org\jabref\logic\TypedBibEntry.java:


0 package org.jabref.logic;
1 
2 import java.util.Objects;
3 import java.util.Optional;
4 
5 import org.jabref.model.database.BibDatabase;
6 import org.jabref.model.database.BibDatabaseContext;
7 import org.jabref.model.database.BibDatabaseMode;
8 import org.jabref.model.entry.BibEntry;
9 import org.jabref.model.entry.BibEntryType;
10 import org.jabref.model.entry.BibEntryTypesManager;
11 
12 /**
13 * Wrapper around a {@link BibEntry} offering methods for {@link BibDatabaseMode} dependend results
14 */
15 public class TypedBibEntry {
16 
17 private final BibEntry entry;
18 private final Optional<BibDatabase> database;
19 private final BibDatabaseMode mode;
20 
21 public TypedBibEntry(BibEntry entry, BibDatabaseMode mode) {
22 this(entry, Optional.empty(), mode);
23 }
24 
25 private TypedBibEntry(BibEntry entry, Optional<BibDatabase> database, BibData

In [13]:
import matplotlib.pyplot as plt #maybe plot number of source files per directory

[Count plot ref](https://stackoverflow.com/questions/2632205/how-to-count-the-number-of-files-in-a-directory-using-python)

In [14]:
counts = {}
for x in os.listdir():
    if(not os.path.isfile(x)):   #is a directory
        #count number of files in the directory.
        counts[x] = len([y for y  in os.listdir() if os.path.isfile(y) and str(y).find(".java")])
counts

{'.ipynb_checkpoints': 3, 'gitJabRef': 3, 'JavaCommentsClassification': 3}

## Supplementary Dataset (from Ref2: Classifying code comments in Java)


Writing a class to read the comments into, if required

In [66]:
class Comment:
    def __init__(self, id = 0,path = "",text = ""):
        self.id = id
        self.path = path     
        self.text = ""
    def __repr__(self):
        return f'Comment object with id={self.id},path = {self.path}, and text = {self.text}'

In [67]:
Comment(1003,text = "/TODO autogen block",path=".")

Comment object with id=1003,path = ., and text = 

Exploring the dataset

In [73]:
automatic_ide_comments_path = "D:/Temp/JavaCommentsClassification.tar/JavaCommentsClassification/List of comments/cat/automatic generated by ide" 
os.chdir(automatic_ide_comments_path)

In [74]:
idecomments = []
counter = 0
total_files = len(os.listdir())
for file in os.listdir():
    if os.path.isfile(file):
        with open(file) as reader:
            text = reader.read().replace("\n"," ")
            #commentsdict[file.split(".txt")[0]] = 
            print(f'{counter} out of {total_files} processed',file,text)
            idecomments.append(str(text))
    counter += 1
        

0 out of 205 processed 1003.txt // TODO Auto-generated catch block
1 out of 205 processed 10040.txt      * (non-Javadoc)
2 out of 205 processed 10041.txt      * (non-Javadoc)
3 out of 205 processed 10042.txt      * (non-Javadoc)
4 out of 205 processed 10043.txt      * (non-Javadoc)
5 out of 205 processed 10044.txt      * (non-Javadoc)
6 out of 205 processed 10096.txt     /* (non-Javadoc)
7 out of 205 processed 10115.txt 		/* (non-Javadoc)
8 out of 205 processed 10116.txt 		/* (non-Javadoc)
9 out of 205 processed 10117.txt 		/* (non-Javadoc)
10 out of 205 processed 10118.txt 		/* (non-Javadoc)
11 out of 205 processed 10119.txt 		/* (non-Javadoc)
12 out of 205 processed 10120.txt 		/* (non-Javadoc)
13 out of 205 processed 10288.txt 	/* (non-Javadoc)
14 out of 205 processed 10289.txt 	/* (non-Javadoc)
15 out of 205 processed 10290.txt 				// TODO Auto-generated method stub
16 out of 205 processed 10291.txt 				// TODO Auto-generated method stub
17 out of 205 processed 10384.txt 	/* (non-J

200 out of 205 processed 9736.txt 	/* (non-Javadoc)
201 out of 205 processed 980.txt // TODO Auto-generated method stub
202 out of 205 processed 9835.txt     /* (non-Javadoc)
203 out of 205 processed 9836.txt     /* (non-Javadoc)
204 out of 205 processed 9837.txt     /* (non-Javadoc)


In [26]:
idecomments

['// TODO Auto-generated catch block',
 '     * (non-Javadoc)',
 '     * (non-Javadoc)',
 '     * (non-Javadoc)',
 '     * (non-Javadoc)']

In [69]:
turtle_comments_path = "D:\Temp\JavaCommentsClassification.tar/JavaCommentsClassification/List of comments/cat/turtle"
os.chdir(turtle_comments_path)

In [70]:
turtlecomments = []
counter = 0
total_files = len(os.listdir())
for file in os.listdir():
    if os.path.isfile(file):
        with open(file) as reader:
            text = reader.read().replace("\n"," ")
            #commentsdict[file.split(".txt")[0]] = 
            print(f'{counter} out of {total_files} processed',file,text)
            turtlecomments.append(str(text))
    counter += 1
        

0 out of 20 processed 1419.txt   // Ditto
1 out of 20 processed 1802.txt // Nothing
2 out of 20 processed 1968.txt #DEPTH_*
3 out of 20 processed 2570.txt /**  * COMMIT3 Response  */
4 out of 20 processed 2806.txt 		// nothing
5 out of 20 processed 4267.txt // Tool
6 out of 20 processed 4390.txt // Tool
7 out of 20 processed 6393.txt  $NON-NLS-2$ $NON-NLS-3$
8 out of 20 processed 6399.txt  $NON-NLS-2$ $NON-NLS-3$
9 out of 20 processed 6411.txt $NON-NLS-2$ $NON-NLS-3$
10 out of 20 processed 8796.txt //#ifdef exercises
11 out of 20 processed 8797.txt //#else
12 out of 20 processed 8798.txt //#package org.eclipse.cdt.examples.dsf.dataviewer.answers;
13 out of 20 processed 8799.txt //#endif
14 out of 20 processed 9179.txt //#ifdef exercises
15 out of 20 processed 9180.txt //#else
16 out of 20 processed 9181.txt //#package org.eclipse.cdt.examples.dsf.dataviewer.answers;
17 out of 20 processed 9182.txt //#endif
18 out of 20 processed 990.txt // good
19 out of 20 processed 993.txt // PASS


In [79]:
turtlecomments[:5]

['  // Ditto',
 '// Nothing',
 '#DEPTH_*',
 '/**  * COMMIT3 Response  */',
 '\t\t// nothing']

In [81]:
import csv
os.chdir('D:\Temp\JavaCommentsClassification.tar')
counter = 0
with open("turtle_comments.csv","w+") as csvfile:
    fieldnames = ['index', 'comment',"non-information"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for comment in turtlecomments:
        writer.writerow({'index': counter, 'comment': comment,'non-information': "yes"})
        counter += 1


In [52]:
all_comments_path = "D:/Temp/JavaCommentsClassification.tar/JavaCommentsClassification/List of comments/int"
os.chdir(all_comments_path)

In [None]:
# commentsdict = {}
# allcomments = []
# counter = 0
# total_files = len(os.listdir())
# for file in os.listdir():
#     if os.path.isfile(file):
#         with open(file) as reader:
#             text = reader.read().replace("\n"," ")
#             #commentsdict[file.split(".txt")[0]] = 
#             print(f'{counter} out of {total_files} processed',file,text)
#             allcomments.append(str(text))
#         counter += 1
        

In [75]:
import csv
os.chdir('D:\Temp\JavaCommentsClassification.tar')
counter = 0
with open("autogen_comments.csv","w+") as csvfile:
    fieldnames = ['index', 'comment',"non-information"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for comment in idecomments:
        writer.writerow({'index': counter, 'comment': comment,'non-information': "yes"})
        counter += 1


range(4, 10)
range(10, 14)


# Data Merging and Cleaning

Once we labelled and extracted the comments from the supplementary dataset, we manually appended the new observations and their labels into the declutter dataset. Let us see what it looks like now

In [43]:
declutter = pd.read_csv("declutter-gold_DevelopmentSet.csv")
declutter.head()

Unnamed: 0,ID,type,path_to_file,begin_line,link_to_comment,comment,non-information
0,FR587,Javadoc,https://github.com/nnovielli/jabref/blob/maste...,151.0,https://github.com/nnovielli/jabref/blob/maste...,@implNote taken from {@link com.sun.javafx.sce...,yes
1,FR974,Line,https://github.com/nnovielli/jabref/blob/maste...,95.0,https://github.com/nnovielli/jabref/blob/maste...,icon.setToolTipText(printedViewModel.getLocali...,yes
2,FR1359,Line,https://github.com/nnovielli/jabref/blob/maste...,45.0,https://github.com/nnovielli/jabref/blob/maste...,Synchronize changes of the underlying date val...,no
3,FR30,Javadoc,https://github.com/nnovielli/jabref/blob/maste...,1102.0,https://github.com/nnovielli/jabref/blob/maste...,Ask if the user really wants to close the give...,yes
4,FR774,Block,https://github.com/nnovielli/jabref/blob/maste...,227.0,https://github.com/nnovielli/jabref/blob/maste...,css: information *,no


# Training a word embedding on the sample comments

In [None]:
from scikitlearn import 

# References

1.  Zhai, Juan, Xu, Xiangzhe, Shi, Yu, Tao, Guanhong, Pan, Minxue, Ma, Shiqing, Xu, Lei, Zhang, Weifeng, Tan, Lin & Zhang, Xiangyu. (2020). CPC: automatically classifying and propagating natural language comments via program analysis. Retrieved from http://dx.doi.org/doi:10.7282/t3nsjs-4386. [Link](https://rucore.libraries.rutgers.edu/rutgers-lib/61591/PDF/1/play/)
2. Pascarella, L., Bruntink, M. & Bacchelli, A. Classifying code comments in Java software systems. Empir Software Eng 24, 1499–1537 (2019). https://doi.org/10.1007/s10664-019-09694-w [Link](https://link.springer.com/article/10.1007/s10664-019-09694-w)