# Code Comments Generator using RNN

Code comments are often as important as codes in a program, especially when the projects are done by a team. 
* It's important to communicate what the code SHOULD be doing
* Comments show respect for the next developer to read your code
* Comments indicate potential problem areas to avoid

# Evaluation Index


**MSE for RNN model**

**BLEU-4 for Comments**

BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.
BLEU is essentially to calculate the frequency of co-occurrence words in two sentences.


## RNN(Recurrent Neural Network)

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. 

Basic RNNs are a network of neuron-like nodes organized into successive "layers", each node in a given layer is connected with a directed (one-way) connection to every other node in the next successive layer. Each node (neuron) has a time-varying real-valued activation. Each connection (synapse) has a modifiable real-valued weight. Nodes are either input nodes (receiving data from outside the network), output nodes (yielding results), or hidden nodes (that modify the data en route from input to output).

from __[Recurrent neural network - Wikipedia)](https://en.wikipedia.org/wiki/Recurrent_neural_network)__

![RNN](data/RNN.png)

## Introduction to Tensorflow

TensorFlow is a  leading deep learning and neural network computation framework. It is based on a C++ low level backend but is usually controlled via Python. 

Tensorflow has a multi-hierarchy structure, which can be deployed in various servers, PC terminals and web pages, and supports high-performance numerical computation of GPU and TPU. Tensorflow is widely used in Google internal product development and scientific research in various fields.

## Learning strategies

First, covert the raw data( training and testing) to a sequence of vectors with 50 digits during preprocessing. After encoding the input sequence to a vector , feed the encoded data to RNN model and get predicting sequences. Finally decode the output sequence of vectors and the results are expected comments. 

![workflow](data/workflow.png)

We will read the text file, then split the content into an array which each element is a word, and store it into data variable. Next, we will create a new array called characterlist to store the unique values for training and testing data separately.

![vectorization](data/vectorization.png)

## Prepare the dataset

Our dataset was collected from [UCI](https://www.ics.uci.edu/~lopes/datasets/). The dataset contains 470486 minutes of data.

In [1]:
%matplotlib inline
from collections import Counter
import random

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [185]:
# Read raw data
rawdata=pd.read_json('data/train.json',lines=True)

In [186]:
# Show format of dataset
rawdata

Unnamed: 0,code,nl
0,protected void tearDown(){\n}\n,"Tears down the fixture, for example, close a n..."
1,"@KnownFailure(""Fixed on DonutBurger, Wrong Exc...","javax.net.ssl.SSLEngine#unwrap(ByteBuffer src,..."
2,public static CstFloat make(int bits){\n retu...,Makes an instance for the given value. This ma...
3,public long size(){\n long size=0;\n if (par...,"Returns the ""size"" of the chromosome, namely, ..."
4,public void increment(View view){\n if (quant...,This method is called when the plus button is ...
5,public void trimToSize(){\n ++modCount;\n if...,Trims the capacity of this <tt>ArrayHashList</...
6,public SyncValueResponseMessage(SyncValueRespo...,Performs a deep copy on <i>other</i>.
7,public void clearParsers(){\n if (parserManag...,Removes all parsers from this text area.
8,@Override public void run(){\n while (doWork)...,This is the code for the thread. It delivers d...
9,private byte[] calculateUValue(byte[] generalK...,Calculate what the U value should consist of g...


In [187]:
# Make data a np.array
rawdata.dropna()
rawdata = rawdata.values

In [442]:
# Build X and y
X = rawdata[:, 0]
y = rawdata[:, 1:]

In [443]:
X

array(['protected void tearDown(){\n}\n',
       '@KnownFailure("Fixed on DonutBurger, Wrong Exception thrown") public void test_unwrap_ByteBuffer$ByteBuffer_02(){\n  String host="new host";\n  int port=8080;\n  ByteBuffer bbs=ByteBuffer.allocate(10);\n  ByteBuffer bbR=ByteBuffer.allocate(100).asReadOnlyBuffer();\n  ByteBuffer[] bbA={bbR,ByteBuffer.allocate(10),ByteBuffer.allocate(100)};\n  SSLEngine sse=getEngine(host,port);\n  sse.setUseClientMode(true);\n  try {\n    sse.unwrap(bbs,bbA);\n    fail("ReadOnlyBufferException wasn\'t thrown");\n  }\n catch (  ReadOnlyBufferException iobe) {\n  }\ncatch (  Exception e) {\n    fail(e + " was thrown instead of ReadOnlyBufferException");\n  }\n}\n',
       'public static CstFloat make(int bits){\n  return new CstFloat(bits);\n}\n',
       ...,
       'public Optional<Boolean> eagerCheck(){\n  return Optional.ofNullable(this.eagerCheck);\n}\n',
       'public static NewPlaylistFragment newInstance(Song song){\n  NewPlaylistFragment fragment=

In [190]:
y

array([[ 'Tears down the fixture, for example, close a network connection. This method is called after a test is executed.'],
       [ 'javax.net.ssl.SSLEngine#unwrap(ByteBuffer src, ByteBuffer[] dsts) ReadOnlyBufferException should be thrown.'],
       [ 'Makes an instance for the given value. This may (but does not necessarily) return an already-allocated instance.'],
       ..., 
       ['whether check errors as more as possible'],
       [ 'Creates a new instance of the New Playlist dialog fragment to create a new playlist and add a song to it.'],
       ['Report the current cursor location in its window.']], dtype=object)

## Tokenization & Generating Vocabulary

We tokenized features data and label data to two vocabulary separately, because the format of java codes and human words are quite different.

In [25]:
from collections import Counter
import time
from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize import ToktokTokenizer
toktok = ToktokTokenizer()

### Get Features Vocabulary

In [444]:
%%time
start=time.time()
Chracterlist_code = []
i = 0
for s in X:
    i = i+1
    Chracterlist_code.extend(toktok.tokenize(s))
    if i % 100000 == 0:
        print('----------------------------------------')
        print('Pieces of data: %s'%(i))
        print('The length of List: %s'%(len(Chracterlist_code)))
        end=time.time()
        print('Running time: %s Seconds'%(end-start))

----------------------------------------
Pieces of data: 100000
The length of List: 7152616
Running time: 32.0409209728241 Seconds
----------------------------------------
Pieces of data: 200000
The length of List: 14354948
Running time: 63.855900049209595 Seconds
----------------------------------------
Pieces of data: 300000
The length of List: 21568075
Running time: 91.99899792671204 Seconds
----------------------------------------
Pieces of data: 400000
The length of List: 28778707
Running time: 121.17291712760925 Seconds
CPU times: user 2min 18s, sys: 2.9 s, total: 2min 21s
Wall time: 2min 24s


In [446]:
#Vocabulary for vectorization and  use Vectorized data to fit model
vocabs_code = sorted(list(set(Chracterlist_code)))
code_to_int = dict((c, i) for i, c in enumerate(vocabs_code))   #Vectorize the text
int_to_code = dict((i, c) for i, c in enumerate(vocabs_code))
data_vocabs_code=pd.DataFrame(data=vocabs_code)#
data_vocabs_code.to_csv('vocabs_code.csv',encoding='gbk')

### Get Labels Vocabulary

In [437]:
%%time
start=time.time()
Chracterlist_comment = []
i = 0
for s in y[:,0]:
    i = i+1
    Chracterlist_comment.extend(toktok.tokenize(s))
    if i % 100000 == 0:
        print('----------------------------------------')
        print('Pieces of data: %s'%(i))
        print('The length of List: %s'%(len(Chracterlist_comment)))
        end=time.time()
        print('Running time: %s Seconds'%(end-start))

----------------------------------------
Pieces of data: 100000
The length of List: 1652981
Running time: 8.005395889282227 Seconds
----------------------------------------
Pieces of data: 200000
The length of List: 3303995
Running time: 16.8194260597229 Seconds
----------------------------------------
Pieces of data: 300000
The length of List: 4952205
Running time: 25.460472106933594 Seconds
----------------------------------------
Pieces of data: 400000
The length of List: 6602312
Running time: 33.247819900512695 Seconds
CPU times: user 36.4 s, sys: 1.76 s, total: 38.1 s
Wall time: 39.9 s


In [438]:
#Vocabulary for predicting comments
vocabs_comment = sorted(list(set(Chracterlist_comment)))
comment_to_int = dict((c, i) for i, c in enumerate(vocabs_comment))   #Vectorize the text
int_to_comment = dict((i, c) for i, c in enumerate(vocabs_comment))
data_vocabs_comment=pd.DataFrame(data=vocabs_comment)#
data_vocabs_comment.to_csv('vocabs_comment.csv',encoding='gbk')

# Vectorization

Here we create 10 arrays to store the data for each stage. Because we need to keep inputing elements to our array and each time the array size grows, data migration occurs, which can be time-consuming. It took nearly two days when we first tried it. So we came up with this idea.

In [451]:
%%time
start = time.time()
i = 0
for s in X:
    encoded_s = np.array([vocab_to_int[c] for c in toktok.tokenize(s)], dtype=np.int32)
    if encoded_s.shape[0]<50:
        l = encoded_s.shape[0]
        encoded_s = np.pad(encoded_s, (50-l,0), 'constant')
    else:
        encoded_s = encoded_s[:50]
    i = i+1    
    
    
    if i<50000:
        encoded1 = np.vstack((encoded1, encoded_s))
    elif i>=50000 and i<100000:
        encoded2 = np.vstack((encoded2, encoded_s))  
    elif i>=100000 and i<150000:   
        encoded3 = np.vstack((encoded3, encoded_s))         
    elif i>=150000 and i<200000:   
        encoded4 = np.vstack((encoded4, encoded_s))         
    elif i>=200000 and i<250000:   
        encoded5 = np.vstack((encoded5, encoded_s)) 
    elif i>=250000 and i<300000:   
        encoded6 = np.vstack((encoded6, encoded_s))         
    elif i>=300000 and i<350000:   
        encoded7 = np.vstack((encoded7, encoded_s))         
    elif i>=350000 and i<400000:   
        encoded8 = np.vstack((encoded8, encoded_s))
    elif i>=400000 and i<450000:   
        encoded9 = np.vstack((encoded9, encoded_s))        
    else:   
        encoded10 = np.vstack((encoded10, encoded_s))        
         
    if i % 100000 == 0:
        print('----------------------------------------')
        print('Pieces of data Processed: %s'%(i))
#         print('The Shape of Encoded Array: %s'% (encoded.shape,))
        end=time.time()
        print('Running time: %s Seconds'%(end-start))

----------------------------------------
Pieces of data Processed: 100000
Running time: 489.3179728984833 Seconds
----------------------------------------
Pieces of data Processed: 200000
Running time: 951.5639998912811 Seconds
----------------------------------------
Pieces of data Processed: 300000
Running time: 1408.0064408779144 Seconds
----------------------------------------
Pieces of data Processed: 400000
Running time: 1871.5140528678894 Seconds
CPU times: user 22min 22s, sys: 12min 44s, total: 35min 7s
Wall time: 35min 14s
Parser   : 146 ms


In [452]:
encoded11 = np.empty([50])
encoded12 = np.empty([50])
encoded13 = np.empty([50])
encoded14 = np.empty([50])
encoded15 = np.empty([50])
encoded16 = np.empty([50])
encoded17 = np.empty([50])
encoded18 = np.empty([50])
encoded19 = np.empty([50])
encoded20 = np.empty([50])

In [453]:
%%time
start = time.time()
i = 0
for s in y[:,0]:
    encoded_s = np.array([vocab_to_int[c] for c in toktok.tokenize(s)], dtype=np.int32)
    if encoded_s.shape[0]<50:
        l = encoded_s.shape[0]
        encoded_s = np.pad(encoded_s, (50-l,0), 'constant')
    else:
        encoded_s = encoded_s[:50]
    i = i+1    
    
    
    if i<50000:
        encoded11 = np.vstack((encoded11, encoded_s))
    elif i>=50000 and i<100000:
        encoded12 = np.vstack((encoded12, encoded_s))  
    elif i>=100000 and i<150000:   
        encoded13 = np.vstack((encoded13, encoded_s))         
    elif i>=150000 and i<200000:   
        encoded14 = np.vstack((encoded14, encoded_s))         
    elif i>=200000 and i<250000:   
        encoded15 = np.vstack((encoded15, encoded_s)) 
    elif i>=250000 and i<300000:   
        encoded16 = np.vstack((encoded16, encoded_s))         
    elif i>=300000 and i<350000:   
        encoded17 = np.vstack((encoded17, encoded_s))         
    elif i>=350000 and i<400000:   
        encoded18 = np.vstack((encoded18, encoded_s))
    elif i>=400000 and i<450000:   
        encoded19 = np.vstack((encoded19, encoded_s))        
    else:   
        encoded20 = np.vstack((encoded20, encoded_s))        
         
    if i % 100000 == 0:
        print('----------------------------------------')
        print('Pieces of data Processed: %s'%(i))
#         print('The Shape of Encoded Array: %s'% (encoded.shape,))
        end=time.time()
        print('Running time: %s Seconds'%(end-start))

----------------------------------------
Pieces of data Processed: 100000
Running time: 420.38994693756104 Seconds
----------------------------------------
Pieces of data Processed: 200000
Running time: 857.4740488529205 Seconds
----------------------------------------
Pieces of data Processed: 300000
Running time: 1293.0160129070282 Seconds
----------------------------------------
Pieces of data Processed: 400000
Running time: 1750.2411909103394 Seconds
CPU times: user 20min 31s, sys: 13min 8s, total: 33min 39s
Wall time: 33min 46s


After vectorization, merge sub-arrays into one

In [456]:
encodedx = np.vstack((np.delete(encoded1, [0], axis = 0), np.delete(encoded2, [0], axis = 0), np.delete(encoded3, [0], axis = 0), np.delete(encoded4, [0], axis = 0), np.delete(encoded5, [0], axis = 0), np.delete(encoded6, [0], axis = 0), np.delete(encoded7, [0], axis = 0), np.delete(encoded8, [0], axis = 0), np.delete(encoded9, [0], axis = 0), np.delete(encoded10, [0], axis = 0)))

In [457]:
encodedy = np.vstack((np.delete(encoded11, [0], axis = 0), np.delete(encoded12, [0], axis = 0), np.delete(encoded13, [0], axis = 0), np.delete(encoded14, [0], axis = 0), np.delete(encoded15, [0], axis = 0), np.delete(encoded16, [0], axis = 0), np.delete(encoded17, [0], axis = 0), np.delete(encoded18, [0], axis = 0), np.delete(encoded19, [0], axis = 0), np.delete(encoded20, [0], axis = 0)))

Save the vectorized data

In [458]:
data_encoded_x=pd.DataFrame(data=encodedx)#
data_encoded_x.to_csv('encodedx.csv',encoding='gbk')
data_encoded_y=pd.DataFrame(data=encodedy)#
data_encoded_y.to_csv('encodedy.csv',encoding='gbk')

### Split dataset into training and testing

In [459]:
X_train_start = 0
X_train_end = int(np.floor(encodedx.shape[0]*0.9))
X_test_end = int(encodedx.shape[0])
Y_train_start = 0
Y_train_end = int(np.floor(encodedy.shape[0]*0.9))
Y_test_end = int(encodedy.shape[0])
X_train_encoded = X_encoded[X_train_start: X_train_end]
X_test_encoded = X_encoded[X_train_end: X_test_end]
Y_train_encoded = Y_encoded[Y_train_start: Y_train_end]
Y_test_encoded = Y_encoded[Y_train_end: Y_test_end]

Check the formats 

In [460]:
print(X_train_encoded.shape)
print(Y_train_encoded.shape)
print(X_test_encoded.shape)
print(Y_test_encoded.shape)

(423437, 50)
(423437, 50)
(47049, 50)
(47049, 50)


## Create RNN Model and Generate Comments

In [461]:
# Import Tensorflow
import tensorflow as tf

In [462]:
# Number of features in training
n_code = X_train_encoded.shape[1]

# Neurons
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128

#Traget dimensions
n_target = 50

In [463]:
# Session
net = tf.InteractiveSession()

The shape of the placeholders correspond to [None, n_code] , meaning that the inputs and outputs are 2-dimensional matrix vector. And [None] means the size is undefined and can be defined later.

In [464]:
# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_code])
Y = tf.placeholder(dtype=tf.float32, shape=[None, n_code])

The ultimate goal of neural network training is to get the best parameters and make the target function get the minimum value. Initialization of parameters is also important.

In [465]:
# Initializers
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg", distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()

Our model consists of 4 hidden layers and we define the amount in every layers from 1024 to 128. The number of neurons in each subsequent layer reduces the amount of information that the network identifies in the previous layer.

In [466]:
# Hidden weights
W_hidden_1 = tf.Variable(weight_initializer([n_code, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))

Weights and biases are represented as variables in order to adapt during training.

In [467]:
# Output weights
W_out = tf.Variable(weight_initializer([n_neurons_4, n_target]))
bias_out = tf.Variable(bias_initializer([n_target]))

Activation functions are important elements. Because without the Activation function, the multi-layer inputs and ouputs are just in a linear transformation. Since the expression ability of linear model is not enough, the Activation function can introduce nonlinear factors.

We choose ReLU(Rectified Linear Unit) in our model, one of the most common Activation functions.

In [468]:
# # Hidden layer
# hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
# hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2), bias_hidden_2))
# hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3), bias_hidden_3))
# hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4), bias_hidden_4))
# # Output layer (transpose to fit the shape!)
# out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))

In [469]:
model = Sequential()

model.add(Conv2D(128, (3, 3), strides=(1,1), padding='same', 
					input_shape=x_train.shape[1:]))
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), strides=(1,1), padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same'))
model.add(Activation('relu'))

model.add(Conv2D(128, (3, 3), strides=(1,1), padding='same'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), strides=(1,1), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same'))

model.add(Conv2D(128, (3, 3), strides=(1,1), padding='same'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), strides=(1,1), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same'))

model.add(Conv2D(256, (3, 3), strides=(1,1), padding='same'))
model.add(BatchNormalization())
model.add(Conv2D(512, (3, 3), strides=(1,1), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same'))

# 3 Fully connect Layer
model.add(Flatten())
model.add(Dense(4096))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

In [470]:
# Cost function
mse = tf.reduce_sum(tf.reduce_mean(tf.squared_difference(tf.transpose(out), Y)))

The learning rate is inversely proportional to the number of layers in the network. The more layers there are, the lower the learning rate is.

If the learning rate is too high, the loss begins to increase and then diverges to infinity

Deep learning is common for gradient optimization, that is to say, the optimizer is actually all kinds of gradient descent algorithm optimization. It indicates the direction in which the weights and biases have to be changed during training in order to minimize the network’s cost function.

Adam(adaptive moment estimation) dynamically adjusts the learning rate of each parameter according to the first moment estimation and the second moment estimation of the gradient of each parameter based on the loss function. TensorFlow offers tf.train.adamoptimizer to control the learning speed. Adam is also a method based on gradient descent, but the learning step size of parameters in each iteration has a certain range, which will not lead to a large learning step size due to a large gradient. The value of parameters is relatively stable.

In [471]:
# Optimizer
# Adam
learning_rate = 0.0005
opt = tf.train.AdamOptimizer(learning_rate).minimize(mse)

In [472]:
# Run initializer
net.run(tf.global_variables_initializer())

In [473]:
init = tf.global_variables_initializer()

In [690]:
# Batch size
batch_size = 256
mse_train = []
mse_test = []

In [475]:
%%time
epochs = 100 #number of iterations or training cycles

saver = tf.train.Saver(max_to_keep=4)
with tf.Session() as sess:
    init.run()
    for e in range(epochs):
        sess.run(opt, feed_dict={X: X_train_encoded, Y: Y_train_encoded})
        if e % 10 == 0:
            print("-----------------------")
            loss = mse.eval(feed_dict={X: X_train_encoded, Y: Y_train_encoded})
            print(e, "\tMSE:", loss)
            print("----------save the model")
            saver.save(sess, "model/my-model", global_step=e)   
    y_pred = sess.run(out, feed_dict={X: X_test_encoded})

-----------------------
0 	MSE: 4.16052e+11
----------save the model
-----------------------
10 	MSE: 3.41653e+11
----------save the model
-----------------------
20 	MSE: 2.99512e+11
----------save the model
-----------------------
30 	MSE: 2.77965e+11
----------save the model
-----------------------
40 	MSE: 2.61161e+11
----------save the model
-----------------------
50 	MSE: 2.50663e+11
----------save the model
-----------------------
60 	MSE: 2.4382e+11
----------save the model
-----------------------
70 	MSE: 2.40229e+11
----------save the model
-----------------------
80 	MSE: 2.38388e+11
----------save the model
-----------------------
90 	MSE: 2.37596e+11
----------save the model
CPU times: user 2h 36min 13s, sys: 13min 59s, total: 2h 50min 13s
Wall time: 28min


## Decoding Prediction Results

We store the decoded comments in list Comment_pred, and merge prediction data and test data to check the accuracy.

In [476]:
y_pred = y_pred.T

In [615]:
import string
length = y_pred.shape[0]
Comment_pred = []
for x in y_pred:
    comment = ''
    for y in x:
        try:
            co = int_to_comment[int(y)]
        except:
            continue
        co = co.strip(string.punctuation)    
        comment = comment+' '+co
    
    Comment_pred = Comment_pred + [comment]

In [616]:
Comment_pred = np.array(Comment_pred)

In [632]:
y = rawdata[:, 1:]
Comment_test = y[:,0][Y_train_end: Y_test_end]
Test_result = np.vstack((Comment_test, Comment_pred)).T

In [633]:
Result=pd.DataFrame(data=Test_result, columns = ['Comment','Prediction'])#
Result

Unnamed: 0,Comment,Prediction
0,<!-- begin-user-doc --> <!-- end-user-doc -->,Business mca tt>iface</tt code>_gfortran_abor...
1,The default flow direction. Normally (which is...,code>Vector</code IINCs code>s2</code subDire...
2,Size the window and positioning it relative to...,OpenMRS ADWIN code>foreach</code CWD bR Eleme...
3,Returns the contents of the cell at rowNumber ...,beHealthyMember 0x00080003</li ValidationErro...
4,Method strategyStarted.,code>retryCount</code PKCS1 code>t</code IntA...
5,Checks if this column is signed. It always ret...,CopyStreamListener owner-write desactive exam...
6,Merges authorization configurations by putting...,code>hrs</code patient fetch defineFont imple...
7,<!-- begin-user-doc --> <!-- end-user-doc -->,consults finalized Gmail 21/#F2 gm PUSHF code...
8,Synchronized read of the current number of tas...,Vissim-Inp-File Notice li>Scan DirectedGraph ...
9,Schedule a task for repeated fixed-rate execut...,code>XmlWriter</code fValue RiskoGefaehrdungs...


## Evaluate results by BLEU-4

We evaluate our results through nltk-bleu scoring method and get scores.

In [622]:
from nltk.translate.bleu_score import sentence_bleu

In [641]:
scores = []
for i in range(Comment_test.shape[0]):
    reference = toktok.tokenize(Comment_test[i])
    candidate = toktok.tokenize(Comment_pred[i])
    score = sentence_bleu(reference, candidate)
    scores = scores + [score]

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [685]:
scores = np.array(scores)
Final_result = np.concatenate(([scores],Test_result.T)).T
Final_Result=pd.DataFrame(data=Final_result, columns = ['Score','Prediction','Comment'])#
Final_Result

Unnamed: 0,Score,Prediction,Comment
0,0,<!-- begin-user-doc --> <!-- end-user-doc -->,Business mca tt>iface</tt code>_gfortran_abor...
1,0,The default flow direction. Normally (which is...,code>Vector</code IINCs code>s2</code subDire...
2,0,Size the window and positioning it relative to...,OpenMRS ADWIN code>foreach</code CWD bR Eleme...
3,0,Returns the contents of the cell at rowNumber ...,beHealthyMember 0x00080003</li ValidationErro...
4,0,Method strategyStarted.,code>retryCount</code PKCS1 code>t</code IntA...
5,0,Checks if this column is signed. It always ret...,CopyStreamListener owner-write desactive exam...
6,0,Merges authorization configurations by putting...,code>hrs</code patient fetch defineFont imple...
7,0,<!-- begin-user-doc --> <!-- end-user-doc -->,consults finalized Gmail 21/#F2 gm PUSHF code...
8,0,Synchronized read of the current number of tas...,Vissim-Inp-File Notice li>Scan DirectedGraph ...
9,0,Schedule a task for repeated fixed-rate execut...,code>XmlWriter</code fValue RiskoGefaehrdungs...


In [689]:
Final_Result[Final_Result['Score']>0]

Unnamed: 0,Score,Prediction,Comment
23,0.472871,Drag and drop,Differ non-Locals code>method</code coveringR...
92,0.456634,Put string data to shared preferences in priva...,Caret dl_vlan m lockGrantorIdLock DTLZ fist i...
185,0.423799,Creates an empty set of values using the defau...,1111 li>user=username</li AuthnProviderParams...
228,0.415677,Always throws RejectedExecutionException.,DownloadStatusConfiguration MapKey MediaRoute...
250,0.447214,Action Listener,breaker 1e-13ish MessagingArea tt>target.size...
335,0.492479,Strips source routing. According to RFC-2821 i...,LOCALAWARE_CLUSTERPERSISTENT CommandsWindow r...
347,0.467138,<p>Search the specified classloader for the gi...,FailedTask Initialitzes LogReaderTask prefix ...
419,0.447214,Mouse Clicked. Start AssignmentDialog,code>quad.properties</code code>InvalidJobExc...
475,0.434721,Checks whether this line in the cql script ind...,Pnts SSDP tt>ras</tt complied bootrecord Tran...
503,0.417226,"Has this upstream started or ""onSubscribed"" ?",HGAtomRef JDlgUploadProducts tt>other</tt>val...


## Conclusion

The model had difficulties with this dataset, as the achieved accuracy was relatively low. However, the performance of this vectorization approach mainly relies on the dataset, and it still has much room for improvement.

## Reference

[1] Xing Hu, Ge li, Zhi Jin. Deep Code Comment Generation. 2018. Key Laboratory of High Confidence Software Technologies (Peking University), MoE.

[2] Tjalling Haije. Automatic Comment Generation using a Neural Translation Model. 2016. University of Amsterdam Faculty of Science.

[3] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.