# Final Project for MP

**(d)** (N points) Let's add some extra features. Use the features that you believe will help detecting code clone. (For example, the count of Java keywords: specifically, for a Java function, how many 'if', how many 'try'? ... and so on.) Please feel free to design your own features.

Here are several steps you might want to take to add the new features (you are welcomed to use different approaches):
* Write a function that when given a code snippet, returns the features.
* Write a function to create a dataframe for the features you designed.
* Write a function that uses ``ColumnTransformer`` to combine unigram and the designed features.
* Re-train your logistic regression model over the combined features.

 What is the 10-fold cross validation accuracy? (Make sure to use the same folds as before, so the results are comparable!)
 
**(e)** (N points) Finally, some analysis: essentially, we have trained two models: LR with unigrams and bigrams, and LR with unigrams and features you designed. This setting allows us to compare the power of the new features (bigrams/features you designed). Which one is the better feature for the code clone detection task? Explain why this may be true. 

In [1]:
# Imports

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.base import BaseEstimator
from nltk import word_tokenize
from scipy.spatial import distance

import pandas as pd
import numpy as np

import re

## LR with unigrams and bigrams

In [2]:
# Load dataset

dataset = load_files("code-clone/")

In [3]:
def evaluate_cross_val_score(df, vectorizer):
    model = LogisticRegression(random_state=3, solver='liblinear', penalty="l1")
    
    X_train = vectorizer.fit_transform(df)
    y_train = dataset.target
    
    lr_cross_val_score = cross_val_score(model, X_train, y_train, cv=10)
    
    print("Accuracy: %0.4f (+/- %0.4f)" % (lr_cross_val_score.mean(), lr_cross_val_score.std() * 2))

In [4]:
vectorizer = CountVectorizer(ngram_range=(1, 2))

evaluate_cross_val_score(dataset.data, vectorizer)

Accuracy: 0.7680 (+/- 0.0784)


# LR with extra features

## Baseline (only unigrams)

In [5]:
vectorizer = CountVectorizer()

evaluate_cross_val_score(dataset.data, vectorizer)

Accuracy: 0.7680 (+/- 0.0709)


## ✅ Counting the number of classes in each method

In [6]:
# Given code from sklearn.datasets.load_files(), extract list of unique class names.
# Current strategy is look for 'new', and the word after it is probably a class name.
def extract_classes(code):
    retVal = set()
    for file in code:
        tokens = word_tokenize(file.decode())
        for index, token in enumerate(tokens):
            if token == 'new':
                retVal.add(tokens[index+1])
    return retVal

class SimilarityScorer(BaseEstimator):
    def fit(self, data, unused):
        self.unique_classes = extract_classes(data)
        self.unique_classes.update({"byte", "short", "int", "long", "float", "double", "boolean", "String"})
        return self
        
    def transform(self, doc):
        retVal = []
        
        
        for file in doc:
            tokens = word_tokenize(file.decode())
            score = 0
            class_count = {}
            
            # First function, we want to count the number of times we
            # see a class being used
            done = False
            index = 0
            while done == False and index < len(tokens):
                token = tokens[index]
                if token.lower() == "cls" and tokens[index-1] == '[' and tokens[index+1] == ']':
                    done = True
                if token in self.unique_classes:
                    if token in class_count:
                        class_count[token] += 1
                    else:
                        class_count[token] = 1
                index += 1
                        
            # Here, we are on the 2nd function (usually)
            # or this file doesn't have a 2nd function (unlikely).
            # Now basically we want to see if the classes encountered here
            # are also classes that haved used in the first function.
            # Add 1 to similarity score for each class both functions
            # are using
            while index < len(tokens):
                token = tokens[index]
                if token in self.unique_classes and token in class_count and class_count[token] > 0:
                    class_count[token] -= 1
                    score += 1
                index += 1
                
            retVal.append(score)
        return pd.DataFrame(retVal)

In [7]:
vectorizer = ColumnTransformer([
    ("body", CountVectorizer(), "body"),
    ("classes", SimilarityScorer(), "classes")
])

df = pd.DataFrame({"body": dataset.data, "classes": dataset.data})

evaluate_cross_val_score(df, vectorizer)

Accuracy: 0.8260 (+/- 0.0460)


## ✅ Counting the java keywords in each method

In [8]:
java_kwords = {
    'abstract', 'continue', 'for', 'new',
    'switch', 'assert', 'default', 'goto',
    'package', 'synchronized' 'boolean', 'do',
    'if', 'private', 'this','break', 'double',
    'implements', 'protected', 'throw', 'byte', 
    'else', 'import', 'public', 'throws', 'case',
    'enum', 'instanceof', 'return', 'transient',
    'catch', 'extends', 'int', 'short', 'try', 
    'char', 'final', 'interface', 'static', 'void',
    'class', 'finally', 'long', 'strictfp', 'while',
    'volatile', 'const', 'float', 'native', 'super'
}

java_kword_vectorizer = CountVectorizer()
java_kword_vectorizer.fit(java_kwords)

def get_keyword_vectorizer(code):
    features = list()
    tokens = word_tokenize(code)
    for token in tokens:
        if token in java_kwords:
            features.append(token)
    return java_kword_vectorizer.transform([' '.join(features)])

In [9]:
def separate_methods(docs):
    for doc in docs:
        document = doc.decode()
        left, right = document.split('[CLS]')
        yield left, right

class SimilarityScorerKeywords(BaseEstimator):
    def fit(self, data, unused):
        return self
        
    def transform(self, documents):
        retVal = []
        
        for left, right in separate_methods(documents):
            left_counter = get_keyword_vectorizer(left)            
            right_counter = get_keyword_vectorizer(right)
            score = cosine_similarity(left_counter.todense(), right_counter.todense())
            retVal.append(score[0][0])
            # this returns the count as an array
            #score = np.hstack([left_counter.todense()[0], right_counter.todense()[0]])
            #retVal.append(np.array(score.flatten())[0])
        return pd.DataFrame(retVal)

In [10]:
import warnings
warnings.filterwarnings('ignore') # Warning caused by dependency on numpy

vectorizer = ColumnTransformer([
    ("body", CountVectorizer(ngram_range=(1,2)), "body"),
    ("kwords", SimilarityScorerKeywords(), "kwords"),
])

df = pd.DataFrame({"body": dataset.data, "kwords": dataset.data})

evaluate_cross_val_score(df, vectorizer)

Accuracy: 0.7720 (+/- 0.0700)


## ✅ Counting number of arguments and return type

In [11]:
regex = ("^[ \t]*(?:(?:public|private|protected|static|final|native"
         "|synchronized|abstract|transient|@Override|@Test)+\s+)+"
         "[$_\w<>\[\]\s]*\s+[\$_\w]+\([^\)]*\)?\s*")
non_kword_regex = "^[ \t]*[$_\w<>\[\]\s]*\s+[\$_\w]+\([^\)]*\)?\s*"
def get_java_func_signature(code):
    match = re.match(regex, code)
    ret_val_idx = -2
    if not match:
        match = re.match(non_kword_regex, code)
        ret_val_idx = -1
        if not match:
            return "void", []
    # remove )
    string = match.group(0).lstrip()[:-2]
    signature, args = string.split("(")
    args = args.split(", ")
    signature = signature.split(" ")
    ret_value = signature[ret_val_idx]
    return ret_value, args

In [12]:
class SimilarityScoreArgCount(BaseEstimator):
    """
    Evaluates based on the number of arguments that each method has.
    The score is len(left_args) - len(right_args)
    """
    def fit(self, data, unused):
        return self
        
    def transform(self, documents):
        args_diff = []
        for left, right in separate_methods(documents):
            _, largs = get_java_func_signature(left)
            _, rargs = get_java_func_signature(right)
            args_diff.append(len(largs) - len(rargs))
        return pd.DataFrame(args_diff)


class SimilarityScoreRetVals(BaseEstimator):
    """
    Score based on the return type of each method.
    """
    def fit(self, data, unused):
        self.unique_classes = set()
        for left, right in separate_methods(data):
            left_ret_val, _ = get_java_func_signature(left)
            self.unique_classes.add(left_ret_val)
            right_ret_val, _ = get_java_func_signature(right)
            self.unique_classes.add(right_ret_val)
        return self
        
    def transform(self, documents):
        vals = []
        vectorizer = CountVectorizer()
        vectorizer.fit(self.unique_classes)
        vocab = vectorizer.vocabulary_
        for left, right in separate_methods(documents):
            ret_type_l, _ = get_java_func_signature(left)
            ret_type_r, _ = get_java_func_signature(right)
            lret_id = vocab.get(ret_type_l.lower(), -1)
            rret_id = vocab.get(ret_type_r.lower(), -1)
            vals.append([lret_id, rret_id])
        return pd.DataFrame(vals)

In [13]:
vectorizer = ColumnTransformer([
    ("body", CountVectorizer(), "body"),
    ("argcount", SimilarityScoreArgCount(), "argcount"),
])

df = pd.DataFrame({"body": dataset.data, "argcount": dataset.data})

evaluate_cross_val_score(df, vectorizer)

Accuracy: 0.7705 (+/- 0.0662)


## ❌ Using w2v embeddings of words in the method names

> Note: run the code under methods-w2v.ipynb to generate the embeddings and save them to files.

In [14]:
left_method_embeddings = load_files('code-clone-method-embeddings-left', encoding="utf-8")
right_method_embeddings = load_files('code-clone-method-embeddings-right', encoding="utf-8")

In [15]:
def populate_pd_series(data, col_prefix):
    df = pd.Series(data, dtype='str', name='left')
    df = df.str.split(',', expand=True)
    
    col_count = df.shape[1]
    col_names = dict(zip(range(col_count), [f'{col_prefix}{i}' for i in range(col_count)]))
    df = df.rename(columns=col_names).astype('float')
    return df

In [16]:
left_method_series = populate_pd_series(left_method_embeddings.data, 'left_method')
right_mehod_series = populate_pd_series(right_method_embeddings.data, 'right_method')

body = pd.Series(dataset.data, name='body')

df = pd.concat([body, left_method_series, right_mehod_series], axis=1)

In [17]:
vectorizer = ColumnTransformer([
    ("body", CountVectorizer(), "body"),
], remainder='passthrough')

evaluate_cross_val_score(df, vectorizer)

Accuracy: 0.7590 (+/- 0.0595)


## ✅ Using w2v embeddings for all java tokens in each method

> Note: run the code under tokens-w2v.ipynb to generate the embeddings and save them to files.

In [18]:
left_token_embeddings = load_files('code-clone-embeddings-left', encoding="utf-8")
right_token_embeddings = load_files('code-clone-embeddings-right', encoding="utf-8")

In [19]:
left_token_series = populate_pd_series(left_token_embeddings.data, 'left_tokens')
right_token_series = populate_pd_series(right_token_embeddings.data, 'right_tokens')

body = pd.Series(dataset.data, name='body')

df = pd.concat([body, left_token_series, right_token_series], axis=1)

In [20]:
vectorizer = ColumnTransformer([
    ("body", CountVectorizer(), "body"),
], remainder='passthrough')

evaluate_cross_val_score(df, vectorizer)

Accuracy: 0.7680 (+/- 0.0697)


# 🎉 Combining the good features 🎉

In [21]:
vectorizer = ColumnTransformer([
    ("body", CountVectorizer(), "body"),
    ("argcount", SimilarityScoreArgCount(), "argcount"),
    ("kwords", SimilarityScorerKeywords(), "kwords"),
    ("classes", SimilarityScorer(), "classes")
    
], remainder="passthrough")

df = pd.DataFrame({
    "body": dataset.data, 
    "argcount": dataset.data,
    "kwords": dataset.data,
    "classes": dataset.data,    
})

df = pd.concat([
    df, 
    left_token_series,
    right_token_series,
], axis=1)

evaluate_cross_val_score(df, vectorizer)

Accuracy: 0.8330 (+/- 0.0448)
