# Final Project for MP

**(d)** (N points) Let's add some extra features. Use the features that you believe will help detecting code clone. (For example, the count of Java keywords: specifically, for a Java function, how many 'if', how many 'try'? ... and so on.) Please feel free to design your own features.

Here are several steps you might want to take to add the new features (you are welcomed to use different approaches):
* Write a function that when given a code snippet, returns the features.
* Write a function to create a dataframe for the features you designed.
* Write a function that uses ``ColumnTransformer`` to combine unigram and the designed features.
* Re-train your logistic regression model over the combined features.

 What is the 10-fold cross validation accuracy? (Make sure to use the same folds as before, so the results are comparable!)
 
**(e)** (N points) Finally, some analysis: essentially, we have trained two models: LR with unigrams and bigrams, and LR with unigrams and features you designed. This setting allows us to compare the power of the new features (bigrams/features you designed). Which one is the better feature for the code clone detection task? Explain why this may be true. 

## LR with unigrams and bigrams

In [1]:
# Imports

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

In [2]:
# Load dataset

dataset = load_files("code-clone/")

In [11]:
lr_model = LogisticRegression(random_state=3, solver='liblinear')

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(dataset.data)
y_train = dataset.target

lr_cross_val_score = cross_val_score(lr_model, X_train, y_train, cv=10)

In [13]:
if lr_cross_val_score is not None:
    print("Accuracy for Logistic Regression: %0.4f (+/- %0.4f)" % (lr_cross_val_score.mean(), lr_cross_val_score.std() * 2))

Accuracy for Logistic Regression: 0.7565 (+/- 0.0663)


## LR with extra features

### Using w2v embeddings

In [40]:
# Imports

import pandas as pd
from sklearn.compose import ColumnTransformer

In [41]:
left_embeddings = load_files('code-clone-embeddings-left', encoding="utf-8")
right_embeddings = load_files('code-clone-embeddings-right', encoding="utf-8")

In [53]:
def rename_cols(prefix):
    return dict(zip(range(128), [f'{prefix}{i}' for i in range(128)]))

index = dict(enumerate(dataset.filenames))

les = pd.Series(left_embeddings.data, dtype='str', name='left')
le = les.str.split(',', expand=True)
le = le.rename(columns=rename_cols('l')).astype('float')

res = pd.Series(right_embeddings.data, dtype='str', name='right')
re = res.str.split(',', expand=True)
re = re.rename(columns=rename_cols('r')).astype('float')

body = pd.Series(dataset.data, name='body')

df = pd.concat([body, le, re], axis=1).rename(index=index)

In [66]:
print(df.head(1).T)

                   code-clone/neg/19035955-23162862.txt
body  b'public static void main(String[] args) throw...
l0                                            -0.051221
l1                                            -0.122162
l2                                             0.101226
l3                                            -0.021237
...                                                 ...
r123                                          -0.049509
r124                                          -0.022985
r125                                           0.131543
r126                                           0.082626
r127                                           0.082103

[257 rows x 1 columns]


In [67]:
vectorizer = ColumnTransformer([
    ("body", CountVectorizer(), "body"),
], remainder='passthrough')

In [68]:
lr_model = LogisticRegression(random_state=3, solver='liblinear')

X_train = vectorizer.fit_transform(df)
y_train = dataset.target

lr_cross_val_score = cross_val_score(lr_model, X_train, y_train, cv=10)

In [69]:
if lr_cross_val_score is not None:
    print("Accuracy for Logistic Regression: %0.4f (+/- %0.4f)" % (lr_cross_val_score.mean(), lr_cross_val_score.std() * 2))

Accuracy for Logistic Regression: 0.7520 (+/- 0.0552)
