<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Code-comment-classifiers" data-toc-modified-id="Code-comment-classifiers-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Code comment classifiers</a></span></li></ul></div>

# Code comment classifiers
_Author: Sanket Mehrotra_

In this notebook, we try to write a classifier to fit to our dataset to try to classify comments based on similarity. 

In [10]:
import pandas as pd
import numpy as np
from tqdm import tqdm

In [4]:
training_data = pd.read_csv("./labeldata.csv")

In [6]:
training_data

Unnamed: 0,comment,non-information
0,@implNote taken from {@link com.sun.javafx.sce...,yes
1,icon.setToolTipText(printedViewModel.getLocali...,yes
2,Synchronize changes of the underlying date val...,no
3,Ask if the user really wants to close the give...,yes
4,css: information *,no
...,...,...
12277,/* (non-Javadoc),yes
12278,// TODO Auto-generated method stub,yes
12279,/* (non-Javadoc),yes
12280,/* (non-Javadoc),yes


We use the widely used TFIDF scores to encode our comments, maybe their similariy will be able to be measured by comparing these values. 

In [69]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

In [70]:
count_vect = CountVectorizer(decode_error='replace', encoding='utf-8')

In [71]:
X_train_counts = count_vect.fit_transform(training_data["comment"].values.astype(str))

In [72]:
X_train_counts.shape

(12282, 7809)

A countvectorizer just calculates term frequencies of each unique word in our dataset.

In [73]:
count_vect.vocabulary_.get('synchronize')  #index of the word synchronize in the generated dictionary

6966

In [74]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(12282, 7809)

In [75]:
from sklearn.naive_bayes import ComplementNB
clf = ComplementNB().fit(X_train_tfidf, training_data['non-information'])

Now that we have fit it, we can test our classifier.

In [76]:
docs_new = ['//Print()', '/* TODO: Nothing']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [77]:
predicted = clf.predict(X_new_tfidf)

In [78]:
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, category))

'//Print()' => no
'/* TODO: Nothing' => yes


In [79]:
testing = pd.read_csv("jabref_single_comments0-1000.csv",header=None,usecols=[0],names=["comment"])

In [80]:
testing

Unnamed: 0,comment
0,"// Not needed for connection, but stored for f..."
1,// Some DBMS require a non-null value as a pas...
2,// Calling dbmsProcessor.setupSharedDatabase()...
3,// copy remote values to local entry
4,// in case entries should be added into the lo...
...,...
1550,//drugg.fgg.uni-lj.si/701/1/GEV_0199_Sajovic.p...
1551,//github.com/FXMisc/RichTextFX/blob/master/LIC...
1552,"// does not accept a property, so this is usin..."
1553,// see https://stackoverflow.com/questions/286...


In [90]:
X_new_counts = count_vect.transform(testing["comment"].values.astype(str))
X_new_counts.shape

(1555, 7809)

In [91]:
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
X_new_tfidf.shape

(1555, 7809)

In [92]:
predicted = clf.predict(X_new_tfidf)
predicted

array(['no', 'no', 'no', ..., 'no', 'no', 'no'], dtype='<U3')

In [93]:
for doc, category in zip(testing, predicted):
    print('%r => %s' % (doc, category))

'comment' => no


In [94]:
for d in testing:
    print(d)

comment


Ref: [Link](https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34)

In [None]:
#remove blank rows
training_data["comment"].dropna(inplace=True)
training_data["comment"] = [row.lower() for row in training_data["comment"]]
