## Import All the Things

In [1]:
import glob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd


In [2]:
languages = {'csharp':'C#', 'sbcl':'Common Lisp', 'clojure':'Clojure',
             'java':'java', 'javascript':'Javascript', 'perl':'Perl',
             'py':'Python', 'jruby':'Ruby', 'php':'PHP', 'yarv':'Ruby',
             'hack':'PHP', 'racket':'Scheme', 'ml':'OCaml', 'mli':'OCaml',
             'python3':'Python', 'gcc':'C', 'c':'C', 'scala':'Scala',
             'ocaml':'OCaml', 'sc':'Scala', 'tcl':'Tcl'}

## Read in All the Things

In [3]:
def file_compile():
    X = []
    y = []
    for fext, language in languages.items():
        files = glob.glob('bench/*.{}'.format(fext))
        for file in files:
            with open(file, encoding='latin_1') as f:
                X.append(f.read())
                y.append(language)
    return np.array(X), np.array(y)

In [4]:
X, y = file_compile()

In [5]:
len(X)

552

In [6]:
len(y)

552

In [7]:
type(X)

numpy.ndarray

In [8]:
type(y)

numpy.ndarray

## Naive Bayes Classifier
#### Preferred classifier as seen on http://scikit-learn.org/stable/tutorial/machine_learning_map/

The below pipeline takes input text and vectorizes it into instances of particular tokens of words and strings of special characters, then feeds it to a Multinomial Naive Bayes classifier that will create a model that guesses the classification of out-of-sample input based on the conditional probabilities established by the data it was trained on.

In [9]:
bayes_pipe = Pipeline([('vectorizer', CountVectorizer(token_pattern=r'[a-zA-Z]{3,}|[^\w\d\s]+')),
                       ('multinom', MultinomialNB())])

In [10]:
from sklearn.cross_validation import cross_val_score

In [11]:
scores = cross_val_score(bayes_pipe, X, y, cv=10, scoring='accuracy')
scores

array([ 0.82258065,  0.91666667,  0.98333333,  0.93103448,  0.875     ,
        0.86792453,  0.94230769,  0.96153846,  0.92      ,  0.95918367])

In [12]:
scores.mean()

0.91795694835373387

91.7 % isn't bad.  Let's now fit our pipeline with our whole training data set, then test it against the provided test data.

## Testing the Classifier

In [13]:
bayes_pipe.fit(X, y)

Pipeline(steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[a-zA-Z]{3,}|[^\\w\\d\\s]+',
        tokenizer=None, vocabulary=None)), ('multinom', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [14]:
testdf = pd.read_csv('test.csv', names=['index', 'y_test'])

In [15]:
X_test = []

for number in testdf['index']:
    with open('test/{}'.format(number), encoding='latin_1') as f:
        X_test.append(f.read())

testdf['X_test'] = X_test

In [16]:
testdf.head()

Unnamed: 0,index,y_test,X_test
0,1,clojure,"(defn cf-settings\n ""Setup settings for campf..."
1,2,clojure,(ns my-cli.core)\n\n(defn -main [& args]\n (p...
2,3,clojure,(extend-type String\n Person\n (first-name [...
3,4,clojure,(require '[overtone.live :as overtone])\n\n(de...
4,5,python,from pkgutil import iter_modules\nfrom subproc...


In [17]:
y_pred = bayes_pipe.predict(testdf.X_test)
y_pred = np.array([element.lower() for element in y_pred])

In [18]:
from sklearn.metrics import accuracy_score

In [19]:
accuracy_score(testdf.y_test, y_pred)

0.75

It appears that our classifier was only 75% effective. Let's see where we missed the mark.

In [29]:
comparison = pd.DataFrame({'y_test':testdf.y_test, 'y_pred':y_pred})

In [35]:
comparison[comparison['y_test'] != comparison['y_pred']]

Unnamed: 0,y_pred,y_test
10,scala,javascript
11,scala,javascript
15,scala,haskell
16,ocaml,haskell
17,ocaml,haskell
22,javascript,java
25,php,tcl
26,perl,tcl


Well, there's something to notice here:  There were no 'tcl' or 'haskell' files in our training data, so of course the classifier will not get that right.  Without those 5 'errors', we can assume that accuracy would jump up significantly.

In [36]:
len(comparison.y_test)

32

In [37]:
.75 * 32

24.0

In [38]:
24 / 27

0.8888888888888888

So there are 32 files to test in total.  If 75% of those files were correctly identified, that means that 24 were correct and 8 were incorrect.  When you ignore the 5 files in our test data that were written in languages that our classifier was not trained on, a more accurate measure of accuracy is 24/27, which is 88.8%.