# Evaluation 

The goal of this session is to wrap up the current approach and to evaluate it. I felt like I could make some improvements with it before I move on to another approach. 

1. Will look for some low hanging fruits.
2. Will try to understand the shortcommings of the final version.
3. Will evaluate current approach with metrics.

##### Data 

Gotta load it in first. Remember that you can find this dataset [here](https://www.kaggle.com/stackoverflow/stacksample).

In [9]:
import pandas as pd

df = (pd.read_csv("Questions.csv", nrows=2_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

# Approach

Here's our `spacy.matcher.Matcher` object.

In [10]:
import spacy 
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

In [11]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  {'IS_PUNCT': True, 'OP': '?'},
                  {'LOWER': 'c'}]
obj_c_pattern2 = [{'LOWER': 'objectivec'}]
python_pattern = [{'LOWER': 'python'}]
go_pattern1    = [{'LOWER': 'go', 'POS': {'NOT_IN': ['VERB']}}]
go_pattern2    = [{'LOWER': 'golang'}]
ruby_pattern   = [{'LOWER': 'ruby'}]
js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("OBJ_C_LANG", None, obj_c_pattern1, obj_c_pattern2)
matcher.add("PYTHON_LANG", None, python_pattern)
matcher.add("GO_LANG", None, go_pattern1, go_pattern2)
matcher.add("JS_LANG", None, js_pattern)
matcher.add("RUBY_LANG", None, ruby_pattern)

Below is a bit of custom code that will highlight the output of the matcher.

In [1]:
from IPython.display import HTML as html_print

def style(s, bold=False):
    blob = f"<text>{s}</text>"
    if bold:
        blob = f"<b style='background-color: #fff59d'>{blob}</b>"
    return blob

def html_generator(g, n=10):
    blob = ""
    for i in range(n):
        doc = next(g)

        state = [[t, False] for t in doc]
        for idx, start, end in matcher(doc):
            for i in range(start, end):
                state[i][1] = True
        blob += style(' '.join([style(str(t[0]), bold=t[1]) for t in state]) + '<br>') 
    return blob

In [13]:
titles = (_ for _ in df['Title'])
g = (d for d in nlp.pipe(titles) if it_matches(d))
html_print(html_generator(g, n=10))

### Results of Evaluation 1

I used this function to quickly loop over a few examples to see when we were doing something right and when we were doing something wrong. 

By looking at items where nothing is matched we get a gentle reminder of what we should add to the matcher. By repeating this a few times I've been able to add a few languages: **C#**, **F#**, **Perl**, **Sql** and **.NET** as well as a few others.

In [14]:
import spacy 
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

In [15]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  {'IS_PUNCT': True, 'OP': '?'},
                  {'LOWER': 'c'}]
obj_c_pattern2 = [{'LOWER': 'objectivec'}]

csharp_pattern1 = [{'LOWER': 'c'}, {'LOWER': '#'}]
csharp_pattern2 = [{'LOWER': 'c'}, {'LOWER': 'sharp'}]
csharp_pattern3 = [{'LOWER': 'c#'}]

fsharp_pattern1 = [{'LOWER': 'f'}, {'LOWER': '#'}]
fsharp_pattern2 = [{'LOWER': 'f'}, {'LOWER': 'sharp'}]
fsharp_pattern3 = [{'LOWER': 'f#'}]
 
dot_net_pattern = [{'LOWER': '.net'}]

php_pattern = [{'LOWER': 'php'}]

asp_net_pattern = [{'LOWER': {'IN': ['asp.net', 'asp']}}]

python_pattern = [{'LOWER': 'python'}]

lisp_pattern1  = [{'LOWER': 'lisp'}]
lisp_pattern2  = [{'LOWER': 'common'}, {'LOWER': 'lisp'}]

go_pattern1    = [{'LOWER': 'go', 'POS': {'NOT_IN': ['VERB']}}]
go_pattern2    = [{'LOWER': 'golang'}]

ruby_pattern   = [{'LOWER': 'ruby'}]

sql_pattern    = [{'LOWER': 'sql'}]

matlab_pattern = [{'LOWER': 'matlab'}]

perl_pattern   = [{'LOWER': 'perl'}]

html_pattern   = [{'LOWER': 'html'}]

css_pattern   = [{'LOWER': 'css'}]

js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

java_pattern   = [{'LOWER': 'java'}]

c_pattern      = [{'LOWER': 'c'}]

cpp_pattern    = [{'LOWER': 'c++'}]

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("OBJ_C_LANG", None, obj_c_pattern1, obj_c_pattern2)
matcher.add("PYTHON_LANG", None, python_pattern)
matcher.add("GO_LANG", None, go_pattern1, go_pattern2)
matcher.add("CSHARP_LANG", None, csharp_pattern1, csharp_pattern2, csharp_pattern3)
matcher.add("FSHARP_LANG", None, fsharp_pattern1, fsharp_pattern2, fsharp_pattern3)
matcher.add("JS_LANG", None, js_pattern)
matcher.add("JAVA_LANG", None, java_pattern)
matcher.add("RUBY_LANG", None, ruby_pattern)
matcher.add("SQL_LANG", None, sql_pattern)
matcher.add("C_LANG", None, c_pattern)
matcher.add("CPP_LANG", None, cpp_pattern)
matcher.add("PHP_LANG", None, php_pattern)
matcher.add("MATLAB_LANG", None, matlab_pattern)
matcher.add("PERL_LANG", None, perl_pattern)
matcher.add("LISP_LANG", None, lisp_pattern1, lisp_pattern2)
matcher.add("HTML_LANG", None, html_pattern)
matcher.add("CSS_LANG", None, css_pattern)

## Actual Labelling

In [16]:
titles = (_ for _ in df['Title'][:2000])
sum(1 for d in nlp.pipe(titles) if it_matches(d))

510

- I got about 500 labels in 10 minutes while drinking tea. The timing suggests I should be able to do about 2000 in 30-45 minutes. This is soo worth **my** time. No need to wait around and have somebody else label for me. 
- I also think labelling myself is useful. Labelling isn't super easy per se and it helps me understand the problem a bit better. 
    - I thought that `.NET` was a programming language but after seeing so many examples during labelling where it said `.NET ASP` and `C# .NET` I figured I'd just check what kind of a language it was. It turns out that it is actually a library and *not* a language. This was a cool lesson.
    - Some programming languages have a version number attached. Like `python3`.
    - Technically my labelling approach is a bit janky since there can be multiple proglangs in a single title. Made a note of this mentally.
    - Be sure to handle labelled files with care. Maybe give different names to de-risk overwriting it. Don't want to loose valuable labels. Also, it helps to turn it into a tab seperated file now because the titles all have commas in it. Excel doesn't play nice all the time.
- I figured after 500 rows it'd be a good idea to just check the evaluation thus far. Just to check if there's anything in there that I could check.

## Language Version 

To adress the version number of the language I've made a function to deal with it. It creates many patterns out of a single one.

In [17]:
def create_versioned(name):
    return [
        [{'LOWER': name}], 
        [{'LOWER': {'REGEX': f'({name}\d+\.?\d*.?\d*)'}}], 
        [{'LOWER': name}, {'TEXT': {'REGEX': '(\d+\.?\d*.?\d*)'}}],
    ]

create_versioned('python')

[[{'LOWER': 'python'}],
 [{'LOWER': {'REGEX': '(python\\d+\\.?\\d*.?\\d*)'}}],
 [{'LOWER': 'python'}, {'TEXT': {'REGEX': '(\\d+\\.?\\d*.?\\d*)'}}]]

You can confirm that it works.

In [18]:
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("PYTHON_LANG", None, *create_versioned('python'))
g = nlp.pipe(["i use python, python3.7, python 3.6.6", 
              "also python3, python 2 and python3.2.1", 
              "not bypython, pythonn and py36"])
html_print(html_generator(g, n=3))

# Putting this in a new matcher.

Note that I am creating **lots** of patterns here. The downside is that it looks a bit chaotic but the benefit is that it is very easy to add a pattern. It might be good to refactor later on though.

In [19]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  {'IS_PUNCT': True, 'OP': '?'},
                  {'LOWER': 'c'}]
obj_c_pattern2 = [{'LOWER': 'objectivec'}]

csharp_pattern1 = [{'LOWER': 'c'}, {'LOWER': '#'}]
csharp_pattern2 = [{'LOWER': 'c'}, {'LOWER': 'sharp'}]
csharp_pattern3 = [{'LOWER': 'c#'}]

fsharp_pattern1 = [{'LOWER': 'f'}, {'LOWER': '#'}]
fsharp_pattern2 = [{'LOWER': 'f'}, {'LOWER': 'sharp'}]
fsharp_pattern3 = [{'LOWER': 'f#'}]

lisp_pattern1  = [{'LOWER': 'lisp'}]
lisp_pattern2  = [{'LOWER': 'common'}, {'LOWER': 'lisp'}]

go_pattern1    = [{'LOWER': 'go', 'POS': {'NOT_IN': ['VERB']}}]
go_pattern2    = [{'LOWER': 'golang'}]

html_pattern   = [{'LOWER': 'html'}]
css_pattern    = [{'LOWER': 'css'}]
sql_pattern    = [{'LOWER': 'sql'}]
js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

cpp_pattern    = [{'LOWER': 'c++'}]


versioned_languages = ['ruby', 'php', 'python', 'perl', 'java', 'haskell', 
                       'scala', 'c', 'cpp', 'matlab', 'bash', 'delphi']
flatten = lambda l: [item for sublist in l for item in sublist]
versioned_patterns = flatten([create_versioned(lang) for lang in versioned_languages])

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("PROG_LANG", None, 
            obj_c_pattern1, obj_c_pattern2,
            go_pattern1, go_pattern2,
            lisp_pattern1, lisp_pattern2,
            csharp_pattern1, csharp_pattern2, csharp_pattern3,
            fsharp_pattern1, fsharp_pattern2, fsharp_pattern3,
            html_pattern, css_pattern, sql_pattern, js_pattern,
            cpp_pattern, *versioned_patterns)

# On to evaluating

I am using `sklearn` (an amazing library) such that I don't have to write boilerplate code.

In [55]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

label_df = (pd.read_csv('have_label.txt', delimiter="\t")[['Label', 'Title']]
 .iloc[:968]
 .assign(Pred=lambda d: [len(matcher(d)) > 0 for d in nlp.pipe(d['Title'])])
 .assign(Label=lambda d: d['Label'].astype(np.int8))
 .assign(Pred=lambda d: d['Pred'].astype(np.int8)))

This is the confusion matrix.

<img src="images/confusion.png" width=400>

In [56]:
confusion_matrix(label_df['Label'], label_df['Pred'])

array([[711,  30],
       [  2, 225]])

### Before

```
              precision    recall  f1-score   support

           0       1.00      0.95      0.97       396
           1       0.83      0.99      0.91       106

    accuracy                           0.96       502
   macro avg       0.92      0.97      0.94       502
weighted avg       0.96      0.96      0.96       502
```

## After 

the improvements seem to have had an effect.

In [58]:
print(classification_report(label_df['Label'], label_df['Pred']))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98       741
           1       0.88      0.99      0.93       227

    accuracy                           0.97       968
   macro avg       0.94      0.98      0.96       968
weighted avg       0.97      0.97      0.97       968



This summary lists two important metrics; precision and recall.

<img src="images/prerec.png" width=500>

There's another one listed which is a weighted combination of the two.

<img src="images/f1.png" width=500>

A few extra notes:

- **support** is merely the number of examples in a class
- **macro avg** is just the average of the two metrics
- **weighted avg** is the weighted average of the two metrics weighted by support

In [50]:
mistakes = (label_df
            .loc[lambda d: d['Pred'] == 1]
            .loc[lambda d: d['Label'] == 0]
            ['Title'])

html_print(html_generator((nlp(i) for i in mistakes), n=21))

This `SQL Server` stuff is something to keep in mind. Still, I am happy with the lessons learned sofar. 