# Prediction Algorithms | Verb Prediction

***Please find the assignments to submit following the sample code.***

This notebook contains working code as the basis for your assignment. It fetches and loads the necessary data you will work on during this assignment, and demonstrates all the initial steps.

## Prerequisites
+ **Python 3.6.x.** You may install python via the [Anaconda distribution](https://www.anaconda.com/distribution/#download-section) which will also automatically install many useful python libraries, but this is not mandatory. To import a library in a notebook or python file, it must first be _installed_; this is a prerequisite to running this notebook. Anaconda will out-of-the-box install most/all of the libraries which are imported by this notebook. The latest version of Anaconda comes with python 3.7, so you will have to dig for the previous version of Anaconda.
<br><Br>
+ Python packages used in this notebook should be installed. <br><br>
    + To install any missing or helpful python packages with Anaconda, you would typically issue the following command given below. This command assumes that the package you are after has been made publically available on the official Anaconda distribution channel (called conda-forge). The `conda` command is Anaconda's setup tool that comes along with the Anaconda distribution. If you installed Anaconda, you should have this command available in your command prompt.
```
conda install -c conda-forge <package name>
```
    + Further setup notes:
         + When run, the `conda install` command may take a while to resolve all dependencies.
         + Some packages are not available through Anaconda, in which case use the famous pip utility to install the missing package/s into your python environment.
         + You are entirely free to install python and the necessary packages in other ways
<br><br>

## Getting Started with this notebook
+ Follow the cells and their outputs to confirm you understand what they are doing. 
+ Use it as a learning opportunity about idiomatic python, if you are relatively new to python!
+ Make a local copy of the notebook ― <br>and make sure it is running on your machine, with outpus similar to those already appearing in this pre-run copy.

### Imports

In [1]:
from tqdm import tqdm_notebook # progress bars
import pyconll # library parsing CoNLL-U files
from pprint import pprint # slightly nicer printing of data structures
from sklearn.metrics import confusion_matrix
import random
import itertools 
import sklearn.metrics

## Downloading and loading the HTB corpus
To future-proof, we standardize on a specific version of the HTB. <br>Those with a keen eye will notice we use here the quick-and-dirty ⚡ way of launching OS commands from directly within the notebook. <br>⚙ Anyway, make sure you have git installed and working on your OS before proceeding.

In [2]:
%%script false
!git clone https://github.com/UniversalDependencies/UD_Hebrew-HTB
!cd UD_Hebrew-HTB && git checkout 82591c955e86222e32531336ff23e36c220b5846

In [3]:
conllu1 = pyconll.load_from_file('UD_Hebrew-HTB/he_htb-ud-dev.conllu')
conllu2 = pyconll.load_from_file('UD_Hebrew-HTB/he_htb-ud-test.conllu')
conllu3 = pyconll.load_from_file('UD_Hebrew-HTB/he_htb-ud-train.conllu')

conllu = [conllu1, conllu2, conllu3]

## Quick data exploration
lets quantify how many verbs do we have per sentence

In [4]:
counts = []
for sentence in itertools.chain(*conllu): # the asterisk unpacks the array into an argument list
    verbs = 0
    for token in sentence:
        if token.upos == 'VERB':
            verbs += 1 #print(token.form)
    counts.append(verbs)

In [5]:
import pandas as pd
counts = pd.Series(counts)
counts.value_counts().sort_index()

0      511
1     1673
2     1748
3     1099
4      604
5      321
6      140
7       66
8       26
9       15
10       6
11       3
12       3
15       1
dtype: int64

In [6]:
verbs = {}
non_verbs = {}

for sentence in itertools.chain(*conllu):
    for token in sentence:
        if token.upos == 'VERB':
            if token.form in verbs:
                verbs.update({token.form : verbs[token.form]+1})
            else:
                verbs.update({token.form : 0})
        else:
            if token.form in non_verbs:
                non_verbs.update({token.form : non_verbs[token.form]+1})
            else:
                non_verbs.update({token.form : 0})

    
print('{:,} unique verbs in training data'.format(len(verbs)))
print('{:,} unique non-verbs in training data'.format(len(non_verbs)))

ambiguous = set(verbs.keys()) & set(non_verbs.keys())

print('{:,} words are ambiguous'.format(len(ambiguous)))
print()
#print('ambiguous words:\n' + str(ambiguous))

5,528 unique verbs in training data
28,205 unique non-verbs in training data
632 words are ambiguous



In [7]:
verbs     = {k:v for (k,v) in verbs.items() if not k in ambiguous} # this is called a dict comprehension
non_verbs = {k:v for (k,v) in non_verbs.items() if not k in ambiguous}

print('after removing ambiguous words:\n')
print('{:,} unique verbs in training data'.format(len(verbs)))
print('{:,} unique non-verbs in training data'.format(len(non_verbs)))

after removing ambiguous words:

4,896 unique verbs in training data
27,573 unique non-verbs in training data


In [8]:
characters = dict()

lexicon = list(verbs.keys()) + list(non_verbs.keys())

for verb in tqdm_notebook(lexicon):
    for char in verb:
        if char in characters:
            characters[char] += 1
        else:
            characters[char] = 0        

HBox(children=(IntProgress(value=0, max=32469), HTML(value='')))




In [9]:
alphabet = characters.keys()
alphabet = sorted(alphabet)
print('the alhpabet size in this corpus is {}'.format(len(alphabet)))
print('alphabet:\n' + str(alphabet))

the alhpabet size in this corpus is 49
alphabet:
['!', '"', '%', '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '_', 'א', 'ב', 'ג', 'ד', 'ה', 'ו', 'ז', 'ח', 'ט', 'י', 'ך', 'כ', 'ל', 'ם', 'מ', 'ן', 'נ', 'ס', 'ע', 'ף', 'פ', 'ץ', 'צ', 'ק', 'ר', 'ש', 'ת']


## Getting down to business
In the following we test whether ngrams are enough for classifying words into verbs v.s. non-verbs. This demonstrates the entire flow of classification.

In [10]:
_1grams = list(itertools.combinations_with_replacement(alphabet, 1))
_2grams = list(itertools.combinations_with_replacement(alphabet, 2))
ngrams = _1grams + _2grams
ngrams = list(map(lambda ngram: ''.join(ngram), ngrams))
print('we have {} ngram features'.format(len(ngrams)))

we have 1274 ngram features


In [11]:
def vectorize(word):
    ''' turn word into a vector '''
    
    # bigram occurences
    vec1 = [0] * len(ngrams)
    for idx, bigram in enumerate(ngrams):
        if bigram in word:
            vec1[idx] = 1
            
    # bigram occurences as prefix
    vec2 = [0] * len(ngrams)
    for idx, bigram in enumerate(ngrams):
        if word.startswith(bigram):
            vec2[idx] = 1

    # bigram occurences as prefix
    vec3 = [0] * len(ngrams)
    for idx, bigram in enumerate(ngrams):
        if word.endswith(bigram):
            vec3[idx] = 1
        
    return vec1 + vec2 + vec3

In [12]:
def train_test_split(X, y, train_proportion=0.8):
    ''' split the given data into train and test sets '''

    assert len(X) == len(y), 'input data should have exactly one prediction per input'
    assert 0 < train_proportion < 1, 'this function requires a proportion between zero and one as its second argument'
    
    data_indices = set(range(len(X)))
    data_count = len(data_indices)

    train_indices = set(random.sample(data_indices, int(data_count * train_proportion)))
    test_indices  = data_indices - train_indices
    
    X_train = [X[idx] for idx in train_indices]
    X_test  = [X[idx] for idx in test_indices]
    
    y_train = [y[idx] for idx in train_indices]
    y_test  = [y[idx] for idx in test_indices]

    assert len(X_train) + len(X_test) == len(X)
    
    return X_train, X_test, y_train, y_test

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

X_positive = list(map(vectorize, verbs.keys()))
X_negative = list(map(vectorize, non_verbs.keys()))

y_positive = [1] * len(X_positive)
y_negative = [0] * len(X_negative)

X_train_pos, X_test_pos, y_train_pos, y_test_pos = train_test_split(X_positive, y_positive)
X_train_neg, X_test_neg, y_train_neg, y_test_neg = train_test_split(X_negative, y_negative)

X_train = X_train_pos + X_train_neg
y_train = y_train_pos + y_train_neg

In [14]:
pos_sample_weight = 1
neg_sample_weight = 1
sample_weights = ([pos_sample_weight] * len(X_train_pos)) + ([neg_sample_weight] * len(X_train_neg))

In [15]:
clf = MultinomialNB()
#clf = LogisticRegression()

clf.fit(X_train, y_train, sample_weights)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
model_coefficients = list(zip(clf.coef_[0], ngrams))
random.sample(list(model_coefficients), 30)

[(-10.662633125602696, '5מ'),
 (-10.662633125602696, 'טץ'),
 (-10.662633125602696, 'ךק'),
 (-10.662633125602696, '8ן'),
 (-10.662633125602696, '"ל'),
 (-10.662633125602696, ','),
 (-8.465408548266476, 'סע'),
 (-10.662633125602696, '18'),
 (-10.662633125602696, '6ז'),
 (-10.662633125602696, '1ל'),
 (-10.662633125602696, '__'),
 (-10.662633125602696, '%4'),
 (-9.276338764482805, 'צק'),
 (-10.662633125602696, ',כ'),
 (-5.339623146464287, 'הת'),
 (-10.662633125602696, 'ףש'),
 (-10.662633125602696, '2פ'),
 (-10.662633125602696, '00'),
 (-10.662633125602696, 'םם'),
 (-10.662633125602696, '(_'),
 (-10.662633125602696, '0ג'),
 (-10.662633125602696, '?ש'),
 (-10.662633125602696, '"ה'),
 (-3.4349706268740414, 'ה'),
 (-10.662633125602696, '%ז'),
 (-10.662633125602696, ')ב'),
 (-6.73080749287837, 'חר'),
 (-10.662633125602696, '9כ'),
 (-10.662633125602696, '.4'),
 (-10.662633125602696, '4צ')]

In [17]:
sorted(model_coefficients, reverse=True, key=lambda tuple: tuple[0])

[(-3.0753226195800805, 'ו'),
 (-3.1310807441954065, 'י'),
 (-3.4349706268740414, 'ה'),
 (-3.487143411978474, 'ת'),
 (-3.5826066256801052, 'מ'),
 (-3.634431693544691, 'ל'),
 (-3.6523212582954665, 'ר'),
 (-3.8770454805947665, 'נ'),
 (-4.0059066014243045, 'ש'),
 (-4.131755497876811, 'ע'),
 (-4.1613434550623065, 'ב'),
 (-4.201164949248978, 'ח'),
 (-4.270716012210094, 'ק'),
 (-4.365523805668761, 'פ'),
 (-4.446027024517831, 'ם'),
 (-4.555610237860441, 'ד'),
 (-4.594207537358585, 'ס'),
 (-4.639185532641663, 'כ'),
 (-4.673671708712832, 'ים'),
 (-4.686282216304762, 'א'),
 (-4.709389791314911, 'צ'),
 (-4.765479257965955, 'ג'),
 (-4.985879323334414, 'ט'),
 (-5.161374915057968, 'ות'),
 (-5.296657110580845, 'ז'),
 (-5.296657110580845, 'יי'),
 (-5.339623146464287, 'הת'),
 (-5.369328300878204, 'הו'),
 (-5.587459310368869, 'מת'),
 (-5.638752604756419, 'ור'),
 (-5.67220053882396, 'יע'),
 (-5.713873235224527, 'יר'),
 (-5.9264346772082, 'ן'),
 (-5.971285243373552, 'בי'),
 (-6.037660312318425, 'מש'),
 (-6

In [18]:
word = 'התלבשו'
word = 'הלבשה'
clf.predict([vectorize(word)])

array([0])

In [19]:
X_test = X_test_pos + X_test_neg
y_test = y_test_pos + y_test_neg

y_pred = clf.predict(X_test)

In [20]:
sklearn.metrics.accuracy_score(y_test, y_pred)

0.8640492686682063

In [21]:
print(sklearn.metrics.classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.90      0.94      0.92      5515
          1       0.56      0.43      0.49       980

avg / total       0.85      0.86      0.86      6495



In [22]:
confusion_matrix(y_test, y_pred)

array([[5189,  326],
       [ 557,  423]])

## _3 pts_ | Question 1 
to understand how badly a model is performing, it is good to come up with a naive baseline, unless there is a benchmark result to compare to. assume you have no benchmark to compare to, and you hence wish to have a baseline.***suggest and calculate a reasonable even if naive, baseline over the test set***. hint: think of the class distribution in the data.*

## _2 pts_ | Question 2 
what can you say about the performance obtained here, according to your baseline?

## _3 pts_ | Question 3
in light of the above, suggest and implement an adjusted accuracy metric, that would better account for the class imbalance inherent in our data. class imbalance is defined as a case where the distribution of classes in the data is far from being uniform.



## _8 pts_ | Question 4
suggest better features based on properties of the Hebrew language. which language feature comes into aid in that? (*hint: think of features abscent e.g. in English*). implement the features and provide a notebook with your implementation and the resulting performance report showing its contribution. 

## _4 pts_ | Question 5
now alternatively suggest and apply features specifically for English. Provide a notebook with your implementation classifying over the English data of https://github.com/UniversalDependencies/UD_English-EWT, including a performance report.

*use the following code cell to download and use the chosen version of this data*

In [23]:
%%script false
!git clone https://github.com/UniversalDependencies/UD_English-EWT
!cd UD_English-EWT && git checkout 7be629932192bf1ceb35081fb29b8ecb0bd6d767

## _30 pts_ | Question 6 
_In this question you will implement a classifier that takes account of the context of a word_. This means, you will no longer classify a word, context-less, but rather, you will transition to classifying words as part of the sentence they appear in. It will open the way for bringing additional features into play in your model, and very hopefully, lead to substantially improved classification performance.

+ You may reuse the code of this notebook in any way, or write from scratch.

+ You will be graded based upon:

   + dilligence and creativity in adding reasonable features
   + score on the test set, obtained when run for grading
   + reproducibility: does your submitted solution notebook actually run when downloaded for grading
   + cleanly structured, mildly self-explanatory or documented code
   

## _20 pts_ | Question 7
Implement the _Multinomial Naive Bayes_ algorithm itself from scratch. Establish the same performances as obtained with the ready-made Multinomial Naive Bayes algorithm that you have used so far.

What to submit for this?
+ a working notebook using your own implemented Naive Bayes classifier rather than the sklearn one, per each of the programming assignments you have already implemented above


## _20 pts_ | Question 8
Implement the _logistic regression_ algorithm itself from scratch. Establish the same performances as obtained with the ready-made logistic regression algorithm that you have used so far.

What to submit for this?
+ a working notebook using your own implemented logistic regression algorithm rather than the sklearn one, per each of the programming assignments you have already implemented above


## _10 pts_ | Question 9
*Manual domain adaptation.* 
<br>
1. obtain an unrelated Hebrew dataset; 

    + this can be a liberated copy of your gmail inbox ([google data takeout](https://takeout.google.com/settings/takeout))
    + a whatsapp data export
    + public hebrew tweets
    + a hebrew wikipedia dump


2. Use your best accomplished model over some subset of that data.

<br><br>
What to submit for this?
1. Provide a simple report of its success rate over that data (comparing to the original results)
2. Try to analyze towards a qualitative description of the difference in the performance; Submit your analysis.
+ _15 pts bonus:_ modify the training data to improve performance over your data. submit your modified training data here as well.

<span style="color:gray">what not to include in your submission?</span>
+ <span style="color:gray">You may not include in your new data, any data explicitly or implicitly disclosing the identity or personal information of any person/s. Anonymize any non-public data that you use as necessary to comply with this requirement.</span>
+ <span style="color:gray">You may not include any proprietary or confidential data in your new data.</span>

## <span style="color:orange">Submission Notes</span>

+ <span style="color:red">Note that non-reproducible results will be ignored.</span>
+ <span style="color:red">Any submission that does not actually run (halts with an error) will immediately deduce 5 pts. Any re-submission that does not actually run (halts with an error) will deduce additional 5 pts each. So make sure to verify that your code submission runs before submitting it.</span>
+ <span style="color:gray">if any of your code should take more than 10 minutes to run, require more than 16GB of memory, or happen to rely on a GPU, please provide a notifcation in advance.

The way to avoid this troubles, is to bundle your solution into a compressed zip archive, unzip it to an empty folder, and verify it is running from scratch from there.

