# Webmining - Assignment 1

This **Home Assignment** is to be submitted and you will be given points for each of the tasks. It familiarizes you with basics of *web crawling* and standard text preprocessing. It then takes a deep dive into *GloVe* one approach for obtaining word embeddings. To train GloVe we will frist construct the co-occurence matrix and then we will use adaptive stochastic gradient descent to minimize the cost function.

## Formalities
**Submit in a group of 2-3 people until 27.05.2020 23:59CET. The deadline is strict!**

## Evaluation and Grading
General advice for programming excercises at *CSSH*:
Evaluation of your submission is done semi automatically. Think of it as this notebook being 
executed once. Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Use external libraries only when specified by task.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature of e.g. a function usually can be inferred from task description, 
  code skeletons and test cases.
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Double check if your code relies on presence of files or directories other than those mentioned
    in given tasks. Tests run under Linux, hence don't use Windows style paths 
    (`some\path`, `C:\another\path`). Also, use paths only that are relative to and within your
    working directory (OK: `some/path`, `./some/path`; NOT OK: `/home/alice/python`, 
    `../../python`).
* Keep your code idempotent! Running it or parts of it multiple times must not yield different
  results. Minimize usage of global variables.
* Ensure your code / notebook terminates in reasonable time.

**There's a story behind each of these points! Don't expect us to fix your stuff!**

Regarding the scores, you will get no points for a task if:
- your function throws an unexpected error (e.g. takes the wrong number of arguments)
- gets stuck in an infinite loop
- takes much much longer than expected (e.g. >1s to compute the mean of two numbers)
- does not produce the desired output (e.g. returns an descendingly sorted list even though we asked for ascending, returns the mean and the std even though we asked for only the mean, prints an output instead of returning it, ...)

In [2]:
# credentials of all team members (you may add or remove items from the dictionary)
team_members = [
    {
        'first_name': 'Alice',
        'last_name': 'Foo',
        'student_id': 12345
    },
    {
        'first_name': 'Bob',
        'last_name': 'Bar',
        'student_id': 54321
    }
]

## 1. Crawling web pages (total of 4 points)
Consider the top 150 stackoverflow questions tagged with data-mining ordered by votes (e.g. the  questions from the 10 first pages accessible from here: https://stackoverflow.com/questions/tagged/data-mining?tab=votes&pagesize=15). Use the `BeautifulSoup`  and `requests` package.

### a) Simple spidering (1.5)

Write a function ```get_questions``` that takes a tag (like ```"data-mining"```) and a number `n` and returns a list containing the hyperlinks (as strings) to the top n questions as explained above. Assume the tag exists.

(the first link for data-mining is: https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression)

### b) Processing a page (1.5)

Write a function ```process_question``` that takes a string hyperlink to a stackoverflow questions and returns a dictionary. It contains the text of the question, their comments and the answers to the question and comments. Keep in mind to remove: All images, tags, html tags, user_information, _code_ sections. Also remove information on edits, dates etc. Finally remove all functional text-content like share, edit, follow, flag, add comment (the kind of button things). Remove everything that is not inside the div with `id="mainbar"`.

The structure of the result is:

```python
{'title': 'The title',
 'question': {'text' : 'How to learn web-mining?',
              'comments':['Good question', 'Sounds\n interesting']},
 'answers' : [{'text':'Do a course at CSSH!', 
               'comments' : ['You will learn a lot', 'Good stuff']},
              {'text':'Learn on youtube', 'comments' : []}, ]}
```
You can also find an example of the  processed page https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression on moodle.

### c) Flatten the document (0.5)

Write a function `flatten_question` that takes a dict of the structure from c) and returns a list of strings for that dict. Therefor merge the title with the flattened question. Then add the one string for each answer.
The answer and question are flattened by first joining the comment strings with a `" "` and then joining the text and the comments in the same way. The text should preceed the comments.
The returned list should look like:

```
['The title How to learn web-mining? Good question Sounds\n interesting', 'Do a course at CSSH! You will learn a lot Good stuff', 'Learn on youtube ']
```

### d) Bringing it all together (0.5)

Write a function `process_top_questions` that takes a tag and a number n and that processes the top n questions by votes as explained above. It returns a single list of strings (concatenated from the list of strings for each single answer). Thereby use the previously defined functions.

Execute the function with the tag `"data-mining"` and n=150. Store the result in ```result_1```


In [3]:
from bs4 import BeautifulSoup
import requests

In [4]:
def get_questions(tag, n):
    # pass
    urls = []
    d = {'tab':'votes','pagesize': 50,'page':1 }
    request_time = 1
    while n > len(urls):
        r = requests.get(url = 'https://stackoverflow.com/questions/tagged/'+tag, params=d)
        print(str(request_time) + 'th ', 'request address :', r.url)
        soup = BeautifulSoup(r.content)
        questions = soup.findAll(class_ = "question-hyperlink")
        for q in questions:
            if q['class'] == ['question-hyperlink']:
                urls.append('https://stackoverflow.com' + q['href'])
        d['page'] += 1
        request_time += 1
    return urls[:n]

In [5]:
#test_case
get_questions('data-mining', 50)

1th  request address : https://stackoverflow.com/questions/tagged/data-mining?tab=votes&pagesize=50&page=1


['https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression',
 'https://stackoverflow.com/questions/1746501/can-someone-give-an-example-of-cosine-similarity-in-a-very-simple-graphical-wa',
 'https://stackoverflow.com/questions/5064928/difference-between-classification-and-clustering-in-data-mining',
 'https://stackoverflow.com/questions/2323768/how-does-the-amazon-recommendation-feature-work',
 'https://stackoverflow.com/questions/17469835/why-does-one-hot-encoding-improve-machine-learning-performance',
 'https://stackoverflow.com/questions/11808074/what-is-an-intuitive-explanation-of-the-expectation-maximization-technique',
 'https://stackoverflow.com/questions/26355942/why-is-the-f-measure-a-harmonic-mean-and-not-an-arithmetic-mean-of-the-precision',
 'https://stackoverflow.com/questions/11513484/1d-number-array-clustering',
 'https://stackoverflow.com/questions/12066761/what-is-the-difference-between-gradient-descent-and-ne

In [6]:
def process_question(api_url):
    data = {} # store all information of this page
    response = requests.get(url=api_url)
    soup = BeautifulSoup(response.text,'html.parser')
    # title
    data['title'] = soup.find('a', class_ = "question-hyperlink").string
    # quetions
    question = soup.find('div', class_ = 'question')
    s = question.find('div', class_='post-text')
    ques = {}
    ques['text'] = s.text # question text
    ques['comments'] = [] #question comments
    #find the first comments list
    ques_comm = question.find('ul',class_ = 'comments-list')
    q_coms = ques_comm.findAll('li', class_ = 'comment js-comment')
    for q_c in q_coms:
        com = q_c.find('span',class_ = 'comment-copy')
        ques['comments'].append(com.text)
    data['question'] = ques
    # all answers (including comments)
    all_answers = soup.find(id='answers')
    answers = all_answers.findAll(class_ = 'answer')
    data['answers'] = [] # the list of answers and corresponing comments
    for ans in answers:
        ans_com_dict = {} # {'test:'###', 'comments': [XX,XXXX]}
        temp = ans.find(class_ = 'post-text')
        comms = ans.findAll('li', class_ = 'comment js-comment')
        try:
            ans_com_dict['text'] = temp.text
        except:
            pass
        # each question may have multiple comments
        ans_com_dict['comments'] = []
        for c in comms:
            com = c.find('span',class_ = 'comment-copy')
            ans_com_dict['comments'].append(com.text)
        data['answers'].append(ans_com_dict)
    return data

In [7]:
#test_case
api_url = 'https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression'
process_question(api_url)

{'title': 'What is the difference between linear regression and logistic regression?',
 'question': {'text': '\nWhen we have to predict the value of a categorical (or discrete) outcome we use logistic regression. I believe we use linear regression to also predict the value of an outcome given the input values.\nThen, what is the difference between the two methodologies?\n',
  'comments': []},
 'answers': [{'text': "\n\nLinear regression output as probabilities\nIt's tempting to use the linear regression output as probabilities but it's a mistake because the output can be negative, and greater than 1 whereas probability can not. As regression might actually\nproduce probabilities that could be less than 0, or even bigger than\n1, logistic regression was introduced. \nSource: http://gerardnico.com/wiki/data_mining/simple_logistic_regression\n\nOutcome\nIn linear regression, the outcome (dependent variable) is continuous.\nIt can have any one of an infinite number of possible values. \nIn

In [8]:

# print(result.headers['Content-Type'])
# soup = BeautifulSoup(result.text,'html.parser')
# #title
# soup.find('a', class_ = "question-hyperlink").string
# #questions
# s = soup.find('div', class_='post-text')
# s.text
# # questions comments

# # answers text
# answers = soup.find(id='answers')
# ids  = answers.findAll('a')
#get all answer ids
# for i in ids:
#     try:
#         if int(i['name']):
#             index  = i['name']
#             print(index)
#             ans = answers.find(id = 'answer-'+idnex)
#             print(ans.text)
#     except:
#         continue
# answers = answers.findAll(class_ = 'answer')
# res = []
# for ans in answers:
#     re = ans.find(class_ = 'post-text')
#     try:
#         res.append(re.text)
#         # print(re.text)
#     except:
#         continue
# print(res)
# answer comments:
# for ans in answers:
#     re = ans.findAll('li', class_ = 'comment js-comment')
#     for comments in re:
#         com = comments.find('span',class_ = 'comment-copy')
#         print(com.text)
#         print('####')
#     print('!!!!!')
        

In [9]:
# example input
d={'title': 'The title',
 'question': {'text' : 'How to learn web-mining?',
              'comments':['Good question', 'Sounds\n interesting']},
 'answers' : [{'text':'Do a course at CSSH!', 
               'comments' : ['You will learn a lot', 'Good stuff']},
              {'text':'Learn on youtube', 'comments' : []}, ]}

def flatten_question(data):
    res = []
    # process title and question
    s = data['title'] + ' ' + data['question']['text']
    # process question comments
    for comm in data['question']['comments']:
        s += (' ' + comm)
    res.append(s)
    for ans in data['answers']:
        temp = ''
        temp += ans['text']
        for c in ans['comments']:
            temp += (' ' + c)
        res.append(temp)
    return res

In [10]:
#test_case
flatten_question(d)

['The title How to learn web-mining? Good question Sounds\n interesting',
 'Do a course at CSSH! You will learn a lot Good stuff',
 'Learn on youtube']

In [11]:
from tqdm.notebook import tqdm

In [12]:
def process_top_questions(tag, n):
    # pass
    urls = get_questions(tag, n)
    result = []
    for u in tqdm(urls):
        data = process_question(u)
        re = flatten_question(data)
        result.extend(re)
    return result

In [24]:
# test_case
result_1 = process_top_questions('data-mining', 150)
print(result_1)
# store result
with open('./result_1.txt', 'w', encoding='utf-8') as w:
    w.writelines(result_1)

1threquest address :https://stackoverflow.com/questions/tagged/data-mining?tab=votes&pagesize=50&page=1
2threquest address :https://stackoverflow.com/questions/tagged/data-mining?tab=votes&pagesize=50&page=2
3threquest address :https://stackoverflow.com/questions/tagged/data-mining?tab=votes&pagesize=50&page=3


HBox(children=(IntProgress(value=0, max=150), HTML(value='')))

eryone exactly everything im doing, I only wanted a simple way to remove some noise from Kmeans lol @JungleBoogie: don\'t be too quick to judge. I just tried it myself and it is working as advertised... And I have no idea what Javascript you are talking about, the code is right there in front of you! You don\'t like it, I can see three other implementations in the first page of a google search. Better yet implement your own...', 'Cosine similarity when one of vectors is all zeros \nHow to express the cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity ) \nwhen one of the vectors is all zeros?\nv1 = [1, 1, 1, 1, 1]\nv2 = [0, 0, 0, 0, 0]\nWhen we calculate according to the classic formula we get division by zero:\nLet d1 = 0 0 0 0 0 0\nLet d2 = 1 1 1 1 1 1\nCosine Similarity (d1, d2) =  dot(d1, d2) / ||d1|| ||d2||dot(d1, d2) = (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) = 0\n\n||d1|| = sqrt((0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 0\n\n||d2|| = sqrt((

# 2. Preprocessing (total of 2 points)
It is/was common to use stopword removal and stemming as a preparation for word_embeddings. You should use the stopword list provided by Python-nltk library.

## a) The usual preprocessing (1)

Write a function `to_tokens` that takes a string and performs tokenization, stopword removal and stemming. It returns a list of tokens. The tokens are in the same order they were in the initial string. Use the functions from the nltk library.
To transform a string into tokens use the function `nltk.word_tokenize`.

## b) Reduce vocabulary (1)
Write a function `process_and_filter_corpus` that takes a list of strings and an integer as input. It applies the `to_tokens` function on each of those. The return values are

1) the list of list of tokens. All tokens that appear less than min_support times (in the entire corpus) are __removed__

2) a set of tokens that were removed.

In [14]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [15]:
def to_tokens(sent):
    sent = sent.lower()
    ps = PorterStemmer()
    stop_words = set(stopwords.words('english'))
    # print(stop_words)
    words = word_tokenize(sent)
    filtered_words = [ps.stem(w) for w in words if w not in stop_words]
    return filtered_words
#test case
print(to_tokens('I am a cat and I love fish and chips'))

['cat', 'love', 'fish', 'chip']


In [16]:
# without using any extern library except nltk
def process_and_filter_corpus(sents, min_support):
    hash_table = {}
    res = []
    for sent in sents:
        temp = []
        words = to_tokens(sent)
        for w in words:
            temp.append(w)
            if w not in hash_table:
                hash_table[w] = 1
            else:
                hash_table[w] += 1
        res.append(temp)
    removed_tokens = set(k for k,v in hash_table.items() if v < min_support)
    result = []
    for words in res:
        temp = []
        for w in words:
            if w not in removed_tokens:
                temp.append(w)
        result.append(temp)
    return result, removed_tokens

In [25]:
sents = ['Here we see that the pair of words than-done is a bigram','We will see how it works later','collocations are essentially just frequent bigrams']
process_and_filter_corpus(sents,2)

([['see', 'bigram'], ['see'], ['bigram']],
 {'colloc',
  'essenti',
  'frequent',
  'later',
  'pair',
  'than-don',
  'word',
  'work'})

## 3. Glove (total of 10 points)
### a) Computing the global word co-occurence matrix (3 = 2.5 + 0.5)
We will now explore some steps of required for the GloVe word embedding.

Write a function `get_cooc_matrix` that takes a list of list of tokens and an integer _context_size_ as input. It returns 1) a sparse co-occurrence matrix (a dict mapping pairs of integer indices to their cooc-score) and 2) a dictionary that maps a word to an index in the cooc matrix/dict. Do not use any non standard library for that. If two tokens are d places apart they get a score of 1/d. Only take into account words that are at most context_size apart from the central word.

Example:

For the corpus `[['bad', 'dog', 'bad', 'cat', 'thing'],['bad', 'dog']]` and context_size=2 the tokens 'dog' and 'thing' do not co-occur. While the tokens 'dog' and 'bad' have a cooc score of 3. The token pair 'bad' and 'bad' has a cooc score of 1.0.

Additionally write a function `cooc_to_numpy` that takes the dict-sparse-representation and returns three numpy arrays. The first two are of type int and they contain the values for i and j respectively. The third array contains the cooc-score for that entry. They are sorted in a way that num it is first sorted ascending according to i and then ascending according to j. This sorting is only for reproducability, not necessary for the glove. We call the output of this function a coord_tpl

In [15]:
import numpy as np
def get_cooc_matrix(context, context_size):
    #compute word to index dict
    w2i = {}
    text = []
    for sent in context:
        for word in sent:
            text.append(word)
    #store unique words
    text = list(dict.fromkeys(text))
    for i, w in enumerate(text):
        w2i[w] = i
    #compute co-courrence dict
    cs = context_size
    co_occurence = {} # mapping pairs of integer indices to their cooc-score
    for sent in context:
        for i, w in enumerate(sent):
            # get left context words and right context words
            center_word_id = w2i[w]
            left_context_words = sent[max(0,i-cs):i]
            right_context_words = sent[i+1:min(len(sent),i+cs+1)]
            # left context words score
            for left_i, left_w in enumerate(left_context_words):
                #calculate distance
                dist = len(left_context_words) - left_i
                incremment = 1/float(dist) # score
                left_word_id = w2i[left_w]
                # add score into co_occurence matrix dict
                if (center_word_id, left_word_id) not in co_occurence:
                    co_occurence[(center_word_id, left_word_id)] = incremment
                else:
                     co_occurence[(center_word_id, left_word_id)] += incremment
                # here we can also use this following calculation if we don't want to us right_context_words

                # if (left_word_id, center_word_id) not in co_occurence:
                #     co_occurence[(left_word_id,center_word_id)] = incremment
                # else:
                #     co_occurence[(left_word_id,center_word_id)] += incremment
                
            # right context words score
            for right_i, right_w in enumerate(right_context_words):
                dist = right_i + 1
                incremment = 1/float(dist)
                right_word_id = w2i[right_w]
                if (center_word_id, right_word_id) not in co_occurence:
                    co_occurence[(center_word_id, right_word_id)] = incremment
                else:
                     co_occurence[(center_word_id, right_word_id)] += incremment
    return co_occurence, w2i

def cooc_to_numpy(coocs):
    list_i = []
    list_j = []
    scores = []
    sorted_coocs = sorted(coocs.items(), key = lambda item: item[0][0])
    hash_table = {}
    for pair in sorted_coocs:
        i, j = pair[0]
        if i not in hash_table:
            hash_table[i] = [j]
        else:
            hash_table[i].append(j)
    for i, js in hash_table.items():
        for j in sorted(js):
            list_i.append(i)
            list_j.append(j)
            scores.append(coocs[(i,j)])
    np_i = np.array(list_i, dtype=int)
    np_j = np.array(list_j, dtype=int)
    np_scores = np.array(scores, dtype=float)
    # coord_tql = np.vstack((np_i, np_j, np_scores))
    return np_i, np_j, np_scores

#test_case
mini_corpus=[['bad', 'dog', 'bad', 'cat', 'thing'], ['bad', 'dog']]
#results for mini_corpus:
d,vocab = get_cooc_matrix(mini_corpus, 2)
print(d,'\n', vocab)
cooc_tpl = cooc_to_numpy(d)
print(cooc_tpl)
# print('cooc_tpl',(
#  np.array([0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3]),
#  np.array([0, 1, 2, 3, 0, 2, 0, 1, 3, 0, 2]),
#  np.array([1. , 3. , 1. , 0.5, 3. , 0.5, 1. , 0.5, 1. , 0.5, 1. ])))
# print('vocab', {'bad': 0, 'dog': 1, 'cat': 2, 'thing': 3})

{(0, 1): 3.0, (0, 0): 1.0, (1, 0): 3.0, (1, 2): 0.5, (0, 2): 1.0, (0, 3): 0.5, (2, 1): 0.5, (2, 0): 1.0, (2, 3): 1.0, (3, 0): 0.5, (3, 2): 1.0} 
 {'bad': 0, 'dog': 1, 'cat': 2, 'thing': 3}
(array([0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3]), array([0, 1, 2, 3, 0, 2, 0, 1, 3, 0, 2]), array([1. , 3. , 1. , 0.5, 3. , 0.5, 1. , 0.5, 1. , 0.5, 1. ]))


Now that we have obtained the global co-occurence matrix, it is time to use this information to train GloVe vectors.
For this task we will use an (adaptive) stochastic gradient descent method.

### b) Initialize Vectors and Gradients (1)
Write a function `init_matrix`  that takes three arguments: 1) the size of the vocabulary $v$ 2) the size of the desired Glove vectors $n$ and 3) a random number generator which is initialized like: (https://numpy.org/doc/1.18/reference/random/index.html). It returns two matrices: 1) The matrix of glove vectors where each row is of size $n+1$ reflecting one glove vector + bias. It is initialized with random values such that they lie uniformly on the interval $[-l, l)$ with $l=\frac{0.5}{n}$. Use one call to the generator only. 2) A matrix of similar shape initialized to all values being one. These are our corresponding squared gradients.

### c) Compute loss (1.5)
Write a function `compute_loss(v1, v2, b1, b2, cooc_score, max_score, alpha)` that computes the GloVe Loss for a single pair of vectors v1 and v2 their weights b1 and b2 with associated cooc_score and weighting parameters max_score and alpha. The function returns 1) the loss and 2) the part $g$ that is the part of the gradient, that is the same in both

The GloVe Loss for a pair of indices i,j is: $loss(i,j) = f(C_{i,j}) \cdot \left(v_i \cdot v_j + b_i + b_j -log(C_{i,j})\right)^2$ with $C_{i,j}$ the cooc-score. The function function $f$ is defined as

$f(x) = (\frac{x}{max\_score})^{\alpha}$ if $x< max\_score$

$f(x) = 1 $ otherwise

The shared part is: $g(i,j) = f(C_{i,j}) \cdot \left(v_i \cdot v_j + b_i + b_j -log(C_{i,j})\right)$

### d) Computing the updates (0.5 each)

Write a function `calc_gradient_vi(g, vj, eta, grad_clip)`  that takes the value $g$ and a vector $v_j$, the learning rate $\eta$ and a gradient clipping value. It computes the gradient update for the $v_i$ vector without the bias. Applies gradient clipping such that the absolute value of each element of the gradient vector is at most grad_clip. Thereafter the gradient is multiplied with the learning rate and finally the validify function is applied to the gradient vector which is then returned.

Write a similar function to compute the update for the bias b `calc_gradient_b(g, grad_clip)`. It computes the gradient update for the bias b. By clipping the gradient and therafter validifying the gradient. It then returns the gradient update.
To compute the gradient update, drop the factor 2 that arises when differentiating the squared term. Also the gradient update for the bias terms does not include the learning rate.


### e) Apply the update (1.5)

Write a function `one_update(W, W_grad, pair, cooc_score, max_score, alpha, eta, grad_clip)`
That performs an update of the vector matrix W and gradient matrix W_grad for a pair of word indices $(i,j)$=pair which with associated cooc_score. The remaining parameters should be clear from the previous functions.

The matrix W is updated using the previously computed updates divided by the square root of the corresponding entries in the W_grad matrix. Keep in mind you have to walk into the opposite direction of the gradient. 

The W_grad matrix is updated using the sqared entry of the corresponding update (prior to dividing).

The function returns the loss prior to upgrading. (The matrices W and W_grad are updated in place)

### f) Write a training function (1)

Write function `train(W, W_grad, cooc_tpl, n_epochs, rng, max_score, alpha, eta, grad_clip)` that trains W and W_grad matrix using the cooc_tpl for n_epochs. Before each epoch shuffle all arrays in the cocc_tpl in unison using one call to `rng.permutation`. 

### g) Bringing it all together (1)

Write a function `train_corpus` that takes a corpus in the form of a list of strings.
Preprocess each string as in task 2). Keep only words that have at least three occurences. 

Internally use: max_score=100, alpha=3/4, window size of 4, learning rate of 0.05, dimension of glove vectors = 50, grad_clip=100. Create a new `numpy.random.default_rng` instance, feed it a seed of 1. Train for 25 epochs.

The function returns the trained W matrix.

Apply it to your corpus that you obtained in task 1).

In [14]:
def compute_loss(v1, v2, b1, b2, cooc_score, max_score, alpha):
    if cooc_score < max_score:
        f = (cooc_score/max_score) ** alpha
    else:
        f = 1
    cost_inner = np.dot(v1,v2) + b1 + b2 - np.log(cooc_score)
    loss = f * (cost_inner **2)
    shared_gradient = f * cost_inner
    return loss, shared_gradient

In [37]:
np.log(1)
np.dot(np.array([1,2]),np.array([1,2]))

5

In [31]:
def validify(grad):
    if np.isnan(grad).sum()>0 or np.isinf(grad).sum()>0:
        print('Warning: invalid value')
        return np.zeros(np.shape(grad))
    return grad


def calc_gradient_vi(fdiff, v, eta, grad_clip):
    grad_vi = v * fdiff
    grad_vi[abs(grad_vi) > grad_clip] = grad_clip
    grad_vi_update = eta*grad_vi
    return validify(grad_vi_update)


def calc_gradient_b(fdiff, grad_clip):
    grad_b = fdiff
    if grad_b > grad_clip:
        grad_b = grad_clip
    return validify(grad_b)


def one_update(W, W_grad, pair, cooc_score, max_score, alpha, eta, grad_clip):
    (i, j) = pair
    w_i, b_i = W[i][:-1], W[i][-1]
    w_j, b_j = W[j][:-1], W[j][-1]
    grad_vi, grad_bi = W_grad[i][:-1], W_grad[i][-1]
    grad_vj, grad_bj = W_grad[j][:-1], W_grad[j][-1]
    #compute loss and shared gradient
    loss, g = compute_loss(w_i, w_j, b_i, b_j, cooc_score, max_score, alpha)
    #computr gradient update
    grad_vi_update = calc_gradient_vi(g, w_j, eta, grad_clip)
    grad_vj_update = calc_gradient_vi(g, w_i, eta, grad_clip)
    grad_bi_update = calc_gradient_b(g, grad_clip)
    grad_bj_update = calc_gradient_b(g, grad_clip)
    #update W
    W[i][:-1] -= (grad_vi_update/np.sqrt(grad_vi))
    W[j][:-1] -= (grad_vj_update/np.sqrt(grad_vj))
    W[i][-1] -= (grad_bi_update/np.sqrt(grad_bi))
    W[j][-1] -= (grad_bj_update/np.sqrt(grad_bj))
    #update W_grad
    W_grad[i][:-1] += np.square(grad_vi_update)
    W_grad[j][:-1] += np.square(grad_vj_update)
    W_grad[i][-1]  += grad_bi_update**2
    W_grad[j][-1] += grad_bj_update**2
    return loss


def init_matrices(n_vocab, n_dim, rng):
    l = 0.5/float(n_dim)
    # uniformly between [-l,l)
    W = 2*l*rng.random((n_vocab, n_dim+1)) - l
    grandient_squard = np.ones((n_vocab, n_dim+1), dtype=np.float64)
    return W, grandient_squard

In [12]:
from numpy.random import default_rng
import numpy as np
rg = default_rng(12345)
2*rg.random((10,1))-1
# d,f = init_matrices(4,2,rg)
rng = np.random.default_rng()
# rg.permutation(d, axis=0)
arr = np.arange(9).reshape((3, 3))
print(arr)
rng.permutation(arr,axis=1)
print(cooc_tpl)
s =rng.permutation(cooc_tpl,axis=1)
print(s)
rng.permutation(s,axis=1)

[[0 1 2]
 [3 4 5]
 [6 7 8]]
(array([0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3]), array([0, 1, 2, 3, 0, 2, 0, 1, 3, 0, 2]), array([1. , 3. , 1. , 0.5, 3. , 0.5, 1. , 0.5, 1. , 0.5, 1. ]))
[[0.  3.  1.  0.  0.  2.  2.  3.  0.  1.  2. ]
 [0.  0.  2.  2.  3.  1.  3.  2.  1.  0.  0. ]
 [1.  0.5 0.5 1.  0.5 0.5 1.  1.  3.  3.  1. ]]


array([[2. , 0. , 2. , 3. , 0. , 1. , 0. , 1. , 0. , 3. , 2. ],
       [3. , 3. , 0. , 2. , 1. , 0. , 2. , 2. , 0. , 0. , 1. ],
       [1. , 0.5, 1. , 1. , 3. , 3. , 1. , 0.5, 1. , 0.5, 0.5]])

In [56]:
def train(W, W_grad, cooc_tpl, n_epochs, rng, max_score, alpha, eta, grad_clip):
    for i in range(n_epochs):
        cooc_tpl = rng.permutation(cooc_tpl,axis=1)
        for j in range(len(cooc_tpl[0])):
        #select first pair of word as
            pair = (int(cooc_tpl[0][j]), int(cooc_tpl[1][j]))
            cooc_score = cooc_tpl[2][j]
            loss = one_update(W, W_grad, pair, cooc_score, max_score, alpha, eta, grad_clip)
        if i == n_epochs-1:
            print('the last loss is ', loss)

In [58]:
# Small example of how it works:
from numpy.random import default_rng

# cooc_stuff
cooc_dict, vocab = get_cooc_matrix(mini_corpus, 2)
cooc_tpl = cooc_to_numpy(cooc_dict)

rng = default_rng(1)
W, W_grad = init_matrices(len(vocab), 2, rng)
print(W)
train(W, W_grad, cooc_tpl, n_epochs=25, rng=rng, max_score=2, alpha=0.8, eta=0.1, grad_clip=1)
print(W)
print(W_grad)

[[ 0.00591081  0.22523185 -0.17792019]
 [ 0.22432472 -0.09408427 -0.03833678]
 [ 0.1638513  -0.04540043  0.02479684]
 [-0.23622044  0.12675655  0.01907166]]
the last loss is  0.16155238876341566
[[ 0.42636017  0.07392391  0.11268705]
 [ 0.86475964  0.05931378  0.48346455]
 [-0.58417307  0.04107205 -0.12712688]
 [-0.60295216  0.01171958 -0.35638232]]
[[ 1.01811777  1.0011147  17.32257018]
 [ 1.00686823  1.00206274 14.91833432]
 [ 1.01327751  1.00043868 11.30258586]
 [ 1.00239736  1.00057218  6.84297285]]


In [23]:
# What I obtain for code above

# W init
#[[ 0.00591081  0.22523185 -0.17792019]
# [ 0.22432472 -0.09408427 -0.03833678]
# [ 0.1638513  -0.04540043  0.02479684]
# [-0.23622044  0.12675655  0.01907166]]


# after training:
#loss on last pass 0.7861722683546024
#[[ 0.37861754  0.05811192  0.13090998]
# [ 0.67430958 -0.00918572  0.58465184]
# [-0.4713854   0.08926127 -0.31768868]
# [-0.58445333  0.02673079 -0.0975816 ]]
#[[ 1.01016694  1.0008865  12.1206226 ]
# [ 1.003497    1.00143482  9.65947768]
# [ 1.00864141  1.0002788   9.17531255]
# [ 1.00211807  1.00053237  6.48303266]]