
| Model	| Embeddings | Private | Public | Local |
|:------ |:---------- | ------- | ------ | ----- |
| CapsuleNet	| fasttext	| 0.9855	| 0.9867	| 0.9896|
| CapsuleNet	| glove	| 0.9860 	| 0.9859	| 0.9899|
| CapsuleNet	| lexvec	| 0.9855	| 0.9858	| 0.9898|
| CapsuleNet	| toxic	| 0.9859	| 0.9863	| 0.9901|

| Model	| Embeddings | Private | Public | Local |
|:------ |:---------- | ------- | ------ | ----- |
| RNN Version 2	| fasttext	| 0.9856	| 0.9864	| 0.9904|
| RNN Version 2	| glove	| 0.9858 	| 0.9863	| 0.9902|
| RNN Version 2	| lexvec	| 0.9857	| 0.9859	| 0.9902|
| RNN Version 2	| toxic	| 0.9851	| 0.9855	| 0.9906|

| Model	| Embeddings | Private | Public | Local |
|:------ |:---------- | ------- | ------ | ----- |
| RNN Version 1	| fasttext	| 0.9853	| 0.9859	| 0.9898|
| RNN Version 1	| glove	| 0.9855	| 0.9861	| 0.9901|
| RNN Version 1	| lexvec	| 0.9854	| 0.9857	| 0.9897|
| RNN Version 1	| toxic	| 0.9856 | 0.9861	| 0.9903|

| Model	| Embeddings | Private | Public | Local |
|:------ |:---------- | ------- | ------ | ----- |
| 2 Layer CNN	| fasttext	| 0.9826	| 0.9835	| 0.9886|
| 2 Layer CNN	| glove 	| 0.9827	| 0.9828	| 0.9883|
| 2 Layer CNN	| lexvec	| 0.9824	| 0.9831	| 0.9880|
| 2 Layer CNN	| toxic	| 0.9806	| 0.9789	| 0.9880|
| SVM with NB features	| NA	| 0.9813	| 0.9813	| 0.9863|

## Main Idea
For each missing word `w` in `missing_words`, find the most similar word `most_similar_word` in `shared_words`:

```
local = {local_words: local_vectors}
external = {external_words: external_vectors}
shared_words = intersect(local_words, external_words)
missing_words = setdiff(local_words, external_words)
reference_matrix = array(local[w] for w in shared_words).T

for w in missing_words:
     similarity = local[w] * reference_matrix
     most_similar_word = shared_words[argmax(similarity)]
     external[w] = external_vectors[most_similar_word]

return {w: external[w] for w in local_words}
```

## Setup

In [1]:
import os
import torch
import timeit
import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize

os.environ['CUDA_DEVICE_ORDER']='PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES']='0'

In [2]:
# load data
def read_embedding(file_name):
  print(f'Loading {file_name}')
  f = open(file_name, encoding='utf-8')
  word_list, word_vectors = [],{}
  for line in f:
    split_line = line.split()
    w = split_line[0]
    v = np.array([float(val) for val in split_line[1:]])        

    word_list.append(w)
    word_vectors[w] = v        
  return np.array(word_list), word_vectors
  
local_words, local_vectors = read_embedding('../../input/tmp/fasttext-local.txt')
external_words, external_vectors = read_embedding('../../input/tmp/glove.twitter.27B.200d.txt')  

Loading ../../input/tmp/fasttext-local.txt
Loading ../../input/tmp/glove.twitter.27B.200d.txt


In [3]:
# find missing and shared words
missing_words = np.setdiff1d(local_words, external_words)
shared_words = np.intersect1d(local_words, external_words)

In [4]:
# create reference matrix
reference_matrix = np.array([local_vectors[w] for w in shared_words])
reference_matrix = normalize(reference_matrix).T # word vectors are columns

# create lookup matrix
lookup_matrix = np.array([local_vectors[w] for w in missing_words])
lookup_matrix = normalize(lookup_matrix)

## Evaluate Results

In [5]:
# find words similar to random missing words
for w in np.random.choice(missing_words, 10):
  similarity = np.matmul(local_vectors[w], reference_matrix)
  similar_word = shared_words[np.argmax(similarity)]
  print('missing: {0: <10}   most similar: {1}'.format(w[:10], similar_word))

missing: succesors    most similar: succes
missing: ratul        most similar: satul
missing: donorsthat   most similar: donors
missing: pollypocke   most similar: lollypops
missing: controvert   most similar: contentious
missing: actualyy     most similar: actualy
missing: cjaed        most similar: cja
missing: labadii      most similar: hadii
missing: dreun        most similar: sik
missing: vadabalija   most similar: dijalankan


### Loop through missing words
* uses least memory on CPU
* slowest implementation

In [9]:
def fill_missing_loop():
  for w in missing_words:
    similarity = np.matmul(local_vectors[w], reference_matrix)
    similar_word = shared_words[np.argmax(similarity)]
    external_vectors[w] = external_vectors[similar_word]

In [10]:
timeit.timeit(fill_missing_loop, number=1)

1124.9556284940045

### Vectorized computation
* needs more memory (more than 16 GB)
* almost 5x faster than looping

In [11]:
def fill_missing_vectorized():
  similarity = np.matmul(lookup_matrix, reference_matrix)
  similar_words = shared_words[np.argmax(similarity, axis=1)]
  for m,s in zip(missing_words, similar_words):
    external_vectors[m] = external_vectors[s]

In [12]:
timeit.timeit(fill_missing_vectorized, number=1)

205.92537692499172

### Computation on GPU
* requires pytorch
* need to optimize chunk size
* fastest implementation

In [27]:
def fill_missing_gpu_chunks():
  
  # setup
  chunk_size = 500
  n_lookups = lookup_matrix.shape[0]
  n_chunks = n_lookups//chunk_size+1

  # convert to numpy array to torch tensors
  dtype = torch.cuda.FloatTensor  
  def np2tc(x): return torch.from_numpy(x).type(dtype)
  reference_matrix_gpu = np2tc(reference_matrix)
  
  # iterate through chunks
  for i in range(n_chunks):
    chunk_indexs = slice(chunk_size*i, min(chunk_size*(i+1), n_lookups))
    similarity = torch.mm(np2tc(lookup_matrix[chunk_indexs]), reference_matrix_gpu)
    _, similar_indexs = torch.max(similarity, 1)
    similar_words = shared_words[np.array(similar_indexs)]
    for m,s in zip(missing_words[chunk_indexs], similar_words):
      external_vectors[m] = external_vectors[s] 

In [28]:
timeit.timeit(fill_missing_gpu_chunks, number=1)

54.229991279993556

```
chunk_size  GPU RAM  duration (s)
100         2.0 GB   68.6401116699999
500         2.7 GB   52.75159321799583
1000        3.5 GB   53.69221766899864
2500        6.1 GB   54.229991279993556
```

## Save results

In [None]:
def write_embedding(save_name):
  fwrite = open(save_name, 'w')
  for word, vec in external_vectors.items():
    fwrite.write(word + ' ' + ' '.join(vec.astype(str)) + '\n')
  fwrite.close()      
  
write_embedding('../../input/tmp/output-embedding.txt')