<a href="https://colab.research.google.com/github/katrina906/CS6120-Summarization-Project/blob/main/compress_fasttext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Compress FastText Model
Needs smaller RAM consumption for practical use.  
- Package: https://github.com/avidale/compress-fasttext
- Based off of blog: https://medium.com/@vasnetsov93/shrinking-fasttext-embeddings-so-that-it-fits-google-colab-cd59ab75959e
  - Basic idea is to use a smaller set of the most popular vocabulary words. Fasttext will still generate embeddings for the out of vocabulary words

Achieve ~75% similarity with original model

In [28]:
%%capture
!pip install compress-fasttext
!pip install pqkmeans
!pip install gensim==3.8.3

In [4]:
import os
import tqdm
import numpy as np
import gensim

from collections import defaultdict
from gensim.models.utils_any2vec import ft_ngram_hashes  
import compress_fasttext
import sys
import pqkmeans

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# download - load english wiki word vectors: https://fasttext.cc/docs/en/pretrained-vectors.html

# load wiki fasttext bin and save only model object: smaller 
#ft = FastText.load_fasttext_format("/content/drive/MyDrive/data/wiki.en.bin")
#ft.wv.save('/content/drive/MyDrive/data/wiki.en.model')

In [5]:
# load full model
ft = gensim.models.KeyedVectors.load("/content/drive/MyDrive/data/wiki.en.model")

In [None]:
# compress 
small_model = compress_fasttext.svd_ft(ft)

In [28]:
# save
small_model.save("/content/drive/MyDrive/data/shrunk_fasttext_svd.model")

## Evaluate similarity between original and compressed model 
Test on 1,000,000 most frequent words

In [29]:
sorted_vocab = sorted(ft.vocab.items(), key=lambda x: x[1].count, reverse=True)
sims = []
for test_word, _ in sorted_vocab[0:1000000]:
    sim = ft.cosine_similarities(ft.get_vector(test_word), [small_model.get_vector(test_word)])
    if not np.isnan(sim):
        sims.append(sim)

In [25]:
print("Similarity:", np.mean(sims))

Similarity: 0.71135896
