# 2. FastText demo

### About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.1 (22/03/2022)

This notebook demonstrates how you may reuse a pre-trained language model from a Python library (e.g. the fasttext one).

Most language models already have pre-trained version online, along with a few basic method giving the closest 10 words to a given word or vector representation, giving analogies, etc.

This is based on the paper P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, "Enriching Word Vectors with Subword Information", 2017 (https://arxiv.org/abs/1607.04606).

And it follows the (very nice) documentation provided on https://fasttext.cc/docs/en/unsupervised-tutorial.html

**Requirements:**
- Python 3 (tested on v3.9.6)
- Matplotlib (tested on v3.5.1)
- Numpy (tested on v1.22.1)
- Torch (tested on v1.10.1)
- Torchvision (tested on v0.11.2)
- Fasttext (tested on v0.9.2)
- We also strongly recommend setting up CUDA on your machine!

Important: You might have to pip install the **fasttext** package.

### Imports

In [None]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import functools
import matplotlib.pyplot as plt
import fasttext
import fasttext.util
CUDA = torch.cuda.is_available()
device = torch.device("cuda" if CUDA else "cpu")

### Download the model

This command will download a pre-trained english language model and save it to file.

Note: heavy model, takes a while.

In [None]:
# Load model
lang = 'en'
fasttext.util.download_model(lang, if_exists = 'ignore')
model = fasttext.load_model('cc.en.300.bin')

### Getting a vector embedding for word

The command below can be used to get the word embedding for any word.

In [None]:
# Get vector embedding for word
word = "hello"
v = model.get_word_vector(word)
print(v)

### Getting the closest words to a given word or vector

You may use the get_nearest_neighbors() method to get the 10 closest word in vocabulary to another given word, or a given word vector (assuming this word vector is produced by another model, e.g. an autocompletion AI trying to predict the next word in a given sentence).

In order to find nearest neighbors, we need to compute a similarity score between words. Our words are represented by continuous word vectors and we can thus apply simple similarities to them. In particular we use the cosine of the angles between two vectors (to be discussed in W9S3).

This similarity is computed for all words in the vocabulary, and the 10 most similar words are shown. Of course, if the word appears in the vocabulary, it will appear on top, with a similarity of 1.

In [None]:
# Show closest words and their similarity scores for a given word
l = model.get_nearest_neighbors('university')
print(l)

In [None]:
# Show closest words and their similarity scores for a given word vector
vec = model.get_word_vector("asparagus")
l2 = model.get_nearest_neighbors(vec)
print(l2)

### Word analogies

You may even use word analogies, e.g.: Following the analogy between Paris and France, what are the top 10 words having the same analogy with Spain? 

In [None]:
# Play with analogies
# Following the analogy between Paris and France, which words have the same analogy with Spain?
l3 = model.get_analogies("paris", "france", "spain")
print(l3)