Skip to content
Supervised learning tutorial. Technique: Partial Least Square Regression (PLSR)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
README.md
regression.py
utils.py
viz.py

README.md

Mapping semantic spaces for translation

In this tutorial, you will map from a small English semantic space to a Catalan semantic space. You may need a Catalan dictionary for the following exercises. Here's a good one: https://www.diccionaris.cat/.

Pen and paper exercise

The following are two toy semantic spaces, one for English, one for Catalan. Rows represent vectors, columns represent dimensions.

natureargue
horse0.30.0
dog0.30.0
house0.20.1
parliament0.00.7
politics0.10.9
right0.10.6
wrong0.10.7
lluitararbre
cavall0.10.3
gos0.10.2
casa0.00.2
parlament0.50.0
política0.60.0
correcte0.40.0
equivocat0.50.0

Now, you get a new vector in English, say:

natureargue
green0.60.1

Which of those two Catalan words do you think is the translation of green according to your semantic spaces? Why? NB: you don't have to know any calculus to solve this by hand.

lluitararbre
verd0.10.5
vermell0.20.1

Running the visualisation code

Running the following will create pictures of the English and Catalan spaces in your directory.

python3 viz.py

You can also visualise the neighbourhood of a specific word by doing e.g.

python3 viz.py en bird 20

(This will give you a graph of the 20 nearest neighbours of bird in the English space.)

For Catalan, you can similarly do:

python3 viz.py cat ocell 20

Preliminaries to running the code

Familiarise yourself with the content of the data/ directory. The pairs.txt file contains gold standard translations from English to Catalan. english.subset.dm and catalan.subset.dm are subsets of an English and a Catalan semantic space corresponding to the words occurring in pairs.txt.

There are 166 pairs and we will be splitting the data into 120 pairs for training and 46 for testing. You can look at the test pairs by doing

tail -46 data/pairs.txt

Just looking at the data and the associated visualisation (before running anything), can you tell where the model might do well and where it might fail?

Running the regression code

Running the code will split the data into training and test set, calculate the regression matrix on the training data and evaluate it on the test set:

python3 regression.py

The output first gives the predictions for each pair. For instance, it could be:

bird ocell ['arbre', 'peix', 'ocell', 'gos', 'animal'] 1

Here, bird should have been translated with ocell. The 5 nearest neighbours of the predicted vector are arbre, peix, ocell, gos and animal, meaning the gold translation can be found in those close neighbours.

The last line gives the precision @ k, where k is the number of nearest neighbours considered for evaluation.

What can you say about the system's errors? Do they confirm your hypotheses?

All too easy?

There is a small Italian semantic space in the data/ folder, with 1000 frequent Italian words. You can try and build the regression for a new train/test set for Italian!

You can’t perform that action at this time.