# Few-shot learning on textual data with siamese neural networks

If you're doing machine learning and meet a classification problem with many categories and few examples per category, it is usually thought that you're in trouble. Unfortunately, acquiring new data to solve this issue is not always easy or even doable. This problem of learning with only a few examples per category is called "few-shot learning", and "one-shot learning" in the extreme case of only one example per class (yes, you can even do this and [obtain decent results](https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf)!). 

Most of the machine learning research on one-shot learning involves images, but some [recent research papers](https://arxiv.org/abs/1710.10280) address the same problem in the Natural Language Processing (NLP) realm.

In this blog post, I will use siamese neural network to tackle few-shot learning , following a [method that was originally applied to images](https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf) and that is nicely explained [here](https://sorenbouma.github.io/blog/oneshot/).

A nice example of a few-shot learning problem in NLP is job title classification. If you want to group job titles in different categories or "occupations" (e.g. gather "Programmer" and "Software engineer" in an occupation, and " in another one), then unless you have hundreds of job titles examples per occupation you are facing a few-shot learning problem. The U.S government provides such a job title taxonomy: the [Standard Occupational Classification](https://www.bls.gov/soc/). I'll use it as a toy dataset understand how few-shot learning with siamese neural networks works.

Let's start by downloading the taxonomy and check what's in there.

In [36]:
from io import StringIO
import requests
import pandas as pd

# Download the Standard Occupation Classification
file_url = 'https://www.onetcenter.org/dl_files/database/db_20_1_text/Sample%20of%20Reported%20Titles.txt'
csv = StringIO(requests.get(file_url).text)

# Load it in a pandas DataFrame and drop a useless column
df = pd.read_csv(csv, sep='\t').drop('Shown in My Next Move', axis=1)

# Remove occupations for which we have only one example, because
# We can't even have a train a test examples for those
occupations_counts = df['O*NET-SOC Code'].value_counts()
multi_examples_occupations = occupations_counts[occupations_counts > 1].index
df = df[df['O*NET-SOC Code'].isin(multi_examples_occupations)]

# Lower all job titles for simplicity
df['Reported Job Title'] = df['Reported Job Title'].str.lower()

df.head(20)

Unnamed: 0,O*NET-SOC Code,Reported Job Title
0,11-1011.00,chief diversity officer (cdo)
1,11-1011.00,chief executive officer (ceo)
2,11-1011.00,chief financial officer (cfo)
3,11-1011.00,chief nursing officer
4,11-1011.00,chief operating officer (coo)
5,11-1011.00,executive director
6,11-1011.00,executive vice president (evp)
7,11-1011.00,operations vice president
8,11-1011.00,president
9,11-1011.00,vice president


The downloaded file contains job categories codes (first column) and samples of job titles that belong to those categories. To get categories description type the occupation code [here](https://www.onetonline.org/help/online/search). For instance, the first occupation (11-1011.00) is called "Chief Executives" as one can guess from the corresponding job titles examples.

Let's investigate a bit:

In [23]:
df.nunique()  # Count the number of different modalities in each column

O*NET-SOC Code         954
Reported Job Title    7172
dtype: int64

So we have 954 categories ("occupations") for 7172 examples, i.e. 7.5 examples per category on average. We are definitely in the few-shot learning setting. 

Before proceeding with modelling let's create a train and test sets, by putting one example in the test set for each class.

In [46]:
test_set = df.groupby('O*NET-SOC Code', as_index=False)['Reported Job Title'].first()
train_set = df[~df['Reported Job Title'].isin(test_set['Reported Job Title'])]

x_train, y_train = train_set['Reported Job Title'], train_set['O*NET-SOC Code']
x_test, y_test = test_set['Reported Job Title'], test_set['O*NET-SOC Code']

## Building a baseline

Before experimenting with fancy models, let's establish a strong baseline. We can start by using word embeddings to get a vector representation of each job title, and use a nearest neighbor classifier that is a less likely to overfit than tree-based models or parametric classifiers.

To get the representation of a sentence from pre-trained word embeddings I'll use [Zeugma](https://github.com/nkthiebaut/zeugma), an NLP python library I've written that provides pre-trained word embeddings in the form [scikit-learn transformers](http://scikit-learn.org/stable/modules/pipeline.html).

In [38]:
from zeugma import EmbeddingTransformer

# We'll use the GloVe pre-trained embeddings, using the sum of the word embeddings
# of a job title as the embedding vector
embedding = EmbeddingTransformer('glove', aggregation='sum')

Using TensorFlow backend.


In [47]:
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

# Our model is a nearest neighbor classifier, the input of which is the sum of the 
# embeddings of words in the job title.
clf = KNeighborsClassifier(n_neighbors=1)
baseline = make_pipeline(embedding, clf)

In [54]:
baseline.fit(x_train, y_train)
print('Train accuracy (baseline): {:.2f} %'.format(100*baseline.score(x_train, y_train)))
print('Test accuracy (baseline): {:.2f} %'.format(100*baseline.score(x_test, y_test)))

Train accuracy (baseline): 79.42 %
Test accuracy (baseline): 12.26 %


Definitely not great but not too bad for a simple baseline model, considering a random guess would have a $\frac{1}{n_{\text{classes}}} \simeq 0.1 \%$ accuracy. 

It may seem hard to beat this simple baseline with a deep learning model due to the high chances of overfitting with such a small datasets, but here come siamese networks to the rescue.

## Few-shot learning with siamese neural networks

The nearest neighbor model of the previous section is preforming quite well despite its simplicity, because it uses word embeddings learnt on the twitter dataset. This is the basic principle of [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning).

Nevertheless, the embedding space used to determine nearest neighbors knows nothing about job titles in particular. There must be a way to learn an embedding space in which jobs belonging to the same occupation category are closer. This is where siamese networks come into play.

The main idea of siamese networks is to learn such a representation by training a model to discriminate between pairs of examples that are in the same category, and pairs of examples that come from different categories. 

To be continued, stay tuned!