## Few-Shot Text Classification with Pre-Trained Word Embeddings
This notebook provides code to reproduce the results from our paper, Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop. Specifically, the results obtained on the 20 Newsgroups dataset.

To begin, some setup...

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from operator import itemgetter
from itertools import cycle, islice
import pandas as pd
import numpy as np
import sif_embedding_wrapper
import utils
import itertools

### Load the pre-trained GloVe embeddings
Here we use the SIF code to load up the words, embeddings and weights we'll be using to vectorize our documents.

In [2]:
words, embs, weight4ind = sif_embedding_wrapper.load_embeddings("/tmp/glove.6B/glove.6B.300d.txt", 
                                                     '/tmp/enwiki_vocab_min200.txt')

### Classification example with 20 Newsgroups
Here we define a batch of documents to be classified, based on a subset of categories from the 20 Newsgroups dataset. It extracts two dataframes, one that holds the document ID and corresponding text for each document, and another that holds the document ID and corresponding ground truth category for each document. It returns these dataframes, along with a list of the simplified category names.

In [3]:
def create_dataset_for_newsgroup_pair(category_pair):
    newsgroups_train = fetch_20newsgroups(subset='train', categories=category_pair, remove=('headers', 'footers', 'quotes'))
    docs = {}
    for i,text in enumerate(newsgroups_train.data):
        doc_id = str(i+1)
        docs[doc_id] = {
            "text": text.strip().strip('"'),
            "category_ind": newsgroups_train.target[i]
        }
    all_doc_ids = sorted(list(docs.keys()))
    df = pd.DataFrame({"text": [docs[d]["text"] for d in all_doc_ids], 
                       "category_ind": [docs[d]["category_ind"] for d in all_doc_ids], 
                       "doc_id": [d for d in all_doc_ids]})
    labels = []
    for i in df["category_ind"]:
        parts = newsgroups_train.target_names[i].split(".")
        if parts[-1] == "misc":
            labels.append(parts[-2])
        else:
            labels.append(parts[-1])
    df["label"] = labels
    categories = list(df["label"].unique())
    text_df = pd.DataFrame({"doc_id": df["doc_id"], "text": df["text"]})
    truth_df = pd.DataFrame({"doc_id": df["doc_id"], "gt": df["label"]})
    truth_dict = {str(rec["doc_id"]): rec["gt"] for rec in truth_df.to_dict(orient="records")}
    return text_df, truth_dict, categories


### Create a batch with two categories
Here we create a batch using the "autos" and "baseball" categories, and create document vectors of the text for each document.

In [5]:
df, truth_dict, categories = create_dataset_for_newsgroup_pair(["rec.autos","rec.sport.baseball"])
doc_embeddings = sif_embedding_wrapper.sentences2vecs(df["text"], embs, words, weight4ind)
df["vector"] = pd.Series(list(doc_embeddings))

The function below contains all the classification logic. Given a dataframe of documents, with ID, text, and vector along with a dict specifying the list of documents to use as representatives for each category, the function predicts categories for the remaining documents.

In [6]:
def auto_classify(docs, category_reps, min_text_length=80):
    # Exclude docs deemed too short to classify.
    skip_prediction = list(df[df["text"].map(len) < min_text_length].doc_id)
    categories = []
    for k,v in category_reps.items():
        categories.append(k)
        skip_prediction.extend(v) # No need to predict manually labeled docs
    category_vecs = {}
    for c in categories:
        vectors = np.asarray(list(docs.loc[docs['doc_id'].isin(category_reps[c])].vector))
        category_vecs[c] = np.mean(vectors, axis=0)

    predictions = {}
    for idx, row in docs.iterrows():
        doc_id = row["doc_id"]
        if doc_id in skip_prediction:
            continue
        max_sim = 0
        winner = categories[0]
        for j in category_vecs:
            sim = cosine_similarity(row["vector"].reshape(1, -1), category_vecs[j].reshape(1, -1)).flatten()[0]
            if sim > max_sim:
                max_sim = sim
                winner = j
        predictions[doc_id] = winner
    return predictions

We'll use the document combination we discovered through our brute-force search through the combinations to provide the highest accuracy.

In [7]:
preds = auto_classify(df, {"autos": ["351"], "baseball": ["171"]})

### Calculate accuracy
Here we define a function that simply determines the fraction of correctly predicted documents, and call it with our predictions and the dict containing the ground truth for each document.

In [8]:
def get_accuracy_score(predictions, truth_dict):
    scores = []
    for k,v in predictions.items():
      if v == truth_dict[k]:
        scores.append(1)
      else:
        scores.append(0)
    if len(scores) == 0:
      return 0.0
    return sum(scores) / float(len(scores))

get_accuracy_score(preds, truth_dict)

0.9703065134099617

### Topic inference
In the above case, we already knew a combination of documents that would produce a high level of accuracy, but in a real scenario we'll use topic inference to try to surface documents that are likely to be good category representatives. We define a function that perferms Latent Dirichlet Allocation on the batch and returns the documents in descending order of "topiciness", interleaved according to topic. We then call this function on our "autos, baseball" batch.

In [9]:
def infer_topics(docs, n_topics, min_text_length=80, max_iter=150, batch_size=128, learning_offset=300.):
    unclassifiable = list(docs[docs["text"].map(len) < min_text_length].doc_id)
    filtered = docs[~docs['doc_id'].isin(unclassifiable)]
    ids = [d for d in list(filtered.doc_id)[0:10]]
    n_features = 1000
    tf_vectorizer = TfidfVectorizer(
        stop_words='english',
        max_df=0.95,
        min_df=0.1,
        max_features=n_features)
    tf = tf_vectorizer.fit_transform(list(filtered.loc[:, 'text']))
    lda = LatentDirichletAllocation(
        n_components=n_topics,
        max_iter=max_iter,
        batch_size=batch_size,
        learning_method='online',
        learning_offset=learning_offset,
        random_state=0)
    lda.fit(tf)
    doc_topics = lda.transform(tf)
    topic_leaders = {"topic_{}".format(i): [] for i in iter(range(n_topics))}
    for idx, probs in enumerate(doc_topics):
        score = max(probs)
        topic = np.argmax(probs)

        doc_id = filtered.loc[filtered.index[idx]].doc_id
        topic_leaders["topic_{}".format(topic)].append(
            {"doc_id": doc_id, "score": score})
    for i in iter(range(n_topics)):
        topic_leaders["topic_{}".format(i)] = sorted(
            topic_leaders["topic_{}".format(i)], key=itemgetter('score'), reverse=True)

    def roundrobin(*iterables):
        "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
        # Recipe credited to George Sakkis
        pending = len(iterables)
        nexts = cycle(iter(it).next for it in iterables)
        while pending:
            try:
                for next in nexts:
                    yield next()
            except StopIteration:
                pending -= 1
                nexts = cycle(islice(nexts, pending))

    return list(roundrobin(*topic_leaders.values()))

ordered_docs = infer_topics(df, 2)

Here we get all possible combinations of the top n documents from our LDA ordering, based on their ground truth categories, and calculate prediction accuracy on those combinations.

In [10]:
def get_top_lda_combs(ordered_ids, docs_df, categories, truth_dict, top_n=12):
    representatives = {c:[] for c in categories}
    for doc_id in ordered_ids[:top_n]:
        gt = truth_dict[str(doc_id)]
        representatives[gt].append(doc_id)
    for c in categories:
        if len(representatives[c]) == 0:
            print("No representatives for %s" % c)
            return None
    values = [representatives[c] for c in categories]
    doc_combs = list(itertools.product(*values))
    return doc_combs

def get_lda_accuracies(categories, doc_combs, docs_df, truth_dict):
    accuracies = []
    for comb in doc_combs:
        category_reps = {}
        for i,c in enumerate(categories):
            category_reps[c] = [str(comb[i])]
        preds = auto_classify(docs_df, category_reps)
        acc = get_accuracy_score(preds, truth_dict)
        accuracies.append(acc)
    return accuracies


top_lda_combs = get_top_lda_combs([d["doc_id"] for d in ordered_docs], 
                                  df, categories, truth_dict)
lda_accs = get_lda_accuracies(categories, top_lda_combs, df, truth_dict)
max(lda_accs)

0.946360153256705