# IC2S2 2020
## Tutorial: Advanced Text representation methods for CSS

Link to tutorial: https://github.com/orion-search/tutorials

## Orion
A research measurement and knowledge discovery tool that enables you to monitor progress in science, visually explore the scientific landscape and search for relevant publications.

Link to Orion: https://orion-search.org/  
Learn more about it here: https://youtu.be/m0s5sjlpfAY

## Team
### [Kostas Stathoulopoulos](https://twitter.com/kstathou)
Mozilla Open Science Fellow & Data Scientist at Nesta. I work at the intersection of machine learning, economics and policy. 

### [Zac Ioannidis](https://portfolio.izac.us/)
Creative technologist & data visualization engineer who designs and builds platforms and data-rich UIs that condense and allow for exploration of multi-dimensional datasets.  

His area of expertise lies in data visualization, information design, and creative uses of data. He is usually found working within multi-disciplinary teams in different stages of product development—from ideation and prototyping to production. 

### [Lilia Villafuerte](https://www.villafuerte.info/)
HCI researcher, interface designer and digital artist. Her work as a researcher and artist has been exhibited in Spain, Mexico, Egypt, Germany, Peru and the United States.

## Purpose of the tutorial
Learn how to transform textual data to vectors and use them in a variety of downstream tasks such as **clustering** and **semantic similarity** and **visualisation**.

![workflow](../figures/workflow.png)

## Outcomes
- How to use [transformers](https://github.com/huggingface/transformers) to create state-of-the-art vector representations of text data.
- How to build a [Faiss](https://github.com/facebookresearch/faiss) index for efficient similarity search and clustering of dense vectors.
- How to use [UMAP](https://github.com/lmcinnes/umap), a non-linear dimensionality reduction technique.
- How to cluster text vectors with [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/).

## How we will do this
We will do this by showcasing how **Orion's vector-based search engine** works and go through an example with **real data - academic publications on misinformation research.**

## Orion search engine
![orion_search_diagram](../figures/orion_search_diagram.png)

In this tutorial, we won't explain how Elasticsearch works but if you are interested in learning more about it, let me know by opening an issue on the [GitHub repo](https://github.com/orion-search/tutorials)!

## Let's begin!

## A *very* brief overview of text vectorisation methods
![embeddings](../figures/embeddings.png)

## Why transformers?

- Pretrained models
- Easy to use
- They come with their own tokenizers
- Better performance than word2vec and the rest of the methods

## How do we create sentence-level embeddings with BERT?
1. Average the embeddings of the last layer (similar to averaging word vectors in word2vec).
2. Use the special **CLS** token to represent the sentence.

**These often lead to low quality embeddings, often worse than averaging GloVe vectors.**

## Let's code!

In [1]:
import umap
import torch
import faiss
import hdbscan
import numpy as np
import pandas as pd
import altair as alt
from sentence_transformers import SentenceTransformer

from tutorials.utils import id2details, vector_search, plot
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In this tutorial, we will use a subset of publications on **misinformation**, **disinformation** and **fake news** that were published between 2000 and 2020.  The documents were retrieved from [Microsoft Academic Graph (MAG)](https://www.microsoft.com/en-us/research/project/academic-knowledge).

We have done minimal data cleaning and dropped any publications that were missing an abstract, title or DOI. We are only showing a handful of columns: 
- Year
- Title
- Abstract
- Paper ID


In [26]:
df = pd.read_csv('../data/misinformation_papers.csv')
df.head()

Unnamed: 0,year,original_title,abstract,id
0,2000,Misinformation and the Currency of Democratic ...,Scholars have documented the deficiencies in p...,2134599899
1,2000,Memory conformity: Exploring misinformation ef...,Two experiments demonstrate that post-event in...,2107611319
2,2000,Women and the Internet: Promise and Perils,683 THE INTERNET is empowering women in ways t...,2057624592
3,2000,Against the Odds: Breastfeeding Experiences of...,This qualitative study asked low income mother...,2168906625
4,2000,The Best of Both Worlds: An Online Self-Help G...,Online mental health groups can be classified ...,2055049770


In [27]:
print(f'Number of unique papers: {df.shape[0]}')

Number of unique papers: 5501


In [28]:
# Store the abstracts in a list.
abstracts = list(df.abstract)

The [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) offers a variety of pretrained transformers. Checkout this [spreadsheet](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit?usp=sharing) with all the available models. 

In this tutorial, we will use the `distilbert-base-nli-stsb-mean-tokens` model which has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions. Moreover, although it's slightly worse than BERT, it is quite faster thanks to having a smaller size. 

We use the same model in Orion's semantic search engine!

In [29]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')



Encoding an abstract is easy! The instantiated model has a `.encode` method which receives a list of texts, preprocesses them and returns a list of dense vectors. 

Each of these vectors will have a length of 768.

In [30]:
vector = model.encode([abstracts[1]])
print(f'Shape of the transformed abstract: {vector[0].shape}')

Shape of the transformed abstract: (768,)


In [7]:
%%time
# Encode all abstracts
# You can speed this up by using a GPU. Colab offers one for "free"
abstract_vectors = model.encode(abstracts)

CPU times: user 21min 23s, sys: 19 s, total: 21min 42s
Wall time: 5min 26s


In [8]:
len(abstract_vectors)

5501

## Vector similarity search with Faiss

## Why long-text searches?
- Keyword-based search engines are not expressive.
- Search by querying with a paragraph from a blog or paper abstract.
- Easier to discover unknown unknowns.

## What is Faiss?
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM.  

It scales to billions of vectors, given a GPU!

## Building a Faiss index
Faiss is built around the `Index` object which contains, and sometimes preprocesses, the searchable vectors. 

Faiss has a large collection of [indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes) while it's possible to combine them and create a [composite index](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)).

Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s. These collections can be stored in matrices. 

**Note:** Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index before building.

In this tutorial, we will use `IndexFlatL2`:
- It's a simple index that performs a **brute-force L2 distance search**
- It **scales linearly**. Since we have ~5,000 vectors, it will work fine! If you are working with more vectors (>100,000) you might want to use some methods to [speed up your index](https://github.com/facebookresearch/faiss/wiki/Faster-search).

To create an index with the misinformation abstract vectors, we will have to:
1. Change the data type of the abstract vectors to `float32`.
2. Build an index and pass it the dimension of the vectors it will operate on.
3. Pass the index to `IndexIDMap`, an object that enables us to provide a custom list of IDs for the indexed vectors.
3. Add the abstract vectors and their ID mapping to the index. In this tutorial, we will map vectors to their paper IDs from MAG.

In [31]:
# Step 1: Change data type
abstract_vectors = np.array([abstract_vector for abstract_vector in abstract_vectors]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(abstract_vectors.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(abstract_vectors, df.id.values)

In [32]:
print(f'Number of vectors in the Faiss index: {index.ntotal}')

Number of vectors in the Faiss index: 5501


## Searching the index

The index we built will perform a k-nearest-neighbour search. We have to provide the number of neighbours to be returned. 

Let's query the index with an abstract from the misinformation dataset and retrieve the 10 most relevant documents. The first one must be our query!

In [11]:
# Paper title
df.iloc[2531, 1]

'Detecting Fake News in Social Media Networks'

In [12]:
# Paper abstract
df.iloc[2531, 2]

'Abstract Fake news and hoaxes have been there since before the advent of the Internet. The widely accepted definition of Internet fake news is: fictitious articles deliberately fabricated to deceive readers”. Social media and news outlets publish fake news to increase readership or as part of psychological warfare. Ingeneral, the goal is profiting through clickbaits. Clickbaits lure users and entice curiosity with flashy headlines or designs to click links to increase advertisements revenues. This exposition analyzes the prevalence of fake news in light of the advances in communication made possible by the emergence of social networking sites. The purpose of the work is to come up with a solution that can be utilized by users to detect and filter out sites containing false and misleading information. We use simple and carefully selected features of the title and post to accurately identify fake posts. The experimental results show a 99.4% accuracy using logistic classifier.'

In [13]:
# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([abstract_vectors[2531]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 61.20379638671875, 62.467647552490234, 63.3008918762207, 63.46514892578125, 63.47710037231445, 64.8943862915039, 64.8943862915039, 65.30744171142578, 67.49905395507812]

MAG paper IDs: [2900144515, 3013743800, 2971345456, 2809857047, 2951104154, 2808510781, 3030055324, 3008702223, 2887232089, 3027288733]


In [14]:
# Fetching the paper titles based on their index
id2details(df, I, 'original_title')

[['Detecting Fake News in Social Media Networks'],
 ['Exploring the Role of Visual Content in Fake News Detection.'],
 ['Credibility investigation for tweets and its users'],
 ['A Taxonomy of Audiovisual Fake Multimedia Content Creation Technology'],
 ['Detecting fake news for reducing misinformation risks using analytics approaches'],
 ['A First Step Towards Combating Fake News over Online Social Media'],
 ['FakeDetector: Effective Fake News Detection with Deep Diffusive Neural Network'],
 ['Deep Diffusive Neural Network based Fake News Detection from Heterogeneous Social Networks'],
 ['Believability of News'],
 ['Fake News Detection in Social Networks Using Machine Learning and Deep Learning: Performance Evaluation']]

## Putting all together
So far, we've built a Faiss index using the misinformation abstract vectors we encoded with a sentence-DistilBERT model. That's helpful but in a real case scenario, we would have to work with unseen data. To query the index with an unseen query and retrieve its most relevant documents, we would have to do the following:

1. Encode the query with the same sentence-DistilBERT model we used for the rest of the abstract vectors.
2. Change its data type to `float32`.
3. Search the index with the encoded query.

We will grab the very first paragraph of this [commentary](https://misinforeview.hks.harvard.edu/article/promoting-health-literacy-during-the-covid-19-pandemic-a-call-to-action-for-healthcare-professionals/) published on Harvard's Misinformation Review.

In [33]:
user_query = """The extraordinary spread of misinformation during the COVID-19 pandemic is impressive. 
And, to public health professionals like us, it’s worrying: We know that good information and good health 
go hand in hand. Knowing what we do about the practice of public health and what the science tells us about 
how people fall for misinformation, we see promising strategies for intervention in our own field. 
We therefore call on fellow healthcare professionals to take concerted action against misinformation, 
and we suggest here one lever our field is perfectly situated to address: health literacy. 
In this commentary, we propose concrete strategies for colleagues at four levels of practice: 
in healthcare organizations, community-based partnerships, cross-sector collaborations, and as individual 
healthcare providers."""

In [34]:
# For convenience, we've wrapped all steps in the vector_search function.
# It takes four arguments: 
# A query, the sentence-level transformer, the Faiss index and the number of requested results
D, I = vector_search([user_query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [64.16974639892578, 79.92680358886719, 81.83247375488281, 82.27056121826172, 90.5491943359375, 92.62487030029297, 93.32608795166016, 95.37651062011719, 96.64971923828125, 98.76470184326172]

MAG paper IDs: [3021999948, 2307445927, 2886045938, 3025366776, 3019619423, 3014943803, 3018986165, 3015865241, 3028762622, 2053060288]


In [17]:
# Fetching the paper titles based on their index
id2details(df, I, 'original_title')

[['Chatbots in the fight against the COVID-19 pandemic'],
 ['UNITY IN DIVERSITY'],
 ['Fighting the good fight: the fallout of fake news in infection prevention and why context matters.'],
 ['Vitamin D and SARS-CoV-2 virus/COVID-19 disease'],
 ['Paying SPECIAL consideration to the digital sharing of information during the COVID-19 pandemic and beyond'],
 ['The FPM International Awards for Medical Writing in Social Media: a step in the right direction.'],
 ['Understanding Antibody Testing for COVID-19.'],
 ['Misinformation During the Coronavirus Disease 2019 Outbreak: How Knowledge Emerges From Noise'],
 ['Learning from each other in the COVID-19 pandemic'],
 ['On-Line Medical Information and Service Delivery: Implications for Health Education']]

Now we know how to build a Faiss index and perform semantic searches by utilising the sentence-DistilBERT embeddings we created previously.

**Tip**: You could actually use Faiss with other vector representations too, for example **TF-IDF** vectors!

## Dimensionality reduction with UMAP

UMAP constructs a high dimensional graph representation of the data then optimizes a low-dimensional graph to be as structurally similar as possible. In that sense, it's similar to t-SNE!

A representation of a weighted graph, with edge weights representing the likelihood that two points are connected. To determine connectedness, UMAP extends a radius outwards from each point, connecting points when those radii overlap. Choosing this radius is critical - too small a choice will lead to small, isolated clusters, while too large a choice will connect everything together. UMAP overcomes this challenge by choosing a radius locally, based on the distance to each point's nth nearest neighbor. UMAP then makes the graph "fuzzy" by decreasing the likelihood of connection as the radius grows. Finally, by stipulating that each point must be connected to at least its closest neighbor, UMAP ensures that local structure is preserved in balance with global structure.

## Notable hyperparameters

#### `n_neighbors`
It controls how UMAP balances local versus global structure in the data. It does this by constraining the size of the local neighborhood UMAP will look at when attempting to learn the manifold structure of the data. 

Low values -> local structure  
Large values -> global structure

#### `min_dist`
It controls how tightly UMAP is allowed to pack points together. It provides the minimum distance apart that points are allowed to be in the low dimensional representation. 

Low values -> clumpier embeddings (might be good for clustering)  
Large values -> global structure

#### `n_components`
The number of components in UMAP. This is identical with the `n_components` used in `scikit-learn`.

#### `metric`
Controls how distance is computed in UMAP. See all metrics [here](https://umap-learn.readthedocs.io/en/latest/parameters.html#metric)

## Why UMAP and not t-SNE
It's **faster** and better at **preserving the data's global structure**.

![umap](../figures/umap.gif "segment")
[Link to gif](https://pair-code.github.io/understanding-umap/)

![orion-interface](../figures/orion.png)

To project the dense, abstract vectors to a low dimensional space, we will do the following:
1. Instantiate UMAP with a set of hyperparameters. Finding the right one requires experimentation! 
2. `fit_transform` the abstract vectors
3. Visualise them with [Altair](https://altair-viz.github.io/)

Note: Use a random_state to get deterministic results.

In [35]:
# Step 1. Instantiate UMAP
reducer = umap.UMAP(n_neighbors=10, n_components=2, metric='cosine', min_dist=.01, random_state=42)

# Step 2. Project the abstract vectors to a low dimensional space
embeddings = reducer.fit_transform(abstract_vectors)

print(f'Shape of the UMAP embeddings: {embeddings.shape}')

Shape of the UMAP embeddings: (5501, 2)


In [36]:
# Store the embeddings in a dataframe and add the paper titles
embed = pd.DataFrame(embeddings, columns=['Component 1', 'Component 2'])
embed['title'] = df.original_title

# Visualise the embeddings - the figure is interactive!
plot(embed)

We will crudely treat these points as outliers and remove them from the dataset. We can do this by filtering out the rows where the value of the `Component 1` column is below zero. 

Let's redraw the figure and explore particle neighbourhoods.

In [37]:
embed = embed[embed['Component 1']>0]
plot(embed)

Our scatterplot looks neat! In the final part of the tutorial, let's cluster the 2-dimensional embeddings with HDBSCAN and colour the particle space.

## Clustering dense vectors with HDBSCAN
Instead of clustering the 768-dimensional dense vectors we created with the sentence-DistilBERT model, we will use the UMAP embeddings as a preprocessing step. 

This is a bit controversial ([see here for a discussion on t-SNE and clustering](https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne)). The most notable issue is that UMAP, like t-SNE, does not completely preserve density so using it as a preprocessing step might produce an inaccurate (usually larger) number of clusters.

You are advised to do a few sanity checks before reporting!

Here, we will use [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/), a density-based clustering algorithm, to group vectors. 
1. Instantiate HDBSCAN and choose a hyperparameter set. Again, this requires experimentation!
2. `fit_predict` labels for each UMAP vector.
3. Plot the results

## Noteable hyperparameters

#### `min_cluster_size`
It controls the smallest size grouping that you wish to consider a cluster.

#### `min_samples`
It controls how conservative you want your clustering to be. 

Large values -> More points will be tagged as noise and clusters will be restricted to progressively more dense areas.

In [38]:
# Step 1: Instantiate HDBSCAN
cluster = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=50)

# Step 2: Predict labels
labels = cluster.fit_predict(embed[['Component 1', 'Component 2']])

print(f'Number of clusters (excluding outliers): {len(set(labels)) - 1}')

Number of clusters (excluding outliers): 23


In [39]:
# Step 3: Add the labels in the dataframe and redraw the figure
embed.loc[:, 'color'] = [str(label) for label in labels]

# Interactive legend
selection = alt.selection_multi(fields=['color'], bind='legend')

# Plot figure
fig = (alt.Chart(embed)
       .mark_circle(size=20)
       .encode(
           alt.X("Component 1"),
           alt.Y("Component 2"),
           alt.Color("color", title="Cluster"),
           opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
           tooltip=["title"],
       )
       .interactive()
       .properties(width=650, height=500)
       .add_selection(selection)
      )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [25]:
fig

## Conclusion
In this tutorial, we covered how state-of-the-art text representation methods can be used in a variety of downstream tasks. In summary, we did the following:
- Transform text to vectors without spending time on preprocessing the documents. 
- Developed a semantic search model by building a Faiss index and adding the dense vectors. 
- Reduced the dimensionality of the document vectors with UMAP and used HDBSCAN to cluster them.

Some things you could try:
- Fine-tune the sentence-DistilBERT embeddings to your dataset.
- Fit multiple UMAP models and choose the most suitable representation
- Try a fuzzy clustering method.


I hope you enjoyed the tutorial! Get in touch if you have any questions or would like to learn more about Orion!

### Thank you!

**@kstathou**  
**kostas@mozillafoundation.org**