In this notebook, we will move from numeric data to text. In the last year, text has received a lot of attention due to the buzz around ChatGPT. We are going to start a bit simpler by looking at two older approaches for extracting meaning from text: 1) `word2vec` and 2) `doc2vec`. The following code block imports libraries we will use.

In [None]:
import json
import pathlib
from pprint import pprint
import re
import string

from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity 
from tqdm.auto import tqdm

The following code block reads in the data we will consider, which is a dataset of abstracts for scholarly publications related to human trafficking, labor trafficking, and sex trafficking.

In [None]:
data_filepath = pathlib.Path('abstract_data.json')

if data_filepath.exists():
    with open(data_filepath) as fin:
        data = json.load(fin)

len(data)

The following code block prints an example of the data for one entry.

In [None]:
id_list = list(data.keys())

pprint(data[id_list[75]], width=120)

The following code block defines a simple function for cleaning the abstracts.

In [None]:
def prepare_text(text):
    text = text.replace('\n',' ').strip()
    text = text.lower()
    text = re.sub(' +', ' ', text)
    text = ''.join([char for char in text if char not in string.punctuation])
    text = text.split(' ')

    return text

The following code block demonstrates the function. As you can see, it is very rudimentary. If we were doing an analysis for production or a research project, I would invest much more effort in cleaning the data.

In [None]:
prepare_text(data[id_list[75]]['abstract'])

The following code block applies the function to the data, creating a new `clean_abstract` column.

In [None]:
for key in tqdm(data, 'Cleaning abstracts'):
    data[key]['clean_abstract'] = prepare_text(data[key]['abstract'])

The following code block prints an example.

In [None]:
pprint(data[id_list[75]], width=120)

#### Word2Vec

We will first look at Word2Vec. See https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html for additional details. The following code block creates a list of the cleaned abstracts for modeling.

In [None]:
abstracts = [val['clean_abstract'] for val in data.values()]

The following code block uses `gensim` to fit a `Word2Vec` model using the default parameters.

In [None]:
w2v_model = Word2Vec(
    sentences=abstracts,
    workers=4,
)

By default, the model uses a *Continuous Bag of Words* training scheme to determine 100-dimensional vector representations for words. A few examples are given in the following code blocks.

In [None]:
w2v_model.wv['pimp']

In [None]:
w2v_model.wv['trafficker']

In [None]:
w2v_model.wv['victim']

The following code block defines some test cases we can use to understand what is captured in the vectors.

In [None]:
def w2v_test_cases(w2v_model_object):

    pimp_trafficker_similarity = cosine_similarity(
        w2v_model_object.wv['pimp'].reshape(1, -1), 
        w2v_model_object.wv['trafficker'].reshape(1, -1)
    )[0][0]
    print(f' - pimp/trafficker similarity: {pimp_trafficker_similarity:.5f}')

    pimp_victim_similarity = cosine_similarity(
        w2v_model_object.wv['pimp'].reshape(1, -1), 
        w2v_model_object.wv['victim'].reshape(1, -1)
    )[0][0]
    print(f' - pimp/victim similarity: {pimp_victim_similarity:.5f}')

    trafficker_most_similar = w2v_model_object.wv.most_similar(
        positive=['trafficker'], 
        topn=5,
    )
    print('\n - trafficker (most similar)')
    pprint(trafficker_most_similar, indent=4)

    victim_most_similar = w2v_model_object.wv.most_similar(
        positive=['victim'], 
        topn=5,
    )
    print('\n - victim (most similar)')
    pprint(victim_most_similar, indent=4)

    internet_most_similar = w2v_model_object.wv.most_similar(
        positive=['internet'], 
        topn=5,
    )
    print('\n - internet (most similar)')
    pprint(internet_most_similar, indent=4)

The following code block uses the function to examine the outputs for the model we previously fit.

In [None]:
w2v_test_cases(w2v_model)

The following code block fits another model with a larger *window* size and number of *epochs*

In [None]:
w2v_model = Word2Vec(
    sentences=abstracts, 
    window=10,
    epochs=25,
    workers=4,
)

Test results are computed in the following code block.

In [None]:
w2v_test_cases(w2v_model)

The following code block fits another model that uses a *skip-gram* training scheme.

In [None]:
w2v_model = Word2Vec(
    sentences=abstracts, 
    window=10,
    epochs=25,
    sg=1,
    workers=4,
)

Test results are computed in the following code block.

In [None]:
w2v_test_cases(w2v_model)

#### Doc2Vec

We will now look at *Doc2Vec*. Doc2Vec trains a set of *paragraph vectors*, one for each document, in addition to the word vectors. To implement this in `gensim`, we need to use `TaggedDocument` objects. These are defined in the following code block.

In [None]:
documents = [TaggedDocument(val['clean_abstract'], [key]) for key, val in data.items()]

The following code block trains a basic `Doc2Vec` model using the *gensim* defaults.

In [None]:
d2v_model = Doc2Vec(
    documents=documents,
    workers=4,
)

Since Doc2Vec still computes word vectors, we can still perform our tests.

In [None]:
w2v_test_cases(d2v_model)

We can get the *paragraph vector* for a document by indexing with the ID we used when defining the `TaggedDocument` objects.

In [None]:
d2v_model.dv[id_list[0]]

We can use the paragraph vectors to identify similar pieces of text. To demonstrate, here are the IDS for two papers I have been involved with.

In [None]:
my_ids = [
    'SCOPUS_ID:85147005648', 
    'SCOPUS_ID:85097137525',
]

As you might expect, the Doc2Vec model finds them to be very similar.

In [None]:
cosine_similarity(
    d2v_model.dv[my_ids[0]].reshape(1, -1),
    d2v_model.dv[my_ids[1]].reshape(1, -1),
)[0][0]

We can use the `most_similar` method to easily get back a list of the IDs for the most similar documents along with the similarity score. 

In [None]:
d2v_model.dv.most_similar(d2v_model.dv[my_ids[0]])

The following code block prints the titles for the most similar 25 documents for both of the targets.

In [None]:
for target in my_ids:
    target_paper = data[target]['title']
    abstract = data[target]['abstract']
    print(f"Target: {target_paper}\n")
    print(f"Abstract: {abstract}")
    print('\n')
    for sid, similarity in d2v_model.dv.most_similar(d2v_model.dv[target], topn=25):
        print(f" - {data[sid]['title']} ({similarity: .3f})")
    print('*'*100 + '\n')

#### Transformers

We will now look at using transformer models that are publicly available via HuggingFace (https://huggingface.co/). We will use PyTorch as the neural network framework. In class, we will look at how to install PyTorch for CUDA. The following code block imports PyTourch and checks to see if CUDA is available.

In [None]:
import torch

torch.cuda.is_available()

The following code block checks the number of CUDA devices available.

In [None]:
torch.cuda.device_count()

We will generate the embeddings for abstracts using the `SentenceTransformer` package. The following code block imports the library and specifes that we will use the `all-MiniLM-L6-v2` model (see https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [None]:
from sentence_transformers import SentenceTransformer
st_model = SentenceTransformer('all-MiniLM-L6-v2')

The `SentenceTransformer` will handle tokenization, so we only need a list of the raw abstract texts.

In [None]:
abstract_strings = [val['abstract'] for val in data.values()]

The following code block times the generation of embeddings on 500 of the abstracts when using the CPU.

In [None]:
%%time

embeddings = st_model.encode(abstract_strings[:500], device='cpu')

The following code block times the generation of embeddings on 500 of the abstracts when using the GPU.

In [None]:
%%time

embeddings = st_model.encode(abstract_strings[:500], device='cuda')

The following code block generates the embeddings for all abstracts using the CPU.

In [None]:
%%time

embeddings = st_model.encode(abstract_strings, device='cuda')

The following code block creates a `DataFrame` of the embeddings.

In [None]:
embeddings_df = pd.DataFrame(
    embeddings,
    index=list(data.keys()),
)

The following code block prints the titles for the most similar 25 documents for both of the targets defined in `my_ids`.

In [None]:
for target in my_ids:

    similarities = cosine_similarity(
        embeddings_df.loc[target].values.reshape(1, -1),
        embeddings_df.values,
    )
    
    similar_articles = pd.Series(
        similarities.flatten(),
        index=list(data.keys()),
    ).nlargest(25).to_dict()
    
    target_paper = data[target]['title']
    abstract = data[target]['abstract']
    print(f"Target: {target_paper}\n")
    print(f"Abstract: {abstract}")
    print('\n')
    for sid, similarity in similar_articles.items():
        print(f" - {data[sid]['title']} ({similarity: .3f})")
    print('*'*100 + '\n')