<a href="https://colab.research.google.com/github/hsmidt/hicss-track-recommender/blob/main/hicss_minitrack_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Preliminaries

Resources:
- https://www.sbert.net/docs/pretrained_models.html#scientific-publications
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_publications.py
- https://www.sbert.net/examples/applications/computing-embeddings/README.html#storing-loading-embeddings


## Import Remote HICSS Data


In [2]:
import pandas as pd

In [3]:
minitracks = {}
minitracks['HICSS-54'] = pd.read_csv('https://raw.githubusercontent.com/hsmidt/hicss-track-recommender/main/data/hicss54_minitracks.csv')
minitracks['HICSS-55'] = pd.read_csv('https://raw.githubusercontent.com/hsmidt/hicss-track-recommender/main/data/hicss55_minitracks.csv')
minitracks['HICSS-56'] = pd.read_csv('https://raw.githubusercontent.com/hsmidt/hicss-track-recommender/main/data/hicss56_minitracks.csv')

minitracks['HICSS-55'].describe()

Unnamed: 0,conference,track,trackname,id,name,description
count,190,190,190,190,190,190
unique,1,12,12,190,190,188
top,HICSS-55,DA,"Decision Analytics, Mobile Services, and Servi...",a8a72967-8e7e-49ef-a254-62f83f4fe368,Digital Government and AI,"Technical area: Decision Analysis, Big Data..."
freq,190,34,34,1,1,2


In [4]:
papers = {}
papers['HICSS-54'] = pd.read_csv('https://raw.githubusercontent.com/hsmidt/hicss-track-recommender/main/data/hicss54_papers.csv')
papers['HICSS-55'] = pd.read_csv('https://raw.githubusercontent.com/hsmidt/hicss-track-recommender/main/data/hicss55_papers.csv')
papers['HICSS-56'] = pd.read_csv('https://raw.githubusercontent.com/hsmidt/hicss-track-recommender/main/data/hicss56_papers.csv')

papers['HICSS-55'].describe()

Unnamed: 0,paperid,minitrackid,minitrackname,minitrackdescription,track,conference,title,abstract
count,1553,1553,1553,1553,1553,1553,1553,1553
unique,1553,186,186,184,12,1,1553,1553
top,bb4e4b29-48ac-433c-b4d6-c0b2b1f642c4,8cbf14f0-696e-441b-820f-aa71fce4ed25,Managing the Dynamics of Platforms and Ecosystems,MANAGING THE DYNAMICS OF PLATFORMS & ECOSYSTEM...,OS,HICSS-55,Conceptualizing Design Knowledge in IS Researc...,Design science projects are of great interest ...
freq,1,29,29,29,307,1553,1,1


## Load Pre-Trained Models


In [5]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 3.4 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 21.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 53.0 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 69.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_

In [6]:
from sentence_transformers import SentenceTransformer, util

In [7]:
# memoization copied from https://dbader.org/blog/python-memoization
def memoize(func):
  cache = dict()

  def memoized_func(*args):
    if args in cache:
      return cache[args]
    result = func(*args)
    cache[args] = result
    return result

  return memoized_func

def model_loader(name):
  return SentenceTransformer(name)

mem_model = memoize(model_loader)

In [8]:
# sample usage of memoized model
mod = mem_model('allenai-specter') 

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

In [8]:
model_general = ['all-mpnet-base-v2', 'multi-qa-mpnet-base-dot-v1', 'all-MiniLM-L6-v2', 'all-MiniLM-L12-v2']
model_semantic = ['multi-qa-MiniLM-L6-cos-v1']
model_science = ['allenai-specter']

## HICSS-56 Embeddings for Streamlit

See https://www.sbert.net/examples/applications/computing-embeddings/README.html#storing-loading-embeddings for details.

In [8]:
import pickle

In [9]:
model_names = ['allenai-specter', 'all-mpnet-base-v2']
hicss56_mt_embeddings = {}
for name in model_names:
  hicss56_mt_embeddings[name] = mem_model(name).encode(minitracks['HICSS-56']['description'], convert_to_tensor=True)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [10]:
with open('hicss56_embeddings_2.pkl', "wb") as fOut:
    pickle.dump({'minitracks': minitracks['HICSS-56'], 'embeddings': hicss56_mt_embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

## HICSS-55 Data

### Compute Embeddings

In [14]:
#Compute embeddings
abstract_embeddings = mem_model('allenai-specter').encode(papers['HICSS-55']['abstract'], convert_to_tensor=True)
minitrack_embeddings = mem_model('allenai-specter').encode(minitracks['HICSS-55']['description'], convert_to_tensor=True)


### Sample Search

In [9]:
def labelInResults(conference, idx, papers, results):
  return papers['minitrackid'][idx] in results['id'].values

In [10]:
def get_result_df(search_results, results):
  return pd.DataFrame([(score, track) for score, track in zip([x['score'] for x in search_results[0]], results['name'].values)], columns=['score', 'minitrack'])

In [11]:
def print_search_results(conference, idx, papers, search_results):
  search_ind = [x['corpus_id'] for x in search_results[0]]
  mtracks = minitracks[conference].iloc[search_ind]
  print('Success: ', labelInResults(conference, idx, papers[conference], mtracks))
  print('Human label: ', papers['minitrackname'][idx])
  print('Semantic results: \n', get_result_df(search_results, mtracks))

In [154]:
search_results = util.semantic_search(abstract_embeddings[0], minitrack_embeddings, top_k=10)
print_search_results('HICSS-55', 0, papers['HICSS-55'], search_results)

Success:  True
Human label:  Advances in Design Science Research
Semantic results: 
       score                                          minitrack
0  0.548921                Advances in Design Science Research
1  0.474495  Reports from the Field: Knowledge and Learning...
2  0.463690  Design and Appropriation of Knowledge and AI S...
3  0.460038  Human-centered Design for Digital Innovations ...
4  0.452245                     Collaboration for Data Science
5  0.448823                  Informing Research: Where to Now?
6  0.442026  Innovation and Entrepreneurship: Theory and Pr...
7  0.429629  Innovation in Organizations:  Learning, Unlear...
8  0.426230  Practitioner Research Insights: Applications o...
9  0.416935  Socio-technical Issues in Organizational Infor...


### Model Evaluation (sample)


In [12]:
def success_rate(count: int, items: int) -> float:
  return round(success_count / max_items * 100, 3)

In [16]:
conference = 'HICSS-56'
score_function = util.cos_sim # [util.cos_sim, util.dot_score]

In [17]:
## 'bert-base-nli-mean-tokens' , 'all-MiniLM-L6-v2', 'allenai-specter'
abstract_embeddings = mem_model('all-mpnet-base-v2').encode(papers[conference]['abstract'], convert_to_tensor=True)
minitrack_embeddings = mem_model('all-mpnet-base-v2').encode(minitracks[conference]['description'], convert_to_tensor=True)

In [18]:
success_count = 0
max_items = len(abstract_embeddings)
for i in range(0,max_items):
  results = util.semantic_search(abstract_embeddings[i], minitrack_embeddings, top_k=10, score_function=score_function)
  mtracks = minitracks[conference].iloc[[x['corpus_id'] for x in results[0]]]
  success = labelInResults(conference, i, papers[conference], mtracks)
  if success: success_count += 1

print(f"Success: {success_count} / {max_items}, {success_rate(success_count, max_items)}%")                        

Success: 17 / 22, 77.273%


## Model Selection

In [167]:
from sklearn.model_selection import train_test_split
import numpy as np


In [169]:
conference = 'HICSS-55'
success_criteria = 10 # must be in top 10 results to be considered successful
model_space = model_science + model_semantic + model_general # evaluation 6 models
score_function_space = {'cosine': util.cos_sim, 'dot': util.dot_score} # evaluation 2 similarity scores

In [None]:
# train, test = train_test_split(papers[conference], test_size=0.2) # holding out 20% of 1553 papers

In [172]:
selection_results = [] # track model name, function, and success rate
for modelname in model_space:
  abstract_embeddings = mem_model(modelname).encode(papers[conference]['abstract'], convert_to_tensor=True)
  minitrack_embeddings = mem_model(modelname).encode(minitracks[conference]['description'], convert_to_tensor=True)
  for score_function in score_function_space.keys():
    success_count = 0
    abstract_items = len(abstract_embeddings)
    for i in np.nditer(np.arange(abstract_items)):
      results = util.semantic_search(abstract_embeddings[i], 
                                     minitrack_embeddings, 
                                     top_k=success_criteria, 
                                     score_function=score_function_space[score_function])
      mtracks = minitracks[conference].iloc[[x['corpus_id'] for x in results[0]]]
      success = labelInResults(conference, i, papers[conference], mtracks)
      if success: success_count += 1
    selection_results.append({'model': modelname, 'score_function': score_function, 'success_rate': success_rate(success_count, abstract_items)})


In [173]:
selection_results

[{'model': 'allenai-specter',
  'score_function': 'cosine',
  'success_rate': 66.838},
 {'model': 'allenai-specter', 'score_function': 'dot', 'success_rate': 66.774},
 {'model': 'multi-qa-MiniLM-L6-cos-v1',
  'score_function': 'cosine',
  'success_rate': 51.256},
 {'model': 'multi-qa-MiniLM-L6-cos-v1',
  'score_function': 'dot',
  'success_rate': 51.256},
 {'model': 'all-mpnet-base-v2',
  'score_function': 'cosine',
  'success_rate': 76.304},
 {'model': 'all-mpnet-base-v2',
  'score_function': 'dot',
  'success_rate': 76.304},
 {'model': 'multi-qa-mpnet-base-dot-v1',
  'score_function': 'cosine',
  'success_rate': 54.99},
 {'model': 'multi-qa-mpnet-base-dot-v1',
  'score_function': 'dot',
  'success_rate': 53.703},
 {'model': 'all-distilroberta-v1',
  'score_function': 'cosine',
  'success_rate': 69.285},
 {'model': 'all-distilroberta-v1',
  'score_function': 'dot',
  'success_rate': 69.285},
 {'model': 'all-MiniLM-L12-v2',
  'score_function': 'cosine',
  'success_rate': 66.645},
 {'mo

## Playground and Unused/Old Code



In [24]:
search_results.describe()
search_results.head()


Unnamed: 0,corpus_id,score
0,159,0.944916
1,135,0.89792
2,28,0.862984
3,168,0.855652
4,188,0.848771


In [None]:
#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(abstract_embeddings, minitrack_embeddings)

In [None]:
cosine_scores.shape
print(len(cosine_scores))
for i in range(len(cosine_scores)-1):
  ordered = torch.argsort(cosine_scores[i], descending=True)
  print(f"\n\n{i}:{papers['minitrackname'][i]}, \t {papers['title'][i]}")
  print(f"{papers['abstract'][i][0:200]}")
  for idx, jTensor in enumerate(ordered):
    j = jTensor.item()
    if idx < 10:
      print(f"{minitracks['minitrackname'][j]} \t\t {cosine_scores[i][j]}")

In [None]:
#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(papers['abstract'][i][0:60], minitracks['minitrackname'][j], pair['score']))

DevOps is a new shift in business and information technology 		 Cross-Organizational and Cross-Border IS/IT Collaboration 		 Score: 0.6965
A company wishing to exploit the ongoing digitalization in i 		 Data Science for Digital Collaboration 		 Score: 0.6395
DevOps has become the fundamental approach to IT development 		 Virtual Collaboration, Organizations, and Networks 		 Score: 0.6304
DevOps is a new shift in business and information technology 		 Business Intelligence for Innovative, Collaborative and Sustainable Development of Organizations in Digital Era 		 Score: 0.6103
DevOps is a new shift in business and information technology 		 Virtual Collaboration, Organizations, and Networks 		 Score: 0.5881
DevOps has become the fundamental approach to IT development 		 Data Science for Digital Collaboration 		 Score: 0.5661
With reference to the echo chamber concept contained within  		 Cross-Organizational and Cross-Border IS/IT Collaboration 		 Score: 0.5608
DevOps is a new shift in 