# Keyword Analysis with KeyBERT and Taipy

## 01 - Extraction of arXiv Abstracts with API
- https://github.com/lukasschwab/arxiv.py

In [34]:
import arxiv
import ast
import itertools
import pandas as pd
import sqlite3
from keybert import KeyBERT

In [35]:
search = arxiv.Search(
            query = 'artificial intelligence',
            max_results = 20,
            sort_by = arxiv.SortCriterion.SubmittedDate,
            sort_order = arxiv.SortOrder.Descending)

In [36]:
type(search)

arxiv.arxiv.Search

In [37]:
for result in search.results():
    print(result.entry_id)
    print(result.published)
    print(result.title)
    print(result.summary)

http://arxiv.org/abs/2303.13519v1
2023-03-23 17:59:54+00:00
Learning and Verification of Task Structure in Instructional Videos
Given the enormous number of instructional videos available online, learning
a diverse array of multi-step task models from videos is an appealing goal. We
introduce a new pre-trained video model, VideoTaskformer, focused on
representing the semantics and structure of instructional videos. We pre-train
VideoTaskformer using a simple and effective objective: predicting weakly
supervised textual labels for steps that are randomly masked out from an
instructional video (masked step modeling). Compared to prior work which learns
step representations locally, our approach involves learning them globally,
leveraging video of the entire surrounding task as context. From these learned
representations, we can verify if an unseen video correctly executes a given
task, as well as forecast which steps are likely to be taken after a given
step. We introduce two new benchma

___
## 02 - SQLite Database Setup
- https://www.digitalocean.com/community/tutorials/how-to-use-the-sqlite3-module-in-python-3

In [4]:
# connection = sqlite3.connect("../data/abstracts.db")
# cursor = connection.cursor()

In [5]:
# # Create new table in database
# cursor.execute("CREATE TABLE IF NOT EXISTS abstracts_ai (id TEXT PRIMARY KEY, \
#                                                          title TEXT, \
#                                                          date_published TEXT, \
#                                                          abstract TEXT)"
#               )

<sqlite3.Cursor at 0x2183b230b90>

In [6]:
# # Insert dummy row
# cursor.execute("INSERT INTO abstracts_ai VALUES ('a1', \
#                                                  'test_title', \
#                                                  '2023-02-16 18:16:09+00:00', \
#                                                  'test abstract text')"
#               )

<sqlite3.Cursor at 0x2183b230b90>

In [7]:
# # Fetch all rows
# query = "SELECT * FROM abstracts_ai"
# df = pd.read_sql_query("SELECT * FROM abstracts_ai", connection)
# df

Unnamed: 0,id,title,date_published,abstract
0,a1,test_title,2023-02-16 18:16:09+00:00,test abstract text


In [8]:
# # Delete dummy row
# cursor.execute(
#     "DELETE FROM abstracts_ai")

<sqlite3.Cursor at 0x2183b230b90>

In [9]:
# # Check all rows deleted
# query = "SELECT * FROM abstracts_ai"
# df = pd.read_sql_query("SELECT * FROM abstracts_ai", connection)
# df

Unnamed: 0,id,title,date_published,abstract


___
## 03 - Retrieve and Store arXiv AI Article Abstracts

In [None]:
# for result in search.results():
#     entry_id = result.entry_id
#     uid = entry_id.split('.')[-1]
#     title = result.title
#     date_published = result.published
#     abstract = result.summary
    
#     query = 'INSERT OR REPLACE INTO abstracts_ai(id, title, date_published, abstract)' + \
#             ' VALUES(?, ?, ?, ?);'
    
#     fields = (uid, title, date_published, abstract)

#     cursor.execute(query, fields)

In [14]:
# # Fetch all rows
# query = "SELECT * FROM abstracts_ai"
# df = pd.read_sql_query("SELECT * FROM abstracts_ai", connection)
# df

Unnamed: 0,id,title,date_published,abstract
0,05512v1,PAC-NeRF: Physics Augmented Continuum Neural R...,2023-03-09 18:59:50+00:00,Existing approaches to system identification (...
1,05510v1,Planning with Large Language Models for Code G...,2023-03-09 18:59:47+00:00,Existing large language model-based code gener...


## Alternative - Without SQLite

In [38]:
df_raw = pd.DataFrame()

In [39]:
for result in search.results():
    entry_id = result.entry_id
    uid = entry_id.split('.')[-1]
    title = result.title
    date_published = result.published
    abstract = result.summary
    
    result_dict = {'uid': uid,
                   'title': title,
                   'date_published': date_published,
                   'abstract': abstract
                  }
    
    df_raw = df_raw.append(result_dict, ignore_index=True)    

In [40]:
df_raw

Unnamed: 0,uid,title,date_published,abstract
0,13519v1,Learning and Verification of Task Structure in...,2023-03-23 17:59:54+00:00,Given the enormous number of instructional vid...
1,13518v1,Three ways to improve feature alignment for op...,2023-03-23 17:59:53+00:00,The core problem in zero-shot open vocabulary ...
2,13512v1,Towards Solving Fuzzy Tasks with Human Feedbac...,2023-03-23 17:59:17+00:00,To facilitate research in the direction of fin...
3,13511v1,Neural Preset for Color Style Transfer,2023-03-23 17:59:10+00:00,"In this paper, we present a Neural Preset tech..."
4,13508v1,DreamBooth3D: Subject-Driven Text-to-3D Genera...,2023-03-23 17:59:00+00:00,"We present DreamBooth3D, an approach to person..."
5,13497v1,TriPlaneNet: An Encoder for EG3D Inversion,2023-03-23 17:56:20+00:00,Recent progress in NeRF-based GANs has introdu...
6,13496v1,The effectiveness of MAE pre-pretraining for b...,2023-03-23 17:56:12+00:00,This paper revisits the standard pretrain-then...
7,13494v1,Attention! Dynamic Epistemic Logic Models of (...,2023-03-23 17:55:32+00:00,Attention is the crucial cognitive ability tha...
8,13489v1,Boosting Reinforcement Learning and Planning w...,2023-03-23 17:53:44+00:00,Although reinforcement learning has seen treme...
9,13483v1,NS3D: Neuro-Symbolic Grounding of 3D Objects a...,2023-03-23 17:50:40+00:00,Grounding object properties and relations in 3...


___
## 04 - DataFrame Pre-Processing

In [41]:
df = df_raw.copy()
print(df.dtypes)

uid                            object
title                          object
date_published    datetime64[ns, UTC]
abstract                       object
dtype: object


In [42]:
df['date_published'] = pd.to_datetime(df['date_published'])

In [43]:
print(df.dtypes)

uid                            object
title                          object
date_published    datetime64[ns, UTC]
abstract                       object
dtype: object


In [44]:
# Create empty column to store keyword extraction output
df['keywords_and_scores'] = ''

# Create empty column to store top keywords
df['keywords'] = ''

In [45]:
df

Unnamed: 0,uid,title,date_published,abstract,keywords_and_scores,keywords
0,13519v1,Learning and Verification of Task Structure in...,2023-03-23 17:59:54+00:00,Given the enormous number of instructional vid...,,
1,13518v1,Three ways to improve feature alignment for op...,2023-03-23 17:59:53+00:00,The core problem in zero-shot open vocabulary ...,,
2,13512v1,Towards Solving Fuzzy Tasks with Human Feedbac...,2023-03-23 17:59:17+00:00,To facilitate research in the direction of fin...,,
3,13511v1,Neural Preset for Color Style Transfer,2023-03-23 17:59:10+00:00,"In this paper, we present a Neural Preset tech...",,
4,13508v1,DreamBooth3D: Subject-Driven Text-to-3D Genera...,2023-03-23 17:59:00+00:00,"We present DreamBooth3D, an approach to person...",,
5,13497v1,TriPlaneNet: An Encoder for EG3D Inversion,2023-03-23 17:56:20+00:00,Recent progress in NeRF-based GANs has introdu...,,
6,13496v1,The effectiveness of MAE pre-pretraining for b...,2023-03-23 17:56:12+00:00,This paper revisits the standard pretrain-then...,,
7,13494v1,Attention! Dynamic Epistemic Logic Models of (...,2023-03-23 17:55:32+00:00,Attention is the crucial cognitive ability tha...,,
8,13489v1,Boosting Reinforcement Learning and Planning w...,2023-03-23 17:53:44+00:00,Although reinforcement learning has seen treme...,,
9,13483v1,NS3D: Neuro-Symbolic Grounding of 3D Objects a...,2023-03-23 17:50:40+00:00,Grounding object properties and relations in 3...,,


___
## 05 - Keyword Extraction with KeyBERT
- https://github.com/MaartenGr/KeyBERT
- https://maartengr.github.io/KeyBERT/guides/embeddings.html

In [58]:
# Using 'all-MiniLM-L6-v2' given its speed and good quality
# https://www.sbert.net/docs/pretrained_models.html#model-overview
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

In [59]:
# Define parameters
stop_words = 'english'
ngram_lower_bound = 1
ngram_upper_bound = 2
use_mmr = True
diversity = 0.1
use_maxsum=False
nr_candidates = 20
top_n = 8

In [60]:
for i, row in df.iterrows():
    abstract_text = row['abstract']
    kw_output = kw_model.extract_keywords(abstract_text, 
                                  keyphrase_ngram_range=(ngram_lower_bound, ngram_upper_bound), 
                                  stop_words=stop_words,
                                  use_mmr=use_mmr, 
                                  use_maxsum=use_maxsum,
                                  diversity=diversity,
                                  top_n=top_n)
    df.at[i, 'keywords_and_scores'] = kw_output
    
    # Obtain keyword from every keyword-score pair
    top_kw = []
    
    for pair in kw_output:
        top_kw.append(pair[0])
        
    df.at[i, 'keywords'] = top_kw

In [61]:
df

Unnamed: 0,uid,title,date_published,abstract,keywords_and_scores,keywords
0,13519v1,Learning and Verification of Task Structure in...,2023-03-23 17:59:54+00:00,Given the enormous number of instructional vid...,"[(train videotaskformer, 0.533), (learns step,...","[train videotaskformer, learns step, trained v..."
1,13518v1,Three ways to improve feature alignment for op...,2023-03-23 17:59:53+00:00,The core problem in zero-shot open vocabulary ...,"[(detection training, 0.4193), (vocabulary det...","[detection training, vocabulary detection, zer..."
2,13512v1,Towards Solving Fuzzy Tasks with Human Feedbac...,2023-03-23 17:59:17+00:00,To facilitate research in the direction of fin...,"[(minecraft competition, 0.5917), (human feedb...","[minecraft competition, human feedback, feedba..."
3,13511v1,Neural Preset for Color Style Transfer,2023-03-23 17:59:10+00:00,"In this paper, we present a Neural Preset tech...","[(color normalization, 0.5621), (adaptive colo...","[color normalization, adaptive color, color ma..."
4,13508v1,DreamBooth3D: Subject-Driven Text-to-3D Genera...,2023-03-23 17:59:00+00:00,"We present DreamBooth3D, an approach to person...","[(3d generative, 0.5317), (text 3d, 0.5204), (...","[3d generative, text 3d, 3d assets, dreambooth..."
5,13497v1,TriPlaneNet: An Encoder for EG3D Inversion,2023-03-23 17:56:20+00:00,Recent progress in NeRF-based GANs has introdu...,"[(3d gans, 0.6795), (2d gan, 0.6178), (gan inv...","[3d gans, 2d gan, gan inversion, space gan, ba..."
6,13496v1,The effectiveness of MAE pre-pretraining for b...,2023-03-23 17:56:12+00:00,This paper revisits the standard pretrain-then...,"[(models pretrained, 0.5777), (scale pretraini...","[models pretrained, scale pretraining, pretrai..."
7,13494v1,Attention! Dynamic Epistemic Logic Models of (...,2023-03-23 17:55:32+00:00,Attention is the crucial cognitive ability tha...,"[(propositional attention, 0.608), (attention ...","[propositional attention, attention axiomatiza..."
8,13489v1,Boosting Reinforcement Learning and Planning w...,2023-03-23 17:53:44+00:00,Although reinforcement learning has seen treme...,"[(demonstrations learning, 0.7309), (learning ...","[demonstrations learning, learning planning, d..."
9,13483v1,NS3D: Neuro-Symbolic Grounding of 3D Objects a...,2023-03-23 17:50:40+00:00,Grounding object properties and relations in 3...,"[(3d referring, 0.5461), (ns3d neuro, 0.5423),...","[3d referring, ns3d neuro, grounding ns3d, ns3..."


### Get value counts of keywords

In [65]:
keywords_count = pd.DataFrame(pd.Series([x for item in df.keywords for x in item]).value_counts()).reset_index()
keywords_count.columns = ['keyword', 'count']
keywords_count.head(10)

Unnamed: 0,keyword,count
0,human mesh,2
1,train videotaskformer,1
2,meshfree particle,1
3,person recordings,1
4,object localization,1
5,meshfree methods,1
6,volume meshfree,1
7,meshfree collocation,1
8,meshfree lagrangian,1
9,meshfree,1
