# Keyword Analysis with KeyBERT and Taipy

## 01 - Extraction of arXiv Abstracts with API
- https://github.com/lukasschwab/arxiv.py

In [1]:
import arxiv
import sqlite3
import pandas as pd
from keybert import KeyBERT

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
search = arxiv.Search(
            query = 'artificial intelligence',
            max_results = 2,
            sort_by = arxiv.SortCriterion.SubmittedDate,
            sort_order = arxiv.SortOrder.Descending)

In [3]:
for result in search.results():
    print(result.entry_id)
    print(result.published)
    print(result.title)
    print(result.summary)

http://arxiv.org/abs/2302.08500v1
2023-02-16 18:55:21+00:00
Auditing large language models: a three-layered approach
The emergence of large language models (LLMs) represents a major advance in
artificial intelligence (AI) research. However, the widespread use of LLMs is
also coupled with significant ethical and social challenges. Previous research
has pointed towards auditing as a promising governance mechanism to help ensure
that AI systems are designed and deployed in ways that are ethical, legal, and
technically robust. However, existing auditing procedures fail to address the
governance challenges posed by LLMs, which are adaptable to a wide range of
downstream tasks. To help bridge that gap, we offer three contributions in this
article. First, we establish the need to develop new auditing procedures that
capture the risks posed by LLMs by analysing the affordances and constraints of
existing auditing procedures. Second, we outline a blueprint to audit LLMs in
feasible and effectiv

___
## 02 - SQLite Database Setup
- https://www.digitalocean.com/community/tutorials/how-to-use-the-sqlite3-module-in-python-3

In [8]:
connection = sqlite3.connect("../data/abstracts.db")
cursor = connection.cursor()

In [3]:
# Create new table in database
cursor.execute("CREATE TABLE IF NOT EXISTS abstracts_ai (id TEXT PRIMARY KEY, \
                                                         title TEXT, \
                                                         date_published TEXT, \
                                                         abstract TEXT)"
              )

<sqlite3.Cursor at 0x1e7288b22d0>

In [4]:
# Insert dummy row
cursor.execute("INSERT INTO abstracts_ai VALUES ('a1', \
                                                 'test_title', \
                                                 '2023-02-16 18:16:09+00:00', \
                                                 'test abstract text')"
              )

<sqlite3.Cursor at 0x1e7288b22d0>

In [5]:
# Fetch all rows
query = "SELECT * FROM abstracts_ai"
df = pd.read_sql_query("SELECT * FROM abstracts_ai", connection)
df

Unnamed: 0,id,title,date_published,abstract
0,a1,test_title,2023-02-16 18:16:09+00:00,test abstract text


In [6]:
# Delete dummy row
cursor.execute(
    "DELETE FROM abstracts_ai")

<sqlite3.Cursor at 0x1e7288b22d0>

In [7]:
# Check all rows deleted
query = "SELECT * FROM abstracts_ai"
df = pd.read_sql_query("SELECT * FROM abstracts_ai", connection)
df

Unnamed: 0,id,title,date_published,abstract


___
## 03 - Retrieve and Store arXiv AI Article Abstracts

In [10]:
search = arxiv.Search(
            query = 'artificial intelligence',
            max_results = 2,
            sort_by = arxiv.SortCriterion.SubmittedDate,
            sort_order = arxiv.SortOrder.Descending)

In [32]:
for result in search.results():
    entry_id = result.entry_id
    uid = entry_id.split('.')[-1]
    title = result.title
    date_published = result.published
    abstract = result.summary
    
    # Replace row if unique constraint in primary key is violated
    cursor.execute(f'INSERT OR REPLACE INTO abstracts_ai VALUES ("{uid}", \
                                                      "{title}", \
                                                      "{date_published}", \
                                                      "{abstract}")'
                  )

In [33]:
# Fetch all rows
query = "SELECT * FROM abstracts_ai"
df = pd.read_sql_query("SELECT * FROM abstracts_ai", connection)
df

Unnamed: 0,id,title,date_published,abstract
0,08500v1,Auditing large language models: a three-layere...,2023-02-16 18:55:21+00:00,The emergence of large language models (LLMs) ...
1,08481v1,Local-to-Global Information Communication for ...,2023-02-16 18:40:24+00:00,Neural Architecture Search (NAS) has shown gre...


___
## 04 - DataFrame Pre-Processing

In [13]:
print(df.dtypes)

id                object
title             object
date_published    object
abstract          object
dtype: object


In [14]:
df['date_published'] = pd.to_datetime(df['date_published'])

In [15]:
print(df.dtypes)

id                             object
title                          object
date_published    datetime64[ns, UTC]
abstract                       object
dtype: object


In [16]:
# Create empty column to store keyword extraction output
df['keywords'] = ''

In [20]:
df

Unnamed: 0,id,title,date_published,abstract,keywords
0,08500v1,Auditing large language models: a three-layere...,2023-02-16 18:55:21+00:00,The emergence of large language models (LLMs) ...,
1,08481v1,Local-to-Global Information Communication for ...,2023-02-16 18:40:24+00:00,Neural Architecture Search (NAS) has shown gre...,


___
## 05 - Keyword Extraction with KeyBERT
- https://github.com/MaartenGr/KeyBERT
- https://maartengr.github.io/KeyBERT/guides/embeddings.html

In [17]:
kw_model = KeyBERT()

In [26]:
# Define parameters
stop_words = 'english'
ngram_lower_bound = 1
ngram_upper_bound = 2
use_mmr = True
diversity = 0.2
use_maxsum=False
nr_candidates = 20
top_n = 3

In [29]:
for i, row in df.iterrows():
    abstract_text = row['abstract']
    kw_output = kw_model.extract_keywords(abstract_text, 
                                  keyphrase_ngram_range=(ngram_lower_bound, ngram_upper_bound), 
                                  stop_words=stop_words,
                                  use_mmr=use_mmr, 
                                  use_maxsum=use_maxsum,
                                  diversity=diversity,
                                  top_n=top_n)
    df.at[i, 'keywords'] = kw_output
    print(kw_output)

[('governance audits', 0.5767), ('model audits', 0.542), ('auditing promising', 0.5245)]
[('convolutional network', 0.4329), ('cityscapes dataset', 0.379), ('search lgcnet', 0.3707)]


In [30]:
df

Unnamed: 0,id,title,date_published,abstract,keywords
0,08500v1,Auditing large language models: a three-layere...,2023-02-16 18:55:21+00:00,The emergence of large language models (LLMs) ...,"[(governance audits, 0.5767), (model audits, 0..."
1,08481v1,Local-to-Global Information Communication for ...,2023-02-16 18:40:24+00:00,Neural Architecture Search (NAS) has shown gre...,"[(convolutional network, 0.4329), (cityscapes ..."
