## KeyBERT Preprocessing
##### This code records keyword extraction of Batteries Abstract data (training only) from HuggingFace, using the KeyBERT framework. 

### 1. Import Data

In [14]:
# Import packages
import pandas as pd
import numpy as np

# Import data
df = pd.read_excel(r"/Users/ishanisahama/Documents/Data Science/github_blog/keybert and h2o/input/df_model.xlsx") 

# Remove any junk columns
df.drop("Unnamed: 0", axis=1, inplace=True)

# View dataset
df.head(10)

Unnamed: 0,abstract,label,id,text_proc
0,A new ionic liquid-based electrolyte for lithi...,battery,0,new ionic liquid-based electrolyte lithium bat...
1,The interest in self-consumption of PV electri...,battery,1,interest self-consumption pv electricity grid-...
2,This paper explores the synergistic and cataly...,battery,2,paper explores synergistic catalytic propertie...
3,Li-rich layered oxides with micro-sized primar...,battery,3,li-rich layered oxides micro-sized primary par...
4,"In the present study, Al2O3 is utilized for th...",battery,4,"present study, alo utilized first time coating..."
5,Micron spherical Sn doping Li1.2Ni0.2Mn0.8O2 c...,battery,5,micron spherical sn doping li.ni.mn.o cathode ...
6,Based on the re-construction idea of carbon na...,battery,6,based re-construction idea carbon nanomaterial...
7,Li-rich layered transition metal oxides with t...,battery,7,li-rich layered transition metal oxides nomina...
8,Micrometre-size silicon particles are desirabl...,battery,8,micrometre-size silicon particles desirable ba...
9,"Globally, buildings are responsible for approx...",battery,9,"globally, buildings responsible approximately ..."


### 2. Text Splitting

To conduct KeyBERT processing, the author of KeyBERT advises a specific transformer (“all-MiniLM-L6-v2”) as a sentence transformer (to embed both documents and keywords) for English documents. This transformer has a sequence length limit of 256, therefore long text data needs to be split (preferably into sentences) to fit this sequence limit, before KeyBERT can be run on text. Sequence length here means the number of tokens (words) within the text. In addition, the author of KeyBERT recommends that running KeyBERT on long text is not the use case for KeyBERT, as many different topics will be covered in such long text. For the purposes of this project, text splitting into sentences of variable sentence length, followed by KeyBERT implementation will be carried out. 

Source 1: https://maartengr.github.io/KeyBERT/api/keybert.html. 

Source 2: https://maartengr.github.io/KeyBERT/guides/embeddings.html.

Source 3: https://github.com/MaartenGr/KeyBERT/issues/70.


In [15]:
# Subset data
df_proc = df[["id", "text_proc"]] # The "label" column was kept out as this dataset is a sample to test KeyBERT only.
df_proc.head(10)

Unnamed: 0,id,text_proc
0,0,new ionic liquid-based electrolyte lithium bat...
1,1,interest self-consumption pv electricity grid-...
2,2,paper explores synergistic catalytic propertie...
3,3,li-rich layered oxides micro-sized primary par...
4,4,"present study, alo utilized first time coating..."
5,5,micron spherical sn doping li.ni.mn.o cathode ...
6,6,based re-construction idea carbon nanomaterial...
7,7,li-rich layered transition metal oxides nomina...
8,8,micrometre-size silicon particles desirable ba...
9,9,"globally, buildings responsible approximately ..."


In [16]:
# Split text before running KeyBERT (split by full stop that was kept in during data preprocessing for this purpose)
df_proc["sentence_split"] = df_proc["text_proc"].apply(lambda x: x.split(". "))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_proc["sentence_split"] = df_proc["text_proc"].apply(lambda x: x.split(". "))


In [17]:
# Subset data
df_proc = df_proc[["id", "sentence_split"]] 

# Explode "sentence_split" column to separate rows for KeyBERT processing
df_proc = df_proc.explode("sentence_split")
df_proc.head(10)

# SOURCE: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html.


Unnamed: 0,id,sentence_split
0,0,new ionic liquid-based electrolyte lithium bat...
0,0,electrolyte contain volatile organic component...
0,0,.m lifsi–pptfsi electrolyte shown work effecti...
0,0,discharge capacity –mahg− achieved c/ rate cou...
0,0,electrolyte also compatible lico/mn/ni/o (nmc)...
0,0,"c/ rate, achieved discharge capacity mahg−"
0,0,coulombic efficiency remains high %.
1,1,interest self-consumption pv electricity grid-...
1,1,self-consumption defined share total pv produc...
1,1,decreased subsidies pv electricity several cou...


In [18]:
# Remove any empty rows from sentence splitting
df_proc["sentence_split"].replace('', np.nan, inplace=True)
df_proc["sentence_split"].replace(' ', np.nan, inplace=True) # Used the NaN command twice to account for two types of missing text
df_proc.dropna(subset=["sentence_split"], inplace=True)
print("The dimensions of df_proc are:", df_proc.shape)

The dimensions of df_proc are: (5362, 2)


### 3. Apply KeyBERT

In [19]:
# Import packages
from keybert import KeyBERT

# Instantiate the model
kw_model = KeyBERT(model = "all-MiniLM-L6-v2") # The "all-MiniLM-L6-v2" model has been recommended for English text with KeyBERT.

# Define the vectorizer
# from keyphrase_vectorizers import KeyphraseCountVectorizer
# vectorizer = KeyphraseCountVectorizer()

# from sklearn.feature_extraction.text import CountVectorizer
# vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words="english")

'''
KeyphraseVectorizers can be used because it enriches the typical CountVectorizer method from scikit-learn, to extract 
keyphrases with part-of-speech patterns (using the Spacy library), and automatically extracts keyphrases without needing 
to specify an ngram range. KeyphraseVectorizers extracts the part-of-speech tags from the documents and then applies a 
regex pattern to extract keyphrases that fit within that pattern (default pattern: <J.*>*<N.*>+ ; extracts keyphrases that 
have 0 or more adjectives followed by 1 or more nouns). This regex pattern can be customised to suit the problem. To use 
part-of-speech, we also need a language specific model (i.e. English).

KeyphraseVectorizers and CountVectorizer were not used here as I could not get this component to work for the workplace 
data problem. In essence, the preprocessing at this stage was stripping all words in the workplace data problem. Typically, 
in my previous running of KeyBERT, I preprocessed the text (for messy text) or preprocessed minimally and then applied 
KeyBERT directly. In this case, minimal processing in unigrams to bigrams will be conducted within KeyBERT. Trigrams will 
not be used to avoid niche keywords being selected by KeyBERT. 

SOURCE 1: https://discuss.python.org/t/countvectorizer-throwing-valueerror-empty-vocabulary-perhaps-the-documents-only-contain-stop-words/5388
SOURCE 2: https://maartengr.github.io/KeyBERT/guides/countvectorizer.html#:~:text=CountVectorizer%20Tips%20%26%20Tricks%201%20Basic%20Usage%20First%2C,of%20the%20resulting%20keywords.%20...%203%20KeyphraseVectorizers%20
'''


'\nKeyphraseVectorizers can be used because it enriches the typical CountVectorizer method from scikit-learn, to extract \nkeyphrases with part-of-speech patterns (using the Spacy library), and automatically extracts keyphrases without needing \nto specify an ngram range. KeyphraseVectorizers extracts the part-of-speech tags from the documents and then applies a \nregex pattern to extract keyphrases that fit within that pattern (default pattern: <J.*>*<N.*>+ ; extracts keyphrases that \nhave 0 or more adjectives followed by 1 or more nouns). This regex pattern can be customised to suit the problem. To use \npart-of-speech, we also need a language specific model (i.e. English).\n\nKeyphraseVectorizers and CountVectorizer were not used here as I could not get this component to work for the workplace \ndata problem. In essence, the preprocessing at this stage was stripping all words in the workplace data problem. Typically, \nin my previous running of KeyBERT, I preprocessed the text (for

In [20]:
# Define KeyBERT function
def keybert(x, ngrams, mmr, n, diversity):
    keywords = kw_model.extract_keywords(x, keyphrase_ngram_range=ngrams, use_mmr=mmr, top_n=n, diversity=diversity)
    return keywords

In [21]:
# Use Maximal Marginal Relevance
''' Maximal Marginal Relevance: This method is a way to diversify the extracted keywords from text, so that they don't 
look similar to each other. The method considers the similarity of keywords/keyphrases to the document, along with the 
similarity of already selected keywords/keyphrases (I believe the KeyBERT author means the keyphrases extracted through 
CountVectorizer or KeyPhrase Vectorizers, or in our case, the text itself), resulting in a keyword selection output that 
is similar to the document but that maximises diversity from each other. This approach allows you to tune for diversity 
specifically, with higher diversity (closer to 1) and lower diversity (closer to 0). The default diversity parameter is 
0.8. This method of diversification will be used. In my own experience, KeyBERT has shown that extracting keyword output 
that makes up 75% of the text length is crucial for good downstream classification performance. This is suitable for 
longer text, so experimentation with extraction of 10%, 25% and 50% of the text length will also be conducted, to pick the 
best keyword output. Note that with more keywords extracted, the cosine similarities will be low after the first five 
keyword extracts per sentence.'''

# SOURCE: https://www.maartengrootendorst.com/blog/keybert/.

df_proc["kw_proc_0.75"] = df_proc["sentence_split"].apply(lambda x: keybert(x, (1,2), True, round((len(set(x.split(" "))))*0.75), 0.8))
df_proc["kw_proc_0.5"] = df_proc["sentence_split"].apply(lambda x: keybert(x, (1,2), True, round((len(set(x.split(" "))))*0.5), 0.8))
df_proc["kw_proc_0.25"] = df_proc["sentence_split"].apply(lambda x: keybert(x, (1,2), True, round((len(set(x.split(" "))))*0.25), 0.8))
df_proc["kw_proc_0.1"] = df_proc["sentence_split"].apply(lambda x: keybert(x, (1,2), True, round((len(set(x.split(" "))))*0.1), 0.8))
df_proc.head()


Unnamed: 0,id,sentence_split,kw_proc_0.75,kw_proc_0.5,kw_proc_0.25,kw_proc_0.1
0,0,new ionic liquid-based electrolyte lithium bat...,"[(ionic liquid, 0.5577), (lithium batteries, 0...","[(ionic liquid, 0.5577), (lithium batteries, 0...","[(ionic liquid, 0.5577), (lithium batteries, 0...","[(ionic liquid, 0.5577), (operating elevated, ..."
0,0,electrolyte contain volatile organic component...,"[(electrolyte contain, 0.6292), (ionically, 0....","[(electrolyte contain, 0.6292), (volatile orga...","[(electrolyte contain, 0.6292), (thermally sta...","[(electrolyte contain, 0.6292)]"
0,0,.m lifsi–pptfsi electrolyte shown work effecti...,"[(graphite anode, 0.6204), (electrolyte interf...","[(graphite anode, 0.6204), (electrolyte interf...","[(graphite anode, 0.6204), (lifsi pptfsi, 0.43...","[(graphite anode, 0.6204), (lifsi pptfsi, 0.43..."
0,0,discharge capacity –mahg− achieved c/ rate cou...,"[(discharge capacity, 0.6112), (coulombic effi...","[(discharge capacity, 0.6112), (mahg achieved,...","[(discharge capacity, 0.6112), (mahg achieved,...","[(discharge capacity, 0.6112)]"
0,0,electrolyte also compatible lico/mn/ni/o (nmc)...,"[(electrolyte compatible, 0.6412), (nmc cathod...","[(electrolyte compatible, 0.6412), (nmc cathod...","[(electrolyte compatible, 0.6412), (mn ni, 0.2...","[(electrolyte compatible, 0.6412)]"


In [22]:
# Export dataset
df_proc.to_excel(r"/Users/ishanisahama/Documents/Data Science/github_blog/keybert and h2o/output/kb_proc.xlsx")