# Case Study 5.1 - 02 KeyBERT

In this script we are using a library called KeyBERT to create semnatically informed keyword representation of the data.

KeyBERT works by identifying words or phrases in the text block that are closest in meaning the the meaning of the complete text.

The original scripts for the content of this Notebook can be found here:
* [CaseStudy_5.1_02-01.py](CaseStudy_5.1_02-01.py)
* [CaseStudy_5.1_02-02.py](CaseStudy_5.1_02-02.py)

<span style="color: #FF0000;">Errata:</span> The listing presented in the book for the manipulation and display of the keywords contained several typos. These are corrected below.

In [5]:
from keybert import KeyBERT
from collections import Counter
import pandas as pd
import numpy as np

We use a different version of tqdm designed for Notebooks.
This line is different from the content of the book and the original script

In [6]:
from tqdm.notebook import tqdm

In [7]:
df = pd.read_csv("data/complete_with_features.csv")
test = df[df["RANDOM"]>=0.95]

text_list = test['text'].tolist()

## Part 01 - Key Word Extraction

We start by just extracting keywords from across all text

In [9]:
model_name = 'bert-base-uncased'
kw_model = KeyBERT(model=model_name)

def extract_keywords(texts, top_n=3):
    all_keywords = []
    for text in tqdm(texts, desc="Extracting keywords"):
        keywords = kw_model.extract_keywords(text, top_n=top_n, stop_words='english')
        all_keywords.extend([kw[0] for kw in keywords])
    return Counter(all_keywords)

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


**Note:** Due to a Hardware/Library Compatibility (Specific to M4 Macs) we include Numpy specific warning suppressions.

If you are using Windows or Linux, or an Intel Mac then you will not need to encapsulate the `extract_keywords` function call inside the `with np.errstate` 

In [11]:
with np.errstate(divide='ignore', invalid='ignore', over='ignore'):
    keywds = extract_keywords(text_list, top_n=3)

Extracting keywords:   0%|          | 0/3244 [00:00<?, ?it/s]

In [18]:
df_keywords = pd.DataFrame({
    "keywords": pd.Series(keywds)
}).fillna(0).astype(int).reset_index()

df_keywords.columns=['keyword','count']
df_keywords = df_keywords.sort_values(by='count', ascending=False)

keywords = df_keywords[df_keywords['count']>=30]

print(keywords.head(5).to_markdown())

|      | keyword         |   count |
|-----:|:----------------|--------:|
|   25 | feel            |     189 |
|  715 | extracurricular |     100 |
| 1337 | student_name    |      86 |
|    8 | stressful       |      84 |
|  392 | hey             |      77 |


## Part 02 - Key Phrase Extraction

We modify our extraction function slightly to extract key phrases rather than words.

In [20]:
def extract_keywords2(texts, top_n=3):
    all_keywords = []
    for text in tqdm(texts, desc="Extracting keywords"):
        keywords = kw_model.extract_keywords(text, top_n=top_n, keyphrase_ngram_range=(2, 3), stop_words='english')
        all_keywords.extend([kw[0] for kw in keywords])
    return Counter(all_keywords)

In [21]:
with np.errstate(divide='ignore', invalid='ignore', over='ignore'):
    keywds2 = extract_keywords2(text_list, top_n=3)

Extracting keywords:   0%|          | 0/3244 [00:00<?, ?it/s]

In [22]:
df_keywords2 = pd.DataFrame({
    "keywords": pd.Series(keywds2)
}).fillna(0).astype(int).reset_index()

df_keywords2.columns=['keyphrase','count']
keywords2 = df_keywords2[df_keywords2['count']>=10]
keywords2 = keywords2.sort_values(by='count', ascending=False)

print(keywords2.head(5).to_markdown())

|      | keyphrase                    |   count |
|-----:|:-----------------------------|--------:|
| 3825 | make informed decisions      |      31 |
| 5423 | reduce traffic congestion    |      30 |
| 5959 | support abolishing electoral |      22 |
| 5738 | abolishing electoral college |      21 |
| 3612 | making decisions             |      16 |
