In [1]:
import pandas as pd
ZEPHYR = '/kaggle/input/q-and-a-with-zephyr-7b/zephyr.json'
df = pd.read_json(lines=True, path_or_buf=ZEPHYR)
df['prompt token count'] = df['prompt'].str.split().str.len()
df['response token count'] = df['response'].str.split().str.len()
df.head()

Unnamed: 0,prompt,response,prompt token count,response token count
0,Delve into the intricate ways Histochemistry i...,"Histochemistry, the study of the chemical comp...",10,546
1,How does Chemical engineering intersect with t...,Chemical engineering plays a significant role ...,11,226
2,What are the implications of Outline of paraps...,"The ""Outline of parapsychology"" is a scientifi...",11,276
3,How does Phytopathology influence the developm...,"Phytopathology, the scientific study of plant ...",9,107
4,Delve into the detailed ethical dilemmas posed...,The insurance industry has witnessed significa...,11,477


In [2]:
df.sample(n=10, random_state=2024)['prompt'].values

array(['Explain the impact of Chamber music on the perception of beauty and aesthetics.',
       'What are the ethical considerations in Literary journalism?',
       'How does Sound Engineering shape our understanding of the multiverse theory?',
       'How does Polymer engineering challenge our understanding of the world?',
       'What are the legal and regulatory challenges within Pre-Columbian era?',
       'What are the basic principles of Higher education?',
       'Explain the relationship between Graphic design and global justice systems.',
       'Explain how General systems theory intersects with the concept of identity in a digital age.Describe the role of General systems theory in redefining human relationships in modern society.',
       'What are the roles of Experiment in understanding dark matter and dark energy?',
       'Examine the comprehensive historical figures and movements in Business law.'],
      dtype=object)

A lot of these prompts look like they have been generated with a script, where two concepts are being stuck together; as a result we expect that sometimes the response will focus on one concept and sometimes the other.

Now we have a research question: how often do we find the same primary keyword in both the prompt and the response?

In [3]:
df.shape

(179759, 4)

This is much more data than we can graph in a scatter plot. Let's work our way toward a meaningful sample.

In [4]:
%env TOKENIZERS_PARALLELISM=false
!pip install --quiet keybert
print('pip install keybert complete.')

env: TOKENIZERS_PARALLELISM=false
pip install keybert complete.


In [5]:
df['prompt'].nunique() / df['response'].nunique(), df.nunique()

(0.9367764618183234,
 prompt                  168394
 response                179759
 prompt token count          22
 response token count      1198
 dtype: int64)

Roughly 6-7% of our prompts are repeated. Let's drop them from our prompt data.

In [6]:
prompt_df = df[['prompt']].drop_duplicates(ignore_index=True).copy().sample(n=40000, random_state=2024)

Our sample size is arbitrary and we pick this for performance reasons. We can tune it up or down depending on how long we have to wait.

In [7]:
from arrow import now
from keybert import KeyBERT
from sklearn.feature_extraction.text import TfidfVectorizer

MAX_DF = 1.0
MIN_DF = 10 # we have a lot of documents so we can contract our token space somewhat without fear
MODEL = 'all-MiniLM-L12-v2'
STOP_WORDS = 'english'
# we use the clean text for keywords even though we show a truncated original message
DOCS = prompt_df['prompt'].values.tolist()

model_start = now()
model = KeyBERT(model=MODEL,)
# we will capture almost all of the content with the default max sequence length of 128
vectorizer = TfidfVectorizer(ngram_range=(1, 1), stop_words=STOP_WORDS, min_df=MIN_DF, max_df=MAX_DF, )
document_embeddings, word_embeddings = model.extract_embeddings(docs=DOCS, vectorizer=vectorizer, )
print('embedding time: {}'.format(now() - model_start))
print('we have {} documents and {} words.'.format(len(document_embeddings), len(word_embeddings)))
keywords = model.extract_keywords(docs=DOCS, top_n=1, stop_words=STOP_WORDS, vectorizer=vectorizer,
                                  doc_embeddings=document_embeddings, word_embeddings=word_embeddings, min_df=MIN_DF, )
print('model time: {}'.format(now() - model_start))
prompt_df['keyword'] = [keyword[0][0] if len(keyword) else '-none-' for keyword in keywords]

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

embedding time: 0:06:02.416808
we have 40000 documents and 1825 words.
model time: 0:06:34.917817


In [8]:
prompt_df['keyword'].value_counts(normalize=True).head(n=20)

keyword
sociology        0.024500
archaeology      0.017900
anthropology     0.013200
psychology       0.011750
ethics           0.011675
economics        0.011025
ethical          0.010025
quantum          0.009175
geography        0.008875
chemistry        0.008775
cultural         0.008550
climate          0.007950
history          0.007050
astronomy        0.006700
engineering      0.006575
economic         0.006550
governance       0.006100
consciousness    0.005475
innovation       0.005325
linguistics      0.005300
Name: proportion, dtype: float64

In [9]:
from plotly.express import histogram
histogram(data_frame=prompt_df[prompt_df['keyword'] != '-none-']['keyword'].value_counts().to_frame().reset_index().head(n=40), x='keyword', y='count', marginal='box')

In [10]:
import pandas as pd
from umap import UMAP

IGNORE = {'-none-', }

umap_start = now()
umap_model = UMAP(n_components=2, random_state=2024, verbose=False, n_jobs=1)
prompt_df[['u0', 'u1']] = umap_model.fit_transform(X=document_embeddings)
print('UMAP time: {}'.format(now() - umap_start))

UMAP time: 0:01:06.203936


In [11]:
prompt_df.head()

Unnamed: 0,prompt,keyword,u0,u1
34551,What are the implications of Digital Marketing...,marketing,10.151413,0.779017
54975,How does Jewish studies interact with the envi...,environment,11.98149,0.938504
100544,What are some common myths and truths in Socio...,religion,8.460294,1.058826
144655,What are the key skills needed to excel in Anc...,egypt,8.250136,2.881632
47362,Detail the impact of Health informatics on sha...,informatics,12.751146,-0.56892


In [12]:
from plotly.express import scatter
top_keywords = prompt_df[prompt_df['keyword'] != '-none-']['keyword'].value_counts().head(n=12).index.tolist()
top_df = prompt_df[prompt_df['keyword'].isin(top_keywords)]
scatter(data_frame=top_df, x='u0', y='u1', hover_name='prompt', color='keyword' ).update_xaxes(showticklabels=False).update_yaxes(showticklabels=False).update_xaxes(visible=False).update_yaxes(visible=False)

We have some areas of coherence and some areas of incoherence.

Let's take our top prompts with top keywords and go back to our original data to get the responses.

In [13]:
top_response_df = top_df.merge(right=df, on='prompt', how='inner').drop(columns=['prompt token count', 'response token count'])
top_response_df.head()

Unnamed: 0,prompt,keyword,u0,u1,response
0,How does Sociology of the family shape the fut...,sociology,11.155166,3.40018,The Sociology of the Family does not directly ...
1,Evaluate the impact of Strategic geography on ...,geography,10.132791,1.905565,"Strategic geography, which refers to the locat..."
2,What are the impacts of Physical geography on ...,geography,16.018892,5.072301,"Physical geography, which refers to the natura..."
3,Investigate the role of Biblical archaeology i...,archaeology,7.453992,3.106962,The role of Biblical archaeology in the future...
4,Discuss the complexities of teaching and learn...,geography,4.489276,-0.952409,Teaching and learning geography can be complex...


And use the model above to get the response keywords.

In [14]:
time_start = now()
RESPONSES = top_response_df['response'].values.tolist()
response_document_embeddings, response_word_embeddings = model.extract_embeddings(docs=RESPONSES, vectorizer=vectorizer, )

response_keyword = model.extract_keywords(docs=RESPONSES, top_n=1, stop_words=STOP_WORDS, vectorizer=vectorizer, doc_embeddings=response_document_embeddings, 
                                          word_embeddings=response_word_embeddings, min_df=MIN_DF, )
top_response_df['response keyword'] = [keyword[0][0] if len(keyword) else '-none-' for keyword in response_keyword]
print('got response keywords in {}'.format(now() - time_start))

got response keywords in 0:07:23.662499


How often do we get the same keywords for both prompt and response?

In [15]:
len(top_response_df[top_response_df['keyword'] == top_response_df['response keyword']])/len(top_response_df)

0.578921568627451

This is about what we would expect given how we think the prompts were built; a lot of the prompts have two concepts in them, and a small minority of them have only one. So a little more than half the time our prompt keyword matches our response keyword.