<a href="https://colab.research.google.com/github/ksachdeva11/llm/blob/main/KeyLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Keyword Extraction with Mistral 7B**

In [1]:
!pip install --upgrade git+https://github.com/UKPLab/sentence-transformers
!pip install keybert ctransformers[cuda]
!pip install --upgrade git+https://github.com/huggingface/transformers

Collecting git+https://github.com/UKPLab/sentence-transformers
  Cloning https://github.com/UKPLab/sentence-transformers to /tmp/pip-req-build-ghehqmp5
  Running command git clone --filter=blob:none --quiet https://github.com/UKPLab/sentence-transformers /tmp/pip-req-build-ghehqmp5
  Resolved https://github.com/UKPLab/sentence-transformers to commit c5f93f70eca933c78695c5bc686ceda59651ae3b
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers==2.2.2)
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers==2.2.2)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h

**Loading the model**

In [2]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading (…)f9f258a8/config.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading (…)uct-v0.1.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [3]:
from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


**Prompt Engineering**

In [5]:
response = generator("What is 5+5?")
print(response[0]["generated_text"])

What is 5+5?
A: 10


In [6]:
prompt = """
I have the following document:
* The classification model downloaded also expects an argument num_labels which is the number of classes in our data.
A linear layer is attached at the end of the bert model to give output equal to the number of classes.

Extract 5 keywords from that document.
"""
response = generator(prompt)
print(response[0]["generated_text"])


I have the following document:
* The classification model downloaded also expects an argument num_labels which is the number of classes in our data. 
A linear layer is attached at the end of the bert model to give output equal to the number of classes.

Extract 5 keywords from that document.

**Answer:**
1. Classification
2. Model
3. Download
4. Argument
5. Linear


In [7]:
example_prompt = """
<s>[INST]
I have the following document:
- The classification model downloaded also expects an argument num_labels which is the number of classes in our data.
A linear layer is attached at the end of the bert model to give output equal to the number of classes.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""

In [8]:
keyword_prompt = """
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""

In [9]:
prompt = example_prompt + keyword_prompt
print(prompt)


<s>[INST]
I have the following document:
- The classification model downloaded also expects an argument num_labels which is the number of classes in our data. 
A linear layer is attached at the end of the bert model to give output equal to the number of classes.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]



**Keyword Extraction with KeyLLM**

In [10]:
from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

In [14]:
documents = [
"As discussed above, for the training set, finer-grained instances in the training set are generally better than coarser-grained ones. This preference does not apply to classification time, i.e. the use of the classifier in the field. We should go ahead and predict the sentiment of whatever text we are given, be it a sentence or a chapter.",
"I received my package!",
"You clearly want to know what is being complained about and what is being liked."
]

keywords = kw_model.extract_keywords(documents); keywords

[['discussed',
  'above',
  'finer-grained',
  'instances',
  'training',
  'set',
  'better',
  'coarser-grained',
  'preference',
  'applies',
  'classification',
  'time',
  'field',
  'predict',
  'sentiment',
  'text',
  'sentence',
  'chapter.'],
 ['package',
  'received',
  'delivery',
  'shipment',
  'mail',
  'courier',
  'product',
  'order',
  'online',
  'store'],
 ['complained',
  'liked',
  'want',
  'know',
  'clear',
  'understand',
  'specific',
  'detail',
  'issue',
  'problem',
  'feedback',
  'opinion',
  'satisfaction',
  'enjoyment',
  'appreciation',
  'preference',
  'dislike',
  'dissatisfaction',
  'negative',
  'positive',
  'favorable',
  'unf']]

In [12]:
documents = [
"The website mentions that it only takes a couple of days to deliver but I still have not received mine.",
"I received my package!",
"Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."
]

keywords = kw_model.extract_keywords(documents); keywords

[['website',
  'delivery',
  'days',
  'receive',
  'mention',
  'take',
  'couple',
  'still',
  "haven't",
  'received'],
 ['package',
  'received',
  'delivery',
  'shipment',
  'mail',
  'courier',
  'product',
  'order',
  'online',
  'store'],
 ['LLaMA',
  'model',
  'weights',
  'release',
  'noncommercial',
  'license',
  'research',
  'community',
  'powerful',
  'LLMs',
  'APIs',
  'limited',
  'accessibility.']]

**Efficient way to extract keywords**

In [13]:
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(documents, convert_to_tensor=True)

Downloading (…)8fc4c/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)a6f2e8fc4c/README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

Downloading (…)f2e8fc4c/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)8fc4c/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)a6f2e8fc4c/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)2e8fc4c/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.5)

In [15]:
keywords

[['discussed',
  'above',
  'finer-grained',
  'instances',
  'training',
  'set',
  'better',
  'coarser-grained',
  'preference',
  'applies',
  'classification',
  'time',
  'field',
  'predict',
  'sentiment',
  'text',
  'sentence',
  'chapter.'],
 ['package',
  'received',
  'delivery',
  'shipment',
  'mail',
  'courier',
  'product',
  'order',
  'online',
  'store'],
 ['complained',
  'liked',
  'want',
  'know',
  'clear',
  'understand',
  'specific',
  'detail',
  'issue',
  'problem',
  'feedback',
  'opinion',
  'satisfaction',
  'enjoyment',
  'appreciation',
  'preference',
  'dislike',
  'dissatisfaction',
  'negative',
  'positive',
  'favorable',
  'unf']]