# A cookbook for Named Entities Recognition with LLMs

## 1. Preliminary steps

### 1.2. import information from the config file

Here you store 3 variables with:

* the name of the file
* the column with ids
* the column with texts

In [25]:
import yaml

with open('config.yaml', 'r') as file:
    cfg = yaml.safe_load(file)


my_file = cfg['museum']['file_name']
id_ = cfg['museum']['id_column_name']
txt = cfg['museum']['text_column_name']

### 1.2. open the file and process it

In [26]:
import pandas as pd

df = pd.read_csv(my_file)
records = [{'_id':row[id_],'text': row[txt]} for _, row in df.iterrows()]

## 2 Try the demo

### 2.1. implement the model

In this section we implement the model. You can play around with different models by changing their huggingface path in  the config.yaml file


global: <br>
    model_name:


In [24]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch,accelerate,regex as re

model_name = cfg['global']['model_name']
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.float16, device_map="auto")

### 2.2. Create the prompt
Here we build the prompt. Give a look to the example in the config file


In [None]:
text_item = cfg['inference']['text']
prompt = cfg['inference']['prompt'].format(text=text_item)

### 2.3a. Do the inference [NO MAC]
Here we do the inference if we do not have a Mac

In [None]:
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
        outputs = model.generate(
                **inputs,
                max_new_tokens=150,
                output_scores=True,
                temperature=0.3,
            )

        decoded_text = tokenizer.decode(
                outputs[0],
                skip_special_tokens=True
            )

        model_output = decoded_text.split("Answer:")[-1]
        print(model_output)


### 2.3b. Do the inference [ONLY MAC]

Here we do the inference if we have a mac

In [34]:
inputs = tokenizer(prompt, return_tensors="pt").to('mps')

with torch.no_grad():
        outputs = model.generate(
                **inputs,
                max_new_tokens=150,
                output_scores=True,
                temperature=0.3,
            ).to('mps')

        decoded_text = tokenizer.decode(
                outputs[0],
                skip_special_tokens=True
            )

        model_output = decoded_text.split("Answer:")[-1]
        print(model_output)



```json
{'PERSON': ['Abarth', 'Riccardo Patrese', 'Michele Alboreto', 'Nicola Larini'], 'LOCATION': ['CSAI', 'Fiat 124 Sport', 'Formula Italia Trophy']}
```
```json
{'PERSON': ['Abarth', 'Riccardo Patrese', 'Michele Alboreto', 'Nicola Larini'], 'LOCATION': ['CSAI', 'Fiat 124 Sport', 'Formula Italia Trophy']}
```
```json
{'PERSON': ['Abarth', 'Riccardo Patrese', 'Michele Alboreto', 'Nicola Larini'], 'LOCATION': ['CS


## 3. Assignment

1. play with different models and parameters in the config file
2. implement the script to extract entities from all the descriptions (you can also try to change entities, if you want)
3. save everything in a csv with three columns: _id, entity, entity_type (eg. abarth-c-se-025-formula-italia, Riccardo Patrese, PERSON) and save it in the output folder
4. materialize a KG with SPARQL anything and save it in the output folder
5. create a new branch in the Github repository with your name or the name of your team and send a mail to professor Damiano