# Using LLMs to parse History of Philosophy of Science (HPS) books

In [None]:
import transformers
import torch
import json
from pathlib import Path
import markdown
from bs4 import BeautifulSoup

In [31]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

Using device: cuda

Tesla V100-SXM2-32GB
Memory Usage:
Allocated: 18.0 GB
Cached:    18.3 GB


### Load data

In [22]:
DATA_DIR = Path("./data/gleick2011")

In [23]:
out = {}
for fname in DATA_DIR.glob("*"):
    with open(fname, 'r', encoding='utf-8') as f:
        md_content = f.read()
        html = markdown.markdown(md_content)
        plain_text = BeautifulSoup(html, 'html.parser').get_text()
        out[fname.stem] = plain_text

In [None]:
print(out[20])

The Information
horizon. Whether Ong would have seen cyberspace as fundamentally oral or literary, he would surely have recognized it as transformative: not just a revitalization of older forms, not just an amplification, but something wholly new. He might have sensed a coming discontinuity akin to the emergence of literacy itself. Few understood better than Ong just how profound a discontinuity that had been.
When he began his studies, “oral literature” was a common phrase. It is an oxymoron laced with anachronism; the words imply an all-too-unconscious approach to the past by way of the present. Oral literature was generally treated as a variant of writing; this, Ong said, was “rather like thinking of horses as automobiles without wheels.”
You can, of course, undertake to do this. Imagine writing a treatise on horses (for people who have never seen a horse) which starts with the concept not of “horse” but of “automobile,” built on the readers’ direct experience of automobiles. It pro

### Load the model

In [None]:
model = "/gpfs1/llm/llama-3.2-hf/Meta-Llama-3.2-3B-Instruct"

classifier = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto"
)

### Parse text

In [54]:
instruction = 'Extract people information of the following text as JSON format. I want name, time, place, and role, if available. Do not justify you answers, put the well formated JSON inside a ```json ``` delimiter.'

In [55]:
prompt = f"""### Instruction:
{instruction}

### Input:
{out['20']}
### Response:
"""

result = classifier(prompt, max_new_tokens=400, temperature=0.0, do_sample=False)
print(result[0]["generated_text"])

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


### Instruction:
Extract people information of the following text as JSON format. I want name, time, place, and role, if available. Do not justify you answers, put the well formated JSON inside a ```json ``` delimiter.

### Input:
The Information
horizon. Whether Ong would have seen cyberspace as fundamentally oral or literary, he would surely have recognized it as transformative: not just a revitalization of older forms, not just an amplification, but something wholly new. He might have sensed a coming discontinuity akin to the emergence of literacy itself. Few understood better than Ong just how profound a discontinuity that had been.
When he began his studies, “oral literature” was a common phrase. It is an oxymoron laced with anachronism; the words imply an all-too-unconscious approach to the past by way of the present. Oral literature was generally treated as a variant of writing; this, Ong said, was “rather like thinking of horses as automobiles without wheels.”
You can, of cours

In [56]:
def parse_json(text):
    start = text.find('```\n') + len('```\n')
    end = text.find('\n```', start)
    json_str = text[start:end]
    people = json.loads(json_str)
    print(json.dumps(people, indent=4))

In [57]:
parse_json(result[0]['generated_text'])

[
    {
        "name": "Ong",
        "time": "when he began his studies",
        "place": "not specified",
        "role": "scholar"
    },
    {
        "name": "Jonathan Miller",
        "time": "not specified",
        "place": "not specified",
        "role": "philosopher"
    },
    {
        "name": "Plato",
        "time": "not specified",
        "place": "not specified",
        "role": "philosopher"
    },
    {
        "name": "Socrates",
        "time": "not specified",
        "place": "not specified",
        "role": "philosopher"
    }
]


In [58]:
prompt = f"""### Instruction:
{instruction}

### Input:
{out['21']}
### Response:
"""

result = classifier(prompt, max_new_tokens=400, temperature=0.0, do_sample=False)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [59]:
# print(result[0]["generated_text"])

In [60]:
parse_json(result[0]["generated_text"])

[
    {
        "name": "Plato",
        "time": "Two thousand years ago",
        "place": "Western world",
        "role": "made this observation"
    },
    {
        "name": "McLuhan",
        "time": "End of the nineteenth century",
        "place": "Western world",
        "role": "said"
    },
    {
        "name": "Pliny",
        "time": "End of the nineteenth century",
        "place": "Western world",
        "role": "wrote"
    },
    {
        "name": "Samuel Butler",
        "time": "End of the nineteenth century",
        "place": "Western world",
        "role": "observed"
    },
    {
        "name": "Theseus",
        "time": "Ancient times",
        "place": "Ancient Greece",
        "role": "unwound Ariadne\u2019s thread"
    },
    {
        "name": "Ariadne",
        "time": "Ancient times",
        "place": "Ancient Greece",
        "role": "the subject of Ariadne\u2019s thread"
    }
]
