In [1]:
import os

#paper_source_directory = '/home/louis/research/pdf_processor/processed_data/superconductivity_processed/physrevb.71.134526'
#paper_source_directory = '/home/louis/research/pdf_processor/processed_data/superconductivity_processed/physrevb.88.144511'
#paper_source_directory = '/home/louis/research/pdf_processor/processed_data/superconductivity_processed/physrevb.31.1329'
paper_source_directory = '/home/louis/research/pdf_processor/processed_data/superconductivity_processed/physrevb.28.1389'
#paper_source_directory = '/home/louis/research/pdf_processor/processed_data/superconductivity_processed/physrevb.85.214519'
#paper_source_directory = '/home/louis/research/pdf_processor/processed_data/superconductivity_processed/physrevb.98.094505'

file_name = 'text.txt'

with open(os.path.join(paper_source_directory, file_name)) as f:
    paper_text = f.read()


In [2]:
paper_text[:10000]

"Departement de Physique de la Matiere Condensee, Univerxite de Geneve 24 quai Ernest Ansermet, CH-12ll Geneve 4, Switzerlana Lalr,Si, has a high-temperature form crystallizing in the primitive tetragonal CaBe,Ge type structure (space group P4/nmm; a =4.191 A, c=9.944 A), and a low-temperature form of the body-centered tetragonal ThCr,Siz type (space group I4/mmm; a--4.109 A, c-10.299 A). Both structures are BaAl-type derivatives. The high-temperature modifica tion may be quenched from above the transformation temperature of 1720*C and becomes superconducting at 1.6 K, while the low-temperature modification is normal down to 1 K The structures of both modifications have been refined in single-crystal studies. The excep tional polymorphism of Lalr2Si, is discussed, emphasizing structural relationships and the occurrence of superconductivity in the high-temperature modification.\n Over the last decade, the field of superconductivi ty in ternary compounds has received growing interest. Th

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import transformers
import torch


SYS_PROMPT = """You are a condensed matter physicist. You are given the extracted parts of a long document and a question. Read the document and don't make up an answer."""

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# use quantization to lower GPU usage                                                
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=bnb_config
)
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [126]:
def format_prompt(prompt, paper_text):
  PROMPT = f"Question: {prompt}\nContext: " + paper_text
  return PROMPT
    \

def generate(formatted_prompt):
  formatted_prompt = formatted_prompt[:10000] # to avoid GPU OOM                      
  messages = [{"role":"system","content":SYS_PROMPT}, {"role":"user","content":formatted_prompt}]
  # tell the model to generate                                                       
  input_ids = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt"
  ).to(model.device)
  print("sequence length:", input_ids.shape)
  outputs = model.generate(
      input_ids,
      max_new_tokens=1024,
      eos_token_id=terminators,
      do_sample=False, #temperature=0.6, top_p=0.9,
  )
  response = outputs[0][input_ids.shape[-1]:]
  return tokenizer.decode(response, skip_special_tokens=True)

In [127]:
generate(format_prompt("What is the material studied in this paper? Format the answer as MATERIAL: {Chemical Formula}. If there are multiple materials, separate them with &. Just give a formula and do not provide any explanation. Here are some example outputs: 'MATERIAL: Ga3As4.5 & Al0.6Fe0.4 & TexS1-x & UF6', 'MATERIAL: PrOs4Sb12 & PrOs3Sb13'", paper_text))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


sequence length: torch.Size([1, 2676])


'MATERIAL: LaIrSi & LaIr2Si'

In [128]:
generate(format_prompt("What is the critical temperature, which is the temperature at which a material becomes superconducting, of LaIr2Si2? Just give a number and do not provide any explanation. Do not make up an answer. Format the answer as {Number} K.", paper_text))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


sequence length: torch.Size([1, 2654])


'1.55 K'

In [83]:
generate(format_prompt("What is the lattice spacing?", paper_text))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


sequence length: torch.Size([1, 1699])


'Based on the provided document, the lattice spacing is not explicitly mentioned. The document discusses the onset of superconductivity in ultrathin amorphous Ga films and crystalline Sn films, and the existence of a threshold dependent only on the normal-state sheet resistance. The authors use a technique to build up a continuous film from isolated metallic clusters on the substrate, and they observe that the last film exhibiting finite resistance at low temperature had a normal-state resistance of the order of 7 to 8 kΩ/m, whereas the first superconducting film had a normal-state resistance of 4 to 5 kΩ/m. They also report that the pairs of normal-state resistances for seven sets of films appeared to bracket 6000 kΩ/√C, implying that the details of the film geometry, and in particular intercluster capacitance, were not relevant variables in determining the onset of global superconductivity.\n\nThe document does not mention lattice spacing, which is a characteristic of a crystalline s