In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
import os
from dotenv import load_dotenv
import json
import re
import torch
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
torch.cuda.empty_cache()

## Loading Model and Prompt

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    # bnb_4bit_quant_type="nf4",
    # bnb_4bit_compute_dtype=getattr(torch, "float16"),
    # bnb_4bit_use_double_quant=False
)

In [4]:
# trained model path
model_path = "../models/llama-3.2-3b-it-Perovskite-PaperExtractor"
model_name = "models/DeepSeek-R1-PSC-Extractor-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")
tokenizer_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
tokenizer.model_max_length = 60000
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=None,
    top_p=None,
    do_sample=False,
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00,  5.61s/it]


In [5]:
# prompt that specifies what to return
PREFIX = """
"You are a scientific assistant and your task is to extract certain information from text. "
"We are in a scientific environment. You MUST be critical of the units of the variables. "
"Do not leave information behind. "
"Only extract the variables that were developed in this study. You must omit the ones extracted from the bibliography"
Your task is to extract relevant scientific data from the provided text about perovskite solar cells.
    Follow these guidelines:

    1. **If passivating molecules are mentioned:**
    - Create a JSON object for each passivating molecule tested.
    - Include stability test data for each molecule if available. There may be multiple stability tests for a single molecule.

    2. **If no passivating molecules are mentioned:**
    - Provide a JSON object with any other relevant data explicitly mentioned in the text.

    **JSON Structure:**
    Ensure the output adheres to the following structure and is parseable as valid JSON:

    {{
        "perovskite_composition": null, // Chemical formula of the perovskite (string).
        "electron_transport_layer": null, // Material used as the electron transport layer (string).
        "pin_nip_structure": null, // Whether the perovskite uses a PIN or NIP structure (values: "PIN" or "NIP").
        "hole_transport_layer": null, // Material used as the hole transport layer (string).
        "test_1": {{ // Include only if stability tests are mentioned. Use unique keys for each test (e.g., test_1, test_2, etc.).
            "test_name": null, // Must be one of: "ISOS-D", "ISOS-L", "ISOS-T", "ISOS-LC", "ISOS-LT".
            "temperature": null, // Temperature in Celsius (numeric or string, no units or symbols like ° or -).
            "time": null, // Duration of the test (string or numeric).
            "humidity": null, // Humidity level (string or numeric).
            "retained_proportion_cont": null, // Proportion of the PCE retained by the control perovskite after stability test (numeric).
            "retained_proportion_tret": null, // Proportion of the PCE retained by the treated perovskite after stability test (numeric).
            "passivating_molecule": null, // Name of the passivating molecule used in the test.
            "control_pce": null, // Power conversion efficiency for control perovskite (numeric).
            "control_voc": null, // Open-circuit voltage for control perovskite (numeric).
            "treated_pce": null, // Power conversion efficiency for treated perovskite (numeric).
            "treated_voc": null // Open-circuit voltage for treated perovskite (numeric).
        }}
    }}

    **Instructions:**
    - Be concise and accurate. Include only data explicitly present in the text.
    - For stability tests:
    - Infer the test type (e.g., ISOS-D, ISOS-L) based on the description if not explicitly stated.
    - Ensure all numeric values are parseable (e.g., no symbols like ° or -).
    - Use unique keys for each test (e.g., `test_1`, `test_2`, etc.).
    - If a field has no data, set it to `null`.
    - Make sure to only return a JSON object.
    - Make sure not to confuse the retained_proportion_cont/retained_proportion_tret variables with the control_pce/treated_pce variables. 
    The PCE values will almost never be above 30, while the proportion retained values will rarely be below 0.5. The retained proportion will not always be there, please leave these values as null if they cannot be found. DO NOT use the PCE for these values.

    Text to extract from:
"""
SUFFIX = """\n\n{sample}\n\n"""
def create_prompt(system, user):
    tokens = tokenizer.encode(user, max_length=23000, truncation=True) # prevents CUDA memory errors with current GPU
    truncated_user = tokenizer.decode(tokens)
    return [
    {"role": "system", "content": system},
    {"role": "user", "content": truncated_user}, ]

## Generating Extracted Data

In [6]:
contents = {}
dataset = pd.read_csv('data/rag_filtered_150_papers.csv')

In [7]:
dataset

Unnamed: 0.1,Unnamed: 0,id,full_text,filtered_text
0,0,115,Paper #: 115\nLiquid medium annealing for fabr...,Liquid medium annealing for fabricating durabl...
1,1,112,Paper #: 112\nIodine reduction for reproducibl...,Iodine reduction for reproducible and high-per...
2,2,72,Paper #: 72\nSuppressed Ion Migration in Reduc...,T oday's perovskite solar cells (PSCs) require...
3,3,75,Paper #: 75\nDynamical Evolution of the 2D/3D ...,UV-VIS Absorption The absorption spectra of t...
4,4,40,Paper #: 40\nDamp heat-stable perovskite solar...,Damp heat-stable perovskite solar cells with t...
...,...,...,...,...
144,144,92,Paper #: 92\n\nBanners \n Banners New for 2018...,Electronic Table of Contents (eToC) \n Standa...
145,145,123,Paper #: 123\nConstructing heterojunctions by ...,P erovskite solar cells (PSCs) have reached po...
146,146,31,Paper #: 31\n\nWe performed 1 H-NMR to support...,We performed 1 H-NMR to support our proposed r...
147,147,36,Paper #: 36\nUnderstanding Performance Limitin...,Understanding Performance Limiting Interfacial...


In [None]:
dataset = pd.read_csv('data/rag_filtered_150_papers.csv')
for index, row in dataset.iterrows():
    paper_id = row["id"]
    #feeds paper into model pipeline
    text = row["filtered_text"]
    instruction = create_prompt(PREFIX, text)
    json_string = pipe(instruction, max_new_tokens=1024)[0]["generated_text"][-1]['content']
    print(json_string)
    json_match = re.search(r"\{.*\}", json_string, re.DOTALL)
    if json_match:
        raw_json = json_match.group(0).strip()
    else:
        print("No JSON found")
        continue
    # Fix unquoted values (if any exist, less common)
    parseable = re.sub(r'":\s*"([^"]*?)(".*?"[^"]*?)"', r'": "\1\\\2"', raw_json)
    try:
        parsed_data = json.loads(parseable)
        contents[paper_id] = parsed_data
    except json.JSONDecodeError as e:
        print("Error creating JSON", e)
        print(raw_json)
        

Alright, so I'm trying to extract information from this scientific text about perovskite solar cells. The user wants me to focus on extracting data related to the passivating molecules used in the devices and their stability tests. They also mentioned not to include data from the bibliography. 

First, I'll read through the provided text to understand the study's focus. It seems the study is about improving the efficiency and stability of perovskite solar cells (PSCs) by using a liquid medium annealing (LMA) process with anisole. They compared this method to the traditional annealing (Ref) process and found that LMA improved both efficiency and stability.

I need to look for sections where passivating molecules are mentioned. The text discusses the use of anisole as the liquid medium in the LMA process. It also mentions the use of phenethylammonium iodide (PEAI) as a passivation layer. There's a table (Table S1) that lists the fitted carrier lifetimes of perovskite films with different

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Alright, I need to extract relevant scientific data from the provided text about perovskite solar cells. The user has provided a detailed set of instructions on how to structure the JSON output. Let me go through the text step by step to identify the necessary information.

First, the text discusses the excitation source for photoluminescence (PL) measurements, which is a 445 nm continuous-wave laser. It mentions that the laser intensity is adjusted to match 1 sun equivalent intensity by illuminating a 1 cm² perovskite solar cell under short-circuit and matching the current density to the 1 sun simulator. This gives me the excitation source details.

Next, the text describes the setup for PL imaging measurements, including an integrating sphere and an Andor SR393i-B spectrometer with a silicon CCD camera. It also notes that the system was calibrated with a halogen lamp and a spectral correction factor was applied. This provides the instrumentation details for the PL measurements.

The 

In [None]:
# exporting JSON
json_path = "data/deepseek_finetuned.json"
with open(json_path, "w") as file:
    json.dump(contents, file, indent=4)