In [None]:
from openai import OpenAI
import json 
import os
import re

In [17]:
prompt = """Help me extract data from the attached paper. Your goal is locating specific data types and organize them into the specified format. If a particular data key is not present in the text, simply put the value as "Not reported".

Note that the data will be formatted into a list of JSON documents. If there is only a single case study and single set of data reported, the list should be of length 1. 
If there are multiple individual datasets or cases from different locations or with different PV technologies, each should be reported as a separate element in the list.

Do not include data from the introduction section or references section of a document.

Here are the data types to extract.:

• PV techonology (raw text): raw text of the techonology of the PV module in study; 
If multiple PV technologies are reported, each case should be represented as a separate dictionary in the final list.
If no info, return 'not reported'

• PV techonology: normalzied, choose from ['mono-c-Si', 'multi-c-Si', 'a-Si', 'CdTe', 'CIGS', 'thin film', 'HIT', 'other', 'not reported'];
Normalize the fields as follows: treat 'poly-c-Si' and 'mc-Si' as 'multi-c-Si', and 'sc-Si' as 'mono-c-Si'. Normalize other forms of PV tech as needed so it corresponds to one of the elements of the list.

• PV techonology detail: details of module tech or cell architecture if reported- e.g., PERC, HJT

• Scope of study: choose from ['cell level', 'module level', 'system level']; defines whether the study analyzes cells, modules, or systems.

• Duration (raw text): reported exposure or operation time (e.g., 'xx months', ‘xx days’, 'xx hours'). This should copy the raw text indicating duration in the original document.

• Duration in years: convert the above duration into years as a float (e.g., '1.00').

• Start year: starting year of the degradation study, or 'not reported'.

• End year: ending year of the study, or 'not reported'.

• System capacity (raw text): total capacity of the PV modules/system in watts, e.g. "353 kW". This should copy the raw text indicating system capacity in the original document.

• System capacity in Watts: total capacity of the PV modules/system in watts, formatted as a float.

• Number of PV modules in study: integer .

• PV module nominal power (raw text): rated power of a single module. This should copy the raw text indicating PV module power in the original document.

• PV module nominal power in Watts: rated power of a single module, formatted as a float. If the raw text is 'not reported' or nan, make this value also 'not reported'.

• Country: country of the study site. If multiple countries or locations are reported, each case should be represented as a separate dictionary in the final list.

• Location (raw text): raw text of the location, like city or country

• Location: latitude and longitude of the location as a String indicating floats, formatted as 'lat, lon' (e.g., '1.3521, 103.8198'); do not include compass letters (e.g., 'N', 'E').
  If not reported, infer from the city or country if indicated, if no city reported, use the location of the reported country's capital. 
  If multiple cities or locations are reported, each case should be represented as a separate dictionary in the final list.

• PV module name: actual name indicating the make and model of the solar module, or 'not reported'.

• Bifacial: choose from [True, False, or 'not reported'].

• Mounting: choose from ['roof', 'rack', 'tracker', 'not reported'].

* Annual power degradation rate (raw text): raw text from the original paper indicating the annual power degradation rate. If the rate is inferred rather than directly stated in the original paper, indicate 'inferred'.

• Annual power degradation rate in percent: annual rate as a float xx.xx format indicating percent annual power degradation
  - Only use values from studies conducted within the target paper (not from cited references).
  - Show as a negative value if performance decreased.
  - If a range is given (e.g., '0.5–1'), report the mean.

• Location of information: type of location in the paper where the degradation rate was reported, choose one or more from ['text', 'figure', 'table', 'inferred'].

• Confidence level: float (0–1), reflecting how confident you are in the accuracy of the extracted degradation rate.

• Annual degradation rate of other parameters: optional dictionary of other annual degradation rates for parameters other than the power such as open-circuit voltage, short-circuit current, shunt resistance, etc. (e.g., {'Voc': '0.50%', 'Isc': '-0.70%'}).

• Degradation metric: choose one or more from ['Performance Ratio', 'Efficiency', 'Power', 'Indoor I-V', 'Outdoor I-V', 'PVUSA', 'not reported']. 
Performance Ratio: Ratio of actual to expected energy output.

Efficiency: Power conversion efficiency under standard conditions.

Indoor I-V: I-V data measured indoors (e.g., simulator).

Outdoor I-V: I-V data measured under real outdoor conditions.

PVUSA: Model-based performance from field data.

• Analysis method: choose from ['LR', 'YOY', 'CSD', 'HW', 'ARIMA', 'LOESS', 'other', 'not reported'];  
  LR = linear regression (divide value change by time),
  CSD = Classical Seasonal Decomposition,
  HW = Holt–Winters Method,
  ARIMA = AutoRegressive Integrated Moving Average,
  LOESS = Locally Weighted Scatterplot Smoothing,
  YOY = Year-on-year degradation analysis

• Number of measurements for degradation analysis: integer or text (e.g., '1', '2', 'multiple').

• Source of initial power value: choose from ['nameplate', 'field measurement', 'not reported'].

• Other examination techs: list any used techniques (e.g., 'IR', 'EL', 'PL', 'UVF').
IR = Infrared (Thermography), EL = Electroluminescence, PL = Photoluminescence, UVF = Ultraviolet Fluorescence.

• Faults (raw text): raw text of the reported defects/faults related to the degradation

• Faults: list reported defects/faults related to the degradation (e.g., PID, delamination, soiling, shading, browning, corrosion, cracks, etc.) or 'not reported'.

• Grid-connected: whether the system was connected to the grid (True, False, or 'not reported')

• Materials and construction: an array of tags indicating any details specified regarding the bill of materials or construction of the modules. Examples include ["EVA encapsulant", "glass-glass construction", "PET backsheet"]

• Note: add any additional observations, especially regarding 
(i) how the degradation rate was determined or estimated and 
(ii) any notable observations regarding installation, faults, weather events, etc. that explain the observed degradation rate or lack thereof (e.g., frequent cleaning would mitigate expected soiling losses)


Output in JSON list format only, like this:

[
  {
    "PV technology (raw text)": "mono-c-Si",
    "PV technology": "mono-c-Si",
    "PV technology detail": "PERC",
    "scope of study": "cell level",
    "duration (raw text)": "3593 days",
    "duration in years": 9.843,
    "start year": 2001,
    "end year": 2009,
    "system capacity (raw text)": "353 kW",
    "system capacity in Watts": 353000.0,
    "number of PV modules in study": 1000,
    "PV module nominal power (raw text)": "353 W",
    "PV module nominal power in Watts": 353.0,
    "country": "France",
    "location (raw text)": "Paris",
    "location": "1.3521, 103.8198",
    "PV module name": "SunPower SPR-353",
    "bifacial": true,
    "mounting": "rack",
    "Annual power degradation rate (raw text)": "-0.80% per year",
    "Annual power degradation rate in percent": -0.80,
    "location of information": ["text", "figure"],
    "confidence level": 0.9,
    "Annual degradation rate of other parameters": {
      "Voc": "-0.30%", 
      "Isc": "-0.10%", 
      "Vmp": "-0.60%", 
      "Imp": "-0.50%", 
      "FF": "-0.20%"
    },
    "degradation metric": ["Power"],
    "analysis method": "LR",
    "number of measurements for degradation analysis": "multiple",
    "source of initial power value": "nameplate",
    "other examination techs": ["EL", "IR"],
    "faults (raw text)": "xxx",
    "faults": ["PID", "cracks"],
    "grid-connected": true,
    "materials and construction": ["EVA encapsulant", "glass-glass construction"],
    "note": "Degradation rate was calculated using linear regression of normalized power over time, confirmed through EL imaging and I-V curve analysis. Frequent rainfall may have reduced dust accumulation."
  }
]

Do not generate any reasoning or commentary, other than within the note field. Output JSON only.


"""

In [13]:
# --------------------------------------------------
# Collect PDF filenames (without extension)
# --------------------------------------------------
# List all files in the target directory and keep only
# those ending with ".pdf". The file extension is removed
# to simplify downstream indexing and access.

pdf_files = [os.path.splitext(f)[0] for f in os.listdir(f"./example_files/") if f.endswith('.pdf')]
pdf_files = sorted(pdf_files, key=lambda x: int(re.match(r'\d+', x).group()) if re.match(r'\d+', x) else float('inf'))

# --------------------------------------------------
# Construct the full path to a selected PDF file
# --------------------------------------------------
# Select the second PDF in the sorted list and
# build its relative path for downstream processing
# (e.g., upload to an API or data extraction).


test_file = f"./example_files/{pdf_files[0]}.pdf"
test_file

'./example_files/26-85-2018-C-Performance Analysis of First Grid Connected PV Power Plant in Subtropical Climate of Pakistan.pdf'

In [14]:
# --------------------------------------------------
# Initialize OpenAI client
# --------------------------------------------------
# Create an OpenAI client instance using the API key
# stored in the environment variable OPENAI_API_KEY.
# This avoids hard-coding sensitive credentials.

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"]
)

In [15]:
# --------------------------------------------------
# Upload a local file to OpenAI
# --------------------------------------------------
# Upload a file (e.g., PDF, text, or data file) that
# will be referenced in the model prompt.
# The file is opened in binary mode ("rb").
# The purpose "user_data" indicates it will be used
# as input context for the model.

file = client.files.create(
    file=open(test_file, "rb"),
    purpose="user_data"
)

In [18]:
# --------------------------------------------------
# Create a model response using file + text input
# --------------------------------------------------
# Send a request to the Responses API, providing:
# 1) A file input (referenced by file_id)
# 2) A text-based question or instruction
#
# The model can jointly reason over both the uploaded
# file content and the user-provided text prompt.

response = client.responses.create(
    model="gpt-4.1-mini",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_file",
                    "file_id": file.id,
                },
                {
                    "type": "input_text",
                    "text": prompt,
                },
            ]
        }
    ]
)

In [19]:
# --------------------------------------------------
# Retrieve model output
# --------------------------------------------------
# Extract the aggregated text output generated
# by the model after processing the inputs.

response.output_text

'[\n  {\n    "PV technology (raw text)": "mono c-Si",\n    "PV technology": "mono-c-Si",\n    "PV technology detail": "not reported",\n    "scope of study": "system level",\n    "duration (raw text)": "May-2012 to June-2018",\n    "duration in years": 6.08,\n    "start year": 2012,\n    "end year": 2018,\n    "system capacity (raw text)": "178.08 KW",\n    "system capacity in Watts": 178080.0,\n    "number of PV modules in study": 848,\n    "PV module nominal power (raw text)": "210W",\n    "PV module nominal power in Watts": 210.0,\n    "country": "Pakistan",\n    "location (raw text)": "Islamabad (Latitude:33.6°N, Longitude: 73.1° E, Altitude:1667 ft)",\n    "location": "33.6, 73.1",\n    "PV module name": "not reported",\n    "bifacial": "not reported",\n    "mounting": "rack",\n    "Annual power degradation rate (raw text)": "1.15 %/year",\n    "Annual power degradation rate in percent": -1.15,\n    "location of information": ["text", "table", "figure"],\n    "confidence level": 0.

In [20]:
# --------------------------------------------------
# Parse model output as JSON
# --------------------------------------------------
# The model response is expected to be a JSON-formatted string.
# Convert it into a Python dictionary for downstream processing.

json_data = json.loads(response.output_text)
json_data

[{'PV technology (raw text)': 'mono c-Si',
  'PV technology': 'mono-c-Si',
  'PV technology detail': 'not reported',
  'scope of study': 'system level',
  'duration (raw text)': 'May-2012 to June-2018',
  'duration in years': 6.08,
  'start year': 2012,
  'end year': 2018,
  'system capacity (raw text)': '178.08 KW',
  'system capacity in Watts': 178080.0,
  'number of PV modules in study': 848,
  'PV module nominal power (raw text)': '210W',
  'PV module nominal power in Watts': 210.0,
  'country': 'Pakistan',
  'location (raw text)': 'Islamabad (Latitude:33.6°N, Longitude: 73.1° E, Altitude:1667 ft)',
  'location': '33.6, 73.1',
  'PV module name': 'not reported',
  'bifacial': 'not reported',
  'mounting': 'rack',
  'Annual power degradation rate (raw text)': '1.15 %/year',
  'Annual power degradation rate in percent': -1.15,
  'location of information': ['text', 'table', 'figure'],
  'confidence level': 0.8,
  'Annual degradation rate of other parameters': {},
  'degradation metric