## Draft of Text Extraction

### PDF file to String

**For the PDF to string converting you need:**  

`pip install PyPDF2` 
`# pip install transformers torch`

In [1]:
import PyPDF2

# define a function to get string of the PDF article
def pdf_to_string(filepath):
    text = ""
    with open(filepath, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        
        # loop to read each page
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

# Usage
pdf_text = pdf_to_string("testFile.pdf")

In [7]:
from transformers import pipeline
import pandas as pd

# Load the model and pipeline (using a conversational model for zero-shot Q&A)
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-small")

# Define the article text
text = '''Your long article text goes here'''

# Define prompts
prompts = [
    "What is the publication date of this paper?",
    "Which journal was this paper published in?",
    "What is the control efficiency mentioned in the paper?",
    "What is the treatment efficiency, and how much improvement did it provide?",
    "What is the VOC (Open Circuit Voltage) value?",
    "How many hours of stability does the paper report?",
    "What are the stability test conditions?",
    "What is the stabilized power output or SPO efficiency?",
    "What is the perovskite composition used in the study?",
    "Is the structure PIN or NIP?",
    "What is the molecule type (e.g., quasi-2D, passivation, etc.)?",
    "What is the concentration of the molecule mentioned?"
]

# Define function to extract each piece of information
def extract_info(prompts, text):
    results = {}
    for prompt in prompts:
        # Query the model with each prompt and the text
        response = qa_pipeline(f"{prompt}\n\n{text}")
        results[prompt] = response[0]['generated_text']
    return results

# Extract information and create a DataFrame
data = extract_info(prompts, text)
df = pd.DataFrame([data])

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

The model 'BertForQuestionAnswering' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'T5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].
Token indices sequence length is longer than the specified maximum sequence length for this model (14989 > 512). Running this sequence through the model will result in indexing errors


TypeError: The current model class (BertForQuestionAnswering) is not compatible with `.generate()`, as it doesn't have a language model head. Please use one of the following classes instead: {'BertLMHeadModel'}

---
### Attempting to use matextract

**To use `matextract` which is an extraction workflow using LLMs, do the following:**  

1\. Clone the repository
  
`git clone https://github.com/lamalab-org/matextract-book.git`. 

Then, go into the folder
  
`cd matextract-book`  
  
Install dependencies

`cd package && pip install .`


<font color='red'>***progress so far:** unable to import matextact into the notebook*</font>

---

---
### Attempting to use OpenAI

In [None]:
import openai
import pandas as pd

# Set up your OpenAI API key
openai.api_key = "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"

# Define the information you want to extract
prompts = [
    "Extract the date this paper was published.",
    "Identify the journal this paper was published in.",
    "Extract the control efficiency in the format: 'Efficiency of the control was on average x%'",
    "Extract the treatment efficiency in the format: 'By using y molecule, their efficiency increased x%'",
    "Extract the VOC (Open Circuit Voltage) value.",
    "Extract the stability information in hours.",
    "Identify the stability test conditions, e.g., temperature, sunlight, LED, or power point conditions.",
    "Extract the stabilized efficiencies, also known as Stabilized Power Output or SPO.",
    "Extract the perovskite composition including elements like Methylammonium, Formamidinium, Cesium, etc.",
    "Identify whether the structure is PIN or NIP.",
    "Extract the molecule type if available (e.g., quasi-2D, 2D/3D former, additive, passivation, passivant).",
    "Identify the concentration of the molecule if mentioned."
]


# Function to make a request to GPT-3.5-turbo to extract each piece of information
def extract_info(prompts, text):
    results = {}
    for prompt in prompts:
        # Make a call to the OpenAI API
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "system", "content": "You are a data extraction assistant."},
                      {"role": "user", "content": f"{prompt}\n\n{text}"}],
            max_tokens=100
        )
        # Extract and store the response
        results[prompt] = response.choices[0].message['content']
    return results

# Extract information and structure it in a DataFrame
data = extract_info(prompts, pdf_text)
df = pd.DataFrame([data])

# Display the DataFrame
df

In [None]:
# pip install openai

In [10]:
import openai