## Draft of Text Extraction

### PDF file to String

**For the PDF to string converting you need:**  

`pip install PyPDF2` 

In [1]:
import PyPDF2

# define a function to get string of the PDF article
def pdf_to_string(filepath):
    text = ""
    with open(filepath, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        
        # loop to read each page
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

# Usage
pdf_text = pdf_to_string("testFile.pdf")

In [2]:
# pip install transformers torch

In [3]:
from transformers import pipeline
import pandas as pd

# Load the model and pipeline (using a conversational model for zero-shot Q&A)
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-small")

# Define the article text
text = '''Your long article text goes here'''

# Define prompts
prompts = [
    "What is the publication date of this paper?",
    "Which journal was this paper published in?",
    "What is the control efficiency mentioned in the paper?",
    "What is the treatment efficiency, and how much improvement did it provide?",
    "What is the VOC (Open Circuit Voltage) value?",
    "How many hours of stability does the paper report?",
    "What are the stability test conditions?",
    "What is the stabilized power output or SPO efficiency?",
    "What is the perovskite composition used in the study?",
    "Is the structure PIN or NIP?",
    "What is the molecule type (e.g., quasi-2D, passivation, etc.)?",
    "What is the concentration of the molecule mentioned?"
]

# Define function to extract each piece of information
def extract_info(prompts, text):
    results = {}
    for prompt in prompts:
        # Query the model with each prompt and the text
        response = qa_pipeline(f"{prompt}\n\n{text}")
        results[prompt] = response[0]['generated_text']
    return results

# Extract information and create a DataFrame
data = extract_info(prompts, text)
df = pd.DataFrame([data])

Downloading:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/308M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

  What is the publication date of this paper?  \
0                          September 13, 2017   

  Which journal was this paper published in?  \
0                        journal of medicine   

  What is the control efficiency mentioned in the paper?  \
0  The control efficiency mentioned in the paper ...       

  What is the treatment efficiency, and how much improvement did it provide?  \
0  The treatment efficiency was a little higher t...                           

       What is the VOC (Open Circuit Voltage) value?  \
0  VOC (Open Circuit Voltage) is the value of a c...   

  How many hours of stability does the paper report?  \
0            The paper reports a total of 365 hours.   

             What are the stability test conditions?  \
0  The stability test is a test for the ability t...   

  What is the stabilized power output or SPO efficiency?  \
0  The SPO efficiency is the most efficient power...       

  What is the perovskite composition used in the study?  \
0  

In [4]:
df

Unnamed: 0,What is the publication date of this paper?,Which journal was this paper published in?,What is the control efficiency mentioned in the paper?,"What is the treatment efficiency, and how much improvement did it provide?",What is the VOC (Open Circuit Voltage) value?,How many hours of stability does the paper report?,What are the stability test conditions?,What is the stabilized power output or SPO efficiency?,What is the perovskite composition used in the study?,Is the structure PIN or NIP?,"What is the molecule type (e.g., quasi-2D, passivation, etc.)?",What is the concentration of the molecule mentioned?
0,"September 13, 2017",journal of medicine,The control efficiency mentioned in the paper ...,The treatment efficiency was a little higher t...,VOC (Open Circuit Voltage) is the value of a c...,The paper reports a total of 365 hours.,The stability test is a test for the ability t...,The SPO efficiency is the most efficient power...,The perovskite composition is used in the study.,NIP is a symbiotic structure that is a sym,"molecule type (e.g., quasi-2D, passivation, etc.",Concentration of the molecule mentioned is the...


### Attempting to use matextract

**To use `matextract` which is an extraction workflow using LLMs, do the following:**  

1\. Clone the repository
  
`git clone https://github.com/lamalab-org/matextract-book.git`. 

Then, go into the folder
  
`cd matextract-book`  
  
Install dependencies

`cd package && pip install .`

---
***progress so far:** unable to import matextact into the notebook*

---