### Step 1: Extract Content (Vision AI)
This notebook performs the following:
1. Converts PDF survey forms into individual JPEG images.
2. Uses Generative AI to extract text from each image.
3. Appends extracted text to a Word document.

#### Required Input:
- PDF files stored in the `To_process` folder.

#### Output:
- A `.docx` file containing the extracted text.

### Setup
This code:
1. Loads the OpenAI API key from the `.env` file.
2. Verifies if the API key is loaded successfully.
3. Configures Pandas options to enhance dataframe readability.

In [None]:
# SETUP
from dotenv import load_dotenv
import os

# Load environment variables from the .env file
load_dotenv()

# Access the API key
api_key = os.getenv("OPENAI_API_KEY")

# Check if the API key is loaded
if api_key:
    print("API key loaded successfully!")
else:
    print("Failed to load API key. Please check your .env file.")

import pandas as pd
# Set option to display full (non-truncated) dataframe information
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 0)  # Adapt the number as needed for your screen

### Convert PDFs to Images
This code identifies a PDF file in the `To_process` directory, converts it into individual pages 
as JPEG images, and saves them in the `Images_temp` folder.

In [None]:
import os
from pdf2image import convert_from_path

# Path to the directory containing the PDF that is placed there for processing
directory_path = "To_process"

# List all files in the directory
files = os.listdir(directory_path)

# Find the first PDF file in the directory
pdf_file = next((file for file in files if file.lower().endswith('.pdf')), None)

if pdf_file:
    # Create full path to the PDF file
    pdf_path = os.path.join(directory_path, pdf_file)
    
    # Convert PDF to images
    images = convert_from_path(pdf_path, dpi=200, output_folder=None, fmt='jpeg')
    
    # Iterate over images and save them and separate jpeg in Images_temp folder
    for i, image in enumerate(images):
        image_path = f"Images_temp/page_{i + 1}.jpeg"
        image.save(image_path, 'JPEG')
        print(f"Saved: {image_path}")
else:
    print("No PDF files found in the directory.")


### Extract Text from Images Using GPT-4o

Next section defines two functions for processing survey images and extracting their text content using GPT-4o:
1. **`encode_image()`**:
   - Converts an image into Base64 format, required by the GPT-4o API.

2. **`append_image_content_to_doc()`**:
   - Sends the encoded image to GPT-4o with the following instructions:
     - Retype all questions and their answer options into separate lines.
     - Mark each question with a "Q" and each answer option with an "A".
     - Keep question numbers at the start.
     - Avoid making up content—if the image is unreadable, the model is instructed to return "error".
   - Appends the extracted text to a Word document.

#### Inputs:
- **Image Path**: Path to the JPEG image file.
- **Document Path**: Path to the Word document where extracted content will be stored.
- **API Key**: OpenAI API key loaded securely from the environment.

#### Outputs:
- Extracted content appended to the Word document.

#### Notes:
- The GPT-4o API is instructed to preserve the structure of survey questions and answers, ensuring the output is neatly formatted and usable for analysis.


In [None]:
### FUNCTION TO LOOP OVER INDIVIDUAL PAGES and request GPT-4o to extract content into doxc file in the main folder
import os
import re
import requests
import csv
from PIL import Image
import base64
from docx import Document


# Define function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def append_image_content_to_doc(image_path, doc_path, api_key):
    # Encode the image using the function above
    base64_image = encode_image(image_path)

    headers = {
      "Content-Type": "application/json",
      "Authorization": f"Bearer {api_key}"
    }

    payload = {
      "model": "gpt-4o",
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "This is a page from a survey questionnaire." 
                      "Retype all questions and their answer options into separate lines."
                      "Mark each question with a Q and each answer option with an A."
                      "Keep question numbers at the start."
                      "Make sure to not invent content - if you cannot recognize the image, just state it and write 'error'."
            },
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
              }
            }
          ]
        }
      ],
      "max_tokens": 1500
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

    # Extract the content from the response
    content = response.json()['choices'][0]['message']['content']
    print(content)

    # Check if the Word document exists, create a new one if not
    if not os.path.exists(doc_path):
        doc = Document()
    else:
        doc = Document(doc_path)

    # Append the content to the Word document
    doc.add_paragraph(content)

    # Save the updated document
    doc.save(doc_path)
    print(f"Content appended to {doc_path}")

### Process Images in Order and Extract Content

Next section ensures that images are processed in the correct numerical order to maintain the sequence of survey pages. Key steps include:
1. Sorting images numerically using a custom function (`sort_key`) that extracts numbers from filenames.
2. Iterating through the sorted list of images in the `Images_temp` directory.
3. For each image:
   - Checking if it is a `.jpeg` file.
   - Sending the image to GPT-4o for text extraction and appending the extracted content to the specified Word document.

#### Notes:
- Ensure the `doc_path` variable is updated to save the extracted content in a file corresponding to the survey's country or dataset.
- The Word document is updated iteratively, with each image's content appended in order.

In [None]:
# Custom sort key function to ensure that images get picked in the right order
def sort_key(filename):
    # This regex extracts the numerical part of the filename
    numbers = re.findall(r'\d+', filename)
    return int(numbers[0]) if numbers else 0

directory_path = "Images_temp"
doc_path = "Extracted.docx" # !!!!!!!! You need to change this to save each new survey in a file that corresponds to the country

# Iterate over all files in the directory, sorted by the numerical part of the filename
for file_name in sorted(os.listdir(directory_path), key=sort_key):
    # Check if the file is a JPEG image
    if file_name.endswith(".jpeg"):
        image_path = os.path.join(directory_path, file_name)
        # Append the content of each image to the Word document
        append_image_content_to_doc(image_path, doc_path, api_key)
        print(f"Processed {image_path}")


### Conclusion

In this notebook, we successfully:
1. Processed PDF files from the `To_process` folder by converting them into JPEG images.
2. Extracted text content from each image using GPT-4o.
3. Saved the extracted content into a Word document (`.docx`).

#### Inputs:
- PDF files located in the `To_process/` directory.

#### Outputs:
- A Word document (`Extracted.docx`) containing extracted text.
- Temporary images stored in the `Images_temp/` folder.

### Next Steps
The extracted text can now be consolidated and structured in Step 2. Ensure the Word document `Extracted.docx` is available for use in the next stage.