# Analyze a single long document
- [Source](https://python.langchain.com/docs/use_cases/question_answering/analyze_document)
- use vectprdb to retrieve relevant persons based on similarity search
- use their profile page as input
- use BeautifulSoup to retrieve profile data

**Helper decorator timer**

In [33]:
import time

def timer(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        elapsed_time = end_time - start_time
        print(f"{func.__name__} took {elapsed_time:.5f} seconds to execute.")
        return result
    return wrapper

**Get the profile data**

In [9]:
import re
import requests
from bs4 import BeautifulSoup

In [22]:
url = "https://www.zhaw.ch/de/ueber-uns/person/acke"

In [23]:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
profile = soup.find('div', class_='zhaw-person')
raw_profile = profile.get_text()

In [24]:
def preprocess_profile(raw_data: str):
    # Remove all obsolete whitespace
    cleaned_str = re.sub(r'\s+', ' ', raw_data)

    # Remove the final "Zurück" in every profile
    cleaned_str = cleaned_str.replace('Zurück', '')

    # Insert whitespace between numbers and letters (when they got lost from the original html)
    pattern = r'([a-zA-Z])(\d)'
    cleaned_str = re.sub(pattern, r'\1 \2', cleaned_str)

    return cleaned_str

In [25]:
cleaned_profile = preprocess_profile(raw_profile)

**The AnalyzeDocumentChain takes in a single document, splits it up, and then runs it through a CombineDocumentsChain.**

In [14]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

In [15]:
from langchain.llms import OpenAI
from langchain.chains import AnalyzeDocumentChain

llm = OpenAI(temperature=0)

In [16]:
from langchain.chains.question_answering import load_qa_chain

In [46]:
# Map reduce
qa_chain = load_qa_chain(llm, chain_type="map_reduce")  # took 8.94116 seconds to execute

# Refine
# qa_chain = load_qa_chain(llm, chain_type="refine")  # took 10.97927 seconds to execute.

In [47]:
qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)

In [52]:
query = """Would this person be a fitting partner for a collaboration \
on augmented reality and why? please elaborate and use some citations \
from the original document."""

In [53]:
@timer
def run_qa_document_chain(input_document, question):
    return qa_document_chain.run(input_document=input_document, question=question)

In [54]:
run_qa_document_chain(cleaned_profile, query)

run_qa_document_chain took 13.02925 seconds to execute.


' Dr. Philipp Ackermann is a fitting partner for a collaboration on augmented reality due to his expertise in the field. He is the Deputy Head of the Human-Centered Computing research focus at the ZHAW School of Engineering, and is a lecturer in Informatics in the same focus. He has worked on multiple projects related to augmented reality, such as "Mixed Reality im Anlagenbau" and "ARchi VR: Real-Time Collaboration in Augmented Spaces" (Moreno & Ackermann, 2023). He has also published multiple peer-reviewed papers on the topic, such as "AR Patterns: Event-Driven Design Patterns in Creating Augmented Reality Experiences" (Ackermann, 2023). Additionally, his 2015 paper, "Redesign of an Introductory Computer Graphics Course," which was published in the Eurographics Education Papers, demonstrates his expertise in computer graphics, which is a key component of augmented reality. His other publications, such as "PLM-integrated Configurators for Machine and Plant Construction" and "Product Kn

## ToDo

- try German
- complete pipeline from retrieval to answer
- script / OOP
- unit tests