# Analyze a single long document
- [Source](https://python.langchain.com/docs/use_cases/question_answering/analyze_document)
- use vectprdb to retrieve relevant persons based on similarity search
- use their profile page as input
- use BeautifulSoup to retrieve profile data

**Get the profile data**

In [9]:
import re
import requests
from bs4 import BeautifulSoup

In [22]:
url = "https://www.zhaw.ch/de/ueber-uns/person/acke"

In [23]:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
profile = soup.find('div', class_='zhaw-person')
raw_profile = profile.get_text()

In [24]:
def preprocess_profile(raw_data: str):
    # Remove all obsolete whitespace
    cleaned_str = re.sub(r'\s+', ' ', raw_data)

    # Remove the final "Zurück" in every profile
    cleaned_str = cleaned_str.replace('Zurück', '')

    # Insert whitespace between numbers and letters (when they got lost from the original html)
    pattern = r'([a-zA-Z])(\d)'
    cleaned_str = re.sub(pattern, r'\1 \2', cleaned_str)

    return cleaned_str

In [25]:
cleaned_profile = preprocess_profile(raw_profile)

**The AnalyzeDocumentChain takes in a single document, splits it up, and then runs it through a CombineDocumentsChain.**

In [14]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

In [15]:
from langchain.llms import OpenAI
from langchain.chains import AnalyzeDocumentChain

llm = OpenAI(temperature=0)

In [16]:
from langchain.chains.question_answering import load_qa_chain

In [17]:
qa_chain = load_qa_chain(llm, chain_type="map_reduce")

In [18]:
qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)

In [28]:
query = "would this person be a fitting partner for a collaboration on augmented reality and why? please use some citations from the original document."

In [32]:
qa_document_chain.run(input_document=cleaned_profile, question=query)

' Yes, this person would be a suitable candidate for collaboration in the field of Augmented Reality, as they have participated in the 7th International Conference on Augmented Reality, Virtual Reality and Computer Graphics which took place in September 2020. They have also participated in other conferences and publications on topics such as Computer Graphics, Visual eHealth, Product Knowledge Management and Intelligent Building Configuration Model.'

## ToDo

- try German
- complete pipeline from retrieval to answer
- script / OOP
- unit tests