# Using Unstructured to parse documents
The [unstructured library](https://unstructured.io) provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more.

There are two ways to use unstructured:
- Install required components yourself
- Use the Unstructured API

Registration is free, you can obtain a key here: [Obtain Unstructured Key](https://unstructured.io/#get-api-key). If you want to install it on you local machine, instructions for MacOS are in the [Langchain example playbook](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file). For Windows look at the [Unstructured documentation](https://unstructured-io.github.io/unstructured/installation/full_installation.html). We recommend using the remote API for the workshop.


In [None]:
import os

from dotenv import load_dotenv

load_dotenv()

# The components
You start your journey by looking at the different components setting up the chain. After that you combine all the components into a complete chain.

The components you have to use are:
- Document loaders: Multiple loaders are available. Later you use the csv loader, but you start with a loader for the Unstructured library. Using this loader we can extract content from a pdf.
- Document transformer: When loading the complete text from a document into vector store, the text can be to big or have little meaning. Working with chunks of text overcomes that problem. You use a splitter to create chunks of text.
- Text embedding: Vector stores are great for semantic search. You create an embedding vector that can be stored in a vector store.
- Vector stores: Accept documents with embeddings and provide a way to search over the available vectors for a similar result to the query. 

## Document loader: load pdf using Unstructured
The code block shows the code to use the API version of Unstructured. If you forget to add the API key, it tries to call a local version. If you see a message about missing libraries, you might have missed adding the api key to the environment.

If you want to use a local running version, remove the _api_key_ from the call, and replace UnstructuredAPIFileLoader with UnstructuredFileLoader. You can find [more information](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) on the Langchain website.

In [None]:
from unstructured.cleaners.core import clean_extra_whitespace
from langchain.document_loaders import UnstructuredAPIFileLoader

unstructured_api_key = os.getenv('UNSTRUCTURED_API_KEY')
# TODO: Initialise the loader, add a clean_extra_whitespace as a post processor. The faq.pdf is in the data_sources folder. 
loader = None

docs = loader.load()

## Document transformer: create chunks using a splitter
The complete document is too large to match similar queries. Therefore, we create chunks of knowledge to be more specific. Too large documents cannot be passed to an LLM.
- We suggest a chunk_size of 300, and a chunk_overlap of 100. 
- Print the number of chunks after splitting the text of the pdf

Use the RecursiveCharacterTextSplitter to split the text in the document into chunks. 

Play around with the chunk_size and the chunk_overlap. Notice the changes to the output

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# TODO: Initialise the splitter
text_splitter = None

chunks = text_splitter.create_documents(texts=[docs[0].page_content], metadatas=[docs[0].metadata])

for chunk in chunks:
    print(f"--\n{chunk.page_content}\n--\n")

## Text embedding
There are multiple ways to generate embeddings, some are free, some have to be paid. We prefer the OpenAI embedding. Not an expensive one, but when embedding a lot of content you still have to monitor the costs.  


### Use OpenAI
In this block you create the embeddings using the OpenAI endpoint. You choose if you want to embed all chunks or just one. You can find more information [here](https://python.langchain.com/docs/integrations/text_embedding/openai).

In [None]:
from langchain.embeddings import OpenAIEmbeddings

openai_embeddings = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPEN_AI_API_KEY'),
    model="text-embedding-ada-002"
)

# TODO: Create the embeddings or vectors for the first chunk
vectors_openai = None

# TODO: Calculate the dimension of embedding with the OpenAI model
print(f"Dimension of the embedding for OpenAI '{None}'")

### Use HuggingFace Hub
In this block you create the embeddings using the HuggingFace Hub endpoint. You choose if you want to embed all chunks or just one. You can find more information [here](https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub).

In [None]:
from langchain.embeddings import HuggingFaceHubEmbeddings

repo_id = "sentence-transformers/all-mpnet-base-v2"

huggingface_embeddings_hub = HuggingFaceHubEmbeddings(
    repo_id=repo_id,
    task="feature-extraction",
    huggingfacehub_api_token=os.getenv('HUGGINGFACEHUB_API_TOKEN')
)

# TODO: Create the embeddings or vectors for the first chunk
vectors_hfh = None

# TODO: Calculate the dimension of embedding with the HuggingFace Hub model
print(f"Dimension of the embedding for HuggingFace Hub '{None}'")

### Use HuggingFace sentence transformers locally
Beware, the model will be downloaded to your local machine. Advantage, you do not have to obtain some API key.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

model = "sentence-transformers/all-mpnet-base-v2"

huggingface_embeddings = HuggingFaceEmbeddings(model_name=model)

# TODO: Create the embeddings or vectors for the first chunk
vectors_hf = None

# TODO: Calculate the dimension of embedding with the HuggingFace model
print(f"Dimension of the embedding for HuggingFace '{None}'")

# Vector Stores: keeping the vectors safe and searchable
Langchain supports a lot of vector stores. We start with _Chroma_, you can easily run this without any installation. Vector stores like Weaviate or OpenSearch have more functionality and scalability. One cool feature of Weaviate is the Hybrid search.

## Chroma Vectorstore
If you want, you can change the embedder that is used. 

In [None]:
from langchain.vectorstores import Chroma

chroma_vs = Chroma.from_documents(chunks, openai_embeddings)

## Weaviate vectorstore
You can find more information about the Lanchchain integration with Weaviate [here](https://python.langchain.com/docs/integrations/vectorstores/weaviate).

In [None]:
import weaviate
from langchain.vectorstores import Weaviate

weaviate_url = os.getenv('WEAVIATE_URL')

auth_config = weaviate.auth.AuthApiKey(
    api_key=os.getenv('WEAVIATE_API_KEY'),
)

weaviate_client = weaviate.Client(
    url=weaviate_url,
    auth_client_secret=auth_config,
    additional_headers={
        "X-OpenAI-Api-Key": os.getenv('OPEN_AI_API_KEY')
    }
)

# TODO: Initialise the Weaviate vector store
weaviate_vs = None

## Implement similarity search for one or both vector stores
Use one or both of the vector stores to find the most similar chunks. Notice that you get back a chunk of text. In the next assignment, you use these chunks to generate an answer using a RAG solution.

In [None]:
query = "Who should I contact to become a sponsor?"

# TODO: call the vector store and find the chunks that are most similar to the query above.
