# Handling Unstructured Data

In [4]:
from unstructured.partition.auto import partition
from unstructured.documents.elements import *
from unstructured.staging.weaviate import create_unstructured_weaviate_class, stage_for_weaviate

import weaviate
from weaviate.util import generate_uuid5

from dotenv import load_dotenv
import os
import json
from datetime import date
import tqdm
load_dotenv()

True

In [2]:
weaviate_url = os.getenv("WEAVIATE_URL")
weaviate_api_key = os.getenv("WEAVIATE_API_KEY")
openai_api_key=os.getenv("OPENAI_API_KEY")

## Weaviate schemas
Schemas are patterns for data. The template that exists is a good base but this layer is ripe for customization. 

Allowing user input here can assist with searching later on. The recommendation engine can use tags to help put useful information in front of me. Auto-ingesting data could also use tags for reliable sources (e.g. #media)

A blank text box (upload note - i like the name reverie) which would be useful for experimenting and curating a database.

What follows is the definition a generalized 'unstructured document' class for Weaviate. Once this is working, next step is to make one for a specific use-case, e.g. abc online articles. 

In [35]:
unstructured_class = {
    'class': 'UnstructuredDocument',
    'description': 'General class for all documents (todo: add more specific classes)',
    'properties': [
        {'name': 'text', 'dataType': ['text']},
        {'name': 'category', 'dataType': ['text']},
        {'name': 'filename', 'dataType': ['text']},
        {'name': 'file_directory', 'dataType': ['text']},
        {'name': 'date', 'dataType': ['text']},
        {'name': 'filetype', 'dataType': ['text']},
        {'name': 'attached_to_filename', 'dataType': ['text']},
        {'name': 'page_number', 'dataType': ['int']},
        {'name': 'page_name', 'dataType': ['text']},
        {'name': 'url', 'dataType': ['text']},
        {'name': 'sent_from', 'dataType': ['text']},
        {'name': 'sent_to', 'dataType': ['text']},
        {'name': 'subject', 'dataType': ['text']},
        {'name': 'header_footer_type', 'dataType': ['text']},
        {'name': 'text_as_html', 'dataType': ['text']},
        {'name': 'regex_metadata', 'dataType': ['text']},
        {'name': 'tags', 'dataType': ['text']},
        {'name': 'upload_note', 'dataType': ['text']},
    ],
    'vectorizer': 'text2vec-openai', 
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": False
        }
    },

}


client = weaviate.Client(
    url=weaviate_url,
    auth_client_secret=weaviate.AuthApiKey(api_key=weaviate_api_key),
    additional_headers= {
        "X-OpenAI-Api-Key": openai_api_key,
    }
)


## Schemas vs. classes
Important to differentiate between schemas and classes. From GPT-4:

>In Weaviate, a schema is a high-level structure that defines the types of data that can be stored in the database. It's like a blueprint for the data. A schema consists of classes and their properties.
>
>A class, on the other hand, is a part of the schema. It represents a concept or type of object in the database. For example, in a schema for a library database, you might have classes like "Book", "Author", and "Publisher". Each class has properties that define the characteristics of the objects of that class. For instance, the "Book" class might have properties like "title", "author", and "publication_date".
>
>Here's an analogy: If you think of the schema as a city plan, then the classes would be the different types of buildings (like houses, apartment buildings, and office buildings), and the properties would be the characteristics of those buildings (like the number of floors, the color, and the year of construction).
>
>In summary:
>- A schema is the overall structure of the data in Weaviate. It defines what kinds of objects (classes) can be stored and what characteristics (properties) those objects can have.
>- A class is a type of object in the schema. It represents a concept and has properties that define the characteristics of the objects of that class.

## Modifying the schema
Below code blocks are used to update the schema in Weaviate. Don't need to run this every time.

### Create schema with the defined classes

In [None]:
schema = {"classes": [unstructured_class]}
client.schema.create(schema)

### Print defined schemas

In [None]:
schema = client.schema.get()
print(schema)

### Delete all classes in schema

In [11]:
schema = client.schema.get()

for class_info in schema['classes']:
    class_name = class_info['class']
    client.schema.delete_class(class_name)

### Show properties of the first class in the schema

In [None]:
schema = client.schema.get()
schema['classes'][0]['properties']

## Ingest document data using Unstructured
Unstructured allows essentially any file to be uploaded and text data extracted. Testing on documents in `../data/`.

In [5]:
doc_elements = partition("../data/Politics and the English Language - George Orwell.pdf")
data_objects = stage_for_weaviate(doc_elements)

for key in ['filename', 'file_directory', 'filetype', 'page_number', 'text', 'category']:
    print("{0}: {1}".format(key, data_objects[0][key]))

filename: Politics and the English Language - George Orwell.pdf
file_directory: ../data
filetype: application/pdf
page_number: 1
text: Politics and the English Language - George Orwell
category: Title


Not sure how much of the above data actually needs to be stored. Store it all for now, but there's likely more interesting metadata to be added here. Examples might be a sentiment analysis score, GPT-generated summary, external links and internal links, etc. 

Upload to Weaviate with the uploaded data fitting the defined schema:

In [36]:
upload_note = "hello weaviate"
tags = "test, weaviate, python"

with client.batch(batch_size=10) as batch:
    for i, d in enumerate(data_objects):  
        properties = {
            'category': d['category'],
            'text': d['text'],
            'filename': d['filename'],
            'page_number': d['page_number'],
            'filetype': d['filetype'],
            'date': date.today().strftime("%Y-%m-%d"),
            'upload_note': upload_note, # testing
            'tags': tags # testing
        }
        batch.add_data_object(
            properties,
            'UnstructuredDocument',
            uuid=generate_uuid5(properties),
        )

## Querying Weaviate
Once the data is in Weaviate, a NearText search (to research) can be performed to find concepts similar to user input:

In [41]:
nearText = {"concepts": ["cliche"]}

response = (
    client.query  # start a new query
    .get("UnstructuredDocument", ["text"])  # get objects of the "UnstructuredDocument" class and retrieve their "text" property
    .with_near_text(nearText)  # find objects that are semantically similar to the text in "nearText"
    .with_limit(2)  # limit the results to the top 2 most similar objects
    .do()  # execute the query
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "UnstructuredDocument": [
                {
                    "text": "Dying metaphors. A newly invented metaphor assists thought by evoking a visual image, while on the other hand a metaphor which is technically \"dead\" (e.g. iron resolution) has in effect reverted to being an ordinary word and can generally be used without loss of vividness. But in between these two classes there is a huge dump of worn-out metaphors which have lost all evocative power and are merely used because they save people the trouble of inventing phrases for themselves. Examples are: Ring the changes on, take up the cudgel for, toe the line, ride roughshod over, stand shoulder to shoulder with, play into the hands of, no axe to grind, grist to the mill, fishing in troubled waters, on the order of the day, Achilles' heel, swan song, hotbed. Many of these are used without knowledge of their meaning (what is a \"rift,\" for instance?), and incompatible metaphors are