### **Step 4: Inserting and Analyzing Data with a Vector Database**

#### **Purpose**
In this step, we use a **vector database** to store and query survey data semantically. This allows us to perform concept-based searches and retrieve records based on their meaning, rather than exact matches. The results are then analyzed with GPT-4 to generate insights about specific themes in national surveys.

#### **What is a Vector Database?**
A vector database organizes data as embeddings—numerical representations of text or other data types that capture their meaning. This enables **semantic searches**, where queries retrieve conceptually similar records, even if phrased differently.

#### **What We Are Doing**
1. **Data Preparation**: Clean and format survey data for insertion into the vector database.
2. **Schema Definition**: Define a class (`questions`) to organize the data.
3. **Data Insertion**: Add survey records, each represented by an embedding.
4. **Semantic Search**: Query the database for themes (e.g., "Maternity leave").
5. **Analysis**: Use GPT-4 to summarize and analyze the search results.


### Load and Prepare Data for Weaviate Insertion:

   - The Excel file `ALL_PROCESSED_METADATA.xlsx` is loaded into a pandas DataFrame.
   - Problematic float values (`inf`, `-inf`) are replaced with `NaN`.
   - Missing values (`NaN`) are filled with empty strings to ensure compatibility with JSON format.
   - All numeric data (e.g., floats, integers) that may not be finite are converted to strings.
   - Any remaining empty fields in the `answers` column are replaced with a placeholder value (`"N/A"`).
   - The cleaned DataFrame is converted into a dictionary (`data_dict`) in a record-oriented format, where each row becomes a dictionary entry.
   - This step ensures that the data is properly formatted and compatible with Weaviate's JSON-based API for seamless insertion.

In [None]:
import weaviate
import pandas as pd
import numpy as np
import json

# Load the Excel file into a pandas DataFrame
df = pd.read_excel("Database/ALL_PROCESSED_METADATA.xlsx")

# Replace problematic float values with NaN, then replace NaN with placeholders
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna('N/A', inplace=True)  # Replace NaN with "N/A" globally

# Verify JSON compatibility
try:
    json.dumps(df.to_dict(orient='records'))
except TypeError as e:
    print(f"Data contains non-serializable values: {e}")

# Convert the DataFrame into a dictionary for Weaviate insertion
data_dict = df.to_dict(orient='records')

# View the prepared data
print(data_dict[:5])  # Display first 5 records for verification

### Connect to Weaviate Cluster
   - Connect to the Weaviate cluster by providing the URL of the cluster and required authentication details -  you will need to create the cluster directly on Weaviate cloud services first!
   - Use `additional_headers` to pass the OpenAI API key for operations requiring embedding generation.
   - Replace placeholders (`url`, `api_key`, `username`, and `password`) with actual cluster and authentication details.

#### Security Note:
- Do not hardcode sensitive credentials (e.g., passwords, API keys) directly in the script. Use a `.env` file or environment variables to securely store and access them.

In [None]:
from dotenv import load_dotenv
import os
import weaviate

# Load environment variables
load_dotenv()

# Fetch sensitive details securely
api_key = os.getenv("OPENAI_API_KEY")
weaviate_url = os.getenv("WEAVIATE_URL")
username = os.getenv("WEAVIATE_USERNAME")
password = os.getenv("WEAVIATE_PASSWORD")

# Initialize Weaviate client
client = weaviate.Client(
    url=weaviate_url,
    additional_headers={
        "X-OpenAI-Api-Key": api_key
    },
    auth_client_secret=weaviate.AuthClientPassword(
        username=username,
        password=password
    )
)

### Define and Add Schema to Weaviate
   - The line `client.schema.delete_class("questions")` is commented out but can be used to remove an existing class named `questions` before re-creating it.
   - A schema object (`class_obj`) is created for the `questions` class.
   - The class uses the `text2vec-openai` vectorizer for embedding generation.
   - The `create_class` method adds the defined `questions` class to the Weaviate schema.
   - The `get("questions")` method retrieves and prints the schema for the `questions` class to confirm successful creation.

#### Notes:
- The `text2vec-openai` vectorizer leverages OpenAI embeddings for semantic search capabilities. You can use other vectorizers.
- Ensure the `questions` class does not already exist in the schema unless overwriting is intentional.


In [None]:
# Optional: Delete the class "questions" if it already exists (use with caution)
# Uncomment the next line if you intend to delete and recreate the class
# client.schema.delete_class("questions")

# Define the "questions" class schema
class_obj = {
    "class": "questions",
    "vectorizer": "text2vec-openai"  # Use OpenAI embeddings for semantic search
}

# Check if the class already exists in the schema
existing_classes = [cls['class'] for cls in client.schema.get()['classes']]
if "questions" not in existing_classes:
    try:
        # Create the class
        client.schema.create_class(class_obj)
        print("Class 'questions' created successfully.")
    except Exception as e:
        print(f"Error creating class 'questions': {e}")
else:
    print("Class 'questions' already exists.")

# Retrieve and print the schema for the "questions" class to verify
try:
    schema = client.schema.get("questions")
    print("Retrieved schema for 'questions':", schema)
except Exception as e:
    print(f"Error retrieving schema: {e}")


### Import Data Objects to Weaviate Individually

1. Iterate through the `data_dict`, where each entry represents a single record.
2. Extract and prepare the properties for the `questions` class, including fields like `Country`, `Year`, `Source`, etc.
3. Use `client.data_object.create` to import each record into Weaviate.

#### Purpose:
- **Error Isolation**: By importing records individually, errors in one line do not halt the entire process, ensuring that as many records as possible are successfully imported.
- **Progress Tracking**: Each record's import status is printed, providing visibility into the process and helping identify problematic entries.

#### Notes:
- Missing or invalid fields are handled by providing default values (e.g., `Unknown` for missing `country` or `year` fields).
- Errors during import are logged with the specific line number and the error message for easier debugging.

In [None]:
# Import lines individually
for i, d in enumerate(data_dict):
    print(f"Importing line: {i+1}")
    properties = {
        "Country": d["country"],
        "Year": d["year"],
        "Source": d["source"],
        "FullPath": d["full_path"],
        "Question": d["question"],
        "Answers": d["answers"],
        #"All_text": d["all_text"],
    }
    try:
        client.data_object.create(properties, "questions")
        print(f"Line {i+1} imported successfully")
    except Exception as e:
        print(f"Error importing line {i+1}: {e}")

### Query and View Data from Weaviate

This section queries the `questions` class in the Weaviate database and displays the results in a user-friendly format.

   - Use the Weaviate client to query the `questions` class.
   - Retrieve fields like `country`, `year`, `source`, `question`, `answers`, and `fullPath` for up to 20 results.
   - Iterate over the query results and reorder the fields into a structured dictionary for better readability.
   - Use the `json.dumps` function to format the output with indentation for easier visualization.

#### Purpose:
- Quickly validate that the data was imported into Weaviate correctly.
- Ensure the fields are accessible and stored as expected in the database.

#### Notes:
- The query is limited to 20 results for quick inspection. Adjust the limit as needed for larger datasets.
- The `Questions` class name must match the schema (note the uppercase "Q").

In [None]:
# View the data
import json

# Perform the query
result = (
    client.query
    .get("questions", ["country", "year", "source", "question", "answers", "fullPath"])  
    .with_limit(20)
    .do()
)

# Reorganize the display of results
for obj in result['data']['Get']['Questions']:  # Use 'Questions' with an uppercase 'Q'
    reordered_obj = {
        'year': obj['year'],
        'country': obj['country'],
        'source': obj['source'],
        'fullPath': obj['fullPath'],
        'question': obj['question'],
        'answers': obj['answers']
    }
    print(json.dumps(reordered_obj, indent=4))

### Perform a Semantic Search in Weaviate

This section demonstrates how to perform a semantic search in the `questions` class of the Weaviate database using a specific theme or concept.

   - Specify the theme or concept to search for (e.g., "Maternity leave").
   - Use the `with_near_text` method to search for questions and answers semantically related to the defined concept.
   - Retrieve relevant fields like `country`, `year`, `source`, `question`, `answers`, and `fullPath`.
   - Iterate over the search results and reorder the fields into a structured dictionary for better readability.
   - Format the output using `json.dumps` for clear and indented display.
   - Store the search results in a variable (`search_result`) for potential reuse in further analysis or visualization.

#### Purpose:
- Explore the database for entries semantically related to a specific concept.
- Validate the effectiveness of the semantic search functionality in retrieving meaningful results.

#### Notes:
- The search is limited to 10 results (`with_limit(10)`) for efficiency. Adjust the limit based on your needs.
- The `Questions` class name in the query must match the schema definition (uppercase "Q").
- Semantic search uses embeddings to find related entries, making it robust for retrieving conceptually similar data.
- Save the results (`search_result`) for further analysis or documentation purposes.

In [None]:
# Define the concept for the search
concept = "Maternity leave"

# Perform the nearText search
nearText = {"concepts": [concept]}

result = (
    client.query
    .get("questions", ["country", "year", "source", "question", "answers", "fullPath"])  
    .with_near_text(nearText)
    .with_limit(10)
    .do()
)

# Reorder the fields manually for display
for obj in result['data']['Get']['Questions']:  # Ensure to use 'Questions' with an uppercase 'Q'
    reordered_obj = {
        'year': obj['year'],
        'country': obj['country'],
        'source': obj['source'],
        'fullPath': obj['fullPath'],
        'question': obj['question'],
        'answers': obj['answers']
    }
    print(json.dumps(reordered_obj, indent=4))

# Save the results for further use
search_result = result

### Pass Semantic Search Results to GPT-4 for Analysis

This section sends the semantic search results to GPT-4 for generating an analytical summary about a specific concept based on national surveys.

   - Use the API key from Step 1 to initialize the OpenAI client.
   - Provide a system prompt describing GPT-4's role as a research assistant for a labor lawyer.
   - Send a user prompt containing:
     - The semantic search results (`search_result`).
     - The concept being analyzed (e.g., "Maternity leave").
     - Instructions to generate a well-structured analytical summary, referencing specific surveys.
   - Retrieve the analytical summary generated by GPT-4.
   - Use `textwrap.fill` to format the summary text for better readability.

#### Purpose:
- Leverage GPT-4's capabilities to analyze and summarize how a specific concept is reflected in national surveys.
- Automatically create structured insights from semantic search results for further use in legal or academic contexts.

#### Notes:
- Ensure the `search_result` variable contains valid data from the semantic search before passing it to GPT-4.
- The model (`gpt-4o`) should align with your requirements; adjust if needed.
- The analytical summary can be saved or processed further as required.


In [None]:
# Pass the selection to GPT-4
import textwrap
from openai import OpenAI

# Initialize the OpenAI client with API key
client = OpenAI(api_key= {api_key}) # From Step 1

try:
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {"role": "system", 
             "content": "You will be a helpful research assistant to a labour lawyer."},
            {"role": "user", 
             "content": (
                 "Look at this text: " + str(search_result) + 
                 " It is the result of my semantic search of a vector database with questions from different national surveys." +
                 " I searched for the concept " + concept + 
                 ". Based on this search result, write a well-structured analytical summary note about how the concept of " + concept +
                 " is reflected in national surveys. Support your statements with references to the content of country surveys from the search."
             )}
        ]
    )
except Exception as e:
    print(f"Error generating summary: {e}")
    response = None

# Get the summary lines as a list
summary_lines = response.choices[0].message.content.strip()
wrapped_summary_lines = textwrap.fill(summary_lines, width=100)
print(wrapped_summary_lines)