# NLP PROJECT
Build a RAG for a specific advanced task of your choice.

Libraries installation

In [1]:
%%capture
!pip install chromadb tqdm fireworks-ai python-dotenv pandas
!pip install sentence-transformers

Fireworks API

In [2]:
import fireworks.client
import os
import dotenv
import chromadb
import json
from tqdm.auto import tqdm
import pandas as pd
import random

# you can set envs using Colab secrets
dotenv.load_dotenv()

fireworks.client.api_key = "SnicmsKvl8pCGxKQ723uWzPxUFp0Aun12nImYzHicf6ZZAC4"

Function to get completions from the Fireworks inference platform.

In [3]:
def get_completion(prompt, model=None, max_tokens=50):

    fw_model_dir = "accounts/fireworks/models/"

    if model is None:
        model = fw_model_dir + "llama-v2-7b"
    else:
        model = fw_model_dir + model

    completion = fireworks.client.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=0,
    )

    return completion.choices[0].text

Let's first try the function with a simple prompt:

In [4]:
get_completion("Hello, my name is")

' Katie and I am a 20 year old student at the University of Leeds. I am currently studying a BA in English Literature and Creative Writing. I have been working as a tutor for over 3 years now and I'

## RAG Use Case: Generating travel itineraries to the space

For the RAG use case, we will be using [a dataset](https://www.kaggle.com/datasets/anthonytherrien/interstellar-travel-customer-satisfaction-analysis) that provides a comprehensive view of customer experiences in interstellar space travel. It encompasses approximately 500,000 records, each representing an individual space travel experience.

The user will provide their characteristics and preferences for a travel to the space. The input will then taken and compared against the dataset to find experiences of other travelers similar to the given request, and in the end the LLM will generate 5 suggested itineraries based on past travels.

N.B. To run the notebook, you need to have the csv file in the path *"./data/interstellar_travel.csv"* starting from the folder of this notebook.



### Step 1: Load the Dataset

Let's first load the dataset we will use:

In [6]:
'''from google.colab import drive
drive.mount('/content/drive')'''

Mounted at /content/drive


In [8]:
# load dataset from data/ folder to pandas dataframe
# dataset contains column names

interstellar_travel_data = pd.read_csv("./data/interstellar_travel.csv")


In [None]:
#show number of non-null values in the dataset
interstellar_travel_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547568 entries, 0 to 547567
Data columns (total 19 columns):
 #   Column                                 Non-Null Count   Dtype   
---  ------                                 --------------   -----   
 0   Age                                    547568 non-null  int64   
 1   Gender                                 547568 non-null  object  
 2   Occupation                             547568 non-null  object  
 3   Travel Class                           547568 non-null  object  
 4   Destination                            547568 non-null  object  
 5   Star System                            547568 non-null  object  
 6   Distance to Destination (Light-Years)  547568 non-null  float64 
 7   Duration of Stay (Earth Days)          547568 non-null  float64 
 8   Number of Companions                   547568 non-null  int64   
 9   Purpose of Travel                      547568 non-null  object  
 10  Transportation Type                    54756

Convert the DataFrame to a list of dictionaries, one for each record, for easier processing



In [9]:
# convert dataframe to list of dicts
interstellar_travel_data_list = interstellar_travel_data.to_dict(orient="records")

Print some example rows

In [10]:
interstellar_travel_data.head()

Unnamed: 0,Age,Gender,Occupation,Travel Class,Destination,Star System,Distance to Destination (Light-Years),Duration of Stay (Earth Days),Number of Companions,Purpose of Travel,Transportation Type,Price (Galactic Credits),Booking Date,Departure Date,Special Requests,Loyalty Program Member,Month,Customer Satisfaction Score
0,14,Female,Colonist,Business,Gliese 581,Cunningham Mountains,1.09,11.0,5,Tourism,Warp Drive,828.949275,2023-09-17,2025-01-07,Other,No,9,105.0
1,22,Male,Tourist,Economy,Alpha Centauri,Hayes Trace,5.7,23.0,0,Research,Solar Sailing,488.469135,2023-03-31,2025-12-26,Other,No,3,102.0
2,62,Female,Businessperson,Luxury,Alpha Centauri,Anna Port,0.37,4.0,1,Tourism,Ion Thruster,183.745881,2022-05-19,2025-01-04,,Yes,5,100.0
3,21,Female,Colonist,Economy,Lalande 21185,Henry Ville,0.32,23.0,1,Tourism,Warp Drive,358.754,2023-04-13,2024-02-09,,No,4,108.0
4,42,Male,Explorer,Luxury,Exotic Destination 10,Graves Mall,6.17,42.0,1,Colonization,Ion Thruster,3073.75992,2023-06-12,2024-03-15,Special Meal,No,6,97.0


Generating sentence-level embeddings, which are representations of the semantic meaning of sentences in a high-dimensional space:

As in the example notebook, we will be using SentenceTransformer for generating embeddings for our data that we will store to a chroma document store, which allows us to efficiently retrieve them later.


In [14]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from sentence_transformers import SentenceTransformer

# Initialize a sentence transformer model for generating embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Define a custom embedding function that will be used by chromadb to generate embeddings from documents
class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # Encode the input documents using the initialized sentence transformer model
        batch_embeddings = embedding_model.encode(input)
        return batch_embeddings.tolist()

embed_fn = MyEmbeddingFunction()

# Initialize the chromadb directory, and client.
client = chromadb.PersistentClient(path="./chromadb")

# create collection
collection = client.get_or_create_collection(
    name=f"interstellar_travel"
)

# Embedding generation
Performs a query and retrieves documents based on semantic similarity.

Semantic Search:  Use these embeddings to find the most semantically similar
documents to a given query

In [15]:
# Adjusted function to convert each record from our dataset to a textual description
def record_to_text(record):
    # Construct a descriptive text for each record, incorporating all relevant details
    # we are not interested in all records
    description = (
        f"Age: {record['Age']}, "
        f"Gender: {record['Gender']}, "
        f"Occupation: {record['Occupation']}, "
        f"Travel Class: {record['Travel Class']}, "
        f"Destination: {record['Destination']}, "
        f"Star System: {record['Star System']}, "
        f"Distance to Destination: {record['Distance to Destination (Light-Years)']} light-years, "
        f"Duration of Stay: {record['Duration of Stay (Earth Days)']} days, "
        f"Number of Companions: {record['Number of Companions']}, "
        f"Purpose of Travel: {record['Purpose of Travel']}, "
        f"Transportation Type: {record['Transportation Type']}, "
        f"Price: {record['Price (Galactic Credits)']} Galactic Credits, "
        f"Booking Date: {record['Booking Date']}, "
        f"Departure Date: {record['Departure Date']}, "
        #f"Special Requests: {record['Special Requests']}, "
        #f"Loyalty Program Member: {record['Loyalty Program Member']}, "
        f"Customer Satisfaction Score: {record['Customer Satisfaction Score']}."
    )
    return description


# Specify the batch size for processing records
# Each dictionary in 'interstellar_data' corresponds to a record in your dataset
batch_size = 50
n_samples =  1000 # This is the number of rows we are embedding, the higher the more amount of information we have
for i in tqdm(range(0, len(interstellar_travel_data_list[:n_samples]), batch_size)):
    i_end = min(i + batch_size, len(interstellar_travel_data_list[:n_samples]))
    batch = interstellar_travel_data_list[i:i_end]

    # Convert each record in the batch to a text description
    batch_texts = [record_to_text(record) for record in batch]
    # Generate unique IDs for each record, could be adjusted to your needs
    batch_ids = [str(random.randint(1, 10000000)) for _ in batch]

    # Generate embeddings for the batch of text descriptions
    batch_embeddings = embedding_model.encode(batch_texts)

    # Upsert the batch of records into the ChromaDB collection
    collection.upsert(
        ids=batch_ids,
        documents=batch_texts,
        embeddings=batch_embeddings.tolist(),
    )


  0%|          | 0/20 [00:00<?, ?it/s]




Retrieved Documents: The term "documents" refers to the results of this query, which are text strings that the ChromaDB has found to be most relevant to your query



Test the retriever:

In [None]:
collection = client.get_or_create_collection(
    name=f"interstellar_travel",
    embedding_function=embed_fn
)

retriever_results = collection.query(
    query_texts=["Destination: Gliese 581, Star System: Cunningham Mountains, Travel Class: Business, Duration of Stay: 15 days, Transportation Type: Warp Drive, Number of Companions: 2"],
    n_results=2,
)

print(retriever_results["documents"])

[['Age: 14, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Gliese 581, Star System: Cunningham Mountains, Distance to Destination: 1.09 light-years, Duration of Stay: 11.0 days, Number of Companions: 5, Purpose of Travel: Tourism, Transportation Type: Warp Drive, Price: 828.949275 Galactic Credits, Booking Date: 2023-09-17, Departure Date: 2025-01-07, Special Requests: Other, Loyalty Program Member: No, Customer Satisfaction Score: 105.0.', 'Age: 14, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Gliese 581, Star System: Cunningham Mountains, Distance to Destination: 1.09 light-years, Duration of Stay: 11.0 days, Number of Companions: 5, Purpose of Travel: Tourism, Transportation Type: Warp Drive, Price: 828.949275 Galactic Credits, Booking Date: 2023-09-17, Departure Date: 2025-01-07, Special Requests: Other, Loyalty Program Member: No, Customer Satisfaction Score: 0.8791540785498491.']]


A series of query to try with different details

This method performs a semantic search query on the "interstellar_travel" collection. It uses the embeddings generated by the embedding_function to find documents that are semantically similar to the provided query texts. This operation is crucial for applications that require semantic understanding and retrieval of documents, us our own.

In [None]:
#QUERY TO TRY

#Combining Demographic and Trip Details
query_1 = "Gender: Male, Age: 30, Occupation: Scientist, Travel Class: Economy, Destination: Proxima Centauri, Purpose of Travel: Research, Transportation Type: Hyperloop, Loyalty Program Member: Yes"
#Focused on Destination and Travel Preferences
query_2 = "Destination: Gliese 581, Star System: Cunningham Mountains, Travel Class: Business, Duration of Stay: 15 days, Transportation Type: Warp Drive, Number of Companions: 2"
#Leisure Travel with Family
query_3 = "Age: 45, Gender: Female, Occupation: Engineer, Purpose of Travel: Tourism, Destination: Alpha Centauri, Travel Class: First Class, Number of Companions: 4, Duration of Stay: 30 days, Special Requests: Child-Friendly Amenities, Loyalty Program Member: No"
#Adventure Seeker Profile
query_4 = "Occupation: Freelancer, Age: 25, Gender: Non-Binary, Purpose of Travel: Adventure, Destination: Europa, Travel Class: Economy, Special Requests: Extreme Sports Package, Duration of Stay: 10 days"
#Scientific Expedition:
query_5 = "Occupation: Researcher, Age: 40, Gender: Female, Purpose of Travel: Scientific Expedition, Destination: Mars, Travel Class: Economy, Duration of Stay: 60 days, Special Requests: Equipment Transport, Loyalty Program Member: Yes"

In [None]:
# user query
user_query = query_2

# Retrieving 20 results from the dataset given the user_query
results = collection.query(
    query_texts=[user_query],
    n_results=20,
)

#Since we can get the same results multiple times, I remove duplicates
unique_documents = []
seen = set()  # To track seen documents

for doc in results['documents'][0]:
    if doc not in seen:
        unique_documents.append(doc)
        seen.add(doc)

# Concatenate titles of unique documents into a single string
suggested_itineraries = '\n'.join(unique_documents)  # Adjust based on actual structure

# This ensures `suggested_itineraries` contains unique results



In [None]:
print(suggested_itineraries)

Age: 14, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Gliese 581, Star System: Cunningham Mountains, Distance to Destination: 1.09 light-years, Duration of Stay: 11.0 days, Number of Companions: 5, Purpose of Travel: Tourism, Transportation Type: Warp Drive, Price: 828.949275 Galactic Credits, Booking Date: 2023-09-17, Departure Date: 2025-01-07, Customer Satisfaction Score: 105.0.
Age: 14, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Gliese 581, Star System: Cunningham Mountains, Distance to Destination: 1.09 light-years, Duration of Stay: 11.0 days, Number of Companions: 5, Purpose of Travel: Tourism, Transportation Type: Warp Drive, Price: 828.949275 Galactic Credits, Booking Date: 2023-09-17, Departure Date: 2025-01-07, Special Requests: Other, Loyalty Program Member: No, Customer Satisfaction Score: 105.0.
Age: 48, Gender: Male, Occupation: Scientist, Travel Class: Economy, Destination: Gliese 581, Star System: Sullivan M

Translate the strings containing the information into JSON format such that I can use it in the prompt given to the LLM

In [None]:
import json

# Your input string
data_string = suggested_itineraries
# Split the string into records based on the newline character
records = data_string.strip().split('\n')

# Function to convert a single record string into a dictionary
def record_to_dict(record):
    # Split the record into key-value pairs based on commas
    fields = record.split(', ')
    record_dict = {}
    for field in fields:
        # Split each field into key and value based on the first colon
        key, value = field.split(': ', 1)
        # Assign the key-value pair to the dictionary
        record_dict[key] = value
    return record_dict

# Convert each record to a dictionary and collect them in a list
records_list = [record_to_dict(record) for record in records if record]

# Convert the list of dictionaries to JSON
json_output = json.dumps(records_list, indent=4)


Prepare the final prompt for the model to generate an itenary from.

PROMPT ENGINEERING :
The construction of the prompt involves embedding a structured JSON object, which contains our data relevant to potential travel itineraries. This data is used as the foundation for generating five suggested itineraries

In [None]:
prompt = f'''[INST]

Imagine you are a travel advisor for interstellar journeys, renowned for creating vivid and captivating descriptions of travel experiences.
Write a detailed and engaging paragraph for a potential traveler, highlighting the allure of an upcoming voyage. Include the destination and
the star system it resides in, the class of travel the journey offers, the duration of stay on the destination planet, the most commond booking period, the cost of the trip in Galactic Credits,
the type of transportation used for the journey, the distance to the destination in light-years, and the satisfaction score from previous travelers.
Your goal is to make the reader feel excited and informed about the possibility of this adventure, capturing the essence of the interstellar travel experience.

{json_output}

Use only the data provided in the JSON object above. Use the data to generate a suggested 5 itineraries for the traveler by using verbs like "the cost would be around", "you would travel by", "you would stay for", etc.

[/INST]
'''
mistral_llm = "new-mixtral-chat"
print(get_completion(prompt, model=mistral_llm, max_tokens=1000))

1. For a thrilling adventure, consider a trip to the Cunningham Mountains in the Gliese 581 star system! As a business-class traveler, you would travel by warp drive, reaching your destination in just 1.09 light-years. Your 11-day stay would immerse you in breathtaking landscapes and unique experiences. With an average booking period 2 years in advance, the cost would be around 828.95 Galactic Credits. Previous travelers have given this itinerary a satisfaction score of 105.0!

2. If you're looking for a quick getaway, try the Sullivan Mountain in Gliese 581! As an economy-class traveler, you would travel by warp drive, reaching your destination in 1.39 light-years. During your 1-day stay, you can explore the local attractions and enjoy the peaceful environment. With an average booking period 2 years in advance, the cost would be around 108.46 Galactic Credits. Previous travelers have given this itinerary a satisfaction score of 110.0!

3. For a longer vacation, consider the Clarke Val

In [None]:
responses = get_completion(prompt, model=mistral_llm, max_tokens=1000)

# Print the suggestions.
print("\n\n\nPrompt :")
print(prompt)
print("Model Suggestions:")
print(responses)




Prompt :
[INST]

Imagine you are a travel advisor for interstellar journeys, renowned for creating vivid and captivating descriptions of travel experiences.
Write a detailed and engaging paragraph for a potential traveler, highlighting the allure of an upcoming voyage. Include the destination and
the star system it resides in, the class of travel the journey offers, the duration of stay on the destination planet, the most commond booking period, the cost of the trip in Galactic Credits,
the type of transportation used for the journey, the distance to the destination in light-years, and the satisfaction score from previous travelers.
Your goal is to make the reader feel excited and informed about the possibility of this adventure, capturing the essence of the interstellar travel experience.

[
    {
        "Age": "14",
        "Gender": "Female",
        "Occupation": "Colonist",
        "Travel Class": "Business",
        "Destination": "Gliese 581",
        "Star System": "Cunning

#Generate A dynamic prompt to input to the model




The function's objective is to synthesize this data into a customized prompt that instructs a language model to generate a travel itinerary tailored to the specific characteristics and preferences of the traveler. By synthesizing detailed personal and travel-related information into customized prompts for a language model, this function facilitates a highly tailored approach to travel advice and itinerary suggestions

In [29]:
import random

def create_dynamic_prompt(row):
    # Extract relevant information from the row
    age = row['Age']
    gender = row['Gender']
    occupation = row['Occupation']
    travel_class = row['Travel Class']
    destination = row['Destination']
    purpose_of_travel = row['Purpose of Travel']
    companions = row['Number of Companions']

    # Generate the dynamic prompt
    prompt = (f"[INST] Generate a travel itinerary for a {age}-year-old {gender.lower()} {occupation.lower()} "
              f"traveling to {destination} in {travel_class.lower()} class for {purpose_of_travel.lower()}. "
              f"Consider they are traveling with {companions} companion(s). [/INST]")

    return prompt


# Select a random row from the dataset
random_row = interstellar_travel_data.iloc[random.randint(0, len(interstellar_travel_data) - 1)]

# Generate a dynamic prompt for the selected row
sample_prompt = create_dynamic_prompt(random_row)

print("Sample Dynamic Prompt:", sample_prompt)


retriever_results = collection.query(
    query_texts=[sample_prompt],
    n_results=2,
)

retriever_results

Sample Dynamic Prompt: [INST] Generate a travel itinerary for a 72-year-old female businessperson traveling to Proxima Centauri in economy class for colonization. Consider they are traveling with 1 companion(s). [/INST]


{'ids': [['2761685', '9544284']],
 'distances': [[0.7314045429229736, 0.7622081637382507]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['Age: 38, Gender: Male, Occupation: Explorer, Travel Class: Luxury, Destination: Proxima Centauri, Star System: Christopher Circles, Distance to Destination: 0.55 light-years, Duration of Stay: 8.0 days, Number of Companions: 1, Purpose of Travel: Business, Transportation Type: Other, Price: 621.64179 Galactic Credits, Booking Date: 2023-07-19, Departure Date: 2025-01-17, Customer Satisfaction Score: 102.0.',
   'Age: 21, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Proxima Centauri, Star System: Jessica Spur, Distance to Destination: 3.84 light-years, Duration of Stay: 6.0 days, Number of Companions: 4, Purpose of Travel: Tourism, Transportation Type: Solar Sailing, Price: 957.00528 Galactic Credits, Booking Date: 2023-07-03, Departure Date: 2025-10-24, Customer Satisfaction Score: 97.0.']],
 'uris'

Personalise the prompth by choosing your details

In [32]:
import random

def create_dynamic_prompt_input():
    # Prompt the user for input and extract relevant information
    age = input('Enter your age: ')
    gender = input('Enter your gender: ')
    occupation = input('Enter your occupation: ')
    travel_class = input('Enter your preferred travel class: ')
    purpose_of_travel = input('Enter the purpose of your travel: ')
    companions = input('Enter the number of companions traveling with you: ')

    # Generate the dynamic prompt
    prompt = (f"[INST] Generate a travel itinerary for a {age}-year-old {gender.lower()} {occupation.lower()} "
              f" in {travel_class.lower()} class for {purpose_of_travel.lower()}. "
              f"Consider they are traveling with {companions} companion(s). [/INST]")

    return prompt

# Example of calling the function to get a dynamic prompt based on user input
sample_prompt = create_dynamic_prompt_input()
print("Sample Dynamic Prompt:", sample_prompt)


# Select a random row from the dataset
random_row = interstellar_travel_data.iloc[random.randint(0, len(interstellar_travel_data) - 1)]

# Generate a dynamic prompt for the selected row
sample_prompt = create_dynamic_prompt(random_row)

print("Sample Dynamic Prompt:", sample_prompt)

retriever_results = collection.query(
    query_texts=[sample_prompt],
    n_results=2,
)

retriever_results

Enter your age: 22
Enter your gender: Female
Enter your occupation: Researcher
Enter your preferred travel class: First
Enter the purpose of your travel: Business
Enter the number of companions traveling with you: 1
Sample Dynamic Prompt: [INST] Generate a travel itinerary for a 22-year-old female researcher  in first class for business. Consider they are traveling with 1 companion(s). [/INST]
Sample Dynamic Prompt: [INST] Generate a travel itinerary for a 9-year-old female other traveling to Zeta II Reticuli in business class for tourism. Consider they are traveling with 1 companion(s). [/INST]


{'ids': [['9964222', '2009071']],
 'distances': [[0.7696551084518433, 0.7788380980491638]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['Age: 11, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Zeta II Reticuli, Star System: Jessica Cape, Distance to Destination: 0.05 light-years, Duration of Stay: 10.0 days, Number of Companions: 0, Purpose of Travel: Business, Transportation Type: Warp Drive, Price: 157.69215 Galactic Credits, Booking Date: 2022-10-06, Departure Date: 2024-05-13, Customer Satisfaction Score: 115.0.',
   'Age: 67, Gender: Female, Occupation: Businessperson, Travel Class: Business, Destination: Zeta II Reticuli, Star System: Whitney Walk, Distance to Destination: 13.52 light-years, Duration of Stay: 13.0 days, Number of Companions: 1, Purpose of Travel: Tourism, Transportation Type: Warp Drive, Price: 1757.686392 Galactic Credits, Booking Date: 2023-08-12, Departure Date: 2025-03-29, Customer Satisfaction Score: 100.0.'

We can see that the embedding still needs some work since the result does not match the prompt exactly. For example I said I was 22 years old, but gives an itenary for an 11 year old and a 67 year old.

# Text Generation

In [None]:
!pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.0



This cell demonstrates an advanced application of integrating text generation capabilities with data processing to produce contextually relevant and creative content. It employs a two-pronged approach to generate text from embeddings and further refine the process to produce sentences based on specific data entries. The overarching methodology utilizes the transformers library to leverage a pre-trained model, specifically gpt2, for text generation tasks.

Initially, the cell outlines a procedure to transform retrieved document embeddings into coherent text. This involves aggregating document data, which may be in various formats, into a unified string format. This string effectively serves as a prompt for the gpt2 model, instructing it to generate text that is contextually inspired by the input documents. The function designed for this purpose selects or combines these documents to form a comprehensive prompt, ensuring flexibility in handling the input data. The generated text is expected to reflect the themes or content present in the input documents, thereby creating new content that maintains relevance to the original data.

Subsequently, as we did before we introduce a more targeted text generation function that constructs prompts based on specific attributes of a data entry, such as

destination, occupation, and purpose of travel, among others. This function is designed to articulate why a given destination, characterized by its location, distance, and suitability for a person's occupation and travel purpose, is an ideal choice. It crafts a prompt incorporating these details and feeds it to the text generation model, specifying parameters to guide the creativity and diversity of the output (e.g., temperature, top_k, top_p). The model then generates text that explicates the appeal of the destination in a manner tailored to the specifics of the prompt.

Answers to the question by generating text : 'Explain why {destination} located in the star system of is {star_system},{distance} away, is a perfect destination for a {occupation.lower()} traveling for {purpose.lower()} purposes'
But you can ask any question related to the data



In [33]:
# Generate text from embeddings with chatgtp2 from hugging face
from transformers import pipeline

# Load a text generation model
text_generator = pipeline('text-generation', model='gpt2')

def generate_text_from_embeddings(retrieved_documents):
    # Combine the retrieved documents into a single prompt or select one to inspire the generation
    # Ensure each document is a string
    documents_as_strings = [" ".join(doc) if isinstance(doc, list) else doc for doc in retrieved_documents]

    # Now combine the document strings into a single prompt
    combined_text = " ".join(documents_as_strings)
    print(combined_text)

    generated_text = text_generator(combined_text, max_length=50)  # adjust max_length as needed
    return generated_text[0]['generated_text']

# Use the retrieved documents to generate new text
generated_text = generate_text_from_embeddings(retriever_results["documents"])
print(generated_text)

# Just to generate a prompt sentence as we need before
def generate_sentence(data_entry, max_length=250, temperature=0.7, top_k=25, top_p=0.99):
    destination = data_entry['Destination']
    occupation = data_entry['Occupation']
    purpose = data_entry['Purpose of Travel']
    star_system = data_entry['Star System']
    distance = data_entry['Distance to Destination']
    prompt = f"Explain why {destination} located in the star system of is {star_system},{distance} away, is a perfect destination for a {occupation.lower()} traveling in the interstellar for {purpose.lower()} purposes:"

    response = text_generator(prompt,
                              max_length=max_length,
                              temperature=temperature,
                              top_k=top_k,
                              top_p=top_p,
                              num_return_sequences=1)

    return response[0]['generated_text'].strip()

def parse_document_string(document_string):
    document_dict = {}
    key_value_pairs = document_string.split(', ')
    for pair in key_value_pairs:
        if ': ' in pair:
            key, value = pair.split(': ', 1)
            key = key.strip()
            value = value.strip('\'"')
            if value.isdigit():
                document_dict[key] = int(value)
            else:
                try:
                    document_dict[key] = float(value)
                except ValueError:
                    document_dict[key] = value
        else:
            continue
    return document_dict


# Assuming retriever_results["documents"] is a list of lists where each inner list contains a string
generated_sentences = []
for doc_list in retriever_results["documents"]:
    if isinstance(doc_list, list) and doc_list:
        # Extract the first string from the list (assuming there's only one)
        doc_string = doc_list[0]
        data_entry = parse_document_string(doc_string)
        sentence = generate_sentence(data_entry)
        generated_sentences.append(sentence)
    else:
        print(f"Document is not in the expected format: {doc_list}")

# Print the generated sentences
for sentence in generated_sentences:
    print(sentence)




config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Age: 11, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Zeta II Reticuli, Star System: Jessica Cape, Distance to Destination: 0.05 light-years, Duration of Stay: 10.0 days, Number of Companions: 0, Purpose of Travel: Business, Transportation Type: Warp Drive, Price: 157.69215 Galactic Credits, Booking Date: 2022-10-06, Departure Date: 2024-05-13, Customer Satisfaction Score: 115.0. Age: 67, Gender: Female, Occupation: Businessperson, Travel Class: Business, Destination: Zeta II Reticuli, Star System: Whitney Walk, Distance to Destination: 13.52 light-years, Duration of Stay: 13.0 days, Number of Companions: 1, Purpose of Travel: Tourism, Transportation Type: Warp Drive, Price: 1757.686392 Galactic Credits, Booking Date: 2023-08-12, Departure Date: 2025-03-29, Customer Satisfaction Score: 100.0.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Age: 11, Gender: Female, Occupation: Colonist, Travel Class: Business, Destination: Zeta II Reticuli, Star System: Jessica Cape, Distance to Destination: 0.05 light-years, Duration of Stay: 10.0 days, Number of Companions: 0, Purpose of Travel: Business, Transportation Type: Warp Drive, Price: 157.69215 Galactic Credits, Booking Date: 2022-10-06, Departure Date: 2024-05-13, Customer Satisfaction Score: 115.0. Age: 67, Gender: Female, Occupation: Businessperson, Travel Class: Business, Destination: Zeta II Reticuli, Star System: Whitney Walk, Distance to Destination: 13.52 light-years, Duration of Stay: 13.0 days, Number of Companions: 1, Purpose of Travel: Tourism, Transportation Type: Warp Drive, Price: 1757.686392 Galactic Credits, Booking Date: 2023-08-12, Departure Date: 2025-03-29, Customer Satisfaction Score: 100.0. Age
Explain why Zeta II Reticuli located in the star system of is Jessica Cape,0.05 light-years away, is a perfect destination for a colonist traveling in the interst

---------------------------------------------

# Test: Modify Embedding Function

Since we saw that our embedding has a marginal error rate, we decided to try another embedding method with bert

A custom class, HuggingFaceEmbeddingFunction, is defined to encapsulate the embedding generation process. This class, designed to conform to a specific interface expected by the ChromaDB system

In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
from tqdm.auto import tqdm
import random
from chromadb import Documents, EmbeddingFunction, Embeddings
from sentence_transformers import SentenceTransformer
import torch
import uuid

# Initialize tokenizer and model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")

# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


class HuggingFaceEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # Tokenize input documents
        encoded_input = tokenizer(input, padding=True, truncation=True, return_tensors='pt')

        # Move the tokenized input to the same device as the model
        encoded_input = {key: val.to(device) for key, val in encoded_input.items()}

        # Compute token embeddings
        with torch.no_grad():
            model_output = model(**encoded_input)

        # Pooling strategy to get sentence embeddings (mean pooling here)
        sentence_embeddings = self.mean_pooling(model_output, encoded_input['attention_mask'])

        # Move embeddings back to CPU for further processing or storage
        return sentence_embeddings.cpu().tolist()


    @staticmethod
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output.last_hidden_state
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Instantiate the embedding function
embed_fn = HuggingFaceEmbeddingFunction()

# Assuming you have a function to convert records to a list of textual descriptions
def records_to_texts(records):
    return [record_to_text(record) for record in records]

# Assuming you have your data loaded into interstellar_travel_data_list
batch_size = 256
embedding_dim = 768

# Create a new collection with the specified embedding dimension
client = chromadb.PersistentClient(path="./chromadb")

# create collection
collection = client.get_or_create_collection(
    name=f"interstellar_travel-v2"
)# Adjusted function to convert each record from your dataset to a textual description

# Now you can encode your data using the new embedding function
for i in tqdm(range(0, len(interstellar_travel_data_list), batch_size)):
    batch_records = interstellar_travel_data_list[i:i+batch_size]
    batch_texts = records_to_texts(batch_records)

    # Generate embeddings for the batch of text descriptions
    batch_embeddings = embed_fn(batch_texts)

    # Generate unique IDs for each record in the batch
    batch_ids = [str(random.randint(1, 10000000)) for _ in batch_records]
    # Generate unique IDs for each record in the batch using uuid4
    batch_ids = [str(uuid.uuid4()) for _ in batch_records]
    # print(batch_ids)


    # Upsert the batch of records into the ChromaDB collection
    collection.upsert(
        ids=batch_ids,
        documents=batch_texts,
        embeddings=batch_embeddings,
    )

  0%|          | 0/2139 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
