<a href="https://colab.research.google.com/github/nradich/An_AI_Experiment/blob/Notebook_Start/StreamingPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Step 1)**

 - Shifting the project to be focused on books as opposed to streaming service data, as a book API is more accessible.

 - The API call requires a q parameter, could play aound with the LLM generating a list of 10 adjectives or nouns to generate book lookup



In [None]:
# @title Read in Previous results to avoid querying for the same results.
from google.colab import drive
import glob
import pandas as pd # Make sure pandas is imported
from datetime import date
# 1. Mount your Google Drive
drive.mount('/content/drive')
#Read in previously run  files
# Use a wildcard to match all files that follow the naming convention
file_pattern = '/content/drive/My Drive/AIAnalysis/*.csv'
llm_folder = '/content/drive/My Drive/AIAnalysis/llms/Qwen3-4B-Instruct-2507'
# Get a list of all matching filenames
all_files = glob.glob(file_pattern)

# Create an empty list to hold the DataFrames
df_list = []

# Loop through each filename, read the CSV, and append the DataFrame to the list
for filename in all_files:
    df = pd.read_csv(filename)
    df_list.append(df)

# Concatenate all DataFrames in the list into a single, master DataFrame
master_df = pd.concat(df_list, ignore_index=True)

Mounted at /content/drive


In [None]:
# @title Get previous search terms and develop prompt
unique_query_terms = master_df['Query_Term'].loc[master_df['Query_Term'].notna()].unique()
query_item_list = unique_query_terms.tolist()
words_as_string = ", ".join(query_item_list)
words_as_string
prompt_1 = f"Generate 10 adjectives or nouns to search for book titles. These can be classic book genres, places or locations, or other creative disciplines. Past searches have beem: {words_as_string}. Dot not repeat any of terms in the past search.The words should be returned as a single, comma-separated list, without any extra text or numbers. The words should be:"
prompt_2 = f"Generate 10 disciplines search for book titles. These can be classic book genres, places or locations, or other creative disciplines. Past searches have beem: {words_as_string}. Dot not repeat any of terms in the past search.The words should be returned as a single, comma-separated list, without any extra text or numbers. The words should be:"
prompt_3 = f"Generate 10 epic hike in Europe to search for book titles. These should be inspiring locations  optimal for adventure seeking individuals and moutaineering adventures.The words should be returned as a single, comma-separated list, without any extra text or numbers. The words should be:"

In [None]:
prompt_3

'Generate 10 epic hike in Europe to search for book titles. These should be inspiring locations  optimal for adventure seeking individuals and moutaineering adventures.The words should be returned as a single, comma-separated list, without any extra text or numbers. The words should be:'

In [None]:
#loading in small model to generate the responses
#This model response is pretty wordy, shows a little too much reasoning
from transformers import AutoTokenizer, AutoModelForCausalLM

#tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
#model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")

#read in model from local drive
tokenizer = AutoTokenizer.from_pretrained(llm_folder)
model = AutoModelForCausalLM.from_pretrained(llm_folder)


# Save the model and tokenizer to your Google Drive
# model.save_pretrained(llm_folder)
# tokenizer.save_pretrained(llm_folder)

print("Model and tokenizer saved to Google Drive.")

In [None]:
messages = [
    {"role": "user", "content": prompt_3},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=240)
generated_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])

# Step 6: Print the generated output.
print("Generated Words as a String:")
print(generated_text)

# Step 7: Optional - Parse the output into a Python list
# Note: The model might not always produce a perfectly clean list.
try:
    # Remove any unwanted tokens like '<|im_end|>' and strip whitespace
    clean_text = generated_text.replace('<|im_end|>', '').strip()
    word_list = [word.strip() for word in clean_text.split(',')]
    word_list = [word for word in word_list if word]

    print("\nParsed Words as a Python List:")
    print(word_list)

except Exception as e:
    print("\nAn error occurred while parsing the model output.")
    print("Model output:", generated_text)
    print("Error:", e)


Generated Words as a String:
Everest, Alps, Dolomites, Pyrenees, Carpathians, Swiss Alps, Gran Sasso, Tour du Mont Blanc, Mont Blanc, Icelandic Highlands<|im_end|>

Parsed Words as a Python List:
['Everest', 'Alps', 'Dolomites', 'Pyrenees', 'Carpathians', 'Swiss Alps', 'Gran Sasso', 'Tour du Mont Blanc', 'Mont Blanc', 'Icelandic Highlands']


In [None]:
for item in word_list:
  print(item)

Everest
Alps
Dolomites
Pyrenees
Carpathians
Swiss Alps
Gran Sasso
Tour du Mont Blanc
Mont Blanc
Icelandic Highlands


In [None]:
import json
import requests
import pandas as pd
from googleapiclient.discovery import build
from google.colab import userdata
userdata.get('book_api_key')

# --- Configuration ---
API_KEY = userdata.get('book_api_key')
API_SERVICE_NAME = 'books'
API_VERSION = 'v1'

def get_books_service(api_key):
    """
    Builds and returns the Google Books API service client.
    """
    return build(API_SERVICE_NAME, API_VERSION, developerKey=api_key)

def search_recent_books(service, query, language='en', max_results_to_fetch=400):
    """
    Searches for books with a specific query, orders by newest,
    restricting the results to a specified language.
    Uses pagination to fetch up to `max_results_to_fetch` books.

    Returns the results as a list of dictionaries.
    """
    all_books = []
    start_index = 0
    max_results_per_call = 40  # Maximum allowed per call

    print(f"\nSearching for up to {max_results_to_fetch} recent books with query '{query}' in language '{language}'...")

    try:
        while len(all_books) < max_results_to_fetch:
            print(f"Fetching results from index {start_index}...")

            results = service.volumes().list(
                q=query,
                orderBy='newest',
                langRestrict=language,
                startIndex=start_index,
                maxResults=max_results_per_call
            ).execute()

            books_list = results.get('items', [])

            # If no books were returned, we have reached the end
            if not books_list:
                break

            all_books.extend(books_list)

            # Update the start index for the next page
            start_index += max_results_per_call

        return all_books

    except Exception as e:
        print(f"An error occurred during the API call: {e}")
        return None

def convert_to_dataframe(books_list):
    """
    Converts a list of book dictionaries from the API into a pandas DataFrame.
    """
    # Create an empty list to store the processed book data
    processed_books = []

    for book_data in books_list:
        book = book_data['book']
        query_term = book_data['query_term']

        volume_info = book.get('volumeInfo', {})

        # Extract the necessary fields
        book_id = book.get('id', 'N/A')
        title = volume_info.get('title', 'N/A')
        subtitle = volume_info.get('subtitle', 'N/A')
        authors = ', '.join(volume_info.get('authors', ['N/A']))
        publisher = volume_info.get('publisher', 'N/A')
        published_date = volume_info.get('publishedDate', 'N/A')
        description = volume_info.get('description', 'N/A')
        page_count = volume_info.get('pageCount', 'N/A')
        categories = ', '.join(volume_info.get('categories', ['N/A']))

        # Create a dictionary for the current book and append to the list
        processed_books.append({
            'Query_Term': query_term,
            'ID': book_id,
            'Title': title,
            'Subtitle': subtitle,
            'Authors': authors,
            'Publisher': publisher,
            'Published_Date': published_date,
            'Description': description,
            'Page_Count': page_count,
            'Categories': categories
        })

    # Create the DataFrame from the list of dictionaries
    df = pd.DataFrame(processed_books)
    return df

def main():
    """
    Main function to run the script.
    """
    if API_KEY == 'YOUR_API_KEY_HERE':
        print("Please replace 'YOUR_API_KEY_HERE' with your actual API key.")
        return None

    books_service = get_books_service(API_KEY)
    search_queries = word_list
    all_books_data = []

    for query_term in search_queries:
        print(f"\nSearching for books with query: '{query_term}'")
        books_list = search_recent_books(books_service, query=query_term, language='en', max_results_to_fetch=10)

        if books_list:
            # Store the book data along with the query term
            for book in books_list:
                all_books_data.append({'book': book, 'query_term': query_term})

    if all_books_data:
        # Convert the list of all book data to a DataFrame once
        books_df = convert_to_dataframe(all_books_data)

        print(f"\n--- Retrieved {len(books_df)} books ---")
        print("\n--- Books DataFrame ---")
        print(books_df)
        print("\n--- DataFrame Info ---")
        books_df.info()

        return books_df
    else:
        print("\nNo books data was returned from the API.")
        return None

if __name__ == "__main__":
    books_df = main()


Searching for books with query: 'Everest'

Searching for up to 10 recent books with query 'Everest' in language 'en'...
Fetching results from index 0...

Searching for books with query: 'Alps'

Searching for up to 10 recent books with query 'Alps' in language 'en'...
Fetching results from index 0...

Searching for books with query: 'Dolomites'

Searching for up to 10 recent books with query 'Dolomites' in language 'en'...
Fetching results from index 0...

Searching for books with query: 'Pyrenees'

Searching for up to 10 recent books with query 'Pyrenees' in language 'en'...
Fetching results from index 0...

Searching for books with query: 'Carpathians'

Searching for up to 10 recent books with query 'Carpathians' in language 'en'...
Fetching results from index 0...

Searching for books with query: 'Swiss Alps'

Searching for up to 10 recent books with query 'Swiss Alps' in language 'en'...
Fetching results from index 0...

Searching for books with query: 'Gran Sasso'

Searching for u

In [None]:
query_counts = books_df.groupby('Query_Term')['ID'].count()
query_counts

Unnamed: 0_level_0,ID
Query_Term,Unnamed: 1_level_1
Alps,40
Carpathians,40
Dolomites,40
Everest,40
Gran Sasso,40
Icelandic Highlands,40
Mont Blanc,40
Pyrenees,40
Swiss Alps,40
Tour du Mont Blanc,40


In [None]:
#adds column for the ingested at time
books_df['ingested_at'] =  pd.Timestamp.now(tz='America/Los_Angeles')

In [None]:
from datetime import datetime
import pytz
# Get the current datetime in the 'America/Los_Angeles' timezone
pst_datetime = datetime.now(pytz.timezone('America/Los_Angeles'))
# Format the date as a string
today = pst_datetime.strftime('%Y-%m-%d')
# Dynamically create the filename with the PST date
DATA_FILE = f'books_2025_data_location_{today}.csv'
books_df

Unnamed: 0,Query_Term,ID,Title,Subtitle,Authors,Publisher,Published_Date,Description,Page_Count,Categories,ingested_at
0,Everest,DyDrE7Q7rFoC,Mount Everest,,Ann Heinrichs,Marshall Cavendish,2010,"Discover Mount Everest--a mysterious, exciting...",100,Juvenile Nonfiction,2025-08-18 17:10:16.838320-07:00
1,Everest,G0fOKTFyuG8C,Gaiety of Spirit,The Sherpas of Everest,Frances Klatzel,Rocky Mountain Books Ltd,2010,Since the birth of modern mountaineering the t...,178,Philosophy,2025-08-18 17:10:16.838320-07:00
2,Everest,MWsCvdQi16UC,Everest,The West Ridge,Thomas F. Hornbein,The Mountaineers Books,1998,Details the author and his partner Willi Unsoe...,244,Nature,2025-08-18 17:10:16.838320-07:00
3,Everest,dcO2DwAAQBAJ,Everest,,Megan Lappi,Weigl Publishers,2019-08-01,To reach the top of Mount Everest is to stand ...,32,Juvenile Nonfiction,2025-08-18 17:10:16.838320-07:00
4,Everest,kkvVhYwTVXwC,Everest,,"Peter Potterfield, Tom Hornbein",The Mountaineers Books,2003,"Everest, The Mountaineers Anthology Series, Vo...",268,Sports & Recreation,2025-08-18 17:10:16.838320-07:00
...,...,...,...,...,...,...,...,...,...,...,...
395,Icelandic Highlands,b-iBAAAAMAAJ,Popular Tales of the West Highlands,Orally Collected,John Francis Campbell,,1893,,468,Celts,2025-08-18 17:10:16.838320-07:00
396,Icelandic Highlands,ugSVEAAAQBAJ,Útrásarvíkingar!,The Literature of the Icelandic Financial Cris...,Alaric Hall,punctum books,2020,As the global banking boom of the early twenty...,395,Literary Criticism,2025-08-18 17:10:16.838320-07:00
397,Icelandic Highlands,YRbFEAAAQBAJ,Synergies between climate and biodiversity obj...,,"Engelbrecht Hansen, Amalie, Borgman, Elvira, F...",Nordic Council of Ministers,2023-04-27,Available online: https://pub.norden.org/teman...,141,Social Science,2025-08-18 17:10:16.838320-07:00
398,Icelandic Highlands,kEvDhihVhbMC,Iceland - Modern Processes and Past Environments,,"C. Caseldine, A. Russell, J. Hardardóttir, O. ...",Elsevier,2005-04-28,Iceland provides an unique stage on which to s...,421,Science,2025-08-18 17:10:16.838320-07:00


In [None]:
# from google.colab import drive
# import pandas as pd # Make sure pandas is imported
# from datetime import date
# # 1. Mount your Google Drive
# drive.mount('/content/drive')

# 2. Define the path and filename within your Drive
# Replace 'Your_Folder' with the name of the folder you want to save to.
file_path = f'/content/drive/My Drive/AIAnalysis/{DATA_FILE}.csv'



# 3. Save the DataFrame to the specified path
books_df.to_csv(file_path, index=False)

In [None]:
df = pd.read_csv(file_path)
df

Unnamed: 0,Query_Term,ID,Title,Subtitle,Authors,Publisher,Published_Date,Description,Page_Count,Categories,ingested_at
0,Everest,DyDrE7Q7rFoC,Mount Everest,,Ann Heinrichs,Marshall Cavendish,2010,"Discover Mount Everest--a mysterious, exciting...",100.0,Juvenile Nonfiction,2025-08-18 17:10:16.838320-07:00
1,Everest,G0fOKTFyuG8C,Gaiety of Spirit,The Sherpas of Everest,Frances Klatzel,Rocky Mountain Books Ltd,2010,Since the birth of modern mountaineering the t...,178.0,Philosophy,2025-08-18 17:10:16.838320-07:00
2,Everest,MWsCvdQi16UC,Everest,The West Ridge,Thomas F. Hornbein,The Mountaineers Books,1998,Details the author and his partner Willi Unsoe...,244.0,Nature,2025-08-18 17:10:16.838320-07:00
3,Everest,dcO2DwAAQBAJ,Everest,,Megan Lappi,Weigl Publishers,2019-08-01,To reach the top of Mount Everest is to stand ...,32.0,Juvenile Nonfiction,2025-08-18 17:10:16.838320-07:00
4,Everest,kkvVhYwTVXwC,Everest,,"Peter Potterfield, Tom Hornbein",The Mountaineers Books,2003,"Everest, The Mountaineers Anthology Series, Vo...",268.0,Sports & Recreation,2025-08-18 17:10:16.838320-07:00
...,...,...,...,...,...,...,...,...,...,...,...
395,Icelandic Highlands,b-iBAAAAMAAJ,Popular Tales of the West Highlands,Orally Collected,John Francis Campbell,,1893,,468.0,Celts,2025-08-18 17:10:16.838320-07:00
396,Icelandic Highlands,ugSVEAAAQBAJ,Útrásarvíkingar!,The Literature of the Icelandic Financial Cris...,Alaric Hall,punctum books,2020,As the global banking boom of the early twenty...,395.0,Literary Criticism,2025-08-18 17:10:16.838320-07:00
397,Icelandic Highlands,YRbFEAAAQBAJ,Synergies between climate and biodiversity obj...,,"Engelbrecht Hansen, Amalie, Borgman, Elvira, F...",Nordic Council of Ministers,2023-04-27,Available online: https://pub.norden.org/teman...,141.0,Social Science,2025-08-18 17:10:16.838320-07:00
398,Icelandic Highlands,kEvDhihVhbMC,Iceland - Modern Processes and Past Environments,,"C. Caseldine, A. Russell, J. Hardardóttir, O. ...",Elsevier,2005-04-28,Iceland provides an unique stage on which to s...,421.0,Science,2025-08-18 17:10:16.838320-07:00


In [None]:
# Use a wildcard to match all files that follow the naming convention
file_pattern = '/content/drive/My Drive/AIAnalysis/*.csv'

# Get a list of all matching filenames
all_files = glob.glob(file_pattern)

# Create an empty list to hold the DataFrames
df_list = []

# Loop through each filename, read the CSV, and append the DataFrame to the list
for filename in all_files:
    df = pd.read_csv(filename)
    df_list.append(df)

# Concatenate all DataFrames in the list into a single, master DataFrame
master_df = pd.concat(df_list, ignore_index=True)

# --- Check the DataFrame before cleaning ---
print(f"Successfully loaded and combined {len(all_files)} files.")
print(f"The initial DataFrame has {len(master_df)} rows, before removing duplicates.")
print("\n--- Initial Master DataFrame (before cleaning) ---")


# --- Clean the DataFrame after inspecting ---
# Remove any duplicates that might have been created
master_df.drop_duplicates(subset=['ID'], inplace=True)

print(f"\n--- Final DataFrame (after removing duplicates) ---")
print(f"The final DataFrame has {len(master_df)} unique rows.")
print(master_df.head())

Successfully loaded and combined 11 files.
The initial DataFrame has 3800 rows, before removing duplicates.

--- Initial Master DataFrame (before cleaning) ---

--- Final DataFrame (after removing duplicates) ---
The final DataFrame has 3352 unique rows.
             ID           Title  \
0  Y3cWEQAAQBAJ   Project 2025:   
1  YgY4K2lPDNYC            2025   
2  A7Dc0AEACAAJ    Project 2025   
3  uL0uzgEACAAJ  Zeitgeist 2025   
4  JWcpEQAAQBAJ  HOROSCOPE 2025   

                                            Subtitle  \
0  The BluePrint: Everything You Need To Know Abo...   
1  Scenarios of U.S. and Global Society Reshaped ...   
2                A Hope for All Americans Comes 2025   
3  Countdown to the Secret Destiny of America? th...   
4                                                NaN   

                                             Authors                Publisher  \
0                                       John Madison           A.W Publishing   
1  Joseph Francis Coates, John B. M

In [None]:
master_df.sort_values(by = "ID")

Unnamed: 0,ID,Title,Subtitle,Authors,Publisher,Published_Date,Description,Page_Count,Categories,ingested_at,Query_Term
559,--4xEQAAQBAJ,"Diachronic, Typological, and Areal Aspects of ...",,"Paola Cotticelli-Kurras, Eystein Dahl, Jelena ...",Walter de Gruyter GmbH & Co KG,2024-12-30,"This book deals with the category of converbs,...",527.0,Language Arts & Disciplines,2025-08-11 17:55:24.387066-07:00,romance
580,-0LofU-Ac-oC,Alternative Scriptwriting,Beyond the Hollywood Formula,"Ken Dancyger, Jeff Rush",Taylor & Francis,2013-10-28,"Learn the rules of scriptwriting, and then how...",480.0,Performing Arts,2025-08-11 17:55:24.387066-07:00,thriller
51,-LdozgEACAAJ,2025 Post-Covid Scenarios,Latin America and the Caribbean,"Pepe Zhang, Peter Engelke",,2021-04-29,,,,2025-08-06 17:10:38.934987-07:00,
25,-NwXSBGJo1wC,"The Future of North America, 2025",Outlook and Recommendations,Armand B. Peschard-Sverdrup,CSIS,2008-08-28,,360.0,Business & Economics,2025-08-06 17:10:38.934987-07:00,
90,-_rVCgAAQBAJ,Sentencing Fragments,"Penal Reform in America, 1975-2025",Michael H. Tonry,Oxford University Press,2016,Cover -- Contents -- Preface -- Acknowledgment...,315.0,Law,2025-08-06 17:10:38.934987-07:00,
...,...,...,...,...,...,...,...,...,...,...,...
508,zZC4EAAAQBAJ,Fox and Bear: Unforgettable Adventures,,Zamfir Iacob,Zamfir Iacob,2023-04-11,The story is about the adventures of a fox and...,19.0,Juvenile Fiction,2025-08-11 17:55:24.387066-07:00,adventure
776,zcNyEQAAQBAJ,Tourism Diplomacy,"Insights from Economic, Environmental, and Soc...","Mahmut Demir, Şirvan Şen Demir",Emerald Group Publishing,2025-08-13,This edited volume explores emerging trends an...,238.0,Business & Economics,2025-08-12 16:49:06.165311-07:00,diplomacy
145,zd3_0AEACAAJ,Adobe Indesign 2025 Guide for Beginners,Mastering the Art of Creative Design for Publi...,Nava Asher,Independently Published,2024-11-24,Unlock the full potential of Adobe InDesign wi...,0.0,,2025-08-06 17:10:38.934987-07:00,
137,zdLdzgEACAAJ,2021-2025 Five Year Monthly Planner,Large 5 Year Monthly Planner 2021-2025|60 Mont...,All YourPlanners,,2020-10-28,2021-2025 Five Year Planner 2021- 2025 5 Year ...,131.0,,2025-08-06 17:10:38.934987-07:00,


In [None]:
# Ask the model if it has knowledge of some of the books
books_2024 =  master_df[master_df['Published_Date'].astype(str).str.startswith('2024')]
books_2024[23:28]

Unnamed: 0,ID,Title,Subtitle,Authors,Publisher,Published_Date,Description,Page_Count,Categories,ingested_at,Query_Term
70,GX0gEQAAQBAJ,"Workbook for Lectors, Gospel Readers, and Proc...","United States Edition, Reflowable Layout E-boo...","Various authors including Eric J. Wagner, CR, ...",LTP,2024-09-09,"When lectors, readers, and proclaimers of the ...",614.0,Religion,2025-08-06 17:10:38.934987-07:00,
73,UTEbEQAAQBAJ,Stock Trader's Almanac 2025,,Jeffrey A. Hirsch,John Wiley & Sons,2024-10-22,58th Annual Edition of the leading resource on...,212.0,Business & Economics,2025-08-06 17:10:38.934987-07:00,
74,8YbS0AEACAAJ,Project 2025,Democracy at Risk: Unveiling the Dangers of a ...,Mark Collins,Independently Published,2024-07-13,"In ""Project 2025: Democracy at Risk - Unveilin...",0.0,Political Science,2025-08-06 17:10:38.934987-07:00,
76,RyY6EQAAQBAJ,Fodor's Essential Italy 2025,,Fodor’s Travel Guides,Fodor's Travel,2024-12-24,Whether you want to visit the Colosseum in Rom...,1295.0,Travel,2025-08-06 17:10:38.934987-07:00,
79,snHy0AEACAAJ,Frans Hals Planner 2025,The Laughing Cavalier Organizer Calendar Year ...,Shy Panda Press,,2024-11-06,Frans Hals Planner 2025 (The Laughing Cavalier...,0.0,Art,2025-08-06 17:10:38.934987-07:00,


In [None]:
unique_query_terms = master_df['Query_Term'].loc[master_df['Query_Term'].notna()].unique()
query_item_list = unique_query_terms.tolist()
query_item_list

['mystery',
 'fantasy',
 'adventure',
 'romance',
 'thriller',
 'horror',
 'sci-fi',
 'mythology',
 'survival',
 'poetry',
 'diplomacy',
 'cyberpunk',
 'wilderness',
 'time travel',
 'alchemy']

In [None]:
book_promt = "What year was the book 'Fodor's Essential Italy 2025	' by the author Fodor’s Travel Guides	 published ? Respond either Yes or No"
book_promot_2024 = "What are 5 popular books released in 2024 ? Prodvide the title and author ONLY in a list format. No further commentary needed "
messages = [
    {"role": "user", "content": book_promot_2024},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=240)
generated_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])

# Step 6: Print the generated output.
print("Generated Words as a String:")
print(generated_text)

Generated Words as a String:
- The Ministry of Time by Emily St. John Mandel  
- The Book of Longings by Sue Monk Kidd  
- The Midnight Library by Matt Haig  
- The Vanishing Half by Brit Bennett  
- A Little Life by Hanya Yanagihara (Note: Originally published in 2015; this may be a confusion—no major new release in 2024 by this author)  

Correction: After verifying 2024 releases, here are five accurate and popular books:

- The Ministry of Time by Emily St. John Mandel  
- The Book of Longings by Sue Monk Kidd  
- The Midnight Library by Matt Haig  
- The Vanishing Half by Brit Bennett  
- The Girl with the Dragon Tattoo by Stieg Larsson (Note: Originally published in 2005; not a 2024 release)  

Final corrected list of five popular books released in 2024:

- The Ministry of Time by Emily St. John Mandel  
- The Book of Longings by Sue Monk Kidd  
- The Midnight Library by Matt Haig  
- The Vanishing Half by Brit Bennett  
- A


##**Step 2)**

- Read in the Personsas Dataset from Nvidia

In [None]:
import pandas as pd

# Login using e.g. `huggingface-cli login` to access this dataset
#don't need to authenticate as it is public
personas_df = pd.read_parquet("hf://datasets/nvidia/Nemotron-Personas/data/train-00000-of-00001.parquet")
#take only a small section of the DF for trial
small_personas_df = personas_df[:5]

In [None]:
small_personas_df

Unnamed: 0,uuid,persona,professional_persona,sports_persona,arts_persona,travel_persona,culinary_persona,skills_and_expertise,skills_and_expertise_list,hobbies_and_interests,hobbies_and_interests_list,career_goals_and_ambitions,sex,age,marital_status,education_level,bachelors_field,occupation,city,state,zipcode,country
0,df6b2b96-a938-48b0-83d8-75bfed059a3d,"A disciplined, sociable visionary, Jonathan ba...","A retired manufacturing manager, Jonathan now ...","An avid golfer, Jonathan plays weekly at the W...","A history enthusiast, Jonathan often leads tou...","A seasoned, meticulous planner, Jonathan favor...","A fan of hearty, Midwestern comfort food, Jona...",Jonathan's organizational skills and disciplin...,"['project management', 'budgeting and financia...",Jonathan enjoys a mix of social and solitary a...,"['golfing', 'woodworking', 'coin collecting', ...",After retiring from his career in manufacturin...,Male,72,widowed,high_school,,not_in_workforce,Wickliffe,OH,44092,USA
1,3b5691bf-07cd-4e58-b85b-cff62faba2fd,"Quintin, a 40-year-old logistician from Conver...","Quintin Pete Johnson, a logistician, combines ...","Quintin Pete Johnson, a dedicated fan of the S...",They appreciate the gritty realism of Texas ar...,"Quintin, a meticulous planner, balances family...",They delight in preparing complex Tex-Mex dish...,"Quintin Pete Johnson, a logistician from Conve...","['supply chain management', 'inventory control...",Quintin's balanced social nature extends to hi...,"['board games', 'art appreciation', 'history',...",Quintin aspires to become a director of logist...,Male,40,married_present,bachelors,arts_humanities,logistician,Converse,TX,78109,USA
2,8d6e788b-b0cf-42c1-9448-782fd12c6afe,"Ashley, a passionate community advocate, balan...","Ashley, an aspiring union representative, exce...","Ashley, a dedicated Detroit Lions fan, maintai...","Ashley, a self-proclaimed 'Motown music enthus...","Ashley, a budget-conscious traveler, dreams of...","Ashley, a skilled home cook, loves preparing h...",Ashley has developed strong organizational ski...,"['organizational skills', 'proficient in micro...",Ashley enjoys exploring new music and attendin...,"['exploring new music', 'cooking soul food', '...",Ashley aspires to become a union representativ...,Female,23,never_married,high_school,,laborer_or_freight_stock_or_material_mover,Detroit,MI,48219,USA
3,4617ca2c-673a-4a1b-a6cf-e171d542e113,"Stephanie, always the first to volunteer, bala...","Stephanie, a customer service representative, ...","Stephanie, a die-hard Minnesota Vikings fan, p...","Stephanie, an avid reader and amateur painter,...","Stephanie, despite her love for the outdoors, ...","Stephanie, a self-taught cook, enjoys experime...",Stephanie's ability to balance curiosity and p...,"['customer service', 'data analysis', 'multita...",Stephanie's outgoing nature and curiosity lead...,"['hiking', 'fishing', 'cooking', 'reading', 'h...",Stephanie enjoys her job as a customer service...,Female,41,married_present,some_college,,customer_service_representative,Littlefork,MN,56653,USA
4,21a01219-bace-4f40-9cca-79de787781d2,"Sonia, a 70-year-old retiree, is a vibrant, im...","Sonia, a retired organizer with a creative sou...","Sonia, though not athletic, enjoys watching ba...","Sonia, a passionate artist, finds inspiration ...","Sonia, a seasoned traveler, plans meticulous i...","Sonia, an avid cook, delights in preparing com...",Sonia has honed her organizational skills over...,"['event planning', 'group coordination', 'pain...",Sonia enjoys spending her free time creating a...,"['art creation', 'reading (poetry, biographies...","Though retired, Sonia still harbors ambitions ...",Female,70,married_present,9th_12th_no_diploma,,no_occupation,Cayucos,CA,93430,USA


##**Step 3**

  - Encode the personsas dataset so that it can be feed to a model for fine tunning
  - Looks like hugging face providers autoTokenizers for some of the models
  - Could be next task, creating a tokenizer

  - Will depend on the model how to tokenize it, ie for T5 it is text to text

  - Encodings appear to be too large to fine tune to the LLM, so will need to pivot to the books dataset, and store the personsas in a vector db and prompt the model

  - Could do then do an agent mode and see if the book is available or go and get the prices for each them.

 - Build an API, then train a chatbot on the documentation. Google gemini is pretty good at working with the google books api.



In [None]:
import torch
import transformers as tr

In [None]:
#import transformer library
#then it has the autoencoder, for the speicfic model, call it model checkpoint
model_name = "gpt2"
tokenizer = tr.AutoTokenizer.from_pretrained( model_name)

In [None]:
#need to convert the personsas dataset into a hugging face dataset
# Convert your pandas DataFrame directly into a Hugging Face Dataset object.
# This is the most efficient and recommended way to do it.
from datasets import Dataset
hf_dataset = Dataset.from_pandas(small_personas_df)

In [None]:
columns_to_include = [col for col in hf_dataset.column_names ]
columns_to_include

['uuid',
 'persona',
 'professional_persona',
 'sports_persona',
 'arts_persona',
 'travel_persona',
 'culinary_persona',
 'skills_and_expertise',
 'skills_and_expertise_list',
 'hobbies_and_interests',
 'hobbies_and_interests_list',
 'career_goals_and_ambitions',
 'sex',
 'age',
 'marital_status',
 'education_level',
 'bachelors_field',
 'occupation',
 'city',
 'state',
 'zipcode',
 'country']

In [None]:
#going to take a small segment of the personsas dataset to see if this pattern works for the tokenization
def combine_columns_for_tokenization (dataset, columns_to_include ):
  """Takes a hugging face dataset,a long with list of column name. Then iterates through the dataset and combines each row into
  one"""
  combined_text = []

  # Get the list of columns to include
  # You might want to exclude some columns like 'uuid', 'id', etc.

  # Iterate through each row in the batch of examples
  for i in range(len(dataset[columns_to_include[0]])):
      # Build the text string for the current row
      row_string_parts = []
      for column in columns_to_include:
          # Get the column name and its value for the current row
          column_name = column.replace('_', ' ').title()  # Formats 'hobbies_and_interests' to 'Hobbies And Interests'
          column_value = dataset[column][i]

          # Append the formatted string to the parts list
          row_string_parts.append(f"{column_name}: {column_value}")

      # Join all the parts for the row into a single string
      #Will leave in the period just to see how it performs
      combined_text.append(". ".join(row_string_parts) + ".")

  return {"text": combined_text}

# Apply the function to the dataset to create the new 'text' column
columns_to_include = [col for col in hf_dataset.column_names ]
combined_dataset = hf_dataset.map(
    combine_columns_for_tokenization,
    batched=True,
    fn_kwargs={"columns_to_include": columns_to_include} # <-- The magic line
)
combined_dataset

In [None]:
combined_text

["Uuid: df6b2b96-a938-48b0-83d8-75bfed059a3d. Persona: A disciplined, sociable visionary, Jonathan balances practicality with curiosity, leaving a lasting impact on his community through his organized, competitive approach. Professional Persona: A retired manufacturing manager, Jonathan now excels as a community developer, leveraging his organizational skills and competitive nature to drive sustainable growth in Wickliffe. Sports Persona: An avid golfer, Jonathan plays weekly at the Wickliffe Country Club and cheers for the Cleveland Browns, maintaining his competitive spirit even in leisure. Arts Persona: A history enthusiast, Jonathan often leads tours at the Lake County Historical Society, sharing stories about local pioneers and their impact on the region's development. Travel Persona: A seasoned, meticulous planner, Jonathan favors international destinations with rich histories, like Edinburgh and Dublin, where he can explore ancestral roots and enjoy a round of golf at prestigiou

In [None]:
# The maximum sequence length for the GPT-2 model.
# The maximum sequence length for the GPT-2 model.
max_length = 512

# Load the tokenizer
tokenizer = tr.AutoTokenizer.from_pretrained("gpt2")

# --- THIS IS THE FIX ---
# Set the padding token to be the same as the end-of-sentence token.
tokenizer.pad_token = tokenizer.eos_token

# Define the tokenization function
def tokenize_function(examples):
    """
    This function tokenizes a batch of text from the 'text' column of the dataset.
    """
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=True, # Will now work correctly
        max_length=max_length,
    )

# Apply the tokenization function to the entire dataset using `map`.
tokenized_dataset = combined_dataset.map(tokenize_function, batched=True)

print("\nTokenized Dataset Structure:")
print(tokenized_dataset)
print("-" * 20)
print("First item in the tokenized dataset:")
print(tokenized_dataset[0])

In [None]:
# Assuming you've stored the tokenized dataset in `tokenized_dataset`

# Get the first item
first_item = tokenized_dataset[0]

# Print a few key details for verification
print("--- Verification of First Tokenized Item ---")
print("Original Text:\n", first_item['text'][:300] + "...") # Print first 300 chars of original text
print("\nNumber of Tokens (input_ids length):", len(first_item['input_ids']))
print("Number of Attention Mask values:", len(first_item['attention_mask']))

# Print the first 20 token IDs to see what they look like
print("\nFirst 20 Input IDs:", first_item['input_ids'][:20])

# Print the last 20 attention mask values to check for padding
# You should see 1s, followed by 0s if the text was shorter than max_length.
print("Last 20 Attention Mask values:", first_item['attention_mask'][-20:])

# To get the actual tokens (words/subwords) back from the IDs:
# This is a useful step for sanity-checking.
# You might need to install 'sentencepiece' if it's not already installed for your tokenizer.
decoded_text = tokenizer.decode(first_item['input_ids'], skip_special_tokens=True)
print("\nDecoded Text (for verification):", decoded_text[:300] + "...")

--- Verification of First Tokenized Item ---
Original Text:
 Uuid: df6b2b96-a938-48b0-83d8-75bfed059a3d. Persona: A disciplined, sociable visionary, Jonathan balances practicality with curiosity, leaving a lasting impact on his community through his organized, competitive approach. Professional Persona: A retired manufacturing manager, Jonathan now excels as ...

Number of Tokens (input_ids length): 512
Number of Attention Mask values: 512

First 20 Input IDs: [52, 27112, 25, 47764, 21, 65, 17, 65, 4846, 12, 64, 24, 2548, 12, 2780, 65, 15, 12, 5999, 67]
Last 20 Attention Mask values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Decoded Text (for verification): Uuid: df6b2b96-a938-48b0-83d8-75bfed059a3d. Persona: A disciplined, sociable visionary, Jonathan balances practicality with curiosity, leaving a lasting impact on his community through his organized, competitive approach. Professional Persona: A retired manufacturing manager, Jonathan now excels as ...


In [None]:
# The maximum sequence length for the GPT-2 model.
# Adjust this based on the length of your personas.
# 1024 is the max for gpt2, but smaller values can save memory.
max_length = 512

def tokenize_function(examples):
    """
    This function takes a batch of text and returns the tokenized output.
    """
    # The tokenizer will convert the text into input_ids, attention_mask, etc.
    # `truncation=True`: Ensures sequences longer than max_length are cut.
    # `padding=True`: Pads shorter sequences to max_length for batching.
    # `return_tensors="pt"`: Returns PyTorch tensors. We'll use this later.
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt"
    )

# Apply the tokenization function to the entire dataset using `map`.
# `batched=True` tells the function to process items in batches, which is faster.
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

print("\nTokenized Dataset Structure:")
print(tokenized_dataset)
print("-" * 20)
print("First item in the tokenized dataset:")
print(tokenized_dataset[0])
print(f"Shape of input_ids for the first item: {tokenized_dataset[0]['input_ids'].shape}")

Step 5)
Make sure can hookup to the transformers library and get an LLM

Transformers is the hugging face API, would specifgy the model there

In [None]:
!pip install transformers sentence-transformers torch # torch is the backend, sentence-transformers for embeddings

In [None]:

# 1. Choose a model ID from Hugging Face Hub
# For a smaller, fast example: "gpt2"
# For something more capable (but larger): "distilgpt2", "microsoft/DialoGPT-small",
# or for more recent, look into instruction-tuned models like "google/gemma-2b-it" (requires agreement)
# or "meta-llama/Llama-2-7b-chat-hf" (requires agreement)

# Let's start with GPT-2 for a quick demonstration
model_name = "gpt2"

# Check if GPU is available and set device
device = 0 if torch.cuda.is_available() else -1 # 0 for first GPU, -1 for CPU

# Option 1: Using the `pipeline` API (simplest for common tasks)
# This handles tokenizer and model loading automatically for many tasks
print(f"Loading model '{model_name}' using pipeline...")
generator = tr.pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.float16, # Use float16 for memory efficiency on GPU
    device=device
)
print("Model loaded via pipeline!")

# Example usage with pipeline:
prompt = "Given a persona who loves action movies and sci-fi, which streaming service would they choose?"
output = generator(prompt, max_new_tokens=50, num_return_sequences=1)
print("\nLLM Prediction (Pipeline):")
print(output[0]['generated_text'])

# RAG & Vector Database

In [None]:
#Will be using ChromaDB as it handles a lot of the encoding

In [None]:
def prepare_dataframe_for_chroma(df, columns_to_include):
    """
    Takes a pandas DataFrame and a list of column names,
    then combines each row's specified columns into a single text string
    and prepares the data for ChromaDB.

    Args:
        df (pd.DataFrame): The input pandas DataFrame.
        columns_to_include (list): A list of column names to combine into the document text.

    Returns:
        dict: A dictionary containing 'documents', 'metadatas', and 'ids' lists
              in the format required by chromadb.
    """
    documents = []
    metadatas = []
    ids = []

    # Iterate through each row of the DataFrame
    for index, row in df.iterrows():
        # Build the text string for the current row
        row_string_parts = []
        for column in columns_to_include:
            column_name = column.replace('_', ' ').title()
            column_value = row[column]
            row_string_parts.append(f"{column_name}: {column_value}")

        # Join all parts for the row into a single document string
        combined_text = ". ".join(row_string_parts) + "."
        documents.append(combined_text)

        # Prepare the metadata for the current row
        # You might want to remove the columns used for the document text from the metadata
        row_metadata = row.drop(columns_to_include).to_dict()
        metadatas.append(row_metadata)

        # Create a unique ID for the document
        ids.append(str(index))

    return {
        "documents": documents,
        "metadatas": metadatas,
        "ids": ids
    }

In [None]:
columns_for_document

['Title',
 'Subtitle',
 'Authors',
 'Publisher',
 'Published_Date',
 'Description',
 'Page_Count',
 'Categories']

In [None]:
# Define the columns that should be used exclusively as metadata
metadata_columns = ['ID', 'ingested_at', 'Query_Term']

# Create the list of columns to be combined for the document text
columns_for_document = [col for col in master_df.columns if col not in metadata_columns]
chroma_data = prepare_dataframe_for_chroma(master_df, columns_for_document)
# Print the prepared data to see the format
print("Prepared Documents:")
print(chroma_data['documents'])
print("\nPrepared Metadatas:")
print(chroma_data['metadatas'])
print("\nPrepared IDs:")
print(chroma_data['ids'])

Prepared Documents:

Prepared Metadatas:
[{'ID': 'Y3cWEQAAQBAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'YgY4K2lPDNYC', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'A7Dc0AEACAAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'uL0uzgEACAAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'JWcpEQAAQBAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'LxXm0AEACAAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'bvjS0AEACAAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'Ex4tAAAAYAAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'Be9aaSR43ZsC', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'zRxUzwEACAAJ', 'ingested_at': '2025-08-06 17:10:38.934987-07:00', 'Query_Term': nan}, {'ID': 'sz59AwAAQB

In [None]:
pip install chromadb

In [None]:
import chromadb
#establish client and collection
client = chromadb.Client() # or a persistent client: chromadb.PersistentClient(path="./my_db")
collection = client.create_collection("google_books_8_14")
#add to the collection
collection.add(
    documents=chroma_data['documents'],
    metadatas=chroma_data['metadatas'],
    ids=chroma_data['ids']
)

In [None]:
collection.get(
    ids=["2004"],
)

{'ids': ['2004'],
 'embeddings': None,
 'documents': ['Title: Volcanoes of the World. Subtitle: Third Edition. Authors: Lee Siebert, Tom Simkin, Paul Kimberly. Publisher: Univ of California Press. Published Date: 2011-02-09. Description: This impressive scientific resource presents up-to-date information on ten thousand years of volcanic activity on Earth. In the decade and a half since the previous edition was published new studies have refined assessments of the ages of many volcanoes, and several thousand new eruptions have been documented. This edition updates the book’s key components: a directory of volcanoes active during the Holocene; a chronology of eruptions over the past ten thousand years; a gazetteer of volcano names, synonyms, and subsidiary features; an extensive list of references; and an introduction placing these data in context. This edition also includes new photographs, data on the most common rock types forming each volcano, information on population densities nea

In [None]:
results = collection.query(query_texts=["Australia"], n_results=10)
results

{'ids': [['2081',
   '2107',
   '2116',
   '2091',
   '2117',
   '1809',
   '2101',
   '2094',
   '2090',
   '2099']],
 'embeddings': None,
 'documents': [['Title: Australian Outback. Subtitle: nan. Authors: John Lesley. Publisher: Redback Publishing. Published Date: 2020-02-01. Description: he Outback is not a place with any definite boundary. When Australians refer to the Outback, they mean the enormous regions of the country that are far away from the sorts of services, transport and facilities that people expect to find in urban areas. Find out who lives in the Outback, how they survive and why they choose to live in one of the harshest but most beautiful places on Earth. - One of the largest wilderness regions left in the world - Cattle stations bigger than countries - Ancient and sparsely populated. Page Count: 32.0. Categories: Juvenile Nonfiction.',
   "Title: Exploring Australia: A Journey Down Under. Subtitle: nan. Authors: Mark F. Prinz. Publisher: James Parducci. Published 

In [None]:
#now trying to do rag on the dataset

from transformers import  pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device_map="auto",
)

Device set to use cuda:0


In [None]:
question = "What are three things I should know before drinking my morning coffee ?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
prompt_template = f"Relevant context: {context}\n\n The user's question: {question}"
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

In [None]:
!pip install langchain-community
!pip install langchain
!pip install sentence-transformers # For the embedding model
#!pip install torch transformers accelerate bitsandbytes # For the Qwen model

In [None]:

from transformers import  pipeline
from langchain_community.llms import HuggingFacePipeline
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.embeddings import HuggingFaceEmbeddings

# --- 1. Set up your Qwen Model and Tokenizer ---
# You already have this part, but it's good to keep it together.
# This assumes you have enough VRAM to load a 4B parameter model.
# The `device_map="auto"` is crucial for managing memory.
# tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", device_map="auto")

# --- 2. Wrap the model in a Hugging Face Pipeline ---
# This creates a text-generation pipeline that LangChain can use.
# `pad_token_id` is important to prevent errors during batch generation.
# Adjust `max_new_tokens` based on your needs.
qwen_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    pad_token_id=tokenizer.eos_token_id,
)

# --- 3. Create the LangChain LLM object from the pipeline ---
llm = HuggingFacePipeline(pipeline=qwen_pipeline)

# --- 4. Set up the Retrieval Part (Chroma) ---
# We'll use a HuggingFace embedding model for consistency with your setup.
# This ensures your queries and documents are embedded in the same space.
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_function = HuggingFaceEmbeddings(model_name=embedding_model_name)

# This assumes your Chroma client and collection are still available
vector_store = Chroma(
    client=chromadb.Client(),
    collection_name="google_books_8_14",
    embedding_function=embedding_function
)

retriever = vector_store.as_retriever(search_kwargs={"k": 2})

# --- 5. Create the RetrievalQA chain using your Qwen LLM ---
# The rest of the chain setup remains the same.
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# # --- 6. Invoke the chain with a query ---
# question = "What comprehensive book about Australia travel"
# result = qa_chain.invoke({"query": question})

# print("Answer:", result['result'])
# print("\nSource Documents:", result['source_documents'])

In [None]:
# --- 6. Invoke the chain with a query ---
question = "How many miles is the tour du mount blanc ? "
result = qa_chain.invoke({"query": question})

print("Answer:", result['result'])
print("\nSource Documents:", result['source_documents'])

Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Title: Tour Du Mont Blanc Hiking Guide 2025. Subtitle: nan. Authors: MARK. O. ELIOT. Publisher: Independently Published. Published Date: 2025-03-22. Description: The Ultimate Trekking Companion for One of the World's Greatest Long-Distance Trails Discover the Adventure of a Lifetime! The Tour du Mont Blanc (TMB) is one of the most spectacular and renowned long-distance hikes in the world, taking you through France, Italy, and Switzerland as you circle the breathtaking Mont Blanc massif. Whether you're a seasoned hiker or a first-time trekker, this comprehensive 2025 guidebook will equip you with everything you need to plan and complete this epic 170-kilometer (105-mile) journey. What You'll Find Inside: Complete Route Breakdown & Itinerary Planning Detailed day-by-day itinerary for self-guided and guided hikes Overvi

In [None]:
question = "As a adventurous young adult, what are 3 desitinations I should considering visiting. They should be easy to travel to from the west coast of united states? "
result = qa_chain.invoke({"query": question})

print("Answer:", result['result'])
print("\nSource Documents:", result['source_documents'])

In [None]:
#RAG lookup about previously searched book term
question = "As a adventurous young adult, what are three things I should about a the Tour de Mount Blanc in europe? "
result = qa_chain.invoke({"query": question})

print("Answer:", result['result'])
print("\nSource Documents:", result['source_documents'])

In [None]:
TMB = "How many miles is the tour du mount blanc ? "

messages = [
    {"role": "user", "content": TMB},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=240)
generated_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])

# Step 6: Print the generated output.
print("Generated Words as a String:")
print(generated_text)

Generated Words as a String:
The **Tour du Mont Blanc** is a popular long-distance hiking route that circles the Mont Blanc massif in the Alps, spanning across France, Italy, and Switzerland.

The total distance of the **Tour du Mont Blanc** is approximately **110 miles (177 kilometers)**.

This route typically takes about **5 to 7 days** to complete, depending on the hiker's pace and the specific path chosen (the standard route is about 110 miles, though variations exist). The route includes several key passes and scenic mountain trails, and it's one of the most iconic hikes in the world.

So, to answer your question directly:  
👉 **The Tour du Mont Blanc is about 110 miles (177 km) long.**<|im_end|>
