# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 04: LLM Inference</span>
    
## 🗒️ This notebook is divided into the following sections:
1. Connect to the Hopsworks AI Lakehouse
2. Retrieve the feature view and predictor.
3. Load the LLM.
4. Configure langchain and the context manager
5. Ask questions

In [1]:
!pip install -r requirements.txt --quiet

In [2]:
import joblib

from functions.llm_chain import (
    load_model, 
    get_llm_chain, 
    generate_response,
)

## Connect to Hopsworks

In [3]:
# connect to Hopsworks

import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

2024-09-19 07:51:55,873 INFO: Python Engine initialized.

Logged in to project, explore it here https://hopsworks0.logicalclocks.com/p/119


## Get Departures Feature View and Predictor

In [4]:
# Retrieve the 'air_quality_fv' feature view
feature_view = fs.get_feature_view(
    name='departures_agg',
    version=1,
)

# Initialize batch scoring
feature_view.init_batch_scoring(1)

In [7]:
# Retrieve model serving
ms = project.get_model_serving()

# Retrieve bitcoin predictor
model_deployment = ms.get_deployment("latedeparturemodel")

## ⬇️ LLM Loading

In [8]:
# Load the LLM and its corresponding tokenizer.
model_llm, tokenizer = load_model()




Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


2024-09-19 07:52:01,785 INFO: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]




In [9]:
!nvidia-smi

Thu Sep 19 07:52:14 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla V100S-PCIE-32GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   43C    P0             41W /  250W |   28964MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## ⛓️ LangChain

In [10]:
# Create and configure a language model chain.
llm_chain = get_llm_chain(
    tokenizer,
    model_llm,
)

## 🧬 Model Inference


In [11]:
QUESTION = "Hi! who are you?"

response = generate_response(
    QUESTION,
    feature_view,
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)

🗓️ Today's date: Thursday, 2024-09-19
📖 

Hello! I am a route planner assistant that can help you with analyzing historical public transport departures in Stockholm. I can provide insights on the frequency and number of late departures, helping you make informed decisions about your travel plans.


In [12]:
QUESTION = "Are there expected late departures today?"

response = generate_response(
    QUESTION,
    feature_view,
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (0.76s) 
Scheduled time: 2024-09-19
Departure key: 9295-2024-09-19T05:14:45+02:00
🗓️ Today's date: Thursday, 2024-09-19
📖 {'lateness_probability': 0.00020051002502441406, 'scheduled_time': '2024-09-19'}

Based on historical data, late departures are quite rare today. However, it's always a good idea to plan for some extra time, just in case.


In [None]:
QUESTION = "Were there expected late departures yesterday?"

response = generate_response(
    QUESTION,
    feature_view,
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)

In [13]:
QUESTION = "How many late departures are from 2024-09-10 till 2024-09-19?"

response = generate_response(
    QUESTION,
    feature_view,
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (0.66s) 
🗓️ Today's date: Thursday, 2024-09-19
📖 Departures information between 2024-09-10 and 2024-09-19:
Date: 2024-09-18 00:13:00+00:00; Expected: 2024-09-18 00:15:09+00:00; Number of issues: 0.0; Number of late departures: 0.0; Number of deviations: 0.0; Deviations severity: 0.0;
Date: 2024-09-18 00:13:45+00:00; Expected: 2024-09-18 01:38:28+00:00; Number of issues: 0.0; Number of late departures: 0.0; Number of deviations: 0.0; Deviations severity: 0.0;
Date: 2024-09-18 00:53:52+00:00; Expected: 2024-09-18 02:18:18+00:00; Number of issues: 9.0; Number of late departures: 0.0; Number of deviations: 0.0; Deviations severity: 0.0;
Date: 2024-09-18 02:01:00+00:00; Expected: 2024-09-18 02:08:42+00:00; Number of issues: 5.0; Number of late departures: 0.0; Number of deviations: 1.0; Deviations severity: 7.0;
Date: 2024-09-18 02:16:00+00:00; Expected: 2024-09-18 02:16:00+00:00; Number of issues: 0.0; Number of l

OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 84.19 MiB is free. Process 1567720 has 31.65 GiB memory in use. Of the allocated memory 30.84 GiB is allocated by PyTorch, and 453.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
QUESTION = "How many departure issues occurred last month?"

response = generate_response(
    QUESTION, 
    feature_view, 
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)

In [None]:
QUESTION = "How many departure deviations were planned for last month?"

response = generate_response(
    QUESTION, 
    feature_view, 
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)

In [None]:
QUESTION = "How long will take me to commute if I leave now?"

response = generate_response(
    QUESTION,
    feature_view,
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)

In [None]:
QUESTION = "At what time should I leave if I want to reach before 10:00 to my office?"

response = generate_response(
    QUESTION,
    feature_view,
    model_deployment,
    model_llm,
    tokenizer,
    llm_chain,
    verbose=True,
)

print(response)