Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,16 @@ In order to understand the tutorials you need to be familiar with general concep
- [Iris](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/iris): Classify iris flower species.
- [Loan Approval](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/loan_approval): Predict loan approvals.
- Advanced Tutorials:
- [Air Quality](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/air_quality): Predict the Air Quality value (PM2.5) in Europe and USA using weather features and air quality features of the previous days.
- [Air Quality](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/air_quality): Creating an air quality AI assistant that displays and explains air quality indicators for specific dates or periods, using Function Calling for LLMs and a RAG approach without a vector database.
- [Bitcoin](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/bitcoin): Predict Bitcoin price using timeseries features and tweets sentiment analysis.
- [Citibike](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/citibike): Predict the number of citibike users on each citibike station in the New York City.
- [Credit Scores](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/credit_scores): Predict clients' repayment abilities.
- [Electricity](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/electricity): Predict the electricity prices in several Swedish cities based on weather conditions, previous prices, and Swedish holidays.
- [NYC Taxi Fares](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/nyc_taxi_fares): Predict the fare amount for a taxi ride in New York City given the pickup and dropoff locations.
- [Recommender System](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/recommender-system): Build a recommender system for fashion items.
- [TimeSeries](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/timeseries): Timeseries price prediction.
- [LLM PDF](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/llm_pdfs): An AI assistant that utilizes a Retrieval-Augmented Generation (RAG) system to provide accurate answers to user questions by retrieving relevant context from PDF documents.
- [Fraud Cheque Detection](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/fraud_cheque_detection): Building an AI assistant that detects fraudulent scanned cheque images and generates explanations for the fraud classification, using a fine-tuned open-source LLM.
- [Keras model and Sklearn Transformation Functions with Hopsworks Model Registry](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/transformation_functions/keras): How to register Sklearn Transformation Functions and Keras model in the Hopsworks Model Registry, how to retrieve them and then use in training and inference pipelines.
- [PyTorch model and Sklearn Transformation Functions with Hopsworks Model Registry](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/transformation_functions/pytorch): How to register Sklearn Transformation Functions and PyTorch model in the Hopsworks Model Registry, how to retrieve them and then use in training and inference pipelines.
- [Sklearn Transformation Functions With Hopsworks Model Registy](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/advanced_tutorials/transformation_functions/sklearn): How to register sklearn.pipeline with transformation functions and classifier in Hopsworks Model Registry and use it in training and inference pipelines.
Expand Down
285 changes: 285 additions & 0 deletions advanced_tutorials/llm_pdfs/1_feature_backfill.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "82622ee3",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27\">📝 Imports </span>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ade7fe1f",
"metadata": {},
"outputs": [],
"source": [
"!pip install -r requirements.txt -q"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ab771e2",
"metadata": {},
"outputs": [],
"source": [
"import PyPDF2\n",
"import pandas as pd\n",
"from sentence_transformers import SentenceTransformer\n",
"\n",
"from functions.pdf_preprocess import (\n",
" download_files_to_folder, \n",
" process_pdf_file,\n",
")\n",
"from functions.text_preprocess import process_text_data\n",
"import config\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"id": "7e8f1796",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27\">💾 Download files from Google Drive </span>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea8c756e",
"metadata": {},
"outputs": [],
"source": [
"# Call the function to download files\n",
"new_files = download_files_to_folder(\n",
" config.FOLDER_ID, \n",
" config.DOWNLOAD_PATH,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f783e27e",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27\">🧬 Text Extraction </span>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b3b6715",
"metadata": {},
"outputs": [],
"source": [
"# Initialize an empty list\n",
"document_text = []\n",
"\n",
"for file in new_files:\n",
" process_pdf_file(\n",
" file, \n",
" document_text, \n",
" config.DOWNLOAD_PATH,\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "348b723e",
"metadata": {},
"outputs": [],
"source": [
"# Create a DataFrame\n",
"columns = [\"file_name\", \"file_link\", \"page_number\", \"text\"]\n",
"df_text = pd.DataFrame(\n",
" data=document_text,\n",
" columns=columns,\n",
")\n",
"# Display the DataFrame\n",
"df_text"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62a70763",
"metadata": {},
"outputs": [],
"source": [
"# Process text data using the process_text_data function\n",
"df_text_processed = process_text_data(df_text)\n",
"\n",
"# Display the processed DataFrame\n",
"df_text_processed"
]
},
{
"cell_type": "markdown",
"id": "10f9ea36",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27\">⚙️ Embeddings Creation </span>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9805c689",
"metadata": {},
"outputs": [],
"source": [
"# Load the SentenceTransformer model\n",
"model = SentenceTransformer(\n",
" config.MODEL_SENTENCE_TRANSFORMER,\n",
").to(config.DEVICE)\n",
"model.device"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1b7a89a",
"metadata": {},
"outputs": [],
"source": [
"# Generate embeddings for the 'text' column using the SentenceTransformer model\n",
"df_text_processed['embeddings'] = pd.Series(\n",
" model.encode(df_text_processed['text']).tolist(),\n",
")\n",
"\n",
"# Create a new column 'context_id' with values ranging from 0 to the number of rows in the DataFrame\n",
"df_text_processed['context_id'] = [*range(df_text_processed.shape[0])]\n",
"\n",
"# Display the resulting DataFrame with the added 'embeddings' and 'context_id' columns\n",
"df_text_processed"
]
},
{
"cell_type": "markdown",
"id": "d2bced31",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27;\"> 🔮 Connecting to Hopsworks Feature Store </span>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7caf764d",
"metadata": {},
"outputs": [],
"source": [
"import hopsworks\n",
"\n",
"project = hopsworks.login()\n",
"\n",
"fs = project.get_feature_store() "
]
},
{
"cell_type": "markdown",
"id": "0ed9ac69",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27;\"> 🪄 Feature Group Creation </span>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f5e486b",
"metadata": {},
"outputs": [],
"source": [
"from hsfs import embedding\n",
"\n",
"# Create the Embedding Index\n",
"emb = embedding.EmbeddingIndex()\n",
"\n",
"emb.add_embedding(\n",
" \"embeddings\", \n",
" model.get_sentence_embedding_dimension(),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e32b548",
"metadata": {},
"outputs": [],
"source": [
"# Get or create the 'documents_fg' feature group\n",
"documents_fg = fs.get_or_create_feature_group(\n",
" name=\"documents_fg\",\n",
" embedding_index=emb,\n",
" primary_key=['context_id'],\n",
" version=1,\n",
" description='Information from various files, presenting details like file names, source links, and structured text excerpts from different pages and paragraphs.',\n",
" online_enabled=True,\n",
")\n",
"\n",
"documents_fg.insert(df_text_processed)"
]
},
{
"cell_type": "markdown",
"id": "d39a9ed6",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27;\">🪄 Feature View Creation </span>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a7bc2f0",
"metadata": {},
"outputs": [],
"source": [
"# Get or create the 'documents' feature view\n",
"feature_view = fs.get_or_create_feature_view(\n",
" name=\"documents\",\n",
" version=1,\n",
" description='Chunked context for RAG system',\n",
" query=documents_fg.select([\"file_name\", \"file_link\", \"page_number\", \"paragraph\", \"text\"]),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "708b9a5f",
"metadata": {},
"source": [
"---"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
69 changes: 69 additions & 0 deletions advanced_tutorials/llm_pdfs/1a_feature_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import PyPDF2
import pandas as pd
from sentence_transformers import SentenceTransformer

from functions.pdf_preprocess import download_files_to_folder, process_pdf_file
from functions.text_preprocess import process_text_data
import config

import hopsworks

def pipeline():
# Call the function to download files
new_files = download_files_to_folder(
config.FOLDER_ID,
config.DOWNLOAD_PATH,
)

if len(new_files) == 0:
print('⛳️ Your folder is up to date!')
return

# Initialize an empty list
document_text = []

for file in new_files:
process_pdf_file(
file,
document_text,
config.DOWNLOAD_PATH,
)

# Create a DataFrame
columns = ["file_name", "page_number", "text"]
df_text = pd.DataFrame(
data=document_text,
columns=columns,
)

# Process text data using the process_text_data function
df_text_processed = process_text_data(df_text)

# Retrieve a SentenceTransformer
model = SentenceTransformer(
config.MODEL_SENTENCE_TRANSFORMER,
).to(config.DEVICE)

# Generate embeddings for the 'text' column using the SentenceTransformer model
df_text_processed['embeddings'] = pd.Series(
model.encode(df_text_processed['text']).tolist(),
)

# Create a new column 'context_id' with values ranging from 0 to the number of rows in the DataFrame
df_text_processed['context_id'] = [*range(df_text_processed.shape[0])]


project = hopsworks.login()

fs = project.get_feature_store()

documents_fg = fs.get_feature_group(
name="documents_fg",
version=1,
)

documents_fg.insert(df_text_processed)
return

if __name__ == '__main__':
pipeline()
Loading