### SAP Machine Learning Embedding in OpenAI
##### Author: Sergiu Iatco. July, 2023
https://people.sap.com/iatco.sergiu <br>
https://www.linkedin.com/in/sergiuiatco/ <br>

#### Resources:
https://pypi.org/project/gpt-index/ <br>
https://github.com/jerryjliu/llama_index/blob/main/examples/langchain_demo/LangchainDemo.ipynb <br>
https://github.com/jerryjliu/llama_index/tree/main/examples <br>
https://github.com/jerryjliu/llama_index/blob/main/examples/vector_indices/SimpleIndexDemo-ChatGPT.ipynb <br>
https://gpt-index.readthedocs.io/en/stable/reference/service_context.html <br>
https://gpt-index.readthedocs.io/en/stable/reference/service_context/embeddings.html <br>
https://gpt-index.readthedocs.io/en/stable/getting_started/starter_example.html store and load <br>
https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html <br>

https://blog.streamlit.io/how-to-build-an-llm-powered-chatbot-with-streamlit/ <br>
https://github.com/dataprofessor/langchain-text-summarization <br>
https://github.com/dataprofessor <br>

Blogs: <br>
https://blogs.sap.com/2022/11/07/sap-community-call-sap-hana-cloud-machine-learning-challenge-i-quit-how-to-prevent-employee-churn/ <br>
https://blogs.sap.com/2022/11/28/i-quit-how-to-predict-employee-churn-sap-hana-cloud-machine-learning-challenge/ <br>
https://blogs.sap.com/2022/12/22/sap-hana-cloud-machine-learning-challenge-2022-the-winners-are/ <br>
"I quit!" - How to prevent employee churn | SAP Community Call | Kick-off <br>
https://www.youtube.com/watch?v=pgV_NFdokZ4 <br>
"How to prevent Employee Churn using SAP HANA Cloud | SAP Community Call | Solutions" <br>
https://www.youtube.com/watch?v=ul5ZqnB3qVw <br>

In [1]:
# !pip install llama-index

In [2]:
import os
from IPython.core.debugger import set_trace
# os.environ["OPENAI_API_KEY"] = '<OPENAI_API_KEY>'

In [3]:
import os
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
from llama_index import StorageContext, load_index_from_storage
import shutil
import pathlib

import logging
import sys

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.basicConfig(stream=sys.stdout, level=logging.INFO)

logging.basicConfig(stream=sys.stdout, level=logging.CRITICAL)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# There are five standard levels for logging in Python, listed here in increasing order of severity:
# DEBUG: Detailed information, typically of interest only when diagnosing problems.
# INFO: Confirmation that things are working as expected.
# WARNING: An indication that something unexpected happened or indicative of some problem in the near future (e.g., ‘disk space low’). The software is still working as expected.
# ERROR: Due to a more serious problem, the software has not been able to perform some function.
# CRITICAL: A very serious error, indicating that the program itself may be unable to continue running.

class llama_context():
    def __init__(self, path=None):
        
        if path!=None:
            self.path = path
        else:
            self.path = ''
        
        perisit_sub_dir = "storage"
        self.perisit_dir = os.path.join(self.path, perisit_sub_dir)
        if not os.path.exists(self.perisit_dir):
            os.makedirs(self.perisit_dir)
        data_sub_dir = "data"
        self.data_dir = os.path.join(self.path, data_sub_dir)
        self.data_dir_counter = 0
        
        self.cost_model_ada = "ada" # https://openai.com/pricing
        self.cost_model_davinci = "davinci" # https://openai.com/pricing
        self.price_ada_1k_tokens = 0.0004
        self.price_davinci_1k_tokens = 0.03 

        
    def load_data(self):
        self.documents = SimpleDirectoryReader(self.data_dir).load_data()
        print(f"Documents loaded: {len(self.documents)}.")
    def create_vector_store(self):
        self.index = GPTVectorStoreIndex.from_documents(self.documents)
        print("GPTVectorStoreIndex complete.")
    def save_index(self):
        self.index.storage_context.persist(persist_dir=self.perisit_dir)
        print(f"Index saved in path {self.perisit_dir}.")
    def load_index(self):
        storage_context = StorageContext.from_defaults(persist_dir=self.perisit_dir)
        self.index = load_index_from_storage(storage_context)
    def start_query_engine(self):
        self.query_engine = self.index.as_query_engine()
        print("Query_engine started.")
    def post_question(self, question, sleep = None):
        if sleep == None:
            self.sleep = 0 # trial 20s
        self.response_cls = self.query_engine.query(question)
        self.response = self.response_cls.response

    def del_data_dir(self):
        path = self.data_dir
        try:
            shutil.rmtree(path)
            print(f"{path} deleted successfully!")
        except OSError as error:
            print(f"Error deleting {path}: {error}")

    def copy_file_to_data_dir(self, file_extension ='.txt', verbose = 0):

        path_from = self.path
        path_to = self.data_dir

        if not os.path.exists(path_to):
            os.makedirs(path_to)

        for filename in os.listdir(path_from):
            if filename.endswith(file_extension):
                source_path = os.path.join(path_from, filename)
                dest_path = os.path.join(path_to, filename)
                shutil.copy(source_path, dest_path)
                if verbose == 1:
                    print(f"File {filename} copied successfully!")
    
        path_to_lib = pathlib.Path(path_to)
        path_to_lib_files = path_to_lib.glob(f"*{file_extension}")
        print(f"Files {len(list(path_to_lib_files))} copied in {path_to}.")
 
    def copy_path_from_to_data_dir(self, path_from, file_extension ='.txt', verbose = 0):

        path_to = self.data_dir # default data folder for llama
        start_counter = self.data_dir_counter
        
        if not os.path.exists(path_to):
            os.makedirs(path_to)

        padding_n = 5
        path_from_lib = pathlib.Path(path_from)
        path_from_lib_files = path_from_lib.glob(f"**/*{file_extension}")

        files_copied_n = 0
        counter = None
        for counter, file in enumerate(path_from_lib_files, start_counter):
            filename_path = os.path.split(file)[0] # path only
            filename = os.path.split(file)[1] # filename only
            filename_with_index = f'{str(counter).zfill(padding_n)}_{filename}'
            file_to_data_dir = os.path.join(path_to, filename_with_index)
            shutil.copy(file, file_to_data_dir)
            
            if os.path.exists(file_to_data_dir):
                files_copied_n += 1
                if verbose == 1:
                    print(f"File {filename} -> copied successfully!")
            else:
                if verbose == 1:
                    print(f"File {filename} was not copied!")
        
#         if 'counter' in locals(): 
        if counter != None: 
            self.data_dir_counter = counter + 1 # start from last
        
        print(f"Files: {files_copied_n} copied to folder: {path_to}!")

    def estimate_tokens(self, text):
        words = text.split()

        num_words = int(len(words))
        tokens = int(( num_words / 0.75 ))
        tokens_1k = tokens / 1000
        cost_ada = tokens_1k * self.price_ada_1k_tokens
        cost_davinci = tokens_1k * self.price_davinci_1k_tokens
        return tokens, cost_ada, cost_davinci
    
    def estimate_cost(self):
        total_tokens = 0
        total_cost_ada = 0
        total_cost_davinci = 0
        costs_rounding = 8
        
        for doc in self.documents:
            text = doc.get_text()
            tokens, cost_ada, cost_davinci = self.estimate_tokens(text)
            total_tokens += tokens
            
            total_cost_ada += cost_ada
            total_cost_ada = round(total_cost_ada, costs_rounding)
            
            total_cost_davinci += cost_davinci
            total_cost_davinci = round(total_cost_davinci, costs_rounding)
            
        self.total_tokens = total_tokens
        self.total_cost_ada = total_cost_ada
        self.total_cost_davinci = total_cost_davinci
        print(f"Total tokens: {self.total_tokens}")
        print(f"Total estimated costs with model {self.cost_model_ada }: ${self.total_cost_ada}")
        print(f"Total estimated costs with model {self.cost_model_davinci }: ${self.total_cost_davinci}")
        

In [4]:
from llama_index import download_loader
YoutubeTranscriptReader = download_loader("YoutubeTranscriptReader")
loader = YoutubeTranscriptReader()

In [5]:
ytb_name = 'ytb_hana_ml_call_20221128'
ytb_link = 'https://www.youtube.com/watch?v=pgV_NFdokZ4'

In [6]:
ytb_doc = loader.load_data(ytlinks=[ytb_link])
ytb_content = ytb_doc[0].text
print(ytb_content)

and we're live
perfect so please take it away hi
everyone and a warm welcome to today's
sap Community call my name is Susan and
I am proud of the sap Hana product
management team and I will be your host
today and I'm really really excited
about the topic of today's call so we
will first get an introduction to
machine learning in sap Hana cloud and
afterwards we'll finally kick off our
sap Hana 12 machine learning challenge
so thanks already for joining here and
your big interest in the challenge so
let me introduce the most important
people today we're having in the back
end with me Savage whose competency for
data science and machine learning at sap
Andreas Foster our machine learning
expert in Sap's Global Center of
Excellence Yannick sharp our customer
advisor for machine learning and my
fellow product manager Christoph Morgan
senior director of product management
sap Hana predictive and machine learning
so our call will be 60 Minutes long and
the recording will be available under
t

In [7]:
# doc_ytb_hana_ml_call_20230126 = loader.load_data(ytlinks=[ytb_hana_ml_call_20230126])
# doc_ytb_hana_ml_call_20230126

In [8]:
import datetime

def time_now():
    now = datetime.datetime.now()
    formatted = now.strftime('%Y-%m-%d %H:%M:%S')
#     print(formatted)

time_now()

In [9]:
# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# path_llama = "llama_mvp"
path_llama = 'llama' + '_' + ytb_name
path_from = path_llama + "\\source"
lct = llama_context(path=path_llama)

display(lct.path)
display(lct.data_dir)
display(lct.perisit_dir)
display(path_from)

'llama_ytb_hana_ml_call_20221128'

'llama_ytb_hana_ml_call_20221128\\data'

'llama_ytb_hana_ml_call_20221128\\storage'

'llama_ytb_hana_ml_call_20221128\\source'

In [10]:
if not os.path.exists(path_from):
    os.makedirs(path_from)

In [11]:
filename = ytb_name + '.txt'
ytb_file = os.path.join(path_from, filename)
ytb_file

'llama_ytb_hana_ml_call_20221128\\source\\ytb_hana_ml_call_20221128.txt'

In [12]:
with open(ytb_file, "w") as file:
    file.write(ytb_content)

In [13]:
%time
# Delete data directory
time_now()
run_create_save = True
if run_create_save:
    lct.del_data_dir()

CPU times: total: 0 ns
Wall time: 0 ns
llama_ytb_hana_ml_call_20221128\data deleted successfully!


In [14]:
%time
time_now()
# Copy files from source to data directory
run_create_save = True
if run_create_save:
#     path_from = "llama_mvp/source"
    lct.copy_path_from_to_data_dir(path_from) # default extension *.txt

CPU times: total: 0 ns
Wall time: 0 ns
Files: 1 copied to folder: llama_ytb_hana_ml_call_20221128\data!


In [15]:
vars(lct).keys()

dict_keys(['path', 'perisit_dir', 'data_dir', 'data_dir_counter', 'cost_model_ada', 'cost_model_davinci', 'price_ada_1k_tokens', 'price_davinci_1k_tokens'])

In [16]:
%time
time_now()
# Load documents
run_create_save = True
if run_create_save:
    lct.load_data()

CPU times: total: 0 ns
Wall time: 0 ns
Documents loaded: 1.


In [17]:
%time
time_now()
# Estimate costs
run_create_save = True
if run_create_save:
    lct.estimate_cost()

CPU times: total: 0 ns
Wall time: 0 ns
Total tokens: 8490
Total estimated costs with model ada: $0.003396
Total estimated costs with model davinci: $0.2547


In [18]:
# https://platform.openai.com/account/api-keys
%time
time_now()
# Vector create does embedding and costs tokens
run_create_save = True
if run_create_save:
    lct.create_vector_store()

CPU times: total: 0 ns
Wall time: 0 ns
GPTVectorStoreIndex complete.


In [19]:
%time
time_now()
# Save index
run_create_save = True
if run_create_save:
    lct.save_index()

CPU times: total: 0 ns
Wall time: 0 ns
Index saved in path llama_ytb_hana_ml_call_20221128\storage.


In [20]:
%time
time_now()
# Method load_index() costs as method create_vector_store() but you don't need to upload data
run_load = True
if run_load:
    lct.load_index()

CPU times: total: 0 ns
Wall time: 0 ns


In [21]:
# help(lct.index.vector_store)

In [22]:
# dir(lct)

In [23]:
# help(lct)

In [24]:
# lct.__dict__

In [25]:
%time
time_now()
# Start query engine
lct.start_query_engine()

CPU times: total: 0 ns
Wall time: 0 ns
Query_engine started.


In [26]:
len(lct.documents)

1

In [27]:
%time
time_now()
question = "What is content about?"
lct.post_question(question)
print(lct.response)

CPU times: total: 0 ns
Wall time: 0 ns


Token indices sequence length is longer than the specified maximum sequence length for this model (1947 > 1024). Running this sequence through the model will result in indexing errors



The content is about a challenge to use SAP Hana Machine Learning to predict employee churn within the next 12 months. The challenge includes using the automated predictive library or the predictive analysis library available on the Hana Cloud. Participants are also encouraged to create an appealing presentation with a clear message and to consider the business context and a Persona in mind for whom they are building the solution.


In [28]:
question = "Who can participate?"
lct.post_question(question)
print(lct.response)


Anyone who is interested in participating in the Employee Churn Challenge can participate. The challenge is open to people from all over the world, and the global team of experts is available to provide support. Participants must have access to the Hana Cloud and Data Warehouse Cloud, either through a free trial or their own system.


In [29]:
question = "Who are the organizers?"
lct.post_question(question)
print(lct.response)


The organizers of this challenge are Sarah, Kristoff, Raymond, Andres, and Andreas.


In [30]:
question = "Extract all technical terms."
lct.post_question(question)
print(lct.response)


-Hana Cloud
-Predictive Analysis Library
-Random Forest
-Hyperparameter Tuning
-SQL Scripts
-Data Preparation
-Machine Learning
-Training Set
-Testing Set
-Confusion Matrix
-Random Forest Classifier
-Predictions
-Variables
-Model Storage
-Automated Predictive Library


In [31]:
question = "Extract all unique HANA terms. Do not repeat terms."
lct.post_question(question)
print(lct.response)


Hana Cloud, Hana system, Hana Cloud and Data Warehouse Cloud, Hana Machine Learning, SAP Hana Cloud
