<h2>
Requirement - Demonstrate the Simple text summarization using LLMs hosted in local environment </h2>
<br>
<pre>
    <b>1. Priority - Data security</b>
        a. To run the llms in local environment <br>
    <b>2. Resource constraint - Only GPU(H100) of mem size:16GB is available</b>
        a. To use quantized llm's which occupies 5GB of GPU Mem and goes upto ~7 to 8GB during inference<br>
    <b>3. Assumptions - Simple text passage of 2000 words maximum ( ~6000 tokens)</b>
        a. Just keep it simple - Ignoring the complexities of different chain types
       
</pre>

In [None]:
#import ('pysqlite3')
#import sys
#sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

#----threads & basics
import os
from threading import Thread
import torch

#----Gradio front-end
#from IPython.display import Image, display, HTML
#import gradio as gr

#----llms
from transformers import AutoTokenizer, pipeline,TextIteratorStreamer 
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate

#----embeddings & vectordb
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter


<pre><b>               Defining LLM and its parameters ; Loading of LLM (Mistral 7b instruct) to GPU</b></pre>


In [None]:
global_input = " "
model_id = 'mistralai/Mistral-7B-Instruct-v0.2' #----LLM used
tokenizer = AutoTokenizer.from_pretrained(model_id)

#----streams the words generated by llms, instead of waiting for the entire output to return. 
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) 


In [None]:
model_id = 'mistralai/Mistral-7B-Instruct-v0.2' #----LLM used
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) 

llm = pipeline(
                "text-generation",
                model = model_id, 
                tokenizer = tokenizer, #----Mistral's tokenizer
                device_map = 'auto',#----Loads the LLM on multiple GPU, if multiple GPU exists
                temperature = '0.0', #----avoids hallucinations
                max_length = 8000, #----max no.of words in in the summary
                repetition_penalty=1.3,#----Which prevents the same words getting repeated 
                                       #----llama-2 & mistral suffers from generating endless whiteline
                                       #----Repetition penalty is used to handle them.
                streamer = streamer,
                model_kwargs={"torch_dtype": torch.bfloat16, "load_in_4bit": True} 
                                       #----Once the model is loaded, weights' converted/quantized to 4bits
                                       #----Thus helps in reduced GPU mem usage
                )


<pre><b>
                             1. Input texts to chunks;
                             2. Chunks to vector embeddings
                             </b></pre>


In [None]:
def defining_vector_store():
    global global_input
    #----Input texts to chunks
    text_splitter = RecursiveCharacterTextSplitter(separators=['\n'], chunk_size=3000, chunk_overlap=300) 
    input_tokens = global_input.strip()
    docs = text_splitter.create_documents([input_tokens])
    
    for i, x in enumerate(docs): #----inserting page number for each chunks (expectation of langchain libs)
        x.metadata["source"] = f"{i}-pl"

    embedding = HuggingFaceEmbeddings()
    vectordb = Chroma.from_documents(documents=docs, embedding = embedding) #----Converts chunks to vector embeddings
    return vectordb


<pre><b>
Once user enters the text passages in GUI,summarize function will be invoked, it performs the following actions
                       1. defines the Prompt for summarization wrt Mistral model
                       2. Calls LLM thread to handle the output stream

</b></pre>



In [None]:
def summarize(user_input): 
    torch.cuda.empty_cache()
    
    global global_input
    global_input = user_input
    user_prompt = "Write a concise and a short summary of the following text."  #-----Defining prompts for summarization
    template = """
                <s>[INST] {user_prompt} [/INST]
                Text: `{text}`
                Answer:
                </s>  
    """
    prompt = PromptTemplate(input_variables =['text', 'user_prompt'], template-template) 
    prompt_format = prompt.format(text = summarize_input, user_prompt= user_prompt )

    local_11ms = HuggingFacePipeline(pipeline=11m) 
    t= Thread(target=local_11ms, args = (prompt_format,)).start() #----Thread to handle the LLM output stream
    out=" "
    for new_text in streamer: #----words are sent to GUI as soon as its generated; instead of waiting for entire output
        out += new_text
        yield out


<pre><b>                           defining Front-end (or) GUI        </b></pre>



In [None]:
with gr.Blocks () as demo:
    with gr.Tab("Summarization"):
        gr.Markdown(
        "<center><h1>Text Summarization </h1>\n</center>\n"
        )
        with gr.Column():
            with gr.Row():
                user_input = gr.Textbox(label="User input", lines=10, scale=2, placeholder="Enter your texts...") 
                model_output = gr.Textbox(label="Model output", lines=10, scale =2, interactive=False)
                
            submit = gr.Button(value="Submit")
        submit.click(summarize, inputs = user_input, outputs = model_output)

    demo.queue (max_size=16).launch(share=True, server_port = int('5000')) #----it launches the link, where GUI is created

In [23]:
import pandas as pd
data = pd.read_csv("./topical_chat.csv")

In [24]:
grp = data.groupby('conversation_id')

for i in range (1,101):
    print(grp.get_group(i)['message'].values)
    break



['Are you a fan of Google or Microsoft?'
 'Both are excellent technology they are helpful in many ways. For the security purpose both are super.'
 " I'm not  a huge fan of Google, but I use it a lot because I have to. I think they are a monopoly in some sense. "
 ' Google provides online related services and products, which includes online ads, search engine and cloud computing.'
 " Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives. "
 'Google is leading the alphabet subsidiary and will continue to be the Umbrella company for Alphabet internet interest.'
 'Did you know Google had hundreds of live goats to cut the grass in the past?'
 ' It is very interesting. Google provide "Chrome OS" which is a light weight OS. Google provided a lot of hardware mainly in 2010 to 2015. '
 'I like Google Chrome. Do you use it as well for your browser?'
 ' Yes.Google is the biggest search engine and Google service figure out top 100 website, including Youtu

In [None]:
from langchain_community.llms import VLLM

llm = VLLM(
    model="TheBloke/Llama-2-7b-Chat-AWQ",
    trust_remote_code=True,
    max_new_tokens=128,
    vllm_kwargs={"quantization": "awq"},
)

print(llm.invoke("What is the capital of France ?"))