## **Install Required Libraries**

In [1]:
!pip install transformers accelerate
!pip install flask
!pip install flask-ngrok
!pip install pyngrok
!pip install streamlit



In [2]:
# pip install transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

# Load the model from hugging face you saved after fine tuning.
model = "karanstha/Llama-2-7b-chat-finetune"
tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## **Test the result before integrating with Streamlit**

In [3]:
prompt = "What is a large language model?"
sequences = pipeline(
    f'<s>[INST] {prompt} [/INST]',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Result: <s>[INST] What is a large language model? [/INST] A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on vast amounts of text data to generate language outputs that are coherent and natural-sounding. LLMs are typically trained using deep learning techniques, such as transformer architectures, and are designed to perform a wide range of natural language processing (NLP) tasks, such as language translation, text summarization, and text generation.

LLMs are often used in applications such as chatbots, language translation, and content generation. They are also used in research to explore the limits of language understanding and generation, and to develop new techniques for NLP.

Some examples of large language models include:

* BERT (Bidirectional Encoder Representations from Transformers): A popular LLM developed by Google that has achieved state-of-the-


In [4]:
prompt="explain to me in a simple to understand way what the equation for finding the nth triangle number is and how it can be proved by using only high school level math. please give each step of a proof using LaTeX."
sequences = pipeline(
    f'<s>[INST] {prompt} [/INST]',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=400,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: <s>[INST] explain to me in a simple to understand way what the equation for finding the nth triangle number is and how it can be proved by using only high school level math. please give each step of a proof using LaTeX. [/INST] The triangle number n is equal to the sum of the first n positive integers.

$$\text{Triangle Number }n = \sum_{i=1}^{n}i$$

This equation can be proven by using mathematical induction.

Step 1: Prove the base case

$$\text{Triangle Number }1 = 1 + 1 + 1 = 3$$

$$\text{Triangle Number }2 = 2 + 2 + 1 = 5$$

$$\text{Triangle Number }3 = 3 + 3 + 2 = 7$$

$$\text{Triangle Number }4 = 4 + 4 + 3 = 9$$

$$\text{Triangle Number }5 = 5 + 5 + 4 = 14$$

Step 2: Prove the inductive step

$$\text{Triangle Number }n = \sum_{i=1}^{n-1}i + \sum_{i=1}^{n-1}i + 1$$

$$\sum_{i=1}^{n-1}i = \frac{n(n-1)}{2}$$

$$\sum_{i=1}^{n-1}i + 1 = \frac{n(n-1)}{2} + 1$$

$$\text{Triangle Number }n = \frac{n(n-1)}{2} + 1$$

Step 3: Show that the base case and the inductive step imply the

## **Integrate the model for Backend response in our Application**

In [2]:
# Writing the chatbot backend to a file
with open("chatbot_backend.py", "w") as f:
    f.write("""
print("Loading chatbot backend")
import torch
from transformers import AutoTokenizer, pipeline

MODEL_NAME = "karanstha/Llama-2-7b-chat-finetune"

# Check if GPU is available
if torch.cuda.is_available():
    print(f"GPU is available: {torch.cuda.get_device_name(0)}")
else:
    print("GPU is not available, using CPU.")

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Loading tokenizer")
# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Initialize the pipeline with 8-bit quantization
print("Loading pipeline")
text_pipeline = pipeline(
    "text-generation",
    model=MODEL_NAME,
    torch_dtype=torch.float16,
    device=0 if torch.cuda.is_available() else -1
)

def generate_response(user_input):
    prompt = f'<s>[INST] {user_input} [/INST]'
    sequences = text_pipeline(
        prompt,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=500,
    )
    startSeq = sequences[0]['generated_text'].index('[/INST]') + len('[/INST]')+1
    response = sequences[0]['generated_text'][startSeq:]
    return response
    """)


## **Create app.py for Streamlit frontend Interface**

In [3]:
# Writing the Streamlit app code to a file
with open("app.py", "w") as f:
    f.write("""
print("Loading app.py")
import streamlit as st
from chatbot_backend import generate_response

# Set page configurations
st.set_page_config(page_title="Meet my Chatbot- The Chatur🤖", page_icon="🤖", layout="centered")

# Title for the chatbot
st.title("Meet my Chatbot - The Chatur🤖")

# Initialize the chat history in session state
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display previous messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Capture user input
if user_input := st.chat_input("Your Prompt:"):
    # Add user's message to the message list
    st.session_state.messages.append({"role": "user", "content": user_input})
    with st.chat_message("user"):
        st.markdown(user_input)

    # Generate response from chatbot backend (Llama 2 model)
    with st.chat_message("assistant"):
        chatbot_msg = st.empty()
        full_response = generate_response(user_input)

        # Display the bot's response
        chatbot_msg.markdown(full_response)

        # Add assistant's response to the message list
        st.session_state.messages.append({"role": "assistant", "content": full_response})
    """)


## **Implement the NGROK**

In [4]:
# Install pyngrok to expose Streamlit app via ngrok
# !pip install pyngrok

# Set up ngrok and run the Streamlit app
from pyngrok import ngrok
import os
import threading
from google.colab import userdata

# Kill any previous tunnels
ngrok.kill()

# Set the port for the Streamlit app
port = 8501

# Run Streamlit in the background
def run_streamlit():
    print('starting streamlit')
    os.system(f"streamlit run --server.port {port} app.py")
    print('starting started ',port)

threading.Thread(target=run_streamlit).start()

# Open a ngrok tunnel to the Streamlit app

ngrok.set_auth_token(userdata.get('NGROK_TOKEN'))
public_url = ngrok.connect(addr=port,proto='http',bind_tls = True)
print(f"Streamlit is running at: {public_url}")


starting streamlite
Streamlit is running at: NgrokTunnel: "https://a0b3-34-124-167-164.ngrok-free.app" -> "http://localhost:8501"


In [None]:
ngrok.kill()