# Using GPT4All

*Notice* that [gpt4all](https://github.com/nomic-ai/gpt4all) uses the LLaMa model which is not yet open for commercial use but only for educational purposes. You can play with the model and test it, tune it if you like but are shell not build commercial products on top of it!

This notebooks works was tested and works best with GPU shape and the `/home/datascience/conda/pytorch110_p38_gpu_v1` Conda Environment.

The original source for this notebook was taken from here: https://colab.research.google.com/drive/1NWZN15plz8rxrk-9OcxNwwIk1V1MfBsJ?usp=sharing#scrollTo=vH2-YRnRSN98
... but it was modified to work using our pytorch conda environment.


Install following dependancies on top of the /home/datascience/conda/pytorch110_p38_gpu_v1

- Open a terminal in the notebook
- Install the conda: `odsc conda install -s pytorch110_p38_gpu_v1`
- Activate the conda: `conda activate /home/datascience/conda/pytorch110_p38_gpu_v1`
- Install following dependancies:
    - `pip install datasets loralib sentencepiece`
    - `pip install git+https://github.com/huggingface/transformers.git`
    - `pip install git+https://github.com/huggingface/peft.git`
    - `pip install --upgrade tokenizers`    
    - `pip install bitsandbytes`

In [1]:
# check that you are using GPU enabled conda environment
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Apr 11 09:04:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10          Off  | 00000000:00:04.0 Off |                    0 |
|  0%   43C    P8    16W / 150W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10          Off  | 00000000:00:05.0 Off |                    0 |
|  0%   41C    P8    15W / 150W |      0MiB / 23028MiB |      0%      Default |
|       

In [2]:
import torch

if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

GPU is available


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from peft import PeftModel, PeftConfig, PeftModelForCausalLM
# import textwrap

peft_model_id = "nomic-ai/gpt4all-lora"
config = PeftConfig.from_pretrained(peft_model_id)
print(config)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, 
                                             return_dict=True, 
                                             torch_dtype=torch.float16,
#                                            torch_dtype=torch.float32,
                                             device_map='auto')

PeftConfig(peft_type='LORA', base_model_name_or_path='zpn/llama-7b', task_type='CAUSAL_LM', inference_mode=True)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
# loads the LLaMa tokenizer - takes time!
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

In [12]:
# You can skip that cell if you want to use the LLaMa model directly without the tunning!

# Load the GPT4All Lora tuned model
model = PeftModelForCausalLM.from_pretrained(model,
                                             peft_model_id, 
                                             device_map="auto", 
                                             torch_dtype=torch.float16)
model.to(dtype=torch.float16)

# If you use the flat32 it will use double the amount of RAM memory and the model resonse will be slower!
# model = PeftModelForCausalLM.from_pretrained(model, peft_model_id, device_map="auto", torch_dtype=torch.float32)
# model.to(dtype=torch.float32)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
              )
              (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): M

In [13]:
print(f"Model Mem used: {model.get_memory_footprint() / 1024 / 1024 / 1024:.2f} GB")

Model Mem used: 12.59 GB


In [7]:
def generate_advance(text):    
    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()

#   more information about the GenerationConfigu can be found at: https://huggingface.co/docs/transformers/main_classes/text_generation
#   temperature - more temperature more creativity, less value instructs the model to be less crative
#   top_p - if less than 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
#   top_k - The number of highest probability vocabulary tokens to keep for top-k-filtering.
    generation_config = GenerationConfig(
        temperature=0.6, 
        top_p=0.95, 
        top_k= 40,
        repetition_penalty=(1.0 / 0.85))

    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        # return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=512,
    )

    #wrapped_text = textwrap.fill(tokenizer.decode(generation_output[0]), width=100)
    wrapped_text = tokenizer.decode(generation_output[0])
    print(wrapped_text)

# Simplified
# def generate_from_model(text):
#     encoded_input = tokenizer(text, return_tensors='pt')
#     output_sequences = model.generate(input_ids=encoded_input['input_ids'].cuda(), max_new_tokens=512)
#     text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
#     print(text)

In [8]:
%%time

# LLaMa
# PROMPT Option 1
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What are the differences between alpacas and sheep?
### Response:"""

# PROMPT Option 2
# PROMPT = """
# What are the differences between alpacas and sheep?
# """

# # PROMPT Option 3
# PROMPT = ["What are the differences between alpacas and sheep?"]


# uses the additional settings
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What are the differences between alpacas and sheep?
### Response:
Alpaca’s have longer, thicker coats than sheep do. Alpaca’s also don't shed their coat like sheep do. Sheep can be found in many different colors while alpaca’s come only in white or brown. The most common color of alpaca’s is black but they can also be grey, red, fawn, cream, silver gray, cinnamon, chocolate, blue-grey, light grey, dark grey, tan, and white. They both have two horns on top of their heads. Both animals live for about 20 years.</s>
CPU times: user 7.81 s, sys: 307 ms, total: 8.12 s
Wall time: 8.17 s


In [14]:
%%time

# gpt4all
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What are the differences between alpacas and sheep?
### Response:"""

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What are the differences between alpacas and sheep?
### Response:
Alpacas have longer, finer fiber than sheep, making them more suitable for clothing production. They also tend to be smaller in size with shorter necks and ears compared to sheep. Additionally, they are native to South America while sheep are native to Europe and Asia.</s>
CPU times: user 3.3 s, sys: 15 µs, total: 3.3 s
Wall time: 3.32 s


In [9]:
# LLaMa
PROMPT = ["what are the top 5 biggest banks in USA"]
generate_advance(PROMPT)

<s> what are the top 5 biggest banks in USA?
what is the difference between a bank and an investment company?</s>


In [15]:
# gpt4all
PROMPT = ["what are the top 5 biggest banks in USA"]
generate_advance(PROMPT)

<s> what are the top 5 biggest banks in USA?
The five largest banks in the United States by assets as of March 2019, according to FDIC data, were: JPMorgan Chase & Co., Bank of America Corp., Citigroup Inc., Wells Fargo & Company and Goldman Sachs Group Inc.</s>


In [10]:
# LLaMa
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Read the text below and sumaries the ciber security risks into one sentense?

Cyber security risks

We rely on technology in virtually all aspects of our business. Like those of many large businesses, certain of our information systems have been subject to computer viruses, malicious codes, unauthorized access, phishing efforts, denial-of- service attacks and other cyber-attacks. We expect to be subject to similar attacks in the future as such attacks become more sophisticated and frequent. A significant disruption or failure of our technology systems could result in service interruptions, safety failures, security events, regulatory compliance failures, an inability to protect information and assets against unauthorized users and other operational difficulties. Attacks perpetrated against our systems could result in loss of assets and critical information and expose us to remediation costs and reputational damage.
Although we have taken steps intended to mitigate these risks, including business continuity planning, disaster recovery planning and business impact analysis, a significant disruption or cyber intrusion at one or more of our significant operations could adversely affect our results of operations, financial condition and liquidity. Additionally, if we are unable to acquire, develop, implement, adopt or protect rights around new technology, we may suffer a competitive disadvantage, which could also have an adverse effect on our results of operations, financial condition and/or liquidity.
Cyber-attacks could further adversely affect our ability to operate facilities, information technology and business systems or compromise confidential customer and employee information. Political, economic, social or financial market instability or damage to or interference with our operating assets, customers or suppliers from cyber-attacks may result in business interruptions, lost revenues, higher commodity prices, disruption in fuel supplies, lower energy consumption, unstable markets, increased security, repair or other costs, or may materially adversely affect us in ways that cannot be predicted at this time. Any of these risks could materially affect our consolidated financial results. Furthermore, instability in the financial markets resulting from terrorism, sustained or significant cyber-attacks or war could also have a material adverse effect on our ability to raise capital. We share these risks with all businesses.

### Response:"""


# uses the simplified function
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Read the text below and sumaries the ciber security risks into one sentense?

Cyber security risks

We rely on technology in virtually all aspects of our business. Like those of many large businesses, certain of our information systems have been subject to computer viruses, malicious codes, unauthorized access, phishing efforts, denial-of- service attacks and other cyber-attacks. We expect to be subject to similar attacks in the future as such attacks become more sophisticated and frequent. A significant disruption or failure of our technology systems could result in service interruptions, safety failures, security events, regulatory compliance failures, an inability to protect information and assets against unauthorized users and other operational difficulties. Attacks perpetrated against our systems could result in loss of assets and critical information and 

In [16]:
# gpt4all
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Read the text below and sumaries the ciber security risks into one sentense?

Cyber security risks

We rely on technology in virtually all aspects of our business. Like those of many large businesses, certain of our information systems have been subject to computer viruses, malicious codes, unauthorized access, phishing efforts, denial-of- service attacks and other cyber-attacks. We expect to be subject to similar attacks in the future as such attacks become more sophisticated and frequent. A significant disruption or failure of our technology systems could result in service interruptions, safety failures, security events, regulatory compliance failures, an inability to protect information and assets against unauthorized users and other operational difficulties. Attacks perpetrated against our systems could result in loss of assets and critical information and expose us to remediation costs and reputational damage.
Although we have taken steps intended to mitigate these risks, including business continuity planning, disaster recovery planning and business impact analysis, a significant disruption or cyber intrusion at one or more of our significant operations could adversely affect our results of operations, financial condition and liquidity. Additionally, if we are unable to acquire, develop, implement, adopt or protect rights around new technology, we may suffer a competitive disadvantage, which could also have an adverse effect on our results of operations, financial condition and/or liquidity.
Cyber-attacks could further adversely affect our ability to operate facilities, information technology and business systems or compromise confidential customer and employee information. Political, economic, social or financial market instability or damage to or interference with our operating assets, customers or suppliers from cyber-attacks may result in business interruptions, lost revenues, higher commodity prices, disruption in fuel supplies, lower energy consumption, unstable markets, increased security, repair or other costs, or may materially adversely affect us in ways that cannot be predicted at this time. Any of these risks could materially affect our consolidated financial results. Furthermore, instability in the financial markets resulting from terrorism, sustained or significant cyber-attacks or war could also have a material adverse effect on our ability to raise capital. We share these risks with all businesses.

### Response:"""


# uses the simplified function
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Read the text below and sumaries the ciber security risks into one sentense?

Cyber security risks

We rely on technology in virtually all aspects of our business. Like those of many large businesses, certain of our information systems have been subject to computer viruses, malicious codes, unauthorized access, phishing efforts, denial-of- service attacks and other cyber-attacks. We expect to be subject to similar attacks in the future as such attacks become more sophisticated and frequent. A significant disruption or failure of our technology systems could result in service interruptions, safety failures, security events, regulatory compliance failures, an inability to protect information and assets against unauthorized users and other operational difficulties. Attacks perpetrated against our systems could result in loss of assets and critical information and 

In [144]:
#LLaMa

PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
\n
### Instruction:
Write a python function that the detects if a number is a prime number or not.
\n
### Response:"""
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.


### Instruction:
Write a python function that detects if a number is a prime number or not.


### Response:
Here's my Python code to check whether a given integer is prime or not:
```python
def is_prime(num):
    for i in range(2, num + 1):
        if (num % i) == 0:
            return False
    else:
        return True
```</s>


In [143]:
import math
def is_prime(number):
    for i in range (2, int(math.sqrt(number))+1):
        if number % i == 0:
            return False
        else:
            return True
    
print(is_prime(3))

None


In [146]:
# gpt4all
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
\n
### Instruction:
Write a python function that the detects if a number is a prime number or not.
\n
### Response:"""
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.


### Instruction:
Write a python function that the detects if a number is a prime number or not.


### Response:
Here's my Python code to check whether a given integer is prime or not:
```python
def is_prime(num):
    for i in range(2, num + 1):
        if (num % i) == 0:
            return False
    else:
        return True
```</s>


In [151]:
def is_prime(num):
    for i in range(2, num + 1):
        if (num % i) == 0:
            return False
    else:
        return True

print(is_prime(21))

False


In [152]:
# gpt4all

PROMPT = """You are an advanced python code developer. Respond in detail, provide comments in the code, as well as how to use the code.
### Instruction:
Write a python function that can detect if given number is prime.
### Response:"""
generate_advance(PROMPT)

<s> You are an advanced python code developer. Respond in detail, provide comments in the code, as well as how to use the code.
### Instruction:
Write a python function that can detect if given number is prime.
### Response:
```python
import math
def is_prime(num):
    for i in range(2, int(math.sqrt(num))+1):
        if num % i == 0:
            return False
return True
```</s>


In [163]:
import math
def is_prime(num):
    for i in range(2, int(math.sqrt(num))+1):
        if num % i == 0:
            return False
    return True

# 2, 3, 5, 7, and 11
print(is_prime(11))

True


In [134]:
#gpt4all

PROMPT = """Below is an instruction that describes a mathematical task. Write a response that appropriately completes the request.
\n
### Instruction:
Is 15 a prime number, can you give example for another prime number, and describe what prime number is?
\n
### Response:"""

generate_advance(PROMPT)

<s> Below is an instruction that describes a mathematical task. Write a response that appropriately completes the request.


### Instruction:
Is 15 a prime number, can you give example for another prime number, and describe what prime number is?


### Response:
Yes, 15 is indeed a prime number because it has no factors other than itself and 1 (i.e., 1 x 15 = 15). Another prime number would be 7 since it also has only two factors (i.e., 1 x 7 = 7) but not including 1 or 0 as factors. A prime number is any whole number greater than 1 that cannot be divided evenly by any integer except 1 and itself without leaving a remainder of either 0 or 1.</s>


In [111]:
# gpt4all
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
\n
### Instruction:
Give me the first 5 prime numbers.
\n
### Response:"""
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.


### Instruction:
give me the first 5 prime numbers


### Response:
The first five prime numbers are: 2, 3, 5, 7, and 11</s>


In [18]:
# gpt4all
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
\n
### Instruction:
tokenizers>=0.13.3 is required for a normal functioning of this module, but found tokenizers==0.11.4
\n
### Response:"""
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.


### Instruction:
tokenizers>=0.13.3 is required for a normal functioning of this module, but found tokenizers==0.11.4


### Response:
The command line interface requires version 0.13.3 or higher to run properly. Please update your installation by running `pip install --upgrade tokenizers`.</s>


In [109]:
PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
\n
### Instruction:
I am looking for a python code that can convert a pdf document into a text chunks that can be used for tokenization and explain the code.
\n
### Response:"""
generate_advance(PROMPT)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.


### Instruction:
I am looking for a python code that can convert a pdf document into a text chunks that can be used for tokenization and explain the code.


### Response:
Here's some Python code to help you with your task! 
```python
import PyPDF2 as pypdf
pdf_file = 'path/to/your/pdf.pdf' # replace with actual file path
text_chunks = []
with open(pdf_file, "rb") as fp:
    doc = pypdf.PdfFileReader(fp)
    pages = doc.getNumPages()
    
for page in range(pages):
    chunk = ''
    for i in range(doc.getPageSize().width * doc.getPageSize().height / 720):
        char = chr((ord('A') + (i % 26)) % 26).encode("utf-8").decode("ascii", "ignore")
        if len(chunk) > 150:
            text_chunks.append(chunk[:149] + '\n' + char)
            chunk = char
        
print('\n'.join(text_chunks))
```</s>


In [None]:
# try the code
import PyPDF2 as pypdf
pdf_file = 'berkshire-hathawa-10k-2022ar.pdf' # replace this with the path of your PDF file
text_chunks = []

with open(pdf_file, "rb") as fp:
    doc = pypdf.PdfReader(fp)
    pages = len(doc.pages)
    
for page in range(pages):
    chunk = ''
    for i in range(doc.getPageSize().width // 100 + 1):
        line = doc.getLine(page * (doc.getPageSize().height - 1), i).encode('utf-8')
        if len(line) > 50:
            chunk += line[:49] + '\n\n'
        
    text_chunks.append(chunk)
print("Text Chunk:", text_chunks[i])

In [None]:
import PyPDF2

# Open the PDF file in read-binary mode
pdf_file = open('berkshire-hathawa-10k-2022ar.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfReader(pdf_file)

# Extract text from each page in the PDF document
text_chunks = []
for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    text = page.extract_text()
    text_chunks.append(text)

# Close the PDF file
pdf_file.close()

# Tokenize the text chunks
tokens = []
for chunk in text_chunks:
    chunk_tokens = chunk.split()
    tokens.extend(chunk_tokens)

In [None]:
print("Text Chunk:", tokens[:1000])