# M Science Project

## Instruction
Think about a scenario where you are a developer trying to build an internal chatbot that understands
(has access to information from) the latest financial earnings transcript for a company. For example, our
internal stakeholder may want to ask questions like “summarize highlights in the transcript” or search
for very specific points that they have missed. Your task is to build a simple proof of concept
code/pipeline.

## Install packages

In [1]:
!pip install tika
!pip install sentence-transformers
!pip install chromadb

Collecting tika
  Downloading tika-2.6.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-2.6.0-py3-none-any.whl size=32621 sha256=5c5e01195a3e06080ea5adcccc3d090e8b0a42f5ba627df5dba052e2ad8cfd78
  Stored in directory: /root/.cache/pip/wheels/5f/71/c7/b757709531121b1700cffda5b6b0d4aad095fb507ec84316d0
Successfully built tika
Installing collected packages: tika
Successfully installed tika-2.6.0
Collecting sentence-transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.

In [2]:
import re
from tika import parser
import chromadb
from transformers import pipeline




## read in and process data

In [4]:
raw = parser.from_file('/content/23q3_sonyspeech.pdf')


2024-04-14 22:16:00,947 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2024-04-14 22:16:01,253 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
  self.pid = _posixsubprocess.fork_exec(
2024-04-14 22:16:01,588 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2024-04-14 22:16:06,602 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...

In [5]:
text = raw['content']

In [6]:
# split it up the text by page
sections = re.split(r"\n\d+", text)

In [9]:
# only get pages with more then 100 charcters
cleaned_sections = list(filter(lambda x: len(x) >= 100, sections))


## Build Vector DB

In [12]:
chroma_client = chromadb.Client()

In [13]:
collection = chroma_client.create_collection(name="my_collection3")


In [14]:
collection.add(documents=cleaned_sections,ids=[str(i) for i in range(len(cleaned_sections))])

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 69.5MiB/s]


## Read in LLM

In [15]:

prompt = "Write an email about an alpaca that likes flan"
model = pipeline(model="declare-lab/flan-alpaca-base")
model(prompt, max_length=128)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

[{'generated_text': 'Dear [Name], I am writing to introduce you to an alpaca that loves flan. This alpaca is a small, brown, and white scaly scaly scaly scaly. It is a very small, brown, and white scaly scaly scaly scaly scaly scaly scaly scaly scaly scaly scaly scaly scaly scaly scaly scaly s'}]

In [18]:
def ask_question(question):
  """this funciton takes in a question from a user queries the vector DB and
  then propmts the LLM with the quesiton along with the context from the
  vector DB
  """
  context = collection.query(
    query_texts=[question],
    n_results=2
)['documents']
  prompt = f"""given this question and context ansewr the question
  question:{question}
  context:{context}"""

  return model(prompt, max_length=128)

In [19]:
ask_question("how many PS5 sales were there?")

[{'generated_text': 'There were around 21 million units of PlayStation 5 hardware unit sales in the quarter.'}]

## Possible Improvements


- use a better LLM
- base the number of results the vector DB returns based on a similarity score, this way it will use all the needed info
- if there are no similar results from the Vector DB return that there is no information within the given PDF, this would help stop the LLM from hallucinating