------------------------------------
#### Retrivals using embeddings (key points)
-----------------------------------

**Dataset**
- PDF file on MS annual report (2022)
    - plain text
    - tables
    - Plots/graphs
    
**Extract data**
- Extract text pages

**Chunking**
- Two-Step Chunking Strategy: LangChain + SentenceTransformersTokenTextSplitter

**Embeddings**
- sentence transformer (from chromadb)

**Vector database**
- in memory chromadb

**Query**
- Using embeddings
- Using RAG


In [1]:
#!pip install chromadb

In [2]:
#pip install langchain

In [3]:
#!pip install pdfreader
#!pip install PypDF2

In [4]:
#!pip install pypdf

In [5]:
from helper_utils import word_wrap

In [6]:
from pypdf import PdfReader

In [7]:
reader = PdfReader(r".\data\microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

In [8]:
len(pdf_texts)

93

In [9]:
# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

In [10]:
len(pdf_texts)

90

**RecursiveCharacterTextSplitter**

The `RecursiveCharacterTextSplitter` is a utility that helps split long text into smaller chunks while maintaining as much context as possible. Here's how it works:

##### Separators

The `separators` list defines the order in which the text will be split. In this example:

- It first attempts to split by two newlines (`"\n\n"`), which typically indicates a paragraph break.
- If the chunk size condition isn't met, it moves on to split by a single newline (`"\n"`), indicating line breaks or new sentences.
- Then it tries to split by period followed by a space (`". "`), which indicates sentence boundaries.
- After that, it splits by a space (`" "`), which breaks the text at the word level.
- Finally, it splits by individual characters (`""`) if none of the above yield a chunk that meets the size requirement.

##### Chunk size and overlap

- `chunk_size=1000` means that each chunk will have a maximum of 1000 characters.
- `chunk_overlap=0` means there will be no overlap between consecutive chunks (i.e., no repeated content).

##### Recursive splitting

The process is recursive because it starts from the largest separator (paragraphs), and if the resulting chunk is still larger than 1000 characters, it moves down to the next smaller separator (sentences, words, etc.), ensuring that the chunks are as close to 1000 characters as possible while retaining coherent pieces of text.


```python
text = "This is a long paragraph with multiple sentences. It discusses several topics and ideas, flowing continuously. For instance, it talks about machine learning, deep learning, and various AI applications. While doing so, it doesn’t include paragraph breaks or line breaks. Everything is packed in a single block."
```

##### Initial Split
It tries to split using `"\n\n"` (paragraph breaks). There are no `\n\n` in this text, so no split happens.

##### Next Split
It then looks for `"\n"` (line breaks). There are none here either.

##### Next Split
It tries `". "` (sentence breaks). Here, it successfully splits the text into three sentences:
- "This is a long paragraph with multiple sentences."
- "It discusses several topics and ideas, flowing continuously."
- "For instance, it talks about machine learning, deep learning, and various AI applications."
- "While doing so, it doesn’t include paragraph breaks or line breaks. Everything is packed in a single block."

##### Final Chunks
If any of these sentences exceed 1000 characters, it continues splitting by `" "` (spaces) and eventually by characters if necessary.


```python
text = """Data science is an interdisciplinary field that uses various techniques to extract insights from data. It involves statistics, machine learning, and data analysis.

Machine learning is a subset of AI that enables systems to learn from data and improve from experience.

Deep learning, a branch of machine learning, uses neural networks to model complex patterns in data."""
```

##### Initial Split
The first separator `"\n\n"` (paragraph breaks) will be applied:
- "Data science is an interdisciplinary field that uses various techniques to extract insights from data. It involves statistics, machine learning, and data analysis."
- "Machine learning is a subset of AI that enables systems to learn from data and improve from experience."
- "Deep learning, a branch of machine learning, uses neural networks to model complex patterns in data."

##### Next Split
If any paragraph exceeds 1000 characters, it would then try to split further using `"\n"`, `". "`, and so on.

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [12]:
character_splitter = RecursiveCharacterTextSplitter(
    separators   = ["\n\n", "\n", ". ", " ", ""],
    chunk_size   = 1000,
    chunk_overlap= 0
)

In [13]:
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

In [14]:
print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 344
