environment = `rag_env`

Even though we have ChatGPT, why don't we use it directly to summarise documents or processing them instead of creating an app?

1. It will involve too many copy and paste for contents from documents.
2. Copy and Paste on ChatGPT will not create an *aggregate knowledge base* which I may want to use for a long time.
3. ChatGPT has limit on amount of words that can be processed by it. So, what if your content is larger in size than the token limit?
4. If we have an app, We can smartly send the pertinent text *chunk* to LLM to get the answer to our question - therby, saving cost. 

Sidenote: To stop hallucination by LLM  - In your prompts add these words - **Who is Modi? Do not make things up!**

**Technical Architecture**

<span style = "color: yellow"> Image to add

Langchain has many types of `document_loaders` - helps in loading text i.e. converting words from a document into a machine readable format.

For e.g: 
- `TextLoader` - to read from *.txt* files
- `CSVLoader` - to read from *.csv* files
- `PyPDFLoader` - to read from *.pdf* files. There are many different modules in `langchain` to load pdfs. It will be a good idea to explore them later to see what works best for you.
- `UnstructuredURLLoader` - to read from *url* links


<span style = "color: red"> Loading table from csvloader vs from pdfloader is same or different?

#### Loading *.txt* file

In [23]:
from langchain.document_loaders import TextLoader, CSVLoader, PyPDFLoader, UnstructuredURLLoader

In [4]:
loader = TextLoader("nvda_news_1.txt")
loader.load()



In [5]:
type(loader)

langchain_community.document_loaders.text.TextLoader

In [6]:
dir(loader)

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 'alazy_load',
 'aload',
 'autodetect_encoding',
 'encoding',
 'file_path',
 'lazy_load',
 'load',
 'load_and_split']

In [7]:
loader.file_path

'nvda_news_1.txt'

#### Loading CSVs

In [10]:
import pandas as pd

df = pd.read_csv('movies.csv')
df

Unnamed: 0,movie_id,title,industry,release_year,imdb_rating,studio,language_id,budget,revenue,unit,currency
0,101,K.G.F: Chapter 2,Bollywood,2022,8.4,Hombale Films,3,1,12.5,Billions,INR
1,102,Doctor Strange in the Multiverse of Madness,Hollywood,2022,7.0,Marvel Studios,5,200,954.8,Millions,USD
2,103,Thor: The Dark World,Hollywood,2013,6.8,Marvel Studios,5,165,644.8,Millions,USD
3,104,Thor: Ragnarok,Hollywood,2017,7.9,Marvel Studios,5,180,854,Millions,USD
4,105,Thor: Love and Thunder,Hollywood,2022,6.8,Marvel Studios,5,250,670,Millions,USD
5,106,Sholay,Bollywood,1975,8.1,United Producers,1,Not Available,Not Available,Not Available,Not Available
6,107,Dilwale Dulhania Le Jayenge,Bollywood,1995,8.0,Yash Raj Films,1,400,2000,Millions,INR
7,108,3 Idiots,Bollywood,2009,8.4,Vinod Chopra Films,1,550,4000,Millions,INR
8,109,Kabhi Khushi Kabhie Gham,Bollywood,2001,7.4,Dharma Productions,1,390,1360,Millions,INR


In [8]:
loader = CSVLoader(file_path="movies.csv")
data = loader.load()
data

[Document(metadata={'source': 'movies.csv', 'row': 0}, page_content='movie_id: 101\ntitle: K.G.F: Chapter 2\nindustry: Bollywood\nrelease_year: 2022\nimdb_rating: 8.4\nstudio: Hombale Films\nlanguage_id: 3\nbudget: 1\nrevenue: 12.5\nunit: Billions\ncurrency: INR'),
 Document(metadata={'source': 'movies.csv', 'row': 1}, page_content='movie_id: 102\ntitle: Doctor Strange in the Multiverse of Madness\nindustry: Hollywood\nrelease_year: 2022\nimdb_rating: 7\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 200\nrevenue: 954.8\nunit: Millions\ncurrency: USD'),
 Document(metadata={'source': 'movies.csv', 'row': 2}, page_content='movie_id: 103\ntitle: Thor: The Dark World\nindustry: Hollywood\nrelease_year: 2013\nimdb_rating: 6.8\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 165\nrevenue: 644.8\nunit: Millions\ncurrency: USD'),
 Document(metadata={'source': 'movies.csv', 'row': 3}, page_content='movie_id: 104\ntitle: Thor: Ragnarok\nindustry: Hollywood\nrelease_year: 2017\nimdb_rating: 7.9\

In [9]:
data[0]

Document(metadata={'source': 'movies.csv', 'row': 0}, page_content='movie_id: 101\ntitle: K.G.F: Chapter 2\nindustry: Bollywood\nrelease_year: 2022\nimdb_rating: 8.4\nstudio: Hombale Films\nlanguage_id: 3\nbudget: 1\nrevenue: 12.5\nunit: Billions\ncurrency: INR')

In [11]:
data[0].metadata

{'source': 'movies.csv', 'row': 0}

We can see that for every row the metadata is `movies.csv`. Maybe we want to change that. Maybe we want `title` as the metadat for each row.

In [12]:
loader = CSVLoader(file_path="movies.csv", source_column="title")
data = loader.load()
data

[Document(metadata={'source': 'K.G.F: Chapter 2', 'row': 0}, page_content='movie_id: 101\ntitle: K.G.F: Chapter 2\nindustry: Bollywood\nrelease_year: 2022\nimdb_rating: 8.4\nstudio: Hombale Films\nlanguage_id: 3\nbudget: 1\nrevenue: 12.5\nunit: Billions\ncurrency: INR'),
 Document(metadata={'source': 'Doctor Strange in the Multiverse of Madness', 'row': 1}, page_content='movie_id: 102\ntitle: Doctor Strange in the Multiverse of Madness\nindustry: Hollywood\nrelease_year: 2022\nimdb_rating: 7\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 200\nrevenue: 954.8\nunit: Millions\ncurrency: USD'),
 Document(metadata={'source': 'Thor: The Dark World', 'row': 2}, page_content='movie_id: 103\ntitle: Thor: The Dark World\nindustry: Hollywood\nrelease_year: 2013\nimdb_rating: 6.8\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 165\nrevenue: 644.8\nunit: Millions\ncurrency: USD'),
 Document(metadata={'source': 'Thor: Ragnarok', 'row': 3}, page_content='movie_id: 104\ntitle: Thor: Ragnarok\nindus

Now we can see from above that metadata has been updated

In [14]:
data[0].metadata 

{'source': 'K.G.F: Chapter 2', 'row': 0}

In [16]:
data[0].metadata.get('source') # `get` works for dictionary

'K.G.F: Chapter 2'

In [18]:
data[5].page_content

'movie_id: 106\ntitle: Sholay\nindustry: Bollywood\nrelease_year: 1975\nimdb_rating: 8.1\nstudio: United Producers\nlanguage_id: 1\nbudget: Not Available\nrevenue: Not Available\nunit: Not Available\ncurrency: Not Available'

If we notice - we can see that `page_content` has csv header column names first and then a colon : and then the value in row as a converted version of row in string text form. So, this conversion attahces the context (colum name)  related to each cell for LLM to be able to interpret it.

<span style = 'color: yellow'>May be we should see if we would have loded this csv data from a pdf how would it look? Would it still be able to give us the context : cell value type text?

#### Load Table as PDF and compare it with CSVLoader

In [19]:
loader = PyPDFLoader(file_path='movies.pdf')
data = loader.load()
data

[Document(metadata={'source': 'movies.pdf', 'page': 0}, page_content='movie_id title industry release_yearimdb_ratingstudio language_idbudget revenue unit\n101 K.G.F: Chapter 2Bollywood 2022 8.4 Hombale Films 3 1 12.5 Billions\n102 Doctor Strange in the Multiverse of MadnessHollywood 2022 7 Marvel Studios 5 200 954.8 Millions\n103 Thor: The Dark WorldHollywood 2013 6.8 Marvel Studios 5 165 644.8 Millions\n104 Thor: RagnarokHollywood 2017 7.9 Marvel Studios 5 180 854 Millions\n105 Thor: Love and ThunderHollywood 2022 6.8 Marvel Studios 5 250 670 Millions\n106 Sholay Bollywood 1975 8.1 United Producers 1 Not AvailableNot AvailableNot Available\n107 Dilwale Dulhania Le JayengeBollywood 1995 8 Yash Raj Films 1 400 2000 Millions\n108 3 Idiots Bollywood 2009 8.4 Vinod Chopra Films 1 550 4000 Millions\n109 Kabhi Khushi Kabhie GhamBollywood 2001 7.4 Dharma Productions1 390 1360 Millions'),
 Document(metadata={'source': 'movies.pdf', 'page': 1}, page_content='currency\nINR\nUSD\nUSD\nUSD\nUSD\n

In [21]:
data[0].page_content

'movie_id title industry release_yearimdb_ratingstudio language_idbudget revenue unit\n101 K.G.F: Chapter 2Bollywood 2022 8.4 Hombale Films 3 1 12.5 Billions\n102 Doctor Strange in the Multiverse of MadnessHollywood 2022 7 Marvel Studios 5 200 954.8 Millions\n103 Thor: The Dark WorldHollywood 2013 6.8 Marvel Studios 5 165 644.8 Millions\n104 Thor: RagnarokHollywood 2017 7.9 Marvel Studios 5 180 854 Millions\n105 Thor: Love and ThunderHollywood 2022 6.8 Marvel Studios 5 250 670 Millions\n106 Sholay Bollywood 1975 8.1 United Producers 1 Not AvailableNot AvailableNot Available\n107 Dilwale Dulhania Le JayengeBollywood 1995 8 Yash Raj Films 1 400 2000 Millions\n108 3 Idiots Bollywood 2009 8.4 Vinod Chopra Films 1 550 4000 Millions\n109 Kabhi Khushi Kabhie GhamBollywood 2001 7.4 Dharma Productions1 390 1360 Millions'


<span style = 'color: yellow'>We can see that the conversion of table in a pdf is not ideal. So, if we need to process a pdf containing tables, we should design a a carefully designed workflow because we want high quality conversion of pdf information into LLM readable and understandable text  

#### UnstructuredURLLoader

In [22]:
## Install required libraries
# !pip3 install unstructured libmagic python-magic python-magic-bin

In [24]:
loader = UnstructuredURLLoader(
    urls = [
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

In [25]:
data = loader.load()
len(data)

2

In [26]:
data[0].page_content[0:100]

'English\n\nHindi\n\nGujarati\n\nSpecials\n\nHello, Login\n\nHello, Login\n\nLog-inor Sign-Up\n\nMy Account\n\nMy Pro'

In [27]:
data[0].metadata

{'source': 'https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html'}

#### Text Splitters

Why do we need text splitters in first place?

LLM's have token limits. Hence we need to split the text which can be large into small chunks so that each chunk size is under the token limit. There are various text splitter classes in langchain that allows us to do this.

**Later we will see that we also merge chunks. Why? - keep reading!**

In [28]:
# Taking some random text from wikipedia

text = """Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. 
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. 
Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg. 
Kip Thorne, a Caltech theoretical physicist and 2017 Nobel laureate in Physics,[4] was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar. 
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles. 
Interstellar uses extensive practical and miniature effects, and the company Double Negative created additional digital effects.

Interstellar premiered in Los Angeles on October 26, 2014. In the United States, it was first released on film stock, expanding to venues using digital projectors. The film received generally positive reviews from critics and grossed over $677 million worldwide ($715 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014. 
It has been praised by astronomers for its scientific accuracy and portrayal of theoretical astrophysics.[5][6][7] Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades."""

Manual approach of splitting text into chunks

In [29]:
# Say LLM token limit is 100, in that case we can do simple thing such as this

text[0:100]

'Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher N'

In [30]:
# Well but we want complete words and want to do this for entire text, may be we can use Python's split funciton

words = text.split(" ")
len(words)

264

In [31]:
chunks = []

s = ""
for word in words:
    s += word + " "
    if len(s)>200:
        chunks.append(s)
        s = ""              # reset
        
chunks.append(s)

In [32]:
for chunk in chunks:
    print(len(chunk))

202
202
201
203
206
201
209
212
139


**Splitting data into chunks can be done in native python but it is a tidious process. Also if necessary, you may need to experiment with various delimiters in an iterative manner to ensure that each chunk does not exceed the token length limit of the respective LLM.**

**Langchain provides a better way through text splitter classes.**

#### Using Text Splitter Classes from Langchain
#### CharacterTextSplitter

In [33]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size = 200, 
    chunk_overlap = 10  # we keep chunk overlap so that there is some context overlap between consecutive chunks and thus LLM knows which chunk to pick while answering us
)

In [35]:
chunks = splitter.split_text(text)
len(chunks)

Created a chunk of size 210, which is longer than the specified 200
Created a chunk of size 208, which is longer than the specified 200
Created a chunk of size 358, which is longer than the specified 200


9

In [36]:
for chunk in chunks:
    print(len(chunk))

105
120
210
181
197
207
128
357
253


As you can see, all though we gave 200 as a chunk size since the split was based on \n, it ended up creating chunks that are bigger than size 200.

Another class from Langchain can be used to recursively split the text based on a list of separators. This class is **RecursiveTextSplitter**. Let's see how it works

#### RecursiveTextSplitter

Here we can give a list of separators. The order/ sequence in the list matters.

It will start with first separator and if criterias like chunk_size and overlap are met - it will move onto create next chunk but if in current chunk with current separator, the chunk_size is coming more than what you specified then, it will use the next separator from the list and divde the current chunk based on the newly selected separator. Let's see

In [37]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " "],  # List of separators based on requirement (defaults to ["\n\n", "\n", " "])
    chunk_size = 200,  # size of each chunk created
    chunk_overlap  = 0,  # size of  overlap between chunks in order to maintain the context
    length_function = len  # Function to calculate size, currently we are using "len" which denotes length of string however you can pass any token counter)
)

In [38]:
chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

105
120
199
10
181
197
198
8
128
191
165
198
54


**Let's understand how exactly it formed these chunks**

In [39]:
first_splitter_chunks = text.split("\n\n") # using the first separator
for chunk in first_splitter_chunks:
    print(len(chunk))

439
719
612


If we consider the first chunk here of size 439, since it is larger than 200, it will use the second separator ("\n") to further split it

In [40]:
first_splitter_chunks = text.split("\n\n") # using the first separator
for chunk in first_splitter_chunks:
    print('--')
    print('Len after first separator ', len(chunk))
    if len(chunk)>200:
        second_splitter_chunks = chunk.split("\n") # using the second separator
        for chunk_ in second_splitter_chunks:
            print(len(chunk_))



--
Len after first separator  439
106
121
210
--
Len after first separator  719
182
198
208
128
--
Len after first separator  612
358
253


Even now we see a few chunk_ more than 200 in size. So, we go to thirs splitter which is space ->(" ")

When you split this using space, it will separate out each word and then it will merge those chunks such that their size is close to 200


In [67]:
first_splitter_chunks = text.split("\n\n") # using the first separator
for chunk in first_splitter_chunks:
    print('--')
    print('Len after first separator ', len(chunk))
    if len(chunk)>200:
        second_splitter_chunks = chunk.split("\n") # using the second separator
        for chunk_ in second_splitter_chunks:
            print(len(chunk_))
            if len(chunk_)>200:
                words_third_splitter = chunk_.split(" ") # using the third separator SPACE - containing just words
                final_chunks = []
                s=""
                for word in words_third_splitter:
                    
                    if len(s) + len(word) + 1 > 199:
                        final_chunks.append(s.strip())
                        s = word + " "  # Start a new chunk with the current word
                    else:
                        s+= word+ " "

                # Append the remaining text as the last chunk
                if s.strip(): # means a non-empty string
                    final_chunks.append(s.strip())  # Append the final chunk without trailing space

                for _chunk_ in final_chunks:
                    print(' ->' , len(_chunk_))


            






--
Len after first separator  439
106
121
210
 -> 195
 -> 14
--
Len after first separator  719
182
198
208
 -> 198
 -> 8
128
--
Len after first separator  612
358
 -> 191
 -> 165
253
 -> 198
 -> 54


We can see that it is very similar to the output of `RecursiveTextSplitter`

Writing everything explicitly like hanling case of SPACE can be tedious to do everytime manually. That is why use langchain class


Buteven after using langchain RecursiveTextSplitter, I(MG) think codebasics skipped the merging part where there are chunks of size 8 which is too small. So, may be for LLM size of 4096 we split them into chunks of max size 3800 allowing some bandwidth of mering. We should be using a good number of separators.

If we change the sequence of separators -> our chunks size can change

In [70]:
# Let's see

r_splitter = RecursiveCharacterTextSplitter(
    separators = [" ", "\n\n", "\n"],  # List of separators based on requirement (defaults to ["\n\n", "\n", " "])
    chunk_size = 200,  # size of each chunk created
    chunk_overlap  = 0,  # size of  overlap between chunks in order to maintain the context
    length_function = len  # Function to calculate size, currently we are using "len" which denotes length of string however you can pass any token counter)
)


chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

196
196
198
199
199
198
195
199
186


Yes, it does!!

In [76]:
# Let's see

r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " "],  # List of separators based on requirement (defaults to ["\n\n", "\n", " "])
    chunk_size = 200,  # size of each chunk created
    chunk_overlap  = 0,  # size of  overlap between chunks in order to maintain the context
    length_function = len  # Function to calculate size, currently we are using "len" which denotes length of string however you can pass any token counter)
)


chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

105
120
199
10
181
197
198
8
128
191
165
198
54


In [74]:
chunk

'Visual Effects, and received numerous other accolades.'

My implementation

In [81]:
def merge_text_chunks(chunks, size_limit=256, min_size=50):
    """
    Merge small text chunks with neighbors while respecting the size limit.

    Parameters:
        chunks (list[str]): List of text chunks.
        size_limit (int): Maximum allowed size for a chunk.
        min_size (int): Minimum size for a chunk.

    Returns:
        list[str]: List of adjusted text chunks.
    """
    merged_chunks = []
    current_chunk = ""  # To accumulate chunks

    for chunk in chunks:
        if len(chunk) < min_size and len(current_chunk) + len(chunk) <= size_limit:
            # Merge small chunk into the current chunk
            current_chunk += (" " if current_chunk else "") + chunk
        elif len(current_chunk) + len(chunk) <= size_limit:
            # Add chunk to the current chunk if within the size limit
            current_chunk += (" " if current_chunk else "") + chunk
        else:
            # Save the current chunk if it exceeds the size limit
            if current_chunk:
                merged_chunks.append(current_chunk)
            # Start a new chunk
            current_chunk = chunk

    # Add the last chunk if there's any leftover
    if current_chunk:
        merged_chunks.append(current_chunk)

    return merged_chunks


# Adjust chunks
adjusted_chunks = merge_text_chunks(chunks, size_limit=200, min_size=50)
print("Adjusted chunks:")
for idx, chunk in enumerate(adjusted_chunks, start=1):
    print(f"Chunk {idx} ({len(chunk)} chars): {chunk}")


Adjusted chunks:
Chunk 1 (105 chars): Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
Chunk 2 (120 chars): It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Chunk 3 (199 chars): Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for
Chunk 4 (192 chars): humankind. Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg.
Chunk 5 (197 chars): Kip Thorne, a Caltech theoretical physicist and 2017 Nobel laureate in Physics,[4] was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Chunk 6 (198 chars): Cinematographer Hoyte van Hoytema shot 