In [None]:
!python -m pip install -r requirements.txt

## Extracting knowledgebase

This example will extract textual data from the curl command's man page. Later, we will build an LLM agent that's going to answer questions about using curl and its options.
Man pages are textual, so they are not difficult to process. However, we will do some amount of processing to split it into manageable pieces.

In [2]:
# load the file into memory
file_name = './man_curl.txt'
with open(file_name, 'r') as text_file:
    contents = text_file.read()
    print(contents[:400])

curl(1)                           curl Manual                          curl(1)

NAME
       curl - transfer a URL

SYNOPSIS
       curl [options / URLs]

DESCRIPTION
       curl is a tool for transferring data from or to a server using URLs. It
       supports these protocols: DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP,
       HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP


## Chunk splitting

Since these chunks will be fed to an LLM as context, some take needs to be taken with the boundries between chunks. For instance, if a chunk ends mid-sentence or mid-word, some important infomation might be omitted and the resulting context might not be enough for an LLM to answer a question.
Similarly, if the chunk is synthactically correct, but it's not enough to convey the entire semantic on a particular part of the text, it might also degrade the quality of the context, and therefore answers.

### Semantic text splitter

The following, rather simplistic approach, mainly tackles the first problem above. The `semantic_text_splitter` library will make sure the text is split in roughtly sized pieces and make sure the boundries are ones that make the chunks still meaningful.

In [5]:
from semantic_text_splitter import TextSplitter

# Maximum number of characters in a chunk
max_characters = 2000
# Optionally can also have the splitter not trim whitespace for you
splitter = TextSplitter()

chunks = splitter.chunks(contents, max_characters)
print(f'First chunk: {chunks[0]}')

First chunk: curl(1)                           curl Manual                          curl(1)

NAME
       curl - transfer a URL

SYNOPSIS
       curl [options / URLs]

DESCRIPTION
       curl is a tool for transferring data from or to a server using URLs. It
       supports these protocols: DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP,
       HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP,
       SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET, TFTP, WS and WSS.

       curl is powered by libcurl for all transfer-related features. See
       libcurl(3) for details.

URL
       The URL syntax is protocol-dependent. You find a detailed description
       in RFC 3986.

       If you provide a URL without a leading protocol:// scheme, curl guesses
       what protocol you want. It then defaults to HTTP but assumes others
       based on often-used host name prefixes. For example, for host names
       starting with "ftp." curl assumes you want FTP.

       You can specify a