In [1]:
from dotenv import load_dotenv, find_dotenv
from doc_chat.rag.document_loader import DocumentLoader
from doc_chat.rag.vector_store import VectorStore

_ = load_dotenv(find_dotenv())

In [2]:
file_path = "../data/pdf/The Hundred-Page Machine Learning Book.pdf"
doc_loader = DocumentLoader(file_path)
documents, splits = doc_loader.load_and_split()
print("Number of documents: " + str(len(documents)))
print("Number of splits: " + str(len(splits)))

Number of documents: 152
Number of splits: 385


In [3]:
# document format
documents[0]

Document(metadata={'producer': '3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)', 'creator': 'PyPDF', 'creationdate': '2018-12-18T05:07:46+00:00', 'moddate': '2019-01-22T19:51:34+00:00', 'source': '../data/pdf/The Hundred-Page Machine Learning Book.pdf', 'total_pages': 152, 'page': 0, 'page_label': '1'}, page_content='The\nHundred-\nPage\nMachine\nLearning\nBook\nAndriy Burkov')

In [4]:
# print first 300 characters of the third page
documents[2].page_content[:300]

'Preface\nLet’s start by telling the truth: machines don’t learn. What a typical “learning machine”\ndoes, is ﬁnding a mathematical formula, which, when applied to a collection of inputs (called\n“training data”), produces the desired outputs. This mathematical formula also generates the\ncorrect outputs'

In [5]:
# split format
splits[0]

Document(metadata={'producer': '3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)', 'creator': 'PyPDF', 'creationdate': '2018-12-18T05:07:46+00:00', 'moddate': '2019-01-22T19:51:34+00:00', 'source': '../data/pdf/The Hundred-Page Machine Learning Book.pdf', 'total_pages': 152, 'page': 0, 'page_label': '1'}, page_content='The\nHundred-\nPage\nMachine\nLearning\nBook\nAndriy Burkov')

In [6]:
splits[2].page_content

'Preface\nLet’s start by telling the truth: machines don’t learn. What a typical “learning machine”\ndoes, is ﬁnding a mathematical formula, which, when applied to a collection of inputs (called\n“training data”), produces the desired outputs. This mathematical formula also generates the\ncorrect outputs for most other inputs (distinct from the training data) on the condition that\nthose inputs come from the same or a similar statistical distribution as the one the training\ndata was drawn from.\nWhy isn’t that learning? Because if you slightly distort the inputs, the output is very likely\nto become completely wrong. It’s not how learning in animals works. If you learned to play\na video game by looking straight at the screen, you would still be a good player if someone\nrotates the screen slightly. A machine learning algorithm, if it was trained by “looking”\nstraight at the screen, unless it was also trained to recognize rotation, will fail to play the\ngame on a rotated screen.'

In [7]:
splits[3].page_content

'straight at the screen, unless it was also trained to recognize rotation, will fail to play the\ngame on a rotated screen.\nSo why the name “machine learning” then? The reason, as is often the case, is marketing:\nArthur Samuel, an American pioneer in the ﬁeld of computer gaming and artiﬁcial intelligence,\ncoined the term in 1959 while at IBM. Similarly to how in the 2010s IBM tried to market\nthe term “cognitive computing” to stand out from competition, in the 1960s, IBM used the\nnew cool term “machine learning” to attract both clients and talented employees.\nAs you can see, just like artiﬁcial intelligence is not intelligence, machine learning is not\nlearning. However, machine learning is a universally recognized term that usually refers\nto the science and engineering of building machines capable of doing various useful things\nwithout being explicitly programmed to do so. So, the word “learning” in the term is used\nby analogy with the learning in animals rather than literally

In [8]:
# instantiate vector store
vector_store = VectorStore(documents=splits, token="foo")

In [16]:
conversational_chain = vector_store.create_conversational_retrieval_chain(k=2)
response = conversational_chain.invoke("Explain logistic regression?")
print(response['answer'])

 Logistic regression is a type of classification algorithm that uses the standard logistic function to model the relationship between a set of input variables and a binary output variable. By optimizing the values of the input variables and a bias term, the output of the logistic function can be interpreted as the probability of the output variable being positive. A threshold can then be chosen to classify the output as either positive or negative. This threshold may vary depending on the problem at hand. Logistic regression is often used in machine learning for binary classification tasks.


In [17]:
source_docs = response['source_documents']
print("Number of source documents: " + str(len(source_docs)))

for i, doc in enumerate(source_docs):
    print(f"Source document {i}")
    print(f"Page: {doc.metadata['page']}")
    print(f"Content: {doc.page_content[:10]}")
    print("-----")

source_docs

Number of source documents: 2
Source document 0
Page: 32
Content: 3.
By look
-----
Source document 1
Page: 32
Content: 3.
By look
-----


[Document(metadata={'creationdate': '2018-12-18T05:07:46+00:00', 'creator': 'PyPDF', 'moddate': '2019-01-22T19:51:34+00:00', 'page': 32, 'page_label': '33', 'producer': '3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)', 'source': '../data/pdf/The Hundred-Page Machine Learning Book.pdf', 'total_pages': 152}, page_content='3.\nBy looking at the graph of the standard logistic function, we can see how well it ﬁts our\nclassiﬁcation purpose: if we optimize the values ofx and b appropriately, we could interpret\nthe output off(x) as the probability ofyi being positive. For example, if it’s higher than or\nequal to the threshold0.5 we would say that the class ofx is positive; otherwise, it’s negative.\nIn practice, the choice of the threshold could be di\x00erent depending on the problem. We\nreturn to this discussion in Chapter 5 when we talk about model performance assessment.\nSo our logistic regression model looks like this:\nAndriy Burkov The Hundred-Page Machine

In [18]:
source_docs[0]

Document(metadata={'creationdate': '2018-12-18T05:07:46+00:00', 'creator': 'PyPDF', 'moddate': '2019-01-22T19:51:34+00:00', 'page': 32, 'page_label': '33', 'producer': '3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)', 'source': '../data/pdf/The Hundred-Page Machine Learning Book.pdf', 'total_pages': 152}, page_content='3.\nBy looking at the graph of the standard logistic function, we can see how well it ﬁts our\nclassiﬁcation purpose: if we optimize the values ofx and b appropriately, we could interpret\nthe output off(x) as the probability ofyi being positive. For example, if it’s higher than or\nequal to the threshold0.5 we would say that the class ofx is positive; otherwise, it’s negative.\nIn practice, the choice of the threshold could be di\x00erent depending on the problem. We\nreturn to this discussion in Chapter 5 when we talk about model performance assessment.\nSo our logistic regression model looks like this:\nAndriy Burkov The Hundred-Page Machine 

In [19]:
source_docs[1]

Document(metadata={'creationdate': '2018-12-18T05:07:46+00:00', 'creator': 'PyPDF', 'moddate': '2019-01-22T19:51:34+00:00', 'page': 32, 'page_label': '33', 'producer': '3-Heights(TM) PDF Optimization Shell 4.8.25.2 (http://www.pdf-tools.com)', 'source': '../data/pdf/The Hundred-Page Machine Learning Book.pdf', 'total_pages': 152}, page_content='3.\nBy looking at the graph of the standard logistic function, we can see how well it ﬁts our\nclassiﬁcation purpose: if we optimize the values ofx and b appropriately, we could interpret\nthe output off(x) as the probability ofyi being positive. For example, if it’s higher than or\nequal to the threshold0.5 we would say that the class ofx is positive; otherwise, it’s negative.\nIn practice, the choice of the threshold could be di\x00erent depending on the problem. We\nreturn to this discussion in Chapter 5 when we talk about model performance assessment.\nSo our logistic regression model looks like this:\nAndriy Burkov The Hundred-Page Machine 

In [21]:
# conversational_chain can be used to ask follow-up questions
response = conversational_chain.invoke("Give me the formula")
print(response['answer'])

 The formula for logistic regression is f(x) = 1 / (1 + e^(-x)).


In [22]:
# qa chain does not use the history of the conversation
qa_chain = vector_store.create_qa_chain(k=2)
response = qa_chain.invoke("Explain logistic regression?")
print(response['result'])

 Logistic regression is a type of classification algorithm that uses the standard logistic function to model the relationship between a set of input variables and a binary output variable. By optimizing the values of the input variables and a bias term, the output of the logistic function can be interpreted as the probability of the output variable being positive. A threshold can then be chosen to classify the output as either positive or negative. This threshold may vary depending on the problem at hand. Logistic regression is often used in machine learning for binary classification tasks.


In [23]:
# a follow-up question to the qa does not result in answer
response = qa_chain.invoke("Give me the formula")
print(response['result'])

 I don't know.
