# Setup

In [1]:
import openai
import os
from dotenv import load_dotenv, find_dotenv
from IPython.display import display, Markdown
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
_ = load_dotenv(find_dotenv()) # read local .env file
The line `openai.api_key = os.environ['OPENAI_API_KEY']` is setting the API key for the OpenAI library. It is retrieving the API key from the environment variable `OPENAI_API_KEY` and assigning it to the `api_key` attribute of the `openai` object. This allows the code to authenticate and make API calls to the OpenAI service using the provided API key.
openai.api_key = os.environ['OPENAI_API_KEY']


# Load file

In [2]:
file = 'data/pdfs/ISLRv2_website.pdf'

loader = PyPDFLoader(file)
document = loader.load()
document[99]

Document(page_content='90 3. Linear Regression\n0 50 100 150200 600 1000 1400IncomeBalance0 50 100 150200 600 1000 1400IncomeBalancestudentnon−student\nFIGURE 3.7.For theCreditdata, the least squares lines are shown for pre-diction ofbalancefromincomefor students and non-students.Left:The model(3.34) was ﬁt. There is no interaction betweenincomeandstudent.Right:Themodel (3.35) was ﬁt. There is an interaction term betweenincomeandstudent.takes the formbalancei≈β0+β1×incomei+/braceleft⎪iggβ2ifith person is a student0 ifith person is not a student=β1×incomei+/braceleft⎪iggβ0+β2ifith person is a studentβ0ifith person is not a student.(3.34)Notice that this amounts to ﬁtting two parallel lines to the data, one forstudents and one for non-students. The lines for students and non-studentshave diﬀerent intercepts,β0+β2versusβ0, but the same slope,β1. Thisis illustrated in the left-hand panel of Figure 3.7. The fact that the linesare parallel means that the average eﬀect onbalanceof a one-unit 

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
splits = text_splitter.split_documents(document)
splits[99]

Document(page_content='FIGURE 2.9.Left:Data simulated fromf, shown in black. Three estimates offare shown: the linear regression line (orange curve), and two smoothing splineﬁts (blue and green curves).Right:Training MSE (grey curve), test MSE (redcurve), and minimum possible test MSE over all methods (dashed line). Squaresrepresent the training and test MSEs for the three ﬁts shown in the left-handpanel.since the training MSE and the test MSE appear to be closely related.Unfortunately, there is a fundamental problem with this strategy: thereis no guarantee that the method with the lowest training MSE will alsohave the lowest test MSE. Roughly speaking, the problem is that manystatistical methods speciﬁcally estimate coeﬃcients so as to minimize thetraining set MSE. For these methods, the training set MSE can be quitesmall, but the test MSE is often much larger.Figure 2.9 illustrates this phenomenon on a simple example. In the left-hand panel of Figure 2.9, we have generated observatio

# Convert the pages into embeddings and store in vector store

In [3]:
embeddings = OpenAIEmbeddings()


In [6]:
db = DocArrayInMemorySearch.from_documents(
    splits, 
    embeddings
)

2023-08-03 17:27:45.127552: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-03 17:27:48.544472: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-03 17:27:54.823896: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-08-03 17:27:54.824441: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] 

# Query a store

## Pose a query, find pages similar to query

In [13]:
query = """What's the most complicated ML model"""

In [14]:
similar_pages = db.similarity_search(query, limit=10)
similar_pages

[Document(page_content='426 10. Deep LearningOℓ. We use the ﬁnal valueOLto predict the response: the sentiment of thereview.This is a simple RNN, and has relatively few parameters. If there areKhidden units, the common weight matrixWhasK×(m+ 1) parameters,the matrixUhasK×Kparameters, andBhas 2(K+ 1) for the two-classlogistic regression as in (10.15). These are used repeatedly as we processthe sequenceX={Xℓ}L1from left to right, much like we use a singleconvolution ﬁlter to process each patch in an image (Section 10.3.1). If theembedding layerEis learned, that adds an additionalm×Dparameters(D= 10,000 here), and is by far the biggest cost.We ﬁt the RNN as described in Figure 10.12 and the accompaying text totheIMDbdata. The model had an embedding matrixEwithm= 32 (whichwas learned in training as opposed to precomputed), followed by a singlerecurrent layer withK= 32 hidden units. The model was trained withdropout regularization on the 25,000 reviews in the designated trainingset, and ach

## Retriever

In [9]:
retriever = db.as_retriever()


## Query the stored pages

In [15]:
llm = ChatOpenAI(temperature = 0.0)

qpages = "".join([similar_pages[i].page_content for i in range(len(similar_pages))])


In [16]:
response = llm.call_as_llm(f"{qpages} Question: {query}") 


In [17]:
display(Markdown(response))

The most complicated machine learning model is difficult to determine as it depends on various factors such as the problem being solved, the size of the dataset, and the complexity of the features. However, deep learning models, such as deep neural networks with multiple hidden layers, are generally considered to be more complex compared to other machine learning models. These models have a large number of parameters and require significant computational resources for training. Other complex models include recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs).