# Setup

In [1]:
import openai
import os
from dotenv import load_dotenv, find_dotenv
from IPython.display import display, Markdown
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DocArrayInMemorySearch

In [2]:
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']


# Load file

In [7]:
file = 'data/pdfs/ISLRv2_website.pdf'

loader = PyPDFLoader(file)
pages = loader.load_and_split()

In [8]:
pages[99]

Document(page_content='90 3. Linear Regression\n0 50 100 150200 600 1000 1400IncomeBalance0 50 100 150200 600 1000 1400IncomeBalancestudentnon−student\nFIGURE 3.7.For theCreditdata, the least squares lines are shown for pre-diction ofbalancefromincomefor students and non-students.Left:The model(3.34) was ﬁt. There is no interaction betweenincomeandstudent.Right:Themodel (3.35) was ﬁt. There is an interaction term betweenincomeandstudent.takes the formbalancei≈β0+β1×incomei+/braceleft⎪iggβ2ifith person is a student0 ifith person is not a student=β1×incomei+/braceleft⎪iggβ0+β2ifith person is a studentβ0ifith person is not a student.(3.34)Notice that this amounts to ﬁtting two parallel lines to the data, one forstudents and one for non-students. The lines for students and non-studentshave diﬀerent intercepts,β0+β2versusβ0, but the same slope,β1. Thisis illustrated in the left-hand panel of Figure 3.7. The fact that the linesare parallel means that the average eﬀect onbalanceof a one-unit 

# Convert the pages into embeddings and store in vector store

In [9]:
embeddings = OpenAIEmbeddings()


In [12]:
db = DocArrayInMemorySearch.from_documents(
    pages, 
    embeddings
)

  from .autonotebook import tqdm as notebook_tqdm
2023-07-27 23:01:13.262005: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-27 23:01:15.608939: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-07-27 23:01:15.609024: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


# Query a store

## Pose a query, find pages similar to query

In [13]:
query = """Explain bias-variance tradeoff for a linear regression model."""

In [14]:
similar_pages = db.similarity_search(query, limit=10)
similar_pages

[Document(page_content='2.2 Assessing Model Accuracy 35one of these data points may cause the estimateˆfto change considerably.In contrast, the orange least squares line is relatively inﬂexible and has lowvariance, because moving any single observation will likely cause only asmall shift in the position of the line.On the other hand,biasrefers to the error that is introduced by approxi-mating a real-life problem, which may be extremely complicated, by a muchsimpler model. For example, linear regression assumes that there is a linearrelationship betweenYandX1,X2,...,Xp. It is unlikely that any real-lifeproblem truly has such a simple linear relationship, and so performing lin-ear regression will undoubtedly result in some bias in the estimate off. InFigure 2.11, the truefis substantially non-linear, so no matter how manytraining observations we are given, it will not be possible to produce anaccurate estimate using linear regression. In other words, linear regressionresults in high bias

## Retriever

In [15]:
retriever = db.as_retriever()


## Query the stored pages

In [16]:
llm = ChatOpenAI(temperature = 0.0)

qpages = "".join([similar_pages[i].page_content for i in range(len(similar_pages))])


In [18]:
response = llm.call_as_llm(f"{qpages} Question: Explain bias-variance tradeoff for a linear regression model.") 


In [19]:
display(Markdown(response))

The bias-variance tradeoff refers to the relationship between the bias and variance of a statistical learning model. In the context of a linear regression model, bias refers to the error introduced by approximating a real-life problem with a simpler linear model. It represents the difference between the true relationship between the predictors and the response and the relationship estimated by the linear regression model. On the other hand, variance refers to the variability of the model's predictions when trained on different datasets. It represents the sensitivity of the model to changes in the training data.

In general, more flexible models, such as those with more complex relationships or more parameters, tend to have lower bias but higher variance. This means that they can capture more intricate patterns in the data but are also more likely to overfit the training data and perform poorly on new, unseen data. On the other hand, less flexible models, such as simple linear regression, have higher bias but lower variance. They may not capture all the nuances in the data but are more robust and generalize better to new data.

The tradeoff occurs because as we increase the flexibility of a model, the bias tends to decrease faster than the variance increases, resulting in a decrease in the expected test mean squared error (MSE). However, at some point, increasing flexibility has little impact on the bias but significantly increases the variance, causing the test MSE to increase. The goal is to find the optimal level of flexibility that minimizes the test MSE, striking a balance between bias and variance.

In the case of linear regression, if the true relationship between the predictors and the response is close to linear, the model will have low bias but may have high variance. This means that small changes in the training data can cause large changes in the estimated coefficients. On the other hand, if the true relationship is highly non-linear, a more flexible model may be able to capture the complexity and reduce bias, resulting in a lower test MSE.

Overall, the bias-variance tradeoff highlights the importance of finding the right level of model complexity that balances bias and variance to achieve the best predictive performance.