* https://docs.ragas.io/en/latest/concepts/testset_generation.html
* https://github.com/explodinggradients/ragas/blob/main/src/ragas/testset/generator.py

# try with openai LLM and embeddings

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

def load_pdfs(file_paths):
    """
    file_paths must end with .pdf
    PyPDFLoader auto splits the pdf into pages, each page is 1 Document object split by page number
    note that the splitting by page number is not perfect, the actual page number might be +/- 1-2pages.

    returns a dict of key: file_path and value: list of document objects
    """
    documents_dict = {}   
    for f in tqdm(file_paths):
        loader = PyPDFLoader(file_path = f)
        documents = loader.load()
        documents_dict[f] = documents
    return documents_dict

def chunk_list_of_documents(documents):
    """
    input a list of documents as Document objects

    output a list of chunks as Document objects
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 500,
        chunk_overlap = 100, # using 20% is a good start
        length_function=len,
        is_separator_regex=False,
        add_start_index=True
    )

    chunks = text_splitter.split_documents(documents)    
    return chunks



In [2]:
import os
import glob
from tqdm import tqdm

# folder_path = "/Users/i748920/Desktop/llms-learning/pdf-chatbot-app/data/short-elements-of-statistical-learning-book"
# os.path.exists(folder_path)
# file_paths = glob.glob(f"{folder_path}/*.pdf")
documents_dict = load_pdfs(
    file_paths=[
        "/Users/i748920/Desktop/llms-learning/pdf-chatbot-app/data/short-chap18-elements-of-statistical-learning-book/chap18 copy.pdf"
    ]
)

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
100%|██████████| 1/1 [00:00<00:00,  6.58it/s]


In [3]:
documents = []
for lst in list(documents_dict.values()):
    documents.extend(lst)

len(documents)

3

In [7]:
documents[0]

Document(metadata={'source': '/Users/i748920/Desktop/llms-learning/pdf-chatbot-app/data/short-chap18-elements-of-statistical-learning-book/chap18 copy.pdf', 'page': 0}, page_content='This is page 649\nPrinter: Opaque this\n18\nHigh-Dimensional Problems: p≫N\n18.1 When pis Much Bigger than N\nIn this chapter we discuss prediction problems in which the n umber of\nfeaturespis much larger than the number of observations N, often written\np≫N. Such problems have become of increasing importance, espec ially in\ngenomics and other areas of computational biology. We will s ee that high\nvariance and overﬁtting are a major concern in this setting. As a result,\nsimple, highly regularized approaches often become the met hods of choice.\nThe ﬁrst part of the chapter focuses on prediction in both the classiﬁcation\nand regression settings, while the second part discusses th e more basic\nproblem of feature selection and assessment.\nTo get us started, Figure 18.1 summarizes a small simulation stu

In [9]:
for document in documents:
    document.metadata['filename'] = document.metadata['source']+f" - page: {document.metadata['page']}"

In [11]:
documents[0]

Document(metadata={'source': '/Users/i748920/Desktop/llms-learning/pdf-chatbot-app/data/short-chap18-elements-of-statistical-learning-book/chap18 copy.pdf', 'page': 0, 'filename': '/Users/i748920/Desktop/llms-learning/pdf-chatbot-app/data/short-chap18-elements-of-statistical-learning-book/chap18 copy.pdf - page: 0'}, page_content='This is page 649\nPrinter: Opaque this\n18\nHigh-Dimensional Problems: p≫N\n18.1 When pis Much Bigger than N\nIn this chapter we discuss prediction problems in which the n umber of\nfeaturespis much larger than the number of observations N, often written\np≫N. Such problems have become of increasing importance, espec ially in\ngenomics and other areas of computational biology. We will s ee that high\nvariance and overﬁtting are a major concern in this setting. As a result,\nsimple, highly regularized approaches often become the met hods of choice.\nThe ﬁrst part of the chapter focuses on prediction in both the classiﬁcation\nand regression settings, while the

In [13]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass()

 ········


In [17]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

generator

TestsetGenerator(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=10, max_wait=60, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False, seed=42)), critic_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=10, max_wait=60, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False, seed=42)), embeddings=<ragas.embeddings.base.LangchainEmbeddingsWrapper object at 0x32c37a0a0>, docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x32c37a160>, nodes=[], node_embeddings_list=[], node_map={}, run_config=RunConfig(timeout=180, max_retries=10, max_wait=60, max_workers=16, exception_types=(<class 'Exception'>,), log_tenacity=False, seed=42)))

In [19]:
# generate testset
testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

Filename and doc_id are the same for all nodes.               
Generating: 100%|██████████| 10/10 [08:30<00:00, 51.07s/it] 


In [20]:
testset

TestDataset(test_data=[DataRow(question='What challenges arise when analyzing high-dimensional data?', contexts=['18.2 Nearest Shrunken Centroids 651\nlow bias. When p= 100, we can identify some non-zero coeﬃcients using\nmoderate shrinkage. Finally, when p= 1000, even though there are many\nnonzero coeﬃcients, we don’t have a hope for ﬁnding them and w e need\nto shrink all the way down. As evidence of this, let tj=ˆβj/ˆsej, whereˆβj\nis the ridge regression estimate and ˆsejits estimated standard error. Then\nusing the optimal ridge parameter in each of the three cases, the median\nvalue of|tj|was 2.0, 0.6 and 0.2, and the average number of |tj|values\nexceeding 2 was equal to 9.8, 1.2 and 0.0.\nRidge regression with λ= 0.001 successfully exploits the correlation in\nthe features when p<N, but cannot do so when p≫N. In the latter case\nthere is not enough information in the relatively small numb er of samples\nto eﬃciently estimate the high-dimensional covariance mat rix. In that cas

In [21]:
test_df = testset.to_pandas()
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What challenges arise when analyzing high-dime...,[18.2 Nearest Shrunken Centroids 651\nlow bias...,The analysis of high-dimensional data requires...,simple,[{'source': '/Users/i748920/Desktop/llms-learn...,True
1,What is the purpose of feature selection in hi...,[18.2 Nearest Shrunken Centroids 651\nlow bias...,The purpose of feature selection in high-dimen...,simple,[{'source': '/Users/i748920/Desktop/llms-learn...,True
2,What is the role of Nearest Shrunken Centroids...,[18.2 Nearest Shrunken Centroids 651\nlow bias...,Nearest Shrunken Centroids is a method used fo...,simple,[{'source': '/Users/i748920/Desktop/llms-learn...,True
3,What is the role of a linear model in high-dim...,[This is page 649\nPrinter: Opaque this\n18\nH...,The role of a linear model in high-dimensional...,simple,[{'source': '/Users/i748920/Desktop/llms-learn...,True
4,What challenges arise when analyzing high-dime...,[18.2 Nearest Shrunken Centroids 651\nlow bias...,The analysis of high-dimensional data requires...,simple,[{'source': '/Users/i748920/Desktop/llms-learn...,True
5,What regularization method is used in linear d...,[18.2 Nearest Shrunken Centroids 651\nlow bias...,The answer to given question is not present in...,reasoning,[{'source': '/Users/i748920/Desktop/llms-learn...,True
6,What method achieves feature selection and reg...,[18.2 Nearest Shrunken Centroids 651\nlow bias...,Nearest Shrunken Centroids,reasoning,[{'source': '/Users/i748920/Desktop/llms-learn...,True
7,What is the role of regularized approaches in ...,[This is page 649\nPrinter: Opaque this\n18\nH...,Regularized approaches often become the method...,multi_context,[{'source': '/Users/i748920/Desktop/llms-learn...,True
8,What is the purpose of feature selection in hi...,[18.2 Nearest Shrunken Centroids 651\nlow bias...,Feature selection in high-dimensional data ana...,multi_context,[{'source': '/Users/i748920/Desktop/llms-learn...,True
9,What is the role of regularized approaches in ...,[This is page 649\nPrinter: Opaque this\n18\nH...,Regularized approaches often become the method...,multi_context,[{'source': '/Users/i748920/Desktop/llms-learn...,True


In [33]:
test_df.columns

Index(['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata',
       'episode_done'],
      dtype='object')

In [31]:
test_df.iloc[0,0]

'What challenges arise when analyzing high-dimensional data?'

In [35]:
test_df.iloc[0,1]

['18.2 Nearest Shrunken Centroids 651\nlow bias. When p= 100, we can identify some non-zero coeﬃcients using\nmoderate shrinkage. Finally, when p= 1000, even though there are many\nnonzero coeﬃcients, we don’t have a hope for ﬁnding them and w e need\nto shrink all the way down. As evidence of this, let tj=ˆβj/ˆsej, whereˆβj\nis the ridge regression estimate and ˆsejits estimated standard error. Then\nusing the optimal ridge parameter in each of the three cases, the median\nvalue of|tj|was 2.0, 0.6 and 0.2, and the average number of |tj|values\nexceeding 2 was equal to 9.8, 1.2 and 0.0.\nRidge regression with λ= 0.001 successfully exploits the correlation in\nthe features when p<N, but cannot do so when p≫N. In the latter case\nthere is not enough information in the relatively small numb er of samples\nto eﬃciently estimate the high-dimensional covariance mat rix. In that case,\nmore regularization leads to superior prediction performa nce.\nThus it is not surprising that the analysis 

In [37]:
test_df.iloc[0,2]

'The analysis of high-dimensional data requires either modification of procedures designed for the N > p scenario, or entirely new procedures. In the latter case, there is not enough information in the relatively small number of samples to efficiently estimate the high-dimensional covariance matrix. More regularization leads to superior prediction performance.'

In [39]:
test_df.iloc[0,3]

'simple'

In [41]:
test_df.iloc[0,4]

[{'source': '/Users/i748920/Desktop/llms-learning/pdf-chatbot-app/data/short-chap18-elements-of-statistical-learning-book/chap18 copy.pdf',
  'page': 2,
  'filename': '/Users/i748920/Desktop/llms-learning/pdf-chatbot-app/data/short-chap18-elements-of-statistical-learning-book/chap18 copy.pdf - page: 2'}]

In [43]:
test_df.iloc[0,5]

True

In [54]:
for ind,row in test_df.iterrows():
    print(row["question"])
    print(row["ground_truth"])
    print(row["evolution_type"])
    print(len(row["contexts"]))
    print()
    print()

What challenges arise when analyzing high-dimensional data?
The analysis of high-dimensional data requires either modification of procedures designed for the N > p scenario, or entirely new procedures. In the latter case, there is not enough information in the relatively small number of samples to efficiently estimate the high-dimensional covariance matrix. More regularization leads to superior prediction performance.
simple
1


What is the purpose of feature selection in high-dimensional data analysis?
The purpose of feature selection in high-dimensional data analysis is to regularize the analysis and improve prediction performance by selecting the most relevant features based on scientific contextual knowledge.
simple
1


What is the role of Nearest Shrunken Centroids in feature selection and regularization in high-dimensional data analysis?
Nearest Shrunken Centroids is a method used for feature selection and regularization in high-dimensional data analysis. It achieves feature sele

In [64]:
test_df.to_csv("sample_ragas_test_set.csv",index=False)