## **Method1**

Here, we use Llamaparse with GPT-4o to parse PDF documents into markdown files. 

Then, we chunk the markdown files using the MarkdownNodeElementParser

### 1) Loading Documents

#### 1.1 Quzizes and Exams

In [2]:
from llama_parse import LlamaParse

In [148]:
# Get started with parsing PDF files: 
import nest_asyncio
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

nest_asyncio.apply()

openai_api_key = os.getenv("OPENAI_API_KEY")
llama_cloud_api_key = os.getenv("LLAMA_CLOUD_API_KEY")

In [4]:
# Instructions for Quizzes and Exams
parsingInstruction_QE = """The provided document is an exam or quiz for a course on introductory computer science and programming. 
There are different types of questions, including multiple choice, fill in the blank, and short answer questions. 
Answers to the questions may or may not be provided. 
The goal for you is to extract the concepts tested on the exam/quiz and what is being asked by each question. 
The goal is to help the students solve the questions by teaching the concept instead of giving them the answers.
"""

In [5]:
withInstructionParsing = LlamaParse(
                                result_type="markdown", 
                                parsing_instruction=parsingInstruction_QE,
                                gpt4o_mode=True
                            )

In [10]:
# List all pdf files in a directory
import glob
QE = glob.glob("GSU - CIS 3260 - Fall 2023/Quizzes and Exams/*.pdf")  # Adjust the path to your directory containing CSV files

In [138]:
QE[3:]

['GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 3 with KEY.pdf',
 'GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 1 with KEY.pdf',
 'GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 2 with KEY.pdf',
 'GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 4 with KEY.pdf']

In [139]:
import pickle
for pdf_file in QE[3:]:
    # Generate a unique name for each .pkl file based on the PDF file name
    pkl_filename = os.path.basename(pdf_file).replace('.pdf', '.pkl')
    pkl_filepath = os.path.join("GSU - CIS 3260 - Fall 2023/Quizzes and Exams", pkl_filename)

    # Load and parse the data
    document = withInstructionParsing.load_data(pdf_file) 
    # Save the parsed document to a .pkl file
    with open(pkl_filepath, 'wb') as f:
        pickle.dump(document, f)
    print(f"Saved parsed document to {pkl_filepath}")

Started parsing the file under job_id db4819ef-fc07-4d27-a67c-ddae17a551e3
Saved parsed document to GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 3 with KEY.pkl
Started parsing the file under job_id bc444d59-bf49-4c47-96ff-a637354f5a0c
Saved parsed document to GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 1 with KEY.pkl
Started parsing the file under job_id d6b4f634-2595-46f6-b0cf-e5dcd2026455
Saved parsed document to GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 2 with KEY.pkl
Started parsing the file under job_id 65c6867d-4081-4cc7-ad33-130221358fa6
Saved parsed document to GSU - CIS 3260 - Fall 2023/Quizzes and Exams/Quiz 4 with KEY.pkl


#### 1.2 Syllabus and Lecture Slides

In [154]:
# Instructions to Parser
parsingInstruction1 = """The provided document is a lecture slide for a course on introductory programming and computer science. 
Convert the provided document content into markdown file format. 

Note that there are embedded figures, bullet points, code snippets, flowcharts, etc. 

Infer what each page is trying convey and summarize the contents.

If you encounter a flowchart or an image, give a summary or explanation of it while also extracting text and structure of it. If possible, also related to the concept being conveyed or discussed.  

Also, try to extract the structure, hierarchy, and formatting of the contents where texts exist.
"""

In [155]:
withInstructionParsing1 = LlamaParse(
                                result_type="markdown", 
                                parsing_instruction=parsingInstruction1,
                                gpt4o_mode=True
                            )

In [141]:
# List all pdf files in a directory
import glob
lec_slides = glob.glob("GSU - CIS 3260 - Fall 2023/Lecture Slides Parsed/*.pdf")  # Adjust the path to your directory containing CSV files
import re
def extract_week_number(file_path):
    match = re.search(r"Week (\d+)", file_path)
    return int(match.group(1)) if match else None

# Sort files based on the week number
sorted_files = sorted(lec_slides, key=extract_week_number)


In [143]:
sorted_files[5:7]

['GSU - CIS 3260 - Fall 2023/Lecture Slides Parsed/Week 7.pdf',
 'GSU - CIS 3260 - Fall 2023/Lecture Slides Parsed/Week 8.pdf']

In [160]:
document = withInstructionParsing1.load_data('GSU - CIS 3260 - Fall 2023/Lecture Slides Parsed/Week 8.pdf')

Started parsing the file under job_id c1d36ada-053f-41d2-80d6-79d0cf0203e9
.

In [161]:
pkl_filename = os.path.basename(sorted_files[6]).replace('.pdf', '.pkl')
pkl_filepath = os.path.join("GSU - CIS 3260 - Fall 2023/Lecture Slides Parsed", pkl_filename)


In [162]:
    # Save the parsed document to a .pkl file
    with open(pkl_filepath, 'wb') as f:
        pickle.dump(document, f)
    print(f"Saved parsed document to {pkl_filepath}")

Saved parsed document to GSU - CIS 3260 - Fall 2023/Lecture Slides Parsed/Week 8.pkl


In [163]:
md_lec_slides = glob.glob("GSU - CIS 3260 - Fall 2023/Lecture Slides Parsed/*.pkl") 
md_lec_slides
# List to store loaded documents
documents = []

# Loop through file numbers 1 through 8
for i in md_lec_slides:  # 9 is exclusive, so it goes from 1 to 8
    doc = pickle.load(open(i, "rb"))
    documents.append(doc)

md_qe = glob.glob("GSU - CIS 3260 - Fall 2023/Quizzes and Exams/*.pkl") 
# Loop through file numbers 1 through 8
for i in md_qe:  # 9 is exclusive, so it goes from 1 to 8
    doc = pickle.load(open(i, "rb"))
    documents.append(doc)
    
# Load Syllabus 
doc1 = pickle.load(open("GSU - CIS 3260 - Fall 2023/Syllabus Parsed/demo.pkl", "rb"))
documents.append(doc1)

In [164]:
documents

[[Document(id_='720eb6d8-4f68-4a81-a717-8c42d2eb139c', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Objectives for class 14\n\n- Review IA4-IA7\n- Review Quiz 3 – Quiz 5\n- Final Exam Schedule\n- Group Project Schedule\n---\n# Individual Assignment 4 - 2\n\n- Enter a letter grade A/a, B/b, C/c, D/d, and then displays its corresponding numeric value 90, 80, 70, 60.\n\n## Sample Run\n```\nEnter a letter grade: B\nThe numeric value for grade B is 80\n```\n\n```python\nletter = input("Enter a letter grade: ")\n\nif letter in \'Aa\':\n    print("The numeric value for grade A is 90")\nelif letter in \'Bb\':\n    print("The numeric value for grade B is 80")\nelif letter in \'Cc\':\n    print("The numeric value for grade C is 70")\nelif letter in \'Dd\':\n    print("The numeric value for grade D is 60")\nelse:\n    print(letter, "is an invalid grade")\n```\n\n- Letter in “Aa”\n- Multiple way if statements\n---\n# Individ

In [165]:
len(documents)

21

### 2) Chunk 

#### 2.1 Semantic Chunker

https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/

In [179]:
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)

from llama_index.embeddings.openai import OpenAIEmbedding

# OpenAI Embedding Model 
embed_model = OpenAIEmbedding(model = "text-embedding-3-small")

# Semantic splitter
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

In [180]:
nodes_semantic = []

for doc in documents:
    pdoc = splitter.get_nodes_from_documents(doc)
    nodes_semantic.append(pdoc)

In [268]:
print(nodes_semantic[1][2])

Node ID: aea31121-69c0-467e-b304-adeda7e11d13
Text: - **String** literals enclosed in matching *single quotes* (`'`)
or *double quotes* (`"`).  ```python letter = 'A'  # Same as letter =
"A" numChar = '4'  # Same as numChar = "4" message = "Good morning" #
Same as message = 'Good morning' ``` --- # Unicode/ASCII Assigns a
Code to a Character  - A specification - List every character and
assign ea...


In [181]:
print(nodes_semantic[1][2].get_content())

- **String** literals enclosed in matching *single quotes* (`'`) or *double quotes* (`"`).

```python
letter = 'A'  # Same as letter = "A"
numChar = '4'  # Same as numChar = "4"
message = "Good morning"
# Same as message = 'Good morning'
```
---
# Unicode/ASCII Assigns a Code to a Character

- A specification
- List every character and assign each character a unique code.
- Rules translating characters into bytes are called **encoding**.

## Python

```
'4'
```

## Encoded as

```
00000000 00110100
00000000 00000100
```
---
# ASCII Table and Description

| Dec | Hx | Oct | Char | Description                  | Dec | Hx | Oct | Html  | Chr | Dec | Hx | Oct | Html  | Chr |
|-----|----|-----|------|------------------------------|-----|----|-----|-------|-----|-----|----|-----|-------|-----|
| 0   | 0  | 000 | NUL  | (null)                       | 32  | 20 | 040 | &#32; | Space | 64  | 40 | 100 | &#64;  | @   |
| 1   | 1  | 001 | SOH  | (start of heading)           | 33  | 21 | 041 | &#33;

#### 2.2 Markdown Parser

In [182]:
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4"), num_workers=8
)

In [183]:
nodes = []
base_nodes = []
objects = []

for doc in documents:
    pdoc = node_parser.get_nodes_from_documents(doc)
    nodes.append(pdoc)
    base, obj = node_parser.get_nodes_and_objects(pdoc)
    base_nodes.append(base)
    objects.append(obj)

2it [00:00, 17697.49it/s]
  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:11<00:00,  5.67s/it]
4it [00:00, 40622.80it/s]
100%|██████████| 4/4 [00:13<00:00,  3.35s/it]
1it [00:00, 9776.93it/s]
100%|██████████| 1/1 [00:11<00:00, 11.95s/it]
4it [00:00, 46995.00it/s]
100%|██████████| 4/4 [00:10<00:00,  2.66s/it]
2it [00:00, 18558.87it/s]
100%|██████████| 2/2 [00:10<00:00,  5.44s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
5it [00:00, 53092.46it/s]
100%|██████████| 5/5 [00:05<00:00,  1.03s/it]
2it [00:00, 19463.13it/s]
100%|██████████| 2/2 [00:09<00:00,  4.64s/it]
5it [00:00, 46294.75it/s]
100%|██████████| 5/5 [00:11<00:00,  2.30s/it]
4it [00:00, 34521.02it/s]
100%|██████████| 4/4 [00:04<00:00,  1.14s/it]
6it [00:00, 57325.34it/s]
100%|██████████| 6/6 [00:25<00:00,  4.30s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
4it [00:00, 37449.14it/s]
100%|██████████| 4/4 [00:06<00:00,  1.65s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 9058.97it/s]
100%|██████████| 1/1 [00:06<00:00,  6.05s/it]
0it [00:00, 

#### 2.3 MarkdownElementParser with Different Embedding Model

By default LlamaIndex uses **text-embedding-ada-002**, which is the default embedding used by OpenAI. We change this to **text-embedding-3-small**.

In [184]:
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI

node_parser_OPENAI = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4"), num_workers=8, 
    embed_model =  OpenAIEmbedding(model="text-embedding-3-small")
)

In [185]:
nodes_OA = []
base_nodes_OA  = []
objects_OA  = []

for doc in documents:
    pdoc_OA = node_parser_OPENAI.get_nodes_from_documents(doc)
    nodes_OA.append(pdoc_OA)
    base, obj = node_parser_OPENAI.get_nodes_and_objects(pdoc_OA)
    base_nodes_OA.append(base)
    objects_OA.append(obj)

2it [00:00, 21675.99it/s]
  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:12<00:00,  6.47s/it]
4it [00:00, 14678.23it/s]
100%|██████████| 4/4 [00:11<00:00,  2.78s/it]
1it [00:00, 5223.29it/s]
100%|██████████| 1/1 [00:11<00:00, 11.11s/it]
4it [00:00, 32140.26it/s]
100%|██████████| 4/4 [00:11<00:00,  2.76s/it]
2it [00:00, 19108.45it/s]
100%|██████████| 2/2 [00:13<00:00,  6.70s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
5it [00:00, 46500.04it/s]
100%|██████████| 5/5 [00:05<00:00,  1.17s/it]
2it [00:00, 12318.07it/s]
100%|██████████| 2/2 [00:07<00:00,  3.80s/it]
5it [00:00, 43690.67it/s]
100%|██████████| 5/5 [00:13<00:00,  2.79s/it]
4it [00:00, 26379.27it/s]
100%|██████████| 4/4 [00:05<00:00,  1.37s/it]
6it [00:00, 51150.05it/s]
100%|██████████| 6/6 [00:27<00:00,  4.50s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
4it [00:00, 41527.76it/s]
100%|██████████| 4/4 [00:06<00:00,  1.61s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 8886.24it/s]
100%|██████████| 1/1 [00:06<00:00,  6.37s/it]
0it [00:00, 

In [247]:
base_nodes_OA[0][0].metadata

{'section_summary': 'In this section, the key topics covered include converting letter grades to numeric values, determining the number of days in a given month, working with strings to find their length and last character, and making decisions based on user input. Additionally, the section includes examples of efficiently using lists to map input to specific outputs, such as determining a major and year based on user input. Finally, an example of converting an ISBN-9 to an ISBN-10 format without using functions is provided, showcasing mathematical calculations and conventions related to ISBN numbers.'}

In [250]:
objects_OA[0][1].metadata

{'col_schema': 'Column: Topic\nType: string\nSummary: Different programming and software development topics such as loops, string processing, function creation, and software development stages.\n\nColumn: Multiple Choices\nType: integer\nSummary: Points allocated for multiple choice questions in each topic.\n\nColumn: Fill In blanks\nType: integer\nSummary: Points allocated for fill in the blanks in each topic.\n\nColumn: Problem Solving\nType: integer\nSummary: Points allocated for problem-solving tasks in each topic.\n\nColumn: Total\nType: integer\nSummary: Total points for each topic.'}

In [264]:
objects_OA[0][1].relationships['3'].metadata

{'table_df': "{'                                      ': {0: ' For loop and range function          ', 1: ' While loop                           ', 2: ' String concepts and string processing', 3: ' Function creation and call           ', 4: ' 1D list                              ', 5: ' 2D list                              ', 6: ' Software Development [System analysis, System Design, Implementation] ', 7: ' **Total**                            '}, ' Multiple Choices ': {0: ' 4                ', 1: ' 2                ', 2: ' 6                ', 3: ' 12               ', 4: ' 4                ', 5: ' 2                ', 6: ' 0 ', 7: ' **30**           '}, ' Fill In blanks ': {0: ' 0              ', 1: ' 0              ', 2: ' 5              ', 3: ' 5              ', 4: ' 10             ', 5: ' 0              ', 6: ' 0 ', 7: ' **20**         '}, ' Problem Solving ': {0: ' 2               ', 1: ' 0               ', 2: ' 3               ', 3: ' 0               ', 4: ' 2               ', 5: '

#### 2.4 Transformation:
1. Markdownnodeparser as chunker 
2. Embedding model = "text...small" 
3. Summary extractor 
4. Base nodes, objects 

In [271]:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.extractors.entity import EntityExtractor

entity_extractor = EntityExtractor(
    prediction_threshold=0.5,
    label_entities=False,  # include the entity label in the metadata (can be erroneous)
    device="cpu",  # set to "cuda" if you have a GPU
)

transformations=[node_parser,
                     EntityExtractor(),
                     OpenAIEmbedding(model="text-embedding-3-small")]

pipeline = IngestionPipeline(transformations=transformations)

nodes_4 = pipeline.run(documents=documents1)

2it [00:00, 4851.71it/s]
100%|██████████| 2/2 [00:09<00:00,  4.51s/it]
4it [00:00, 44034.69it/s]
100%|██████████| 4/4 [00:11<00:00,  2.90s/it]
1it [00:00, 10433.59it/s]
100%|██████████| 1/1 [00:12<00:00, 12.98s/it]
4it [00:00, 42366.71it/s]
100%|██████████| 4/4 [00:09<00:00,  2.36s/it]
2it [00:00, 21399.51it/s]
100%|██████████| 2/2 [00:10<00:00,  5.10s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
5it [00:00, 28807.03it/s]
100%|██████████| 5/5 [00:06<00:00,  1.21s/it]
2it [00:00, 18396.07it/s]
100%|██████████| 2/2 [00:06<00:00,  3.11s/it]
5it [00:00, 38764.36it/s]
100%|██████████| 5/5 [00:11<00:00,  2.37s/it]
4it [00:00, 35320.45it/s]
100%|██████████| 4/4 [00:05<00:00,  1.32s/it]
6it [00:00, 57985.77it/s]
100%|██████████| 6/6 [00:26<00:00,  4.40s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
4it [00:00, 31068.92it/s]
100%|██████████| 4/4 [00:06<00:00,  1.63s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 11715.93it/s]
100%|██████████| 1/1 [00:07<00:

ValueError: too many values to unpack (expected 2)

In [236]:
entity_extractor = EntityExtractor(
)




In [234]:
base_nodes_OA1 = list(itertools.chain(*base_nodes_OA))

In [240]:
# or use a transformation in async
nodes_OA2 = await SummaryExtractor().acall(base_nodes_OA1)

  0%|          | 0/83 [00:00<?, ?it/s]

100%|██████████| 83/83 [00:38<00:00,  2.13it/s]


In [242]:
base_nodes_OA1[0]

TextNode(id_='deb3dd7c-abef-4d24-9516-4eb7244b2874', embedding=None, metadata={'section_summary': 'In this section, the key topics covered include converting letter grades to numeric values, determining the number of days in a given month, working with strings to find their length and last character, and making decisions based on user input. Additionally, the section includes examples of efficiently using lists to map input to specific outputs, such as determining a major and year based on user input. Finally, an example of converting an ISBN-9 to an ISBN-10 format without using functions is provided, showcasing mathematical calculations and conventions related to ISBN numbers.'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='720eb6d8-4f68-4a81-a717-8c42d2eb139c', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='0415e58c945f60e23b7b1f527746c366f3dc32eb9904012b3ded2390d291981b'), <NodeRelat

In [241]:
nodes_OA2

[TextNode(id_='deb3dd7c-abef-4d24-9516-4eb7244b2874', embedding=None, metadata={'section_summary': 'In this section, the key topics covered include converting letter grades to numeric values, determining the number of days in a given month, working with strings to find their length and last character, and making decisions based on user input. Additionally, the section includes examples of efficiently using lists to map input to specific outputs, such as determining a major and year based on user input. Finally, an example of converting an ISBN-9 to an ISBN-10 format without using functions is provided, showcasing mathematical calculations and conventions related to ISBN numbers.'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='720eb6d8-4f68-4a81-a717-8c42d2eb139c', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='0415e58c945f60e23b7b1f527746c366f3dc32eb9904012b3ded2390d291981b'), <NodeRela

In [214]:
index1 = VectorStoreIndex(nodes=nodes_4)

In [215]:
# Set up query enginge
recursive_query_engine_transf1 = index1.as_query_engine(
    similarity_top_k=5, verbose=True)

### 3) Index

#### 3.0 Manually defining chunking strat and embedding model 

Transformations: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/transformations/
* Combining with An Index

SummaryExtractor - Summary of nodes
* https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor/

Using extractor in ingestion pipeline to create nodes: 
* https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/EntityExtractionClimate/

In [204]:
documents1

[Document(id_='720eb6d8-4f68-4a81-a717-8c42d2eb139c', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Objectives for class 14\n\n- Review IA4-IA7\n- Review Quiz 3 – Quiz 5\n- Final Exam Schedule\n- Group Project Schedule\n---\n# Individual Assignment 4 - 2\n\n- Enter a letter grade A/a, B/b, C/c, D/d, and then displays its corresponding numeric value 90, 80, 70, 60.\n\n## Sample Run\n```\nEnter a letter grade: B\nThe numeric value for grade B is 80\n```\n\n```python\nletter = input("Enter a letter grade: ")\n\nif letter in \'Aa\':\n    print("The numeric value for grade A is 90")\nelif letter in \'Bb\':\n    print("The numeric value for grade B is 80")\nelif letter in \'Cc\':\n    print("The numeric value for grade C is 70")\nelif letter in \'Dd\':\n    print("The numeric value for grade D is 60")\nelse:\n    print(letter, "is an invalid grade")\n```\n\n- Letter in “Aa”\n- Multiple way if statements\n---\n# Individu

In [186]:
documents1 = list(itertools.chain(*documents))

In [187]:
from llama_index.core.extractors import SummaryExtractor 

In [188]:
index = VectorStoreIndex.from_documents(
    documents1,
    transformations=[node_parser_OPENAI,
                     SummaryExtractor(),
                     OpenAIEmbedding(model="text-embedding-3-small")],
)

2it [00:00, 14716.86it/s]
  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:12<00:00,  6.01s/it]
4it [00:00, 41323.19it/s]
100%|██████████| 4/4 [00:24<00:00,  6.11s/it]
1it [00:00, 10485.76it/s]
100%|██████████| 1/1 [00:12<00:00, 12.07s/it]
4it [00:00, 38130.04it/s]
100%|██████████| 4/4 [00:12<00:00,  3.17s/it]
2it [00:00, 20213.51it/s]
100%|██████████| 2/2 [00:15<00:00,  7.77s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
5it [00:00, 38269.20it/s]
100%|██████████| 5/5 [00:05<00:00,  1.03s/it]
2it [00:00, 20815.40it/s]
100%|██████████| 2/2 [00:08<00:00,  4.25s/it]
5it [00:00, 36792.14it/s]
100%|██████████| 5/5 [00:15<00:00,  3.19s/it]
4it [00:00, 38391.80it/s]
100%|██████████| 4/4 [00:06<00:00,  1.54s/it]
6it [00:00, 54236.69it/s]
100%|██████████| 6/6 [00:30<00:00,  5.10s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
4it [00:00, 32832.13it/s]
100%|██████████| 4/4 [00:06<00:00,  1.74s/it]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 13273.11it/s]
100%|██████████| 1/1 [00:09<00:00,  9.80s/it]
0it [00:00

In [189]:
# Set up query enginge
recursive_query_engine_transf = index.as_query_engine(
    similarity_top_k=5, verbose=True)

#### 3.1

In [190]:
nodes_semantic1 = list(itertools.chain(*nodes_semantic))

In [191]:
# Index and store our chunks using VectorStoreIndex
recursive_index_semantic = VectorStoreIndex(nodes= nodes_semantic1)

# Set up query enginge
recursive_query_engine_semantic = recursive_index_semantic.as_query_engine(
    similarity_top_k=5, verbose=True)

#### 3.2 

In [192]:
import itertools

base_nodes_mdp = list(itertools.chain(*base_nodes))
objects_mdp = list(itertools.chain(*objects))

In [193]:
from llama_index.core import VectorStoreIndex

# Index and store our chunks using VectorStoreIndex
recursive_index = VectorStoreIndex(nodes=base_nodes_mdp + objects_mdp)

# Set up query enginge
recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, verbose=True)

#### 3.3 

In [194]:
import itertools

base_nodes_Embed_OA = list(itertools.chain(*base_nodes_OA))
objects_Embed_OA = list(itertools.chain(*objects_OA))

In [195]:
from llama_index.core import VectorStoreIndex

# Index and store our chunks using VectorStoreIndex
recursive_embed = VectorStoreIndex(nodes=base_nodes_Embed_OA + objects_Embed_OA)

# Set up query enginge
recursive_query_engine_embed = recursive_embed.as_query_engine(
    similarity_top_k=5, verbose=True)

### 4) Retrieval and Query

In [216]:
res1 = recursive_query_engine_transf.query(
    "who is the instructor for the course"
)
print(res1)

res1 = recursive_query_engine_transf1.query(
    "who is the instructor for the course"
)
print(res1)

res1 = recursive_query_engine_semantic.query(
    "who is the instructor for the course"
)
print(res1)

res = recursive_query_engine.query(
    "who is the instructor for the course"
)
print(res)

res = recursive_query_engine_embed.query(
    "who is the instructor for the course"
)
print(res)



The instructor for the course is not explicitly mentioned in the provided context information.
The instructor for the course is not explicitly mentioned in the provided context information.
Dr. Yuan Long
[1;3;38;2;11;159;203mRetrieval entering f00b84d1-c3b9-44b5-84ab-d238d60bce05: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query who is the instructor for the course
[0mThe instructor for the course is not explicitly mentioned in the provided context information.
[1;3;38;2;11;159;203mRetrieval entering 96bd4f61-5df3-408f-8deb-5e50e9c08a2b: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query who is the instructor for the course
[0mDr. Yuan Long


In [197]:
res1 = recursive_query_engine_transf.query(
    "i want to review the final exam. What did the exam cover. give me the concepts covered in it and point me to the class that coveres these materials"
)
print(res1)

res1 = recursive_query_engine_semantic.query(
    "i want to review the final exam. What did the exam cover. give me the concepts covered in it and point me to the class that coveres these materials"
)
print(res1)

res = recursive_query_engine.query(
    "i want to review the final exam. What did the exam cover. give me the concepts covered in it and point me to the class that coveres these materials"
)
print(res)

res1 = recursive_query_engine_embed.query(
    "i want to review the final exam. What did the exam cover. give me the concepts covered in it and point me to the class that coveres these materials"
)
print(res1)

The final exam covered concepts related to the execution flow of a Python program with conditional statements, specifically focusing on calculating loan payments based on user input. The exam included topics such as conditional statements (`if-else`), multiple-way decisions using `if-elif` blocks, boolean expressions, and relational operators for comparison in Python. These concepts were covered in the class that discusses the execution flow of Python programs for calculating loan payments based on user input.
The final exam covered concepts from Chapter 4 to Chapter 9 of the course textbook. The exam included topics such as mathematical functions, strings, objects, loops, functions, lists, multi-dimensional lists, objects, and classes. These concepts were covered in classes held during Week 6, Week 7, Week 8, Week 9, Week 10, Week 11, Week 12, Week 13, and Week 14.
[1;3;38;2;11;159;203mRetrieval entering f00b84d1-c3b9-44b5-84ab-d238d60bce05: TextNode
[0m[1;3;38;2;237;90;200mRetriev

In [198]:
res1 = recursive_query_engine_transf.query(
    "What materials were covered in class 13? "
)
print(res1)

res1 = recursive_query_engine_semantic.query(
    "What materials were covered in class 13? "
)
print(res1)

res = recursive_query_engine.query(
    "What materials were covered in class 13? "
)
print(res)

res = recursive_query_engine_embed.query(
    "What materials were covered in class 13? "
)
print(res)


Class 13 covered implementing selection control with nested if and multi-way if-elif-else statements, combining conditions using logical operators (and, or, and not), using selection statements with combined conditions, and understanding how to develop a program with selections. The key topics included nested if statements, multi-way if-elif-else statements, logical operators, combined conditions, and the development of programs with selections. Students were encouraged to review Assignment 1 and Assignment 2, as well as take Quiz 1 to reinforce their understanding of the material.
Chapter 7 materials were covered in class 13.
[1;3;38;2;11;159;203mRetrieval entering f00b84d1-c3b9-44b5-84ab-d238d60bce05: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What materials were covered in class 13? 
[0m[1;3;38;2;11;159;203mRetrieval entering a459cff7-51e3-4a34-a2d5-26d98dd961e5: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What 

In [199]:
res1 = recursive_query_engine_transf.query(
    "What materials were covered in class 14? "
)
print(res1)

res1 = recursive_query_engine_semantic.query(
    "What materials were covered in class 14? "
)
print(res1)

res = recursive_query_engine.query(
    "What materials were covered in class 14? "
)
print(res)

res = recursive_query_engine_embed.query(
    "What materials were covered in class 14? "
)
print(res)

[1;3;38;2;11;159;203mRetrieval entering 95b15fba-122a-42bd-93b8-c966886a7f0b: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What materials were covered in class 14? 
[0mThe materials covered in class 14 included topics related to computer fundamentals, programming concepts, and software development. The class likely discussed multiple-choice questions on binary numbers, programming languages, operating systems, and Python. Additionally, coding questions may have been included that required writing Boolean expressions and programs for tasks such as calculating minimum runway length for airplane take-off and currency exchange rate conversion. The focus was on testing understanding of basic programming concepts and problem-solving skills.
The concepts of computer hardware, programs, and operating systems, the history of Python, the basic syntax of a Python program, and writing and running a simple Python program were covered in class 14.
[1;3;38;2;11;159

In [200]:
res1 = recursive_query_engine_transf.query(
    "How do we add an element to a list? Which lecture covers this topic?"
)
print(res1)

res1 = recursive_query_engine_semantic.query(
    "How do we add an element to a list? Which lecture covers this topic?"
)
print(res1)

res = recursive_query_engine.query(
    "How do we add an element to a list? Which lecture covers this topic? "
)
print(res)

res = recursive_query_engine_embed.query(
    "How do we add an element to a list? Which lecture covers this topic? "
)
print(res)

[1;3;38;2;11;159;203mRetrieval entering 95b15fba-122a-42bd-93b8-c966886a7f0b: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How do we add an element to a list? Which lecture covers this topic?
[0mTo add an element to a list in Python, you can use the `append()` method or the `insert()` method. The `append()` method adds the element at the end of the list, while the `insert()` method allows you to specify the index where you want to insert the element.

The lecture that covers this topic is likely related to basic Python programming concepts or data structures, as adding elements to a list is a fundamental operation in Python programming.
To add an element to a list, you can use the `append(x)` method, where `x` is the element you want to add. This topic is covered in Chapter 7 of the course material.
To add an element to a list, you can use the `append(x)`, `extend(anotherList)`, or `insert(index, x)` methods. This topic is covered in Chapter 7 under t

In [201]:
question = 'Are you aware of question 6 in quiz 5? Explain in detail the answer to this problem. Why is the answer correct?'

res1 = recursive_query_engine_transf.query(question)
print(res1)

res1 = recursive_query_engine_semantic.query(question)
print(res1)

res = recursive_query_engine.query(question)
print(res)

res = recursive_query_engine_embed.query(question)
print(res)

[1;3;38;2;11;159;203mRetrieval entering 95b15fba-122a-42bd-93b8-c966886a7f0b: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Are you aware of question 6 in quiz 5? Explain in detail the answer to this problem. Why is the answer correct?
[0m[1;3;38;2;11;159;203mRetrieval entering c86626c1-27fc-4831-b138-5f4e31e97b84: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Are you aware of question 6 in quiz 5? Explain in detail the answer to this problem. Why is the answer correct?
[0mI am not aware of question 6 in quiz 5 or its specific details.
Yes, I am aware of question 6 in quiz 5. The output of the code provided in question 6 will be 64 if the number is 8. This is because the code is calculating the square of the given number. In this case, when the number is 8, squaring it results in 64. Therefore, the correct answer to question 6 is 64.
[1;3;38;2;11;159;203mRetrieval entering a459cff7-51e3-4a34-a2d5-26d98dd961e5: TextNod

So far, we are seeing that RAG system using MarkdownNodeParser as its chunker yields the most accurate answers relatively. Yet, there are some hallucinations with this model as well as other models. 

The model with transformations is not picking up any content, the model with semantic chunker fails in the last question about the quiz, and the md RAG systems do poorly in answering questions about materials covered in lectures. 

Now, we want to test some retrievel methods. 

### 5) Evaluator 

In [286]:
from llama_index.llms.openai import OpenAI
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.core.evaluation import RelevancyEvaluator


#### 5.1 Generate a list of questions from our documents

In [279]:
# create llm
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.0)

# define generator, generate questions
dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents1,
    llm=llm,
    num_questions_per_chunk=3,  # set the number of questions per nodes
)

In [280]:
rag_dataset = dataset_generator.generate_questions_from_nodes()
questions = [e.query for e in rag_dataset.examples]

In [285]:
questions[10:40]

['Count the number of ratings in the given list and display how many ratings have a specific value entered by the user.',
 'Create a program that counts the frequency of ratings (0 to 5) in a list and displays the count for each rating.',
 'What will be the output of the following code snippet: `print(s1[:3])`?',
 'Is the statement `s1.isupper()` true or false? Explain your answer.',
 "How many times does the character 'o' appear in the string `s1` when `s1` is evaluated using the `s1.count('o')` function?",
 'Explain the difference between accessing list elements by using indexed variables and obtaining a sublist from a larger list using the slicing operator. Provide examples to illustrate each concept.',
 'How can you traverse elements in a list using a for loop? Provide a code snippet to demonstrate this process and explain the benefits of using a for loop for list traversal.',
 'Discuss the importance of developing and invoking functions that pass list arguments. Provide a scenario

#### 5.2 Eval

In [287]:
# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")

evaluator_gpt4 = RelevancyEvaluator(llm=gpt4)

In [289]:
from llama_index.core import Response

# define jupyter display function
def display_eval_df(query: str, response: Response, eval_result: str) -> None:
    eval_df = pd.DataFrame(
        {
            "Query": query,
            "Response": str(response),
            "Source": (
                response.source_nodes[0].node.get_content()[:1000] + "..."
            ),
            "Evaluation Result": eval_result,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

In [298]:
questions[1]

'Develop a Python program that takes a year and a month as input and displays the number of days in that month for the given year. Make sure to consider leap years when determining the number of days in February.'

In [290]:
response_vector = recursive_query_engine_embed.query(questions[1])
eval_result = evaluator_gpt4.evaluate_response(
    query=questions[1], response=response_vector
)

In [299]:
print(response_vector)

To develop a Python program that takes a year and a month as input and displays the number of days in that month for the given year, you can follow these steps:

1. Prompt the user to enter a year and a month.
2. Determine the number of days based on the month entered:
   - For months with 31 days (January, March, May, July, August, October, December), display "31 days."
   - For months with 30 days (April, June, September, November), display "30 days."
   - For February, check if the year is a leap year:
     - If it is a leap year (divisible by 4 but not by 100 unless also divisible by 400), display "29 days."
     - If it is not a leap year, display "28 days."
3. Consider displaying an error message for invalid inputs or incorrect month names.

By following these steps, you can create a Python program that accurately determines the number of days in a given month for a specified year, accounting for leap years in February.


### Advanced Retrieval Methods 

In [300]:
# import QueryBundle
from llama_index.core import QueryBundle

# import NodeWithScore
from llama_index.core.schema import NodeWithScore

# Retrievers
from llama_index.core.retrievers import (
    BaseRetriever,
    VectorIndexRetriever,
    KeywordTableSimpleRetriever,
)

from typing import List

In [302]:
class CustomRetriever(BaseRetriever):
    """Custom retriever that performs both semantic search and keyword search."""

    def __init__(
        self,
        vector_retriever: VectorIndexRetriever,
        keyword_retriever: KeywordTableSimpleRetriever,
        mode: str = "AND",
    ) -> None:
        """Init params."""

        self._vector_retriever = vector_retriever
        self._keyword_retriever = keyword_retriever
        if mode not in ("AND", "OR"):
            raise ValueError("Invalid mode.")
        self._mode = mode
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve nodes given query."""

        vector_nodes = self._vector_retriever.retrieve(query_bundle)
        keyword_nodes = self._keyword_retriever.retrieve(query_bundle)

        vector_ids = {n.node.node_id for n in vector_nodes}
        keyword_ids = {n.node.node_id for n in keyword_nodes}

        combined_dict = {n.node.node_id: n for n in vector_nodes}
        combined_dict.update({n.node.node_id: n for n in keyword_nodes})

        # Determine whether we are using both or either of the methods 
        if self._mode == "AND":
            retrieve_ids = vector_ids.intersection(keyword_ids)
        else:
            retrieve_ids = vector_ids.union(keyword_ids)

        retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
        return retrieve_nodes

In [306]:
from llama_index.core import SimpleKeywordTableIndex, VectorStoreIndex

keyword_index = SimpleKeywordTableIndex(nodes=base_nodes_Embed_OA + objects_Embed_OA)

In [307]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

# define custom retriever
vector_retriever = VectorIndexRetriever(index=recursive_embed, similarity_top_k=2)  # With our markdonwnodeparser
keyword_retriever = KeywordTableSimpleRetriever(index=keyword_index)
custom_retriever = CustomRetriever(vector_retriever, keyword_retriever)

In [308]:
# define response synthesizer
response_synthesizer = get_response_synthesizer()

# assemble query engine
custom_query_engine = RetrieverQueryEngine(
    retriever=custom_retriever,
    response_synthesizer=response_synthesizer,
)

In [310]:
question = 'Are you aware of question 6 in quiz 5? Explain in detail the answer to this problem. Why is the answer correct?'

res1 = custom_query_engine.query(question)
print(res1)


Yes, I am aware of question 6 in quiz 5. The error that occurs when a program needs to read data from a file that does not exist is a runtime error. This type of error happens during the execution of the program when it is unable to access the required file, leading to a disruption in the program's normal flow. In this scenario, the absence of the file causes the program to halt unexpectedly, resulting in a runtime error. The correct answer is A - runtime error because it accurately describes the situation where the program encounters an issue while running due to the missing file, distinguishing it from logic errors or syntax errors which are related to the program's structure and design.
