In [6]:
#!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

In [7]:
#!wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/parse-faq.ipynb

In [8]:
import minsearch
import json

In [9]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [10]:
documents=[]

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [11]:
documents[0]

{'text': "dThe purpose of this document is to capture frequently asked technical questions\nThe next cohort starts in Jan 2025. More info at DTC Article.\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start？',
 'course': 'data-engineering-zoomcamp'}

In [12]:
Index = minsearch.Index(
    text_fields = ['question','section','text'],
    keyword_fields = ['course']
)

In [13]:
Index.fit(documents)

<minsearch.Index at 0x12fc06650>

In [14]:
ques = "how to parse document files with python?"

In [15]:
boost = {'question': 3, 'section' : 0.4}

results = Index.search(
    query=ques,
    filter_dict={'course':'mlops-zoomcamp'},
    boost_dict=boost,
    num_results=5
)

In [16]:
results

[{'text': 'If you create a folder data and download datasets or raw files in your local repository. Then to push all your code to remote repository without this files or folder please use gitignore file. The simple way to create it do the following steps\n1. Create empty .txt file (using text editor or command line)\n2. Safe as .gitignore (. must use the dot symbol)\n3. Add rules\n *.parquet - to ignore all parquet files\ndata/ - to ignore all files in folder data\n\nFor more pattern read GIT documentation\nhttps://git-scm.com/docs/gitignore\nAdded by Olga Rudakova (olgakurgan@gmail.com)',
  'section': 'Module 1: Introduction',
  'question': '.gitignore how-to',
  'course': 'mlops-zoomcamp'},
 {'text': 'I have faced a problem while reading the large parquet file. I tried some workarounds but they were NOT successful with Jupyter.\nThe error message is:\nIndexError: index 311297 is out of bounds for axis 0 with size 131743\nI solved it by performing the homework directly as a python scr

#### from search script TextSearch Class function

In [17]:
from Search_uncln import TextSearch

index = TextSearch(
    text_fields = ['question','section','text'],
)
index.fit(documents)

se_results =index.search(
    query = ques,
    n_results= 2, 
    filters= {'course':'mlops-zoomcamp'},
boost = {'question':6, 'section':0.2},
)

In [18]:
se_results

[{'text': 'If you create a folder data and download datasets or raw files in your local repository. Then to push all your code to remote repository without this files or folder please use gitignore file. The simple way to create it do the following steps\n1. Create empty .txt file (using text editor or command line)\n2. Safe as .gitignore (. must use the dot symbol)\n3. Add rules\n *.parquet - to ignore all parquet files\ndata/ - to ignore all files in folder data\n\nFor more pattern read GIT documentation\nhttps://git-scm.com/docs/gitignore\nAdded by Olga Rudakova (olgakurgan@gmail.com)',
  'section': 'Module 1: Introduction',
  'question': '.gitignore how-to',
  'course': 'mlops-zoomcamp'},
 {'text': 'I have faced a problem while reading the large parquet file. I tried some workarounds but they were NOT successful with Jupyter.\nThe error message is:\nIndexError: index 311297 is out of bounds for axis 0 with size 131743\nI solved it by performing the homework directly as a python scr

### chatcompletition, Prompt Template, LLM api call

In [22]:
from openai import OpenAI

In [25]:
#client = OpenAI()

In [26]:
ques

'how to parse document files with python?'

In [None]:
response = client.chat.completions.create(
    model = '',
    message = [{"role": "user", "content": ques}],
               
)

In [36]:
from dotenv import load_dotenv
from groq import Groq
import os

load_dotenv()

client = Groq(
    api_key=os.getenv("GROQ_API_KEY"),
)

response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content":ques,
            #"content": "Explain the importance of fast language models",
        }
    ],
    model="llama3-8b-8192",
)

In [58]:
response.choices[0].message.content
print(response.choices[0].message.content)

There are several ways to parse document files using Python, depending on the type of file and the structure of the document. Here are some common methods:

1. **XML parsing**:
	* Use the `xml.etree.ElementTree` module to parse XML documents.
	* Example: `root = ET.parse('file.xml').getroot()`
2. **JSON parsing**:
	* Use the `json` module to parse JSON documents.
	* Example: `data = json.load(open('file.json'))`
3. **HTML parsing**:
	* Use the `html.parser` module to parse HTML documents.
	* Example: `from html.parser import HTMLParser; parser = HTMLParser(); parser.feed('file.html')`
4. **PDF parsing**:
	* Use the `PyPDF2` library to parse PDF files.
	* Example: `pdf = PyPDF2.PdfFileReader('file.pdf'); pages = pdf.numPages`
5. **Microsoft Office document parsing**:
	* Use the `python-docx` library to parse Word documents (.docx).
	* Use the `openpyxl` library to parse Excel spreadsheets (.xlsx).
	* Use the `python-odf` library to parse OpenDocument files (.odt, .ods, .odg).
6. **Text 

In [89]:
prompt_template = """ 
You are an expert machine learning and mlops engineering helping a junior engineer as an assitant and guide. 
Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering.
if the CONTEXT does not contain the answer, Output "Not FOUND in the context given" and explain your answer with reasons.

QUESTION: {question}

CONTEXT: {context}
""".strip()

In [84]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    

In [85]:
print(context)

section: Module 1: Introduction
question: .gitignore how-to
answer: If you create a folder data and download datasets or raw files in your local repository. Then to push all your code to remote repository without this files or folder please use gitignore file. The simple way to create it do the following steps
1. Create empty .txt file (using text editor or command line)
2. Safe as .gitignore (. must use the dot symbol)
3. Add rules
 *.parquet - to ignore all parquet files
data/ - to ignore all files in folder data

For more pattern read GIT documentation
https://git-scm.com/docs/gitignore
Added by Olga Rudakova (olgakurgan@gmail.com)

section: Module 1: Introduction
question: Reading large parquet files
answer: I have faced a problem while reading the large parquet file. I tried some workarounds but they were NOT successful with Jupyter.
The error message is:
IndexError: index 311297 is out of bounds for axis 0 with size 131743
I solved it by performing the homework directly as a pyth

In [86]:
prompt = prompt_template.format(question=ques, context=context).strip()

In [87]:
print(prompt)

You are an expert machine learning and mlops engineering helping a junior engineer as an assitant and guide. 
Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering.
if the CONTEXT does not contain the answer, Output "Not FOUND in the context given" and give your reasons for that.

QUESTION: how to parse document files with python?

CONTEXT: section: Module 1: Introduction
question: .gitignore how-to
answer: If you create a folder data and download datasets or raw files in your local repository. Then to push all your code to remote repository without this files or folder please use gitignore file. The simple way to create it do the following steps
1. Create empty .txt file (using text editor or command line)
2. Safe as .gitignore (. must use the dot symbol)
3. Add rules
 *.parquet - to ignore all parquet files
data/ - to ignore all files in folder data

For more pattern read GIT documentation
https://git-scm.com/docs/gitignore

In [90]:
client = Groq(
    api_key=os.getenv("GROQ_API_KEY"),
)

response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="llama3-8b-8192",
)
response.choices[0].message.content

'I can help you with that! \n\nUnfortunately, there is no mention of parsing document files with Python in the given context. However, we can see that there is a mention of reading large parquet files with Pyspark library.\n\nSo, my answer would be: Not FOUND in the context given\n\nReason: The topic of parsing document files with Python was not discussed in the provided context.'