# IMPORTANT DISCOVERY!!

July 16 2023 update!

It appears my data is actually quite imperfect. Atleast from the chunk_from_html.csv it is apparent that there are a lot of duplicates! some of the webpages I will need to filter out, so that semantic search doesn't return copies. My initial filter (code below) cut down everything by about half! I need to look at this closer. However, after removing it, results seem to be much improved. This is the clear next step for prompt engineering.


```
data = pd.read_csv("../data-collection/data/chunks_from_html.csv", names=['source', 'data'])
admissions_questions = data[data['source'].str.contains('brockport.edu/admissions/apply')]

df_no_duplicates = admissions_questions[~admissions_questions['data'].duplicated(keep='first')]

admissions_questions= df_no_duplicates
for row_index in range(len(admissions_questions)):
    # open the file with write mode
    path = admissions_questions.iloc[row_index, 0].replace('https://www2.', '').replace('/', '_')
    with open(f"/home/msaad/workspace/honors-thesis/data-collection/data/admissions_vectordb_split2/chunk_{path}.txt", 'w') as file:
        # write a row of the csv to the file
        file.write(admissions_questions.iloc[row_index, 1])
```

In [22]:
from langchain.vectorstores import Chroma
from langchain.chains import VectorDBQA
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
from chromadb.utils import embedding_functions
import chromadb
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [23]:
data_dir = "/home/msaad/workspace/honors-thesis/data-collection/data/"
loader = DirectoryLoader(
    data_dir + "admissions_vectordb_split2", 
    glob="./*.txt",
    use_multithreading=True
)
doc = loader.load()

len(doc)

13

In [24]:
# Splitting the text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)
texts = text_splitter.split_documents(doc)

# Count the number of chunks
len(texts)

25

In [25]:
persist_directory = data_dir + "test_admissions2"

# By default uses 'hkunlp/instructor-large'
embedding_function = HuggingFaceInstructEmbeddings(
    query_instruction="Represent the School paragraph for retrieval: "
)

# This also generates embeddings, which is quite GPU taxing.
vectordb = Chroma.from_documents(
    documents = doc,
    embedding = embedding_function,
    persist_directory = persist_directory
)

load INSTRUCTOR_Transformer
max_seq_length  512


In [26]:
vectordb.persist()

In [27]:
retreiver = vectordb.as_retriever()

docs = retreiver.get_relevant_documents("How can I apply as an undergraduate?")

docs

[Document(page_content='If you have attended any college after high school graduation, you should choose this option. If you are not a US citizen and interested in undergraduate or graduate study, you should choose this option.', metadata={'source': '/home/msaad/workspace/honors-thesis/data-collection/data/admissions_vectordb_split2/chunk_brockport.edu_admissions_apply-to-brockport.txt'}),
 Document(page_content='Please choose only ONE of the following applications to fill out. Both applications are equally accepted. This fee is required with either option. For information on fee waivers, please contact your high school guidance counselor or complete the After completing your application, submit any additional requirements, like your transcripts, letters of recommendation, additional essay or statement, and SAT/ACT scores (SAT/ACT scores are optional). Applicants are admitted to the University and not to a major. A separate application or prerequisite courses may be required for admiss

# Start langchain

In [28]:
retriever = vectordb.as_retriever(search_kwargs={"k": 5})

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    verbose=True
)

In [29]:
# Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [31]:
query = "Are SAT scores required to apply?"
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 No, SAT scores are not required to apply. They are optional and may be submitted via official high school transcript or directly from the testing agency.


Sources:
/home/msaad/workspace/honors-thesis/data-collection/data/admissions_vectordb_split2/chunk_brockport.edu_admissions_apply_first_year.html.txt
/home/msaad/workspace/honors-thesis/data-collection/data/admissions_vectordb_split2/chunk_brockport.edu_admissions_apply_status.html.txt
/home/msaad/workspace/honors-thesis/data-collection/data/admissions_vectordb_split2/chunk_brockport.edu_admissions_apply.txt
/home/msaad/workspace/honors-thesis/data-collection/data/admissions_vectordb_split2/chunk_brockport.edu_admissions_apply-to-brockport.txt
/home/msaad/workspace/honors-thesis/data-collection/data/admissions_vectordb_split2/chunk_brockport.edu_admissions_apply_special_admission_requirements.html.txt


In [34]:
llm_response['result']

' No, SAT scores are not required to apply. They are optional and may be submitted via official high school transcript or directly from the testing agency.'