In [11]:
import pandas as pd
import pickle
import requests
from bs4 import BeautifulSoup
import trafilatura
import re
import random
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import VectorDBQA
from langchain.llms import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.document_loaders.csv_loader import CSVLoader

df = pd.read_csv("../data-collection/data/clean_chunks_from_html.csv")

In [12]:
df

Unnamed: 0,url,data
0,https://www2.brockport.edu/academics/advisemen...,Computer Information Systems\nThe computer sci...
1,https://www2.brockport.edu/academics/advisemen...,Athletic Training\n*Fall 2019 will be the last...
2,https://www2.brockport.edu/academics/advisemen...,Social Studies Adolescent Education\nRecommend...
3,https://www2.brockport.edu/academics/kinesiology,"Training our minds, training our bodies. That’..."
4,https://www2.brockport.edu/support/employee-as...,We are all human and sometimes we have problem...
...,...,...
283,https://www2.brockport.edu/academics/advisemen...,Earth Science\nRequired Courses in the Freshma...
284,https://www2.brockport.edu/about/title-ix,Make a Report\nMake a Title IX related report ...
285,https://www2.brockport.edu/support/issue-center,This page is for non-emergencies only. If this...
286,https://www2.brockport.edu/academics/scholars-day,"Scholars Day is Wednesday, April 12, 2023\nThe..."


In [13]:
loader = CSVLoader(
    file_path="../data-collection/data/clean_chunks_from_html.csv",
    source_column='url')
data = loader.load()

# Important!

I need to configure Document class to not put url there. I don't think natively langchain supports that though, So I'll probably just rewrite the class all together. For the loader. There's no real reason I see why not to after looking at source code on github

After all, the class is not really that complex... via https://github.com/hwchase17/langchain/blob/506b21bfc2c1ab0f0b341501b9d2ca741a1550e6/libs/langchain/langchain/schema/document.py#L11
```
class Document(Serializable):
    """Class for storing a piece of text and associated metadata."""

    page_content: str
    """String text."""
    metadata: dict = Field(default_factory=dict)
    """Arbitrary metadata about the page content (e.g., source, relationships to other
        documents, etc.).
    """
```

## ALSO!

Could just remove url with regex...

In [14]:
data

[Document(page_content='url: https://www2.brockport.edu/academics/advisement/handbook/computer-information-systems\ndata: Computer Information Systems\nThe computer science and information systems courses recommended here are required in the major. For more information contact the Department of Computing Sciences. The Computer Information Systems major is accredited by the Computing Accreditation Commission (CAC) of ABET.\nRequired Courses in the Freshman Year\nRequired Courses in the Sophomore Year\nPrerequisite Courses\nGrade Requirement\nThis major/program requires a grade of “C” or above in the following courses. “C-” or lower grades are NOT acceptable and require repeats unless otherwise specified.\n- CSC 203\n- CSC 205\n- CSC 209\n- CIS 117\n- CIS 202\n- CIS 206\nAuxiliary Courses\nStudents are strongly recommended to take courses that have a writing component. Students are advised to take some basic business courses, such as BUS 325, 335, 345 and 365.', metadata={'source': 'http

In [15]:
# Chunk it all
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300, 
    chunk_overlap=50
)
texts = text_splitter.split_documents(data)
len(texts)

2420

In [16]:
import re

def standardize_string(input_string):
    # Step 1: Remove '\n' characters and replace them with periods
    standardized_string = input_string.replace('\n', ' ')

    # Step 2: Standardize the number of spaces
    standardized_string = re.sub(r'\s+', ' ', standardized_string)

    # Step 3: Remove non-alphanumeric characters at the start of the string
    standardized_string = re.sub(r'^[^a-zA-Z0-9]+', '', standardized_string)

    return standardized_string.strip()

In [17]:
# Create new 'texts' with some additional filters
texts_cleaned = []

# Iterate over texts page_content category with this cleaning method.
for id in range(len(texts)):
    texts[id].page_content = standardize_string(texts[id].page_content)

    if len(texts[id].page_content) > 100:
        texts_cleaned.append(texts[id])

texts_cleaned

[Document(page_content='url: https://www2.brockport.edu/academics/advisement/handbook/computer-information-systems data: Computer Information Systems', metadata={'source': 'https://www2.brockport.edu/academics/advisement/handbook/computer-information-systems', 'row': 0}),
 Document(page_content='The computer science and information systems courses recommended here are required in the major. For more information contact the Department of Computing Sciences. The Computer Information Systems major is accredited by the Computing Accreditation Commission (CAC) of ABET.', metadata={'source': 'https://www2.brockport.edu/academics/advisement/handbook/computer-information-systems', 'row': 0}),
 Document(page_content='Required Courses in the Freshman Year Required Courses in the Sophomore Year Prerequisite Courses Grade Requirement This major/program requires a grade of “C” or above in the following courses. “C-” or lower grades are NOT acceptable and require repeats unless otherwise specified. 

# Now Create Embeddings

Going to try openai embeddings this time around, under Morgan's suggestion that they're the best

In [18]:
embedding_function = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents = texts_cleaned,
    embedding = embedding_function
)

In [19]:
retreiver = vectordb.as_retriever()
docs = retreiver.get_relevant_documents("Where do I apply")

docs

[Document(page_content='Choose Your Application Please choose only ONE of the following applications to fill out. Both applications are equally accepted. Track your Progress You can track your application status in our online system and check for any missing documents or additional required items. Other Audiences', metadata={'source': 'https://www2.brockport.edu/admissions/apply', 'row': 141}),
 Document(page_content='Choose Your Application Please choose only ONE of the following applications to fill out. Both applications are equally accepted. Track your Progress You can track your application status in our online system and check for any missing documents or additional required items. Other Audiences', metadata={'source': 'https://www2.brockport.edu/admissions/apply/index', 'row': 247}),
 Document(page_content='Complete Application The “Start Application” link will take you to the Application Management page. Please follow the prompts for First Time or Returning users. If you have s

In [20]:
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retreiver,
    return_source_documents=True,
    verbose=True
)

# Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [21]:
query = "I got accepted! Can you give me a list of what are my next steps are?"
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Congratulations on your acceptance! Your next steps depend on what type of student you are: Freshman, Transfer, Educational Opportunity, or International. Please refer to the checklists provided in the context to complete your enrollment, select housing, register for orientation, and much more.


Sources:
https://www2.brockport.edu/admissions/brockport-bound
https://www2.brockport.edu/admissions/orientation/first-year
https://www2.brockport.edu/admissions/brockport-bound
https://www2.brockport.edu/admissions/apply/status


In [22]:
query = "Does Brockport have a computer science department?"
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Yes, Brockport has a computer information systems department.


Sources:
https://www2.brockport.edu/academics/advisement/handbook/computer-science-major
https://www2.brockport.edu/academics/computing-sciences
https://www2.brockport.edu/academics/advisement/handbook/computer-information-systems
https://www2.brockport.edu/academics/online


In [23]:
query = "What are some of the different ways I can apply to Brockport?"
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 You can apply to SUNY Brockport online, through their partner schools in the Rochester area, or through an application assistance program.


Sources:
https://www2.brockport.edu/academics/international-education
https://www2.brockport.edu/academics/field-experience
https://www2.brockport.edu/admissions/information/counselors
https://www2.brockport.edu/graduate/about


In [24]:
query = "Can you give me a list of all the majors that Brockport offers??"
llm_response = qa_chain(query)
process_llm_response(llm_response)

# PROBLEM FOUND! This url it provided contains the ? which selects a box. In this case, the link it provided filtered to majors n stuff available Online, not in person.
# It did return the correct response though, so that's good anyway.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Yes, Brockport offers 50 undergraduate majors, 55+ minors and 50+ graduate programs. You can find a full list of all the majors on the Brockport website at https://www2.brockport.edu/go/online.


Sources:
https://www2.brockport.edu/admissions
https://www2.brockport.edu/go/online
https://www2.brockport.edu
https://www2.brockport.edu/academics/programs


In [33]:
query = "Can you transfer in? If so, what are the steps?"
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Yes, you can transfer in to a college or university. The steps will depend on the type of student you are (Freshman, Transfer, Educational Opportunity, or International student). Refer to the above checklists for the necessary steps to complete your enrollment, select housing, register for orientation, and much more! You should also reach out to your academic advisor after your official transcript is on file in your student record.


Sources:
https://www2.brockport.edu/admissions/brockport-bound
https://www2.brockport.edu/graduate/current-student-faq
https://www2.brockport.edu/admissions/information/accepted
https://www2.brockport.edu/admissions/apply-to-brockport


In [26]:
query = "What is the DPT program?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

# Interesting. Seems right. Checked via https://www2.brockport.edu/academics/advisement/handbook/doctor-physical-therapy/



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 The DPT program is the 3+3 Doctorate of Physical Therapy Program. It is an early decision program that requires 3 years of study at SUNY Brockport and 3 years of study at SUNY Upstate Medical University. Upon successful completion of three years at Brockport, the student will transfer to SUNY Upstate Medical and SUNY Brockport will grant a Bachelors Degree upon successful completion (at least a 2.0 GPA) of the first year of study at Upstate. GREs are waived for students accepted under the Early Admission Program for the 3+3 DPT and all science prerequisite courses must include laboratories, and are acceptable for transfer credit with grades of “C” or better.


Sources:
https://www2.brockport.edu/academics/advisement/handbook/doctor-physical-therapy
https://www2.brockport.edu/academics/advisement/handbook/doctor-physical-therapy
https://www2.brockport.edu/academics/advisement/handbook/doctor-physical-therapy
https

In [27]:
query = "Tell me about the CS department"
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 The Department of Computing Sciences offers a Computer Information Systems major accredited by the Computing Accreditation Commission (CAC) of ABET, as well as Software Development (SD) and Advanced Computing (AC) tracks of the Computer Science major. Students with a weak mathematical background are advised to take additional mathematics courses prior to enrolling in the first course of the major, CSC 203. This is a demanding course that requires about 10 hours of programming each week. For more information, contact the department.


Sources:
https://www2.brockport.edu/academics/advisement/handbook/computer-information-systems
https://www2.brockport.edu/academics/advisement/handbook/computer-science-major
https://www2.brockport.edu/academics/advisement/handbook/computer-science-major
https://www2.brockport.edu/academics/advisement/handbook/computer-science-major


In [28]:
def qa(question: str):
    llm_response = qa_chain(question)
    process_llm_response(llm_response)

In [29]:
qa("Does Brockport offer scholarships")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Yes, Brockport offers several scholarships to first-year and transfer students.


Sources:
https://www2.brockport.edu/scholarships-aid
https://www2.brockport.edu/admissions/test-optional
https://www2.brockport.edu/about/facts
https://www2.brockport.edu/scholarships-aid/scholarships


In [30]:
qa("What kind of scholarships does Brockport offer?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Brockport offers merit-based Extraordinary Academic Scholarships, including the Honors College Prometheus Scholarship, that are test optional; Gold Scholarships with a high school GPA of 93+; Green Scholarships with a high school GPA of 88+; and Foundation Scholarships created by donors.


Sources:
https://www2.brockport.edu/scholarships-aid
https://www2.brockport.edu/admissions/test-optional
https://www2.brockport.edu/scholarships-aid
https://www2.brockport.edu/scholarships-aid/scholarships


In [31]:
qa("What can you tell me about the foundation scholarships?")

# Not strictly right, I don't think. Mainly towards the end.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Foundation scholarships are created by donors to Brockport, such as alumni, friends, faculty, emeriti, staff and parents. They recognize students for their academic excellence, co-curricular activities, and professional promise. There are two types of foundation scholarships: the Fannie Barrier Williams Scholars ($1,500) and the Transfer Award ($1,000). If you have any questions, please contact the Foundation Scholarship Office at (585) 395-5105.


Sources:
https://www2.brockport.edu/scholarships-aid/scholarships
https://www2.brockport.edu/scholarships-aid/scholarships
https://www2.brockport.edu/scholarships-aid/scholarships
https://www2.brockport.edu/scholarships-aid


In [34]:
qa("Where can I learn more about admissions?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 You can learn more about admissions by speaking with an Undergraduate Admissions advisor or by visiting the Office of Undergraduate Admissions website.


Sources:
https://www2.brockport.edu/admissions/apply/faq
https://www2.brockport.edu/admissions/apply/readmission
https://www2.brockport.edu/graduate/current-student-faq
https://www2.brockport.edu/admissions/apply/faq


In [35]:
qa("What is the website to apply on?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 The website to apply on is https://www2.brockport.edu/graduate/admissions.


Sources:
https://www2.brockport.edu/admissions/apply
https://www2.brockport.edu/admissions/apply/index
https://www2.brockport.edu/graduate/admissions
https://www2.brockport.edu/graduate/admissions


## Testing some more.

Now trying to strip the urls out of the data for semantic search. I don't think the urls are suuper important, but it's kinda nice because on occation they allow the bot to point someone in the right direction. Curious if it is degrading the quality of semantic search.

Actually going to do this in different doc.

In [38]:
data

[Document(page_content='url: https://www2.brockport.edu/academics/advisement/handbook/computer-information-systems\ndata: Computer Information Systems\nThe computer science and information systems courses recommended here are required in the major. For more information contact the Department of Computing Sciences. The Computer Information Systems major is accredited by the Computing Accreditation Commission (CAC) of ABET.\nRequired Courses in the Freshman Year\nRequired Courses in the Sophomore Year\nPrerequisite Courses\nGrade Requirement\nThis major/program requires a grade of “C” or above in the following courses. “C-” or lower grades are NOT acceptable and require repeats unless otherwise specified.\n- CSC 203\n- CSC 205\n- CSC 209\n- CIS 117\n- CIS 202\n- CIS 206\nAuxiliary Courses\nStudents are strongly recommended to take courses that have a writing component. Students are advised to take some basic business courses, such as BUS 325, 335, 345 and 365.', metadata={'source': 'http