Skip to content
This repository has been archived by the owner on Aug 24, 2023. It is now read-only.

kael558/WebIndexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIT License

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About The Project

I created this project for LabLab.ai's AI21 Labs Hackathon.

The existing search bar in most websites performs keyword search. A slow and arduous process in which the user has to read through a myriad of information before finding the tidbit that they wanted in the first place.

My goal was to create a question answering tool that can be easily integrated into any website. It allows you to find specific information, provides answers in a clear, understandable way and includes sources and more information should the user need it.

Presenting Web Indexer. Developed to significantly improve the user experience by providing a service that ChatGPT, Google and standard search bars cannot.

(back to top)

Built With

(back to top)

Getting Started

To Scrape a website:

  1. Navigate to the spiders directory in the scraper directory
  2. Change the urls and domains in the Spider class
  3. Run the command in the terminal scrapy crawl text -O ../data/{filename}.csv
  4. Specify pages to scrape with CLOSESPIDER_PAGECOUNT = 10 in settings.py

To see content of html file

  1. Navigate to main directory
  2. Run scrapy shell <url>

(back to top)

Usage

  • Choose the website you wish to scrape on the Streamlit server
  • Enter a question that you would like answered
  • Adjust the threshold and number of paragraphs to control the context

(back to top)

Roadmap

  • Create benchmarks
  • Summarize context? May lead to improved accuracy
  • Conversation style with prior questions as context
  • Finetune both embedding and generation models
  • Access to attention layer for improved relevant links

(back to top)

Contributing

This repository is intended as an archive. No changes will be made to it in the future.

You may fork the project and work in your own repository.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Rahel Gunaratne:

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages