GitHub

Table of Contents

About The Project
- Built With
Getting Started
Usage
Roadmap
Contributing
License
Contact

About The Project

I created this project for LabLab.ai's AI21 Labs Hackathon.

The existing search bar in most websites performs keyword search. A slow and arduous process in which the user has to read through a myriad of information before finding the tidbit that they wanted in the first place.

My goal was to create a question answering tool that can be easily integrated into any website. It allows you to find specific information, provides answers in a clear, understandable way and includes sources and more information should the user need it.

Presenting Web Indexer. Developed to significantly improve the user experience by providing a service that ChatGPT, Google and standard search bars cannot.

(back to top)

Built With

(back to top)

Getting Started

To Scrape a website:

Navigate to the spiders directory in the scraper directory
Change the urls and domains in the Spider class
Run the command in the terminal scrapy crawl text -O ../data/{filename}.csv
Specify pages to scrape with CLOSESPIDER_PAGECOUNT = 10 in settings.py

To see content of html file

Navigate to main directory
Run scrapy shell <url>

(back to top)

Usage

Choose the website you wish to scrape on the Streamlit server
Enter a question that you would like answered
Adjust the threshold and number of paragraphs to control the context

(back to top)

Roadmap

Create benchmarks
Summarize context? May lead to improved accuracy
Conversation style with prior questions as context
Finetune both embedding and generation models
Access to attention layer for improved relevant links

(back to top)

Contributing

This repository is intended as an archive. No changes will be made to it in the future.

You may fork the project and work in your own repository.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Rahel Gunaratne:

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
indexes		indexes
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
generation.py		generation.py
helper.py		helper.py
index.py		index.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About The Project

Built With

Getting Started

To Scrape a website:

To see content of html file

Usage

Roadmap

Contributing

License

Contact

About

Releases

Packages

Languages

License

kael558/WebIndexer

Folders and files

Latest commit

History

Repository files navigation

About The Project

Built With

Getting Started

To Scrape a website:

To see content of html file

Usage

Roadmap

Contributing

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages