Autonomous Semantic Search Engine

A search engine that autonomously crawls documents from a given domain including their subdomains, analyzes them and renders them into a search frontend. This implementation demonstrates the functionality with the Stanford University's website. This project was implemented during the Next Iteration Hackathon 2018

Web crawler

Python implementation with Scrapy here.

Document analysis

Python implementation using Watson NLU (for Named Entities, Keywords), gensim (for Summarization and Semantic Representation) and a custom Document Type classifier (Random Forest, with sklearn). Title, a thumbnail and embedded images are also extracted from documents. See notebooks for specific implementations.

Web frontend

A react frontend that displays the information with additional image information using Bing Image Search here.

brew Dependencies

Swig etc. for Textract: https://textract.readthedocs.io/en/stable/installation.html
Ghostscript: https://wiki.scribus.net/canvas/Installation_and_Configuration_of_Ghostscript
ImageMagick 6: ImageMagick/ImageMagick#953

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
backend		backend
graph-aggregator		graph-aggregator
img_serve		img_serve
notebooks		notebooks
pdfcrawler		pdfcrawler
search-frontend		search-frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
Hackathon-MA2.pptx		Hackathon-MA2.pptx
LICENSE.md		LICENSE.md
Pipfile		Pipfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend

backend

graph-aggregator

graph-aggregator

img_serve

img_serve

notebooks

notebooks

pdfcrawler

pdfcrawler

search-frontend

search-frontend

.gitattributes

.gitattributes

.gitignore

.gitignore

Hackathon-MA2.pptx

Hackathon-MA2.pptx

LICENSE.md

LICENSE.md

Pipfile

Pipfile

README.md

README.md

Repository files navigation

Autonomous Semantic Search Engine

Web crawler

Document analysis

Web frontend

brew Dependencies

About

Releases

Packages

Contributors 3

Languages

License

manuel-lang/Autonomous-Semantic-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Autonomous Semantic Search Engine

Web crawler

Document analysis

Web frontend

brew Dependencies

About

Topics

Resources

License

Stars

Watchers

Forks

Languages