Submission for HackDataKIBots 2018 - Web crawler combined with document analysis
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
backend
graph-aggregator
img_serve
notebooks
pdfcrawler
search-frontend
.gitattributes
.gitignore
Hackathon-MA2.pptx
LICENSE.md
Pipfile
README.md

README.md

Autonomous Semantic Search Engine

A search engine that autonomously crawls documents from a given domain including their subdomains, analyzes them and renders them into a search frontend. This implementation demonstrates the functionality with the Stanford University's website. This project was implemented during the Next Iteration Hackathon 2018

Web crawler

Python implementation with Scrapy here.

Document analysis

Python implementation using Watson NLU (for Named Entities, Keywords), gensim (for Summarization and Semantic Representation) and a custom Document Type classifier (Random Forest, with sklearn). Title, a thumbnail and embedded images are also extracted from documents. See notebooks for specific implementations.

Web frontend

A react frontend that displays the information with additional image information using Bing Image Search here.

brew Dependencies