An easy-to-use news crawler that extract structured information from the The Guardian International Edition web site. This crawler combine the power of Python libraires such as requests
for making HTTP requests and BeautifulSoup (bs4)
for parsing HTML.
The structured information scraped from the articles are stored in the hosted Mongo database MongoDB Atlas. The structured information includes Title
, Link
, Topic
, Authors
, Datetime
, Description
, Text
and Related topics links
.
Before submmiting data to MongoDB, we are going to clean up the scraped text to use it later in the search engine, using the libraries spaCy
and NLTK
.
Before running the crawler, make sure to connect your MongoDB Atlas with this python application:
- To learn how to generate a URI connection strings please visit this
link
- Insert this key in the 'parameters.yaml' file in the 'url_connection' variable
- Create a database and a collection and indicate their names in the 'parameters.yaml' file
To use the crawler, simply run:
git clone https://github.com/mcmxlix/the_guardian_crawler.git
pip install -r requirements.txt
cd the_guardian_crawler/Crawler
python __main__.py
A search engine API capable to connect to MongoDB Atlas and collects the crawled data. When searching with keywords, the engine will compare this keywords with the data of each article, more precisely we will use the "clean_text", and retrieve a list of informations matching the query, sorted by relevance.
TF-IDF
is the algorithm used for the ranking of the pages. TF is Term frequency which calculates the total occurrence of the word divided by the total words. IDF is the inverse document frequency which gives the rareness of the word among all the documents
TF = (Total occurance of the word in the document)/(Total words in the document)
IDF = log((Total Documents)/(Total number of documents the word is present))
TF-IDF = TF×IDF
the following command lines are the continuation of the previous commands:
cd ../Search_engine
python app.py