# The Hackernews Scraper

## Importing Libraries

* Requests: To make http(s) requests
* BeautifulSoup4: To 
* PyMongo: To interact with mongodb database

In [None]:
import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient
from pathlib import Path
import json

## Phase 1: Extracting the MetaData of all posts from all pages specified by the User

First, let's create a function that returns the parsed html soup of a link given as parameter.

In [None]:
def link_to_soup(link):
    response = requests.get(link)
    if response.ok:
        return BeautifulSoup(response.text, 'html.parser')
    else:
        return False

Requesting to get the home page of TheHackerNews.com

In [None]:
home_page_soup = link_to_soup('https://thehackernews.com/')
print(home_page_soup)

We are able to parse the html document as well. Let's ask the user, for how many pages, the data has to be extracted.

In [None]:
No_pages = int(input('How many pages to extract the data from? '))

We will parse the the rest of the pages and append all of the pages to the array 'pages'

In [None]:
pages = []
pages.append(home_page_soup)
No_pages -= 1
for i in range(No_pages):
    next_page_link = pages[i].find("a", class_="blog-pager-older-link-mobile")['href']
    pages.append(link_to_soup(next_page_link))
    

Now we have all the pages parsed to extract the data from. So, let's extract all the data.

In [None]:
posts_url_title_data = []
posts_url_others_data = []

for page in pages:
    posts_in_page = home_page_soup.find_all("a", class_='story-link')
    for post in posts_in_page:
        posts_url_title_data.append({
        "url" : post['href'],
        "title" : post.find("h2", class_='home-title').text
        })
        
        posts_url_others_data.append({
        "url" : post['href'],
        "desc" : post.find("div", class_='home-desc').text,
        "author" : post.find("span").text[1:len(post.find("span").text)-1],
        "imgSrc" : post.find("img")['data-src']
        })
    
print('URL of 1st post: ', posts_url_title_data[0]['url'])
print('Description of 1st post: ', posts_url_others_data[0]['desc'])
print('Author of the 1st post: ', posts_url_others_data[0]['author'])
print('Image source of the 1st post: ', posts_url_others_data[0]['imgSrc'])
print('Title of the 1st post: ', posts_url_title_data[0]['title'])

Metadata of all posts from all pages is extracted. Phase 1 completed!

## Phase 2: Store the data

### Phase 2a:  Store the data in a MongoDb database

We'll store the data in two documents having:
1. Url and title of the blog
2. Url and Desription, Image, Author)

First, Let's connect to the mongodb server by asking the user MongoDbURI

In [None]:
mongoDbURI = input('Enter the mongoDbURI: ')
client = MongoClient(mongoDbURI) # for users running mongo server on local machine: 'mongodb://localhost:27017/'

In [None]:
# Getting the database
db = client['hackernews']

# Getting the collections
urlTitle = db['url-title']
urlOthers = db['url-others']

Connection done, now we'll insert the data.

In [None]:
result1 = urlTitle.insert_many(posts_url_title_data)
result2 = urlOthers.insert_many(posts_url_others_data)

Let's check if the values were inserted properly.

In [None]:
print(result1.inserted_ids[:5])
print(result2.inserted_ids[:5])

We can see the Object Ids of the inserted documents, that means the documents were inserted to the database successfully.

Phase 2a completed!

### Phase 2b: Save the data as JSON

We will save the data as a single JSON file, merging `posts_url_title_data` with `posts_url_others_data`.

In [None]:
# Merge data into a single dictionary
n = len(posts_url_title_data)
posts_url_data = [ dict(posts_url_title_data[0], **posts_url_others_data[0]) for i in range(n) ]

In [None]:
output_location = input("Enter the location to save the JSON to: ")
output_path = Path(output_location)

# Create any necessary parent directories
if not output_path.parent.exists():
    print("Directory {} does not exist, creating.".format(output_path.parent))
    output_path.parent.mkdir(parents=True)

# Save JSON
with open(output_path, "w") as output_file:
    print("Saving to {}.".format(output_path))
    output_file.write(json.dumps(posts_url_data))

print("Done!")