## Discovering my interests in Hacker News with NLP
# Part I: Get my HN favorites (upvoted)

For scrapping we use Selenium instead of simple "requests" due to the major of Hacker News aggregated pages are JS content generated

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Set your `username` (aka `acct`) and `password` (aka `pw`)

__ACTION__: Replace YOUR_ACCOUNT and YOUR_PASSWORD with your HN credentials

In [1]:
creds = {'acct': 'YOUR_ACCOUNT', 'pw': 'YOUR_PASSWORD'}

We use a requests' session to manage cookies and the session

In [3]:
sess = requests.Session()
resp = sess.post('https://news.ycombinator.com/login', creds)
assert resp.status_code == 200

Yeah! We are "logged"

Time to get the favs (upvoted)

In [4]:
cnt = 0
items = []
nextUrl = f"https://news.ycombinator.com/upvoted?id={creds['acct']}"
while True:
    lnkMore = None
    lastTitle = None
    lastLink = None
    lastPoints = None
    lastComments = None
    resp = sess.get(nextUrl)
    nextUrl = None
    dom = BeautifulSoup(resp.content, 'html.parser')
    tbl = dom.findAll("table", attrs={'class': 'itemlist'})
    for tr in tbl[0].findAll('tr'):
        if tr.has_attr('class') and tr['class'][0] == 'athing':
            a = tr.select('td[class="title"] a[class="storylink"]')[0]
            lastTitle = a.get_text()
            lastLink = a['href']
        elif tr.has_attr('class') and tr['class'][0] == 'spacer':
            # Time to append to array
            items.append({'title': lastTitle, 'url': lastLink, 'points': lastPoints, 'comments': lastComments})
            lastTitle = None
            lastLink = None
            lastPoints = None
            lastComments = None
            cnt += 1
        elif not tr.has_attr('class'):
            # More info
            sel = tr.select('span[class="score"]')
            #if len(sel) == 0:
            #    continue
            try:
                lastPoints = int(sel[0].get_text().split(' ')[0])
            except:
                lastPoints = None
            for a in tr.find_all('a'):
                if 'comment' in a.get_text():
                    try: 
                        lastComments = int(a.get_text().split('\xa0')[0])
                    except:
                        lastComments = None
            if len(tr.select('a[class="morelink"]')) > 0:
                nextUrl = 'https://news.ycombinator.com/' + tr.select('a[class="morelink"]')[0]['href']
    if nextUrl is None:
        break
print(f'Total items: {cnt}')     

Total items: 214


After some time (12 seconds in my case), we have the favs (214 favs for me)

## Apply some filtering

Remove internal links

In [5]:
items = [x for x in items if x['url'].startswith('http')]

We use pandas for exporting to JSON (records)

In [6]:
df = pd.DataFrame(items)

In [7]:
df.sample(5)

Unnamed: 0,title,url,points,comments
7,RedHat Mandrel Makes Java Native,https://www.infoq.com/news/2020/07/mandrel-gra...,176,139.0
165,U.S. judge says LinkedIn cannot block startup ...,http://www.reuters.com/article/us-microsoft-li...,779,288.0
127,Spanish football league defends phone 'spying',https://www.bbc.com/news/technology-44453382,128,51.0
101,Startup idea checklist,https://www.defmacro.org/2019/03/26/startup-ch...,1251,194.0
10,I Just Hit $100k/year On GitHub Sponsors,https://calebporzio.com/i-just-hit-dollar-1000...,1469,493.0


In [8]:
df.to_json('hn_favs.json', 'records')

All right! Favorites ready to be scrapped. Continue with the next notebook