# HN classification Demo

In [1]:
import os
import json
import requests
import pandas as pd

## Fetch Data

In [2]:
stories = []
base_url =  'https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty'
top_story_ids = eval(requests.get('https://hacker-news.firebaseio.com/v0/topstories.json').text)
data = [json.loads(requests.get(base_url.format(story)).text) for story in top_story_ids]

First we'll start with naive classification, just giving it a list of words to exclude. I didn't put much work into it, so the [results aren't great](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out). But with more effort this method might still be a strong contender.

In [3]:
url_words = ['//github', '//wikipedia']
title_words = ['linux', 'hardware']

for d in data:
    if d['type'] == 'story':
        if 'text' not in d.keys():
            if all([w not in d['url'] for w in url_words]):
                if all([w not in d['url'] for w in url_words]):
                    if all([w not in d['title'] for w in title_words]):
                        print(d['title'] + '\n' + d['url'])

Planet Moons
https://www.go-astronomy.com/planets/planet-moons.htm
The German Tank Problem
https://www.eadan.net/blog/german-tank-problem/
Why factoring may be easier than you think
http://math.mit.edu/~cohn/Thoughts/factoring.html
Graphs and Geometry [pdf]
http://web.cs.elte.hu/~lovasz/bookxx/geomgraphbook/geombook2019.01.11.pdf
The Cost of JavaScript in 2019
https://v8.dev/blog/cost-of-javascript-2019
Open Letter to the Wikimedia Foundation Board of Trustees
https://en.wikipedia.org/wiki/Wikipedia:Arbitration_Committee/Noticeboard#Open_letter_to_the_WMF_Board
China pressured London police to arrest Tiananmen protester, says watchdog
https://www.theguardian.com/world/2019/jun/30/political-pressure-before-arrest-of-chinese-dissident-london
Vintage video game cartridges with built-in modems
https://writing.markchristian.org/2019/06/29/communicating-cartridges/
Live coding a vi for CP/M from scratch
http://cowlark.com/2019-06-28-cpm-vi/
The Energy Cost of Electric and Human-Powered Bicyc

KeyError: 'url'

## Using a trained model

First, we need to train a model, and I don't feel like labelling a few thousand HN submissions `tech` / `not tech`, so let's use a famous dataset: [20 newsgroups](https://www.cs.umb.edu/~smimarog/textmining/datasets/)

In [4]:
TEXT_DATA_DIR = '20_newsgroup'
texts, label_text = [], []
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                with open(fpath, encoding='latin-1') as f:
                    t = f.read()
                    i = t.find('\n\n')  # skip header in file (starts with two newlines.)
                    if i > 0:
                        t = t[i:]
                    texts.append(t)
                label_text.append(name)
print(f'Found {len(texts)} texts.')

Found 19997 texts.


In [5]:
train = pd.DataFrame({'sent':texts, 'label': label_text})
train.head()

Unnamed: 0,sent,label
0,\n\nArchive-name: atheism/resources\nAlt-athei...,alt.atheism
1,\n\nArchive-name: atheism/introduction\nAlt-at...,alt.atheism
2,\n\nIn article <65974@mimsy.umd.edu>\nmangoe@c...,alt.atheism
3,\n\ndmn@kepler.unh.edu (...until kings become ...,alt.atheism
4,\n\nIn article <N4HY.93Apr5120934@harder.ccr-p...,alt.atheism


We have our labelled data. Now we label anything 'techy' as such.

In [6]:
tech = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 
        'comp.windows.x', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space']
train['is_tech'] = 0
train.loc[train.label.isin(tech), 'is_tech'] = 1

Building the model:

In [7]:
import string
import en_core_web_sm
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [15]:
punctuations = string.punctuation
nlp = en_core_web_sm.load()
parser = English()

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in STOP_WORDS and word not in punctuations ]
    return mytokens

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

def clean_text(text):
    return text.strip().lower()

X = train['sent']
ylabels = train['is_tech']
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
classifier = LogisticRegression(solver='liblinear')
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])
pipe.fit(X_train,y_train)
predicted = pipe.predict(X_test)

print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.9458333333333333
Logistic Regression Precision: 0.9319391634980989
Logistic Regression Recall: 0.9437812860993454


Wow, pretty good results on that tests dataset. Now we run it on our model:

In [14]:
df = pd.DataFrame({'text':[d['title'] for d in data], 'is_tech': pipe.predict(pd.Series(titles))})
df.head(20)

Unnamed: 0,text,is_tech
0,Planet Moons,1
1,The German Tank Problem,0
2,Why factoring may be easier than you think,1
3,Graphs and Geometry [pdf],1
4,The Cost of JavaScript in 2019,0
5,Show HN: Web pages stored entirely in the URL,0
6,Open Letter to the Wikimedia Foundation Board ...,0
7,China pressured London police to arrest Tianan...,0
8,Vintage video game cartridges with built-in mo...,0
9,Live coding a vi for CP/M from scratch,1


Hmm, not the best. Needs some work!