# Natural Language Processing with Bag-of-words Model: Predicting the Popularity of Hacker News Articles

Introduction: [Hacker News](https://news.ycombinator.com/) is a Reddit-like platform where users can submit articles and upvote based on their interests. The articles with the most upvotes then make it to the front page. Objective: The objective of this project is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. Method: The [dataset](https://github.com/arnauddri/hn) consists of submissions from 2006 to 2016. The headlines are converted into a [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model) dataframe (Below is a diagram showing how two sentences convert to a bag of words). Then, a linear regression algorithm is fitted to the training set, and various thresholds for admission of tokens into the model are experimented to obtain the optimal prediction power. Results: The prediction yields a root mean square error of 43.0, which is slightly lower than the standard deviation of 46.7. Conclusion: The model offers a small improvement in prediction than the mean.

![](https://trello-attachments.s3.amazonaws.com/598ccf452a2ec3ed73385ebf/59ac7d1b8d9ce427639802ca/e00faf8206644508048665565f3dc75e/Capture_d%E2%80%99e%CC%81cran_2017-09-03_a%CC%80_18.00.14.png)

In [407]:
import pandas
import numpy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Data Cleaning

In [416]:
submissions = pandas.read_csv('sel_hn_stories.csv', delimiter=',')
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()
submissions[:30]

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?
5,2013-09-04T12:10:52Z,1,theincidentaleconomist.com,ilk is not as good for you as you think
6,2012-03-09T20:25:42Z,1,worldometers.info,Worldometers - Real time world statistics
7,2010-04-22T13:23:10Z,26,docs.com,icrosoft strikes back: introduces docs for fac...
8,2012-05-06T16:08:46Z,2,blog.hackplanet.in,Net HTTP status codes
9,2014-12-23T00:55:31.000Z,1,curt-rice.com,Anecdata or how McKinsey’s story became Sheryl...


## Lowercasing, Eliminating Punctuation, and Removing Duplicates

In [417]:
tokenized_headlines = [token.split(' ') for token in submissions['headline']]
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]

def clean(token):
    token = token.lower()
    for punc in punctuation:
        if punc in token:
            token = token.replace(punc, '')
    return token

clean_tokenized = [[clean(token) for token in headline] for headline in tokenized_headlines]
clean_tokenized[:30]

[['software',
  'sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['',
  'googles',
  'stock',
  'split',
  'means',
  'more',
  'control',
  'for',
  'larry',
  'and',
  'sergey',
  ''],
 ['ssl',
  'dos',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['immutability', 'and', 'blocks', 'lambdas', 'and', 'closures'],
 ['comment', 'optimiser', 'la', 'vitesse', 'de', 'wordpress'],
 ['ilk', 'is', 'not', 'as', 'good', 'for', 'you', 'as', 'you', 'think'],
 ['worldometers', '', 'real', 'time', 'world', 'statistics'],
 ['icrosoft', 'strikes', 'back', 'introduces', 'docs', 'for', 'facebook'],
 ['net', 'http', 'status', 'codes', ''],
 ['anecdata',
  'or',
  'how',
  'mckinseys',
  'story',
  'became',
  'sheryl',
  'sandbergs',
  'fact'],
 ['immigration', 'overhaul', 'passes', 'in', 'senate'],
 ['what', 'matters', 'most', 'at', 'adtech', 'sf', '2014'],
 ['amazon',
  'silk',
  'revisited',
  'is',
  'the',
  'split',
  '

In [419]:
unique_tokens = list(set([token for headline in clean_tokenized for token in headline]))
unique_tokens[:30]

['',
 '177147',
 'works',
 'lea',
 'protocols',
 'venus',
 'battle',
 'right',
 'sec',
 'potsmoking',
 'best',
 'trackify',
 'balaji',
 'paging',
 '[pdf]',
 'buys',
 'gittip',
 'analyzer',
 'tropical',
 '1944',
 'ycoin',
 'extend',
 'dropbox',
 'hipster',
 'bandwagon',
 'intervene',
 'reactive',
 'patentable',
 'important',
 'ie9s']

## Creating the Bag

In [420]:
counts = pandas.DataFrame(0, index=numpy.arange(len(clean_tokenized)), columns=unique_tokens)
counts[:30]

Unnamed: 0,Unnamed: 1,177147,works,lea,protocols,venus,battle,right,sec,potsmoking,...,electronics,energy,son,cobbler,peeks,ndc,rubys,launches,spector,dac
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Filling the Bag

In [421]:
for i, headline in enumerate(clean_tokenized):
    for token in headline:
        if token in unique_tokens:
            counts.iloc[i][token] += 1
counts[:30]

Unnamed: 0,Unnamed: 1,177147,works,lea,protocols,venus,battle,right,sec,potsmoking,...,electronics,energy,son,cobbler,peeks,ndc,rubys,launches,spector,dac
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Preventing Overfitting

In [422]:
word_counts = counts.sum(axis=0)
counts = counts.loc[:,(word_counts >= 46)]
counts[:30]

Unnamed: 0,Unnamed: 1,how,app,google,new,to,about,with,a,–,...,and,the,for,you,that,it,on,in,what,show
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,1,2,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Linear Regression

In [423]:
features_train, features_test, y_train, y_test = train_test_split(counts, submissions['upvotes'], test_size=0.2, random_state=1)

clf = LinearRegression()
clf.fit(features_train, y_train)

predictions = clf.predict(features_test)
predictions[:30]

array([ 11.55360384,   9.50115009,  12.55193456,  11.55360384,
         9.84006337,   5.03063474,  11.55360384,   9.95519766,
        13.67427861,  17.71537611,  12.76190095,  17.77825487,
        11.41132265,  11.41132265,   9.35886889,   8.98659531,
        11.41132265,   4.34643857,   9.04779055,  11.93880304,
         8.63386769,  11.55360384,  11.6610465 ,  13.3705296 ,
        11.55360384,  11.55360384,  11.41132265,   8.76321457,
         9.90710356,  11.55360384])

In [415]:
rmse = (sum((predictions - y_test) ** 2) / len(predictions))**.5
rmse

43.009368052021514