## **This data set is Hacker News posts from the last 12 months (up to September 26 2016).**

It includes the following columns:

    title: title of the post (self explanatory)

    url: the url of the item being linked to

    num_points: the number of upvotes the post received

    num_comments: the number of comments the post received

    author: the name of the account that made the post

    created_at: the date and time the post was made (the time zone is Eastern Time in the US)

One fun project suggestion is a model to predict the number of votes a post will attract.

The scraper is written, so I can keep this up-to-date and add more historical data. I can also scrape the comments. Just make the request in this dataset's forum.

In [1]:
import numpy as np
import pandas as pd

In [2]:
hack = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')

In [3]:
hack.head(2)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24


In [4]:
hack.sort_values('num_points',ascending = False).head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
170017,11116274,A Message to Our Customers,http://www.apple.com/customer-letter/,5771,967,epaga,2/17/2016 8:38
69169,11966167,UK votes to leave EU,http://www.bbc.co.uk/news/uk-politics-36615028,3125,2531,dmmalam,6/24/2016 3:48
9263,12494998,Pardon Snowden,https://www.pardonsnowden.org/,2553,781,erlend_sh,9/14/2016 8:31
57128,12073675,Tell HN: New features and a moderator,,2381,451,dang,7/11/2016 19:34
136284,11390545,Ubuntu on Windows,http://blog.dustinkirkland.com/2016/03/ubuntu-...,2049,513,bpierre,3/30/2016 16:35


In [5]:
hack.shape

(293119, 7)

In [6]:
# preprocessing
hack.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            293119 non-null  int64 
 1   title         293119 non-null  object
 2   url           279256 non-null  object
 3   num_points    293119 non-null  int64 
 4   num_comments  293119 non-null  int64 
 5   author        293119 non-null  object
 6   created_at    293119 non-null  object
dtypes: int64(3), object(4)
memory usage: 15.7+ MB


In [7]:
from sklearn.model_selection import train_test_split


In [8]:
hack.fillna('missing', inplace = True)
hack.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


In [9]:
# check null values

In [10]:
hack.isna().sum()

id              0
title           0
url             0
num_points      0
num_comments    0
author          0
created_at      0
dtype: int64

# Install needed librarries

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [12]:
import preprocess_kgptalkie as ps  # import the cleaning package
import re # regular expression

def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_urls(x)
    x = ps.remove_html_tags(x)
    x = ps.remove_rt(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

# Preprocessing - > cleaning up title, url and author features

In [13]:
X = hack.loc[:, ['title','url', 'author']]
y= hack.num_points

In [14]:
X.url = X.url.str.extract("^http[s]*://([0-9a-z\-\.]*)",expand=False)
X.url

0            www.regulations.gov
1                 www.sqlite.org
2                     medium.com
3                   cacm.acm.org
4                 www.talend.com
                   ...          
293114                       NaN
293115    people.cs.uchicago.edu
293116        dangerousminds.net
293117              www.zend.com
293118     newsroom.toyota.co.jp
Name: url, Length: 293119, dtype: object

In [15]:
X.title= X.title.apply(lambda x: get_clean(x))
X.url = X.url.apply(lambda x : get_clean(x))
X.author = X.author.apply(lambda x: get_clean(x))



# Training the model

# tranforming data to sparse matrix
X = tfidf.fit_transform(X)

train
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )