## **This data set is Hacker News posts from the last 12 months (up to September 26 2016).**

It includes the following columns:

    title: title of the post (self explanatory)

    url: the url of the item being linked to

    num_points: the number of upvotes the post received

    num_comments: the number of comments the post received

    author: the name of the account that made the post

    created_at: the date and time the post was made (the time zone is Eastern Time in the US)

One fun project suggestion is a model to predict the number of votes a post will attract.

The scraper is written, so I can keep this up-to-date and add more historical data. I can also scrape the comments. Just make the request in this dataset's forum.

In [1]:
import numpy as np
import pandas as pd

In [2]:
hack = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')

In [3]:
hack.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


In [4]:
hack.sort_values('num_points',ascending = False).head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
170017,11116274,A Message to Our Customers,http://www.apple.com/customer-letter/,5771,967,epaga,2/17/2016 8:38
69169,11966167,UK votes to leave EU,http://www.bbc.co.uk/news/uk-politics-36615028,3125,2531,dmmalam,6/24/2016 3:48
9263,12494998,Pardon Snowden,https://www.pardonsnowden.org/,2553,781,erlend_sh,9/14/2016 8:31
57128,12073675,Tell HN: New features and a moderator,,2381,451,dang,7/11/2016 19:34
136284,11390545,Ubuntu on Windows,http://blog.dustinkirkland.com/2016/03/ubuntu-...,2049,513,bpierre,3/30/2016 16:35


In [5]:
hack.shape

(293119, 7)

In [6]:
hack[hack.num_points == 1]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14
...,...,...,...,...,...,...,...
293095,10177013,HTTP/2 demo,https://http2.akamai.com/demo,1,0,tomkwok,9/6/2015 7:28
293097,10177010,Canadian photographer publishes art book of So...,http://www.rt.com/news/314528-soviet-bus-stops...,1,0,edward,9/6/2015 7:24
293101,10176981,Why exactly did Bitcoin take off?,https://bitcoinrevolt.wordpress.com/2015/09/06...,1,0,gizi,9/6/2015 6:55
293107,10176960,Hands-On with Googles OnHub Router,http://techcrunch.com/2015/09/05/hands-on-with...,1,0,confiscate,9/6/2015 6:41


In [7]:
# preprocessing
hack.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            293119 non-null  int64 
 1   title         293119 non-null  object
 2   url           279256 non-null  object
 3   num_points    293119 non-null  int64 
 4   num_comments  293119 non-null  int64 
 5   author        293119 non-null  object
 6   created_at    293119 non-null  object
dtypes: int64(3), object(4)
memory usage: 15.7+ MB


In [8]:
from sklearn.model_selection import train_test_split


In [9]:
hack.fillna('missing', inplace = True)
hack.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


In [10]:
hack.isna().sum()

id              0
title           0
url             0
num_points      0
num_comments    0
author          0
created_at      0
dtype: int64

In [11]:
X = hack.drop(['id','num_points','created_at','num_comments'], axis = 1)
X.head()

Unnamed: 0,title,url,author
0,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,altstar
1,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,blacksqr
2,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,pavel_lishin
3,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,poindontcare
4,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,markgainor1


In [12]:
y = hack['num_points']
X.shape , y.shape

((293119, 3), (293119,))

# **Pre -prosessing**

In [13]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# now defime our categorical features
categorical_features  = ['title','author','url']
one_hot = OneHotEncoder()
transformer = ColumnTransformer ([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                 remainder= 'passthrough')   # will take in a turple
                                 
                                 
transformed_X = transformer.fit_transform(X)
transformed_X    # this is X that has undergone cleaning

<293119x568955 sparse matrix of type '<class 'numpy.float64'>'
	with 879357 stored elements in Compressed Sparse Row format>

In [14]:
pd.DataFrame(transformed_X)[:3]

Unnamed: 0,0
0,"(0, 266555)\t1.0\n (0, 277275)\t1.0\n (0, ..."
1,"(0, 180571)\t1.0\n (0, 280500)\t1.0\n (0, ..."
2,"(0, 252878)\t1.0\n (0, 306639)\t1.0\n (0, ..."


from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# now defime our categorical features
categorical_features  = ['num_points']
one_hot = OneHotEncoder()
transformer = ColumnTransformer ([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                 remainder= 'passthrough')   # will take in a turple
                                 
                                 
transformed_y = transformer.fit_transform(y)
transformed_y    # this is X that has undergone cleaning

In [16]:
#from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import SGDClassifier


#seed
np.random.seed(42)


# train model
X_train,y_train,X_test, y_test = train_test_split(transformed_X ,y, test_size = 0.2)


#instantiate model and fit
model = SGDClassifier()
model.fit(X_train,y_train)

ValueError: y should be a 1d array, got an array of shape () instead.