__Data:__ "hn_stories_select.csv" contains a selection of user submissions to Hacker News (http://news.ycombinator.com/) (3000 randomly sampled entries from 2006 to 2015). Developer Arnaud Drizard used the Hacker News API to scrape these data, which is available from one of his GitHub repositories at https://github.com/arnauddri/hn.

Columns in data set:
* submission_time - When the article was submitted
* upvotes - The number of upvotes the article received
* url - The base URL of the article
* headline - The article's headline

In [1]:
# Overview of the Data
import pandas as pd
submissions = pd.read_csv("hn_stories_select.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

In [2]:
# Tokenizing the Headlines
tokenized_headlines = []

for i in submissions.headline:
    tokenized_headlines.append( i.split(" ") )

In [3]:
# Preprocessing Tokens to Increase Accuracy
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []

for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

In [4]:
# Assembling a Matrix of Unique Words
import numpy as np
unique_tokens = []
single_tokens = []

#flat_list = [item for sublist in clean_tokenized for item in sublist]

# Print top words in cleaned-up input data
#import collections
#ncount = 500
#print ( collections.Counter(flat_list).most_common(ncount) )

unique_tokens_firstencounter = []
unique_tokens = []

for list in clean_tokenized:
    for token in list:
        if ( token in unique_tokens ):
            continue
        elif ( token in unique_tokens_firstencounter ):
            unique_tokens.append( token )
        else:
            
            unique_tokens_firstencounter.append(token)
            
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [5]:
# Counting Token Occurrences

# We've already loaded in clean_tokenized and counts

for row, list in enumerate(clean_tokenized):
    for token in list:
        if ( token in unique_tokens ):
            counts.iloc[row][token] = counts.iloc[row][token]+1

In [6]:
# Removing Columns to Increase Accuracy

word_counts = counts.sum(axis=0)

word_counts_bool = [ (i >= 5) & (i <= 100 ) for i in word_counts ]

counts = counts.loc[:,word_counts_bool]

In [7]:
# Split data set into train and test samples
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)



In [8]:
# Making Predictions With fit()
from sklearn.linear_model import LinearRegression

clf = LinearRegression()

clf.fit( X_train, y_train )
predictions = clf.predict( X_test )



Use Mean Squared Error (https://en.wikipedia.org/wiki/Mean_squared_error) to quantify prediction error

In [9]:
# Calculating Prediction Error

mse = np.sum( (predictions-y_test)**2 )
mse /= len( predictions )

print ( "Mean Squared Error: ", mse )

Mean Squared Error:  2652.60825125


We can take several steps to reduce the error and explore natural language processing further. Here are some ideas for your next steps:

* Use the entire data set. While we used samples in this mission, you could download the entire data set from this GitHub repository. This approach will reduce the error rate dramatically. There are many features in natural language processing. Using more data will ensure that the model will find more occurrences of the same features in the test and training sets, which will help the model make better predictions.
* Add "meta" features like headline length and average word length.
* Use a random forest, or another more powerful machine learning technique.
* Explore different thresholds for removing extraneous columns.
