# Using Natural Language Processing for Sentiment Analysis

## Background

Our data set consists of submissions users made to [Hacker News](https://news.ycombinator.com/) from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which you can find in one of his [GitHub repositories](https://github.com/arnauddri/hn). We've sampled 3000 rows from the data randomly, and removed all of the extraneous columns. Our data only has four columns:

 - __submission_time__ - When the article was submitted
 - __upvotes__ - The number of upvotes the article received
 - __url__ - The base URL of the article
 - __headline__ - The article's headline

Aim is to predict the number of upvotes the articles received, based on their headlines. Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

In [40]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

## Get Data

In [52]:
raw_data = pd.read_csv("~/data/hacker_news_stories/sel_hn_stories.csv",header=None)

In [53]:
raw_data.shape

(3000, 4)

## Clean Data

In [57]:
submissions = raw_data.copy()
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

In [58]:
submissions.head()

Unnamed: 0,submission_time,upvotes,url,headline
0,2014-06-24T05:50:40.000Z,1,flux7.com,8 Ways to Use Docker in the Real World
1,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
2,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
3,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
4,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures


In [59]:
### Tokenise (split each sentence into a list of individual words, or tokens)

tokenized_headlines = []

submissions.head()

for line in submissions['headline']:
    split_line = line.split()
    tokenized_headlines.append(split_line)

In [60]:
### Lowercase and remove punctuation

punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []

for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for p in punctuation: 
            token = token.replace(p, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

## Create matrix of unique word counts

In [61]:
# Initilise matrix with zero for all values
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [62]:
# Add in word counts 
for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

In [63]:
# To reduce the number of features and enable the linear regression model to make better predictions,
# we'll remove any words that occur fewer than 5 times or more than 100 times.
word_counts = counts.sum()
counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

## Split Data into Test and Train Sets

In [64]:
X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

## Make Predictions

In [65]:
clf = LinearRegression()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

# Validate Predictions

In [66]:
mse = mean_squared_error(y_test, predictions)
print(mse)

2378.081322808242


In [68]:
submissions['upvotes'].mean()

10.092109960728312

In [69]:
submissions['upvotes'].std()

39.49215218437226

In [72]:
rmse = mse ** (0.5)
print(rmse)

48.765575181763644


## Conclusion

Our average error is 49 upvotes away from the true value. This is higher than the standard deviation, so our predictions are often way out.

## Next Steps

 - Add "meta" features like headline length and average word length.
 - Use a random forest, or another more powerful machine learning technique.
 - Explore different thresholds for removing extraneous columns.