
### Introduction to Natural Language Processing



Natural Language Processing is the study of enabling computers to understand human languages. This field may involve teaching computers to automatically score essays, infer grammatical rules, or determine the emotions associated with text.

[Hacker News ](https://news.ycombinator.com/) is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they're more visible to the community.

The data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which you can find in one of [his GitHub repositories](https://github.com/arnauddri/hn). We've sampled 3000 rows from the data randomly, and removed all of the extraneous columns. This is the link to the dataset used in this project:  [link](https://dsserver-prod-resources-1.s3.amazonaws.com/67/sel_hn_stories.csv?versionId=xcIAi.Ol72azTM8SF_Y_z5ovesmzz0wN). Our data only has four columns:

* `submission_time` - When the article was submitted
* `upvotes` - The number of upvotes the article received
* `url` - The base URL of the article
* `headline` - The article's headline

We will be predicting the number of upvotes the articles received, based on their headlines because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.


In [1]:
import pandas as pd
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

Our goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to convert each headline to a numerical representation.

While there are several ways to accomplish this, we'll use a [bag of words model](https://en.wikipedia.org/wiki/Bag-of-words_model). A bag of words model represents each piece of text as a numerical vector.

The first step in creating a bag of words model is [tokenization](https://en.wikipedia.org/wiki/Tokenization). In tokenization, we break a sentence up into disconnected words.

In [3]:
tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split())

We now have tokens, but we need to process them a bit to make our predictions more accurate. We know that Berlin, Berlin., and berlin all refer to the same word, but the computer doesn't know that. We'll need to convert those variations so that they're consistent.

We can do this by lowercasing (which will convert Berlin to berlin), and also by removing punctuation (so Berlin. becomes Berlin).



Preprocessing doesn't have to be perfect, but the more we can help the computer group the same word together, the higher our prediction accuracy will be. Take a look through your tokens, and see if there are any instances of the same word that you haven't grouped together.

In [5]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

Since we have gotten our tokens, we can begin converting the sentences to their numerical representations. First, we  will retrieve all of the unique words from all of the headlines. Then, we  will create a matrix, and assign those words as the column headers. We'll initialize all of the values in the matrix to 0.

In [6]:
import numpy as np
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

Now that we have a matrix where all values are 0, we need to fill in the correct counts for each cell. This involves going through each set of tokens, and incrementing the column counters in the appropriate row.


When we're finished, we'll have a row vector for each headline that tells us how many times each token occured in that headline.

To accomplish this, we can loop through each list of tokens in clean_tokenized, then loop through each token in the list and increment the proper cell.

In [7]:
for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

We have over 2000 columns in our matrix. This can make it very hard for a linear regression model to make good predictions. Too many columns will cause the model to fit to noise instead of the signal in the data.

There are two kinds of features that will reduce prediction accuracy. Features that occur only a few times will cause overfitting, because the model doesn't have enough information to accurately decide whether they're important. These features will probably correlate differently with upvotes in the test set and the training set.

Features that occur too many times can also cause issues. These are words like and and to, which occur in nearly every headline. These words don't add any information, because they don't necessarily correlate with upvotes. These types of words are sometimes called stopwords.

To reduce the number of features and enable the linear regression model to make better predictions, we'll remove any words that occur fewer than 5 times or more than 100 times