When you hear your native language, you intuitively know the meaning of what you heard. However, many people who've tried to learn a second or third language find the process to be much more painful. They have to break the language down into components like tenses in order to understand it better. Many have to take years of language lessons to get to the point where they can have a conversation.

Learning a language is difficult because language has many complex rules. If we want computers to be able to understand language, we either need to explicitly teach computers the rules, or enable the computers to intuit the rules themselves. The former is a lot like learning a second language, and the latter is a lot like learning your native language.

Broadly speakingly, natural language processing is the study of enabling computers to understand human languages. This field may involve teaching computers to automatically score essays, infer grammatical rules, or determine the emotions associated with text.

In this mission, we'll learn some of the basic building blocks of natural langage processing. When we feed a computer written text, it has no idea what that text means. In order for a computer to begin making inferences from it, we'll need to convert the text to a numerical representation. This process will enable the computer to intuit grammatical rules, which is more akin to learning a first language.

We'll explore how to get from written text to a numerical representation, and how we can use that representation to make predictions.

In [1]:
import pandas as pd

In [2]:
submissions = pd.read_csv('sel_hn_stories.csv')
submissions.columns = ['submission_time', 'upvotes', 'url', 'headline']
submissions.dropna(inplace=True)

In [3]:
submissions.head()

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?


Our goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to convert each headline to a numerical representation.

While there are several ways to accomplish this, we'll use a <a href='https://en.wikipedia.org/wiki/Bag-of-words_model'>bag of words</a> model. A bag of words model represents each piece of text as a numerical vector.

We'll examine each step in the bag of words process in this mission. For now, here's a high-level diagram showing how two sentences, `I rode my horse to Berlin.` and `You rode my horse to Berlin in the winter.`, convert to a bag of words:

<img src='nlp/fig1.jpg', height=278, width=484>

The first step in creating a bag of words model is tokenization. In tokenization, we break a sentence up into disconnected words.

Here's a diagram in which we tokenize the two sentences we mentioned above:

<img src='nlp/fig2.jpg', height=234, width=461>

As you can see, all we're doing is splitting each sentence into a list of individual words, or tokens. The split occurs on the space character (`" "`).

<b>Instructions</b>
- Split each headline into individual words on the space character(" "), and append the resulting list to tokenized_headlines.
- When you're finished, tokenized_headlines should be a list of lists. Each list should contain the tokens for the headline located at the corresponding position in the submissions dataframe.

In [4]:
headlines = submissions['headline'].tolist()
tokenized_headlines = [headline.split(" ") for headline in headlines]

We now have tokens, but we need to process them a bit to make our predictions more accurate. We know that Berlin, Berlin., and berlin all refer to the same word, but the computer doesn't know that. We'll need to convert those variations so that they're consistent.

We can do this by lowercasing (which will convert Berlin to berlin), and also by removing punctuation (so Berlin. becomes Berlin).

<img src='nlp/fig3.jpg', height=236, width=458>

Preprocessing doesn't have to be perfect, but the more we can help the computer group the same word together, the higher our prediction accuracy will be. Take a look through your tokens, and see if there are any instances of the same word that you haven't grouped together.

<b>Instructions</b>
- Loop through each item in tokenized_headlines, which is a list of lists.
- For each list of tokens:
    - Convert each individual token to lowercase
    - Remove all of the items in the punctuation list from each individual token
    - Append the clean list to clean_tokenized
- clean_tokenized should now be a list of lists. Each list should contain the preprocessed tokens associated with the headline in the corresponding position of the submissions dataframe.

In [5]:
import string

clean_tokenized = []
exclude = set(string.punctuation)

for headline in tokenized_headlines:
    clean = []
    for word in headline:
        word = ''.join(ch for ch in word if ch not in exclude).lower()
        clean.append(word)
    clean_tokenized.append(clean)
    
# one liner: [[''.join(ch for ch in word if ch not in exclude).lower() for word in headline] for headline in tokenized_headlines]

Now that we have our tokens, we can begin converting the sentences to their numerical representations. First, we'll retrieve all of the unique words from all of the headlines. Then, we'll create a matrix, and assign those words as the column headers. We'll initialize all of the values in the matrix to 0.

<img src='nlp/fig4.jpg', height=400, width=500>

We'll use a pandas dataframe instead of a NumPy matrix. We can create a dataframe with all zero values using this syntax:

    pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

The code above will create a dataframe with as many rows as the number of items in clean_tokenized. Each column name will be a word from unique_tokens. This assumes that we already assigned the unique tokens to unique_tokens. Each cell in the dataframe will have the value 0. You can find more documentation on initializing a dataframe in the pandas documentation.

<b>Instructions</b>
- Find all of the unique tokens in clean_tokenized, and assign the result to unique_tokens.
    - Only add tokens that occur more than once (across all of the headlines). Tokens that only occur once don't add anything to the model's prediction power, and removing them will make our algorithm run much more quickly.
    - To do this, you can keep a list of the tokens that occur once in the data, and a different list of the tokens that occur more than once. If a token is already in the first list when you encounter it and it's not in the second list, you should add it to the second list.
    - When you're finished, unique_tokens should contain any tokens that occur more than once across all of the headlines.
    - Each token in unique_tokens should only appear in the list a single time.
- Create a dataframe with as many rows as there are items in the clean_tokenized list. Each column name should be a token in unique_tokens. Initialize all of the cells to the value 0. Assign the dataframe to the variable counts.

In [6]:
import numpy as np

unique_tokens = []
single_tokens = []

for headline in clean_tokenized:
    single_tokens += headline
unique_tokens = list(set(single_tokens))

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

Now that we have a matrix where all values are 0, we need to fill in the correct counts for each cell. This involves going through each set of tokens, and incrementing the column counters in the appropriate row.

<img src='nlp/fig5.jpg', height=300, width=450>

When we're finished, we'll have a row vector for each headline that tells us how many times each token occured in that headline.

To accomplish this, we can loop through each list of tokens in clean_tokenized, then loop through each token in the list and increment the proper cell.

<b>Instructions</b>
- Loop through each list of tokens in clean_tokenized.
    - You should use the enumerate() function when writing the loop to get an index along with the list of tokens.
- Loop through each token in the list of tokens.
- Check whether the token is in unique_tokens. If not, it isn't a column in the dataframe, and you should ignore it.
- Increment the appropriate cell by indexing the row of counts, and finding the right column for the token. Add 1 to the cell to indicate that you found the token once.

In [7]:
for i, clean_list in enumerate(clean_tokenized):
    for token in clean_list:
        if token in unique_tokens:
            counts[token].iloc[i] += 1

We have over 2000 columns in our matrix. This can make it very hard for a linear regression model to make good predictions. Too many columns will cause the model to fit to noise instead of the signal in the data.

There are two kinds of features that will reduce prediction accuracy. Features that occur only a few times will cause overfitting, because the model doesn't have enough information to accurately decide whether they're important. These features will probably correlate differently with upvotes in the test set and the training set.

Features that occur too many times can also cause issues. These are words like and and to, which occur in nearly every headline. These words don't add any information, because they don't necessarily correlate with upvotes. These types of words are sometimes called stopwords.

To reduce the number of features and enable the linear regression model to make better predictions, we'll remove any words that occur fewer than 5 times or more than 100 times.

</b>Instructions</b>
- Generate a vector that contains the sum of each column in counts. This data will indicate how many times each word occurs in the headlines. You can use the sum() method on pandas dataframes to accomplish this. Assign this vector to word_counts.
- Use the vector to filter counts to remove any columns that occur less than 5 times, or more than 100 times. You can use the loc method on dataframes to accomplish this.

In [8]:
word_counts = counts.sum()

cols = word_counts[(word_counts <= 100) & (word_counts >= 5)].index
counts = counts[cols]

Now we'll need to split the data into two sets so that we can evaluate our algorithm effectively. We'll train our algorithm on a training set, then test its performance on a test set.

The train_test_split() function from scikit-learn will help us accomplish this.

We'll pass in .2 for the test_size parameter to randomly select 20% of the rows for our test set, and 80% for our training set.

X_train and X_test contain the predictors, and y_train and y_test contain the value we're trying to predict (upvotes).

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

Now that we have a training set and a test set, let's train a model and make test predictions. We'll use a linear regression algorithm from scikit-learn, which you can read more about in the scikit-learn documentation.

First we'll initialize the model using the LinearRegression class. Then, we'll use the fit() method on the model to train with X_train and y_train. Finally, we'll make predictions with X_test.

When we make predictions with a linear regression model, the model assigns coefficients to each column. Essentially, the model is determining which words correlate with more upvotes, and which with less. By finding these correlations, the model will be able to predict which headlines will be highly upvoted in the future. While the algorithm won't have a high level of understanding of the text, linear regression can generate surprisingly good results.

<b>Instructions</b>
- Train clf using the fit() method.
- Use the predict() method on clf to make predictions on X_test. Assign the result to predictions.

In [11]:
from sklearn.linear_model import LinearRegression

In [12]:
clf = LinearRegression()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)



Now that we have predictions, we can calculate our prediction error. We'll need to select an error metric first, though. We'll use mean squared error (MSE), which is a common error metric.

Here's the formula for MSE:

$MSE = \frac{1}{n}\sum_{i=1}^{n}(\hat{Y_{i}} - Y_{i})^{2}$

With MSE, we subtract the predictions from the actual values, square the results, and find the mean. Because the errors are squared, MSE penalizes errors further away from the actual value more than those close to the actual value. We want to use MSE because we'd like all of our predictions to be relatively close to the actual values.

<b>Instructions</b>
- Calculate the mean squared error associated with our predictions.
    - Subtract y_test from predictions.
    - Square each of the differences.
    - Add all of the squared differences together, and divide by the number of differences to get the mean.
    - Assign the result to mse.

In [13]:
from sklearn.metrics import mean_squared_error

In [14]:
mse = mean_squared_error(y_test, predictions)
print('MSE:', mse)

MSE: 2666.776742906371


Our MSE is 2181, which is a fairly large value. There's no hard and fast rule about what a "good" error rate is, because it depends on the problem we're solving and our error tolerance.

In this case, the mean number of upvotes is 10, and the standard deviation is 39.5. If we take the square root of our MSE to calculate error in terms of upvotes, we get 46.7. This means that our average error is 46.7 upvotes away from the true value. This is higher than the standard deviation, so our predictions are often far off-base.

We can take several steps to reduce the error and explore natural language processing further. Here are some ideas for your next steps:

- Use the entire data set. While we used samples in this mission, you could download the entire data set from <a href='https://github.com/arnauddri/hn'>this</a> GitHub repository. This approach will reduce the error rate dramatically. There are many features in natural language processing. Using more data will ensure that the model will find more occurrences of the same features in the test and training sets, which will help the model make better predictions.
- Add "meta" features like headline length and average word length.
- Use a random forest, or another more powerful machine learning technique.
- Explore different thresholds for removing extraneous columns.

In [15]:
# Try using more data

submissions = pd.read_csv('stories.csv', names=['id', 'created_at', 
                                                'created_at_i', 'author', 
                                                'points', 'url_hostname', 
                                                'num_comments', 'title'])

submissions.rename(columns={'created_at': 'submission_time',
                           'points': 'upvotes',
                           'url_hostname': 'url',
                           'title': 'headline'}, inplace=True)

submissions.drop(['id', 'created_at_i', 'author', 'num_comments'], axis=1, inplace=True)
submissions.dropna(inplace=True)
submissions = submissions.iloc[:10000]

In [16]:
submissions.head()

Unnamed: 0,submission_time,upvotes,url,headline
1,2015-02-20T11:34:22.000Z,1,startupjuncture.com,24sessions: live business advice over video-chat
2,2015-02-20T11:35:32.000Z,3,blog.erratasec.com,Some notes on SuperFish
3,2015-02-20T11:36:18.000Z,1,twitter.com,Apple Watch models could contain 29.16g of gold
4,2015-02-20T11:41:06.000Z,1,phpconference.co.uk,PHP UK Conference Diversity Scholarship Programme
5,2015-02-20T11:43:04.000Z,2,preview.onedrive.com,Microsoft giving away 100GB free OneDrive stor...


In [17]:
headlines = submissions['headline'].tolist()
tokenized_headlines = [headline.split(" ") for headline in headlines]

exclude = set(string.punctuation)
clean_tokenized = [[''.join(ch for ch in word if ch not in exclude).lower() for word in headline] for headline in tokenized_headlines]

unique_tokens = []
single_tokens = []

for headline in clean_tokenized:
    single_tokens += headline
unique_tokens = list(set(single_tokens))

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

for i, clean_list in enumerate(clean_tokenized):
    for token in clean_list:
        if token in unique_tokens:
            counts[token].iloc[i] += 1
            
word_counts = counts.sum()

cols = word_counts[(word_counts <= 100) & (word_counts >= 5)].index
counts = counts[cols]

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

clf = LinearRegression()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print('MSE:', mse)

MSE: 3757.6118251965304


In [18]:
submissions.shape

(10000, 4)