# HBR Headline exploration and prediction

Donec at dapibus magna. Sed sed tristique purus. Vivamus in pretium metus. Pellentesque dolor metus, placerat placerat tristique vitae, vehicula ut risus. Quisque metus ligula, interdum eu tempus vel, porta at arcu. Curabitur sit amet lacus elit. Praesent gravida consequat nibh ultricies ornare. Nullam consectetur dapibus scelerisque. Praesent venenatis odio at neque blandit venenatis. Quisque id commodo leo. Sed cursus leo ut dui semper tincidunt. Phasellus bibendum et elit id posuere.

Aenean tempus, sem ac tincidunt lobortis, diam ligula molestie dolor, vitae aliquam turpis purus at erat. Duis rhoncus odio ipsum, nec placerat ex suscipit non. Nullam egestas vulputate neque, at varius arcu dignissim at. Maecenas eget faucibus nisi. Vivamus vitae posuere justo. Quisque sit amet viverra mi. Etiam facilisis in diam euismod fermentum. Ut ultricies finibus elit, auctor elementum quam dictum a. Proin semper massa erat, a tempor arcu lobortis nec. Suspendisse sed efficitur ante. Vestibulum tortor dolor, feugiat nec fermentum nec, lobortis ut ligula. Duis nec facilisis tellus. Sed fermentum ante felis, ornare dignissim tellus auctor eu. Integer mollis feugiat nibh ut luctus. Donec ac vestibulum erat. Ut efficitur purus aliquet, pharetra arcu vel, molestie sapien.



----

In [None]:
import numpy as np
import pandas as pd

from nltk import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import nltk

import matplotlib.pyplot as plt

## Let's explore our headlines

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut nulla neque, hendrerit ac rutrum ac, accumsan vitae elit. Duis nisl odio, ornare sit amet scelerisque ac, cursus quis arcu. In vel tristique dui, vitae pretium turpis. Integer scelerisque suscipit dolor non vehicula. Praesent non nisl odio. Maecenas vulputate lectus libero, quis placerat mi viverra eu. Nulla facilisi. Sed dictum aliquam mattis. Nunc non tellus tincidunt, viverra mauris et, iaculis eros. Nam felis ex, blandit a maximus quis, auctor ac ante. Vestibulum ac consectetur est, pulvinar iaculis purus. In malesuada eros sed eros suscipit faucibus. Vivamus quis tristique ex, id ultrices mi. Maecenas congue auctor convallis. Vivamus vehicula purus sit amet magna ullamcorper efficitur.

Headlines data consists of 5000 items with two features: 

- **Page Title**: text for headline for article
- **Topic**: text describing editorially assigned topic


In [None]:
#Load headlines and see a few
data = pd.read_csv("headlines.csv")
print "Loaded {:,} headlines".format(len(data.index))
data.head()


### Adding processing functions

Further on we'll be applying a [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) approach, so filtering out stop words ahead of time will help make the titles more usable. Here we're adding in the two processing functions, `remove_stop_words` and `lowercase_tokens` for those next steps. 

In [None]:
#Some text functions

#Takes a list of words, remove common ones
def remove_stop_words(tokens):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in tokens if w.lower() not in stopwords]
    return content

#Takes a list of words, lowercases them
def lowercase_tokens(tokens):
    list_of_words = []
    for a in tokens:
        a = a.lower()
        list_of_words.append(a)
    return list_of_words

Now with the headlines data pulled in from the spreadsheet above, let's clean up the text with those functions,  and take a look at what we have. 

In [None]:
#Count the most common words in headlines

#Concatenate headlines into one string
text = data['Page Title'].str.cat(sep=" ")

#Encode as utf-8 
text = text.decode("utf-8")

#Create tokenizer to remove punctuation and numbers
tokenizer = RegexpTokenizer(r'\w+')

#Tokenize into words, lowercase, remove stop words
tokens = tokenizer.tokenize(text) 
tokens = lowercase_tokens(tokens)
tokens = remove_stop_words(tokens)

#Print common words
fdist = FreqDist(tokens)
print "The most common words in our headlines are:"
# print fdist.most_common(50)

plt.figure(figsize=(15, 4))  # the size you want
fdist.plot(35, cumulative=False)


## Additional cleanup, further processing the text  
The exploration showed that lots of headlines have "HBR" in them. We need to fix that, look for similar issues, and then get rid of duplicates

In [None]:
#Subset headlines that have "HBR" and other issues in them to see how they look

hbr = data[data['Page Title'].str.contains("HBR")]
print hbr.shape

hbr2 = data[data['Page Title'].str.contains("Harvard Business Review")]
print hbr2.shape

for a in hbr['Page Title'][0:5]:
    print a
    

In [None]:
#Fix headline formatting and remove duplicates
data['Clean Title'] = data['Page Title'].str.replace(' - HBR', '')

#Check for remaining 'HBR' mentions
hbr = data[data['Clean Title'].str.contains("HBR")]
print hbr.shape
print hbr.head()

#We've cut 400+ headlines down to 15. Now let's just remove "HBR" from those remaining heds
data['Clean Title'] = data['Clean Title'].str.replace('HBR', '')

In [None]:
#Drop duplicate headlines
data = data.drop_duplicates(subset="Clean Title")
data.shape


## Save the cleaned headlines to a new file
We've got ~4500 cleaned up headlines left. Time to save them to a new csv

In [None]:
data['Clean Title'].to_csv("clean_headlines.csv",header=True,index=False)

## Load social media data
To add to the dataset, add social media posts from Twitter & Facebook

In [None]:
social = pd.read_csv("all_tweets_fb_2013-nov18.csv")
print social.shape

for column in social.columns:
    print column


display(social['Created By'].describe())

In [None]:
#Limit the dataset to tweets and Facebook posts by HBR editors, as opposed to marketing, etc.

editors = ['nicole.torres@harvardbusiness.org','alexandra.kephart@hbr.org', 'Ramsey.Khabbaz@harvardbusiness.org',
     'paige.cohen@hbr.org','nicole.blank@hbr.org','ggavett@hbr.org','etruxler@hbr.org','awieckowski@hbr.org',
     'duygu.mullin@hbr.org','walter.frick@harvardbusiness.org']
social = social[social['Created By'].isin(editors)]
social.shape

In [None]:
display(social['Message'].head(5))
social['Message'].head()

## Combining text
Let's combine the headlines and social media data for ...(?)

In [None]:
#Combine text fields for HBR headlines and social media

tk = data['Clean Title']
tk2 = social['Message']

frames = [tk,tk2]

result = pd.concat(frames)
df = result.to_frame(name='Clean Title')
print df.shape
df.head()

## More cleanup! 
Additional cleanup because ... (?)

In [None]:
#Remove mentions of HBR
df['Clean Title'] = df['Clean Title'].str.replace(' - HBR', '')
df['Clean Title'] = df['Clean Title'].str.replace('HBR', '')

#Remove mentions of research because the algorithm is being used to evaluate research
df['Clean Title'] = df['Clean Title'].str.replace('Research', '')
df['Clean Title'] = df['Clean Title'].str.replace('research', '')
df.head()


## Save the data! 
Save the combined dataframe to csv

In [None]:
df['Clean Title'].to_csv("clean_headlines.csv",header=True,index=False)

-----

## From sheet 2: Non-HBR Headline Aggregator Exploration
Sed vitae mi urna. Ut pharetra varius nisl, ut efficitur elit venenatis pharetra. Nulla molestie vulputate justo vel pretium. Aenean a risus justo. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Mauris vel purus sit amet ipsum rhoncus venenatis. Donec eget nisl venenatis, vestibulum sapien ultricies, faucibus risus. Phasellus hendrerit orci eget leo posuere ullamcorper sit amet ut justo. Maecenas at consectetur dolor. Aliquam dignissim molestie nulla, id consequat turpis consequat tincidunt.


In [None]:
#Load headline dataset and take a look
# https://www.kaggle.com/uciml/news-aggregator-dataset
# https://archive.ics.uci.edu/ml/datasets/News+Aggregator

df = pd.read_csv("uci-news-aggregator.csv")
print df.shape

display(df.head())

## Filter News Data
Let's winnow down this data set to only include 'Business' stories, and then only their headlines.

In [None]:

#Randomly select 15,000 headlines (to match size of HBR dataset)
random_df = df.sample(n=15000)

#Save headlines from random set to csv
random_df['TITLE'].to_csv("other_headlines.csv",index=False,header=True)

#Do the same just for business headlines
biz = df.loc[df['CATEGORY'] == 'b']
random_biz = biz.sample(n=15000)

#Save random business headlines to cvs
random_biz['TITLE'].to_csv("other_biz_headlines.csv",index=False,header=True)


-----

# From sheet 3
Nulla facilisi. Praesent a sapien sit amet diam finibus rutrum a at nunc. Pellentesque ut dui dictum, tempus orci non, luctus ipsum. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin aliquam gravida lacinia. Phasellus vel nibh eget dui sagittis fermentum non ut ipsum. Sed posuere mi eget quam vehicula tincidunt. Sed auctor, arcu at blandit molestie, lacus purus scelerisque lacus, sit amet sagittis dui nunc at sapien. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos.


## Combine clean HBR headlines with sampling of aggregator headlines and add labels

In [None]:
hbr = pd.read_csv("clean_headlines.csv")
hbr.columns = ['Headline']
hbr['HBR'] = 'Yes'
print hbr.shape
hbr.head()

In [None]:
others = pd.read_csv("other_headlines.csv")
others.columns = ['Headline']
others['HBR'] = 'No'
print others.shape
others.head()

In [None]:
combined = pd.concat([hbr, others])
print combined.shape
combined.head()

## Save the data!
Save the combined dataframe to csv

In [None]:
combined.to_csv("combined_headlines.csv",index=False,header=True)

## Repeat for business headlines only

Insert details here? 

In [None]:
biz = pd.read_csv("other_biz_headlines.csv")
biz.columns = ['Headline']
biz['HBR'] = 'No'
print biz.shape
biz.head()

In [None]:
combined_biz = pd.concat([hbr, biz])
print combined_biz.shape
combined_biz.head()

In [None]:
combined_biz.to_csv("combined_biz_headlines.csv",index=False,header=True)

## Review this next block

I'm not sure what this is for...some structure might help? 

In [None]:
#Load headline dataset and take a look
# https://www.kaggle.com/uciml/news-aggregator-dataset
# https://archive.ics.uci.edu/ml/datasets/News+Aggregator

df = pd.read_csv("uci-news-aggregator.csv")
print df.shape

df.head()


#Randomly select 15,000 headlines (to match size of HBR dataset)
random_df = df.sample(n=15000)

#Save headlines from random set to csv
random_df['TITLE'].to_csv("other_headlines.csv",index=False,header=True)

#Do the same just for business headlines
biz = df.loc[df['CATEGORY'] == 'b']
random_biz = biz.sample(n=15000)

#Save random business headlines to cvs
random_biz['TITLE'].to_csv("other_biz_headlines.csv",index=False,header=True)


# (from sheet 4) Train Bag of Words Model

In [None]:
import pickle

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

## Load data and split into train/test

In [None]:
#Load and check data
heds = pd.read_csv("combined_headlines.csv")
print heds.shape
print heds.describe()
heds.head()

In [None]:
#Split data into train/test
train = heds.sample(frac=0.7, random_state=1)
test = heds.loc[~heds.index.isin(train.index)]

print "Train shape:"
print train.shape
print "Test shape"
print test.shape

## Vectorize and train model

In [None]:
#Vectorize data

#Initialize countvectorizer and fit to headlines
vectorizer = CountVectorizer(analyzer='word',
                             stop_words = 'english',
                             ngram_range=(1,2),
                             max_features=1000)

train_counts = vectorizer.fit_transform(train['Headline'])
test_counts = vectorizer.transform(test['Headline'])

#Initialize tfidf transformer and fit to counts
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
print train_tfidf.shape

test_tfidf = tfidf_transformer.transform(test_counts)
print test_tfidf.shape

In [None]:
#Train and evaluate linear model

#Train logistic regression model on training set
logit = linear_model.LogisticRegression(penalty='l1',C=1)
logit = logit.fit(train_tfidf,train['HBR'])

#Evaluate training performance
logit_train_results = logit.predict(train_tfidf)
train_output = pd.DataFrame(data={"Headline":train['Headline'],
                                 "HBR":train["HBR"],
                                 "Prediction":logit_train_results})
print train_output.head(30)

print "\nThe accuracy of the model on the training set is:"
print accuracy_score(train_output['HBR'],train_output['Prediction'])

In [None]:
#Make predictions against test data and assess performance
logit_test_results = logit.predict(test_tfidf)
test_output = pd.DataFrame(data={"Headline":test['Headline'],
                                 "HBR":test["HBR"],
                                 "Prediction":logit_test_results})
print test_output.head(30)

print "\nThe accuracy of the model on the test set is:"
print accuracy_score(test_output['HBR'],test_output['Prediction'])

## Save the trained model

In [None]:
with open('hbr_logit.pkl','wb') as f:
    pickle.dump(logit,f)

In [None]:
#Turn text into dataframe to test
text = "How to manage a company"   
text = [text]
df = pd.DataFrame(text,columns =['text'])    
    
#Vectorize text
tk_features = vectorizer.transform(df['text'])
tk_features = tfidf_transformer.transform(tk_features)
    
#Return prediction
result = logit.predict(tk_features)
print result[0]

## Redo the same process, but using a pipeline

In [None]:
#Pipeline makes it easier to vectorize new data later on -- to save the vectorizer along with the model

vect = CountVectorizer(analyzer='word',
                             stop_words = 'english',
                             ngram_range=(1,2),
                             max_features=1000)

#Define pipeline with countvectorizer, tfidftransformer, and logit model
test_pipe = Pipeline([
     ('vectorizer', vect),
     ('tfidf', TfidfTransformer()),
     ('logit', linear_model.LogisticRegression(penalty='l1',C=1))
 ])

#Fit logistic regression model with training data
test_pipe.fit(train['Headline'], train["HBR"]) 

#Make predictions against test data and assess performance
predictions = test_pipe.predict(test['Headline'])
test_output = pd.DataFrame(data={"Headline":test['Headline'],
                                 "HBR":test["HBR"],
                                 "Prediction":predictions})
print test_output.head(30)

print "\nThe accuracy of the model on the test set is:"
print accuracy_score(test_output['HBR'],test_output['Prediction'])


In [None]:
#Print the features with the lowest and highest coefficients
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = logit.coef_[0].argsort()
print('Smallest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:100]]))
print('Largest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:-101:-1]]))


In [None]:
#Test new text (defined above) with fitted model

print type(text)
print type(test['Headline'])

print test_pipe.predict(pd.Series(text))

In [None]:
#Save pipeline for later

with open('hbr_pipeline.pkl','wb') as f:
    pickle.dump(test_pipe,f)

# Repeat pipeline building and training with business heds only

In [None]:
#Load and check data
heds = pd.read_csv("combined_biz_headlines.csv")
print heds.shape
print heds.describe()
heds.head()

In [None]:
#Split data into train/test
train = heds.sample(frac=0.7, random_state=1)
test = heds.loc[~heds.index.isin(train.index)]

print "Train shape:"
print train.shape
print "Test shape"
print test.shape

In [None]:
#Pipeline makes it easier to vectorize new data later on -- to save the vectorizer along with the model

vect = CountVectorizer(analyzer='word',
                             stop_words = 'english',
                             ngram_range=(1,2),
                             max_features=1000)

#Define pipeline with countvectorizer, tfidftransformer, and logit model
test_pipe = Pipeline([
     ('vectorizer', vect),
     ('tfidf', TfidfTransformer()),
     ('logit', linear_model.LogisticRegression(penalty='l1',C=1))
 ])

#Fit logistic regression model with training data
test_pipe.fit(train['Headline'], train["HBR"]) 

#Make predictions against test data and assess performance
predictions = test_pipe.predict(test['Headline'])
test_output = pd.DataFrame(data={"Headline":test['Headline'],
                                 "HBR":test["HBR"],
                                 "Prediction":predictions})
print test_output.head(30)

print "\nThe accuracy of the model on the test set is:"
print accuracy_score(test_output['HBR'],test_output['Prediction'])


In [None]:
#Print the features with the lowest and highest coefficients
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = logit.coef_[0].argsort()
print('Smallest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:100]]))
print('Largest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:-101:-1]]))


In [None]:
#Test new text (defined above) with fitted model

print type(text)
print type(test['Headline'])

print test_pipe.predict(pd.Series(text))

In [None]:
#Save pipeline for later

with open('hbr_biz_pipeline.pkl','wb') as f:
    pickle.dump(test_pipe,f)

# (from sheet 5) Apply Model to Text

In [None]:
import pandas as pd
from sklearn.externals import joblib

In [None]:
# Define a function to take text and score it using the model

def predictor(text):
    #Takes a string, returns prediction of whether it's HBR-relevant
    pipeline = joblib.load('hbr_pipeline.pkl')
    pipeline2 = joblib.load('hbr_biz_pipeline.pkl')
    return pipeline.predict(pd.Series([text]))[0], pipeline2.predict(pd.Series([text]))[0]
    

In [None]:
#Try any text and see what the model says
tk = 'Tesla is having major supply chain problems'
predictor(tk)

In [None]:
#Categorize results form an RSS feed as HBR-relevant or not

import feedparser
feed = 'http://www.nber.org/rss/new.xml'
#feed = 'http://feeds.hbr.org/harvardbusiness'
#feed = 'https://news.ycombinator.com/rss'
#feed = 'https://theconversation.com/us/articles.atom'
feed = feedparser.parse(feed)
for a in feed.entries:
    title = a.title
    print str(predictor(title)) + ": " + title