For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [98]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import spacy
from collections import Counter
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AffinityPropagation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

In [31]:
# Loading the dataset.
news = pd.read_csv('C:\\Users\\kenne\\Desktop\\uci-news-aggregator.csv', encoding='utf-8')

I am using a News Aggregator Dataset which contains headlines and categories of over 400,000 news stories scraped from the web between March 10, 2014 to August 10, 2014.

News categories included in this dataset include business, science and technology, entertainment and health.

The columns included in this dataset are:

ID : the numeric ID of the article
TITLE : the headline of the article
URL : the URL of the article
PUBLISHER : the publisher of the article
CATEGORY : the category of the news item; one of: -- b : business -- t : science and technology -- e : entertainment -- m : health
STORY : alphanumeric ID of the news story that the article discusses
HOSTNAME : hostname where the article was posted
TIMESTAMP : approximate timestamp of the article's publication, given in Unix time (seconds since midnight on Jan 1, 1970)

This dataset comes from the UCI Machine Learning Repository. Any publications that use this data should cite the repository as follows:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,      School of Information and Computer Science.

In [32]:
news.describe(include='all')

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
count,422419.0,422419,422419,422417,422419,422419,422419,422419.0
unique,,406453,422223,10985,4,7230,11236,
top,,The article requested cannot be found! Please ...,http://www.japantimes.co.jp/news/2014/04/18/wo...,Reuters,e,dubwcJArLL_qAKML5LGPLiunKzNLM,in.reuters.com,
freq,,145,5,3902,152469,450,2877,
mean,211536.764594,,,,,,,1400445000000.0
std,122102.839707,,,,,,,3733062000.0
min,1.0,,,,,,,1394470000000.0
25%,105801.5,,,,,,,1397350000000.0
50%,211655.0,,,,,,,1399990000000.0
75%,317273.5,,,,,,,1403770000000.0


In [33]:
news.isnull().sum()

ID           0
TITLE        0
URL          0
PUBLISHER    2
CATEGORY     0
STORY        0
HOSTNAME     0
TIMESTAMP    0
dtype: int64

In [34]:
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422419 entries, 0 to 422418
Data columns (total 8 columns):
ID           422419 non-null int64
TITLE        422419 non-null object
URL          422419 non-null object
PUBLISHER    422417 non-null object
CATEGORY     422419 non-null object
STORY        422419 non-null object
HOSTNAME     422419 non-null object
TIMESTAMP    422419 non-null float64
dtypes: float64(1), int64(1), object(6)
memory usage: 25.8+ MB


In [35]:
news.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470000000.0
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470000000.0
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470000000.0
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470000000.0
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470000000.0


In [36]:
# We're going to remove all columns except for TITLE and CATEGORY for the purposes of our research.
news.drop(['ID', 'URL', 'PUBLISHER', 'STORY', 'HOSTNAME', 'TIMESTAMP'], 1, inplace=True)

In [37]:
news.head()

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b


In [38]:
news.groupby('CATEGORY')['TITLE'].count()

CATEGORY
b    115967
e    152469
m     45639
t    108344
Name: TITLE, dtype: int64

Note, the 'm' category has significantly less samples and the 'e' category has considerably more.

In [39]:
# Take a sample of 1000 rows from each category to balance the samples.
news_sample = news.groupby('CATEGORY').apply(lambda x: x.sample(1000))

In [42]:
news_sample.groupby('CATEGORY')['TITLE'].count()

Defaulting to column, but this will raise an ambiguity error in a future version
  """Entry point for launching an IPython kernel.


CATEGORY
b    1000
e    1000
m    1000
t    1000
Name: TITLE, dtype: int64

In [43]:
news_sample.describe()

Unnamed: 0,TITLE,CATEGORY
count,4000,4000
unique,3994,4
top,Malaysia Moves to Contain Deadly MERS Virus,m
freq,2,1000


In [44]:
news_sample.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,TITLE,CATEGORY
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
t,152576,Think Samsung's Fingerprint Scanner Is Foolpro...,t
t,122500,"VW Applies Subtle Updates to 2015 Jetta, Gives...",t
t,322723,Internet TV marches on in wake of Aereo decision,t
t,213930,Google I/O 2014 schedule is now available,t
t,277776,NASA uses laser to beam video from space,t


In [45]:
news_sample.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4000 entries, (b, 334392) to (t, 277776)
Data columns (total 2 columns):
TITLE       4000 non-null object
CATEGORY    4000 non-null object
dtypes: object(2)
memory usage: 105.6+ KB


In [46]:
# Utility function to clean text.
def text_cleaner(text):
    
    # spaCy does not recognize the double dash.
    text = re.sub(r'--',' ',text)
    # Removing all numbers.
    text = re.sub(r'\d','',text)
    # Removing all punctuations not word-internal.
    text = re.sub('\s\W',' ',text)
    text = re.sub('\W\s',' ',text)
    # Making sure I didn't introduce any double spaces.
    text = re.sub('\s+',' ',text)
    # Converting all text to lower case.
    text = text.lower()
    # Removing extra whitespace.
    text = ' '.join(text.split())
    
    # Placeholder to remove everything not a letter (for consideration).
    #text = re.sub('[^a-zA-Z]',' ',text)
    
    return text

In [47]:
# Cleaning the data.
news_sample['TITLE'] = news_sample.TITLE.map(lambda x: text_cleaner(str(x)))

news_sample.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,TITLE,CATEGORY
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
b,334392,full recap daniel bryan injury update and bo d...,b
b,125145,housing starts rise slightly in march,b
b,9047,update ackman accuses herbalife of breaking la...,b
b,116163,tax day for those expecting a refund late pena...,b
b,309862,s.f to fight metered parking the hogging economy',b


In [48]:
news_sample.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4000 entries, (b, 334392) to (t, 277776)
Data columns (total 2 columns):
TITLE       4000 non-null object
CATEGORY    4000 non-null object
dtypes: object(2)
memory usage: 105.6+ KB


In [50]:
# Defining the features and the outcome variables.
X = news_sample['TITLE']
y = news_sample['CATEGORY']

# Splitting the data into train and test sets (reserving 25% of my corpus as a test set).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Term Frequency Inverse Document Frequency (Tf-idf) vectorization translates human-readable language to computer usable numeric form and is fundamental for many unsupervised Natural Language Processing (NLP) tasks. Tf-idf will take into account how many times a particular word appears and captures the words used less frequently to generate a vector for each word. 

In [53]:
# Tf-idf vecotrization to turn our headlines into vectors.
vectorizer = TfidfVectorizer(max_df=0.5,               # Drops the words that appear in more than half the headlines.
                             min_df=2,                 # Uses only words that appear twice.
                             stop_words='english',     # Drops English stop words.
                             lowercase=True,           # Converting everything to lower case (may be redundant).
                             use_idf=True,             # Using inverse document frequencies in our weighting.
                             norm=u'l2',               # Applies a correction factor so that longer and shorter headings are treated equally.
                             smooth_idf=True           # Adds 1 to all document frequencie to prevent divide-by-zero errors.
                            )

# Applying the vectorizer.
X_tfidf = vectorizer.fit_transform(X)
print('Number of features: {}'.format(X_tfidf.get_shape()[1]))

# Splitting into train and test sets (reserving 25% of my corpus as a test set).
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

# Reshape vectorizer to readable content
X_train_tfidf_csr = X_train_tfidf.tocsr()

# Number of headings.
n = X_train_tfidf_csr.shape[0]

# A list of dictionaries, one per heading.
tfidf_bypara = [{} for _ in range(0,n)]

# List of features.
terms = vectorizer.get_feature_names()

# For each paragraph, lists the feature words and their tf-idf scores.
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

# Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[593])
print('Tf_idf vector:', tfidf_bypara[0])

Number of features: 3683
Original sentence: skin cancer rates on the rise in canada
Tf_idf vector: {'email': 0.4966881434531527, 'google': 0.29327234676946734, 'unveils': 0.43315123838080677, 'messages': 0.4826728826188848, 'tool': 0.4966881434531527}


In [54]:
# Normalize the data.
X_norm = normalize(X_train_tfidf)

Tokenization is the process of breaking up text into individual meaninful pieces, such as words and punctuation. Note, we removed punctuations earlier because we believed they wouldn't provide much value.

In [55]:
# Loading spaCy's English module.
nlp = spacy.load('en_core_web_sm')

In [57]:
# Scanning each row for tokens. Also capturing the length of each heading and identifying the different parts of speech.
X_train_words = []

for row in X_train:
    row_doc = nlp(row)
    heading_len = len(row_doc) 
    advs = 0
    verb = 0
    noun = 0
    adj = 0
    for token in row_doc:
        if token.pos_ == 'ADV':
            advs +=1
        elif token.pos_ == 'VERB':
            verb +=1
        elif token.pos_ == 'NOUN':
            noun +=1
        elif token.pos_ == 'ADJ':
            adj +=1

    X_train_words.append([row_doc, advs, verb, noun, adj, heading_len])

In [58]:
y_train_new = y_train.reset_index(drop=True)

In [59]:
# Capturing generated features in new dataframe with categories.
news_bow = pd.DataFrame(data=X_train_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'heading_length'])

news_bow = pd.concat([news_bow, y_train_new], ignore_index=False, axis=1)

In [60]:
news_bow.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY
0,"(google, unveils, tool, to, encrypt, email, me...",1,1,4,0,7,t
1,"(pharrell, williams, named, coach, ahead, of, ...",1,1,5,0,12,e
2,"(current, mortgage, rates, at, hsbc, citizens,...",0,0,4,2,8,b
3,"(trade, -, ideas, navient, navi, is, today, 's...",0,1,8,1,13,b
4,"(game, of, thrones, season, episode, recap, to...",0,0,6,1,8,e


In [61]:
news_bow.tail()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY
2995,"(outlander, review, adventure, romance, and, k...",0,2,6,0,13,e
2996,"(', an, act, of, war, ', north, korea, issues,...",0,1,7,2,16,e
2997,"(asian, shares, gain, as, chinese, manufacturi...",0,2,2,2,7,b
2998,"(spacex, resupply, mission, set, for, flight, ...",0,1,6,0,9,t
2999,"(samsung, releases, new, mini, galaxy, s)",0,1,3,2,6,t


In [62]:
# Including the Tf-idf vectors created earlier.
X_norm_df = pd.DataFrame(data=X_norm.toarray())
news_tfidf_bow = pd.concat([news_bow, X_norm_df], ignore_index=False, axis=1)

In [63]:
news_tfidf_bow.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY,0,1,2,...,3673,3674,3675,3676,3677,3678,3679,3680,3681,3682
0,"(google, unveils, tool, to, encrypt, email, me...",1,1,4,0,7,t,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"(pharrell, williams, named, coach, ahead, of, ...",1,1,5,0,12,e,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"(current, mortgage, rates, at, hsbc, citizens,...",0,0,4,2,8,b,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"(trade, -, ideas, navient, navi, is, today, 's...",0,1,8,1,13,b,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"(game, of, thrones, season, episode, recap, to...",0,0,6,1,8,e,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
news_tfidf_bow.tail()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY,0,1,2,...,3673,3674,3675,3676,3677,3678,3679,3680,3681,3682
2995,"(outlander, review, adventure, romance, and, k...",0,2,6,0,13,e,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2996,"(', an, act, of, war, ', north, korea, issues,...",0,1,7,2,16,e,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2997,"(asian, shares, gain, as, chinese, manufacturi...",0,2,2,2,7,b,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2998,"(spacex, resupply, mission, set, for, flight, ...",0,1,6,0,9,t,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2999,"(samsung, releases, new, mini, galaxy, s)",0,1,3,2,6,t,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [67]:
# Feature selection.
features = news_tfidf_bow.drop(['BOW', 'CATEGORY'], axis=1)

# Categories.
y2_train = news_tfidf_bow.CATEGORY

In [68]:
# Since there is a considerable number of features now, I will use k best with chi-squared to select the 400 best features.
kbest = SelectKBest(chi2, k=400)

X2_train = kbest.fit_transform(features, y2_train)

K-Means clustering method is used for grouping data into clusters of similar data points with similar differences. It uses a cost function called the inertia, and the algorithm tries to choose means (called centroids) that minimize the inertia.

In [70]:
# I will choose 4 clusters since we have labeled data with known classes (4), otherwise I would have chosen at random.
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
y_pred = kmeans.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred)

col_0,0,1,2,3
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,88,245,225,175
e,190,196,144,213
m,91,257,247,164
t,127,283,178,177


In [71]:
# Displaying adjusted rand index and silhouette coefficient.
print('Adjusted Rand Score: {}'.format(adjusted_rand_score(y2_train, y_pred)))
print('Silhouette Score: {}'.format(silhouette_score(X2_train, y_pred, metric='euclidean')))

Adjusted Rand Score: 0.009740383806655357
Silhouette Score: 0.22921928734290828


The Adjusted Rand Score indicates that the K-Means clustering solution is close to perfect randomness. A different algorithim would do better!

The Silhouette Score indicates that the K-Means clustering solution is somewhat reliable at clustering datapoints that are more similar to one another than they are to datapoints in other clusters. This is good!

Rather than reducing dimensions using PCA on the data, I am going to use MiniBatchKMeans in sklearn, which randomly samples subsets of the training data in each iteration.

In [73]:
# I will use the same number of k clusters.
minibatchkmeans = MiniBatchKMeans(n_clusters=4, init='k-means++', random_state=42)
y_pred2 = minibatchkmeans.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred2)

col_0,0,1,2,3
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,209,136,122,266
e,121,241,154,227
m,221,122,156,260
t,163,150,184,268


In [75]:
# Displaying adjusted rand index and silhouette coefficient.
print('Adjusted Rand Score: {}'.format(adjusted_rand_score(y2_train, y_pred2)))
print('Silhouette Score: {}'.format(silhouette_score(X2_train, y_pred2, metric='euclidean')))

Adjusted Rand Score: 0.009429119729525967
Silhouette Score: 0.22255351693037823


The Mini Batch K-Means clustering model is slightly worse than K-Means clustering across both scores.

Spectral clustering (and affinity propagation, which I will use next) is based on quantifying similarity between data points, such as words that often appear in the same context would all be types of "similarity" potentially detectable by these algorithms.

In [77]:
# Keeping consistency with number of clusters since we know we're looking for four clusters.
n_clusters= 4

# Declare and fit the model.
sc = SpectralClustering(n_clusters=n_clusters)
y_pred3 = sc.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred3)

col_0,0,1,2,3
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,154,340,76,163
e,126,344,54,219
m,167,353,73,166
t,149,382,56,178


In [78]:
# Displaying adjusted rand index and silhouette coefficient.
print('Adjusted Rand Score: {}'.format(adjusted_rand_score(y2_train, y_pred3)))
print('Silhouette Score: {}'.format(silhouette_score(X2_train, y_pred3, metric='euclidean')))

Adjusted Rand Score: 0.0014068616971487622
Silhouette Score: 0.02310685574815792


The Spectral clustering model is only slightly better than the K-Means clustering solution.

As noted, I am going to attempt one more clustering model, Affinity Propogation.

Affinity Propagation is based on defining exemplars for data points. An exemplar is a data point similar enough to another data point that one could conceivably be represented by the other – they convey largely the same information. Affinity Propagation chooses the number of clusters based on the data.

In [79]:
# Declare and fit the model. Note, you can provide arguments to the model but I have chosen not to.
af = AffinityPropagation()
y_pred4 = af.fit_predict(X2_train)
print('Done')

# Pull the number of clusters and cluster assignments for each data point.
cluster_centers_indices = af.cluster_centers_indices_
n_clusters = len(cluster_centers_indices)
labels = af.labels_

print('Number of estimated clusters: {}'.format(n_clusters))

pd.crosstab(y2_train, y_pred4)

Done
Number of estimated clusters: 100


col_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b,6,13,9,6,0,9,8,12,11,7,...,3,16,7,6,2,9,1,11,3,5
e,10,8,7,4,11,9,4,5,6,12,...,5,7,13,11,7,6,3,7,2,1
m,4,17,10,2,2,12,6,10,9,3,...,5,9,6,4,2,7,2,11,3,4
t,8,13,8,7,6,13,7,7,9,5,...,2,5,10,7,5,11,0,8,1,4


In [80]:
# Displaying adjusted rand index and silhouette coefficient.
print('Adjusted Rand Score: {}'.format(adjusted_rand_score(y2_train, y_pred4)))
print('Silhouette Score: {}'.format(silhouette_score(X2_train, y_pred4, metric='euclidean')))

Adjusted Rand Score: 0.0007045026862398638
Silhouette Score: 0.14098292111773575


The Affinity Propogation model is the worst performing of all the clustering model solutions. This model is terribly overfit as well!

All of the clustering model solutions attempted (K-Means, Mini Batch K-Means, Spectral and Affinity Propogation) performed poorly in terms of idenifying/predicting the cateogry of a news article (business, science and technology, entertainment, health) given only its headline.

This could be due to  perhaps the headlines alone do not offer enough to distinguish between the different categories. We could look for improvement by using parts of the text from the body of the article, such as the first paragraph.

Of all the models attempted, K-Means performed the best.

In [84]:
# Including K-Means clustering predictions as feature in our training set for Supervised Learning.
X2_train_cluster = pd.DataFrame(X2_train)
X2_train_cluster['kmeans_cluster'] = y_pred

Random Forest can be used for both classification and regression problems. As a classifier the most popular outcome (the mode) is returned. Instead of making one decision tree you make several and each tree in the forest gets a vote on the outcome for a given observation.

In [92]:
# Standard Random Forest Classifier.
rfc = RandomForestClassifier()
train = rfc.fit(X2_train, y2_train)
rfc_scores = cross_val_score(rfc, X2_train, y_train, cv=5)

print('\nTraining set score without clustering: {}'.format(rfc_scores.mean()))


Training set score without clustering: 0.6206722025361257


In [93]:
# Random Forest Classifier with clustering.
rfc_cluster = RandomForestClassifier()
train_cluster = rfc_cluster.fit(X2_train_cluster, y2_train)
rfc_cluster_scores = cross_val_score(rfc_cluster, X2_train_cluster, y_train, cv=5)

print('Training set score with clustering: {}'.format(rfc_cluster_scores.mean()))

Training set score with clustering: 0.6263655990646054


The Random Forest Classifier solution with clustering performed best, by a small margin.

Since our outcome is categorical rather than continuous and we are interested in predicting the probability of an outcome, I am going to use Logistic Regression as a classifier.

In [94]:
# Standard Logistic Regression classifier.
lr = LogisticRegression()
train = lr.fit(X2_train, y2_train)
lr_scores = cross_val_score(lr, X2_train, y_train, cv=5)

print('\nTraining set score without clustering: {}'.format(lr_scores.mean()))


Training set score without clustering: 0.734989367909894


In [96]:
# Logistic Regression classifier with clustering.
lr_cluster = LogisticRegression()
train_cluster = lr_cluster.fit(X2_train_cluster, y2_train)
lr_cluster_scores = cross_val_score(lr_cluster, X2_train_cluster, y_train, cv=5)

print('Training set score with clustering: {}'.format(lr_cluster_scores.mean()))

Training set score with clustering: 0.7339926956876255


The Logistic Regression classifier without clustering performed best, but again by a small margin. Although, the Logistic Regression solutions both performed better than each of their Random Forest counterparts.

Gradient boosting can work on any combination of loss function and model type, as long as we can calculate the derivatives of the loss function with respect to the model parameters. Most often, however, gradient boosting uses decision trees, and minimizes either the residual (regression trees) or the negative log-likelihood (classification trees). 

In [99]:
# Standard Gradient Boosting classifier.
clf = GradientBoostingClassifier()
train = clf.fit(X2_train, y2_train)
clf_scores = cross_val_score(clf, X2_train, y_train, cv=5)

print('\nTraining set score without clustering: {}'.format(clf_scores.mean()))


Training set score without clustering: 0.7040098095404408


In [100]:
# Gradient Boosting classifier with clustering.
clf_cluster = GradientBoostingClassifier()
train_cluster = clf_cluster.fit(X2_train_cluster, y2_train)
clf_cluster_scores = cross_val_score(clf_cluster, X2_train_cluster, y_train, cv=5)

print('Training set score with clustering: {}'.format(clf_cluster_scores.mean()))

Training set score with clustering: 0.7046698169479307


The Gradient Boosting classifier with clustering performed best, but only by the slightest margin once again. Each Gradient Boosting classifer produced a lower accuracy score than it's Logistic Regression counterparts. But it did outperform both types of Random Forest classifiers, which I expected as in many cases gradient boosted decision trees perform better than random forests.

In the end, amongst the Supervised Learning classification solutions, Logistic Regression performed the best. Now I will pivot from the training sets to the test sets using the model that performed best within all of Supervised and Unsupervised solutions attempted.

In [101]:
# Normalize the data.
X_test_norm = normalize(X_test_tfidf)

In [102]:
# Scanning each row for tokens. Also capturing the length of each heading and identifying the different parts of speech.
X_test_words = []

for row in X_test:
    row_doc = nlp(row)
    heading_len = len(row_doc) 
    advs = 0
    verb = 0
    noun = 0
    adj = 0
    for token in row_doc:
        if token.pos_ == 'ADV':
            advs +=1
        elif token.pos_ == 'VERB':
            verb +=1
        elif token.pos_ == 'NOUN':
            noun +=1
        elif token.pos_ == 'ADJ':
            adj +=1

    X_test_words.append([row_doc, advs, verb, noun, adj, heading_len])

In [103]:
y_test_new = y_test.reset_index(drop=True)

In [104]:
# Capturing generated features in new dataframe with categories.
news_bow_test = pd.DataFrame(data=X_test_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'heading_length'])

news_bow_test = pd.concat([news_bow_test, y_test_new], ignore_index=False, axis=1)

In [105]:
news_bow_test.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY
0,"(kaepernick, denies, wrongdoing, details, star...",0,3,5,0,10,b
1,"(apple, agrees, to, conditional, million, eboo...",0,1,3,2,8,t
2,"(canadian, stocks, tumble, for, second, day, a...",0,1,3,2,11,b
3,"(today, in, apis, instagram, looking, to, repl...",0,2,4,3,12,t
4,"(half, of, us, adults, eligible, for, statins,...",0,0,6,1,11,m


In [106]:
news_bow_test.tail()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY
995,"(a, blood, test, that, can, predict, alzheimer...",0,2,4,1,10,m
996,"(new, home, sales, on, the, rise)",0,0,3,1,6,b
997,"(fashion, disaster, kim, kardashian, flaunts, ...",0,1,5,1,12,e
998,"(burger, king, 's, gay, pride, whopper, photo,...",0,0,8,0,10,b
999,"(the, hilarious, apple, song, you, need, to, h...",0,2,3,1,9,t


In [107]:
# Including the Tf-idf vectors created earlier.
X_test_norm_df = pd.DataFrame(data=X_test_norm.toarray())
news_tfidf_bow_test = pd.concat([news_bow_test, X_test_norm_df], ignore_index=False, axis=1)

In [108]:
news_tfidf_bow_test.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY,0,1,2,...,3673,3674,3675,3676,3677,3678,3679,3680,3681,3682
0,"(kaepernick, denies, wrongdoing, details, star...",0,3,5,0,10,b,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"(apple, agrees, to, conditional, million, eboo...",0,1,3,2,8,t,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"(canadian, stocks, tumble, for, second, day, a...",0,1,3,2,11,b,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"(today, in, apis, instagram, looking, to, repl...",0,2,4,3,12,t,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"(half, of, us, adults, eligible, for, statins,...",0,0,6,1,11,m,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [109]:
news_tfidf_bow_test.tail()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,heading_length,CATEGORY,0,1,2,...,3673,3674,3675,3676,3677,3678,3679,3680,3681,3682
995,"(a, blood, test, that, can, predict, alzheimer...",0,2,4,1,10,m,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,"(new, home, sales, on, the, rise)",0,0,3,1,6,b,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,"(fashion, disaster, kim, kardashian, flaunts, ...",0,1,5,1,12,e,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
998,"(burger, king, 's, gay, pride, whopper, photo,...",0,0,8,0,10,b,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
999,"(the, hilarious, apple, song, you, need, to, h...",0,2,3,1,9,t,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [110]:
# Feature selection.
features_test = news_tfidf_bow_test.drop(['BOW', 'CATEGORY'], axis=1)

# Categories.
y2_test = news_tfidf_bow_test.CATEGORY

In [111]:
# As before, using k best with chi-squared to select the 400 best features.
kbest_test = SelectKBest(chi2, k=400)

X2_test = kbest_test.fit_transform(features_test, y2_test)

In [132]:
# K-Means clustering on test set with 4 clusters.
kmeans_test = KMeans(n_clusters=4, init='k-means++', random_state=42)
y_pred_test = kmeans_test.fit_predict(X2_test)

pd.crosstab(y2_test, y_pred_test)

col_0,0,1,2,3
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,82,56,95,34
e,77,45,72,63
m,80,35,97,29
t,87,45,83,20


In [133]:
# Displaying adjusted rand index and silhouette coefficient.
print('Adjusted Rand Score: {}'.format(adjusted_rand_score(y2_test, y_pred_test)))
print('Silhouette Score: {}'.format(silhouette_score(X2_test, y_pred_test, metric='euclidean')))

Adjusted Rand Score: 0.0051089547521292605
Silhouette Score: 0.274217861996886


The Adjusted Rand Score fell by 0.004 using the test set while the Silhouette score improved by 0.05 nearly. Clustering within the train set was primarily evenly distributed between cluster 1, 2 and 3 with cluster 0 falling a bit behind. Within the test set, it seems clustering is relatively even distributed between clusters 0 and 2 while clusters 1 and 3 fall behind.

In [134]:
# Standard Logistic Regression classifier on test set.
lr_test = LogisticRegression()
train_test = lr.fit(X2_test, y2_test)
lr_scores_test = cross_val_score(lr_test, X2_test, y_test, cv=5)

print('\nTesting set score without clustering: {}'.format(lr_scores_test.mean()))


Testing set score without clustering: 0.6479458521116493


The standard Logistic Regression classifier significantly underperformed with the test set as compared to the train set.

In conclusion ...

As noted or showcased throughout this notebook, the classification of articles into their respective categories by using the headline only has proven to be difficult and unreliable. I started with attempting different clustering methods, such as K-Means, Mini Batch K-Means, Spectral and Affinity Propogation, but all four performed poorly. From there, I switched my approach to attempting different supervised learning classifiers, such as Random Forest, Logistic Regression and Gradient Boosting. Logistic Regression was the best performer and showed some promise but still left significant opportunity for improvement.

The challenge with attempting to classify articles into categories based on headers alone may be due to the lack of text within the headings themselves. Headlines, as in their very nature, are meant to be short, quick attemtps to grab your attention and hopefully get you to read the underlying article. I suggested earlier that one could expect to have more success in their attempt if they were able to pull a certain amount of text from the body of the article itself. Piggybacking off my initial statement about the lack of text, there could be a lack of diversity in the headlines as well although I find that to be less practical.