# Unsupervised learning Capstone (name TBA)
Author: Matthew Huh
    
## About the Data

Collection of 142,570 articles from 15 different publications...

## Research Question

...

## Overview

...

## Packages

In [1]:
# Basic imports
import os
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Clustering packages
import sklearn.cluster as cluster
from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AffinityPropagation

# Natural Language processing
import re
import spacy
import nltk
from nltk.corpus import stopwords, twitter_samples, gutenberg
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_rcv1

# Machine Learning packages
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## Data Preview

In [2]:
# Create list of files from directory
filelist = os.listdir('articles')

# Import the files
df_list = [pd.read_csv(file) for file in filelist]

#concatenate them together
articles = pd.concat(df_list)

# Preview the data
articles.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [3]:
articles.shape

(142570, 10)

In [4]:
articles = articles.sample(frac=0.1)

In [5]:
articles.select_dtypes(include=['object']).nunique()

title          14253
publication       15
author          3896
date            1060
url             8551
content        14245
dtype: int64

In [6]:
# Drop variables that have no impact on the outcome
articles = articles[['title', 'publication', 'author', 'content']]

In [7]:
articles.groupby(['author']).size().sort_values(ascending=False)

author
Breitbart News                                          161
Pam Key                                                 137
Associated Press                                        114
Charlie Spiering                                         94
Jerome Hudson                                            83
Daniel Nussbaum                                          71
Camila Domonoske                                         71
AWR Hawkins                                              71
John Hayward                                             65
Post Editorial Board                                     64
Joel B. Pollak                                           61
Ian Hanchett                                             58
Alex Swoyer                                              57
Merrit Kennedy                                           55
Breitbart London                                         54
Reuters                                                  52
Warner Todd Huston               

Well, that partly explains how there are so many authors in this dataset. It seems as though there are over 15,000 authors, and many of them have only published one article, or have co-written multiple articles with other authors. This complicates the problem, so in order to best represent each author's writing style, let's see what happens if we simply remove all authors that only published one article as is.

## Feature Selection

In [8]:
# Drop author from the dataframe if they wrote less than 5 articles
vc = articles['author'].value_counts()
u  = [i not in set(vc[vc<=4].index) for i in articles['author']]
articles = articles[u]

In [9]:
# Reprint how many unique authors there are
articles.select_dtypes(include=['object']).nunique()

title          9448
publication      15
author          629
content        9443
dtype: int64

In [10]:
# View number of articles after feature selection
articles.shape

(9451, 4)

So after removing authors that composed fewer than 5 articles, we are left with 125k articles, or 87.8% of the data, and roughly 3k/15k of the authors. Now, we can create a better representation of each author since each author has at least 5 articles to evaluate from.

In [11]:
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [12]:
articles['content'] = articles.content.map(lambda x: text_cleaner(str(x)))
articles.head()

Unnamed: 0,title,publication,author,content
28643,Donald Trump could severely restrict immigrati...,Vox,Dylan Matthews,"In many ways, a Donald Trump administration wo..."
37552,Justice Dept. to North Carolina: Law limiting ...,Washington Post,Matt Zapotosky,The federal government took on North Carolina’...
1413,South Korea’s Top Spies Give New Evidence in P...,New York Times,Choe Sang-Hun,"SEOUL, South Korea — Officials from North Kore..."
16138,CNN Host Lets Sanders Try Again At Saying How ...,Talking Points Memo,,CNN host Dana Bash told Democratic presidentia...
18197,Islamic State Magazine: Jesus Was ’A Slave of ...,Breitbart,Frances Martel,"In the latest issue of its magazine Dabiq, the..."


In [13]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Reduce all text to their lemmas
for article in articles['content']:
    article = lemmatizer.lemmatize(article)

In [14]:
# Identify predictor and target variables
X = articles['content']
y = articles['publication']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [15]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.99, random_state=42)

## Tf-idf Vectorization

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=5, # only use words that appear at least twice
                             max_features=3000, # limit to 3000 best features
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#Applying the vectorizer
X_tfidf=vectorizer.fit_transform(X)
print("Number of features: %d" % X_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

#Removes all zeros from the matrix
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

Number of features: 3000


In [17]:
from sklearn.preprocessing import normalize
X_norm = normalize(X_train_tfidf)

In [18]:
# Instantiating spaCy
nlp = spacy.load('en')
X_train_words = []

for row in X_train:
    # Processing each row for tokens
    row_doc = nlp(row)
    # Calculating length of each sentence
    sent_len = len(row_doc) 
    # Initializing counts of different parts of speech
    advs = 0
    verb = 0
    noun = 0
    adj = 0
    for token in row_doc:
        # Identifying each part of speech and adding to counts
        if token.pos_ == 'ADV':
            advs +=1
        elif token.pos_ == 'VERB':
            verb +=1
        elif token.pos_ == 'NOUN':
            noun +=1
        elif token.pos_ == 'ADJ':
            adj +=1
    # Creating a list of all features for each sentence
    X_train_words.append([row_doc, advs, verb, noun, adj, sent_len])

In [19]:
X_counter = pd.DataFrame(data=X_train_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])

In [20]:
X_counter.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,sent_length
0,"(’’, ’Things, are, rough, over, at, Intel, ., ...",17,47,56,21,342
1,"(Monica, Crowley, ,, the, former, Fox, News, c...",10,41,51,24,277
2,"(A, migrant, has, threatened, to, throw, a, yo...",11,79,90,25,409
3,"(Two, days, after, Donald, Trump, dismissed, r...",32,128,114,57,718
4,"(According, to, the, Archbishop, of, Aleppo, ,...",6,11,6,5,54


In [21]:
X_normal  = pd.DataFrame(data=X_norm.toarray())

In [22]:
features = pd.concat([X_counter,X_normal], ignore_index=False, axis=1)
features.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,sent_length,0,1,2,3,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,"(’’, ’Things, are, rough, over, at, Intel, ., ...",17,47,56,21,342,0.0,0.0,0.071358,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"(Monica, Crowley, ,, the, former, Fox, News, c...",10,41,51,24,277,0.0,0.0,0.0,0.0,...,0.0,0.061457,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"(A, migrant, has, threatened, to, throw, a, yo...",11,79,90,25,409,0.0,0.079689,0.0,0.0,...,0.0,0.0,0.047811,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"(Two, days, after, Donald, Trump, dismissed, r...",32,128,114,57,718,0.0,0.0,0.0,0.0,...,0.0,0.033556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"(According, to, the, Archbishop, of, Aleppo, ,...",6,11,6,5,54,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
features = features.drop('BOW', axis=1)

In [24]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import chi2

# Instantiating and fitting the 150 best features
kbest = SelectKBest(f_classif, k=300)
X2_train = kbest.fit_transform(features, y_train)

# Clustering

### K-means

In [25]:
X_norm

<7088x3000 sparse matrix of type '<class 'numpy.float64'>'
	with 989371 stored elements in Compressed Sparse Row format>

In [26]:
# Calulate predicted values
kmeans = KMeans(n_clusters=15, init='k-means++', random_state=42, n_init=20)
y_pred = kmeans.fit_predict(X_norm)

pd.crosstab(y_train, y_pred)

col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
publication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Atlantic,14,9,132,16,8,1,0,11,12,75,12,36,72,21,5
Breitbart,32,41,74,162,44,217,5,92,25,217,28,219,443,39,58
Business Insider,74,19,21,12,11,1,2,9,15,46,7,23,117,7,8
Buzzfeed News,18,3,10,5,5,0,4,24,1,21,3,14,83,27,1
CNN,4,26,61,38,25,1,17,63,0,75,37,85,229,15,7
Fox News,5,5,5,53,4,0,1,36,7,28,6,32,104,8,21
Guardian,8,9,38,8,8,0,1,15,0,34,25,28,66,9,0
NPR,19,15,197,22,16,1,7,17,9,39,12,55,124,27,11
National Review,2,7,69,27,8,2,2,4,13,54,2,24,44,15,14
New York Post,85,16,128,16,6,1,6,53,37,52,132,39,362,30,5


In [27]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score

print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X_norm, y_pred, metric='euclidean')))

Adjusted Rand Score: 0.02979696
Silhouette Score: 0.01446115


# Modelling

### Random Forest

In [28]:
tf_rfc = ensemble.RandomForestClassifier()
train = tf_rfc.fit(X_train_tfidf, y_train)

print('Training set score:', tf_rfc.score(X_train_tfidf, y_train))
print('\nTest set score:', tf_rfc.score(X_test_tfidf, y_test))

Training set score: 0.9928047404063205

Test set score: 0.4879390605162928


### Logistic Regression

In [29]:
tf_lr = LogisticRegression()
train = tf_lr.fit(X_train_tfidf, y_train)

print('Training set score:', tf_lr.score(X_train_tfidf, y_train))
print('\nTest set score:', tf_lr.score(X_test_tfidf, y_test))

Training set score: 0.7128950338600452

Test set score: 0.5450698264917477


# Source

https://www.kaggle.com/snapcrack/all-the-news