# Unsupervised learning Capstone (name TBA)
Author: Matthew Huh
    
## About the Data

Collection of 142,570 articles from 15 different publications...

## Research Question

...

## Overview

...

## Packages

In [1]:
# Basic imports
import os
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Clustering packages
import sklearn.cluster as cluster
from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AffinityPropagation

# Natural Language processing
import re
import spacy
import nltk
from nltk.corpus import stopwords, twitter_samples, gutenberg
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_rcv1

# Machine Learning packages
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## Data Preview

In [2]:
# Create list of files from directory
filelist = os.listdir('articles')

# Import the files
df_list = [pd.read_csv(file) for file in filelist]

#concatenate them together
articles = pd.concat(df_list)

# Preview the data
articles.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [3]:
articles.shape

(142570, 10)

In [4]:
articles = articles.sample(frac=0.1)

In [5]:
articles.select_dtypes(include=['object']).nunique()

title          14250
publication       15
author          3921
date            1030
url             8548
content        14246
dtype: int64

In [6]:
# Drop variables that have no impact on the outcome
articles = articles[['title', 'publication', 'author', 'content']]

In [7]:
articles.groupby(['author']).size().sort_values(ascending=False)

author
Pam Key                                              163
Breitbart News                                       154
Associated Press                                     126
Charlie Spiering                                      93
Ian Hanchett                                          81
Jerome Hudson                                         80
John Hayward                                          77
AWR Hawkins                                           77
Daniel Nussbaum                                       73
Alex Swoyer                                           68
Post Editorial Board                                  68
Merrit Kennedy                                        60
Joel B. Pollak                                        58
Camila Domonoske                                      57
Warner Todd Huston                                    52
German Lopez                                          50
NPR Staff                                             48
Jeff Poor               

Well, that partly explains how there are so many authors in this dataset. It seems as though there are over 15,000 authors, and many of them have only published one article, or have co-written multiple articles with other authors. This complicates the problem, so in order to best represent each author's writing style, let's see what happens if we simply remove all authors that only published one article as is.

## Feature Selection

In [8]:
# Drop author from the dataframe if they wrote less than 5 articles
vc = articles['author'].value_counts()
u  = [i not in set(vc[vc<=4].index) for i in articles['author']]
articles = articles[u]

In [9]:
# Reprint how many unique authors there are
articles.select_dtypes(include=['object']).nunique()

title          9397
publication      15
author          616
content        9395
dtype: int64

In [10]:
# View number of articles after feature selection
articles.shape

(9402, 4)

So after removing authors that composed fewer than 5 articles, we are left with 125k articles, or 87.8% of the data, and roughly 3k/15k of the authors. Now, we can create a better representation of each author since each author has at least 5 articles to evaluate from.

In [11]:
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [12]:
articles['content'] = articles.content.map(lambda x: text_cleaner(str(x)))
articles.head()

Unnamed: 0,title,publication,author,content
37379,Trump is inheriting a world that’s gone to hell,New York Post,Benny Avni,This is a fine time to pick a new secretary of...
29415,Samantha Bee snaps up $3.7M Upper West Side ap...,New York Post,Jennifer Gould Keil,Comedy couple Samantha Bee and Jason Jones hav...
15423,Beautiful Huntresses: Scientists Explain Why M...,NPR,Merrit Kennedy,Female orchid mantises are dazzlingly beautifu...
18041,Slack Is Adding Status Messages That Tell Peo...,Buzzfeed News,Alex Kantrowitz,"’Asked for his top five custom statuses, Butte..."
30228,Jacoby Ellsbury has run out of high-priced Yan...,New York Post,George A. King III,TAMPA — The legion of Yankees fans no longer h...


In [13]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for article in articles['content']:
    article = lemmatizer.lemmatize(article)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
# Identify predictor and target variables
X = articles['content']
y = articles['publication']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Tf-idf Vectorization

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#Applying the vectorizer
X_tfidf=vectorizer.fit_transform(X)
print("Number of features: %d" % X_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

#Removes all zeros from the matrix
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

Number of features: 44554


In [23]:
# Examining shapes 
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(7051, 44554)
(2351, 44554)


In [24]:
from sklearn.preprocessing import normalize
X_norm = normalize(X_train_tfidf)

In [25]:
X2_train

<7051x150 sparse matrix of type '<class 'numpy.float64'>'
	with 25103 stored elements in Compressed Sparse Row format>

In [26]:
# Instantiating spaCy
nlp = spacy.load('en')
X_train_words = []

for row in X_train:
    # Processing each row for tokens
    row_doc = nlp(row)
    # Calculating length of each sentence
    sent_len = len(row_doc) 
    # Initializing counts of different parts of speech
    advs = 0
    verb = 0
    noun = 0
    adj = 0
    for token in row_doc:
        # Identifying each part of speech and adding to counts
        if token.pos_ == 'ADV':
            advs +=1
        elif token.pos_ == 'VERB':
            verb +=1
        elif token.pos_ == 'NOUN':
            noun +=1
        elif token.pos_ == 'ADJ':
            adj +=1
    # Creating a list of all features for each sentence
    X_train_words.append([row_doc, advs, verb, noun, adj, sent_len])

In [34]:
features  = pd.DataFrame(data=X_norm.toarray())

In [36]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Instantiating and fitting the 150 best features
kbest = SelectKBest(chi2, k=150)
X_train = kbest.fit_transform(features, y_train)

# Clustering

### K-means

In [40]:
# Calulate predicted values
kmeans = KMeans(n_clusters=15, init='k-means++', random_state=42, n_init=20)
y_pred = kmeans.fit_predict(X_train)

pd.crosstab(y_train, y_pred)

col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
publication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Atlantic,8,258,3,33,0,6,0,0,6,0,12,0,1,0,68
Breitbart,60,882,35,61,0,0,2,0,17,38,114,1,238,0,277
Business Insider,7,294,3,41,0,0,13,0,7,1,15,0,2,0,64
Buzzfeed News,4,160,3,4,0,0,3,0,3,0,5,0,4,0,24
CNN,23,497,1,40,0,0,1,0,13,0,10,0,0,0,92
Fox News,15,179,9,20,0,0,2,0,4,2,18,0,1,0,25
Guardian,3,216,0,13,0,0,1,0,5,0,2,0,0,0,45
NPR,11,237,4,15,1,0,3,1,140,4,9,0,2,0,47
National Review,9,166,11,37,0,0,0,0,4,4,21,0,0,0,40
New York Post,4,756,4,22,0,0,9,1,56,16,18,22,2,0,80


In [42]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score

print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X_train, y_pred, metric='euclidean')))

Adjusted Rand Score: 0.001667979
Silhouette Score: 0.2251684


# Modelling

### Random Forest

In [43]:
tf_rfc = ensemble.RandomForestClassifier()
train = tf_rfc.fit(X_train_tfidf, y_train)

print('Training set score:', tf_rfc.score(X_train_tfidf, y_train))
print('\nTest set score:', tf_rfc.score(X_test_tfidf, y_test))

Training set score: 0.992766983406609

Test set score: 0.41854529987239475


### Logistic Regression

In [44]:
tf_lr = LogisticRegression()
train = tf_lr.fit(X_train_tfidf, y_train)

print('Training set score:', tf_lr.score(X_train_tfidf, y_train))
print('\nTest set score:', tf_lr.score(X_test_tfidf, y_test))

Training set score: 0.7282654942561338

Test set score: 0.5150999574649086


# Source

https://www.kaggle.com/snapcrack/all-the-news