# Unsupervised learning Capstone (name TBA)
Author: Matthew Huh
    
## Overview

For the most part, people are free to choose what news outlets they read and follow. In the United States, there is a near-endless list of sites that people can choose from in order to get their daily news and over time, they develop preferences for sites that they are more attached to, and do their best to avoid. Now these affinities are developed through a combination of means ranging from affiliations, vocabulary, prose, and so forth.

What I would like to examine in this project is if it is possible to differentiate from several different publications with their respective perks / quirks. 

## About the Data

This dataset was obtained from Kaggle, and contains a collection of 142,570 articles from 15 different publications.

The publications within this dataset are
1. CNN
2. Breitbart
3. Vox
4. Washington Post
5. New York Post
6. National Review
7. NPR
8. Guardian
9. Talking Points Memo
10. Atlantic
11. Reuters
12. Fox News
13. Business Insider
14. Buzzfeed News
15. New York Times

## Research Question

As this is an unsupervised learning project first and foremost, the project will have 3 goals.

1. The first goal is to prepare the articles in the dataset for modelling using various Natural Language Processing (NLP) methods to re-represent the data in numbers rather than words
2. Cluster the data to determine if we can identify the articles and associate them as different groups.
3. Determine if we can predict the structure of the article based on the publisher.

## Packages

In [1]:
# Basic imports
import os
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine Learning packages
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import chi2
from sklearn.preprocessing import normalize
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Clustering packages
import sklearn.cluster as cluster
from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AffinityPropagation

# Natural Language processing
import re
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_rcv1

## Data Preview

In [2]:
# Create list of files from directory
filelist = os.listdir('articles')

# Import the files
df_list = [pd.read_csv(file) for file in filelist]

#concatenate them together
articles = pd.concat(df_list)

# Preview the data
articles.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [3]:
# Print the size of the dataset
articles.shape

(142570, 10)

In [4]:
# # Sample the dataset for optimal performance
# articles = articles.sample(frac=0.1)

In [5]:
# Describe unique occurences for each categorical variable
articles.select_dtypes(include=['object']).nunique()

title          142132
publication        15
author          15647
date             1646
url             85559
content        142038
dtype: int64

In [6]:
# Drop variables that have no impact on the outcome
articles = articles[['title', 'publication', 'author', 'content']]

In [7]:
# View most frequently occurring authors
articles.groupby(['author']).size().sort_values(ascending=False)

author
Breitbart News                                                      1559
Pam Key                                                             1282
Associated Press                                                    1231
Charlie Spiering                                                     928
Jerome Hudson                                                        806
John Hayward                                                         747
Daniel Nussbaum                                                      735
AWR Hawkins                                                          720
Ian Hanchett                                                         647
Joel B. Pollak                                                       624
Post Editorial Board                                                 620
Alex Swoyer                                                          604
Camila Domonoske                                                     593
Warner Todd Huston                          

Well, that partly explains how there are so many authors in this dataset. It seems as though there are over 15,000 authors, and many of them have only published one article, or have co-written multiple articles with other authors. This complicates the problem, so in order to best represent each author's writing style, let's see what happens if we simply remove all authors that only published one article as is.

## Feature Selection

In [8]:
# Drop author from the dataframe if they wrote less than 5 articles
vc = articles['author'].value_counts()
u  = [i not in set(vc[vc<=4].index) for i in articles['author']]
articles = articles[u]

In [9]:
# Reprint how many unique authors there are
articles.select_dtypes(include=['object']).nunique()

title          124811
publication        15
author           3063
content        124724
dtype: int64

In [10]:
# View number of articles after feature selection
articles.shape

(125223, 4)

So after removing authors that composed fewer than 5 articles, we are left with 125k articles, or 87.8% of the data, and roughly 3k/15k of the authors. Now, we can create a better representation of each author since each author has at least 5 articles to evaluate from.

In [11]:
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [12]:
# Remove annoying punctuation from the articles
articles['content'] = articles.content.map(lambda x: text_cleaner(str(x)))
articles.head()

Unnamed: 0,title,publication,author,content
0,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,WASHINGTON — Congressional Republicans have a ...
2,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,"When Walt Disney’s “Bambi” opened in 1942, cri..."
4,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,"SEOUL, South Korea — North Korea’s leader, Kim..."
5,"Sick With a Cold, Queen Elizabeth Misses New Y...",New York Times,Sewell Chan,"LONDON — Queen Elizabeth II, who has been batt..."
6,Taiwan’s President Accuses China of Renewed In...,New York Times,Javier C. Hernández,BEIJING — President Tsai of Taiwan sharply cri...


In [13]:
lemmatizer = WordNetLemmatizer()

# Reduce all text to their lemmas
for article in articles['content']:
    article = lemmatizer.lemmatize(article)

In [14]:
# Identify predictor and target variables
X = articles['content']
y = articles['publication']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Tf-idf Vectorization

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=5, # only use words that appear at least twice
                             max_features=150, # limit to 300 best features
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#Applying the vectorizer
X_tfidf=vectorizer.fit_transform(X)
print("Number of features: %d" % X_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

#Removes all zeros from the matrix
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

# Normalize the dataset    
X_norm = normalize(X_train_tfidf)

# Convert from tf-idf matrix to dataframe
X_normal  = pd.DataFrame(data=X_norm.toarray())

Number of features: 150


### Phrase count with spacy

In [16]:
# # Instantiating spaCy
# nlp = spacy.load('en')
# X_train_words = []

# for row in X_train:
#     # Processing each row for tokens
#     row_doc = nlp(row)
#     # Calculating length of each sentence
#     sent_len = len(row_doc) 
#     # Initializing counts of different parts of speech
#     advs = 0
#     verb = 0
#     noun = 0
#     adj = 0
#     for token in row_doc:
#         # Identifying each part of speech and adding to counts
#         if token.pos_ == 'ADV':
#             advs +=1
#         elif token.pos_ == 'VERB':
#             verb +=1
#         elif token.pos_ == 'NOUN':
#             noun +=1
#         elif token.pos_ == 'ADJ':
#             adj +=1
#     # Creating a list of all features for each sentence
#     X_train_words.append([row_doc, advs, verb, noun, adj, sent_len])

# # Create dataframe with count of adverbs, verbs, nouns, and adjectives
# X_count = pd.DataFrame(data=X_train_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])

# # Change token count to token percentage
# for column in X_count.columns[1:5]:
#     X_count[column] = X_count[column] / X_count['sent_length']

# # Normalize X_count
# X_counter = normalize(X_count.drop('BOW',axis=1))
# X_counter  = pd.DataFrame(data=X_counter)

In [17]:
# # Combine tf-idf matrix and phrase count matrix
# features = pd.concat([X_counter,X_normal], ignore_index=False, axis=1)
# features.head()

In [18]:
# # Instantiating and fitting the 300 best features
# kbest = SelectKBest(f_classif, k=300)
# X2_train = kbest.fit_transform(features, y_train)

# Clustering

### K-means

In [19]:
# Calulate predicted values
kmeans = KMeans(n_clusters=15, init='k-means++', random_state=42, n_init=20)
y_pred = kmeans.fit_predict(X_normal)

pd.crosstab(y_train, y_pred)

col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
publication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Atlantic,822,195,133,129,813,137,206,327,93,8,1250,147,229,221,153
Breitbart,3111,135,484,294,5013,417,1040,1264,304,259,1470,1055,1943,461,396
Business Insider,844,133,34,696,1224,105,153,213,73,6,904,137,252,57,74
Buzzfeed News,396,38,1,333,1007,116,92,147,102,7,398,302,108,131,205
CNN,1179,222,4,88,2449,289,404,510,196,9,1195,772,384,214,284
Fox News,483,58,82,37,831,113,120,188,41,18,181,333,508,32,89
Guardian,766,197,0,184,1062,269,156,284,75,21,1541,321,123,220,172
NPR,708,2176,144,176,1139,216,204,342,335,14,1555,268,242,271,237
National Review,836,60,146,15,529,46,401,516,96,10,662,56,356,141,129
New York Post,769,766,537,826,2447,643,242,204,108,26,3211,781,306,290,438


In [20]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score

print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X_normal, y_pred, metric='euclidean')))

Adjusted Rand Score: 0.03274882


MemoryError: 

### Spectral Clustering

In [None]:
sc = SpectralClustering(n_clusters=15)
y_pred3 = sc.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred3)

In [None]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y2_train, y_pred3)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred3, metric='euclidean')))

### Affinity Propagation

In [None]:
af = Affinity Propagation
y_pred4 = af.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred4)

In [None]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y2_train, y_pred4)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred4, metric='euclidean')))

# Modelling

### Random Forest

In [None]:
tf_rfc = ensemble.RandomForestClassifier()
train = tf_rfc.fit(X_normal, y_train)

print('Training set score:', tf_rfc.score(X_normal, y_train))
print('\nTest set score:', tf_rfc.score(X_normal, y_test))

### Logistic Regression

In [None]:
tf_lr = LogisticRegression()
train = tf_lr.fit(X_normal, y_train)

print('Training set score:', tf_lr.score(X_normal, y_train))
print('\nTest set score:', tf_lr.score(X_normal, y_test))

# Source

https://www.kaggle.com/snapcrack/all-the-news