Import desired libraries, including custom functions for web scraping Fox news headlines and formatting the resultant csv file

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from fox_gsdmm.gsdmm.gsdmm.mgp import MovieGroupProcess
from csv2list import csv2list
from Scrape import MakeHoney
#import spacy

Construct DataFrame from scraped headlines

In [2]:
#nlp = spacy.load('en_core_web_sm')

#scrape = MakeHoney(word_thresh=8,save_as='fox_scrape09292021')

files = ['scraped_pages/fox_scrape09092021.csv','scraped_pages/fox_scrape09102021.csv',
         'scraped_pages/fox_scrape09112021.csv','scraped_pages/fox_scrape09122021.csv',
         'scraped_pages/fox_scrape09122021B.csv','scraped_pages/fox_scrape09132021.csv',
         'scraped_pages/fox_scrape09142021.csv','scraped_pages/fox_scrape09152021.csv',
         'scraped_pages/fox_scrape09202021.csv','fox_scrape09212021.csv','fox_scrape09242021.csv',
         'fox_scrape09252021.csv','fox_scrape09292021.csv']

Fox = csv2list(files)

data = pd.DataFrame(Fox)
data.columns = ['Headline']
data.head()

Unnamed: 0,Headline
0,Biden admin trying to reverse all of Trump's a...
1,Chip Roy: Fentanyl overdoses skyrocketing beca...
2,Rep. Mike Gallagher on why Dr. Anthony Fauci m...
3,‘The Lost Calls of 9/11’ debuts on Fox Nation
4,Former CIA senior intel officer says moral obl...


Create an instance of the GSDMM MovieGroupProcess; an iterative Clustering algorithm that moves samples among a number of centroids (K = max centroids) according to each sample's affinity for a) joining an empty cluster and b) joining a cluster based on sample similarity 

In [3]:
'''Cluster Method: GSDMM MovieGroupProcess'''

mgp = MovieGroupProcess(K=60, alpha=0.1, beta=0.1, n_iters=50)

#docs = [nlp(doc) for doc in Fox]


Preprocess text data: Tokenization, Stopword Removal, and formatting for input to the Clustering algorithm

In [4]:
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords

punc = [':',"\:","'s",'\'s',"'",',']
stopwords = stopwords.words('english')

hlines = [word_tokenize(doc) for doc in Fox]

tokens = []
for hline in hlines:
    t = []
    for token in hline:
        if not token in stopwords and not token in punc:
            t.append(token)
    tokens.append(t)
lex = [token for headline in hlines for token in headline if not token in punc]

vocab = set(x for x in lex)
n_terms = len(vocab)
n_terms
#tokens[:20]
#hlines[:20]

5587

In [5]:
y = mgp.fit(tokens,n_terms)

In stage 0: transferred 1122 clusters with 60 clusters populated
In stage 1: transferred 620 clusters with 58 clusters populated
In stage 2: transferred 477 clusters with 55 clusters populated
In stage 3: transferred 415 clusters with 54 clusters populated
In stage 4: transferred 380 clusters with 54 clusters populated
In stage 5: transferred 383 clusters with 55 clusters populated
In stage 6: transferred 369 clusters with 53 clusters populated
In stage 7: transferred 345 clusters with 53 clusters populated
In stage 8: transferred 340 clusters with 53 clusters populated
In stage 9: transferred 328 clusters with 52 clusters populated
In stage 10: transferred 336 clusters with 51 clusters populated
In stage 11: transferred 326 clusters with 52 clusters populated
In stage 12: transferred 335 clusters with 52 clusters populated
In stage 13: transferred 328 clusters with 51 clusters populated
In stage 14: transferred 341 clusters with 51 clusters populated
In stage 15: transferred 331 clust

Display and Evaluate Cluster Contents

In [6]:
doc_count = np.array(mgp.cluster_doc_count)

print('Number of Documents per Cluster: ',doc_count)

top_ix = doc_count.argsort()[:50][::-1]

def top_words(cluster_word_distribution, top_cluster, values):
    for cluster in top_cluster:
        sort_dicts =sorted(mgp.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
        print('Cluster %s : %s'%(cluster,sort_dicts))
        print('-'*116)

top_words(mgp.cluster_word_distribution, top_ix, 20)

Number of Documents per Cluster:  [15 21 35 16  0  8 48  6  4 19  8 14 17 17  0 24 14 23 13 19  6  0 24 53
 58  0 17  0  0 27  8 31 94  8 16  0  0 37 45  0 34  0 14 17  4 19 26  0
 23 50 12 28  8 67 73 15 34 75  0 89]
Cluster 37 : [('$', 12), ('Biden', 10), ('3.5T', 9), ('Democrats', 8), ('bill', 8), ('spending', 7), ('Dems', 7), ('reconciliation', 5), ('infrastructure', 5), ('America', 4), ('–', 4), ('Manchin', 4), ("n't", 4), ('Dem', 4), ('?', 3), ('may', 3), ('Supreme', 3), ('Court', 3), ('’', 3), ('plan', 3)]
--------------------------------------------------------------------------------------------------------------------
Cluster 2 : [('school', 17), ('mask', 6), ('.', 5), ('kids', 5), ('board', 5), ('State', 5), ('mandate', 4), ('parents', 4), ('Tennessee', 4), ('I', 4), ('lead', 4), ('Gov', 3), ('member', 3), ('student', 3), ('fired', 3), ('football', 3), ('beat', 3), ('coach', 3), ('back', 3), ('past', 3)]
-----------------------------------------------------------------------

Import and Instantiate scikit-learn's KMeans algorithm. This Clustering method plots the data in vector space, then randomly places K centroids. Samples are classified by their proximity to a centroid, and the centroid is moved into the geometric center of its assigned samples. The process repeats itself until the centroids stop moving ('convergence') or it times out. Preprocessing involves tf-idf vectorization of string values. Finally, display the top 12 terms from each cluster.

In [8]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

Fox[:10]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(Fox)

k = 50

kmeans = KMeans(n_clusters=k)
kmeans.fit(X)

print("Top terms per cluster:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :12]:
        print(' %s' % terms[ind])
    

Top terms per cluster:
Cluster 0:
 taliban
 make
 veteran
 china
 anti
 reveals
 kids
 lead
 win
 army
 machine
 blasts
Cluster 1:
 mandates
 year
 old
 vaccine
 biden
 governors
 workers
 state
 load
 carry
 larry
 schools
Cluster 2:
 laundrie
 brian
 florida
 gabby
 petito
 lawyer
 parents
 family
 fbi
 man
 manhunt
 judge
Cluster 3:
 flash
 headlines
 business
 fox
 september
 21
 24
 29
 14
 13
 15
 10
Cluster 4:
 working
 hard
 showcases
 exercise
 cosby
 release
 prison
 help
 weight
 think
 tv
 loss
Cluster 5:
 americans
 covid
 blames
 board
 biden
 lands
 time
 unvaccinated
 prove
 caste
 destroy
 emmys
Cluster 6:
 2021
 week
 open
 look
 crossword
 mortgage
 puzzle
 medvedev
 daniil
 carpet
 red
 record
Cluster 7:
 texas
 power
 leaves
 nicholas
 tropical
 abortion
 storm
 inside
 jew
 taliban
 takeover
 asks
Cluster 8:
 california
 recall
 newsom
 election
 voting
 polls
 close
 left
 candidate
 empire
 blowback
 voters
Cluster 9:
 19
 covid
 says
 watch
 really
 kids
 teste