# Clustering Articles

In [3]:
from collections import defaultdict
from bs4 import BeautifulSoup

import re
import math
import string

## Creating Documents

We are using Reuters-21578–a small corpus of Retuers articles in [SGML format](http://kdd.ics.uci.edu/databases/reuters21578/README.txt)–to test our methodology. First, we split them into individual documents.

In [5]:
with open('./data/reut2-000.sgm') as f:
    corpus = f.read()

In [9]:
soup = BeautifulSoup(corpus, 'html.parser')
allArticles = soup.find_all('reuters')

def preprocess(s):
    return re.split('[\W\d]+', s)

articles = [] # stories with a body
documents = []
for a in allArticles:
    if a.body:
        articles.append(a)
        # Each document is a list of words split
        documents.append(preprocess(a.body.string))

We now turn each document into a vector of W dimensions, where W is the number of words in the corpus. Each component of the vector represents the term frequency inverse document frequency (TF-IDF) of one word in the document. TF-IDF is a weighting scheme based on a word's occurrence in a document (term frequency) and its uniqueness in the overall corpus (inverse document frequency). A higher TF-IDF usually means the word is more important in the corpus.

There are [several different weighting schemes](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) we could use; the ones I've chosen are as follows. 

For term frequency (tf) we will use a type of weighting scheme called [double normalization](https://nlp.stanford.edu/IR-book/html/htmledition/maximum-tf-normalization-1.html). The first normalization divides a word's raw count by the maximum raw count of a word in that document. This prevents TF's bias towards longer documents. Then we normalize again by "smoothing", which prevents modest changes in tf from greatly affecting TF-IDF.

For inverse document frequency (idf) we will use the standard logarithmically scaled frequency.


In [4]:
%%latex

$$ tf = 0.4 + (1-0.4)\frac{N(v_i, d)}{max_{v_j \in s}N(v_j, d)} $$

where $N(v_i, s)$ is the number of times word $v_i$ occurs in document $d$. We've used the common smoothing constant of 0.4.

$$ idf = log \frac{N}{N(v_i)} $$

where $N$ is the number of documents in the corpus, and $N(v_i)$ is the number of documents that includes $v_i$.

$$ TF\text{-}IDF = tf \times idf $$

<IPython.core.display.Latex object>

In [10]:
words = []
for doc in documents:
    words += doc
words = [w for w in set(words) if len(w) > 0]
numWords = len(words)

print(f'There are {numWords} unique words over {len(documents)} documents.')

There are 10252 unique words over 925 documents.


In [19]:
wordToLoc = {words[i]: i for i in range(numWords)} # a word to index lookup

# stores number of documents a word appears in
documentCounts = [0 for _ in range(numWords)]

a = 0.4 # our smoothing constant

# Calculates tf vector while building document counts
def tf(doc):    
    vector = [0 for _ in range(numWords)]
    counted = defaultdict(bool) # keeps track of whether we've counted a word in document counts
    for word in doc:
        if len(word) == 0:
            continue
        loc = wordToLoc[word]
        vector[loc] += 1
        if not counted[word]:
            documentCounts[loc] += 1
            counted[word] = True
    maxTf = max(vector)
    return [a + (1 - a) * c / maxTf for c in vector]

tfVectors = [tf(d) for d in documents]

In [21]:
idfCache = {}
def getIdf(i):
    if i not in idfCache:
        idfCache[i] = math.log10(len(documents) / documentCounts[i])
    return idfCache[i]

def normalizeByIdf(vector):
    return [vector[i] * getIdf(i) for i in range(len(vector))]
            
tfidf = [normalizeByIdf(v) for v in tfVectors]

# K-means clustering

Now that we've converted each article into a vector, we can cluster them. Here is a [really good video](https://www.youtube.com/watch?v=_aWzGGNrcic) on how k-means clustering works.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

import numpy as np

In [24]:
tfidf_vectorizer = TfidfVectorizer(preprocessor=lambda x: ' '.join(preprocess(x)))
tfidf = tfidf_vectorizer.fit_transform([a.body.string for a in articles])

In [25]:
numClusters = 100
kmeans = KMeans(n_clusters=numClusters).fit(tfidf)

In [27]:
cluster_assignments = {}

for i in set(kmeans.labels_):
    cluster_assignments[i] = [documents[x] for x in np.where(kmeans.labels_ == i)[0]]

Let's explore our clusters.

In [29]:
cluster = cluster_assignments[10]

for i in range(len(cluster)):
    print('{}. {}\n'.format(i + 1, ' '.join(cluster[i][:50])))

1. New Zealand s official foreign reserves fell to billion N Z Dlrs in January from billion dlrs in December and compared with billion a year ago period the Reserve Bank said in its weekly statistical bulletin Reuter 

2. South Korea plans to take steps to keep its current account surplus below five billion dlrs Economic Planning Board Minister Kim Mahn je said Kim told reporters the government would repay loans ahead of schedule and encourage firms to increase imports and investment abroad to prevent the current account surplus

3. French state owned chemicals group Rhone Poulenc RHON PA said it will increase its capital with a billion franc issue of preferential investment certificates on March Company chairman Jean Rene Fourtou said mln francs of the issue will be placed in the U S Details of the issue will be

4. Switzerland recorded last year its first overall surplus in government finances since ending with a net gain worth mln Swiss francs the Finance Ministry said The surplus i