# Week 11 Assignment
Find a set of texts. Preprocess it and use k-means to generate a topic map.

## Data Loading
The data I'm using is the Newsgroups data from scikit-learn. It has posts split up into various "groups" so I can use them for comparing how well the k-means clustering does in grouping topics together.

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

newsgroups_databunch = fetch_20newsgroups(
    subset = 'train',
    # Selecting distinct groups here so we can see if k-means
    # can even separate distinct topics from each other
    categories = [
        'comp.graphics',
        'alt.atheism',
        'misc.forsale',
        'rec.sport.hockey',
        'sci.space',
        'talk.politics.guns'
    ], 
    shuffle = True, 
    random_state = 1
)

# The `filenames` attribute has the format
# "/root/scikit_learn_data/20news_home/20news-bydate-train/<CATEGORY>/<POST NUMBER>"
# so we can pull it out from the 6th element after a split on the "/" character
target_names = pd.Series(newsgroups_databunch.filenames) \
    .apply(lambda x: x.split('/')[5])

newsgroups_data = pd.DataFrame(newsgroups_databunch.data, columns = ['text'])
newsgroups_data['category'] = target_names
newsgroups_data.head()

Unnamed: 0,text,category
0,From: steinly@topaz.ucsc.edu (Steinn Sigurdsso...,sci.space
1,From: huot@cray.com (Tom Huot)\nSubject: Re: B...,rec.sport.hockey
2,From: Robert Angelo Pleshar <rp16+@andrew.cmu....,rec.sport.hockey
3,From: halat@pooh.bears (Jim Halat)\nSubject: R...,alt.atheism
4,From: jkatz@access.digex.com (Jordan Katz)\nSu...,sci.space


## Data Cleaning & Preprocessing
We are interested in the `text` column so we should clean up that column first using some text preprocessing steps.

In [2]:
import string
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Chaining some preprocessing steps together, mainly:
# 1) Lowercasing text field
# 2) Remove header information ("From", "Subject", etc.) by
#    splitting on double newlines since that how it looks like
#    the text is formatted
# 3) Removing punctuation
# 4) Replacing newline characters with spaces
# 5) Removing numbers
newsgroups_data['text_cleaned'] = newsgroups_data['text'] \
    .str.lower() \
    .apply(lambda x: ' '.join(x.split('\n\n')[1:])) \
    .str.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation))) \
    .str.replace('\n', ' ') \
    .str.replace('\d+', '')
    
# Remove stopwords from the cleaned up text field
stop_words = stopwords.words('english')
newsgroups_data['text_cleaned'] = newsgroups_data['text_cleaned'].apply(
    lambda row: ' '.join([word for word in row.split() if word not in stop_words])
)

# Lemmatize words in the text field. This lemmatizing
# step isn't perfect because I am not determining the
# POS (part-of-speech) tag so by default, the lemmatizer
# assumes each word is a noun and tries to find the lemma
# for that form of the word.
lemmatizer = WordNetLemmatizer()
newsgroups_data['text_cleaned'] = newsgroups_data['text_cleaned'].apply(
    lambda row: ' '.join([lemmatizer.lemmatize(word) for word in row.split()])
)

print(f'=====TEXT BEFORE PROCESSING===== \n"{newsgroups_data["text"][0]}"')
print(f'=====TEXT AFTER PROCESSING===== \n"{newsgroups_data["text_cleaned"][0]}"')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
=====TEXT BEFORE PROCESSING===== 
"From: steinly@topaz.ucsc.edu (Steinn Sigurdsson)
Subject: Re: New planet/Kuiper object found?
Organization: Lick Observatory/UCO
Lines: 23
Distribution: sci
	<1r9de3INNjkv@gap.caltech.edu>
NNTP-Posting-Host: topaz.ucsc.edu
In-reply-to: jafoust@cco.caltech.edu's message of 23 Apr 1993 18:44:19 GMT

In article <1r9de3INNjkv@gap.caltech.edu> jafoust@cco.caltech.edu (Jeff Foust) writes:

   In a recent article jdnicoll@prism.ccs.uwo.ca (James Davis Nicoll) writes:
   >	If the  new  Kuiper belt object *is*  called 'Karla', the next
   >one  should be called 'Smiley'.

   Unless I'm imaging things, (always a possibility =) 1992 QB1, the Kuiper Belt
   object discovered last year, is known as Smiley.

As it happens the _second_ 

## Bag-of-Words
scikit-learn has a built-in method of creating a bag-of-words representation of a series of text using the `CountVectorizer` method. I will use that in the following cells.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df = 2)
text_bow = vectorizer.fit_transform(newsgroups_data['text_cleaned'])
text_bow_dense = pd.DataFrame(text_bow.todense(), columns = vectorizer.get_feature_names())
text_bow_dense.sample(5)

Unnamed: 0,aa,aaa,aaah,aadce,aamrl,aantal,aao,aargh,aario,aaron,ab,abad,abandon,abandoned,abbreviation,abc,abel,abetter,abide,abiding,ability,able,abo,aboard,abode,abolished,abomination,abort,aborted,abortion,abotu,aboyko,abraham,abraxis,abridge,abrupt,abruptly,absence,absent,absolut,...,zeta,zettler,zeus,zezel,zhamnov,zhao,zhenghao,zhitnik,zhivov,zi,zilch,zillion,zimring,zip,zipper,zipping,zippy,zlumber,zmed,zmolek,zo,zog,zola,zombie,zombo,zone,zoo,zoology,zoom,zoomed,zooming,zt,zu,zubov,zulu,zupancic,zurich,zvbww,zyeh,zyxel
893,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1966,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2738,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2867,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## k-Means Clustering
Let's try clustering the bag-of-words into topics and see how it does.

In [4]:
from sklearn.cluster import KMeans

kmeans = KMeans(
    n_clusters = 6, # There are six categories so we should try to make six clusters
    random_state = 1
)
kmeans.fit(text_bow)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

In [5]:
# Get the predicted clusters from the k-Means
newsgroups_data['predicted_cluster'] = kmeans.predict(text_bow)
newsgroups_data.head()

Unnamed: 0,text,category,text_cleaned,predicted_cluster
0,From: steinly@topaz.ucsc.edu (Steinn Sigurdsso...,sci.space,article rdeinnjkv gap caltech edu jafoust cco ...,0
1,From: huot@cray.com (Tom Huot)\nSubject: Re: B...,rec.sport.hockey,oh excuse wasting bandwidth referring original...,0
2,From: Robert Angelo Pleshar <rp16+@andrew.cmu....,rec.sport.hockey,deal bill wirtz apparently blackhawks st louis...,0
3,From: halat@pooh.bears (Jim Halat)\nSubject: R...,alt.atheism,article bu edu jaeger buphy bu edu gregg jaege...,0
4,From: jkatz@access.digex.com (Jordan Katz)\nSu...,sci.space,ssrt rollout speech delivered col simon p word...,0


## k-Means Clustering Evaluation
If you have the ground-truth for labels, there are three metrics you can use to evaluate how well your clustering did: completeness, homogeneity, and V-measure.
* Completeness: A score from 0 to 1 indicating if all points for a particular category are assigned to the same cluster. Higher is better.
* Homogeneity: A score from 0 to 1 indicating how well each cluster is in only containing members of a single category. Higher is better.
* V-measure: A score from 0 to 1 that is the harmonic mean of completeness and homogeneity. Higher is better.

In [6]:
from sklearn.metrics import completeness_score, homogeneity_score, v_measure_score

print(f'Completeness Score: {completeness_score(newsgroups_data["category"], newsgroups_data["predicted_cluster"])}')
print(f'Homogeneity Score:  {homogeneity_score(newsgroups_data["category"], newsgroups_data["predicted_cluster"])}')
print(f'V-measure Score:    {v_measure_score(newsgroups_data["category"], newsgroups_data["predicted_cluster"])}')

Completeness Score: 0.20366589824401074
Homogeneity Score:  0.002036081833853821
V-measure Score:    0.004031856527906846


Looks like k-Means didn't do too good a job with the bag-of-words features in finding the correct clusters. :(