# Preface

In [13]:
# IMPORTS
import os
import string
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Task 5.1: News Article - Preprocessing


In a first step, we are going to process news articles in order to try to apply clustering algorithms. The news article corpus-30docs includes 30 news articles from the domains of science, religion and politics (10 documents for each domain). Import the documents into your process and vectorize them using stopword removal and stemming (Porter). If necessary, transform all tokens to their lower cased version. How large are the resulting document vectors?

In [91]:
# Prep work
data_dir = "../data/"
doc_dirs = ["sci.space", "soc.religion.christian", "talk.politics.guns"]
doc_paths = []

# Get a list of all (relative) filepaths
for d in doc_dirs:
    for f in os.listdir(os.path.join(data_dir, d)):
        doc_paths.append(os.path.join(data_dir, d, f))
    
# Create a custom Porter Stemmer that suits sklearn
class PortStem(object):
    def __init__(self):
        self.ps = PorterStemmer()
    def __call__(self, doc):
        return [self.ps.stem(word) for word in word_tokenize(doc)]

# Initiate a new vectorizer
vectorizer = CountVectorizer(input="filename",
                             stop_words="english", 
                             tokenizer=PortStem())

# creating the bag of words
bow = vectorizer.fit_transform(doc_paths)

# evaluate num of features
print(bow.shape)

(30, 3027)


The resulting Dataframe has **3027** attributes identified. Unfortunately the combination of `PorterStemmer()` and `word_tokenize` did not remove punctuation or numbers. Therefore a more advanced Tokenizer shall be used that:
1. replaces punctuation and numbers to blanks
2. stems based on the Porter algorithm

In [116]:
# Create a more advanced tokenizer
class PortStemNoPunctNum(object):
    def __init__(self):
        self.ps = PorterStemmer()
    def __call__(self, doc):
        return [porter.stem(word)
                for word 
                in word_tokenize(
                doc.translate(
                    str.maketrans(string.punctuation + "0123456789",' '*len(string.punctuation + "0123456789"))))]

In [120]:
# Initiate a new (advanced) vectorizer
vectorizer_adv = CountVectorizer(input="filename",
                                 stop_words="english",
                                 tokenizer=PortStemNoPunctNum())

# creating the bag of words
bow_adv = vectorizer_adv.fit_transform(doc_paths)

# evaluate num of features
print(bow_adv.shape)

(30, 2588)


The new Bag-of-Words only contains features without punctuation(`!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~`) and numbers. **The overall number of features is 2588 features**

---
# Task 5.2: News Article - Clustering
Having imported the documents and transformed them into vectors, apply a k-means clustering using k = 3. How many documents ended up in the wrong cluster? What can you do to improve the clustering? 

---
# Task 5.3: News Article - Improved Clustering
Does the distribution of frequent terms help you to further improve the clustering by using any of the prune methods of the Process Documents from File operator?

---
# Task 5.4: Job Postings: Preprocessing
In a second step, we will focus on the classification of job postings. The jobpostings.xls  le contains 500 descriptions of job postings belonging to 30 different job categories like sales and real estate. Our main goal is to learn a classification model, which is capable to predict the correct category for a new job posting. Therefore, import the data into RapidMiner using the Read Excel and the Process Document from Data operators. Convert the textual description into a vector by applying tokenization and other preprocessing steps. In order to learn a good classi cation model, have a look at the generated attributes and basic setup for the preprocessing which removes noisy and misleading tokens.

---
# Task 5.5: Job Postings - Classification
What levels of accuracy can you reach applying di erent classi cation methods and preprocessing settings?