# Q2.1. Reuters Dataset 

## Task 2.1.1

Reuters-21578 is collection for text categorization research. We want to explore how many documents are in the dataset, how many categories, how many documents per categories, provide mean and standard deviation, min and max.
We'll use ReutersParser class (the code is from the "out of core classification" tutorial of Scikit-Learn). 

In [8]:
from __future__ import print_function
from glob import glob
import os.path
import re
import tarfile
import sys
from bs4 import BeautifulSoup as bs
import nltk
import numpy as np
import collections
from sklearn.externals.six.moves import html_parser
from sklearn.externals.six.moves.urllib.request import urlretrieve
from sklearn.datasets import get_data_home



def _not_in_sphinx():
    # Hack to detect whether we are running by the sphinx builder
    return '__file__' in globals()


class ReutersParser(html_parser.HTMLParser):
    """Utility class to parse a SGML file and yield documents one at a time."""

    def __init__(self, encoding='latin-1'):
        html_parser.HTMLParser.__init__(self)
        self._reset()
        self.encoding = encoding

    def handle_starttag(self, tag, attrs):
        method = 'start_' + tag
        getattr(self, method, lambda x: None)(attrs)

    def handle_endtag(self, tag):
        method = 'end_' + tag
        getattr(self, method, lambda: None)()

    def _reset(self):
        self.in_title = 0
        self.in_body = 0
        self.in_topics = 0
        self.in_topic_d = 0
        self.title = ""
        self.body = ""
        self.topics = []
        self.topic_d = ""

    def parse(self, fd):
        self.docs = []
        for chunk in fd:
            self.feed(chunk.decode(self.encoding))
            for doc in self.docs:
                yield doc
            self.docs = []
        self.close()

    def handle_data(self, data):
        if self.in_body:
            self.body += data
        elif self.in_title:
            self.title += data
        elif self.in_topic_d:
            self.topic_d += data

    def start_reuters(self, attributes):
        pass

    def end_reuters(self):
        self.body = re.sub(r'\s+', r' ', self.body)
        self.docs.append({'title': self.title,
                          'body': self.body,
                          'topics': self.topics})
        self._reset()

    def start_title(self, attributes):
        self.in_title = 1

    def end_title(self):
        self.in_title = 0

    def start_body(self, attributes):
        self.in_body = 1

    def end_body(self):
        self.in_body = 0

    def start_topics(self, attributes):
        self.in_topics = 1

    def end_topics(self):
        self.in_topics = 0

    def start_d(self, attributes):
        self.in_topic_d = 1

    def end_d(self):
        self.in_topic_d = 0
        self.topics.append(self.topic_d)
        self.topic_d = ""


def stream_reuters_documents(data_path=None):
    DOWNLOAD_URL = ('http://archive.ics.uci.edu/ml/machine-learning-databases/'
                    'reuters21578-mld/reuters21578.tar.gz')
    ARCHIVE_FILENAME = 'reuters21578.tar.gz'

    if data_path is None:
        data_path = os.path.join(get_data_home(), "reuters")
    if not os.path.exists(data_path):
        """Download the dataset."""
        print("downloading dataset (once and for all) into %s" %
              data_path)
        os.mkdir(data_path)

        def progress(blocknum, bs, size):
            total_sz_mb = '%.2f MB' % (size / 1e6)
            current_sz_mb = '%.2f MB' % ((blocknum * bs) / 1e6)
            if _not_in_sphinx():
                sys.stdout.write(
                    '\rdownloaded %s / %s' % (current_sz_mb, total_sz_mb))

        archive_path = os.path.join(data_path, ARCHIVE_FILENAME)
        urlretrieve(DOWNLOAD_URL, filename=archive_path,
                    reporthook=progress)
        if _not_in_sphinx():
            sys.stdout.write('\r')
        print("untarring Reuters dataset...")
        tarfile.open(archive_path, 'r:gz').extractall(data_path)
        print("done.")

    parser = ReutersParser()
    for filename in glob(os.path.join(data_path, "*.sgm")):
        for doc in parser.parse(open(filename, 'rb')):
            yield doc



Now, we can use the parser in order to load the documents labels into a list. the i-th element in the list, is a list contains the labels of the i-th document: 

In [9]:
data_stream = stream_reuters_documents()


def get_docs_topics(doc_iter):
    data = [doc['topics']
            for doc in doc_iter]
    if not len(data):
        return np.asarray([], dtype=int), np.asarray([], dtype=int)
    return data

Next, we create a dictionary with categories as keys, and the number of documents labeled by this category as values.  

In [10]:
topics = get_docs_topics(data_stream)
topics_dict ={}
for doc_topics in topics:
    for top in doc_topics:
        if top in topics_dict:
            topics_dict[top] += 1
        else:
            topics_dict[top] = 1


Now, we can collect some statistics:

In [11]:
print("Number of documents is: " + str(len(topics)))
print("Number of categories is: " + str(len(topics_dict)))
print("Documents per categories: " + str((len(topics))/(len(topics_dict))))
np_topics = np.array([])
for i in topics_dict.values():
    np_topics = np.append(np_topics,i)

print("Mean of docs per categories: " + str(np.mean(np_topics)))
print("STD of docs per categories: " + str(np.std(np_topics)))

topics_dict = collections.Counter(topics_dict)
print("Max category: " + str(topics_dict.most_common(1)))

print("Min category: " + str(topics_dict.most_common()[len(topics_dict)-1]))

Number of documents is: 21578
Number of categories is: 445
Documents per categories: 48.48988764044944
Mean of docs per categories: 89.87191011235954
STD of docs per categories: 643.9321684195976
Max category: [('usa', 12542)]
Min category: ('stich', 1)


## Task 2.1.2
We want to explore how many characters and words are present in the documents of the dataset. For this purpose, we collect all the contents of the documents, concatenate them and clean tags. 

In [12]:
data_stream = stream_reuters_documents()

def get_docs_body(doc_iter):

    data = [(u'{title}\n\n{body}'.format(**doc))
            for doc in doc_iter
            if doc['topics']]
    if not len(data):
        return np.asarray([], dtype=int), np.asarray([], dtype=int)
    return data


texts = get_docs_body(data_stream)
all_text = ""
for txt in texts:
    all_text = all_text + txt

clean_text = bs(all_text).get_text()
words = nltk.word_tokenize(clean_text)


Now, we count the total number of words in our concatenated string, and the total number of distinct words. 
We do the same for characters. 

In [13]:
print("Number of words in the dataset: " + str(len(words)))
print("Number of different words in the dataset: " + str(len(set(words))))

ch_list = []
for w in words:
    for ch in w:
        ch_list.append(ch)

print("Number of chars in the dataset: " + str(len(ch_list)))
print("Number of different chars in the dataset: " + str(len(set(ch_list))))

Number of words in the dataset: 2881884
Number of different words in the dataset: 83405
Number of chars in the dataset: 13213960
Number of different chars in the dataset: 89


## Task 2.1.3

The classifiers that support the "partial-fit" method disscused in the tutorial code are:
1. SGD Classifier
2. Perceptron
3. NB Multinomial
4.  Passive-Aggressive Classifier

partial_fit method allows minibatch learning. It can be used whenever the dataset is too big. Then, we can perform our fitting in smaller batches and each batch is used to extend the existing classifier. In the tutorial code the learning algorithm takes 1000 documents at each step, and then it execute patiral_fit method. Each new batch extends the classification, and it is reflected in the charts at the end of the tutorial - the more examples that we see (and therefore the number of batches), the better the accuracy. 

## Task 2.1.4

The hashing vectorizer converts a collection of text documents to a matrix of token occurrences. It uses hashing in order to find the token string name to feature integer index mapping. Hash functions are an efficient way of mapping terms to features, and the ashing vectorizercan be used in a streaming (partial fit), since it guarantees that the features space remains the same over time (due to increased collision).