# Classification and Clustering : Practicing with real data

In this lab, we consider the 20 newsgroups text dataset from [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

In [3]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint

# Training set
cats = ['rec.sport.baseball', 'sci.electronics', 'misc.forsale']
train_data = fetch_20newsgroups(subset='train', categories=cats)

## Getting to know your data

In [4]:
print(train_data.target, len(train_data.target))
print(train_data.target[0:10])
print(train_data.target_names)

[0 0 0 ..., 2 0 2] 1773
[0 0 0 1 2 1 2 2 2 1]
['misc.forsale', 'rec.sport.baseball', 'sci.electronics']


In [5]:
print(train_data.target_names[train_data.target[0]])
print()
print(train_data.data[0])

misc.forsale

From: jrwaters@eos.ncsu.edu (JACK ROGERS WATERS)
Subject: Portable Color Television For Sale
Organization: North Carolina State University, Project Eos
Lines: 17


Hello Everyone,

    I have a Casio TV-470 LCD Color Television for sale.  It
is in mint condition.  Retail is $199 but I'm looking to
get about 1/2 of that for it, tops.  Highest bidder in 
a week gets it, assuming the highest bidder is at least $60.

TV comes with black case and uses 4 AA batteries.  They also
sell AC adaptor.  It has external jack for phones and external
antenna, etc.  The picture is very good and it has electronic
tuning so you don't have to screw with tuning a picture in, etc.
I have the box and all documentation.  This has seen less than
3 hours use as I have all but sworn off TV.

Best Regards
Jack Waters II



### Data Preprocessing

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency. This is a originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results), that has also found good use in document classification and clustering.

$$\text{tf}(t,d) = \text{Number of times term }t \text{ occurs in document } d$$

If $N$ is the total number of documents in the corpus $D$ then

$$\text{idf}(t,D)=\frac{N}{|\{d\in D\mid t\in d \}|}$$

$$\text{tf-idf}(t,d)=\text{tf}(t,d)\times \text{idf}(t,D)$$

TF-IDF for text documents :  http://scikit-learn.org/stable/modules/feature_extraction.html

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', min_df=0.01, max_df=0.8)
text_train_data = vectorizer.fit_transform(train_data.data)

Explain each parameter from [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [7]:
print(len(train_data.data))
print(text_train_data.shape)

1773
(1773, 1564)


## Clustering

Use clustering techniques to identify groups in this dataset. You have to assume that the number of groups is unkown. Compare the results with the known groups.

In [1]:
# ...

## Classification

Build two classifiers (decision tree and naive bayes) that are able to predict the category of a given message. 

In [2]:
# ...

Compare the performance of these two classifiers with respect to the test set.

In [3]:
# ...