# Python machine learning classification tutorial
This tutorial explains basic concepts neccesary for using sci-kit learn's classification tools.

## 1. Set-up
The `CountVectorizer` class implements both tokenization and occurence counting.

In [56]:
from sklearn.feature_extraction.text import CountVectorizer

texts = [u'Hello I\'m a text and I am the first text in this collection of texts',
         u'And I am the second one. I like olives.',
         u'I hate olives. I am the grumpy third text.']

vect = CountVectorizer()

## 2. Fit and transform the data
The `fit_transform` method is a combination of the `fit` and `transform` methods. The method mutates the vector and returns a matrix representation of the tokens.

* `fit`: Learn a vocabulary dictionary of all tokens in the raw documents.
* `transform`: Transform documents to document-term matrix.

In [57]:
X = vect.fit_transform(texts)

We can call `toarray()` on the matrix to better understand how the tokens are saved. Each "row" contains as many elements as there are unique words in the data, and the presence of each word is indicated with a 1, the abscence with a 0.

In [58]:
X.toarray()

array([[1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 0, 1],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0]])

## 3. Vector methods after fit_transform

### Vocabulary
The count vector has a dict – `vocabulary_` – that maps tokens to their integer representations.

In [21]:
print vect.vocabulary_

{u'and': 1, u'text': 13, u'am': 0, u'collection': 2, u'one': 11, u'texts': 14, u'second': 12, u'in': 7, u'hate': 5, u'the': 15, u'olives': 10, u'like': 8, u'third': 16, u'this': 17, u'of': 9, u'grumpy': 4, u'hello': 6, u'first': 3}


### Feature names
The `get_feature_names` method returns an ordered list of all the unique tokens.

In [20]:
print vect.get_feature_names()

[u'am', u'and', u'collection', u'first', u'grumpy', u'hate', u'hello', u'in', u'like', u'of', u'olives', u'one', u'second', u'text', u'texts', u'the', u'third', u'this']


## 4. Compute frequencies
In order to compute the relative frequency of the tokens in each document, we make use of a computation called _Term frequency times inverse document frequency_. It basically calculates the frequency for each word in every document, but adjusts that frequency based on how common the word is among all documents.

In [60]:
from sklearn.feature_extraction.text import TfidfTransformer
X_tfidf = TfidfTransformer().fit_transform(X)

## 5. T