# Contents
1) To compute the `tf-idf` consider a corpus $D$, with documents $d$ and terms $t$:

$$tf(t,d) = \frac{f_{t,d}}{\sum_{\tau\in V} f_{\tau, d}}$$

The vocabulary is $V$ and $f_{t,d}$ the count of the term $t$ in $d$. Then compute
$$idf(t,D) = \ln\left( \frac{N}{|\{ d\in D:\ t\in d \}|}\right)$$
here $N=|D|$ (number of documents in the corpus). Finally the $tfidf(t,d,D) = tf(t,d)\cdot idf(t,D)$.

In the sklearn module, (with `norm = None`)
$$tfidf(t, d,D) = f_{t, d}\cdot(1+\ln((N+1)/(df(t)+1))$$
With `norm = "l1"` each row is normalized, that is, each term is divided by the sum of the row.

## Workflow
1) Get the promath data that has the format:

     - promath
     
       - math10
       
         - 1003_001.tar.gz
         - 1003_002.tag.gz
         - ...
         
  The function should take in the file 1003_001.tar.gz, a _list of phrases_ and output a file with the format:
```xml
<root>
  <article name="culito.tx">
      <para num=1> text </para>
      <para num=2> text </para>
  </article>
</root>
```
The text in the para tags is clean, tokenized and joined.

In [6]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [44]:
vect = TfidfVectorizer(norm='l1')
X = vect.fit_transform(corpus) 
vect.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [47]:
X[1,1], 2/6*(1+np.log(5/4))

(0.3322595913573379, 0.40771451710473655)

In [56]:
sum([X[1,l] for l in range(9)])

0.9999999999999999

In [53]:
X.shape

(4, 9)