# Contents
1) Join more phrases (target 300K)

2) Create the tf-idf transformation.

3) For each article and each term within, annotate each entry with its corresponding tf-idf score.


1) To compute the `tf-idf` consider a corpus $D$, with documents $d$ and terms $t$:
  $$tf(t,d) = \frac{f_{t,d}}{\sum_{\tau\in V} f_{\tau, d}}$$

The vocabulary is $V$ and $f_{t,d}$ the count of the term $t$ in $d$. Then compute
$$idf(t,D) = \ln\left( \frac{N}{|\{ d\in D:\ t\in d \}|}\right)$$
here $N=|D|$ (number of documents in the corpus). Finally the $tfidf(t,d,D) = tf(t,d)\cdot idf(t,D)$.

In the sklearn module, (with `norm = None`)
$$tfidf(t, d,D) = f_{t, d}\cdot(1+\ln((N+1)/(df(t)+1))$$
Where $df$ is the number of documents where $t$ appears.

With `norm = "l1"` each row is normalized, that is, each term is divided by the sum of the row.

## Workflow
1) Get the promath data that has the format:

     - promath
     
       - math10
       
         - 1003_001.tar.gz
         - 1003_002.tag.gz
         - ...
         
  The function should take in the file 1003_001.tar.gz, a _list of phrases_ and output a file with the format:
```xml
<root>
  <article name="macizo.tx">
      <parag num=1> text </para>
      <parag num=2> text </para>
  </article>
</root>
```
The text in the para tags is clean, tokenized and joined.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from lxml import etree
import gzip
from glob import glob
from collections import Counter
from tqdm import tqdm
import re
import random

import os, sys, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0, os.path.join(parentdir, 'embed'))
sys.path.insert(0, parentdir)

from clean_and_token_text import normalize_text, normalize_phrase, \
                                 token_phrases3, ReadGlossary, join_phrases
%load_ext autoreload
%autoreload 2
import parsing_xml as px
import peep_tar as peep

In [2]:
txmlst = []
for t in peep.tar_iter('../tests/few_actual_articles.tar.gz', '.xml'):
    txmlst.append(px.DefinitionsXML(t[1]).run_recutext_onall_para(cleaner_fun=normalize_text))

In [2]:
phrases_file = glob('/media/hd1/glossary/NN.v1/math*/*')

phrases_cnt = Counter()
for xml_path in tqdm(phrases_file):
    root = etree.parse(xml_path)
    phrases_list_temp = [normalize_text(r.text)\
            for r in root.findall('//dfndum') ]
    phrases_cnt.update([r for r in phrases_list_temp if len(r.split()) > 1])

phrases_lst = [pair[0] for pair in phrases_cnt.most_common()]
print(len(phrases_lst))
join_fun = lambda s: token_phrases3(s, phrase_lst=phrases_lst)

100%|██████████| 2816/2816 [00:51<00:00, 54.30it/s] 


759240


In [222]:
%%timeit
#text = 'let us begin with a rough description of kozlov the general setting and the background of the problem .'
text = 'with a rough .'
#text = 'so weird what what what is going on of the let us begin this with background of the problem .'
#token_phrases3(text, phrase_lst=(of_lst + ['of the']))
token_phrases3(text, phrase_lst=phrases_lst)

ValueError: A phrase with only too few words was given. Phrase: lie

In [51]:
RG = ReadGlossary('/media/hd1/glossary/v3/math*/*.xml.gz', '/media/hd1/glossary/NN.v1/math*/*')
ph_dict = RG.first_word_dict(intersect = 'relative')

  1%|          | 21/2816 [00:00<00:16, 172.07it/s]

found [2816, 2816] files


100%|██████████| 2816/2816 [00:53<00:00, 52.30it/s] 
100%|██████████| 2816/2816 [00:55<00:00, 51.11it/s] 


In [52]:
RG.ntc_intersect('')

['_inline_math_ and',
 '_inline_math_ _inline_math_',
 'recent years',
 '_inline_math_ of',
 'for _inline_math_',
 '_inline_math_ if',
 '_inline_math_ also',
 'suppose that',
 'condition for',
 '_inline_math_ on',
 '_inline_math_ a',
 '_inline_math_ as',
 '_inline_math_ in',
 '_inline_math_ up',
 '_inline_math_ by',
 '_inline_math_ which',
 '_inline_math_ that',
 '_inline_math_ the',
 '_inline_math_ or',
 'family of',
 'of the',
 'all _inline_math_',
 'a _inline_math_',
 'well known',
 'group of',
 'r and',
 'a point',
 'a linear',
 ' ',
 '']

In [215]:
[p for p in si if (p[0].split())[0] == 'hales']

[('hales jewett number', 5.71612063521971e-07)]

In [53]:
vocab_set = RG.ntc_intersect('relative')

100%|██████████| 2816/2816 [00:53<00:00, 52.69it/s] 
100%|██████████| 2816/2816 [00:54<00:00, 51.38it/s] 


In [69]:
%%time
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.34 µs


In [44]:
vect = TfidfVectorizer(norm='l1')
X = vect.fit_transform(corpus) 
vect.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [33]:
import datetime as dt

In [113]:
vocab = list(set([t.replace(' ', '_') for t in vocab_set.keys()]))
tvect = TfidfVectorizer(sublinear_tf=True, norm='l1', vocabulary=vocab)

In [162]:
# create the corpus
corpus = []
article_names = []
path_lst = glob('/media/hd1/cleaned_text/joined_math12-34_04-01/math0*/*.xml.gz')
path_lst += glob('/media/hd1/cleaned_text/joined_math12-34_04-01/math9*/*.xml.gz')
for tar_path in path_lst: 
    xobj = etree.parse(tar_path)
    problem_arts = []
    for art in xobj.findall('.//article'):
        art_text = ''
        article_names.append(art.attrib['name'])
        for par in art.findall('.//parag'):
            try:
                art_text +=  (par.text + " ")
            except TypeError as ee:
                #art_text = ''
                #print("article {} gave the error:".format(art.attrib['name']), ee)
                problem_arts.append(art.attrib['name'])
        corpus.append(art_text)

In [163]:
%%time
ttrans = tvect.fit_transform(corpus)
ttrans.shape

CPU times: user 2min 48s, sys: 918 ms, total: 2min 49s
Wall time: 2min 49s


(85376, 347052)

In [171]:
w_lst = ['_inline_math_', 
         'exponential_map',
         'generalized_eigenspaces',
         'weak_hyperbolicity',
         'partially_hyperbolic',
         'finitely_generated_discrete_group',
         'global',
         'smooth_action',
         'nondegenerate',
        'good_pair',
        'stable_manifolds',
        'rank_one',
        'essential_subset',
        'semisimple',
        'orbit_equivalence',
        'higher_rank',
        'non_trivial',
        'lie_groups',
        'margulis',
        'line_bundle']
def tfidf(w):
    w_ind = vocab.index(w)
    return ttrans[0,w_ind]
sort_fun = lambda p: p[1]
s_lst = sorted([(w, tfidf(w)) for w in w_lst], key=sort_fun, reverse=True)
for p in s_lst:
    print(p)

('good_pair', 0.00859552057326182)
('partially_hyperbolic', 0.006482868736963943)
('stable_manifolds', 0.006073718429800212)
('generalized_eigenspaces', 0.0056295467163887695)
('finitely_generated_discrete_group', 0.004597134897897061)
('higher_rank', 0.004565492075269953)
('weak_hyperbolicity', 0.004222940648390717)
('semisimple', 0.004088095118946367)
('nondegenerate', 0.003725007543091549)
('margulis', 0.003543514391854277)
('essential_subset', 0.002550637320075592)
('_inline_math_', 0.0022122098236394697)
('exponential_map', 0.002209085680647651)
('orbit_equivalence', 0.0020292873331886115)
('rank_one', 0.0020172744867039765)
('lie_groups', 0.001991544206848132)
('smooth_action', 0.001831681027789851)
('global', 0.0008277823435205402)
('non_trivial', 0.0007194400649381925)
('line_bundle', 0.0)


In [172]:
N = dt.datetime.now()

In [173]:
tdelta = dt.datetime.now() - N

In [176]:
tdelta.seconds

14

In [177]:
10034//60

167

In [178]:
10034%60

14

In [179]:
167*60 + 14

602054