# Contents
1) Join more phrases (target 300K)

2) Create new word embeddings.

3) Create new index for ANN

1) To compute the `tf-idf` consider a corpus $D$, with documents $d$ and terms $t$:
  $$tf(t,d) = \frac{f_{t,d}}{\sum_{\tau\in V} f_{\tau, d}}$$

The vocabulary is $V$ and $f_{t,d}$ the count of the term $t$ in $d$. Then compute
$$idf(t,D) = \ln\left( \frac{N}{|\{ d\in D:\ t\in d \}|}\right)$$
here $N=|D|$ (number of documents in the corpus). Finally the $tfidf(t,d,D) = tf(t,d)\cdot idf(t,D)$.

In the sklearn module, (with `norm = None`)
$$tfidf(t, d,D) = f_{t, d}\cdot(1+\ln((N+1)/(df(t)+1))$$
With `norm = "l1"` each row is normalized, that is, each term is divided by the sum of the row.

## Workflow
1) Get the promath data that has the format:

     - promath
     
       - math10
       
         - 1003_001.tar.gz
         - 1003_002.tag.gz
         - ...
         
  The function should take in the file 1003_001.tar.gz, a _list of phrases_ and output a file with the format:
```xml
<root>
  <article name="macizo.tx">
      <parag num=1> text </para>
      <parag num=2> text </para>
  </article>
</root>
```
The text in the para tags is clean, tokenized and joined.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from lxml import etree
import gzip
from glob import glob
from collections import Counter
from tqdm import tqdm
import re
import random

import os, sys, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0, os.path.join(parentdir, 'embed'))
sys.path.insert(0, parentdir)

from clean_and_token_text import normalize_text, normalize_phrase, \
                                 token_phrases3, ReadGlossary, join_phrases
%load_ext autoreload
%autoreload 2
import parsing_xml as px
import peep_tar as peep

In [2]:
txmlst = []
for t in peep.tar_iter('../tests/few_actual_articles.tar.gz', '.xml'):
    txmlst.append(px.DefinitionsXML(t[1]).run_recutext_onall_para(cleaner_fun=normalize_text))

In [2]:
phrases_file = glob('/media/hd1/glossary/NN.v1/math*/*')

phrases_cnt = Counter()
for xml_path in tqdm(phrases_file):
    root = etree.parse(xml_path)
    phrases_list_temp = [normalize_text(r.text)\
            for r in root.findall('//dfndum') ]
    phrases_cnt.update([r for r in phrases_list_temp if len(r.split()) > 1])

phrases_lst = [pair[0] for pair in phrases_cnt.most_common()]
print(len(phrases_lst))
join_fun = lambda s: token_phrases3(s, phrase_lst=phrases_lst)

100%|██████████| 2816/2816 [00:51<00:00, 54.30it/s] 


759240


In [222]:
%%timeit
#text = 'let us begin with a rough description of kozlov the general setting and the background of the problem .'
text = 'with a rough .'
#text = 'so weird what what what is going on of the let us begin this with background of the problem .'
#token_phrases3(text, phrase_lst=(of_lst + ['of the']))
token_phrases3(text, phrase_lst=phrases_lst)

ValueError: A phrase with only too few words was given. Phrase: lie

In [31]:
RG = ReadGlossary('/media/hd1/glossary/v3/math*/*.xml.gz', '/media/hd1/glossary/NN.v1/math*/*')
ph_dict = RG.first_word_dict(intersect = 'relative')

  1%|          | 21/2816 [00:00<00:16, 170.80it/s]

found [2816, 2816] files


100%|██████████| 2816/2816 [00:53<00:00, 52.89it/s] 
100%|██████████| 2816/2816 [00:54<00:00, 51.36it/s] 


In [231]:
si = sorted(ntci.items(), key = (lambda kv: kv[1]), reverse=False)
len(si)

347496

In [215]:
[p for p in si if (p[0].split())[0] == 'hales']

[('hales jewett number', 5.71612063521971e-07)]

In [30]:
#%%timeit
text = 'there exist positive integers smooth tangent field _inline_math_ _inline_math_ such that , for all _inline_math_ and all _inline_math_ , _inline_math_ is the multiplicity of _inline_math_ as a root of _inline_math_ . '
join_phrases(text, ph_dict)

'there exist positive_integers smooth_tangent_field _inline_math_ _inline_math_ such that , for all _inline_math_ and all _inline_math_ , _inline_math_ is the multiplicity of _inline_math__as a root_of _inline_math_ . '

In [175]:
dex = px.DefinitionsXML('../tests/latexmled_files/enumerate_forms.xml')
dex.recutext(dex.para_list()[1])
outxml = dex.run_recutext_onall_para(cleaner_fun=normalize_text)
etree.tostring(outxml[3]).decode('utf8')

'<parag index="3"> there exist positive integers _inline_math_ such that , for all _inline_math_ and all _inline_math_ , _inline_math_ is the multiplicity of _inline_math_ as a root of _inline_math_ . </parag>'

In [148]:
ph_lst = RG.common_phrases_lst()

100%|██████████| 2915/2915 [00:57<00:00, 50.50it/s] 


In [70]:
%%time
root = etree.Element('root')
for t in peep.tar_iter('/media/hd1/promath/math14/1401_003.tar.gz', '.xml'):
    print(t)
    try:
        txml = px.DefinitionsXML(t[1], fname = t[0]).run_recutext_onall_para(cleaner_fun=normalize_text, joiner_fun=join_fun)
    except ValueError as ee:
        print(ee, f"-- On the {t[0]}")
    #print(txml.file_path)
    root.append(txml)
#print(etree.tostring(root, pretty_print=True).decode('utf8'))
with gzip.open('/home/luis/rm_me_tfidf/cosito.xml.gz', 'wb') as gfobj:
    gfobj.write(etree.tostring(root))

('1401_003/1401.1545/1401.1545.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.1442/EJU.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.1459/1401.1459.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.2005/1401.2005.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.1879/Rank4Revised.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.1471/1401.1471.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.1969/1401.1969.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.2031/1401.2031.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.1687/paper_IOP_final.xml', <ExFileObject name='/media/hd1/promath/math14/1401_003.tar.gz'>)
('1401_003/1401.1765/ADF-arxiv.xml', <ExFileObject name='/media/hd1/promath/mat

In [69]:
%%time
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.34 µs


In [44]:
vect = TfidfVectorizer(norm='l1')
X = vect.fit_transform(corpus) 
vect.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [33]:
import datetime as dt

In [42]:
N = dt.datetime.now()
N.hour, N.minute, N.second

(14, 25, 9)

In [43]:
4%1

0