# Introduction

This notebook is intended to process wikipedia dumps, with the goal of extract the maximum number of complete and significative sentences. This set of sentences could be used for any purpose in [NLP](https://en.wikipedia.org/wiki/Natural_language_processing), but information is also extracted which would be meaningful for _wikipedia_ itself.

The developed example is about _Galipedia_, the galician wikipedia, but it could be easily adapted for other languages just because it is _language-agnostic_.

The final product of the notebook is a file with the articles title, category, user and sentences. It is stored in a text file (_articles20221201.txt_) and a pickle file for convenience.

# Download data

From __[Wikimedia Downloads](https://dumps.wikimedia.org/mirrors.html)__

from the mirror _Academic Computer Club, Umeå University_ (Last 5 good XML dumps, 'other' datasets): glwiki-20221201-pages-articles.xml.bz2

Probably, the date must be updated if you want run the notebook

In [None]:
!wget http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/glwiki/20221201/glwiki-20221201-pages-articles.xml.bz2


!bunzip2 glwiki-20221201-pages-articles.xml.bz2

# Libraries & functions

The script `Galutils` contains a number of custom and convenience functions and constants

In [1]:
from Galutils import *

from random import choice, sample
from time import time


It could be difficult preserve the _language-agnostic_ nature for sentence tokenization. Two ways are plausible:
* import a sentence tokenizer from nltk or any other suitable library
* write a custom function, which can take into account especifities of the dump we work with

In this notebook a custom function, `sent_tok`, will be used

In [2]:

#import nltk
#sent_tok= nltk.sent_tokenize

def sent_tok(text,ends='[\?\!\*\#]',abrev=[('a.C.', 'aC.'),
                                            ('d.C.','dC.'),
                                            ('(n.','(nado en'),
                                            ('(m.','(morto en'),
                                            ('hab.','habitantes'),
                                            ('(ca.','circa'),
                                            ('(c.','(circa'),
                                            (' No. ',' nº '),
                                            (' op. ',' opus '),
                                            (' b.d. ',' banda deseñada '),
                                            (' || ',' # '),
                                          ]):
    '''
    custom sentence tokenizer
    '''

    
    def insert_EOL(txt,spans,post=False):
        ini0=fin=0
        sent=''
        for span in spans:
            ini,fin=span.span()
            ini=ini+1
            
            fin=fin+1 if post else fin-1
            
            sent+=txt[ini0:ini]+'\n'
            ini0=fin
        sent+=item[fin:] if fin else txt
        return sent
        
    if type(text)!=list:
        text=[text]
    res=[]
    for item in text:
        #preserve some common abreviatures
        for pat,rpl in abrev:
            item=item.replace(pat,rpl)
         
        #ellipsis can be tricky - could or be, or not, EOL without any other symbols
        sent=insert_EOL(item,re.finditer(r'… ?[A-ZÁÂÉÊÍÎÓÔÚÛÜÑÇ¿¡]',item),False)
        
        #sent=insert_EOL(sent, re.finditer(r'[^A-Zªº]\.',sent),True)
        sent=insert_EOL(sent, re.finditer(r'\.\W+[A-ZÁÂÉÊÍÎÓÔÚÛÜÑÇ¿¡]',sent),False)
        sent=sent.replace('--','\n')    
        sent=re.sub(r'={2,200}','\n',sent)

        res.append(re.sub(ends,'\n',sent).split('\n'))

    return ([clean for item in unravel(res) if (clean:=item.strip())])

The `get_links` function try to get information to remove reference patterns while preserves the text if it is a natural part of text.  
The `drop_rec` function removes tables and links which could be recursive and with unbalanced open and close tags.  
The `replace_chunk` function is an helper for removing sections, which can be overlaid with tables, references or links

In [3]:
def get_links(page,pat_open=r'\{\{',pat_close=r'\}\}'):
    
   
    
    lini=[item.span() for item in re.finditer(pat_open,page)]
    lfin=[item.span() for item in re.finditer(pat_close,page)]
    if len(lfin)==0 and len(lini):
        ini=lini[0][0] if len(lini) else 0
        fin=len(page)-1
        return [page[ini:fin]]
    
    chunks=[]
    if len(lini)!=len(lfin):


        indxf=0
        indxi=0
        while indxf<len(lfin) and indxi<len(lini):
            ini=lini[indxi][0]
            while indxi<len(lini) and lini[indxi][1]<lfin[indxf][0] :
                indxi+=1
            if indxi==len(lini):
                fin=lfin[-1][1]
            else:
                while indxf<len(lfin) and lfin[indxf][0]<lini[indxi][1]:
                    indxf+=1
                fin=lfin[indxf-1][1]
            chunks.append(page[ini:fin])
    else:
        posini=[]
        for indx,posfin in enumerate(lfin):
            posini+=[item for item in lini[indx:] if item[1]<posfin[0]]
            if len([item for item in lini[indx:] if item[1]<posfin[0]])==1:
                chunks.append(page[posini[0][0]:posfin[1]])
                posini=[]
            
    
    return sorted(chunks,key=lambda x: len(x), reverse=True)
    
    

In [4]:
def drop_rec(page,pat_open=r'\{\|',pat_close=r'\|\}'):
    
    pato=re.compile(pat_open)
    patc=re.compile(pat_close)

    ini0=ini=pato.search(page,0)
    fin0=fin=patc.search(page,0)
    if ini0==None and fin0==None:
        return page
    elif fin0==None:
        return page[:ini0.start()]
    elif ini0==None:
        while fin:
            fin0=fin
            fin=patc.search(page,fin.end())
            
        return page[min(fin0.end(),len(page)-1):]
 
    chunks=[]
    nest=0
   
    while fin:

        while ini!=None and ini.end()<fin0.start():
            nest+=1
            ini=pato.search(page,ini.end())

       
        if not ini:

            while fin:
                nest-=1
                fin0=fin
                fin=patc.search(page,fin.end())

        elif fin:       
            
            while fin and fin.start()<ini.end() :
                nest-=1                   
                fin0=fin
                fin=patc.search(page,fin.end())

            if fin:
                if nest<1:
                    chunks.append([ini0.start(),fin0.end()])
                    ini0=ini  
                fin0=fin

        else:
            fin=None
        
        nest=max(0,nest)

    chunks.append([ini0.start(),fin0.end()])

    if ini and ini.start()>fin0.end():
        chunks.append([ini.start(),len(page)-1])
    
    res=''
    if chunks[0][0] == 0:                                                                                          
        ini0=chunks[0][1]
        chunks.pop(0)
    else:
        ini0=0
    
    for ini,fin in chunks:
        res+=page[ini0:ini]
        ini0=fin
    if ini0<len(page)-1:
        res+=page[ini0:]
    return res

In [5]:
def replace_chunk(txt):
    pats=[(r'{{',r'}}'),(r'{|',r'|}'),(r'[[',r']]')]
    res=''
    for op,cl in pats:
        nop=txt.count(op) if op in txt else 0
        ncl=txt.count(cl) if cl in txt else 0
        if nop==ncl:
            continue
        elif nop>ncl:
            res+=f' {op} '*(nop-ncl)
        else:
            res+=f' {cl} '*(ncl-nop)
    return res


## Cleaning pages

The function `clean_page` is the core of this notebook. This function accepts a wiki page and process it. It returns 4 elements:
* `title`: page title, string
* `contributors`: list with creator username
* `category`: list with assigned categories
* `text`: clean text, cleaned as described below  

There are a number of language dependent patterns, in this notebook the  selected patterns work with _Galipedia_, but must be easy adapt them for other languages:
* `patt_category`: pattern to extract categories
* `pages_to_drop`: patterns to identify in the title internal pages of wikipedia , such as 'Help', 'Model', ..., which do not contains any significant text.
* `terminal_sections`: The wiki articles have a defined structure and there are sections located at the end of articles whithout any significant text, like 'Bibliography', 'Notes',... 
* `patt_citation`: pattern for extract textual citations and incorporate it into text


In [6]:
patt_category=r'\[\[Categoría:(.*?)\]\]'
pages_todrop=['Axuda:', 'Wikipedia:','MediaWiki:','Modelo:','Categoría:','Módulo:']
terminal_sections= ['Palmarés',  'Festividades',  r'Partidos históricos.*?',  'Filmografía',  r'Galería.*?',  'Notas',  'Véxase tamén',  
                 'Bibliografía',  'Outros artigos',  'Ligazóns externas']
patt_citation=r'\{\{cita ?\|(.*?\.)\}\}'

In [7]:
def clean_page(page):
    title=re.findall(r'<title>(.*?)</title>',page)
    contributors=re.findall(r'<username>(.*?)</username>',page)
    category=re.findall(patt_category,page)
    
    txt=re.findall(r'<text.*?>(.*?)</text>',page)
    if title:
        title=title[0]
        if any_in(title,pages_todrop):
            return '',[],[],[]
    else:
        return '',[],[],[]
    if txt:
        txt=txt[0]
    else:
        return '',[],[],[]
    
    pos=[]
    for pat in terminal_sections:
        pos+=[item.span()[0] for item in re.finditer(r'={2,3} {0,1}%s {0,1}={2,3}'%pat ,txt)]
                
    pos.sort()
    pos=pos[0] if len(pos) else len(txt)
    txt=txt[:pos]
    
    txt=txt.replace('&lt;table','&lt; {|').replace( '/table&gt;','|} &gt;')
    txt=txt.replace('&lt;TABLE','&lt; {|').replace( '/TABLE&gt;','|} &gt;')
    
   
    
    #Remove Boxes
    txt=re.sub(r'\{\{Start box\}\}.*?\{\{End box\}\}',' ',txt)
    txt=re.sub(r'\{\{S-start\}\}.*?\{\{End box\}\}',' ',txt)

    txt=re.sub(r'\{\{ ?columnas.*?columnas ?\}\}',' ',txt)
    
    #remove sections
    
    for tag in ['div','math', 'ref','nowiki','graph','timeline','center','Center','syntaxhighlight','sub','sup','span','time','small','big','gallery','imagemap','score']:
        patt=r'&lt; ?{0}.*?/{0}?&gt;'.format(tag)
        if (chunks:=re.findall(patt,txt)):
            for chunk in sorted(chunks,key= lambda x:len(x)):
                txt=txt.replace(chunk,replace_chunk(chunk),re.IGNORECASE)

        patt=r'&lt; ?{0}.*?/?&gt;'.format(tag)
        if (chunks:=re.findall(patt,txt)):
            for chunk in sorted(chunks,key= lambda x:len(x)):
                txt=txt.replace(chunk,replace_chunk(chunk),re.IGNORECASE)

    
    #Remove tables, links and references, potentially recursive and unmatched
    #remove citations
    if '{{' in txt:
        txt=drop_rec(txt,r'\{\{',r'\}\}')
        

    #remove tables
   
    if any_in(txt,[r'{|',r'|}']):
            txt=drop_rec(txt,r'\{\|',r'\|\}')
 
    #remove links
    if '[[' in txt:
        for l in get_links(txt,r'\[\[',r'\]\]'):

            rpl=''
            if not any_in(l,list(':/')):
                m=l.strip('[]')
                rpl=m.split('|')[1]  if '|' in l else m

            txt=txt.replace(l,rpl)
        
   
    #remove special cases of references

    txt=re.sub(r'&lt; ?br ?/&gt;','. ',txt)

    txt=re.sub(r'&lt;noinclude&gt;',' ',txt)
    txt=re.sub(r'{{nowrap.*?}}',' ',txt) 
    
    txt=re.sub(r'\[http.*?\]','',txt)
    
    txt=re.sub(r'\{.*?\|left','',txt)
    txt=re.sub(r'/{0,9}center {0,20}\|{0,9}','',txt)
    txt=re.sub(r'nbsp;',' ',txt)
    txt=re.sub(r'\|{1,20}left','',txt)
    txt=re.sub(r'/{1,20}math','',txt)
    txt=re.sub(r'\\frac','',txt)
   
    
    #rid off misswrited numbers

    for p in re.findall(r'([0-9]+\. {0,9}[0-9]+)',txt):

        q=re.sub('\. {0,9}','',p)
        txt=re.sub(p,q,txt)

        #Ad hoc
    txt=re.sub(r'\|wid.*?top\|','',txt);
    txt=re.sub(r'\|\|.*?;\|','',txt);
    txt=re.sub(r'\|?rowspan.*?\|[(Clas)|(Des)].*?[á|\|]','',txt);
    txt=re.sub(r'\| ?colspan.*?[\.|\|]','',txt);
    txt=re.sub(r'colspan=.*?[!|\|]','',txt);
    txt=re.sub(r'rowspan=.*?[!|\|]','',txt);
    txt=re.sub(r'\|[\-| ]bg.*?\.','',txt);
    txt=re.sub(r'\|vh?align.*?\|','',txt);
    txt=re.sub(r'\|? ?style=.*?[\.|\|]','',txt);
    txt=re.sub(r'| ?align=|','',txt)
    txt=re.sub(r'{| ?class=wikitable.*?\|-','',txt)
    txt=re.sub(r'{| ?class=wikitable.*?!','',txt)
    txt=re.sub(r'\| vtop \| {1,2}\| width=50% vtop \|','',txt)
    txt=txt.replace('\uf85010','')
    
    #Tidy text for sent tokenize
    txt=txt.replace('...','…')
    txt=re.sub(r'\([. ]*?\)','',txt)
    txt=re.sub(r'\. *?\.','. ',txt)
    txt=txt.replace('}',' ')
    txt=re.sub(r'\'{2,20}','\'',txt)
    txt=re.sub(r'&lt;.*?&gt;','',txt)
    txt=re.sub(r'&.*?;','',txt)
     
    
    txt=[clean for item in sent_tok(txt) if (clean:=item.strip())]
    
    return title, category, contributors, txt

In [8]:
def clean_text(sent):
    
    if isinstance(sent,str):
        sent=sent.split()
    
    sent_f=[clean for i in sent if (clean:=alpha_text(i))]
    if not(sent_f):
        return None
    sent_f=[sent_f[0] if len(sent_f[0])<2 or sent[0].istitle() else '']+[i for i in sent_f[1:] if  i.islower()] 
    sent_f=[i.lower() for i in sent_f if i]
    return sent_f
    

def basic_feat(text):
    
    nsent=0
    ntok=[]
    nword=[]
    sentences=[]
    stream=[]
    
    
    
    for item in text:
        sent=[s for s in item.split() if len(s)<MAX_CHAR_TOKEN]
        
        if len(sent)<MIN_TOKENS :
            continue
        
        
        if (sent_f:=clean_text(sent)):
            nsent+=1
            ntok.append(len(sent))
            nword.append(len(sent_f))
                                                        
            stream+=sent_f
            sentences.append(' '.join(sent))
        
    return nsent,ntok,nword,Counter(stream),sentences         



# Parameters
* `path`: Path object pointing to xml dumps directory, this notebook and auxiliary files must be in the same folder.
* `MIN_TOKENS`: minimum number of tokens for a valid sentence
* `MIN_SENTS`: minumum number of valid sentences for a valid document
* `MAX_CHAR_TOKEN`: maximum number of characters in a valid token
* `MIN_DOCS`: minimum number of document frequency to be included in tfidf

In [9]:
path=Path.cwd()
MIN_TOKENS=4
MIN_SENTS=2
MAX_CHAR_TOKEN=50
MIN_DOCS=3

In [4]:
path_data=Path.cwd().parent/'dataset'

In [None]:
%%time
cpgl=' '.join(list(path.glob('*.xml'))[0].read_text(encoding='UTF8').split())
pages=[]
for ini,fin in zip(re.finditer(r'\<page\>',cpgl),re.finditer(r'\</page\>',cpgl)):

    pages.append(cpgl[ini.end()+1:fin.start()])

In [None]:
with open('pages_wiki.pkl','wb') as fich:
    pickle.dump(pages,fich)

len(pages)

In [10]:

pages=pickle_var('pages_wiki.pkl')
len(pages)

386824

In [11]:
def thread_function(pages):
    docs=[]
    pgs=[]
    for indx,page in enumerate(pages):
        try:
            title,category,user,text=clean_page(page)
            #Filter internal pages
            if not title or not text:
                continue
           
            
            #Filter documentns by length
            test=[alpha_text(item,' -').split() for item in text]
            
            test=[item for item in test if len(item)>MIN_TOKENS]

            if len(test)>MIN_SENTS:
                docs.append((title,category,user,text))
                pgs.append(page)
        except Exception as e:
            print(f'Exception: {e}\nFailure: {page}')
           
    return docs,pgs


In [12]:
%%time
articles=[]
selected_pages=[]
for batch in vectorice(iterable_list=pages,  thread_function=thread_function, max_workers=1000):
    d,p=(batch)
    
    articles+=d
    selected_pages+=p

CPU times: user 7.13 s, sys: 2.65 s, total: 9.78 s
Wall time: 39.4 s


In [13]:
pickle_var('articles20221201_gl.pkl',articles)
len(articles)

154876

In [14]:
pickle_var('selected_pages.pkl',selected_pages)
len(selected_pages)

154876

In [5]:
articles=pickle_var(path_data/'articles20221201_gl.pkl')
len(articles)

154876

In [7]:
articles=[document(*item) for item in articles]


In [13]:
all_articles=[]
for art in articles:
    cad=f'Title: {art.title}'+'\n'
    cad+=f'Category: {", ".join(art.category)}'+'\n'
    cad+=f'User: {", ".join(art.user)}'+'\n'
    cad+=f'Text:'+"\n".join(art.text)+'\n\n'
    all_articles.append(cad)
    

In [15]:
(path_data/'articles20221201_gl.txt').write_text(''.join(all_articles),encoding='utf8')

378991565

# Contributors

In [17]:
users=Counter(unravel([item.user for item in articles])).most_common()

limit=0.75
print(f'Total number of users: {len(users)}')
print(f'Most active users ({100*limit}% of #articles):')
total=len(articles)
cum=0
print(f'{"User":<30}\t{"#articles"}\t{"%articles"}')
print('-'*60)
for user,val in users:
    print(f'{user:<30}\t{val:>8}\t{val/total:>8.3%}')
    cum+=val/total
    if cum>limit:
        break


Total number of users: 1222
Most active users (75.0% of #articles):
User                          	#articles	%articles
------------------------------------------------------------
InternetArchiveBot            	   26678	 17.225%
Breogan2008                   	   23639	 15.263%
BanjoBot 2.0                  	   12886	  8.320%
Breobot                       	   10450	  6.747%
Corribot                      	    8980	  5.798%
HombreDHojalata               	    6888	  4.447%
Estevoaei                     	    6867	  4.434%
Chairego apc                  	    6181	  3.991%
Zaosbot                       	    5330	  3.441%
Xanetas                       	    2790	  1.801%
Xas                           	    2763	  1.784%
Alfonso Márquez               	    2018	  1.303%
MAGHOI                        	    1870	  1.207%


If can be assumed that a user name that contains 'bot' in it identifies a bot...

In [18]:
bots={key:val for key,val in users if 'bot' in key or 'Bot' in key}
print(f'Total number of bots: {len(bots)}')
print(f'Total articles: {sum(bots.values())},  {100*sum(bots.values())/len(articles):0.3f}%')
sts=basic_stats(list(bots.values()))
print('number of articles by bot')
for key in ['mean','min','Q25','Q50','Q75','max','Sh']:
    print(f'\t{key}: {sts[key]:.4g}')
    
print('\n\n')
print('User name       \tnº articles \t%articles')
for key,val in Counter(bots).most_common():
    while len(key)<20:
        key+=' '
    print(f'{key} \t{val} \t\t{round(100*val/len(articles),3)}')


Total number of bots: 20
Total articles: 66741,  43.093%
number of articles by bot
	mean: 3337
	min: 1
	Q25: 7.5
	Q50: 25.5
	Q75: 2232
	max: 2.668e+04
	Sh: 2.692



User name       	nº articles 	%articles
InternetArchiveBot   	26678 		17.225
BanjoBot 2.0         	12886 		8.32
Breobot              	10450 		6.747
Corribot             	8980 		5.798
Zaosbot              	5330 		3.441
Chairebot            	1200 		0.775
Aosbot               	815 		0.526
Addbot               	227 		0.147
BotDHojalata         	76 		0.049
EmausBot             	37 		0.024
KLBot2               	14 		0.009
Escarbot             	11 		0.007
Xqbot                	11 		0.007
BanjoBot             	9 		0.006
Texvc2LaTeXBot       	9 		0.006
Prebot               	3 		0.002
MGA73bot             	2 		0.001
TohaomgBot           	1 		0.001
Hector Bottai        	1 		0.001
Jembot               	1 		0.001


In [19]:
human={key:val for key,val in users if not ('bot' in key or 'Bot' in key)}
print(f'Total number of human contributors: {len(human)}')
sts=basic_stats(list(human.values()))
print('number of articles by human')
for key in ['mean','min','Q25','Q50','Q75','max','Sh']:
    print(f'\t{key}: {sts[key]:.4g}')

print('\n\n')
limit=0.75
print(f'Users which accounts for {100*limit} % of human articles')
print('User name                 \tnº articles \t%articles')
acum=0
total=sum(list(human.values()))
for key,val in Counter(human).most_common():
    if acum>limit:
        break
    acum+=val/total
    while len(key)<25:
        key+=' '
    print(f'{key} \t{val} \t\t{round(100*val/len(articles),3)}')
    

Total number of human contributors: 1202
number of articles by human
	mean: 71.99
	min: 1
	Q25: 1
	Q50: 1
	Q75: 3
	max: 2.364e+04
	Sh: 1.989



Users which accounts for 75.0 % of human articles
User name                 	nº articles 	%articles
Breogan2008               	23639 		15.263
HombreDHojalata           	6888 		4.447
Estevoaei                 	6867 		4.434
Chairego apc              	6181 		3.991
Xanetas                   	2790 		1.801
Xas                       	2763 		1.784
Alfonso Márquez           	2018 		1.303
MAGHOI                    	1870 		1.207
Miguelferig               	1709 		1.103
RubenWGA                  	1605 		1.036
Beninho                   	1512 		0.976
Vitoriaogando             	1377 		0.889
Moedagalega               	1193 		0.77
Elisardojm                	1108 		0.715
Jglamela                  	1073 		0.693
HacheDous=0               	971 		0.627
Xosema                    	945 		0.61
Maria zaos                	910 		0.588


# Categories

In [20]:
categories=unravel([item.category for item in articles])
categories=Counter(categories).most_common()
print(f'Total number of categories used: {len(categories)}')
sts=basic_stats(list(dict(categories).values()))
print('number of articles by category')
for key in ['mean','min','Q25','Q50','Q75','max','Sh']:
    print(f'\t{key}: {sts[key]:.4g}')
print('\n\n')



print('Category                  \t\t\t\t\t\tnº articles \t%articles')
for key,val in categories:
    if val<200:
        break
    while len(key)<70:
        key+=' '
    print(f'{key} \t{val} \t\t{round(100*val/len(articles),3)}')    

Total number of categories used: 70482
number of articles by category
	mean: 7.03
	min: 1
	Q25: 1
	Q50: 1
	Q75: 4
	max: 5091
	Sh: 2.125



Category                  						nº articles 	%articles
Personalidades de Galicia sen imaxes                                   	5091 		3.287
Filmes en lingua inglesa                                               	3316 		2.141
Filmes dos Estados Unidos de América                                   	2816 		1.818
Topónimos galegos con etimoloxía                                       	1611 		1.04
Escritores de Galicia en lingua galega                                 	1435 		0.927
Alumnos da Universidade de Santiago de Compostela                      	1378 		0.89
Nados en ano descoñecido                                               	1371 		0.885
Personalidades sen imaxes                                              	1100 		0.71
Escritores de Galicia en lingua castelá                                	1076 		0.695
Personalidades de Galicia sen imaxes finados

# Basic features  
Language-agnostic preliminary analysis, so there is not misspelled words control or token lemmatization.  
It relies on three routines:  
* `clean_text`: Used to create the _Bag of Words_ (bow) for each article. Only alphabetical chars are allowed. The input is a sentence (list of tokens or string) and the output is a list of alphabetical tokens.
* `basic_feat`: the main routine. The input is the extracted text of each article. This is processed to get:
    * The number of sentences in the article, an `int`
    * A list with the number of tokens in each sentence of the document
    * A list with the number of tokens in the output of `clean_text` applied to each sentence; these are called _words_.
    * A dictionary with the bow of the document, as defined above.
    * A list with the sentences of the article text, after applied the `MAX_CHAR_TOKEN` filter and the `MIN_TOKENS` filter
* `basic_stats`: the input is a list of numerical values and returns a dictionary with the mean, standard deviation, maximum, minimun, informational entropy and th quantile values for 25%, 50% (median) and 75%




In [21]:
def thread_function(arts):
    bows=[]
    feats=[]
    sents=[]

    for art in arts:
        nsent,ntok,nword,bow,sentences=basic_feat(art.text)
        sents.append(sentences)
        bows.append(bow)
        numt=basic_stats(ntok)
        numw=basic_stats(nword)
        nums=basic_stats(list(bow.values()))
        feats.append([art.title,nsent,numt['mean'],numt['Sh'],numw['mean'],numw['Sh'],nums['Sh'],sum(list(bow.values()))/len(bow) if bow else 0])
    return (bows,feats,sents)

In [22]:
%%time

answer=vectorice(articles,thread_function,max_workers=256)

bows=[]
feats=[]
sents=[]

for b,f,s in answer:
    bows+=(b)
    feats+=(f)
    sents+=(s)

CPU times: user 11.6 s, sys: 2.69 s, total: 14.3 s
Wall time: 20.4 s


# Basic features
Data frame with basic features of each article:
* `key`: article title
* `nsent`: number of sentences in the article
* `mean_tok`: mean of number of tokens per sentence in article
* `Sh_tok`: Informational entropy of tokens per sentence in article, in _nats_
* `mean_word`: mean of number of alphabetical tokens (`clean_text` output) per sentence in article
* `Sh_word`: Informational entropy of alphabetical tokens per sentence in article
* `Sh_bow`: Informational entropy of article's Bag of Words (bow) 
* `IL`: Lexical index, defined as $\cfrac{\#(words~in~article)}{\#(unique~words)}$

In [23]:
feats=pd.DataFrame(feats,columns=[ 'key','nsent','mean_tok','Sh_tok','mean_word','Sh_word','Sh_bow','IL'])
feats.describe()

Unnamed: 0,nsent,mean_tok,Sh_tok,mean_word,Sh_word,Sh_bow,IL
count,154876.0,154876.0,154876.0,154876.0,154876.0,154876.0,154876.0
mean,18.590233,18.729475,1.987819,14.991323,1.92735,0.867765,1.762029
std,31.934735,6.558763,0.702668,6.160426,0.707963,0.241812,0.982644
min,1.0,4.270531,-0.0,1.247573,-0.0,-0.0,1.0
25%,5.0,14.0,1.386294,10.416667,1.386294,0.712587,1.427184
50%,9.0,18.666667,1.94591,14.833333,1.889159,0.868778,1.631944
75%,19.0,22.75,2.50529,18.953488,2.441015,1.018204,1.904412
max,1498.0,153.5,4.257493,72.5,4.129187,2.548299,96.608696


In [24]:
feats.sort_values('mean_tok')

Unnamed: 0,key,nsent,mean_tok,Sh_tok,mean_word,Sh_word,Sh_bow,IL
82859,Especies de Rhododendron,207,4.270531,0.612245,2.724638,0.859401,0.164576,2.506667
124032,A balada de Cable Hogue,18,4.277778,0.590842,2.000000,-0.000000,0.206192,1.894737
61806,Concellos de Guatemala,30,4.300000,0.531304,1.600000,0.865470,1.190076,3.000000
16936,Lista de músicos de jazz fusion,9,4.333333,0.636514,1.777778,1.060857,0.600166,1.454545
21875,Lista de concellos de Sevilla,26,4.346154,0.692908,2.730769,0.810150,1.242517,3.086957
...,...,...,...,...,...,...,...,...
67929,Jean Calas,4,80.250000,1.039721,72.500000,1.386294,0.928693,1.726190
122099,El Museo de Pontevedra,4,101.500000,1.386294,16.500000,1.386294,0.672995,1.434783
112357,Liga Galega de Fútbol Gaélico 2016/17,11,109.090909,2.145842,14.000000,2.145842,1.367729,3.019608
113513,Liga Galega de Fútbol Gaélico 2017/18,11,119.000000,2.271869,17.090909,2.098274,1.178731,2.984127


In [28]:
articles[82859].text

["Esta é unha listaxe de especies de 'Rhododendron', da familia Ericaceae.",
 'Dependendo da fonte, hai de 800 a 1,100 especies aceptadas.',
 "Rhododendron' ten oito subxéneros aceptados:",
 'A',
 "'Rhododendron aberconwayi' Cowan",
 "'Rhododendron adenanthum' M.Y.",
 'He',
 "'Rhododendron adenogynum' Diels",
 "'Rhododendron adenopodum' Franch.",
 "Rhododendron adenosum' Davidian",
 "'Rhododendron aganniphum' Balf. f.",
 'Kingdon-Ward',
 "'Rhododendron agastum' Balf. f.",
 'W.W.',
 'Sm.',
 "Rhododendron albertsenianum' Forrest",
 "'Rhododendron alutaceum' Balf. f.",
 'W.W.',
 'Sm.',
 "Rhododendron amandum' Cowan",
 "'Rhododendron ambiguum' Hemsl.",
 "Rhododendron amesiae' Rehder  E.H.",
 'Wilson',
 "'Rhododendron amundsenianum' Hand.",
 'Mazz.',
 "Rhododendron annae' Franch.",
 "Rhododendron anthopogon' D.",
 'Don',
 "'Rhododendron anthopogonoides' Maxim.",
 "Rhododendron anthosphaerum' Diels",
 "'Rhododendron aperantum' Balf. f.",
 'Kingdon-Ward',
 "'Rhododendron apricum' P.C.",
 'Tam

# Computing idf

[_idf_](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) has a yet long history in information retrieval and gives a way to estimate the relative _value_ of a word, a sentence or a document.
 The classical definition is $\mathrm{idf}(t, D) = \log \cfrac{N}{|\{d \in D: t \in d\}|}$; where $N$ is the number of documents, $D$ in the collection (in this context the number of extracted articles), and $|\{d \in D: t \in d\}|$ the number of documents, $d$, which contains the term $t$.


The function `get_tfidf_bows` gets a list with the [_Bag of Wods_](https://en.wikipedia.org/wiki/Bag-of-words_model) for each document as input and returns:
* `ndw` the _idf_ for each term in the collection. 
* `bow` the _Bag of Words_ for the entire collection. A dictionary where the term is the key. The keys in ndw and bow are the same
* `tfidf` the _tfidf_ value for each document in the collection: _bow·ndw_

All three are python dictionaries where the term is the key. 
Applying `tfidf` or `ndw` implies that any term not included in these dictionaries has 0 value.

Three filters could be applied to construct `nwd` and `tfidf`:
* `min_len` : minimum length of the term, so word of 1, 2 or 3 letters have a low probability of be significant.
* `min_docs` : minimum number of documents that include the term for the term to be included. It is supposed that this filter removes misspelled words and extremely exotic words.
* `todrop` : a set with terms to be excluded arbitrarily.

In [29]:
def get_tfidf_bows(bows,todrop={},min_len=4,min_docs=MIN_DOCS):
    
    bow={}
    ndw={}
    for b in bows:
        for key,val in b.items():
            if len(key)<min_len or key in todrop:
                continue
            bow[key]=bow.get(key,0)+val
            ndw[key]=ndw.get(key,0)+1
        
    tfidf={}
    ndw={key:val for key,val in ndw.items() if val>min_docs}
    total=len(bows)
    ndw={key:np.log(total/val) for key,val in ndw.items()}
    bow={key:bow[key] for key in ndw.keys()}
    total=sum(bow.values())
    bow={key:val/total for key,val in bow.items()}
    tfidf={key:val*bow[key] for key,val in ndw.items()}
    return tfidf,bow,ndw
    

In [30]:
%%time
tfidf,BOW,ndw=get_tfidf_bows(bows,todrop={},min_len=2)

CPU times: user 7.05 s, sys: 11.8 ms, total: 7.06 s
Wall time: 7.06 s


In [31]:
Counter(ndw).most_common(30)

[('autotróficos', 10.564085714610725),
 ('sostelos', 10.564085714610725),
 ('casiodoro', 10.564085714610725),
 ('aedificatoria', 10.564085714610725),
 ('baudelaire', 10.564085714610725),
 ('reprodutibilidade', 10.564085714610725),
 ('reinterpretan', 10.564085714610725),
 ('imitativas', 10.564085714610725),
 ('trivio', 10.564085714610725),
 ('poesis', 10.564085714610725),
 ('valorable', 10.564085714610725),
 ('atesouramento', 10.564085714610725),
 ('aprecialas', 10.564085714610725),
 ('ordenalas', 10.564085714610725),
 ('fixalas', 10.564085714610725),
 ('augatinta', 10.564085714610725),
 ('calcografía', 10.564085714610725),
 ('interválicas', 10.564085714610725),
 ('mordentes', 10.564085714610725),
 ('barriñas', 10.564085714610725),
 ('xices', 10.564085714610725),
 ('esgrafiado', 10.564085714610725),
 ('crisoelefantina', 10.564085714610725),
 ('rabuña', 10.564085714610725),
 ('sardónica', 10.564085714610725),
 ('damasquinado', 10.564085714610725),
 ('adquirimos', 10.564085714610725),
 ('

## tfidf sentences based
The goal of this work is sentence-based, so it seems plausible compute the idf over an all sentences basis, that is think each sentence as a document for idf calculation 

In [32]:
#only sentences with more than 3 tokens
sentences=[item for item in unravel(sents) if len(item.split())>3]
#only unique sentences
sentences=list(set(sentences))
len(sentences)

2813985

In [33]:
%%time
all_tfidf={}
all_bow={}
all_ndw={}
for s in sentences:
    tbow=Counter(clean_text(s))
    for key,val in tbow.items():
        all_bow[key]=all_bow.get(key,0)+val
        all_ndw[key]=all_ndw.get(key,0)+1
all_ndw={key:val for key,val in all_ndw.items() if val > MIN_DOCS and len(key)>2}
all_bow={key:val for key,val in all_bow.items() if key in all_ndw.keys()}
total=sum(list(all_bow.values()))
all_bow={key:val/total for key,val in all_bow.items()}
total=len(sentences)
all_ndw={key:np.log(total/val) for key,val in all_ndw.items()}
all_tfidf={key:val*all_bow[key] for key,val in all_ndw.items()}

CPU times: user 1min 22s, sys: 15.2 ms, total: 1min 22s
Wall time: 1min 22s


In [34]:
Counter(all_ndw).most_common(15)

[('nucleoplasmina', 13.463817825031969),
 ('acilglicerois', 13.463817825031969),
 ('glicéridos', 13.463817825031969),
 ('fitando', 13.463817825031969),
 ('veículos', 13.463817825031969),
 ('dischiuso', 13.463817825031969),
 ('xastraría', 13.463817825031969),
 ('snefru', 13.463817825031969),
 ('alastair', 13.463817825031969),
 ('frappé', 13.463817825031969),
 ('aliñaban', 13.463817825031969),
 ('axustábeis', 13.463817825031969),
 ('campocidade', 13.463817825031969),
 ('golpealas', 13.463817825031969),
 ('mivágur', 13.463817825031969)]

## value computation

So, there is four plausible computation schemas:
* Classical: $tfidf=ndw \cdot bow$ where $ndw$ is the _idf_ on a document basis and $bow$ the bag of words of the document 
* Alternative1: $tfidf=ndw \cdot BOW$ where $BOW$ is the _Bag of Words_ for the collection of documents. These $tfidf$ values are the same for all documents.
* Alternative2: the same computation that _alternative1_, but $ndw$ is computed over sentences basis, as explain above
* Alternative3: the same computation that _classical_,  but $ndw$ is computed over sentences basis, as explain above

Values per sentence are computated with the function `get_value`:
* `sent`: the sentence, as list of tokens or string
* `tfidf`: the tfidf to apply for computation
* `func`: the function to apply to get the sentence value, `sum` by default

In [35]:
def get_value(sent,tfidf,func=np.sum):
    if isinstance(sent,str):
        sent=sent.split()
    res=[tfidf[key] for key in clean_text(sent) if key in tfidf.keys()]
    return func(res) if res else 0

For easy comparison, values results are recorded in a dataframe. Columns are named with a prefix and a suffix. The suffix is related to the method of computation: _Classical_ ==> _class_; _Alternative1_ ==> _alt1_ and so on.
The prefix are:
* _mean_: mean sentence value of sentences in article
* _max_: maximun sentence value in article
* _Sh_ : informational entropy of sentences values
* _weight_ : summation of sentences values in article weighted by $\left ( 1+\cfrac{1}{\#sentences} \right )$, so the effect of article length is somehow moderated

In [36]:
%%time
#Classical
v=[]
for sent,bow in zip(sents,bows):
    total=sum(list(bow.values()))
    td={key:val*ndw[key]/total for key,val in bow.items() if key in ndw.keys()}
    z=[get_value(s,td) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))



values=pd.DataFrame(v,columns=['mean_class','max_class','Sh_class','weight_class'])

                

CPU times: user 1min 53s, sys: 13.7 ms, total: 1min 53s
Wall time: 1min 54s


In [37]:
%%time
#Alternative1
v=[]
for sent in sents:
    z=[get_value(s,tfidf) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))


values['mean_alt1'],values['max_alt1'],values['Sh_alt1'],values['weight_alt1']=transpose(v)


                

CPU times: user 1min 44s, sys: 37 ms, total: 1min 44s
Wall time: 1min 44s


In [38]:
%%time
#Alternative2
v=[]
for sent in sents:
    z=[get_value(s,all_tfidf) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))


values['mean_alt2'],values['max_alt2'],values['Sh_alt2'],values['weight_alt2']=transpose(v)


                

CPU times: user 1min 43s, sys: 18.8 ms, total: 1min 43s
Wall time: 1min 43s


In [39]:
%%time
#Alternative3
v=[]
for sent,bow in zip(sents,bows):
    total=sum(list(bow.values()))
    td={key:val*all_ndw[key]/total for key,val in bow.items() if key in all_ndw.keys()}
    z=[get_value(s,td) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))


values['mean_alt3'],values['max_alt3'],values['Sh_alt3'],values['weight_alt3']=transpose(v)


                

CPU times: user 1min 51s, sys: 15 ms, total: 1min 51s
Wall time: 1min 51s


In [40]:
cols=[item for item in values.columns if 'weight' in item]
values[cols].describe()

Unnamed: 0,weight_class,weight_alt1,weight_alt2,weight_alt3
count,154876.0,154876.0,154876.0,154876.0
mean,5.245123,0.496778,1.189642,8.538427
std,7.738198,0.983606,2.450661,11.525053
min,0.55018,0.002359,0.001524,0.955701
25%,3.169407,0.121113,0.254709,5.235987
50%,4.016377,0.228298,0.534771,6.566205
75%,5.481901,0.485443,1.15795,8.981218
max,1155.439433,51.721561,123.608828,1591.58937


As expected there is high correlation between _Classical_ and _Alternative3_ by one hand and between alternatives _1_ and _2_   for all computed values

In [41]:
values[cols].corr()

Unnamed: 0,weight_class,weight_alt1,weight_alt2,weight_alt3
weight_class,1.0,0.352714,0.325513,0.949617
weight_alt1,0.352714,1.0,0.994747,0.39164
weight_alt2,0.325513,0.994747,1.0,0.366499
weight_alt3,0.949617,0.39164,0.366499,1.0


## Value classification comparison
Let's compare the result of the value classification with the calculation schemes _Classical_ and _Alternative1_

In [42]:
for pref in ['mean','max','weight']:
    print(f'{pref.upper()}_ values')
    print('-'*65)
    res=[]
    for suf in ['class','alt1']:
        col=pref+'_'+suf
        indx=values.sort_values(col).index
        res.append((indx[:5],indx[-5:]))
    print(f'{"Minor values":>40}')
    print(f'\t\tClassical{"Alternative1":>60}')

    for c,a in zip(res[0][0],res[1][0]):
        print(f'{articles[c].title:<55}\t{articles[a].title}')
    print(f'\n{"Higher values":>40}')
    print(f'\t\tClassical{"Alternative1":>60}')
    for c,a in zip(res[0][1],res[1][1]):
        print(f'{articles[c].title:<55}\t{articles[a].title}')

    print('\n\n')
              
    

MEAN_ values
-----------------------------------------------------------------
                            Minor values
		Classical                                                Alternative1
Telmatoscopus                                          	Especies de Rhododendron
Sphenomorphus                                          	Sphenomorphus
Lista de raíces indoeuropeas                           	Lista de guitarristas solistas
Lista de siglas e acrónimos                            	Portal:Aviación/Efemérides destacadas/1 de agosto
Isabel Soto                                            	Telmatoscopus

                           Higher values
		Classical                                                Alternative1
Valentin Ivanov                                        	A Ciberirmandade da Fala
Especies de Rhododendron                               	Diaño (mitoloxía)
Jamaica, Land We Love                                  	Midwinterhoorn
Búfalo anano                                          

And the calculation scheme with more intuitive results seems to be _weight_values_ _Alternative1_