# Introduction

This notebook is intended to process wikipedia dumps, with the goal of extract the maximum number of complete and significative sentences. This set of sentences could be used for any purpose in [NLP](https://en.wikipedia.org/wiki/Natural_language_processing), but information is also extracted which would be meaningful for _wikipedia_ itself.

The developed example is about _Galipedia_, the galician wikipedia, but it could be easily adapted for other languages just because it is _language-agnostic_.

# Download data

From __[Wikimedia Downloads](https://dumps.wikimedia.org/mirrors.html)__

from the mirror _Academic Computer Club, Umeå University_ (Last 5 good XML dumps, 'other' datasets): glwiki-20221120-pages-articles.xml.bz2



In [1]:
!wget http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/glwiki/20221120/glwiki-20221120-pages-articles.xml.bz2


!bunzip2 glwiki-20221120-pages-articles.xml.bz2

--2022-12-08 22:20:53--  http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/glwiki/20221120/glwiki-20221120-pages-articles.xml.bz2
Resolving ftp.acc.umu.se (ftp.acc.umu.se)... 194.71.11.173, 194.71.11.165, 194.71.11.163, ...
Connecting to ftp.acc.umu.se (ftp.acc.umu.se)|194.71.11.173|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://gemmei.ftp.acc.umu.se/mirror/wikimedia.org/dumps/glwiki/20221120/glwiki-20221120-pages-articles.xml.bz2 [following]
--2022-12-08 22:20:54--  http://gemmei.ftp.acc.umu.se/mirror/wikimedia.org/dumps/glwiki/20221120/glwiki-20221120-pages-articles.xml.bz2
Resolving gemmei.ftp.acc.umu.se (gemmei.ftp.acc.umu.se)... 194.71.11.137, 2001:6b0:19::137
Connecting to gemmei.ftp.acc.umu.se (gemmei.ftp.acc.umu.se)|194.71.11.137|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 302586132 (289M) [application/x-bzip2]
Saving to: ‘glwiki-20221120-pages-articles.xml.bz2’


2022-12-08 22:21:02 (34.0 MB/s) - ‘glwiki-2022112

# Libraries & functions

In [1]:
import re,os,pickle
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from collections import Counter
from random import choice, sample
from pathlib import Path
from time import time


from collections import namedtuple
document=namedtuple('document',['title','category','user','text'])

In [2]:
def unravel(lst):
    '''Unravel a list of lists
    lst: a list or set or tuple of lists/sets/tuples
    returns all values in one list'''
    ulst=[]
    for item in lst:
        if type(item) in [list,set,tuple]:
            ulst+=unravel(item)
        else:
            ulst.append(item)
    return ulst


In [3]:
def vectorice(iterable_list,thread_function,max_workers=None):
    '''
    given an iterable list with data, and a thread_function for process it
    set up a Pool for vectorice the function and returns the bunches in answer.
    max_workers stands for the number of threads launched. 
    If max_workers is None (or not an integer) it is set to the number of CPUs detected
    '''
    from multiprocessing.pool import Pool
    
    max_workers=max_workers if isinstance(max_workers,int) else os.cpu_count()
    
    R=lambda x,y=max_workers: list(range(0,len(x),len(x)//y))
    
    lR=R(iterable_list)
    
    params=[iterable_list[lR[i]:lR[i+1]] for i in range(len(lR)-2)]
    params.append(iterable_list[lR[-2]:])
    
    pool=Pool()
    answer=pool.map(thread_function,params)
    del pool

    return answer
        

It could be difficult preserve the _language-agnostic_ nature for sentence tokenization. Two ways are plausible:
* import a sentence tokenizer from nltk or any other suitable library
* write a custom function, which can take into account especifities of the dump we work with

In [4]:

#import nltk
#sent_tok= nltk.sent_tokenize

def sent_tok(text,ends='[\:\?\!\*\#]'):
    if type(text)!=list:
        text=[text]
    res=[]
    for item in text:
        #preserve ellipsis
        item=item.replace('...','…')
        #preserve some common abreviatures
        item=item.replace('a.C.','aC.').replace('d.C.','dC.')
        #preserve acronyms
        ini0=0
        sent=''
        for span in re.finditer('[^A-Zªº]\. ',item):
            ini,fin=span.span()
            sent+=item[ini0:ini+1]+'.\n'
            ini0=fin
        
        res.append(re.sub(ends,'\n',sent).split('\n'))
       
    return ([item.strip() for item in unravel(res) if item.strip()])

In [5]:
def any_in(txt,pttrn=[':','.jpg','.png','*[']):
    '''returns True if any of pttr is in txt'''
    for pt in pttrn:
        if pt in txt:
            return True
    return False



The `get_links` function try to get information to remove reference patterns while preserves the text if it is a natural part of text.  
The `drop_rec` function removes tables and links which could be recursive and with unbalanced open and close tags.

In [6]:
def get_links(page,pat_open=r'\{\{',pat_close=r'\}\}'):
    
   
    
    lini=[item.span() for item in re.finditer(pat_open,page)]
    lfin=[item.span() for item in re.finditer(pat_close,page)]
    if len(lfin)==0 and len(lini):
        ini=lini[0][0] if len(lini) else 0
        fin=len(page)-1
        return [page[ini:fin]]
    
    chunks=[]
    if len(lini)!=len(lfin):


        indxf=0
        indxi=0
        while indxf<len(lfin) and indxi<len(lini):
            ini=lini[indxi][0]
            while indxi<len(lini) and lini[indxi][1]<lfin[indxf][0] :
                indxi+=1
            if indxi==len(lini):
                fin=lfin[-1][1]
            else:
                while indxf<len(lfin) and lfin[indxf][0]<lini[indxi][1]:
                    indxf+=1
                fin=lfin[indxf-1][1]
            chunks.append(page[ini:fin])
    else:
        posini=[]
        for indx,posfin in enumerate(lfin):
            posini+=[item for item in lini[indx:] if item[1]<posfin[0]]
            if len([item for item in lini[indx:] if item[1]<posfin[0]])==1:
                chunks.append(page[posini[0][0]:posfin[1]])
                posini=[]
            
    
    return sorted(chunks,key=lambda x: len(x), reverse=True)
    
    

In [7]:
def drop_rec(page,pat_open=r'\{\|',pat_close=r'\|\}'):
    
    pato=re.compile(pat_open)
    patc=re.compile(pat_close)

    ini0=ini=pato.search(page,0)
    fin0=fin=patc.search(page,0)
    if ini0==None and fin0==None:
        return page
    elif fin0==None:
        return page[:ini0.start()]
    elif ini0==None:
        while fin:
            fin0=fin
            fin=patc.search(page,fin.end())
            
        return page[min(fin0.end(),len(page)-1):]
 
    chunks=[]
    nest=0
   
    while fin:

        while ini!=None and ini.end()<fin0.start():
            nest+=1
            ini=pato.search(page,ini.end())

       
        if not ini:# or nest>1:

            while fin:# and nest>1:
                nest-=1
                fin0=fin
                fin=patc.search(page,fin.end())

        elif fin:       
            
            while fin and fin.start()<ini.end() :
                nest-=1                   
                fin0=fin
                fin=patc.search(page,fin.end())

            if fin:
                if nest<1:
                    chunks.append([ini0.start(),fin0.end()])
                    ini0=ini  
                fin0=fin

        else:
            fin=None
        
        nest=max(0,nest)

    chunks.append([ini0.start(),fin0.end()])

    if ini and ini.start()>fin0.end():
        chunks.append([ini.start(),len(page)-1])

    
        
    
    res=''
    if chunks[0][0] == 0:                                                                                          
        ini0=chunks[0][1]
        chunks.pop(0)
    else:
        ini0=0
    
    for ini,fin in chunks:
        res+=page[ini0:ini]
        ini0=fin
    if ini0<len(page)-1:
        res+=page[ini0:]
    return res

## Cleaning pages

The function `clean_page` is the core of this notebook. This function accepts a wiki page and process it. It returns 4 elements:
* `title`: page title, string
* `contributors`: list with creator username
* `category`: list with assigned categories
* `text`: clean text, cleaned as described below  

There are a number of language dependent patterns, in this notebook the  selected patterns work with _Galipedia_, but must be easy adapt them for other languages:
* `patt_category`: pattern to extract categories
* `pages_to_drop`: patterns to identify in the title internal pages of wikipedia , such as 'Help', 'Model', ..., which do not contains any significant text.
* `terminal_sections`: The wiki articles have a defined structure and there are sections located at the end of articles whithout any significant text, like 'Bibliography', 'Notes',... 
* `patt_citation`: pattern for extract textual citations and incorporate it into text


In [8]:
patt_category=r'\[\[Categoría:(.*?)\]\]'
pages_todrop=['Axuda:', 'Wikipedia:','MediaWiki:','Modelo:','Categoría:','Módulo:']
terminal_sections= ['Palmarés',  'Festividades',  'Partidos históricos.*?',  'Filmografía',  'Galería.*?',  'Notas',  'Véxase tamén',  
                 'Bibliografía',  'Outros artigos',  'Ligazóns externas']
patt_citation=r'\{\{cita ?\|(.*?\.)\}\}'

In [9]:
def clean_page(page):
    title=re.findall(r'<title>(.*?)</title>',page)
    contributors=re.findall(r'<username>(.*?)</username>',page)
    category=re.findall(patt_category,page)
    
    txt=re.findall(r'<text.*?>(.*?)</text>',page)
    if title:
        title=title[0]
        if any_in(title,pages_todrop):
            return '',[],[],[]
    else:
        return '',[],[],[]
    if txt:
        txt=txt[0]
    else:
        return '',[],[],[]
    
    pos=[]
    for pat in terminal_sections:
        pos+=[item.span()[0] for item in re.finditer(r'={2,3} {0,1}%s {0,1}={2,3}'%pat ,txt)]
                
    pos.sort()
    pos=pos[0] if len(pos) else len(txt)
    txt=txt[:pos]
    
    #remove latex ecuations
    txt=re.sub(r'[<|&lt;]math.*?/math[&gt;|>]',' ',txt)
    
    #Remove Boxes
    txt=re.sub(r'\{\{Start box\}\}.*?\{\{End box\}\}',' ',txt)
    
    #remove html divisions
    txt=re.sub(r'&lt; ?{0}.*?/{0}?&gt;'.format('div'),' ',txt)
    
    #remove graphs
    txt=re.sub(r'&lt; ?{0}.*?/{0}?&gt;'.format('graph'),' ',txt)
    
    #remove galeries
    txt=re.sub(r'&lt; ?{0}.*?/{0}?&gt;'.format('gallery'),' ',txt)

    
    #Remove tables, links and references, potentially recursive and unmatched
    #remove citations
    if '{{' in txt:
        txt=drop_rec(txt,r'\{\{',r'\}\}')
        

    #remove tables
    txt=txt.replace('&lt;table','&lt; {|').replace( '/table&gt;','|} &gt;')
    txt=txt.replace('&lt;TABLE','&lt; {|').replace( '/TABLE&gt;','|} &gt;')
    
    if any_in(txt,[r'{|',r'|}']):
            txt=drop_rec(txt,r'\{\|',r'\|\}')
 
    #remove links
    if '[[' in txt:
        for l in get_links(txt,r'\[\[',r'\]\]'):

            rpl=''
            if not any_in(l,list(':/')):
                m=l.strip('[]')
                rpl=m.split('|')[1]  if '|' in l else m

            txt=txt.replace(l,rpl)
        
   
    #remove special cases of references

    txt=re.sub(r'&lt; ?Ref.*?/ref?&gt;',' ',txt)
   
    #remove sections
    
    for tag in ['div','ref','nowiki','graph','timeline','center','Center','syntaxhighlight','sub','sup','span','time','small','big','gallery','imagemap']:
        txt=re.sub(r'&lt; ?{0}.*?/{0}?&gt;'.format(tag),' ',txt)
        txt=re.sub(r'&lt; ?{}.*?/&gt;'.format(tag),' ',txt)
        
    
    
    
    txt=re.sub(r'&lt;noinclude&gt;',' ',txt)
    txt=re.sub(r'{{nowrap.*?}}',' ',txt) 
    
    txt=re.sub(r'\[http.*?\]','',txt)
    
    txt=re.sub(r'\{.*?\|left','',txt)
    txt=re.sub(r'/{0,9}center {0,20}\|{0,9}','',txt)
    txt=re.sub(r'nbsp;',' ',txt)
    txt=re.sub(r'br ?/',' ',txt)
    
    
    txt=re.sub(r'\|{1,20}left','',txt)
    txt=re.sub(r'/{1,20}math','',txt)
    txt=re.sub(r'\\frac','',txt)
   
    
    #rid off misswrited numbers

    for p in re.findall(r'([0-9]+\. {0,9}[0-9]+)',txt):

        q=re.sub('\. {0,9}','',p)
        txt=re.sub(p,q,txt)
    
    
    
    
    
    
    #Ad hoc
    txt=re.sub(r'\|wid.*?top\|','',txt);
    txt=re.sub(r'\|\|.*?;\|','',txt);
    txt=re.sub(r'\|?rowspan.*?\|[(Clas)|(Des)].*?[á|\|]','',txt);
    txt=re.sub(r'\| ?colspan.*?[\.|\|]','',txt);
    txt=re.sub(r'colspan=.*?[!|\|]','',txt);
    txt=re.sub(r'rowspan=.*?[!|\|]','',txt);
    txt=re.sub(r'\|[\-| ]bg.*?\.','',txt);
    txt=re.sub(r'\|vh?align.*?\|','',txt);
    txt=re.sub(r'\|? ?style=.*?[\.|\|]','',txt);
    txt=re.sub(r'| ?align=|','',txt)
    txt=re.sub(r'{| ?class=wikitable.*?\|-','',txt)
    txt=re.sub(r'{| ?class=wikitable.*?!','',txt)
    txt=re.sub(r'\| vtop \| {1,2}\| width=50% vtop \|','',txt)
    
    #Tidy text for sent tokenize
    txt=re.sub(r'etc\.','etc…',txt)
    txt=re.sub(r' ?={2,20} ?','. ',txt)
    txt=re.sub(r'\([. ]*?\)','',txt)
    txt=re.sub(r'\. *?\.','. ',txt)
    txt=re.sub(r'\.{2,20}','. ',txt)
    txt=txt.replace('}',' ')
    txt=re.sub(r'\'{2,20}','\'',txt)
    txt=re.sub(r'&lt;.*?&gt;','',txt)
    txt=re.sub(r'&.*?;','',txt)
     
    
    txt=[item.strip() for item in sent_tok(txt)]
    
    return title, category, contributors, txt

In [10]:
alphabet='A-ZÁÂÉÊÍÎÓÔÚÛÜÑÇ'
alpha_text=lambda x: re.sub('[^{} -]'.format(alphabet+alphabet.lower()),'',x)
transpose=lambda x: list(zip(*x))

In [11]:
def clean_text(sent):
    alphabetized=lambda x: re.sub('[^{}]'.format(alphabet+alphabet.lower()),'',x)
    
    if isinstance(sent,str):
        sent=sent.split()
    
    sent_f=[alphabetized(i) for i in sent if alphabetized(i)]
    if not(sent_f):
        return None
    sent_f=[sent_f[0] if len(sent_f[0])<2 or sent[0].istitle() else '']+[i for i in sent_f[1:] if  i.islower()] 
    sent_f=[i.lower() for i in sent_f if i]
    return sent_f
    

def basic_feat(text):
    
    nsent=0
    ntok=[]
    nword=[]
    sentences=[]
    stream=[]
    
    
    
    for item in text:
        sent=[s for s in item.split() if len(s)<MAX_CHAR_TOKEN]
        
        if len(sent)<MIN_TOKENS :
            continue
        
        sent_f=clean_text(sent)
        if sent_f:
            nsent+=1
            ntok.append(len(sent))
            nword.append(len(sent_f))
                                                        
            stream+=sent_f
            sentences.append(' '.join(sent))
        
    return nsent,ntok,nword,Counter(stream),sentences         

# "valor" de una frase suma de los tf-idf de los tokens de la frase (promedio por token)
#value=lambda x: sum([tfidf[key]/len(x.split()) for key in x.split() if key in tfidf.keys()])

def basic_stats(vals):
    vals=np.array(vals)
    if vals.size >0:
        columns=['mean','std','min','max','Sh','Q25','Q50','Q75']
        Sh=np.array(list(Counter(vals).values()))
        Sh=Sh/Sh.sum()
        res=[vals.mean(),vals.std(),vals.min(),vals.max(),-(Sh*np.log(Sh)).sum()]
        res+=list(np.quantile(vals,[0.25,0.5,0.75]))
        return {key:val for key,val in zip(columns,res)}
    
    
    return {key:val for key,val in zip(columns,[0]*len(columns))}

# Parameters
* `path`: Path object pointing to xml dumps directory, this notebook and auxiliary files must be in the same folder.
* `MIN_TOKENS`: minimum number of tokens for a valid sentence
* `MIN_SENTS`: minumum number of valid sentences for a valid document
* `MAX_CHAR_TOKEN`: maximum number of characters in a valid token
* `MIN_DOCS`: minimum number of document frequency to be included in tfidf

In [12]:
path=Path.cwd()
MIN_TOKENS=4
MIN_SENTS=2
MAX_CHAR_TOKEN=50
MIN_DOCS=3

In [12]:
%%time
cpgl=' '.join(list(path.glob('*.xml'))[0].read_text(encoding='UTF8').split())

pages=(re.findall('<page>(.*?)</page>',cpgl))

with open('pages_wiki.pkl','wb') as fich:
    pickle.dump(pages,fich)

len(pages)

CPU times: user 29.6 s, sys: 9.64 s, total: 39.2 s
Wall time: 56.6 s


386308

In [13]:
with open('pages_wiki.pkl','rb') as fich:
    pages=pickle.load(fich)

len(pages)

386308

In [14]:
def thread_function(pages):
    docs=[]
    pgs=[]
    for indx,page in enumerate(pages):
        try:
            title,category,user,text=clean_page(page)
            #Filter internal pages
            if not title or not text:
                continue
           
            
            #Filter documentns by length
            test=[alpha_text(item).split() for item in text]
            
            test=[item for item in test if len(item)>MIN_TOKENS]

            if len(test)>MIN_SENTS:
                docs.append((title,category,user,text))
                pgs.append(page)
        except Exception as e:
            print(f'Exception: {e}\nFailure: {page}')
           
    return docs,pgs


In [17]:
%%time
articles=[]
selected_pages=[]
for batch in vectorice(iterable_list=pages,  thread_function=thread_function, max_workers=1000):
    d,p=(batch)
    articles+=d
    selected_pages+=p

CPU times: user 6.98 s, sys: 2.63 s, total: 9.61 s
Wall time: 41.6 s


In [18]:
with open('articles20221120_gl.pkl','wb') as fich:
    pickle.dump(articles,fich)
len(articles)

150261

In [19]:
with open('selected_pages.pkl','wb') as fich:
    pickle.dump(selected_pages,fich)
len(selected_pages)

150261

In [20]:
with open('articles20221120_gl.pkl','rb') as fich:
    articles=pickle.load(fich)
articles=[document(*item) for item in articles]
len(articles)


150261

# Contributors

In [49]:
users=Counter(unravel([item.user for item in articles])).most_common()

limit=0.75
print(f'Total number of users: {len(users)}')
print(f'Most active users ({100*limit}% of #articles):')
total=len(articles)
cum=0
print(f'{"User":<30}\t{"#articles"}\t{"%articles"}')
print('-'*60)
for user,val in users:
    print(f'{user:<30}\t{val:>8}\t{val/total:>8.3%}')
    cum+=val/total
    if cum>limit:
        break


Total number of users: 1205
Most active users (75.0% of #articles):
User                          	#articles	%articles
------------------------------------------------------------
InternetArchiveBot            	   26643	 17.731%
Breogan2008                   	   23128	 15.392%
BanjoBot 2.0                  	   11982	  7.974%
Breobot                       	    9267	  6.167%
Corribot                      	    8926	  5.940%
HombreDHojalata               	    6854	  4.561%
Chairego apc                  	    6073	  4.042%
Estevoaei                     	    6034	  4.016%
Zaosbot                       	    5428	  3.612%
Xanetas                       	    2843	  1.892%
Xas                           	    2626	  1.748%
Alfonso Márquez               	    1962	  1.306%
MAGHOI                        	    1851	  1.232%


If can be assumed that a user name that contains 'bot' in it identifies a bot...

In [50]:
bots={key:val for key,val in users if 'bot' in key or 'Bot' in key}
print(f'Total number of bots: {len(bots)}')
print(f'Total articles: {sum(bots.values())},  {100*sum(bots.values())/len(articles):0.3f}%')
sts=basic_stats(list(bots.values()))
print('number of articles by bot')
for key in ['mean','min','Q25','Q50','Q75','max','Sh']:
    print(f'\t{key}: {sts[key]}')
    
print('\n\n')
print('User name       \tnº articles \t%articles')
for key,val in Counter(bots).most_common():
    while len(key)<20:
        key+=' '
    print(f'{key} \t{val} \t\t{round(100*val/len(articles),3)}')


Total number of bots: 20
Total articles: 64574,  42.975%
number of articles by bot
	mean: 3228.7
	min: 1
	Q25: 5.75
	Q50: 21.5
	Q75: 2272.75
	max: 26643
	Sh: 2.6923109941417858



User name       	nº articles 	%articles
InternetArchiveBot   	26643 		17.731
BanjoBot 2.0         	11982 		7.974
Breobot              	9267 		6.167
Corribot             	8926 		5.94
Zaosbot              	5428 		3.612
Chairebot            	1221 		0.813
Aosbot               	819 		0.545
Addbot               	131 		0.087
BotDHojalata         	70 		0.047
EmausBot             	32 		0.021
Escarbot             	11 		0.007
Xqbot                	11 		0.007
KLBot2               	10 		0.007
Texvc2LaTeXBot       	9 		0.006
BanjoBot             	7 		0.005
MGA73bot             	2 		0.001
Prebot               	2 		0.001
TohaomgBot           	1 		0.001
Hector Bottai        	1 		0.001
Jembot               	1 		0.001


In [51]:
human={key:val for key,val in users if not ('bot' in key or 'Bot' in key)}
print(f'Total number of human contributors: {len(human)}')
sts=basic_stats(list(human.values()))
print('number of articles by human')
for key in ['mean','min','Q25','Q50','Q75','max','Sh']:
    print(f'\t{key}: {sts[key]}')

print('\n\n')
limit=0.90
print(f'Users which accounts for {100*limit} % of human articles')
print('User name                 \tnº articles \t%articles')
acum=0
total=sum(list(human.values()))
for key,val in Counter(human).most_common():
    if acum>limit:
        break
    acum+=val/total
    while len(key)<25:
        key+=' '
    print(f'{key} \t{val} \t\t{round(100*val/len(articles),3)}')
    

Total number of human contributors: 1185
number of articles by human
	mean: 70.98059071729958
	min: 1
	Q25: 1.0
	Q50: 1.0
	Q75: 3.0
	max: 23128
	Sh: 1.9893721175190062



Users which accounts for 90.0 % of human articles
User name                 	nº articles 	%articles
Breogan2008               	23128 		15.392
HombreDHojalata           	6854 		4.561
Chairego apc              	6073 		4.042
Estevoaei                 	6034 		4.016
Xanetas                   	2843 		1.892
Xas                       	2626 		1.748
Alfonso Márquez           	1962 		1.306
MAGHOI                    	1851 		1.232
Miguelferig               	1728 		1.15
RubenWGA                  	1640 		1.091
Beninho                   	1497 		0.996
Vitoriaogando             	1323 		0.88
Moedagalega               	1148 		0.764
Elisardojm                	1086 		0.723
Jglamela                  	1040 		0.692
Xosema                    	982 		0.654
HacheDous=0               	960 		0.639
Maria zaos                	917 		0.61
CommonsDelink

# Categories

In [52]:
categories=unravel([item.category for item in articles])
categories=Counter(categories).most_common()
print(f'Total number of categories used: {len(categories)}')
sts=basic_stats(list(dict(categories).values()))
print('number of articles by category')
for key in ['mean','min','Q25','Q50','Q75','max','Sh']:
    print(f'\t{key}: {sts[key]}')
print('\n\n')



print('Category                  \t\t\t\t\t\tnº articles \t%articles')
for key,val in categories:
    if val<200:
        break
    while len(key)<70:
        key+=' '
    print(f'{key} \t{val} \t\t{round(100*val/len(articles),3)}')    

Total number of categories used: 69849
number of articles by category
	mean: 6.983450013600767
	min: 1
	Q25: 1.0
	Q50: 1.0
	Q75: 4.0
	max: 5065
	Sh: 2.1225501000240916



Category                  						nº articles 	%articles
Personalidades de Galicia sen imaxes                                   	5065 		3.371
Filmes en lingua inglesa                                               	3203 		2.132
Filmes dos Estados Unidos de América                                   	2730 		1.817
Topónimos galegos con etimoloxía                                       	1594 		1.061
Escritores de Galicia en lingua galega                                 	1427 		0.95
Alumnos da Universidade de Santiago de Compostela                      	1371 		0.912
Nados en ano descoñecido                                               	1366 		0.909
Personalidades sen imaxes                                              	1085 		0.722
Escritores de Galicia en lingua castelá                                	1074 		0.715
Personalid

# Basic features  
Language-agnostic preliminary analysis, so there is not misspelled words control or token lemmatization.  
It relies on three routines:  
* `clean_text`: Used to create the _Bag of Words_ (bow) for each article. Only alphabetical chars are allowed. The input is a sentence (list of tokens or string) and the output is a list of alphabetical tokens.
* `basic_feat`: the main routine. The input is the extracted text of each article. This is processed to get:
    * The number of sentences in the article, an `int`
    * A list with the number of tokens in each sentence of the document
    * A list with the number of tokens in the output of `clean_text` applied to each sentence; these are called _words_.
    * A dictionary with the bow of the document, as defined above.
    * A list with the sentences of the article text, after applied the `MAX_CHAR_TOKEN` filter and the `MIN_TOKENS` filter
* `basic_stats`: the input is a list of numerical values and returns a dictionary with the mean, standard deviation, maximum, minimun, informational entropy and th quantile values for 25%, 50% (median) and 75%




In [53]:
def thread_function(arts):
    bows=[]
    feats=[]
    sents=[]

    for art in arts:
        nsent,ntok,nword,bow,sentences=basic_feat(art.text)
        sents.append(sentences)
        bows.append(bow)
        numt=basic_stats(ntok)
        numw=basic_stats(nword)
        nums=basic_stats(list(bow.values()))
        feats.append([art.title,nsent,numt['mean'],numt['Sh'],numw['mean'],numw['Sh'],nums['Sh'],sum(list(bow.values()))/len(bow) if bow else 0])
    return (bows,feats,sents)

In [54]:
%%time

answer=vectorice(articles,thread_function,max_workers=256)

bows=[]
feats=[]
sents=[]

for b,f,s in answer:
    bows+=(b)
    feats+=(f)
    sents+=(s)

CPU times: user 11.5 s, sys: 2.72 s, total: 14.2 s
Wall time: 26.1 s


# Basic features
Data frame with basic features of each article:
* `key`: article title
* `nsent`: number of sentences in the article
* `mean_tok`: mean of number of tokens per sentence in article
* `Sh_tok`: Informational entropy of tokens per sentence in article, in _nats_
* `mean_word`: mean of number of alphabetical tokens (`clean_text` output) per sentence in article
* `Sh_word`: Informational entropy of alphabetical tokens per sentence in article
* `Sh_bow`: Informational entropy of article's Bag of Words (bow) 
* `IL`: Lexical index, defined as $\cfrac{\#(words~in~article)}{\#(unique~words)}$

In [55]:
feats=pd.DataFrame(feats,columns=[ 'key','nsent','mean_tok','Sh_tok','mean_word','Sh_word','Sh_bow','IL'])
feats.describe()

Unnamed: 0,nsent,mean_tok,Sh_tok,mean_word,Sh_word,Sh_bow,IL
count,150261.0,150261.0,150261.0,150261.0,150261.0,150261.0,150261.0
mean,18.70311,19.160715,1.999124,15.365211,1.951485,0.867772,1.75518
std,33.389359,6.285748,0.705631,5.843242,0.704695,0.239403,0.954831
min,2.0,4.194444,-0.0,1.122449,-0.0,-0.0,1.0
25%,5.0,14.823529,1.386294,11.25,1.386294,0.715393,1.423077
50%,9.0,19.0,1.94591,15.2,1.906155,0.868114,1.628571
75%,19.0,22.875,2.51641,19.0,2.466577,1.016376,1.9
max,1507.0,101.5,4.283692,72.5,4.14555,2.51651,96.608696


In [56]:
feats.sort_values('nsent')

Unnamed: 0,key,nsent,mean_tok,Sh_tok,mean_word,Sh_word,Sh_bow,IL
47255,José María Balmón,2,16.500000,0.693147,12.500000,0.693147,0.198515,1.250000
47504,Edward Santana,2,20.000000,0.693147,15.000000,0.693147,0.456334,1.250000
70458,Edovius,2,22.500000,0.693147,21.500000,0.693147,0.545869,1.228571
60988,Sterling Beaumon,2,18.500000,0.693147,13.500000,0.693147,0.470236,1.173913
106703,Bolidophyceae,3,9.000000,1.098612,8.666667,1.098612,0.286836,1.083333
...,...,...,...,...,...,...,...,...
64604,Historia de Serbia,967,23.766287,3.845294,20.478800,3.732687,1.500926,4.462145
39,Ciencia,1023,21.548387,3.764184,20.116325,3.710465,1.564327,4.766968
93537,Eleccións municipais de 2015 en Galicia,1312,8.362043,2.184080,3.057927,1.573469,1.857073,10.872629
106532,Mártires do século XX en España,1346,5.564636,1.622353,1.943536,1.237220,1.608374,4.158983


# Computing idf

[_idf_](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) has a yet long history in information retrieval and gives a way to estimate the relative _value_ of a word, a sentence or a document.
 The classical definition is $\mathrm{idf}(t, D) = \log \cfrac{N}{|\{d \in D: t \in d\}|}$; where $N$ is the number of documents, $D$ in the collection (in this context the number of extracted articles), and $|\{d \in D: t \in d\}|$ the number of documents, $d$, which contains the term $t$.


The function `get_tfidf_bows` gets a list with the [_Bag of Wods_](https://en.wikipedia.org/wiki/Bag-of-words_model) for each document as input and returns:
* `ndw` the _idf_ for each term in the collection. 
* `bow` the _Bag of Words_ for the entire collection. A dictionary where the term is the key. The keys in ndw and bow are the same
* `tfidf` the _tfidf_ value for each document in the collection: _bow·ndw_

All three are python dictionaries where the term is the key. 
Applying `tfidf` or `ndw` implies that any term not included in these dictionaries has 0 value.

Three filters could be applied to construct `nwd` and `tfidf`:
* `min_len` : minimum length of the term, so word of 1, 2 or 3 letters have a low probability of be significant.
* `min_docs` : minimum number of documents that include the term for the term to be included. It is supposed that this filter removes misspelled words and extremely exotic words.
* `todrop` : a set with terms to be excluded arbitrarily.

In [57]:
def get_tfidf_bows(bows,todrop={},min_len=4,min_docs=MIN_DOCS):
    
    bow={}
    ndw={}
    for b in bows:
        for key,val in b.items():
            if len(key)<min_len or key in todrop:
                continue
            bow[key]=bow.get(key,0)+val
            ndw[key]=ndw.get(key,0)+1
        
    tfidf={}
    ndw={key:val for key,val in ndw.items() if val>min_docs}
    total=len(bows)
    ndw={key:np.log(total/val) for key,val in ndw.items()}
    bow={key:bow[key] for key in ndw.keys()}
    total=sum(bow.values())
    bow={key:val/total for key,val in bow.items()}
    tfidf={key:val*bow[key] for key,val in ndw.items()}
    return tfidf,bow,ndw
    

In [58]:
%%time
tfidf,BOW,ndw=get_tfidf_bows(bows,todrop={},min_len=2)

CPU times: user 7.06 s, sys: 21.4 ms, total: 7.08 s
Wall time: 7.08 s


In [62]:
Counter(ndw).most_common(15)

[('autotróficos', 10.533834699912223),
 ('sostelos', 10.533834699912223),
 ('casiodoro', 10.533834699912223),
 ('aedificatoria', 10.533834699912223),
 ('alberti', 10.533834699912223),
 ('baudelaire', 10.533834699912223),
 ('reprodutibilidade', 10.533834699912223),
 ('reinterpretan', 10.533834699912223),
 ('imitativas', 10.533834699912223),
 ('trivio', 10.533834699912223),
 ('poesis', 10.533834699912223),
 ('valorable', 10.533834699912223),
 ('aprecialas', 10.533834699912223),
 ('ordenalas', 10.533834699912223),
 ('fixalas', 10.533834699912223)]

## tfidf sentences based
The goal of this work is sentence-based, so it seems plausible compute the idf over an all sentences basis, that is think each sentence as a document for idf calculation 

In [63]:
#only sentences with more than 3 tokens
sentences=[item for item in unravel(sents) if len(item.split())>3]
#only unique sentences
sentences=list(set(sentences))
len(sentences)

2744914

In [64]:
%%time
all_tfidf={}
all_bow={}
all_ndw={}
for s in sentences:
    tbow=Counter(clean_text(s))
    for key,val in tbow.items():
        all_bow[key]=all_bow.get(key,0)+val
        all_ndw[key]=all_ndw.get(key,0)+1
all_ndw={key:val for key,val in all_ndw.items() if val > MIN_DOCS and len(key)>2}
all_bow={key:val for key,val in all_bow.items() if key in all_ndw.keys()}
total=sum(list(all_bow.values()))
all_bow={key:val/total for key,val in all_bow.items()}
total=len(sentences)
all_ndw={key:np.log(total/val) for key,val in all_ndw.items()}
all_tfidf={key:val*all_bow[key] for key,val in all_ndw.items()}

CPU times: user 2min 17s, sys: 0 ns, total: 2min 17s
Wall time: 2min 17s


In [65]:
Counter(all_ndw).most_common(15)

[('desvalorizarse', 13.438965941624746),
 ('desestiba', 13.438965941624746),
 ('precompresión', 13.438965941624746),
 ('bedeis', 13.438965941624746),
 ('acaramelado', 13.438965941624746),
 ('bicoca', 13.438965941624746),
 ('tomounas', 13.438965941624746),
 ('doñana', 13.438965941624746),
 ('penalizando', 13.438965941624746),
 ('damours', 13.438965941624746),
 ('attenboroughi', 13.438965941624746),
 ('contribuiço', 13.438965941624746),
 ('serpentean', 13.438965941624746),
 ('ferroquelatase', 13.438965941624746),
 ('fiscalizadora', 13.438965941624746)]

## value computation

So, there is four plausible computation schemas:
* Classical: $tfidf=ndw \cdot bow$ where $ndw$ is the _idf_ on a document basis and $bow$ the bag of words of the document 
* Alternative1: $tfidf=ndw \cdot BOW$ where $BOW$ is the _Bag of Words_ for the collection of documents. These $tfidf$ values are the same for all documents.
* Alternative2: the same computation that _alternative1_, but $ndw$ is computed over sentences basis, as explain above
* Alternative3: the same computation that _classical_,  but $ndw$ is computed over sentences basis, as explain above

Values per sentence are computated with the function `get_value`:
* `sent`: the sentence, as list of tokens or string
* `tfidf`: the tfidf to apply for computation
* `func`: the function to apply to get the sentence value, `sum` by default

In [66]:
def get_value(sent,tfidf,func=np.sum):
    if isinstance(sent,str):
        sent=sent.split()
    res=[tfidf[key] for key in clean_text(sent) if key in tfidf.keys()]
    return func(res) if res else 0

For easy comparison, values results are recorded in a dataframe. Columns are named with a prefix and a suffix. The suffix is related to the method of computation: _Classical_ ==> _class_; _Alternative1_ ==> _alt1_ and so on.
The prefix are:
* _mean_: mean sentence value of sentences in article
* _max_: maximun sentence value in article
* _Sh_ : informational entropy of sentences values
* _weight_ : summation of sentences values in article weighted by $\left( 1+\cfrac{1}{\#sentences} \right)$, so the effect of article length is somehow moderated

In [67]:
%%time
#Classical
v=[]
for sent,bow in zip(sents,bows):
    total=sum(list(bow.values()))
    td={key:val*ndw[key]/total for key,val in bow.items() if key in ndw.keys()}
    z=[get_value(s,td) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))



values=pd.DataFrame(v,columns=['mean_class','max_class','Sh_class','weight_class'])

                

CPU times: user 2min 48s, sys: 10.7 ms, total: 2min 48s
Wall time: 2min 48s


In [68]:
%%time
#Alternative1
v=[]
for sent in sents:
    z=[get_value(s,tfidf) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))


values['mean_alt1'],values['max_alt1'],values['Sh_alt1'],values['weight_alt1']=transpose(v)


                

CPU times: user 2min 40s, sys: 34.1 ms, total: 2min 40s
Wall time: 2min 40s


In [69]:
%%time
#Alternative2
v=[]
for sent in sents:
    z=[get_value(s,all_tfidf) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))


values['mean_alt2'],values['max_alt2'],values['Sh_alt2'],values['weight_alt2']=transpose(v)


                

CPU times: user 2min 39s, sys: 3.11 ms, total: 2min 39s
Wall time: 2min 39s


In [70]:
%%time
#Alternative3
v=[]
for sent,bow in zip(sents,bows):
    total=sum(list(bow.values()))
    td={key:val*all_ndw[key]/total for key,val in bow.items() if key in all_ndw.keys()}
    z=[get_value(s,td) for s in sent]
    val=basic_stats(z)
    w=1+1/len(z)
    v.append((val['mean'],val['max'],val['Sh'],sum(z)*w))


values['mean_alt3'],values['max_alt3'],values['Sh_alt3'],values['weight_alt3']=transpose(v)


                

CPU times: user 2min 48s, sys: 15.2 ms, total: 2min 48s
Wall time: 2min 48s


In [72]:
cols=[item for item in values.columns if 'weight' in item]
values[cols].describe()

Unnamed: 0,weight_class,weight_alt1,weight_alt2,weight_alt3
count,150261.0,150261.0,150261.0,150261.0
mean,5.094203,0.491009,1.206788,8.322778
std,7.208282,0.968784,2.476938,11.949575
min,0.653132,0.002347,0.001522,0.748882
25%,3.121288,0.119291,0.262656,5.194379
50%,3.925353,0.224685,0.534423,6.43838
75%,5.317236,0.479906,1.173499,8.679648
max,812.725555,50.177468,122.930273,1322.069086


As expected there is high correlation between _Classical_ and _Alternative3_ by one hand and between alternatives _1_ and _2_   for all computed values

In [73]:
values[cols].corr()

Unnamed: 0,weight_class,weight_alt1,weight_alt2,weight_alt3
weight_class,1.0,0.369648,0.345998,0.935061
weight_alt1,0.369648,1.0,0.995196,0.375781
weight_alt2,0.345998,0.995196,1.0,0.354982
weight_alt3,0.935061,0.375781,0.354982,1.0


## Value classification comparison
Let's compare the result of the value classification with the calculation schemes _Classical_ and _Alternative1_

In [74]:
for pref in ['mean','max','weight']:
    print(f'{pref.upper()}_ values')
    print('-'*65)
    res=[]
    for suf in ['class','alt1']:
        col=pref+'_'+suf
        indx=values.sort_values(col).index
        res.append((indx[:5],indx[-5:]))
    print(f'{"Minor values":>40}')
    print(f'\t\tClassical{"Alternative1":>60}')

    for c,a in zip(res[0][0],res[1][0]):
        print(f'{articles[c].title:<55}\t{articles[a].title}')
    print(f'\n{"Higher values":>40}')
    print(f'\t\tClassical{"Alternative1":>60}')
    for c,a in zip(res[0][1],res[1][1]):
        print(f'{articles[c].title:<55}\t{articles[a].title}')

    print('\n\n')
              
    

MEAN_ values
-----------------------------------------------------------------
                            Minor values
		Classical                                                Alternative1
Telmatoscopus                                          	Especies de Rhododendron
Especies de Rhododendron                               	Lista de xentilicios de concellos galegos
Lista de raíces indoeuropeas                           	Lista de guitarristas solistas
Lista de capítulos de O detective Conan                	Benthamia
Isabel Soto                                            	Cadros de xogadores de tempadas pasadas do Hockey Club Liceo

                           Higher values
		Classical                                                Alternative1
Hisham I                                               	Partido Socialista Obrero Español en Galicia
Paranarrador                                           	Midwinterhoorn
Búfalo anano                                           	Bolsa de estudos


And the calculation scheme with more intuitive results seems to be _weight_values_ _Alternative1_