<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#数据处理" data-toc-modified-id="数据处理-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>数据处理</a></span><ul class="toc-item"><li><span><a href="#数据概览" data-toc-modified-id="数据概览-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>数据概览</a></span></li><li><span><a href="#维基百科数据xml-parser" data-toc-modified-id="维基百科数据xml-parser-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>维基百科数据xml parser</a></span></li><li><span><a href="#Pass-1:-收集标题和重定向" data-toc-modified-id="Pass-1:-收集标题和重定向-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Pass 1: 收集标题和重定向</a></span></li><li><span><a href="#预处理和统计" data-toc-modified-id="预处理和统计-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>预处理和统计</a></span></li><li><span><a href="#编号" data-toc-modified-id="编号-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>编号</a></span></li><li><span><a href="#Pass2:-链接" data-toc-modified-id="Pass2:-链接-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Pass2: 链接</a></span><ul class="toc-item"><li><span><a href="#从page正文中提取链接" data-toc-modified-id="从page正文中提取链接-1.6.1"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>从page正文中提取链接</a></span></li></ul></li></ul></li><li><span><a href="#数据加载类" data-toc-modified-id="数据加载类-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>数据加载类</a></span></li><li><span><a href="#数据库" data-toc-modified-id="数据库-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>数据库</a></span></li><li><span><a href="#附录" data-toc-modified-id="附录-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>附录</a></span></li></ul></div>

In [None]:
DATA_PATH = r"D:\data" + "\\"

In [6]:
import pickle
import numpy as np
debugging=False

import os
os.sys.path.append("../common")

# 数据处理
## 数据概览
我们下载了2020年4月1日英文维基百科的数据。数据由xml组成。
前50000字符见附录。

整个数据集由页组成。有两种页，一种是重定向类型的，重定向的页的正文并不是真正的内容，另外一种是正文具有真正内容的页。
每个都有一个名字空间，名字空间为0，表示这是一个通常的文章(Article)。名字空间除了文章，还有包括模板，讨论，分类等等。
我们只读取表示通常页的文章。
每个页都有一个标题(title)，一般来说每个页的标题并不相同，但是有极个别例外，我们在名字空间0中发现了两个重复标题。

## 维基百科数据xml parser
因为数据巨大，我们采用流式解析器。
对于极大的文件，采用流式解析器，这样可以保证速度和有限的内存，但是给编码带来一定复杂度。
代码里包含一些小技巧来避坑。

In [12]:
import xml.sax
# https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
file_path=DATA_PATH + r"enwiki-20201201-pages-articles-multistream.xml\enwiki-20201201-pages-articles-multistream.xml"

class AbortException(Exception):
    pass

class StreamHandler(xml.sax.handler.ContentHandler):

    def __init__(self, on_page=None, skip_n_pages=0):
        self.cnt = 0
        self.on_page = on_page
        self.skip_n_pages = skip_n_pages
        
        # character may before any startElement        
        self.name = None
        self.title = None
        self.redirect = None
        self.ns = -1
        self.text = None
    
    def startElement(self, name, attrs):
        
        if self.cnt < self.skip_n_pages:
            return
        
        self.name = name
        
            
        if name == "redirect":
            assert attrs.getLength() == 1
            for n in attrs.getNames():
                self.redirect = attrs.getValue(n)
        elif name == "title":
            self.title_ = []
        elif name == "ns":
            self.ns_ = []
        elif name == "text":
            self.text_ = []
    


    def endElement(self, name):
        
        if self.cnt < self.skip_n_pages:
            if name == "page":
                if self.cnt % 10000 == 0:
                    print("\rpages: ", self.cnt, end="", flush=True)
                self.cnt += 1
            return
            
        
        if name == "title":
            self.title = "".join(self.title_)
        elif name == "ns":
            self.ns = int("".join(self.ns_))
        elif name == "text":
            self.text  = "".join(self.text_)
        elif name == "page":
            
            if self.on_page:
                self.on_page(self)                
                
            self.title = None
            self.redirect = None
            self.ns = -1
            self.text = None
            
            if self.cnt % 10000 == 0:
                print("\rpages: ", self.cnt, end="", flush=True)
                
            self.cnt += 1
            
        self.name = None

    def characters(self, content):
        if self.cnt < self.skip_n_pages:
            return
        
        # character may be invokde once for a line or for handured of characters
        # we need to contact them in the `endElement` by ourselves        
        if self.name == "title":                            
            self.title_.append(content)
        if self.name == "ns":                            
            self.ns_.append(content)
        elif self.name == "text":
            self.text_.append(content)
            
            
def parse_wiki(on_page, skip_n_pages = 0):
    parser = xml.sax.make_parser()
    handler = StreamHandler(on_page, skip_n_pages)
    parser.setContentHandler(handler)
        
    with open(file_path, "r", encoding="utf-8") as f:
        try:
            parser.parse(f)
        except AbortException:
            pass
    print("\rpages: ", handler.cnt, end="", flush=True)
    print()


## Pass 1: 收集标题和重定向
* 收集所有真正页的标题
* 记录重定向页的映射

In [3]:
# exlude redirect page titles
page_titles = []
# include redirect page titles
page_titles_set = set()
# redirect page title -> real page title
redirect_map = {}

def get_page_titles():
    
    def on_page(self):
        
        # considering only articles
        if self.ns != 0:
            return
         
        # no title ?
        if not self.title:
            print("\nno title")
            print("title: ", self.title)
            print("len(page_titles)", len(page_titles))
            return
            
        if self.redirect is not None:
            redirect_map[self.title] = self.redirect            
            return
                
        # repeat title?        
        if self.title in page_titles_set:
            print("\nrepeat title")
            print("title: ", self.title)                    
            print("len(page_titles)", len(page_titles))
        else:
            page_titles_set.add(self.title)
            
        page_titles.append(self.title)
        
        if debugging:
            if self.cnt >10000:
                raise AbortException()
    
    parse_wiki(on_page)
    
    with open(DATA_PATH + "pass1.data", "wb") as f:
        pickle.dump([page_titles, redirect_map], f)

%timeit -r1 -n1 get_page_titles()

pages:  19340000
repeat title
title:  Tirupattur
len(page_titles) 5878898
pages:  20776844
38min 22s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## 预处理和统计

In [4]:
with open(DATA_PATH + "pass1.data", "rb") as f:
    page_titles, redirect_map = pickle.load(f)

def count():
    global page_titles
    
    print("real pages: ", len(page_titles))    
    max_len = 0
    max_title = None
    page_titles_ = []
    page_titles_set = set()
    for i in range(len(page_titles)):
        title = page_titles[i]
        
        stitle = title.strip()
        if stitle != title:
            print("strip:", stitle, title)
        
        if title in page_titles_set:
            print("repeat:", title)
        else:
            page_titles_set.add(title)
            page_titles_.append(title)
            
        if len(title) > max_len:
            max_len = len(title)
            max_title = title
        
    page_titles = page_titles_
    
    print("maxlen", max_len)
    print(max_title)
    
    max_len = 0
    for title in redirect_map:
        if len(title) > max_len:
            max_len = len(title)
            max_title = title
    print("maxlen", max_len)
    print(max_title)
    print("unique title real pages: ", len(page_titles))        

    with open(DATA_PATH + "pass1.5.data", "wb") as f:
        pickle.dump([page_titles, redirect_map], f)
        
count()

real pages:  6200809
repeat: Tirupattur
maxlen 250
Cneoridium dumosum (Nuttall) Hooker F. Collected March 26, 1960, at an Elevation of about 1450 Meters on Cerro Quemazón, 15 Miles South of Bahía de Los Angeles, Baja California, México, Apparently for a Southeastward Range Extension of Some 140 Miles
maxlen 251
Protocol Amending the Agreements, Conventions and Protocols on Narcotic Drugs concluded at The Hague on 23 January 1912, at Geneva on 11 February 1925 and 19 February 1925, and 13 July 1931, at Bangkok on 27 November 1931 and at Geneva on 26 June 1936
unique title real pages:  6200808


## 编号

In [6]:
import pickle
import numpy as np

with open(DATA_PATH + "pass1.5.data", "rb") as f:
    page_titles, redirect_map = pickle.load(f)

page_title_indices = {}
for i, page_title in enumerate(page_titles):
    page_title_indices[page_title] = i
    
    
def get_index_from_title(title):
    
    iters = 0
    while True:
        if title in page_title_indices:
            return page_title_indices[title]
        
        # don't use capitalize, it we lower the first char of a name
        ctitle = title[:1].upper() + title[1:]
        if ctitle != title:
            if ctitle in page_title_indices:
                return page_title_indices[ctitle]
            
        if title in redirect_map:
            title_ = redirect_map[title]
            # this is not a full dection of loop, but it's work
            if title_ == title or title_ == ctitle:
                break
            title = title_
        elif ctitle in redirect_map:
            title_ = redirect_map[ctitle]
            if title_ == title or title_ == ctitle:
                break
            title = title_
        else:
            break
            
        iters += 1
        if iters >= 5:
            break
                        
def get_title_from_index(index):
    return page_titles[index]

In [7]:
print(get_index_from_title("Anarchism"))
print(get_title_from_index(get_index_from_title("Anarchism")))

0
Anarchism


## Pass2: 链接

### 从page正文中提取链接

In [8]:
#https://en.wikipedia.org/wiki/Wikipedia:Page_name
# don't contiain #<>[\]|{}_ in title
# have at most 255 characters (indeed 255 bytes) in title
# however we have to a larger value for this, to take into the namespace
# we assume at most 100 character, for tag, and text of hyperlink

import re
title_pattern = re.compile(r'''\[\[:?([^#<>[\]|{}_]{1,265})(?:#[^[\]|]{0,100})?(?:\|[^[\]]{0,100})?\]\]''')
#title_pattern = re.compile(r'''\[\[:?([^#<>[\]|{}_]+)\]\]''')
print(title_pattern.findall("[[t1]]"))
print(title_pattern.findall("[[t2#tag]]"))
print(title_pattern.findall("[[t3|text]]"))
print(title_pattern.findall("[[t3#tag|text]]"))
print(title_pattern.findall("[[t5#tag|text ]] [[t6|text]]"))
print(title_pattern.findall("[[:t7#tag|text]]"))
print(title_pattern.findall("[[[[t8]]"))
print(title_pattern.findall("[[t_ [[t9]]"))
print(title_pattern.findall("[[t1|aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa]]"))
print(title_pattern.findall("[[t2|aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa]]"))
print(title_pattern.findall("[[t3 (a)]]"))

['t1']
['t2']
['t3']
['t3']
['t5', 't6']
['t7']
['t8']
['t9']
['t1']
[]
['t3 (a)']


In [9]:
# links of all page, all links in one dimension
links=[]
# number of links of each page
page_n_links = []

# for debug
paring_page_title = None
paring_page_text = None
page_titles_set = set()

def get_page_links():
    
    global page_n_links
    global links
    def on_page(self):
                        
        global paring_page_title
        global paring_page_text
        
        # considering only articles
        if self.ns != 0:
            return
         
        # no title ?
        if not self.title:
            print("\nno title")
            print("title: ", self.title)
            print("len(page_titles)", len(page_titles))
            return
            
        if self.redirect is not None:
            return
                
        # repeat title?        
        if self.title in page_titles_set:
            print("\nrepeat title")
            print("title: ", self.title)                    
            print("len(page_titles)", len(page_titles))
            return
        else:
            page_titles_set.add(self.title)
            
        
        paring_page_title = self.title
        paring_page_text = self.text
               
        # if index of this title if correct
        if self.title != page_titles[len(page_n_links)]:
            print(self.title, "!=", page_titles[len(page_n_links)])
            assert False                        
            
        page_n_links.append(0)
        title_indices_set = set()
        
        for title in title_pattern.findall(self.text):
            index = get_index_from_title(title)
            if index is not None:
                if index not in title_indices_set:
                    title_indices_set.add(index)
                    links.append(index)
                    page_n_links[-1] += 1
                    
        if debugging:
            if self.cnt >10000:
                raise AbortException()
                    
    parse_wiki(on_page)
        
    links = np.array(links, dtype=np.int)
    page_n_links = np.array(page_n_links, dtype=np.int)
    np.savez_compressed(DATA_PATH + "pass2.npz", links=links, page_n_links=page_n_links)

%timeit -r1 -n1 get_page_links()

pages:  19340000
repeat title
title:  Tirupattur
len(page_titles) 6200808
pages:  20776844
48min 7s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


# 数据加载类

In [7]:
import wiki
import imp
imp.reload(wiki)
wiki_titles = wiki.WikiTitles(file = DATA_PATH + "pass1.5.data")
print(wiki_titles.get_index_from_title("Anarchism"))
print(wiki_titles.get_title_from_index(wiki_titles.get_index_from_title("Anarchism")))

0
Anarchism


In [8]:
wiki_links = wiki.WikiLinks(file = DATA_PATH + "pass2.npz", wiki_titles=wiki_titles)
print(wiki_links.get_links_from_title("PageRank"))

['Logarithmic scale', 'Algorithm', 'Google Search', 'Web page', 'Search engine', 'Larry Page', 'Google Patents', 'Network theory', 'Weighting', 'Hyperlink', 'Set (abstract data type)', 'World Wide Web', 'Link building', 'Webgraph', 'CNN', 'Mayo Clinic', 'Recursion', 'Backlink', 'HITS algorithm', 'Jon Kleinberg', 'Teoma', 'Ask.com', 'CLEVER project', 'TrustRank', 'Google Hummingbird', 'Eigenvalues and eigenvectors', 'Scientometrics', 'Thomas L. Saaty', 'Analytic hierarchy process', 'Cognitive model', 'Baidu', 'Robin Li', 'The New York Times', 'Forbes', 'Sergey Brin', 'Stanford University', 'Héctor García-Molina', 'Rajeev Motwani', 'Terry Winograd', 'Google', 'Software patent', 'Citation analysis', 'Eugene Garfield', 'Hyper Search', 'Massimo Marchiori', 'University of Padua', 'Probability distribution', 'Matt Cutts', 'Markov chain', 'URL', 'Adjacency matrix', 'Stochastic matrix', 'Eigenvector centrality', 'Eigengap', 'Expected value', 'Wikipedia', 'Link farm', 'Trade secret', 'Power iter

# 数据库

最好吧文件放在ssd上,可以快10倍。

In [9]:
import sqlite3
import os
import wiki

def to_db():
    
    wiki_titles = wiki.WikiTitles(file = DATA_PATH + "pass1.5.data")
    if os.path.exists(DATA_PATH +  r"wiki_titles.db"):
        os.remove(DATA_PATH + r"wiki_titles.db")

    try:
        conn = sqlite3.connect(DATA_PATH + "wiki_titles.db")
        c = conn.cursor()
        
        c.execute("create table page_title_indices (title text unique primary key, idx integer)")
        for i, (title, index) in enumerate(wiki_titles.page_title_indices.items()):        
            c.execute("insert into page_title_indices values (?, ?)", (title, index))
            if i % 10000 == 0:
                conn.commit()
            if i % 100000 == 0:
                print("\r1:%d    "%i, end="")
        conn.commit()
        print("\r1:%d    "%i, end="\n")
        
        c.execute("create table page_titles (idx integer unique primary key, title text)")
        for i, title in enumerate(wiki_titles.page_titles):        
            c.execute("insert into page_titles values (%d, ?)"%i, (title,))
            if i % 10000 == 0:
                conn.commit()
            if i % 100000 == 0:
                print("\r2:%d    "%i, end="")
        conn.commit()
        print("\r1:%d    "%i, end="\n")

        c.execute("create table redirect_map (key text unique primary key, value text)")
        for i, (key, value) in enumerate(wiki_titles.redirect_map.items()):
            c.execute("insert into redirect_map values (?, ?)", (key,value))
            if i % 10000 == 0:
                conn.commit()
            if i % 1000 == 0:
                print("\r3:%d    "%i, end="")
        conn.commit()
        print("\r1:%d    "%i, end="\n")
        
    finally:
        conn.close()
        
%timeit -r1 -n1 to_db()

3:9367000    16min 1s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


# 附录

In [13]:
with open(file_path, "r", encoding="utf-8") as f:
    print(f.read(50000))

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.36.0-wmf.18</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <n