# Calculating Semantic Relatedness using Wikipedia

* **Armin Sajadi** - Faculty of Computer Science
* **Dr. Evangelos Milios** - Faculty of Computer Science
* **Dr. Vlado Kešelj** – Faculty of Computer Science

This is a simple and step by step explanation of calculating semantic relatedness using Wikipedia. We start by preprocessing and building the api, that is explained in the following papers papers:

* Armin Sajadi, Evangelos E. Milios, Vlado Keselj, "Vector Space Representation of Wikipedia Concepts", Submitted to NLDB 2017


### Public Resources
* Weservice: (http://web.cs.dal.ca/~sajadi/wikisim/)
* Source Code: (https://github.com/asajadi/wikisim)




# Read Here First
### If you want to use our pre-pared datasets [Recommended] 
[Using Prepared Tables](#Using-Prepared-Tables)

### If you want to experience another Wikipedia Database dump
[Start From Preprocessing](#Preprocessing)

# Table of Context

**[Preparing The New Wiki Dump](#Preparing-The-New-Wiki-Dump)**

**[Wikipedia Interface](#Wikipedia-Interface)**

**[Fast Pagerank Implementation](#Fast-Pagerank-Implementation)**

**[Calculating Semantic Relatedness](#Calculating-Semantic-Relatedness)**

**[A Simple Example](#A-Simple-Example)**

**[Calculating All The Embeddngs](#Calculating-All-The-Embeddngs)**

**[Visualizing The Embeddings](#Visualizing-The-Embeddings)**



# Preparing The New Wiki Dump

## Preprocessing

**Note: You can skip the step by step processing section completely by running `bash preprocess`**. 

** This is intended to guide you in case on of the steps goes wrong!**


The first step is to download the wikipedia database dumps and import them to mysql. We do a preprocessing on the sql dumps for mainly three reasons:

* The tables are huge, containing many column and rows we do not use. Removing the unnessary information, that includes unused columns (such as time stamps, viewed count of the pages or categories) and all the information about talk pages, media files or user draft pages, can dramatically decreas the size of the tables.

* Forming **synonym Rings**. We extend the concept of synonym ring to Wikipedia (similar to what is called synset in Wordnet). In Wikipedia, redirection stands for equivallency, for example Car --> Automobile. But it's not always this easy and you can find all sorts of weired redirection, like:

![](../resrc/sr.jpg)

   We iterate through redirectins and remove cycles, dangling redirections and also all the chains. This process forms clusters of redirections around main pages. Then we go through all other tables (pagelinks and  category links) and replace any redirected page by its main article, the result would be much more neated, and makes the rest of the process faster.


* We remove garbage, links to non existing pages, self links, mismatching namespaces, and many other incosistencies that you can find the details in the source code).

* We apply some strategic changes, like instead of source id --> destination title format of the pagelinks, we use source id --> dest id, which is faster and preferrabel for out case. 

To complete this step, download and run the parser (written in Java) that prunes these files. You can run the following cells, but due to a known bug with ipython, you can't see bash progress messages untill the job is finished. So a better option would be simply running the scripts from bash and skipping the remaining of this section. In this case, running each cell create a script in the [preparation_scripts/] directory with the name indicated as the argument of `writefile` at the begining of the cell. If you want to run the cell directly, ** comment the first line and uncomment the second line**.

## Downloading

Download the following files and decompress them (we assume the path to be ~/Downloads/wikidumps):

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pagelinks.sql.gz 

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-redirect.sql.gz

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-category.sql.gz

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz

### Or
**Use the the prepared following script that download wikipedia dumps to the the default `~/Downloads/wikidumps` directory and decompress them**


`bash download.sh`

`bash decompress.sh`


## Parsing Database dumps

The following java file  does the preprosseing (parsing) wikipedia dumps and creates the processed tables (ending in `main.sql`) and several log files of the errors

*Note*: you might need to recompile (`javac ProcessSQLDumps.java`) 

run by

`java ProcessSQLDumps ~/Downloads/wikidumps`

### Preparing mysql
Running the folling cell will set some variable in mysql for maximum performance (if you have enoguh physical memory). Replace \$1 and \$2 with the actuall user and password of the user, or run the script as:

`bash setupmysql.sh <user> <pass>`


## Actuall Importing

```mysql -u <user> -p<pass> -e 'CREATE SCHEMA `enwikilast` DEFAULT CHARACTER SET binary;'```

`./importall  ~/Downloads/wikidumps last <user> <pass>`

This might take several hours 




# Using Prepared Tables
## Download
Download the following files and decompress them to a dir (we assume the path to be ~/Downloads/wikidumps)

[](cgm6.research.cs.dal.ca/~sajadi/wikisim/downloads/enwiki-20160305-page.main.tsv.gz)

cgm6.research.cs.dal.ca/~sajadi/wikisim/downloads/enwiki-20160305-redirect.main.tsv.gz

cgm6.research.cs.dal.ca/~sajadi/wikisim/downloads/enwiki-20160305-pagelinks.main.tsv.gz

cgm6.research.cs.dal.ca/~sajadi/wikisim/downloads/enwiki-20160305-category.main.tsv.gz

cgm6.research.cs.dal.ca/~sajadi/wikisim/downloads/enwiki-20160305-categorylinks.main.tsv.gz

cgm6.research.cs.dal.ca/~sajadi/wikisim/downloads/enwiki-20160305-pagelinksorderedin.main.tsv.gz

cgm6.research.cs.dal.ca/~sajadi/wikisim/downloads/enwiki-20160305-pagelinksorderedout.main.tsv.gz

### Preparing mysql
Running the folling cell will set some variable in mysql for maximum performance (if you have enoguh physical memory. replace \$1 and \$2 with the actuall user and password of the user, or run the script as:

`bash setupmysql.sh <user> <pass>`


## Actuall Importing
Run:

mysql -u <user> -p<pass> -e 'CREATE SCHEMA `enwiki20160305` DEFAULT CHARACTER SET binary;'

./importall  ~/Downloads/wikidumps enwiki20160305 <user> <pass>

This might take several hours 



# Wikipedia Interface
This is the main interface to Wikipedia database and provides basic functions given a pages, such as its:

* id or title
* synonym ring
* linkage
* in or out neighborhood. 

**You might need to modify, user, password and portnumbers**


In [5]:
%%writefile wikipedia.py 
"""A General Class to interact with Wiki datasets"""
# uncomment

import sys;
import os
import scipy as sp
import pandas as pd
import cPickle as pickle
import MySQLdb

from utils import * # uncomment

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi", "Evangelo Milios", "Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"


DISABLE_CACHE=False;
MAX_GRAPH_SIZE=1000000

DIR_IN=0;
DIR_OUT=1;
DIR_BOTH=2;
_db = MySQLdb.connect(host="127.0.0.1",port=3307,user='amaral',passwd="123456",db="enwiki20160305")
_cursor = _db.cursor()
#WIKI_SIZE = 10216236;
#WIKI_SIZE = 13670498; #2016
WIKI_SIZE = 5576365; #no redirect, 2016
def close():
    global _db, _cursor;
    if _cursor is not None: 
        _cursor.close();
        _db.close();
    _cursor=_db=None;
def reopen():
    global _db, _cursor;
    if _db is None:
        _db = MySQLdb.connect(host="127.0.0.1",port=3307,user='amaral',passwd="123456",db="enwiki20160305")
        _cursor = _db.cursor()
        
def load_table(tbname, limit=-1):
    """ Returns a list, containing a whole table     
    
    Args: 
        tbname: Table Name
    Returns: 
        The list of rows
    """
    if limit!=-1:
        q = """SELECT * FROM `%s` limit %s""" % (tbname, limit)
    else:
        q = """SELECT * FROM `%s`""" % (tbname,)
        
    _cursor.execute(q)
    rows = _cursor.fetchall();
    return rows
    
    
def id2title(wid):
    """ Returns the title for a given id

    Args: 
        wid: Wikipedia id       
    Returns: 
        The title of the page
    """
    title=None;

    _cursor.execute("""SELECT * FROM `page` where page_id = %s""", (wid,))
    row= _cursor.fetchone();
    if row is not None:
        title=row[2];          
    return title;

def ids2title(wids):
    """ Returns the titles for given list of wikipedia ids 

    Args: 
        wids: A list of Wikipedia ids          
    Returns: 
        The list of titles
    """

    wid_list = [str(wid) for wid in wids] ;
    order = ','.join(['page_id'] + wid_list) ;
    wid_str = ",".join(wid_list)
    query = "SELECT page_id, page_title FROM `page` where page_id in ({0})" \
    .format(wid_str, order);
    _cursor.execute(query);
    rows = _cursor.fetchall();
    rows_dict = dict(rows)
    titles = [rows_dict[wid] for wid in wids]
    return titles;

def encode_for_db(instr):
    if isinstance(instr, unicode):
        instr = instr.encode('utf-8')  
    return instr
        
def normalize_str(title):
    
    title = encode_for_db(title)
    title = title.replace(' ','_')
    return title
def title2id(title):
    """ Returns the id for a given title

    Args: 
        wid: Wikipedia id          
    Returns: 
        The title of the page
    """        
    wid=None;
    title = normalize_str(title)
    _cursor.execute("""SELECT * FROM `page` where page_title=%s and page_namespace=0""", (title,))
    row= _cursor.fetchone();
    if row is not None:
        wid = getredir_id(row[0]) if row[3] else row[0];
    return wid;

def is_ambiguous(wid):
    _cursor.execute("""SELECT * FROM `categorylinks` WHERE `categorylinks`.cl_from=%s and `categorylinks`.cl_to=19204864;""", (wid,))
    row= _cursor.fetchone();
    return not (row is None)    

def getredir_id(wid):
    """ Returns the target of a redirected page 

    Args:
        wid: wikipedia id of the page
    Returns:
        The id of the target page
    """
    rid=None

    _cursor.execute("""select * from redirect where rd_from=%s;""", (wid,));
    row= _cursor.fetchone();
    if row is not None:
        rid=row[1]
    return rid 

def resolveredir(wid):
    tid = getredir_id(wid);
    if tid is not None:
        wid = tid;    
    return wid

def getredir_title(wid):
    """ Returns the target title of a redirected page 

    Args:
        wid: wikipedia id of the page
    Returns:
        The title of the target page
    """
    
    title=None;
    _cursor.execute(""" select page_title from redirect INNER JOIN page
                  on redirect.rd_to = page.page_id 
                  where redirect.rd_from =%s;""", (wid));
    row=_cursor.fetchone()
    if row is not  None:
        title=row[0];
    return title;

def synonymring_titles(wid):
    """ Returns the synonim ring of a page

    Example: synonymring_titles('USA')={('U.S.A', 'US', 'United_States_of_America', ...)}

    Args:
        wid: the wikipedia id
    Returns:
        all the titles in its synonym ring
    """
    wid = resolveredir(wid)
    _cursor.execute("""(select page_title from page where page_id=%s) union 
                 (select page_title from redirect INNER JOIN page
                    on redirect.rd_from = page.page_id 
                    where redirect.rd_to =%s);""", (wid,wid));
    rows=_cursor.fetchall();
    if rows:
        rows = tuple(r[0] for r in rows)
    return rows;


def anchor2concept(anchor):
    """ Returns the targets of an anchor text

    Args:
        anchor: anchor
        
    Returns:
        The list of the titles of the linked pages
    """
  
    anchor = encode_for_db(anchor)
        
    _cursor.execute("""select anchors.id, anchors.freq from anchors inner join page on anchors.id=page.page_id where anchors.anchor=%s;""", (anchor,))
    rows =_cursor.fetchall()
#     if rows:
#         rows = tuple(r[0] for r in rows)
    return rows


def id2anchor(wid):
    """ Returns the targets of an anchor text

    Args:
        anchor: anchor
        
    Returns:
        The list of the titles of the linked pages
    """
    _cursor.execute("""select anchor , freq from anchors where id=%s""", (wid,))
    rows =_cursor.fetchall()
#     if rows:
#         rows = tuple(r[0] for r in rows)
    return rows


def _getlinkedpages_query(id, direction):
    query="(SELECT {0} as lid FROM pagelinks where ({1} = {2}))"
    if direction == DIR_IN:
        query=query.format("pl_from","pl_to",id);
    elif direction == DIR_OUT:
        query=query.format("pl_to","pl_from",id);
    return query;

def getlinkedpages(wid,direction):
    """ Returns the linkage for a node

    Args:
        id: the wikipedia id
        direction: 0 for in, 1 for out, 2 for all
    Returns:
        The list of the ids of the linked pages
    """
    _cursor.execute(_getlinkedpages_query(wid, direction));
    rows =_cursor.fetchall()
    if rows:
        rows = tuple(r[0] for r in rows)
    return rows

def e2i(wids):
    elist=[];
    edict=dict();
    last=0;    
    for wid in itertools.chain(*iters):
        if wid not in edict:
            edict[wid]=last;
            elist.append(wid);
            last +=1; 
    return elist, edict;

def getneighbors(wid, direction):
    """ Returns the neighborhood for a node

    Args:
        id: the wikipedia id
        direction: 0 for in, 1 for out, 2 for all
    Returns:
        The vector of ids, and the 2d array sparse representation of the graph, in the form of
        array([[row1,col1],[row2, col2]]). This form is flexible for general use or be converted to scipy.sparse 
        formats
    """
    log('[getneighbors started]\twid = %s, direction = %s', wid, direction)
    
    idsquery = """(select  {0} as lid) union {1}""".format(wid,_getlinkedpages_query(wid,direction));

    _cursor.execute(idsquery);


    rows = _cursor.fetchall();
    if len(rows)<2:
        log('[getneighbors]\tERROR: empty')
        return (), sp.array([])
    
    
    neighids = tuple(r[0] for r in rows);
    if len(neighids)>MAX_GRAPH_SIZE:
        log('[getneighbors]\tERROR: too big, %s neighbors', len(neighids))
        return (), sp.array([])

    
    id2row = dict(zip(neighids, range(len(neighids))))

    neighbquery=  """select lid,pl_to as n_l_to from
                     ({0}) a  inner join
                     pagelinks on lid=pl_from""".format(idsquery);

    links=_cursor.execute(neighbquery);

    links = _cursor.fetchall();
    
    #links = tuple((id2row(u), id2row(v)) for u, v in links if (u in id2row) and (v in id2row));
    links = sp.array([[id2row[u], id2row[v]] for u, v in links if (u in id2row) and (v in id2row)]);
    
    log('Graph extracted, %s nodes and %s linkes', len(neighids), len(links) )
    log('[getneighbors]\tfinished')
    return (neighids,links)

def deletefromcache(wid, direction):
    wid = resolveredir(wid)
    if direction in [DIR_IN, DIR_BOTH] : 
        query =    """delete from {0} where cache_id={1}""".format('pagelinksorderedin', wid) 
        _cursor.execute(query);
    if direction in [DIR_OUT, DIR_BOTH]: 
        query =    """delete from {0} where cache_id={1}""".format('pagelinksorderedout', wid) 
        _cursor.execute(query);
    
def clearcache():
    if DISABLE_CACHE:
        return;
    _cursor.execute("delete  from pagelinksorderedin");
    _cursor.execute("delete  from pagelinksorderedout");

def checkcache(wid, direction):
    log('[checkcache started]\twid = %s, direction = %s', wid, direction)
    if DISABLE_CACHE:
        log('[checkcache]\tDisabled')
        return None
    

    
    em=None
    
    if direction == DIR_IN: 
        tablename = 'pagelinksorderedin';
        colname = 'in_neighb'
    elif direction == DIR_OUT: 
        tablename = 'pagelinksorderedout';
        colname = 'out_neighb';
    query =    """select {0} from {1} where cache_id={2}""".format(colname, tablename, wid)
    _cursor.execute(query);
    row = _cursor.fetchone();
    if row is not None:
        values, index = pickle.loads(row[0])
        log('[checkcache]\tfound')
        if not index:        
            log('[checkcache]\tempty embedding')
        em=pd.Series(values, index=index)
    else:
        log('[checkcache]\tnot found')

    log('[checkcache]\tfinished')
    return em


def cachescores(wid, em, direction):
    log('[cachescores started]\twid = %s, direction = %s', wid, direction)
    if DISABLE_CACHE:
        log('[cachescores]\tDisabled')
        return

    if direction == DIR_IN: 
        tablename = 'pagelinksorderedin';
        colname = 'in_neighb'

    elif direction == DIR_OUT: 
        tablename = 'pagelinksorderedout';
        colname = 'out_neighb';
        
    idscstr = pickle.dumps((em.values.tolist(), em.index.values.tolist()), pickle.HIGHEST_PROTOCOL)
    _cursor.execute("""insert into %s values (%s,'%s');""" %(tablename, wid, _db.escape_string(idscstr)));
    
    
    log('cachescores finished')


Overwriting wikipedia.py


# Utils
Some small helper function for reporting purposes. 

In [None]:
%%writefile utils.py 
"""Utility functions"""
# uncomment

import os
import re
import itertools
import scipy as sp
import pandas as pd
import datetime

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi", "Evangelo Milios", "Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"

def readds(url, usecols=None):    
    data = pd.read_table(url, header=None, usecols=usecols)
    return data

DISABLE_LOG=True;

def clearlog(logfile):
    with open(logfile, 'w'):
        pass;

def logres(outfile, instr, *params):
    outstr = instr % params;
    with open(outfile, 'a') as f:
        f.write("[%s]\t%s\n" % (str(datetime.datetime.now()) , outstr));          
        
def log(instr, *params):
    if DISABLE_LOG:
        return
    logres(logfile, instr, *params)
    
if not DISABLE_LOG:    
    outdir = '../out'    
    logfile=os.path.join(outdir, 'log.txt');
    if not os.path.exists(logfile):
        if not os.path.exists(outdir):
            os.makedirs(outdir)
        log('log created') 
        os.chmod(logfile, 0777)    
    
    
def timeformat(sec):
    return datetime.timedelta(seconds=sec)

def str2delta(dstr):
    r=re.match(('((?P<d>\d+) day(s?), )?(?P<h>\d+):(?P<m>\d+):(?P<s>\d*\.\d+|\d+)'),dstr)
    d,h,m,s=r.group('d'),r.group('h'),r.group('m'),r.group('s')
    d=int(d) if d is not None else 0
    h,m,s = int(h), int(m), float(s)    
    return datetime.timedelta(days=d, hours=h, minutes=m, seconds=s)

# Fast Pagerank Implementation

Here we have the actuall implementation of pagerank. Two implemenation are provided, both inspired  by the sparse fast solutions given in **Cleve Moler**'s book, [*Experiments with MATLAB*](http://www.mathworks.com/moler/index_ncm.html). The power method is much faster with enough precision for our task. Our benchmarsk shows that this implementation is faster than networkx implementation magnititude of times

The input is a 2d array, each row of the array is an edge of the graph [[a,b], [c,d]], a and b are the node numbers. 
(In case you want to caclulate reall page rank, uncomment the line that transposes the adjacency matrix)

In [None]:
%%writefile pagerank.py 
"""Two implementations of PageRank.

Pythom implementations of Matlab original in Cleve Moler, Experiments with MATLAB.
"""
# uncomment

import scipy as sp
import scipy.sparse as sprs
import scipy.spatial
import scipy.sparse.linalg 

from utils import * # uncomment

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi", "Evangelo Milios", "Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"


def create_csr(Z):
    """ Creates a csr presentation from 2darray presentation and 
        calculates the pagerank
    Args:
        G: input graph in the form of a 2d array, such as [[2,0], [1,2], [2,1]]
    Returns:
        Pagerank Scores for the nodes
    
    each row of the array is an edge of the graph [[a,b], [c,d]], a and b are the node numbers. 

    """   
    rows = Z[:,0];
    cols = Z[:,1];
    n = max(max(rows), max(cols))+1;
    G=sprs.csr_matrix((sp.ones(rows.shape),(rows,cols)), shape=(n,n));
    return G

def pagerank_sparse(G, p=0.85, personalize=None, reverse=False):
    """ Calculates pagerank given a csr graph
    
    Args:
        G: a csr graph.
        p: damping factor
        personlize: if not None, should be an array with the size of the nodes
                    containing probability distributions. It will be normalized automatically
        reverse: If true, returns the reversed-pagerank 
        
    Returns:
        Pagerank Scores for the nodes
     
    """
    log('[pagerank_sparse]\tstarted')

    if not reverse:
        G=G.T;

    n,n=G.shape
    c=sp.asarray(G.sum(axis=0)).reshape(-1)
    r=sp.asarray(G.sum(axis=1)).reshape(-1)

    k=c.nonzero()[0]

    D=sprs.csr_matrix((1/c[k],(k,k)),shape=(n,n))

    if personalize is None:
        e=sp.ones((n,1))
    else:
        e = personalize/sum(personalize);
        
    I=sprs.eye(n)
    X1 = sprs.linalg.spsolve((I - p*G.dot(D)), e);

    X1=X1/sum(X1)
    log('[pagerank_sparse]\tfinished')
    return X1
def pagerank_sparse_power(G, p=0.85, max_iter = 100, personalize=None, reverse=False):
    """ Calculates pagerank given a csr graph
    
    Args:
        G: a csr graph.
        p: damping factor
        max_iter: maximum number of iterations
        personlize: if not None, should be an array with the size of the nodes
                    containing probability distributions. It will be normalized automatically
        reverse: If true, returns the reversed-pagerank 
        
    Returns:
        Pagerank Scores for the nodes
     
    """
    log('[pagerank_sparse_power]\tstarted')
    
    if not reverse: 
        G=G.T;

    n,n=G.shape
    c=sp.asarray(G.sum(axis=0)).reshape(-1)
    r=sp.asarray(G.sum(axis=1)).reshape(-1)

    k=c.nonzero()[0]

    D=sprs.csr_matrix((1/c[k],(k,k)),shape=(n,n))

    if personalize is None:
        e=sp.ones((n,1))
    else:
        e = personalize/sum(personalize);
        
    z = (((1-p)*(c!=0) + (c==0))/n)[sp.newaxis,:]
    G = p*G.dot(D)
    x = e/n
    oldx = sp.zeros((n,1));
    
    iteration = 0
    
    while sp.linalg.norm(x-oldx) > 0.001:
        oldx = x
        x = G.dot(x) + e.dot(z.dot(x))
        iteration += 1
        if iteration >= max_iter:
            break;
    x = x/sum(x)
    
    log('# of iterations: %s, normdiff: %s', iteration, sp.linalg.norm(x-oldx))
    log('[pagerank_sparse_power]\tfinished')
    return x.reshape(-1) 



In [None]:
%%writefile embedding.py

from wikipedia import * # uncomment
from pagerank import * # uncomment
import gensim

#from utils import * # uncomment

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi", "Evangelo Milios", "Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"

_word2vec_model = None
def getword2vec_model():
    """ returns the word2vec model
    """
    
    return _word2vec_model

def conceptrep(wid, method ='rvspagerank', direction=DIR_BOTH, get_titles=True, cutoff=None):
    """ Calculates well-known similarity metrics between two concepts 
    Arg:
        id1, id2: the two concepts 
        method:
            rvspagerank: rvs-pagerank embedding
            word2vec : wor2vec representation
    Returns:
        The similarity score        
    """
        
    if method =='rvspagerank':
        return conceptrep_rvs(wid, direction, get_titles, cutoff)
    if 'word2vec' in method:
        return getword2vector(wid)


def concept_embedding(wid, direction):
    """ Calculates concept embedding to be used in relatedness
    
    Args:
        wid: wikipedia id
        direction: 0 for in, 1 for out, 2 for all
        
    Returns:
        The neighbor ids, their scores and the whole neighorhood graph (for visualization purposes)
        
    """
    log('[concept_embedding started]\twid = %s, direction = %s', wid, direction)

    if direction == DIR_IN or direction==DIR_OUT:
        em = _concept_embedding_io(wid, direction)
    if direction == DIR_BOTH:
        em = _concept_embedding_both(wid, direction)
    log('[concept_embedding]\tfinished')
    return em
    
def _concept_embedding_io(wid, direction):
    wid = resolveredir(wid)
    cached_em = checkcache(wid, direction);
    if cached_em is not None:
        return cached_em;

    (ids, links) = getneighbors(wid, direction);
    if ids:
        scores = pagerank_sparse_power(create_csr(links), reverse=True)
        em = pd.Series(scores, index=ids) 
    else:
        em = pd.Series([], index=[])  
    cachescores(wid, em, direction);
    return em
            

def _concept_embedding_both(wid, direction):            
        in_em = _concept_embedding_io(wid, DIR_IN);
        out_em = _concept_embedding_io(wid, DIR_OUT )
        if (in_em is None) or (out_em is None):
            return None;
        return in_em.add(out_em, fill_value=0)/2

def conceptrep_rvs(wid, direction, get_titles=True, cutoff=None):
    """ Finds a representation for a concept
    
        Concept Representation is a vector of concepts with their score
    Arg:
        wid: Wikipedia id
        direction: 0 for in, 1 for out, 2 for all
        titles: include titles in the embedding (not needed for mere calculations)
        cutoff: the first top cutoff dimensions (None for all)
        
    Returns:
        the vecotr of ids, their titles and theirs scores. It also returns the
        graph for visualization purposes. 
    """
    
    log('[conceptrep started]\twid = %s, direction = %s', wid, direction)
    
    em=concept_embedding(wid, direction);    
    if em.empty:
        return em;
    
    
    #ids = em.keys();
    
    if cutoff is not None:
        em = em.sort_values(ascending=False)
        em = em[:cutoff]
    if get_titles:
        em = pd.Series(zip(ids2title(em.index), em.values.tolist()), index=em.index)
    log ('[conceptrep]\tfinished')
    return em

def gensim_loadmodel(model_path):
    """ Loads the word2vec model 
    Arg:
        model_path: path to the model
    """
    global _word2vec_model
    log('[getsim_word2vec]\tloading: %s', model_path)
    _word2vec_model = gensim.models.Word2Vec.load(model_path)                
    log('[getsim_word2vec]\loaded')
    return _word2vec_model
    
def getword2vector(wid):
    wid_s=str(wid)
    wid_s = 'id_'+ wid_s
    if wid_s not in _word2vec_model.vocab:
        return  pd.Series(sp.zeros(_word2vec_model.vector_size))
    return pd.Series(_word2vec_model[wid_s])
    


## Calculating Semantic Relatedness
The idea is get the neighborhood graph for each concept and calculating the similarity by embedding the graph into a vector and then perforiming cosine similarity. 

The process can be illustrated like this:
    ![](../resrc/alg.jpg)

In [5]:
%%writefile calcsim.py 
"""Calculating Relatedness."""
# uncomment

from __future__ import division

from embedding import *
#from collections import defaultdict
import json
import math
from scipy import stats
from config import *
#from utils import * # uncomment

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi", "Evangelo Milios", "Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"

#constants


def _unify_ids_scores(*id_sc_tuple):
    uids, id2in = e2i(*(ids for ids, _ in id_sc_tuple));
    
    uscs=tuple();            
    for ids,scs in id_sc_tuple:
        scs_u=sp.zeros(len(id2in))
        scs_u[[id2in[wid] for wid in ids]] = scs;            
        uscs += (scs_u,)                
    return uids, uscs       

    
def getsim_word2vec(id1, id2):
    """ Calculates wor2vec similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    model  = getword2vec_model()
    if model is None:
        log('[getsim_word2vec]\tmodel not loaded')
        raise Exception('model not loaded, try gensim_loadmodel()')
        
    if id1 not in model.vocab:
        #print '%s,%s skipped, %s not in vocab ' % (id1, id2, id1)
        return 0
    if id2 not in model.vocab:
        #print '%s,%s skipped, %s not in vocab ' % (id1, id2, id2)
        return 0
    return model.similarity(id1, id2)


def getsim_wlm(id1, id2):
    """ Calculates wlm (ngd) similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_IN))
    in2 = set(getlinkedpages(id2, DIR_IN))
    f1 = len(in1)
    f2 = len(in2)
    f12=len(in1.intersection(in2))
    dist = (sp.log(max(f1,f2))-sp.log(f12))/(sp.log(WIKI_SIZE)-sp.log(min(f1,f2)));
    if (f1==0) or (f2==0) or (f12==0):
        return 0;
    sim = 1-dist if dist <=1 else 0
    return sim

def getsim_cocit(id1, id2):
    """ Calculates co-citation similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_IN))
    in2 = set(getlinkedpages(id2, DIR_IN))
    f1 = len(in1)
    f2 = len(in2)
    if (f1==0) or (f2==0):
        return 0;
    
    f12=len(in1.intersection(in2))
    sim = (f12)/(f1+f2-f12);
    return sim


def getsim_coup(id1, id2):
    """ Calculates coupler similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_OUT))
    in2 = set(getlinkedpages(id2, DIR_OUT))
    f1 = len(in1)
    f2 = len(in2)
    if (f1==0) or (f2==0):
        return 0;
    
    f12=len(in1.intersection(in2))
    sim = (f12)/(f1+f2-f12);
    return sim

def getsim_ams(id1, id2):
    """ Calculates amlser similarity between two concepts 
    Arg:
        id1, id2: the two concepts 
    Returns:
        The similarity score        
    """
    in1 = set(getlinkedpages(id1, DIR_IN))
    out1 = set(getlinkedpages(id1, DIR_OUT))
    link1 = in1.union(out1)
    
    in2 = set(getlinkedpages(id2, DIR_IN))
    out2 = set(getlinkedpages(id2, DIR_OUT))
    link2 = in2.union(out2)
    
    f1 = len(link1)
    f2 = len(link2)
    if (f1==0) or (f2==0):
        return 0;
    
    f12=len(link1.intersection(link2))
    sim = (f12)/(f1+f2-f12);
    return sim


def getsim_emb(id1,id2, direction):
    """ Calculates the similarity between two concepts
    Arg:
        id1, id2: the two concepts
        direction: 0 for in, 1 for out, 2 for all
        
    Returns:
        The similarity score
    """
    em1 = concept_embedding(id1, direction);
    em2 = concept_embedding(id2, direction);
    if em1.empty or em2.empty:
        return 0;
    
    em1, em2 = em1.align(em2, fill_value=0)
#     print em1
#     print em2
    return 1-sp.spatial.distance.cosine(em1.values,em2.values);

def getsim(id1,id2, method='rvspagerank', direction=DIR_BOTH, sim_method=None):
    """ Calculates well-known similarity metrics between two concepts 
    Arg:
        id1, id2: the two concepts 
        method:
            wlm: Wikipedia-Miner method
            cocit: cocitation
            coup: coupling
            ams: amsler
            rvspagerank: ebedding based similarity (in our case, 
                 reversed-page rank method)
    Returns:
        The similarity score        
    """
    log('[getsim started]\method = %s, direction = %s, id1=%s, id2=%s', method, direction, id1, id2)
    
    if method=='rvspagerank':
        sim = getsim_emb(id1,id2, direction)
    elif method=='wlm':
        sim = getsim_wlm(id1,id2)
    elif method=='cocit':
        sim = getsim_cocit(id1,id2)
    elif method=='coup':
        sim = getsim_coup(id1,id2)
    elif method=='ams':
        sim = getsim_ams(id1,id2)
    elif 'word2vec' in  method:
        sim = getsim_word2vec(id1, id2)
    elif sim_method is not None:    
        sim = sim_method(id1,id2)
    else:
        sim=None
    log('[getsim]\tfinished')
    return sim

ENTITY_TITLE = 0
ENTITY_ID = 1
ENTITY_ID_STR = 2
ENTITY_ID_ID_STR = 3
    
def encode_entity(term1, term2, entity_encoding):
    if entity_encoding==ENTITY_TITLE:
        return term1, term2
    
    term1 = title2id(term1)
    term2 = title2id(term2)
    if entity_encoding==ENTITY_ID_STR:
        term1 = str(term1)
        term2 = str(term2)
        
    if entity_encoding==ENTITY_ID_ID_STR:
        term1 = 'id_'+term1
        term2 = 'id_'+term2
    return term1, term2
    
def getsim_file(infilename, outfilename, method='rvspagerank', direction=DIR_BOTH, sim_method=None, entity_encoding=ENTITY_ID):
    """ Batched (file) similarity.
    
    Args: 
        infilename: tsv file in the format of pair1    pair2   [goldstandard]
        outfilename: tsv file in the format of pair1    pair2   similarity
        direction: 0 for in, 1 for out, 2 for all
        entity_encoding: how the entity is represented in the dataset
                        ENTITY_TITLE = simple entity
                        ENTITY_ID = integer id
                        ENTITY_ID_STR = str id
                        ENTITY_ID_ID_STR = id_entityid
                        
    Returns:
        vector of scores, and Spearmans's correlation if goldstandard is given
    """
    log('[getsim_file started]\t%s -> %s', infilename, outfilename)
    outfile = open(outfilename, 'w');
    dsdata=readds(infilename);
    gs=[];
    scores=[];
    spcorr=None;
    for row in dsdata.itertuples():   
        log('processing %s, %s', row[1], row[2])
        if (row[1]=='null') or (row[2]=='null'):
            continue;
        if len(row)>3: 
            gs.append(row[3]);
            
        term1, term2 = encode_entity(row[1], row[2], entity_encoding)
            
        if (term1 is None) or (term2 is None):
            sim=0;
        else:
            sim=getsim(term1, term2, method, direction, sim_method);
        outfile.write("\t".join([str(row[1]), str(row[2]), str(sim)])+'\n')
        scores.append(sim)
    outfile.close();
    if gs:
        spcorr = sp.stats.spearmanr(scores, gs);
    log('[getsim_file]\tfinished')
    return scores, spcorr

    

def getembed_file(infilename, outfilename, direction, get_titles=False, cutoff=None):
    """ Batched (file) concept representation.
    
    Args: 
        infilename: tsv file in the format of pair1    pair2   [goldstandard]
        outfilename: tsv file in the format of pair1    pair2   similarity
        direction: 0 for in, 1 for out, 2 for all
        titles: include titles in the embedding (not needed for mere calculations)
        cutoff: the first top cutoff dimensions (None for all)        

    """
    
    log('[getembed_file started]\t%s -> %s', infilename, outfilename)
    outfile = open(outfilename, 'w');
    dsdata=readds(infilename, usecols=[0]);
    scores=[];
    for row in dsdata.itertuples():        
        wid = title2id(row[1])
        if wid is None:
            em=pd.Series();
        else:
            em=conceptrep(wid, method='rvspagerank', direction = direction, 
                          get_titles = get_titles, cutoff=cutoff)
        outfile.write(row[1]+"\t"+em.to_json()+"\n")
    outfile.close();
    log('[getembed_file]\tfinished')



Overwriting calcsim.py


# A Simple Example

In [None]:
%load_ext autoreload
%autoreload 2
#%aimport calcsim

%aimport wikipedia

from wikipedia import * # uncomment
from calcsim import *   # uncomment
# Examples
reopen()
direction = DIR_IN

page_title1 = 'Abortion' 
print ('page_title: ', page_title1)

page_id1 = title2id(page_title1)
print ("id: ", page_id1)

sr1 = synonymring_titles(page_id1)
print ("synonym ring: %s\n " % str(sr1[:5]))

rep1=conceptrep(page_id1, method='rvspagerank', direction = direction,  get_titles=True, cutoff=5)
print ("Concept Representation:  %s\n" % rep1.to_json())

print ("\n")

page_title2 = 'Miscarriage' 
print ('page_title: ', page_title2)

page_id2 = title2id(page_title2)
print ("id: ", page_id2)

sr2 = synonymring_titles(page_id2)
print ("synonym ring: %s\n " % str(sr2[:5]))

rep2=conceptrep(page_id2, method='rvspagerank', direction = direction,  get_titles=True, cutoff=5)
print ("Concept Representation: %s\n" % rep2.to_json())



sim = getsim(page_id1, page_id2,'rvspagerank',DIR_IN)
print ("similarity", sim)



# Calculating All The Embeddngs
### Note: This step might take several days
This step can be safely skipped and let the caching mechansism happens gradually over time, but if you have some heavy task, it is worth to invest some time and calculate the embeddings off-line

### First. Get a list of the id pages
```
SELECT page_id
INTO OUTFILE '~/backup/wikipedia/20160305/edited/enwiki-20160305-page.dumped.ssv'
FROM page
where page_namespace=0 and page_is_redirect=0 ;
```
### Second. Starting the precalculation

In [None]:
%%writefile preembed.py 
"""Pre calculation of the embeddings"""

from config import *

# %load_ext autoreload
# %autoreload

# %aimport calcsim
from calcsim import *
direction = DIR_IN;
dirstr = graphtype(direction)
#wid_fname  = os.path.join(home, 'backup/wikipedia/20160305/embed/enwiki-20160305-page.dumped.ssv')
wid_fname = os.path.join(home, 'backup/wikipedia/20160305/embed/enwiki-20160305-embeddings.'+dirstr+'.dead_2.ssv')

done_fname = os.path.join(home, 'backup/wikipedia/20160305/embed/enwiki-20160305-embeddings.'+dirstr+'.done.ssv')
dead_fname = os.path.join(home, 'backup/wikipedia/20160305/embed/enwiki-20160305-embeddings.'+dirstr+'.dead_3.ssv')
rewrite = True
lastwid = ""
if os.path.exists(done_fname):
    with open(done_fname) as done_f:
        for lastwid in done_f:
            pass
        if lastwid is not None:
            lastwid = lastwid.strip() 
            

wid_f = open(wid_fname)
done_f = open(done_fname, 'a')
dead_f = open(dead_fname, 'a')
    
if lastwid:
    for line in wid_f:
        if line.strip() == lastwid:
            break
    print "Continuing from ", lastwid
else: 
    print "Fresh start"
    
for line in wid_f:
    wid = line.strip().split('\t')[0]
    if rewrite:
        deletefromcache(wid, direction)
    em = concept_embedding(wid, direction)
    if em.empty:
        count = str(len(getlinkedpages(wid, direction)))
        dead_f.write(wid+'\t'+id2title(wid)+'\t'+count+'\n')
    done_f.write(wid+'\n')
wid_f.close()
done_f.close()
dead_f.close()

print "done"

# Visualizing The Embeddings

In [None]:
%load_ext autoreload
%autoreload

from calcsim import *

import json
from IPython.display import Javascript

cre1 = conceptrep(title2id('Tehran'), method='rvspagerank', direction = DIR_OUT, get_titles=True, cutoff=5);
cre2 = conceptrep(title2id('Sanandaj'), method='rvspagerank', direction = DIR_OUT, get_titles=True, cutoff=5);


#runs arbitrary javascript, client-side
Javascript("""
           window.vizObj1={};window.vizObj2={};
           """.format(cre1.to_json(), cre2.to_json()))


In [None]:
%%javascript

require.config({
    paths: {
        d3:'//cgm6.research.cs.dal.ca/~sajadi/wikisim/js/d3',
        d3_cloud:'//cgm6.research.cs.dal.ca/~sajadi/wikisim/js/d3.layout.cloud',
        simple_draw:'//cgm6.research.cs.dal.ca/~sajadi/wikisim/js/simpledraw'

    }
});

In [None]:
%%javascript

function createWords(cp){

    var titles=[];
    var scores=[];

    for (var key in cp){ 
        if (cp.hasOwnProperty(key)) {
            titles.push(cp[key][0])
            scores.push(cp[key][1])
        }
    }
    var sum = scores.reduce(function(a, b) {return a + b;});
    var min = Math.min.apply(null, scores)
    var max = Math.max.apply(null, scores)
    
    scores=scores.map(function(a){return (a/sum)*90+20});
    var words=[];
    for (var i = 0; i<titles.length; i++) {
        words.push({"text":titles[i], "size": scores[i]})
    }
    return words;
}

var words1=createWords(window.vizObj1);
//element.text(JSON.stringify(words1));
var words2=createWords(window.vizObj2);
require(['d3','d3_cloud', 'simple_draw'], function(d3,d3_cloud, simple_draw){
    $("#chart1").remove();
    element.append("<div id='chart1' style='width:49%; height:500px; float:left; border-style:solid'> </div>");
    simpledraw(words1, chart1);
    
    $("#chart2").remove();
    element.append("<div id='chart2' style='width:49%; margin-left:2%; height:500px; float:left; border-style:solid'> </div>");
    simpledraw(words2, '#chart2');    
    
});    
    
