# Wikisim Wikify: Linking Text to Wikipedia

* **Armin Sajadi** (sajadi@cs.dal.ca)
* **Ryan Amaral**  (amaral@cs.dal.ca)

This is a simple and step by step explanation of calculating semantic relatedness using Wikipedia. We start by preprocessing and building the api, that is explained in the following papers papers:





# Read Here First

** Make sure you have followed the [setup process](../README.md#hosting) and have all the requirements before trying to run these scripts **



# Table of Context

** [WSD Utils](#WSD-Utils) **

** [Coherence Module](#Coherence-Module) **

** [Testing Coherence](#Testing-Coherence)**

** [WSD Module](#WSD-Module) **

** [Genetrate Train Data Repository for WSD](#Genetrate-Train-Data-Repository-for-WSD) **

** [Train the LTR Model](#Train-the-LTR-Model) **

** [Mention Detection](#Mention-Detection) **

** [Generate Train Data Repository For Mention Detection](#Generate-Train-Data-Repository-For-Mention-Detection) **

** [Train SVC Model for Mention Detection](#Train-SVC-Model-for-Mention-Detection) **

** [Wikification API](#Wikification-API) **

** [Testing Wikification](#Testing-Wikification) **



## Most Frequent used Datastructures and Terminology
**Mention Detection**: Finding strings that can potential refere to a concept in Wikipedia

**WSD** : *Having the mentions*, finding the correct concept. In WSD we a assume the mentions are given

**Wikification**: Mention Detection + WSD

**S**: segmented (tokenized) sentence [w1, ..., wn]

**M**: mensions [(m1, e1), ... , (mj,ej)] where mi is an index of S (S[mi] is the mention) and ej is the entity it referes to

**C**: Candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
             where cij is the jth candidate for ith mention and pij is the relative frequency of cij
             


# WSD Utils
Generating candidates, calculating measures, ...

In [2]:
%%writefile wsd_util.py 
"""A few general modules for disambiguation
"""
from __future__ import division

import sys, os
from itertools import chain
from itertools import product
from itertools import combinations
import unicodedata

dirname = os.path.dirname(__file__)
sys.path.insert(0,os.path.join(dirname, '..'))
from wikisim.config import *
from wikisim.calcsim import *
from requests.packages.urllib3 import Retry

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"

dirname = os.path.dirname(__file__)
MODELDIR = os.path.join(dirname, "../models")

session = requests.Session()
http_retries = Retry(total=20,
                backoff_factor=.1)
http = requests.adapters.HTTPAdapter(max_retries=http_retries)
session.mount('http://', http)


def generate_candidates(S, M, max_t=20, enforce=False):
    """ Given a sentence list (S) and  a mentions list (M), returns a list of candiates
        Inputs:
            S: segmented sentence [w1, ..., wn]
            M: mensions [m1, ... , mj]
            max_t: maximum candiate per mention
            enforce: Makes sure the "correct" entity is among the candidates
        Outputs:
         Candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
             where cij is the jth candidate for ith mention and pij is the relative frequency of cij
    
    """
    candslist=[]
    for m in M:
        
        clist = anchor2concept(S[m[0]])
        if not clist:
            clist=((0L,1L),)
        
        clist = sorted(clist, key=lambda x: -x[1])
        clist = clist[:max_t]
        
        smooth=0    
        if enforce:          
            wid = title2id(m[1])            
    #         if wid is None:
    #             raise Exception(m[1].encode('utf-8') + ' not found')
            
                        
            trg = [(i,(c,f)) for i,(c,f) in enumerate(clist) if c==wid]
            if not trg:
                trg=[(len(clist), (wid,0))]
                smooth=1

                
            if smooth==1 or trg[0][0]>=max_t: 
                if clist:
                    clist.pop()
                clist.append(trg[0][1])
            
        s = sum(c[1]+smooth for c in clist )        
        clist = [(c,float(f+smooth)/s) for c,f in clist ]
            
        candslist.append(clist)
    return  candslist 


def solr_escape(s):
    """
        Escape a string for solr
    """
    #ToDo: probably && and || nead to be escaped as a whole, and also AND, OR, NOT are not included
    to_sub=re.escape(r'+-&&||!(){}[]^"~*?:\/')
    return re.sub('[%s]'%(to_sub,), r'\\\g<0>', s)

def solr_unescape(s):
    """
        Escape a string for solr
    """
    #ToDo: probably && and || nead to be escaped as a whole, and also AND, OR, NOT are not included
    to_sub=re.escape(r'+-&&||!(){}[]^"~*?:\/')
    return re.sub('\\\([%s])'%(to_sub,), r'\g<1>', s)

def throw_unicodes(inputstr):
    '''This function "ideally" should prepare the text in the correct encoding
        which is utf-16, but I couldn't (cf. my encoding notes)
        so for know, just make everything ascii!
        Input: 
            A unicode string with any encoding
        Output: 
            Ascii encoded string
    '''
    if isinstance(inputstr, str):
        return inputstr
    log('[throw_unicodes]\t Encoded to ascii')
    return unicodedata.normalize('NFKD', inputstr).encode('ascii', 'ignore')


#Evaluation Methods
def get_tp(ids, gold_titles):
    """Returns true positive number in id, compared to gold_titles 
        this function is used to evaluate WSD
       Inputs: goled_titles: The correct titles
               ids: The given ids
       Outputs: returns a tuple of (true_positives, total_number_of_ids)
    
    """
    tp=0
    for m,id2 in zip(gold_titles, ids):
        if title2id(m[1]) == id2:
            tp += 1
    return [tp, len(ids)]

def get_prec(tp_list):
    """Returns precision
       Inputs: a list of (true_positive and total number) lists
       Output: Precision
    """
    overall_tp = 0
    simple_count=0
    overall_count=0
    macro_prec = 0;
    for tp, count in tp_list:
        if tp is None:
            continue
        simple_count +=1    
        overall_tp += tp
        overall_count += count
        macro_prec += float(tp)/count
        
    macro_prec = macro_prec/simple_count
    micro_prec = float(overall_tp)/overall_count
    
    return micro_prec, macro_prec

from difflib import SequenceMatcher
def strsimilar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def get_sentence_measures(S, M, S_gold, M_gold, wsd_measure=False):
    ''' Calcuates precision/recall/F1 for mention detection/wsd for a given sentence
        Input:
            S_gold: The correct tokenized sentence 
            M_gold: The correct mention
            S: The given sentence to evaluate
            M: The given mentions to evaluate
            wsd_measure: if True, it returns the wikifying measures,
                        if false, returns the measures of the mention dection  prcess
        Output:
            precision, recall, f-measure
            
    '''
    
    S_gold = [throw_unicodes(s.replace(' ','')) for s in S_gold]    
    Sgi=[]
    Mgi=[]
    last_index=0
    for s in S_gold:
        Sgi.append ([last_index, last_index+len(s)])
        last_index += len(s)
    Mgi = [Sgi[m[0]] for m in M_gold]    
    
    S = [throw_unicodes(s.replace(' ','')) for s in S]    
    Sj=[]
    last_index=0
    for s in S:
        Sj.append ([last_index, last_index+len(s)])
        last_index += len(s)
    Mj = [Sj[m[0]] for m in M]    
    
    i=0
    j=0
    tp=fp=fn=0
    
    
    while True:
        if i >= len(Mgi):
            fp += (len(Mj)-j)
            break
            
        if j >= len(Mj):
            fn += (len(Mgi)-i)
            break
            
        if (Mgi[i][1] <= Mj[j][0]) or ((Mgi[i][1] <= Mj[j][1]) and strsimilar(S_gold[M_gold[i][0]], S[M[j][0]])<0.5):
            fn += 1
            i += 1
            continue
            
        if  (Mgi[i][0] >= Mj[j][1]) or ((Mgi[i][0] <= Mj[j][1]) and strsimilar(S_gold[M_gold[i][0]], S[M[j][0]])<0.5):
            fp += 1
            j += 1
            continue

        if wsd_measure:
            if title2id(M_gold[i][1]) != title2id(M[j][1]):
                fp += 1
                i += 1
                j += 1
                continue
        #print "match:%s, %s (%s, %s) " % (Mgi[i], Mj[j], S_gold[M_gold[i][0]], S[M[j][0]])
        tp +=1
        i += 1
        j += 1
        
    return tp, fp, fn   
            
def get_overall_measures(tp_list):
    """Returns micro/macro measures, given a list of (tp, fp, fn)
       Inputs: a list of (tp, fp, fn) tuples
       Output: macro_prec, macro_rec, macro_f1, micro_prec, micro_rec, micro_f1
    """
    overall_tp = overall_fp = overall_fn = 0
    macro_prec = macro_rec = macro_f1 = 0;
    for tp, fp, fn in tp_list:

        overall_tp += tp
        overall_fp += fp
        overall_fn += fn

        prec = float(tp)/(tp+fp) if tp+fp > 0 else 0
        rec  = float(tp)/(tp+fn) if tp+fn > 0 else 0
        macro_prec += prec
        macro_rec  += rec
        macro_f1   += 2*(prec * rec)/(prec + rec) if (prec + rec)>0 else 0
        
    macro_prec = macro_prec/len(tp_list)
    macro_rec = macro_rec/len(tp_list)
    macro_f1 = macro_f1/len(tp_list)

    micro_prec =  float(overall_tp) / (overall_tp + overall_fp)
    micro_rec  =  float(overall_tp) / (overall_tp + overall_fn)
    micro_f1   = 2*(micro_prec * micro_rec)/(micro_prec + micro_rec)
    
    return macro_prec, macro_rec, macro_f1, micro_prec, micro_rec, micro_f1
     

        

Overwriting wsd_util.py


# Coherence Module
Calculates two types of coherence given a sentence
* Standard Coherence : Integer Programming, maximizing sum of the mutual simiarities
* Key Entity Coherence: It is based on the similarity of a an entity to the key-entity, where the key_entity is found through a complete ($O(n^2)$) search


In [3]:
%%writefile coherence.py 
"""Diiferent coherence (context, key-entity) calculation, and 
    disambiguation.
"""
from itertools import izip
from pulp import *
import random
from itertools import chain
from itertools import product
from itertools import combinations

from wsd_util import *

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"



def get_sim_matrix(candslist, method, direction):
    ''' Computes a matrix of pairwise similarities
        Input: 
            candslist: The list of candidates
            method: Similarity Method
        Output:
            The similarity matrix matrix
    '''
    concepts=  list(chain(*candslist))
    concepts=  list(set(c[0] for c in concepts))
    sims = pd.DataFrame(index=concepts, columns=concepts)
    for cands1,cands2 in combinations(candslist,2):
        for c1,c2 in product(cands1,cands2):
            sims[c1[0]][c2[0]]= sims[c2[0]][c1[0]] = getsim(encode_entity(c1[0], method, get_id=False),
                                                            encode_entity(c2[0], method, get_id=False),
                                                            method, direction)
    return sims     

# ILP Coherence
def disambiguate_ilp(C, method, direction):
    """ Disambiguate using ILP 
        Inputs: 
            C: Candidate List [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            method: Similarity method
            direction: embedding direction"""
    #C = [('a','b','c'), ('e', 'f', 'g'), ('h', 'i')]

    R1 = [zip([i]*len(c), range(len(c))) for i,c in enumerate(C)]

    #R1 = {r:str(r) for r in itertools.chain(*RI1)}
    #R1 = [[str(rij) for rij in ri] for ri in RI1]

    #RI1_flat = list(itertools.chain(*RI1))


    R2=[]
    for e in combination(R1,2):
        R2 += [r for r in itertools.product(e[0], e[1]) ]        


    #R2 = {r:str(r) for r in RI2}


    simmatrix = get_sim_matrix(C, method, direction)
    
    S = {((u0,u1),(v0,v1)):simmatrix[C[u0][u1][0]][C[v0][v1][0]] for ((u0,u1),(v0,v1)) in R2}


    prob = LpProblem("wsd", LpMaximize)

    R=list(itertools.chain(*R1)) + R2
    R_vars = LpVariable.dicts("R",R,
                                lowBound = 0,
                                upBound = 1,
                                cat = pulp.LpInteger)
    prob += lpSum([S[r]*R_vars[r] for r in R2])


    i=0
    for ri in R1:
        prob += lpSum([R_vars[rij] for rij in ri])==1, ("R1 %s constraint")%i
        i += 1


    for r in R2:
        prob += lpSum([R_vars[r[0]],R_vars[r[1]],-2*R_vars[r]]) >=0, ("R_%s_%s constraint"%(r[0], r[1]))

    prob.solve() 
    ids    = [C[r[0]][r[1]][0] for r in list(itertools.chain(*R1)) if R_vars[r].value() == 1.0]
    titles = ids2title(ids)
    return ids, titles

# key-entity based methods
def evalkey(c, a, candslist, simmatrix):
    ''' Calculated the coherence, given a key entity
        Input: 
                c: a possible key-entity
                a: index of the mention whose candidate is considered to be key entity
                simmatrix: Similarity Matrix
        Output:
            resolved: the resolved entitieis
            score: Coherence, using this entity
    '''
    resolved=[]
    score=0;
    for i in  range(len(candslist)):
        if a==i:
            resolved.append(c[0])
            continue
        cands = candslist[i]
        vb = [(cj[0], simmatrix[c[0]][cj[0]])  for cj in cands]
        max_concept, max_sc = max(vb, key=lambda x: x[1])
        score += max_sc
        resolved.append(max_concept)
    return resolved,score

def key_quad(candslist,  method, direction):
    '''Disambiguate using search-based (quadratic) key-entity method
        Input:
            candslist: The list of candidates
            method: Similarity method
            direction: The graph direction
        Output: 
            Resolved entities, and resolved titles
    '''
    res_all=[]
    simmatrix = get_sim_matrix(candslist, method, direction)

    for i in range(len(candslist)):
        for j in range(len(candslist[i])):
            res_ij =  evalkey(candslist[i][j], i, candslist, simmatrix)
            res_all.append(res_ij)
    res, score = max(res_all, key=lambda x: x[1])
    #print("Score:", score)
    titles = ids2title(res)
    return res, titles


def disambiguate(C, method, direction, op_method):
    """ Disambiguate C list using a disambiguation method 
        Inputs:
            C: Candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            method: disambiguation method 
                        most important ones: ilp (integer linear programming), 
                                             key: Key Entity based method
        
    """
    if op_method == 'ilp':
        return disambiguate_ilp(C, method, direction)
    if op_method == 'keyq':
        return key_quad(C, method, direction)    
    return None

def disambiguate_driver(C, ws, method='rvspagerank', direction=DIR_BOTH, op_method="keyq"):
    """ Initiate the disambiguation by chunking the sentence 
        Disambiguate C list using a disambiguation method 
        Inputs:
            C: Candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            ws: Windows size for chunking
            method: similarity method
            direction: embedding type
            op_method: disambiguation method 
                        most important ones: ilp (integer linear programming), 
                                             keyq: Key Entity based method
        
    """
    #TODO: modify this chunking to an overlapping version
    if ws == 0: 
        return  disambiguate(C, method)
    
    ids = []
    titles = []
    
    windows = [[start, min(start+ws, len(C))] for start in range(0,len(C),ws) ]
    last = len(windows)
    if last > 1 and windows[last-1][1]-windows[last-1][0]<2:
        windows[last-2][1] = len(C)
        windows.pop()
        
    for w in windows:
        chunk_c = C[w[0]:w[1]]
        
        chunk_ids, chunk_titles = disambiguate(chunk_c, method, direction, op_method)
        ids += chunk_ids
        titles += chunk_titles
    return ids, titles     


Overwriting coherence.py


# Testing the coherence module

In [4]:
"""Testing Coherence'
"""
from coherence import *
# S=["Carlos", "met", "David", "and" , "Victoria", "in", "Madrid"]
# M=[[0, "Roberto_Carlos"], [2, "David_Beckham"], [4, "Victoria_Beckham"], [6, "Madrid"]]


S=["Three", "of", "the", "greatest", "guitarists", "started", "their", "career", "in", "a", "single", "band", ":", "Clapton", ",", "Beck", ",", "and", "Page", "."]
M=[[13, "Eric_Clapton"], [15, "Jeff_Beck"], [18, "Jimmy_Page"]]

# S=["Phoenix, Arizona"] 
# M=[[0, "Phoenix,_Arizona"]]

C = generate_candidates(S, M, max_t=5, enforce=True)
print "Candidates: ", C, "\n"


ids, titles = disambiguate_driver(C, ws=5, op_method='keyq')
print ids 
print titles 

Candidates:  [[(635951L, 0.41843971631205673), (1112504L, 0.23404255319148937), (2473742L, 0.1347517730496454), (10049L, 0.12056737588652482), (28222274L, 0.09219858156028368)], [(149681L, 0.9626168224299065), (1464020L, 0.01557632398753894), (11468545L, 0.01142263759086189), (18842308L, 0.009345794392523364), (105389L, 0.0010384215991692627)], [(106597L, 0.4411764705882353), (91270L, 0.23529411764705882), (1451666L, 0.16176470588235295), (1780004L, 0.14705882352941177), (102096L, 0.014705882352941176)]] 

[10049L, 105389L, 102096L]
['Eric_Clapton', 'Jeff_Beck', 'Jimmy_Page']


# VSM Coherence Module
Calculates two types of coherence given a sentence
* Context Coherence : It is based on the similarity of an entity to the other entities in its context
* VSM Key Entity Coherence: It is based on the similarity of a an entity to the key-entity, where the  key-entity is found using VSM operations


In [5]:
%%writefile vsmcoherence.py 
"""Diiferent coherence (context, key-entity) calculation, and 
    disambiguation.
"""
from __future__ import division

from wsd_util import *
import numpy as np

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"


def get_candidate_representations(candslist, direction, method):
    '''returns an array of vector representations. 
       Inputs: 
           candslist: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
           direction: embedding direction
           method: similarity method
      Outputs
           cvec_arr: Candidate embeddings, a two dimensional array, each column 
                   is the representation of a candidate
           cveclist_bdrs: a list of pairs (beginning, end), to indicate where 
                   the embeddings for a concepts indicates start and end. In other words
                   The embedding of candidates [ci1...cik] in candslist is
                   cvec_arr[cveclist_bdrs[i][0]:cveclist_bdrs[i][1]] 
    '''
    
    cframelist=[]
    cveclist_bdrs = []
    ambig_count=0
    for cands in candslist:
        if len(candslist)>1:
            ambig_count += 1
        cands_rep = [conceptrep(encode_entity(c[0], method, get_id=False), method=method, direction=direction, get_titles=False) for c in cands]
        cveclist_bdrs += [(len(cframelist), len(cframelist)+len(cands_rep))]
        cframelist += cands_rep
        
    cvec_fr = pd.concat(cframelist, join='outer', axis=1)
    cvec_fr.fillna(0, inplace=True)
    cvec_arr = cvec_fr.as_matrix().T
    return cvec_arr, cveclist_bdrs

def entity_to_context_scores(candslist, direction, method):
    ''' finds the similarity between each entity and its context representation
        Inputs:
            candslist: the list of candidates [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            direction: embedding direction
            method: similarity method
        Returns:
           cvec_arr: Candidate embeddings, a two dimensional array, each column 
           cveclist_bdrs: a list of pairs (beginning, end), to indicate where the 
                   reperesentation of the candidates for cij reside        
           cands_score_list: scroes in the form of [[s11,...s1k],...[sn1,...s1m]]
                    where sij  is the similarity of c[i,j] to to ci-th context
                    
            '''
    cvec_arr, cveclist_bdrs =  get_candidate_representations(candslist, direction, method)    
    
    aggr_cveclist = np.zeros(shape=(len(candslist),cvec_arr.shape[1]))
    for i in range(len(cveclist_bdrs)):
        b,e = cveclist_bdrs[i]
        aggr_cveclist[i]=cvec_arr[b:e].sum(axis=0)
    
    from itertools import izip
    resolved = 0
    cands_score_list=[]        
    for i in range(len(candslist)):
        cands = candslist[i]
        b,e = cveclist_bdrs[i]
        cvec = cvec_arr[b:e]
        convec=aggr_cveclist[:i].sum(axis=0) + aggr_cveclist[i+1:].sum(axis=0)
        S=[]    
        for v in cvec:
            try:
                # We have zero vectors, so this can rais an exception
                # or return none                
                s = 1-sp.spatial.distance.cosine(convec, v);
            except:
                s=0                
            if np.isnan(s):
                s=0
            S.append(s)
        cands_score_list.append(S)

    return cvec_arr, cveclist_bdrs, cands_score_list

def key_criteria(cands_score):
    ''' helper function for find_key_concept: returns a score indicating how good a key is x
        Input:
            scroes for candidates [ci1, ..., cik] in the form of (i, [(ri1, si1), ..., (rik, sik)] ) 
            where (rij,sij) indicates that sij is the similarity of c[i][rij] to to cith context
            
    '''
    if len(cands_score[1])==0:
        return -float("inf")    
    if len(cands_score[1])==1 or cands_score[1][1][1]==0:
        return float("inf")
    
    return (cands_score[1][0][1]-cands_score[1][1][1]) / cands_score[1][1][1]

def find_key_concept(candslist, direction, method):
    ''' finds the key entity in the candidate list
        Inputs:
            candslist: the list of candidates [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            cvec_arr: the array of all embeddings for the candidates
            cveclist_bdrs: The embedding vector for each candidate: [[c11,...c1k],...[cn1,...c1m]]
        Returns:
            cvec_arr: Candidate embeddings, a two dimensional array, each column 
            cveclist_bdrs: a list of pairs (beginning, end), to indicate where the 
            key_concept: the concept forwhich one of the candidates is the key entity
            key_entity: candidate index for key_cancept that is detected to be key_entity
            key_entity_vector: The embedding of key entity
            '''
    cvec_arr, cveclist_bdrs, cands_score_list = entity_to_context_scores(candslist, direction, method);
    S=[sorted(enumerate(S), key=lambda x: -x[1]) for S in cands_score_list]
        
    key_concept, _ = max(enumerate(S), key=key_criteria)
    key_entity = S[key_concept][0][0]
    
    b,e = cveclist_bdrs[key_concept]
    
    key_entity_vector =  cvec_arr[b:e][key_entity]    
    return cvec_arr, cveclist_bdrs, key_concept, key_entity, key_entity_vector

def keyentity_candidate_scores(candslist, direction, method):
    '''returns entity scores using key-entity scoring 
       Inputs: 
           candslist: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
           direction: embedding direction
           method: similarity method
           
       Returns:
           Scores [[s11,...s1k],...[sn1,...s1m]] where sij is cij similarity to the key-entity
    '''
    
    cvec_arr, cveclist_bdrs, key_concept, key_entity, key_entity_vector = find_key_concept(candslist, direction, method)
    
    # Iterate 
    candslist_scores=[]
    for i in range(len(candslist)):
        cands = candslist[i]
        b,e = cveclist_bdrs[i]
        cvec = cvec_arr[b:e]
        cand_scores=[]

        for v in cvec:
            try:
                # We have zero vectors, so this can rais an exception
                # or return none                
                d = 1-sp.spatial.distance.cosine(key_entity_vector, v);
            except:
                d=0                
            if np.isnan(d):
                d=0
            
            cand_scores.append(d)    
        candslist_scores.append(cand_scores) 
    return candslist_scores



def coherence_scores_driver(C, ws=5, method='rvspagerank', direction=DIR_BOTH, op_method="keydisamb"):
    """ Assigns a score to every candidate 
        Inputs:
            C: Candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            ws: Windows size for chunking
            method: similarity method
            direction: embedding type
            op_method: disambiguation method, either keyentity or entitycontext
        Output:
            Candidate Scores
        
    """
    windows = [[start, min(start+ws, len(C))] for start in range(0,len(C),ws) ]
    last = len(windows)
    if last > 1 and windows[last-1][1]-windows[last-1][0]<2:
        windows[last-2][1] = len(C)
        windows.pop()
    scores=[]    
    for w in windows:
        chunk_c = C[w[0]:w[1]]
        if op_method == 'keydisamb':
            scores += keyentity_candidate_scores(chunk_c, direction, method)
            
        if op_method == 'entitycontext':
            _, _, candslist_scores = entity_to_context_scores(chunk_c, direction, method);
            scores += candslist_scores
            
    return scores


Overwriting vsmcoherence.py


# Testing Coherence

In [6]:
"""Testing Coherence'
"""
from vsmcoherence import *
# S=["Carlos", "met", "David", "and" , "Victoria", "in", "Madrid"]
# M=[[0, "Roberto_Carlos"], [2, "David_Beckham"], [4, "Victoria_Beckham"], [6, "Madrid"]]


S=["Three", "of", "the", "greatest", "guitarists", "started", "their", "career", "in", "a", "single", "band", ":", "Clapton", ",", "Beck", ",", "and", "Page", "."]
M=[[13, "Eric_Clapton"], [15, "Jeff_Beck"], [18, "Jimmy_Page"]]

# S=["Phoenix, Arizona"] 
# M=[[0, "Phoenix,_Arizona"]]

C = generate_candidates(S, M, max_t=5, enforce=False)
print "Candidates: ", C, "\n"


coh_scores = coherence_scores_driver(C, ws=5, method='rvspagerank', direction=DIR_BOTH, op_method="entitycontext")
print coh_scores

Candidates:  [[(635951L, 0.41843971631205673), (1112504L, 0.23404255319148937), (2473742L, 0.1347517730496454), (10049L, 0.12056737588652482), (28222274L, 0.09219858156028368)], [(149681L, 0.9625779625779626), (1464020L, 0.014553014553014554), (11468545L, 0.010395010395010396), (18842308L, 0.008316008316008316), (273869L, 0.004158004158004158)], [(106597L, 0.4027777777777778), (91270L, 0.20833333333333334), (1451666L, 0.1388888888888889), (1780004L, 0.125), (2165598L, 0.125)]] 

[[4.6521063918336658e-05, 3.7266464031904256e-05, 4.5843805666856419e-05, 0.01042156335153166, 0.0009723255356791638], [0.010091926514982474, 0.00012862557240578276, 2.1648392303119657e-05, 0.00019246023418373337, 0.00017654705162273299], [0.00013532853088216168, 9.8833055384495161e-05, 2.7683319119398142e-05, 7.2847110666685033e-05, 0.00010954076899327703]]


# WSD Module
This modules assumes that the mentions are already detected. Given a tokenized sentence with mention markers, it 
tries to find the target entity in Wikipedia, using several measures:

* Popularity
* Context similarity coherence
* Key entity coherence
* String similarity between the mention and the candidate
* Textual context similarity
* Machine Learned Model


In [7]:
%%writefile wsd.py 
"""Context-based disambiguation and also Learning-To-Rank combination
    of several features.
"""

from __future__ import division

from collections import Counter
import sys
from vsmcoherence import *
from sklearn.externals import joblib
#sys.path.insert(0,'..')

#from wikisim.calcsim import *
#from wsd.wsd import *
# My methods
#from senseembed_train_test.ipynb

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"


LTR_NROWS_S = 10000
LTR_NROWS_L = 50000
wsd_model_preprocessor_ = None
wsd_model_=None
def load_wsd_model(nrows):
    global wsd_model_preprocessor_, wsd_model_
    
    wsd_model_preprocessor_fn = os.path.join(MODELDIR, 'ltr_preprocessor.%s.pkl' %(nrows, ))
    if os.path.isfile(wsd_model_preprocessor_fn): 
        wsd_model_preprocessor_ = joblib.load(open(wsd_model_preprocessor_fn, 'rb'))    
        log("[load_wsd_model]\twsd_model_preprocessor file (%s) loaded" % (wsd_model_preprocessor_fn,))
    else:
        log("[load_wsd_model]\twsd_model_preprocessor file (%s) not found" % (wsd_model_preprocessor_fn,))


    wsd_model_fn_ = os.path.join(MODELDIR, 'ltr.%s.pkl'%(nrows,))
    if os.path.isfile(wsd_model_fn_): 
        wsd_model_ = joblib.load(open(wsd_model_fn_, 'rb'))    
        log("[load_wsd_model]\twsd_model file (%s) loaded" % (wsd_model_fn_,))
    else:
        log("[load_wsd_model]\twsd_model file (%s) not found" % (wsd_model_fn_,))


def get_context(anchor, eid, rows=50000):
    """Returns the context
       Inputs: 
           anchor: the anchor text
           eid: The id of the entity this anchor points to
       Output:
           The context (windows size is, I guess, 20)       
    """
    params={'wt':'json', 'rows':rows}
    anchor = solr_escape(anchor)
    
    q='anchor:"%s" AND entityid:%s' % (anchor, eid)
    params['q']=q
    
#     session = session.Session()
#     http_retries = Retry(total=20,
#                     backoff_factor=.1)
#     http = session.adapters.HTTPAdapter(max_retries=http_retries)
#     session.mount('http://localhost:8983/solr', http)
    
    r = session.get(qstr, params=params).json()
    if 'response' not in r: 
        log("[get_context]\t(terminating)\t%s",(str(r),))
        sys.stdout.flush()
        os._exit(0)
        
    if not r:
        return []
    return r['response']['docs']

#from wsd
def word2vec_context_candidate_scores (S, M, candslist, ws=5):
    '''returns entity scores using the similarity with their context
       Inputs: 
           S: Sentence
           M: Mentions
           candslist: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            ws: word size
       Returns:
           Scores [[s11,...s1k],...[sn1,...s1m]] where sij is cij similarity to the key-entity
    '''
    
    candslist_scores=[]
    for i in range(len(candslist)):
        cands = candslist[i]
        pos = M[i][0]
        context = S[max(pos-ws,0):pos]+S[pos+1:pos+ws+1]
        context_vec = sp.zeros(getword2vec_model().vector_size)
        for c in context:
            context_vec += getword2vector(c).as_matrix()
        cand_scores=[]

        for c in cands:
            try:
                # We have zero vectors, so this can rais an exception
                # or return none                
                cand_vector = getentity2vector(encode_entity(c[0],'word2vec', get_id=False))
                d = 1-sp.spatial.distance.cosine(context_vec, cand_vector);
            except:
                d=0                
            if np.isnan(d):
                d=0
            
            cand_scores.append(d)    
        candslist_scores.append(cand_scores) 

    return candslist_scores

#from wsd
def word2vec_context_disambiguate(S, M, candslist):
    '''Disambiguate a sentence using word-context similarity
       Inputs: 
           S: Sentence
           M: Mentions
           candslist: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
           
       Returns: 
           a list of entity ids and a list of titles
    '''
    
        
    candslist_scores = word2vec_context_candidate_scores (S, M, candslist)
                      
    # Iterate 
    true_entities = []
    for cands, cands_scores in zip(candslist, candslist_scores):
        max_index, max_value = max(enumerate(cands_scores), key= lambda x:x[1])
        true_entities.append(cands[max_index][0])

    titles = ids2title(true_entities)
    return true_entities, titles 



#from wikisim
def get_solr_count(s):
    """ Gets the number of documents the string occurs 
        NOTE: Multi words should be quoted
    Arg:
        s: the string (can contain AND, OR, ..)
    Returns:
        The number of documents
    """
    q='+text:(%s)'%(solr_escape(s),)
    qstr = 'http://localhost:8983/solr/enwiki20160305/select'
    params={'indent':'on', 'wt':'json', 'q':q, 'rows':0}
    r = session.get(qstr, params=params)
    D = r.json()['response']
    return D['numFound']



# Editing Ryan's code
def context_to_profile_sim(mention, context, candidates):
    """
    Description:
        Uses Solr to find the relevancy scores of the candidates based on the context.
    Args:
        mention: The mention as it appears in the text
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The score for each candidate in the same order as the candidates.
    """
    
    
    # put text in right format
    if not context:
        return [0]*len(candidates)
    context = solr_escape(context)
    mention = solr_escape(mention)
    
    filter_ids = " ".join(['id:' +  str(tid) for tid,_ in candidates])
        

    # select all the docs from Solr with the best scores, highest first.
    qst = 'http://localhost:8983/solr/enwiki20160305/select'
    #q='text:('+context+')^1 title:(' + mention+')^1.35'
    q='text:('+context+')'
    
    params={'fl':'id score', 'fq':filter_ids, 'indent':'on',
            'q':q, 'wt':'json','rows':len(candidates)}
    
    
    r = session.get(qst, params = params).json()['response']['docs']
    id_score_map=defaultdict(float, {long(ri['id']):ri['score'] for ri in r})
    id_score=[id_score_map[c] for c,_ in candidates]
    return id_score

# Important TODO
# This queriy is very much skewed toward popularity, better to replace space with AND
#!!!! I don't like this implementation, instead of retrieving and counting, better to let the 
# solr does the counting, 
def context_to_context_sim(mention, context, candidates, rows=100):
    """
    Description:
        Uses Solr to find the relevancy scores of the candidates based on the context.
    Args:
        mentionStr: The mention as it appears in the text
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The score for each candidate in the same order as the candidates.
    """
    if not context:
        return [0]*len(candidates)
    
    # put text in right format
    context = solr_escape(context)
    mention = solr_escape(mention)
    
    filter_ids = " ".join(['entityid:' +  str(tid) for tid,_ in candidates])
    
    
    # select all the docs from Solr with the best scores, highest first.
    qstr = 'http://localhost:8983/solr/enwiki20160305_context/select'
    q="_context_:(%s) entity:(%s)" % (context,mention)
    q="_context_:(%s) " % (context)
    
    params={'fl':'entityid', 'fq':filter_ids, 'indent':'on',
            'q':q,'wt':'json', 'rows':rows}
    r = session.get(qstr, params = params)
    cnt = Counter()
    
    for doc in r.json()['response']['docs']:
        cnt[long(doc['entityid'])] += 1
    
    id_score=[cnt[c] for c,_ in candidates]
    return id_score


def context_candidate_scores (S, M, candslist, ws=5, method='c2c', skip_current=1):
    '''returns entity scores using  context seatch
       Inputs: 
           S: Sentence
           M: Mentions
           candslist: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            ws: word size
            method: Either 'c2p': for context to profile, or 'c2c' for context to context
            skip_current: Whether or not include the current mention in the context
       Returns:
           Scores [[s11,...s1k],...[sn1,...s1m]] where sij is cij similarity to the key-entity
    '''
    candslist_scores=[]
    for i in range(len(candslist)):
        cands = candslist[i]
        pos = M[i][0]
        mention=S[pos]
        context = S[max(pos-ws,0):pos]+S[pos+skip_current:pos+ws+1]
        context=" ".join(context)
        
        if method == 'c2p':
            cand_scores=context_to_profile_sim(mention, context, cands)
        if method == 'c2c':
            cand_scores=context_to_context_sim(mention, context, cands)
            
        candslist_scores.append(cand_scores) 

    return candslist_scores

def mention_to_title_sim(mention, candidates):
    """
    Description:
        Uses Solr to find the string similarity scores between the mention candidates.
    Args:
        mention: The mention as it appears in the text
        context: The words that surround the target word.
        candidates: A list of candidates that each have the entity id and its frequency/popularity.
    Return:
        The score for each candidate in the same order as the candidates.
    """
    
    
    # put text in right format
    mention = solr_escape(mention)
    
    filter_ids = " ".join(['id:' +  str(tid) for tid,_ in candidates])
        

    # select all the docs from Solr with the best scores, highest first.
    qst = 'http://localhost:8983/solr/enwiki20160305/select'
    q='title:(' + mention+')'
    
    params={'fl':'id score', 'fq':filter_ids, 'indent':'on',
            'q':q, 'wt':'json','rows':len(candidates)}
    
    
    r = session.get(qst, params = params).json()['response']['docs']
    id_score_map=defaultdict(float, {long(ri['id']):ri['score'] for ri in r})
    id_score=[id_score_map[c] for c,_ in candidates]
    return id_score

def mention_candidate_score(S, M, candslist):
    return [mention_to_title_sim(S[m[0]], c) for m,c in zip(M,candslist) ]

def popularity_score(candslist):
    """Retrieves the popularity score from the candslist
    """
    scores=[[s for _, s in cands] for cands in candslist]
    return scores

def normalize(scores_list):
    """Normalize a matrix, row-wise
    """
    normalized_scoreslist=[]
    for scores in scores_list:
        smooth=0
        if 0 in scores:
            smooth=1
        sum_s = sum(s+smooth for s in scores )        
        n_scores = [float(s+smooth)/sum_s for s in scores]
        normalized_scoreslist.append(n_scores)
    return normalized_scoreslist
        
def normalize_minmax(scores_list):
    """Normalize a matrix, row-wise, using minmax technique
    """
    normalized_scoreslist=[]
    for scores in scores_list:
        scores_min = min(scores)        
        scores_max = max(scores)        
        if scores_min == scores_max:
            n_scores = [0]*len(scores)
        else:
            n_scores = [(float(s)-scores_min)/(scores_max-scores_min) for s in scores]
        normalized_scoreslist.append(n_scores)
    return normalized_scoreslist

def find_max(candslist,candslist_scores):
    '''Disambiguate a sentence using a list of candidate-score tuples
       Inputs: 
           candslist: candidate list [[(c11, s11),...(c1k, s1k)],...[(cn1, sn1),...(c1m, s1m)]]
       Returns: 
           a list of entity ids and a list of titles
    '''
            
    true_entities = []
    for cands, cands_scores in zip(candslist, candslist_scores):
        max_index, max_value = max(enumerate(cands_scores), key= lambda x:x[1])
        true_entities.append(cands[max_index][0])
    titles = ids2title(true_entities)
    return true_entities, titles        

#Delete, useless
def disambiguate_random(C):
    '''Disambiguate using the given order (which can be random)
        Input:
            C: Candlist
        Output:
            Disambiguated entities
    '''
    
    ids = [c[0][0] for c in C ]
    titles= ids2title(ids)
    return ids, titles

def get_scores(S, M, C, method):
    """ Disambiguate C list using a disambiguation method 
        Inputs:
            S: Sentence
            M: Metntions
            C: Candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            method: similarity method
            direction: embedding type
            op_method: disambiguation method 
                        most important ones: ilp (integer linear programming), 
                                             key: Key Entity based method
        
    """
    scores=None
    if method == 'popularity'  :
        scores = popularity_score(C)
    if method == 'keydisamb'  :
        scores = coherence_scores_driver(C, method='rvspagerank', direction=DIR_BOTH, op_method="keydisamb")
    if method == 'entitycontext'  :
        scores = coherence_scores_driver(C, method='rvspagerank', direction=DIR_BOTH, op_method="entitycontext")
    if method == 'mention2entity'  :
        scores = mention_candidate_score (S, M, C)
    if method == 'context2context'  :
        scores = context_candidate_scores (S, M, C, method='c2c')
    if method == 'context2profile'  :
        scores = context_candidate_scores (S, M, C, method='c2p')    
    if method == 'learned'  :
        scores = learned_scores (S, M, C)    
        
    #scores = normalize_minmax(scores)    
    return scores

def formated_scores(scores):
    """Only for pretty-printing
    """
    scores = [['{0:.2f}'.format(s) for s in cand_scores] for cand_scores in scores]
    return scores

def formated_all_scores(scores):
    """Only for pretty-printing
    """
    scores = [[tuple('{0:.2f}'.format(s) for s in sub_scores) for sub_scores in cand_scores] for cand_scores in scores]
    return scores

def get_all_scores(S, M, C):
    """Give all scores as different lists
        Inputs:
            S: segmented sentence [w1, ..., wn]
            M: mensions [m1, ... , mj]
            C: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]

        Output:
            Scores, in this format [[(c111,.., c1k1),...(cm11,.., cmks)],...[(c1n1,.., pm1s),...(c1m1,.., p1ms)]]
            where cijk is the k-th scores for cij candidate
        
            Scores, in this format [[(c111, c11s),...(c1k1, c1ks)],...[(cn11, pn1s),...(c1m1, p1ms)]]
            where cijk is the k-th scores for cij candidate
    """
    all_scores= [get_scores(S, M, C, method) for method in \
           ['popularity','keydisamb','entitycontext','mention2entity','context2context','context2profile']]
    return [zip(*s) for s in zip(*all_scores)]




def learned_scores (S, M, candslist):
    '''returns entity scores using the learned (learned-to-rank method)
       Inputs: 
           S: Sentence
           M: Mentions
           candslist: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
       Returns:
           Scores [[s11,...s1k],...[sn1,...s1m]] where sij is cij similarity to the key-entity
    '''
    if (wsd_model_preprocessor_ is None) or (wsd_model_ is None):
        log('[learned_scores]\tmodel not loaded')
        raise Exception('model not loaded, try load_wsd_model()')
    
    all_scores=get_all_scores(S,M,candslist)
    return [wsd_model_.predict(wsd_model_preprocessor_.transform(cand_scores)) for cand_scores in all_scores] 

def wsd(S, M, C, method='learned'):
    '''Gets a sentence, mentions and candslist, and returns disambiguation
       Inputs: 
           S: Sentence
           M: Mentions
           candslist: candidate list [[(c11, p11),...(c1k, p1k)],...[(cn1, pn1),...(c1m, p1m)]]
            method: disambiguation method 
       Returns: 
           A disambiguated list in the form of  (true_entities, titles)
    
    '''
    candslist_scores = get_scores(S, M, C, method)
    return find_max(C,candslist_scores)


Overwriting wsd.py


# Genetrate Train Data Repository for WSD
Uses and already created dataset (wiki-mentions.30000.json) which contain 30000 Wikipedia paragraphs to generate 
data for training the Learn2Rank Model

In [8]:
%%writefile gen_trainrep.py 
""" Create a train-set 
    entity_id, query_id, scores1, score2, ..., scoren, true/false (is it a correct entity)
"""
from __future__ import division
from wsd import *

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"


sys.stdout.flush()

max_t=20
max_count=5000
#np.seterr(all='raise')

outdir = os.path.join(baseresdir, 'wsd')
outfile = os.path.join('../datasets/ner/trainrepository.%s.30000.tsv'%(max_count,))
if os.path.isfile(outfile): 
    sys.stderr.write(outfile + " already exist!\n")
    #sys.exit()

dsname = os.path.join('../datasets/ner/wiki-mentions.30000.json')

count = 0          
with open(dsname,'r') as ds, open(outfile,'w') as outf:
    qid=0
    for line in ds:                           
        js = json.loads(line.decode('utf-8').strip());
        S = js["text"]
        M = js["mentions"]
        count +=1        
        print "%s:\tS=%s\n\tM=%s" % (count, json.dumps(S, ensure_ascii=False).encode('utf-8'),json.dumps(M, ensure_ascii=False).encode('utf-8'))        
        C = generate_candidates(S, M, max_t=max_t, enforce=False)
        all_scores=get_all_scores(S,M,C)
        for i in  range(len(C)):
            m=M[i]
            cands = C[i]
            cand_scores = all_scores[i]
            wid = title2id(m[1]) 
            for (eid,_),scores in zip (cands, cand_scores):
                is_true_eid = (wid == eid)
                string_scores=[str(s) for s in scores]
                outf.write("\t".join([str(eid), str(qid)]+string_scores+[str(int(is_true_eid))])+"\n")
            qid += 1
        if count >= max_count:
            break
print "Done"             
        

        

Overwriting gen_trainrep.py


# Train the LTR Model

In [9]:
%%writefile train_ltr.py 
""" Train a LambdaMart (LTR) Method
"""
from __future__ import division
import pyltr
import pandas as pd
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals import joblib
from wsd import *

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"


#Columns = [entity_id, qid, score0, score1, score5, label]
outdir = os.path.join(baseresdir, 'wikify')
tr_file_name = os.path.join('../datasets/ner/trainrepository.5000.30000.tsv')
nrows=50000
data=pd.read_table(tr_file_name, nrows=nrows, header=None)

# Can't shuffle straighforwardly, I should group by quid, the shuffle
# But I guess shuffling is done in the estimator
#data = data.sample(frac=1)


num_cols = len(data.columns)

grouped=data.groupby(1)
total_len=len(grouped)
group = grouped.filter(lambda x:x.iloc[0,1] >= 0 and x.iloc[0,1] < 0.6*total_len)

#Train Data
#The following line does does the int-->float conversion, is it reliable? 
#Should I care later, while testing?
X_train = group.iloc[:,2:num_cols-1].as_matrix()

# Train the transformer and preprocess X_train
ltr_preprocessor = MinMaxScaler()
X_train=ltr_preprocessor.fit_transform(X_train)
ltr_preprocessor_fn = os.path.join('../model/tmp/ltr_preprocessor.%s.pkl' %(nrows,))
joblib.dump(ltr_preprocessor, open(ltr_preprocessor_fn, 'wb'))
####

y_train = group.iloc[:,num_cols-1].as_matrix()
qid_train = group.iloc[:,1].as_matrix()


#Validation Data
group=grouped.filter(lambda x:x.iloc[0,1] >= 0.6*total_len and x.iloc[0,1] < 0.8*total_len)
X_validate = group.iloc[:,2:num_cols-1].as_matrix()
X_validate = ltr_preprocessor.transform(X_validate)

y_validate = group.iloc[:,num_cols-1].as_matrix()
qid_validate = group.iloc[:,1].as_matrix()

#Test Data
group=grouped.filter(lambda x:x.iloc[0,1] >= 0.8*total_len and x.iloc[0,1] < 1.0*total_len)
X_test = group.iloc[:,2:num_cols-1].as_matrix()
X_test = ltr_preprocessor.transform(X_test)

y_test = group.iloc[:,num_cols-1].as_matrix()
qid_test = group.iloc[:,1].as_matrix()

monitor = pyltr.models.monitors.ValidationMonitor(
     X_validate, y_validate, qid_validate, metric=pyltr.metrics.NDCG(k=10), stop_after=250)
model = pyltr.models.LambdaMART(n_estimators=300, learning_rate=0.1, verbose = 1)
#lmart.fit(TX, TY, Tqid, monitor=monitor)
print "Training, sample_count: %s" % (nrows)

model.fit(X_train, y_train, qid_train, monitor=monitor)

metric = pyltr.metrics.NDCG(k=10)
Ts_pred = model.predict(X_test)
print 'Random ranking:', metric.calc_mean_random(qid_test, y_test)
print 'Our model:', metric.calc_mean(qid_test, y_test, Ts_pred)

model_file_name = os.path.join('../model/tmp/ltr.%s.pkl'%(nrows,))
joblib.dump(model, open(model_file_name, 'wb'))

print 'Model saved'

Overwriting train_ltr.py


# Mention Detection

Contains our mention detection modules, we try two methods:
* CoreNLP
* Train an SVM on top of SolrTextTagger

In [10]:
%%writefile mention_detection.py 
from mention_detection import *
from sklearn.externals import joblib

from wsd import *

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"


#constants
CORE_NLP=0
LEARNED_MENTION=1

SVC_HP_NROWS_S, SVC_HP_CV_S = 10000,1
SVC_HR_NROWS_S, SVC_HR_CV_S = 10000,20


SVC_HP_NROWS_L, SVC_HP_CV_L = 50000,1
SVC_HR_NROWS_L, SVC_HR_CV_L = 50000,20


mention_model_preprocessor_=None
mention_model_=None

def load_mention_model(nrows, svc):
    global mention_model_preprocessor_, mention_model_
    mention_model_preprocessor_fn = os.path.join(MODELDIR, 'svc_preprocessor.%s.pkl' % (nrows,))
    if os.path.isfile(mention_model_preprocessor_fn): 
        log("[load_mention_model]\tmention_model_preprocessor file (%s) loaded" % (mention_model_preprocessor_fn,))
        mention_model_preprocessor_ = joblib.load(open(mention_model_preprocessor_fn, 'rb'))
    else:
        log("[load_mention_model]\tmention_model_preprocessor file (%s) not found" % (mention_model_preprocessor_fn,))


    mention_model_fn = os.path.join(MODELDIR, 'svc_mentions_unbalanced.%s.%s.pkl' % (nrows,svc))
    if os.path.isfile(mention_model_fn): 
        mention_model_ = joblib.load(open(mention_model_fn, 'rb'))    
        log("[load_mention_model]\tmention_model_ file (%s) loaded" % (mention_model_fn,))
    else:
        log("[load_mention_model]\tmention_model_ file (%s) not found" % (mention_model_fn,))
        

def tokenize_stanford(text):
    addr = 'http://localhost:9001'
    params={'annotators': 'tokenize', 'outputFormat': 'json'}
    r = session.post(addr, params=params, data=text.encode('utf-8'))    
    
    return [token['originalText'] for token in r.json()['tokens']]

def encode_solrtexttagger_result(text,tags):
    """ Convert the solrtext output to our M,S format
        input:
            text: The original text
            tags: The result of the solrtexttagger
        output:
            S,M
            S: segmented sentence [w1, ..., wn]
            M: mensions [m1, ... , mj]
    """
    start=0
    termindex=0
    S=[]
    M=[]
    # pass 1, adjust partial mentions. 
    # approach one, expand (the other could be shrink)
    
    for tag in tags:
        assert text[tag[1]:tag[3]] == tag[5]
        seg = text[start:tag[1]]
        S += seg.strip().split()
        M.append([len(S),'UNKNOWN'])
        S += [" ".join(text[tag[1]:tag[3]].split())]
        start = tag[3]
        
    S += text[start:].strip().split()
    return S, M

def annotate_with_solrtagger(text):
    ''' Annonate a text using solrtexttagger
        Input: 
            text: The input text *must be unicode*
        Output:
            Annotated text
    '''
    addr = 'http://localhost:8983/solr/enwikianchors20160305/tag'
    params={'overlaps':'LONGEST_DOMINANT_RIGHT', 'tagsLimit':'5000', 'fl':'id','wt':'json','indent':'on','matchText':'true'}
    #text=solr_escape(text) Maybe not needed!
    r = session.post(addr, params=params, data=text.encode('utf-8'))    

    S,M = encode_solrtexttagger_result(text,r.json()['tags'])
    return S,M


def encode_corenlp_result(text,annotated):
    """ Convert the corenlp output to our M,S format
        input:
            text: The original text
            mentions: The result of the solrtexttagger
        output:
            S,M
            S: segmented sentence [w1, ..., wn]
            M: mensions [m1, ... , mj]
    """
    #****** Important ****
    #* The indices are not correct if it contains unicode, 
    #* in case you need to work with the indices, decode to utf-8
    #******
    S=[]
    M=[]
    P=[]
    # pass 1, adjust partial mentions. 
    # approach one, expand (the other could be shrink)
    
    for sentence in annotated['sentences']: 
        start=0
        
        for mention in sentence['entitymentions']:
            S += [token['originalText'] for token in sentence['tokens'][start:mention['tokenBegin']]]
            M.append([len(S),'UNKNOWN'])
            mentionstr = mention['text']
            S += [mentionstr]
            start = mention['tokenEnd']

        S += [token['originalText'] for token in sentence['tokens'][start:]]
        P += [[token['originalText'],token['pos']] for token in sentence['tokens']]
    return S, M, P

def annotate_with_corenlp(text):
    ''' Annonate a text using coreNLP
        Input: 
            text: The input text
        Output:
            Annotated text
    '''
    addr = 'http://localhost:9001'
    params={'annotators': 'entitymentions', 'outputFormat': 'json'}
    r = session.post(addr, params=params, data=text.encode('utf-8'))    

    
    S,M, P = encode_corenlp_result(text, r.json())
    return S,M,P

def solrtagger_pos(S,M,P):
    ''' Alligns the tags from corenlp to solrtagger's mentions
        Input:
            S: Sentence 
            M: Mentions
            P: POS of the mentions, from corenlp
        Output:
            Q: POS of solrtagger's mentions
    '''
    Q=[]
    j=0
    for i in range(len(M)):
        #m=tokenize_stanford(solr_unescape(S[M[i][0]]))  I skip escaping for now
        m=tokenize_stanford(S[M[i][0]]) 
        j_backup=j
        q=[]
        while j<len(P):
            if strsimilar(P[j][0], m[0])> .8:
                k=0
                while strsimilar(P[j][0], m[k])>0.8:
                    #q.append(P[j]) #good for debugging
                    q.append(P[j][1]) #good for debugging
                    k=k+1
                    j=j+1
                    if j >= len(P) or k>=len(m):
                        break

                Q.append(" ".join(q))
                break
            j=j+1
        if not q:
            Q.append("OTHER")
            j=j_backup
    return Q

def get_mention_count(s):
    """
    Description:
        Returns the amount of times that the given string appears as a mention in wikipedia.
    Args:
        s: the string (can contain AND, OR, ..)
    Return:
        The amount of times the given string appears as a mention in wikipedia
    """
    
    return sum(c for _,c in anchor2concept(s))  

def mention_prob(text):
    """
    Description:
        Returns the probability that the text is a mention in Wikipedia.
    Args:
        text: 
    Return:
        The probability that the text is a mention in Wikipedia.
    """
    
    total_mentions = get_mention_count(text)
    total_appearances = get_solr_count(text.replace(".", ""))
    if total_appearances == 0:
        return 0 # a mention never used probably is not a good link
    return float(total_mentions)/total_appearances

def get_mention_probs(S,M):
    return [mention_prob(S[m[0]]) for m in M]


def boil_down_candidate_score(score_list):
    return [sum(scores)/len(scores) for scores in scores_list]

def mention_overlap(S1, M1, S2,M2):
    '''Calculates the overlap between two given detected mentions
        Input:
            S1: Source Setnence
            M1: Source Mention
            S2: Destination Sentence
            M2: Destination mention            
        Output: A 0/1 vector of size M1, each element shows whether M1[i] is also in M2
    '''
    is_detected = []
    for m1 in M1:
        found = 0
        for m2 in M2:
            if strsimilar(S1[m1[0]], S2[m2[0]])>0.8:
                found=1
        is_detected.append(found)
    return is_detected

def detect_and_score_mentions(text, max_t=5):
    """Give
        Uses solrtagger to detect mentions, and score them
        Inputs:
            text: Given text
        Output:
            Scores, in this format [[(c111, c11s),...(c1k1, c1ks)],...[(cn11, pn1s),...(c1m1, p1ms)]]
            where cijk is the k-th scores for cij candidate
    """
    text = throw_unicodes(text)
    solr_S, solr_M = annotate_with_solrtagger(text)
    # max_t does not have to equal the number of candidates in wsd, it's just to 
    # get an average relevancy
    solr_C = generate_candidates(solr_S, solr_M, max_t=max_t, enforce=False)
    
    
    wsd_scores = [[sum(sc)/len(sc) for sc in get_scores(solr_S, solr_M, solr_C, method)] for method in \
               ['popularity','entitycontext','mention2entity','context2context','context2profile']]

    mention_scores=[]
    mention_scores.extend(wsd_scores)
    mention_scores.append(get_mention_probs(solr_S, solr_M))
    
    core_S, core_M, core_P = annotate_with_corenlp(text)
    overlap_with_corenlp = mention_overlap(solr_S, solr_M, core_S,core_M)
    mention_scores.append(overlap_with_corenlp)
    
    pos_list = solrtagger_pos(solr_S, solr_M,core_P)
    mention_scores.append(pos_list)
    
    return solr_S, solr_M, zip(*mention_scores)


def get_learned_mentions(text):
    if (mention_model_preprocessor_ is None) or (mention_model_ is None):
        log('[mention_models]\tmodel not loaded')
        raise Exception('model not loaded, try load_mention_model()')
        
    S_solr,M_solr,scores = detect_and_score_mentions(text)
    M_scores=[]
    for sc_vec in scores:
        # Unintuitive: When fitting, the first column was the mention_id, which was ignored!
        # And the preprocessor needs the exact column names!
        sc_frame = pd.DataFrame([sc_vec], columns=[str(i+1) for i in range(len(sc_vec))])
        X = mention_model_preprocessor_.transform(sc_frame)
        M_scores.append(mention_model_.predict(X))
    M = [m for m_s, m in zip(M_scores, M_solr) if m_s==1]
    return S_solr, M
    
def detect_mentions(text, mentionmethod=CORE_NLP):
    if mentionmethod == CORE_NLP:
        S, M, _ = annotate_with_corenlp(text)        
    if mentionmethod == LEARNED_MENTION:
        S, M =  get_learned_mentions(text)
    return S, M
    

Overwriting mention_detection.py


# Mention Detection Test

In [11]:
from mention_detection import *
text = "I want to Brazil to visit Romario, but David was in Real Madrid and I couldn't eat Kebab"
S, M = detect_mentions(text, mentionmethod = CORE_NLP)
for m in M:
    print S[m[0]]

Brazil
Romario
David
Real Madrid
Kebab


# Generate Train Data Repository For Mention Detection

Uses and already created dataset (wiki-mentions.30000.json) which contain 30000 Wikipedia paragraphs to generate 
data for training the SVM Model for mention detection

In [12]:
%%writefile gen_trainrep_for_mention.py 
""" Create a train-set 
    entity_id, query_id, scores1, score2, ..., scoren, true/false (is it a correct entity)
"""
from __future__ import division
from mention_detection import *

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"



max_count=5000
skip_lines=0

outdir = os.path.join(baseresdir, 'wsd')
outfile = os.path.join('../datasets/ner/mentiontrainrepository.%s.30000.tsv'%(max_count,))
if os.path.isfile(outfile): 
    sys.stderr.write(outfile + " already exist!\n")
    sys.exit()

dsname = os.path.join('../datasets/ner/wiki-mentions.30000.json')

count = 0  
mention_id = 0
with open(dsname,'r') as ds, open(outfile,'w') as outf:
    for line in ds:                           
        count +=1  
        if count <= skip_lines:
            continue
            
        js = json.loads(line.decode('utf-8').strip());
        S = js["text"]
        M = js["mentions"]
        text= " ".join(S)
        print "%s:\tS=%s\n\tM=%s\ttext=%s" % (count, json.dumps(S, ensure_ascii=False).encode('utf-8'),json.dumps(M, ensure_ascii=False).encode('utf-8'),text.encode('utf-8'))        
        
        solr_S, solr_M, scores = detect_and_score_mentions(text)
        correct_mention = mention_overlap(solr_S, solr_M, S, M)
        for i in  range(len(solr_M)):
            string_scores=[str(s) for s in scores[i]]
            outf.write("\t".join([str(mention_id)] + string_scores+[str(correct_mention[i])])+"\n")
            mention_id += 1
        if count >= max_count:
            break
print "Done"             
        

        

Overwriting gen_trainrep_for_mention.py


# Train SVC Model for Mention Detection

In [13]:
%%writefile train_svc.py 
''' Trains an SVC for mention detection
'''
from mention_detection import *

import numpy as np
import os
import pandas as pd
import sklearn
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MinMaxScaler
from sklearn_pandas import gen_features
from sklearn.externals import joblib

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"

def downsample_negatives(X_train, y_train, frac=0.2):

    pos_index = y_train==1
    X_train_pos = X_train[pos_index,:]
    y_train_pos = y_train[pos_index]

    neg_index = y_train==0
    X_train_neg = X_train[neg_index,:]
    y_train_neg = y_train[neg_index]


    X_train_neg, y_train_neg = sklearn.utils.resample(X_train_neg, y_train_neg, 
                                                n_samples = int(frac*len(X_train_neg)), replace=False)    

    X_train_downsampled = np.vstack([X_train_pos, X_train_neg])
    y_train_downsampled = np.hstack([y_train_pos, y_train_neg])

    X_train_downsampled_shuffled, y_train_downsampled_shuffled = sklearn.utils.shuffle(X_train_downsampled, y_train_downsampled)
    return X_train_downsampled_shuffled, y_train_downsampled_shuffled

home = '/users/grad/sajadi'

tr_file_name = os.path.join('../datasets/ner/mentiontrainrepository.5000.30000.tsv')
pos_col=['8']
nrows=50000
data=pd.read_table(tr_file_name, header=None, nrows=nrows)
data.columns = [str(c) for c in data.columns]

# Shuffle, Shuffle and Shuffle!
data = data.sample(frac=1)


num_cols = len(data.columns)
X  = data.iloc[:,1:num_cols-1]
y  = data.iloc[:,num_cols-1]

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

#Preprocess X_train
feature_def = gen_features(
     columns=[[c] for c in X_train.columns[:7]],
     classes=[MinMaxScaler]
 )

feature_def += ((pos_col, [LabelBinarizer()]),)

svc_preprocessor = DataFrameMapper(feature_def)
X_train = svc_preprocessor.fit_transform(X_train)
svc_preprocessor_fn = os.path.join('../model/tmp/svc_preprocessor.%s.pkl' % (nrows,))
joblib.dump(svc_preprocessor, open(svc_preprocessor_fn, 'wb'))
X_test = svc_preprocessor.transform(X_test)
#####

#Didn't help!!
#X_train, y_train = downsample_negatives(X_train, y_train)

for cv in [1,10,20]:
    print "Training, sample_count: %s\tcv:%s" % (nrows, cv)
    clf = svm.SVC(kernel='linear', class_weight={1:cv})
    clf.fit(X_train, y_train)  
    y_pred = clf.predict(X_test)
    measures = metrics.precision_recall_fscore_support(y_test, y_pred, average='binary')
    model_file_name = os.path.join('../model/tmp/svc_mentions_unbalanced.%s.%s.pkl' % (nrows,cv))
    joblib.dump(clf, open(model_file_name, 'wb'))
    print "measures: ", measures
    sys.stdout.flush()




print 'Model saved'

Overwriting train_svc.py


# Wikification API

In [14]:
%%writefile wikify.py 
from __future__ import division
from mention_detection import *

__author__ = "Armin Sajadi"
__copyright__ = "Copyright 215, The Wikisim Project"
__credits__ = ["Armin Sajadi"]
__license__ = "GPL"
__version__ = "1.0.1"
__maintainer__ = "Armin Sajadi"
__email__ = "sajadi@cs.dal.ca"
__status__ = "Development"

HIGH_PREC_SMALL = 1
HIGH_REC_SMALL = 2
HIGH_PREC_LARGE = 3
HIGH_REC_LARGE = 4
 

def get_wikifify_params(opt):
    if opt == HIGH_PREC_SMALL:
        return SVC_HP_NROWS_S, SVC_HP_CV_S, LTR_NROWS_S
    
    if opt == HIGH_REC_SMALL: 
        return SVC_HR_NROWS_S, SVC_HR_CV_S, LTR_NROWS_S
    
    if opt == HIGH_PREC_LARGE: 
        return SVC_HP_NROWS_L, SVC_HP_CV_L, LTR_NROWS_L
    
    if opt == HIGH_REC_LARGE: 
        return SVC_HR_NROWS_L, SVC_HR_CV_L, LTR_NROWS_L


def wikify_string(line, mentionmethod=CORE_NLP, max_t=20):
    if not isinstance(line, unicode):
        line = line.decode('utf-8')
    
    S,M = detect_mentions(line, mentionmethod)      
    C = generate_candidates(S, M, max_t=max_t, enforce=False)
    E = wsd(S, M, C, method='learned')
    for m,e in zip(M,E[1]):
        m[1]=e
    return S,M

def wikify_a_line(line, mentionmethod=CORE_NLP):
    ''' Annotate a single line 
        Input:
            line: The given string
            mentionmethod: The mention detection method
        Output:
            Annotated Sentence inwhich mentiones are hyper-linked to the Wikipedia concepts
    '''
    S, M = wikify_string(line, mentionmethod) 
    for m in M: 
        S[m[0]]="<a href=https://en.wikipedia.org/wiki/%s>%s</a>"  % (m[1], S[m[0]])
    S_reconcat = " ".join(S)
    return S_reconcat
            
def wikify_api(text, mentionmethod=CORE_NLP):
    outlist=[]
    for line in text.splitlines():
        outlist.append(wikify_a_line(line, mentionmethod))
    return "<br>".join(outlist).decode('utf-8')

def wikify_from_file_api(infilename, outfilename, mentionmethod=CORE_NLP):
    with open(infilename) as infile, open(outfilename, 'w') as outfile:
        for line in infile.readlines():
            wikified = wikify_api(line, mentionmethod)
            outfile.write(wikified + "\n")

            

Overwriting wikify.py


# Testing Wikification

In [15]:
from wikify import *

svc_nrows, svc_cv, ltr_nrows = get_wikifify_params(HIGH_REC_LARGE)
load_mention_model(svc_nrows, svc_cv)

load_wsd_model(ltr_nrows)

text = "Lee Jun-fan known professionally as Bruce Lee"
#text = "I like Hall & Oats and Bruce Lee"
S=["Three", "of", "the", "greatest", "guitarists", "started", "their", "career", "in", "a", "single", "band", ":", "Clapton", ",", "Beck", ",", "and", "Page", "."]
text = " ".join(S)
text=text.decode('utf-8')

text = "Lee Jun-fan known professionally as Bruce Lee"
# text = "I like Hall & Oats and Bruce Lee"
# text = 'I like Charles "Lucky" Luciano'
# text = 'I hate Senate of Serampore College (University) and Joze'
text="I went to US to eat  Kebab"
text = "Lee Jun-fan (截拳道) known professionally as Bruce Lee"
text = text.decode('utf-8')
S1,M1 = detect_mentions(text, mentionmethod=LEARNED_MENTION)      
print S1
print M1
S2,M2 = wikify_string(text, mentionmethod=LEARNED_MENTION)
print S2
print M2
print wikify_api (text, LEARNED_MENTION)

['Lee Jun-fan', '()', 'known professionally', 'as', 'Bruce Lee']
[[0, 'UNKNOWN'], [4, 'UNKNOWN']]
['Lee Jun-fan', '()', 'known professionally', 'as', 'Bruce Lee']
[[0, 'Bruce_Lee'], [4, 'Bruce_Lee']]
<a href=https://en.wikipedia.org/wiki/Bruce_Lee>Lee Jun-fan</a> () known professionally as <a href=https://en.wikipedia.org/wiki/Bruce_Lee>Bruce Lee</a>
