# M-WePNaD task

In this notebook we'll take a first look at the data for the M-WePNaD task; we'll sample ten random training queries and develop a simple preprocessing pipeline.

In [1]:
from itertools import islice
import nltk
import numpy as np
import os
import pandas as pd
import re
import sys
import xml.etree.ElementTree as ET

In [2]:
# check how np is configured (do we have fast linear algebra?)
# NB: it's not that easy to find just how to interpret this output.
np.show_config()

blas_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    library_dirs = ['/Users/richard/anaconda/envs/mwepnad/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/richard/anaconda/envs/mwepnad/include']
blas_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    library_dirs = ['/Users/richard/anaconda/envs/mwepnad/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/richard/anaconda/envs/mwepnad/include']
lapack_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    library_dirs = ['/Users/richard/anaconda/envs/mwepnad/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/richard/anaconda/envs/mwepnad/include']
lapack_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthr

In [3]:
training_data_dir = os.path.join('..', '..', 'data', 'training_data', 'MWePNaDTraining')
exp_dir = os.path.join('..','..','exp')
src_dir = os.path.join('..')

In [4]:
def log(msg):
    print(msg)

In [5]:
prng = np.random.RandomState(seed=42)
ten_queries = prng.choice(os.listdir(training_data_dir), 10)

# A first peek into some of the data

In [6]:
walk_first_query = os.walk(os.path.join(training_data_dir, ten_queries[0]))

In [7]:
[i for i in islice(walk_first_query, 1, 3)]

[('../../data/training_data/MWePNaDTraining/paul_erhlich/001',
  [],
  ['001.txt', 'metadata.xml', 'SR001.htm']),
 ('../../data/training_data/MWePNaDTraining/paul_erhlich/002',
  [],
  ['002.txt', 'metadata.xml', 'SR002.htm'])]

In [8]:
first_query_dir = os.path.join(training_data_dir, ten_queries[0])
first_metadata_path = os.path.join(first_query_dir, os.listdir(first_query_dir)[0], 'metadata.xml')

In [9]:
print(open(first_metadata_path).read())

<?xml version="1.0" encoding="UTF-8"?>
<tns:Annotation_Corpus xmlns:tns="http://www.example.org/metadata-corpus" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.example.org/metadata-corpus metadata-corpus.xsd">
<tns:url>http://es.wikipedia.org/wiki/Paul_Ehrlich</tns:url>
<tns:language>ES</tns:language>
<tns:downloadDate>2013-07-05</tns:downloadDate>
<tns:annotator>Miguel Bernabé</tns:annotator>
</tns:Annotation_Corpus>


In [10]:
tree = ET.parse(first_metadata_path)
root = tree.getroot()

In [11]:
[i for i in root]

[<Element '{http://www.example.org/metadata-corpus}url' at 0x117b899f8>,
 <Element '{http://www.example.org/metadata-corpus}language' at 0x117b89a98>,
 <Element '{http://www.example.org/metadata-corpus}downloadDate' at 0x117b89ae8>,
 <Element '{http://www.example.org/metadata-corpus}annotator' at 0x117b89b38>]

In [12]:
root.tag

'{http://www.example.org/metadata-corpus}Annotation_Corpus'

In [13]:
language_e = root.find('{http://www.example.org/metadata-corpus}language')

In [14]:
root.find('{http://www.example.org/metadata-corpus}url').text

'http://es.wikipedia.org/wiki/Paul_Ehrlich'

In [15]:
language_e.text

'ES'

# Some useful ad-hoc functions and classes

In [16]:
# NB: fix annoying spelling mistake in gold standard and dirname in training corpus
def correct_paul_erhlich(query):
    return 'paul ehrlich' if query == 'paul erhlich' else query

In [17]:
def load_gold_standard():
    try:
        f = open(os.path.join(training_data_dir, '..', 'GoldStandardTraining.txt'))
    except IOError:
        pass
    else:
        with f:
            return frozenset([(correct_paul_erhlich(q), d, c) for q, d, c in [
                    tuple(l.rstrip('\n').split('\t')) for l in f]])

In [18]:
gold_standard = load_gold_standard()

In [19]:
"""
Parse document dir

Load data and metadata in memory

Just meant as a minimal parse of the files on disk, not meant to hold many features.
"""
class Document:
    XMLNS = '{http://www.example.org/metadata-corpus}'
    def __init__(self, dir_path, file_paths):
        self.path = dir_path
        self.file_paths = file_paths
        self._parse_metadata()
        self._get_text()
        
    def _parse_metadata(self):
        tree = ET.parse(os.path.join(self.path, 'metadata.xml'))
        root = tree.getroot()
        self.url_ = root.find('{}url'.format(Document.XMLNS)).text
        self.languages = frozenset(root.find('{}language'.format(Document.XMLNS))
                                   .text.strip().split(','))
         
    def _get_text(self):
        ids = [re.sub(r'\.txt$', '', p) for p in self.file_paths if p.endswith('.txt')]
        assert len(ids) == 1
        self.id = ids[0]
        with open(os.path.join(self.path,'{}.txt'.format(self.id))) as f:
            self.text = f.read()

In [20]:
"""
Parse query dir

Load data and metadata in memory
"""
class Query:
    def __init__(self, path):
        self.path = path
        self.query = correct_paul_erhlich(os.path.basename(path).replace('_', ' '))
        try:
            self._load()
        except Exception as e:
            log('Exception: {}'.format(e))
            raise ValueError('Could not parse everything in query dir')
    
    def _load(self):
        self.docs = []
        walker = os.walk(self.path)
        for path, _, files in walker:
            if 'metadata.xml' in files:
                self.docs.append(Document(path, files))

# A closer look at our first query

In [21]:
first_query = Query(first_query_dir)

In [22]:
first_query.path

'../../data/training_data/MWePNaDTraining/paul_erhlich'

In [23]:
first_query.query

'paul ehrlich'

In [24]:
s_docs_first_query = pd.Series([doc.languages for doc in first_query.docs])

In [25]:
s_docs_first_query.unique()

array([frozenset({'ES'}), frozenset({'EN'}), frozenset({'ES', 'EN'}),
       frozenset({'DE'}), frozenset({'DE', 'EN'})], dtype=object)

In [26]:
print(re.sub(r'\s+', ' ', first_query.docs[11].text.lower())[:250])

paul r. ehrlich (paulrehrlich) en twitter twitter consulta de búsqueda buscar cuenta verificada @ idioma: español bahasa indonesia bahasa melayu dansk deutsch english englishuk euskara filipino galego italiano lolcatz magyar nederlands norsk polski p


In [27]:
s_first_query_urls = pd.Series([doc.url_ for doc in first_query.docs])

In [28]:
s_first_query_urls.sample(5, random_state=prng)

18    http://www.biography.com/people/paul-ehrlich-9...
45         http://www.allergyasthmanyc.com/bio_paul.php
47    http://www.patheos.com/blogs/godandthemachine/...
89         http://en.wikiquote.org/wiki/Paul_R._Ehrlich
4     http://www.stanford.edu/group/CCB/cgi-bin/ccb/...
dtype: object

# Features for classification into 'NR' and 'relevant'

In [29]:
"""
Compute a feature

Accepts a Document and a Query object. 
Normally you would use this function with a Document that is in query.docs.

Returns a numeric value
"""
def compute_n_exact_name_matches(query, doc):
    first_name, last_name = query.query.split(' ')
    re_ = re.compile(r'' + re.escape(first_name) + r'\s+' + re.escape(last_name))
    return len(re.findall(re_, doc.text.lower()))
    # TODO: use URL, too

In [30]:
"""
Compute a feature

Accepts a Document and a Query object. 
Normally you would use this function with a Document that is in query.docs.

Returns a numeric value
"""
def compute_n_name_matches_with_optional_word_or_initial_in_between(query, doc):
    first_name, last_name = query.query.split(' ')
    re_ = re.compile(r'' + re.escape(first_name) + r'\s+' + r'[\w]*\.?\s*' + re.escape(last_name))
    return len(re.findall(re_, doc.text.lower()))
    # TODO: use URL, too
    # TODO: compute other features, e.g., 1 / (1 + number_of_chars_between_first_and_last_name)
    # TODO: possibly preprocess and tokenise text / URL prior to computing features

In [31]:
compute_n_exact_name_matches(first_query, first_query.docs[11])

0

In [32]:
compute_n_name_matches_with_optional_word_or_initial_in_between(first_query, first_query.docs[11])

23

Now, let's get an idea for how well these two features correlate with NR in the ground truth for our ten randomly sampled development queries. Or rather, first, just for our 'first_query'.

In [33]:
s_feature = pd.Series({
        doc.id : compute_n_name_matches_with_optional_word_or_initial_in_between(
                first_query, doc) for doc in first_query.docs})

In [34]:
s_gold_standard = pd.Series({a[1] : 1 if a[2] == 'NR' else 0 for a in gold_standard if a[0] == first_query.query})

In [35]:
df = pd.concat([s_gold_standard, s_feature], axis=1, join_axes=[s_gold_standard.index], keys=['NR', 'x'])

In [36]:
df.shape[0]

99

In [37]:
df.groupby('NR').agg('mean')

Unnamed: 0_level_0,x
NR,Unnamed: 1_level_1
0,9.293478
1,12.428571


Interestingly, above, the number of name mention matches is higher for pages annotated as NR. What reasons would annotators use to make this decision? Let's look into some of these pages, where we do have name matches, but the pages are annotated as 'NR' (not relevant).

In [38]:
df.loc[df['NR'] == 1, :]

Unnamed: 0,NR,x
9,1,0
22,1,5
24,1,43
27,1,3
29,1,28
33,1,3
66,1,5


In [39]:
print(re.sub(r'\n\n+', '\n\n', [doc for doc in first_query.docs if doc.id == '029'][0].text)[:500])

Perfiles: Paul Ehrlich | LinkedIn

	Inicio
	¿Qué es LinkedIn?
	Únete hoy
	Inicia sesión

Búsqueda por nombre
	
Nombre

	
Apellidos

Paul Ehrlich

	
25 de 42 perfiles
| Ver todos los perfiles en LinkedIn »

	
	
Ver el perfil completo

	

Paul
Ehrlich

	Cargo
	Chief Medical Officer, Cerner Corporation
	Información demográfica
	

Kansas City y alrededores, Missouri, Estados Unidos

 | 

Atención sanitaria y hospitalaria

	Actual:
	

Chief Medical Officer at Cerner Corporation

	Anterior:
	
VP and C


The above page seems to be a LinkedIn listing of public profiles that came up for the search 'Paul Ehrlich'.

In [40]:
print(re.sub(r'\n\n+', '\n\n', [doc for doc in first_query.docs if doc.id == '024'][0].text)[:500])

Paul Ehrlich MedChem Euro-PhD

Paul
Ehrlich European Medicinal Chemistry Ph.D. Network

The Doctorate Course in Pharmaceutical Chemistry at the University of Vienna is part of a European network, recently formed, which has the aim
 of  fostering the education and research training of
post-graduate students in Medicinal Chemistry towards PhD degree. In
particular the aim of the Paul Ehrlich MedChem Euro-PhD Network is to
provide an in-depth research training and mobility of PhD students in
the ar


The above page seems to refer mainly to an event / network that is named after one Paul Ehrlich.

# Some ideas for this task

As a first step, it would seem a nice experiment to see if we can correctly classify pages as 'NR' (binary classification). Because of the way this task will be evaluated, it is a good idea to put all of these pages together in a single cluster. We could engineer some features (see ideas in the TODO's above) and then train a simple classifier.

As a second step, we could try to find some generic way to tokenise text, regardless of language. So, no stemming or anything like that yet, just tokenisation. Then, an often used representation of documents could be a TF-IDF vector.

As a third step, we could calculate cosine distances between documents.

As a fourth step, we could use hierarchical agglomerative clustering, it has performed well in previous editions of WePS campaigns.

If we want, we can use the simple idea from Berendsen et al, ECIR 2012, where social media profiles were not included in the clustering. Instead, adding them as singleton clusters after clustering the rest of the pages boosted the score considerably.

If we want to improve further, we could add custom tokenisation for each language.

We can also investigate how common it is that an individual will be referred to from pages with different languages. If it happens a lot, another way to improve scores would be to use some kind of translation machinery.