# Task 1: Using RLTK to perform Entity Resolution (ER)

<sub>Content of this notebook was prepared by Basel Shbita, and modified by Avijit Thawani (thawani@usc.edu) as part of the class <u>DSCI 558: Building Knowledge Graphs</u> at University of Southern California (USC).</sub>

The Record Linkage ToolKit ([RLTK](https://github.com/usc-isi-i2/rltk)) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity.

This notebook introduces some applied examples using RLTK. You can also find additional examples and use-cases in [RLTK's documentation](https://rltk.readthedocs.io/en/master/).

## Dataset analysis & RLTK components construction

In [1]:
!pip install rltk

Collecting rltk
  Downloading rltk-2.0.0a20-py3-none-any.whl (81 kB)
     ---------------------------------------- 81.5/81.5 kB 1.5 MB/s eta 0:00:00
Collecting dask>=0.19.2
  Downloading dask-2022.9.0-py3-none-any.whl (1.1 MB)
     ---------------------------------------- 1.1/1.1 MB 2.7 MB/s eta 0:00:00
Collecting scipy>=1.1.0
  Downloading scipy-1.9.1-cp310-cp310-win_amd64.whl (38.6 MB)
     ---------------------------------------- 38.6/38.6 MB 2.4 MB/s eta 0:00:00
Collecting pyrallel.lib
  Downloading pyrallel.lib-0.0.10-py3-none-any.whl (24 kB)
Collecting matplotlib>=2.0.0
  Downloading matplotlib-3.5.3-cp310-cp310-win_amd64.whl (7.2 MB)
     ---------------------------------------- 7.2/7.2 MB 2.4 MB/s eta 0:00:00
Collecting Cython>=0.28.0
  Downloading Cython-0.29.32-py2.py3-none-any.whl (986 kB)
     -------------------------------------- 986.3/986.3 kB 2.5 MB/s eta 0:00:00
Collecting pandas>=1.2.0
  Downloading pandas-1.4.4-cp310-cp310-win_amd64.whl (10.0 MB)
     ---------------

### Task 1-1. Construct RLTK Datasets

First, you need define how a single entry would like for each type of record (for each dataset)

In [2]:
import rltk
import csv

# You can use this tokenizer in case you need to manipulate some data
tokenizer = rltk.tokenizer.crf_tokenizer.crf_tokenizer.CrfTokenizer()

In [4]:
'''
Feel free to add more columns here for use in record linkage.
'''

class GoodRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['Title']

    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

class NobleRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['Title']
    
    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

In [5]:
dir_ = ''
good_file = dir_ + 'goodreads.csv'
noble_file = dir_ + 'barnes_and_nobles.csv'

ds1 = rltk.Dataset(rltk.CSVReader(good_file),record_class=GoodRecord)
ds2 = rltk.Dataset(rltk.CSVReader(noble_file),record_class=NobleRecord)

UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 3442: illegal multibyte sequence

You can load your csv files into RLTK using this method:

And we can inspect a few entries:

In [None]:
# print some entries
print(ds1.generate_dataframe().head(5))
print(ds2.generate_dataframe().head(5))

  id                                 name_string  \
0  0          Managing My Life: My Autobiography   
1  1     I Remember: Sketch for an Autobiography   
2  2              Betty Boothroyd: Autobiography   
3  3  Caddie, A Sydney Barmaid: An Autobiography   
4  4     Nureyev: An Autobiography With Pictures   

                                         name_tokens  
0             {My, Life, :, Managing, Autobiography}  
1   {I, an, Sketch, Autobiography, for, :, Remember}  
2               {:, Betty, Autobiography, Boothroyd}  
3  {Caddie, A, An, ,, Barmaid, Sydney, :, Autobio...  
4    {Nureyev, An, Pictures, With, :, Autobiography}  
  id                                        name_string  \
0  0          Pioneer Girl: The Annotated Autobiography   
1  1  American Sniper (Movie Tie-in Edition): The Au...   
2  2                     The Autobiography of Malcolm X   
3  3                           Assata: An Autobiography   
4  4                            Autobiography of a Yogi   

  

### Task 1-2. Blocking

First, we'll load dev set to evaluate both blocking (Task 1-2) and entity linking (Task 1-3).

In [None]:
dev_set_file = dir_ + 'dev.csv'
dev = []
with open(dev_set_file, encoding='utf-8', errors="replace") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if len(row) <= 1:
            continue
        if line_count == 0:
            columns = row
            line_count += 1
        else:
            dev.append(row)
    print(f'Column names are: {", ".join(columns)}')
    print(f'Processed {len(dev)} lines.')

gt = rltk.GroundTruth()
for row in dev:    
    r1 = ds1.get_record(row[0])
    r2  = ds2.get_record(row[1])
    if row[-1] == '1':
        gt.add_positive(r1.raw_object['ID'], r2.raw_object['ID'])
    else:
        gt.add_negative(r1.raw_object['ID'], r2.raw_object['ID'])

rltk.Trial(gt)

Column names are: goodreads.ID, barnes_and_nobles.ID, label
Processed 297 lines.


<rltk.evaluation.trial.Trial at 0x16236e7a0>

Then, you can build your own blocking techniques and evaluate it.

Hint:

- What is the total number of pairs without blocking? 
- what is the number of paris with blocking?
- After blocking, how many "correct" (matched) pairs presented in dev set?


### Task 1-3. Entity Linking

Here are 2 example functions for field (attribute) similarity:

In [None]:
def name_string_similarity_1(r1, r2):
    ''' Example dummy similiary function '''
    s1 = r1.name_string[:3]
    s2 = r2.name_string[:3]
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def name_string_similarity_2(r1, r2):
    ''' Example dummy similiary function '''
    s1 = r1.name_string
    s2 = r2.name_string
    
    if s1 == s2:
        return 1
    
    return 0

Here's how you can combine multiple similarity functions into a single weightened scoring function:

In [None]:
# threshold value to determine if we are confident the record match
MY_TRESH = 0.8 # this number is just an example, you need to change it

# entity linkage scoring function
def rule_based_method(r1, r2):
    score_1 = name_string_similarity_1(r1, r2)
    score_2 = name_string_similarity_2(r1, r2)
    
    total = 0.7 * score_1 + 0.3 * score_2
    
    # return two values: boolean if they match or not, float to determine confidence
    return total > MY_TRESH, total

Lets run some candidates using the ground-truth

In [None]:
trial = rltk.Trial(gt)
candidate_pairs = rltk.get_record_pairs(ds1, ds2, ground_truth=gt)
for r1, r2 in candidate_pairs:
    result, confidence = rule_based_method(r1, r2)
    trial.add_result(r1, r2, result, confidence)

Now lets evaluate our trial results

In [None]:
trial.evaluate()
print('Trial statistics based on Ground-Truth from development set data:')
print(f'tp: {trial.true_positives:.06f} [{len(trial.true_positives_list)}]')
print(f'fp: {trial.false_positives:.06f} [{len(trial.false_positives_list)}]')
print(f'tn: {trial.true_negatives:.06f} [{len(trial.true_negatives_list)}]')
print(f'fn: {trial.false_negatives:.06f} [{len(trial.false_negatives_list)}]')

Trial statistics based on Ground-Truth from development set data:
tp: 0.597015 [40]
fp: 0.052174 [12]
tn: 0.947826 [218]
fn: 0.402985 [27]


In [None]:
trial.f_measure

0.6722689075630253

### Save Test predictions
You will be evaluated on dev and test predictions, over a hidden ground truth.

In [None]:
test_set_file = dir_ + 'test.csv'
test = []
with open(test_set_file, encoding='utf-8', errors="replace") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if len(row) <= 1:
            continue
        if line_count == 0:
            columns = row
            line_count += 1
        else:
            test.append(row)
    print(f'Column names are: {", ".join(columns)}')
    print(f'Processed {len(test)} lines.')

Column names are: goodreads.ID, barnes_and_nobles.ID
Processed 100 lines.


In [None]:
predictions = []
for id1, id2 in test:
    r1 = ds1.get_record(id1)
    r2  = ds2.get_record(id2)
    result, confidence = rule_based_method(r1, r2)
    predictions.append((r1.id, r2.id, result, confidence))

In [None]:
len(predictions), len(ds1.generate_dataframe()), len(ds2.generate_dataframe())

(100, 3967, 3701)

In [None]:
with open(dir_ + 'predictions.csv', mode='w') as file:
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in predictions:
        writer.writerow(row)

# Task 2: Using RDFLib for Knowledge Representation

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information as graphs. RDFLib aims to be a pythonic RDF API, a Graph is a python collection of RDF Subject, Predicate,  Object Triples.

This notebook introduces simple examples. You can also find additional information in the [official documenation](https://rdflib.readthedocs.io/en/stable/).

In [None]:
! pip install rdflib



In [None]:
from rdflib import Graph, URIRef, Literal, XSD, Namespace, RDF

Let's define some namespaces:

In [None]:
FOAF = Namespace('http://xmlns.com/foaf/0.1/')
MYNS = Namespace('http://dsci558.org/myfakenamespace#')

We can create a graph:

In [None]:
my_kg = Graph()
my_kg.bind('myns', MYNS)
my_kg.bind('foaf', FOAF)

Define a URI, then add a simple triple to the graph:

In [None]:
node_uri = URIRef(MYNS['dsci_558'])
my_kg.add((node_uri, RDF.type, MYNS['course']))

Add an additional triple (which describes the same subject, `node_uri`):

In [None]:
my_kg.add((node_uri, FOAF['name'], Literal('Building Knowledge Graphs')))

And now let's dump our graph triples into some `ttl` file:

In [None]:
my_kg.serialize(dir_ + 'sample_graph.ttl', format="turtle")

In [None]:
!head sample_graph.ttl

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix myns: <http://dsci558.org/myfakenamespace#> .

myns:dsci_558 a myns:course ;
    foaf:name "Building Knowledge Graphs" .

