<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>DSCI 558: Building Knowledge Graphs</u> during Fall 2020 at University of Southern California (USC).</sub>

# Using RLTK to perform ER (i.e., Task 1) and Blocking (i.e., Task 2)

The Record Linkage ToolKit ([RLTK](https://rltk.readthedocs.io/en/master/)) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity.

This notebook introduces some applied examples using RLTK. You can also find additional examples and use-cases in the same link provided above.

In [None]:
from hw03_tasks_1_2 import rltk, json, g_tokenizer, create_dataset, get_ground_truth

## Dataset analysis & RLTK components construction

First, you need define how a single entry would like for each type of record (for each dataset). Similar code is already set up for you in the file `hw03_tasks_1_2.py` (the file you will submit)

In [None]:
# RLTK IMDB Record
class IMDBRecord(rltk.Record):
    ''' Record entry class for each of our IMDB records '''
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['url']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['name']

    @rltk.cached_property
    def name_tokens(self):
        global g_tokenizer
        return set(g_tokenizer.tokenize(self.name_string))

In [None]:
class AFIRecord(rltk.Record):
    ''' Record entry class for each of our AFI records '''
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['url']

    @rltk.cached_property
    def title_string(self):
        return self.raw_object['title']

    @rltk.cached_property
    def title_tokens(self):
        global g_tokenizer
        return set(g_tokenizer.tokenize(self.title_string))
    
    @rltk.cached_property
    def date_string(self):
        return self.raw_object.get('release_date', '')

You can load your json-lines files into RLTK using the `rltk.JsonLinesReader`. It is already implemented for you to use, we can just call `create_dataset`:

In [None]:
ds_imdb = create_dataset('imdb.jl', IMDBRecord)
ds_afi = create_dataset('afi.jl', AFIRecord)

And we can inspect a few entries:

In [None]:
ds_imdb.generate_dataframe().head(5)

In [None]:
ds_afi.generate_dataframe().head(5)

## Field (Attribute) Similarity (i.e., Task 1.1)

Here are 2 example functions for field (attribute) similarity:

In [None]:
def name_string_similarity_1(r_imdb, r_afi):
    ''' Example dummy similiary function '''
    s1 = r_imdb.name_string[:3]
    s2 = r_afi.title_string[:3]
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def name_string_similarity_2(r_imdb, r_afi):
    ''' Example dummy similiary function '''
    s1 = r_imdb.name_string
    s2 = r_afi.title_string
    
    if s1 == s2:
        return 1
    
    return 0

## Entity Linking (i.e., Task 1.2)

Here's how you can combine multiple similarity functions into a single weightened scoring function:

In [None]:
# threshold value to determine if we are confident the record match
MY_TRESH = 0.8 # this number is just an example, you need to change it

# entity linkage scoring function
def rule_based_method(r_imdb, r_afi):
    score_1 = name_string_similarity_1(r_imdb, r_afi)
    score_2 = name_string_similarity_2(r_imdb, r_afi)
    
    total = 0.7 * score_1 + 0.3 * score_2
    
    # return two values: boolean if they match or not, float to determine confidence
    return total > MY_TRESH, total

## EL Evaluation

Evaluation is a built-in module for benchmarking. Lets load our development set and build a ground truth. We already implemented the loader, so lets just call it:

In [None]:
# load development set data
gt = get_ground_truth('imdb_afi_el.dev.json', ds_imdb, ds_afi)

We can generate additional negatives for our ground truth:

In [None]:
gt.generate_all_negatives(ds_imdb, ds_afi, range_in_gt=True)

Lets run some candidates using the ground-truth

In [None]:
trial = rltk.Trial(gt)
candidate_pairs = rltk.get_record_pairs(ds_imdb, ds_afi, ground_truth=gt)
for r_imdb, r_afi in candidate_pairs:
    result, confidence = rule_based_method(r_imdb, r_afi)
    trial.add_result(r_imdb, r_afi, result, confidence)

Now lets evaluate our trial results

In [None]:
trial.evaluate()
print('Trial statistics based on Ground-Truth from development set data:')
print(f'tp: {trial.true_positives:.06f} [{len(trial.true_positives_list)}]')
print(f'fp: {trial.false_positives:.06f} [{len(trial.false_positives_list)}]')
print(f'tn: {trial.true_negatives:.06f} [{len(trial.true_negatives_list)}]')
print(f'fn: {trial.false_negatives:.06f} [{len(trial.false_negatives_list)}]')

# Using RDFLib for Knowledge Representation (i.e. Task 3)

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information as graphs. RDFLib aims to be a pythonic RDF API, a Graph is a python collection of RDF Subject, Predicate,  Object Triples.

This notebook introduces simple examples. You can also find additional information in the [official documenation](https://rdflib.readthedocs.io/en/stable/).

In [None]:
from rdflib import Graph, URIRef, Literal, XSD, Namespace, RDF

Let's define some namespaces:

In [None]:
FOAF = Namespace('http://xmlns.com/foaf/0.1/')
MYNS = Namespace('http://dsci558.org/myfakenamespace#')

We can create a graph and bind the namespaces we defined:

In [None]:
my_kg = Graph()
my_kg.bind('myns', MYNS)
my_kg.bind('foaf', FOAF)

Define a URI, then add a simple triple to the graph:

In [None]:
node_uri = URIRef(MYNS['dsci558_production_company'])
my_kg.add((node_uri, RDF.type, MYNS['productionCompany']))

The triple that will be generated is: `myns:dsci558_production_company rdf:type myns:productionCompany` where `myns` is the prefix `http://dsci558.org/myfakenamespace#`

Add an additional triple (which describes the same subject, `node_uri`):

In [None]:
my_kg.add((node_uri, FOAF['name'], Literal('DSCI 558 Production Company')))

The triple that will be generated is: `myns:dsci558_production_company foaf:name "DSCI 558 Production Company"`.

And now let's dump our graph triples into some `ttl` file:

In [None]:
my_kg.serialize('sample_graph.ttl', format="turtle")

Open the file `sample_graph.ttl` and inspect it to see how your two triples look in `ttl` syntax.