# <span style="color:blue">Advanced Databases and Information Systems Project I</span>
## Implementation of Join Algorithm for SPARQL Query Processing

**Group Members**:
* Omar Swelam os132@uni-freiburg.de
* Jumshaid Khan jk1308@uni-freiburg.de

**Submitted to**: 
Dr. Fang Wei-Kleiner

**Repository:** https://github.com/iamjumshaid/adbis-projects

**<span style="color:red">Note:</span>** We have added your email `fwei@informatik.uni-freiburg.de` as collaborator to our private GitHub repository.
  
**Date:** 31.07.2023


In [None]:
import pandas as pd
import hashlib
import numpy
import os
import time

pd.set_option('display.max_rows', 200)

# Tasks on small dataset "watdiv100k.txt"

* **<span style="color:red">Note:</span>** All three tasks are done on larger dataset file "watdiv.10M.tar.bz2" at the end

## Task 01
The first task is to pre-process the data. It is required to partition the triples into relations by using
vertically partitioned approach, namely for each distinct property, set up a table with ’Subject’ and
’Object’ as columns. Assume there are n properties in the triple store, then you need to construct n
tables. One optional step before the pre-processing, is to build up a dictionary of all strings occurring
in the triple store and transform the string values into integers. Since the comparison of integer is
much faster than the comparison on string values, this optional step helps improve the efficiency of
the join algorithm.

In [None]:
df = pd.read_csv('100k.txt', sep='\t', header=None, names=['Subject', 'Property','Object'])
df['Object'] = df.Object.str.rstrip(" .")
df.head(5)

In [None]:
df["Property"] = df.Property.str.split(":").apply(lambda x: x[1])
df["Object"] = df.Object.str.split(":").apply(lambda x: x[1] if len(x)>1 else x[0])
df["Subject"] = df.Subject.str.split(":").apply(lambda x: x[1])

In [None]:
# The code snippet groups data in DataFrame df based on the 'Property' column, creating a dictionary
# property_dicts with unique 'Property' values as keys and corresponding subsets as DataFrames.
# It displays the first two rows of each group and stores the subsets in the dictionary.
# This enables efficient access and manipulation of data related to specific properties.

property_dicts = {}
for prop, df_part in df.groupby('Property'):
    print(prop)
    print(df_part.head(2))
    print("========================")
    property_dicts[prop] = df_part[["Subject","Object"]]

In [None]:
# this code snippet identifies properties within property_dicts where all the elements in the 'Object' 
# value are numeric strings and converts those numeric strings to integer values. The names of such properties
# are stored in the numeric_objects list for further use.

numeric_objects = []
for k_prop in property_dicts.keys():
    if property_dicts[k_prop]['Object'].str.isnumeric().all():
        numeric_objects.append(k_prop)
        property_dicts[k_prop]['Object'] = property_dicts[k_prop]['Object'].apply(lambda x: int(x))

In [None]:
str(numeric_objects)

In [None]:
list_of_dfs = ['follows', 'friendOf', 'likes', 'hasReview']
# finding whether the property tables have unique elements types i.e. users, products, reviews, ...
# would be helpful to transform strings into integers and join them

for prop in list_of_dfs:
    print(f"The {prop} has the following unique values:")
    print("Subject: " + str(property_dicts[prop]['Subject'].apply(lambda x: x[:4]).unique()))
    print("Object: " + str(property_dicts[prop]['Object'].apply(lambda x: x[:4]).unique()))

In [None]:
import re 

# we convert the string values to integers for the tables we will use in the join operations

def extract_integer_part(input_string):
    pattern = r'\d+'  # This regex pattern matches one or more digits in the string.
    match = re.search(pattern, input_string)
    if match:
        return int(match.group())  # Convert the matched substring to an integer.
    return None

In [None]:
for prop in list_of_dfs:
    property_dicts[prop]['Subject'] = property_dicts[prop]['Subject'].apply(extract_integer_part)
    property_dicts[prop]['Object'] = property_dicts[prop]['Object'].apply(extract_integer_part)

## Task 2

The second task is to design and implement hash join and sort-merge join algorithms for the query
evaluation. Obviously, our running query can be expressed in the form of SQL given the data set
yield by vertically partitioned approach.

**Hash join**

In [None]:
# in Hash join we have 2 main steps, the first one is to build the hash_map from the first table, and the second one is 
# probing the second table to match each one with the corresponding hash_key elements and avoiding collisions

def build(table, join_key, hash_function):
# Builds a hash table from the input table by applying the hash function to the join key column.
# Partitions the table into sub-tables based on the distinct hash keys, and stores them in a dictionary.
# Returns the hash table, which maps hash keys to corresponding sub-tables.

    table['hash_key'] = table[join_key].apply(hash_function)
    hash_table = dict(tuple(table.groupby('hash_key')))
    table.drop('hash_key', axis=1, inplace=True)
    return hash_table



def probe(hash_table, join_key1, table2, join_key2, hash_function):
# For each row in the second table, calculates the hash key using the hash function on its join key value.
# If a matching hash key exists in the hash table, it extracts the corresponding sub-table and filters rows where join key values match.
# Appends the second table's columns to the matching rows and stores the resulting sub-tables in a list.
# Concatenates all the sub-tables into a single DataFrame and returns it.

    dfs_to_join = []
    for _, row in table2.iterrows():
        hash_key = hash_function(row[join_key2])
        if hash_key in hash_table:
            df_hash = hash_table[hash_key].copy()
            df_hash = df_hash[df_hash[join_key1] == row[join_key2]]
            for col in row.index:
                df_hash[col + '_2'] = row[col]
            dfs_to_join.append(df_hash)
    joined_tables = pd.concat(dfs_to_join,axis=0)
    return joined_tables

def hash_join(table1, table2, join_key1, join_key2, hash_function=hash, join_type='inner'):
# Performs a Hash Join operation between two input tables based on their respective join keys.
# Handles different join types (inner, left, or right) based on the provided argument.
# Creates a hash table from the first table using the build function.
# Merges the two tables using the hash table and the probe step.
# Returns the resulting merged DataFrame.

    if join_type == 'right':
        temp = table1
        table1 = table2
        table2 = temp
        temp_key = join_key1
        join_key1 = join_key2
        join_key2 = temp_key
    
    hash_table = build(table1, join_key1, hash_function)
    joint = probe(hash_table, join_key1, table2, join_key2, hash_function).drop('hash_key', axis=1)
    
    if join_type in ['right', 'left']:
        return pd.concat([joint, table1[~table1[join_key1].apply(lambda x: x in joint[join_key1].values)]],axis=0)
    
    return joint

In [None]:
# result from applying hash_join
start_time = time.time()
hashed_res_1 = hash_join(property_dicts['follows'], property_dicts['friendOf'],'Object','Subject')
hashed_res_1 = hashed_res_1.rename({'Subject':'User', 'Object':'follows', 'Object_2':'friendsOf'}, axis=1).drop('Subject_2', axis=1)
hashed_res_2 = hash_join(hashed_res_1, property_dicts['likes'],'friendsOf','Subject')
hashed_res_2 = hashed_res_2.rename({'Object_2':'likes'}, axis=1).drop('Subject_2', axis=1)
hashed_res_3 = hash_join(hashed_res_2, property_dicts['hasReview'],'likes','Subject')
hashed_result = hashed_res_3.rename({'Object_2':'hasReview'}, axis=1).drop('Subject_2', axis=1)
end_time = time.time()

hashed_result

In [None]:
print('time taken: %s seconds' % (end_time - start_time))

**Sort-merge join**

In [None]:
# This Sort-Merge Join algorithm efficiently merges two tables based on their common join key,
# taking advantage of the sorted order to optimize the merge process.

def sort_merge_join(table1, table2, join_key1, join_key2,join_type='inner'):
# in the first part we start with handling the input:
# It supports different join types (inner, left, or right) based on the provided argument (join_type).
# If join_type is 'right', it swaps table1 and table2 along with their corresponding join keys to handle right join cases.
    

    if join_type == 'right':
        temp = table1
        table1 = table2
        table2 = temp
        temp_key = join_key1
        join_key1 = join_key2
        join_key2 = temp_key

# The function sorts both table1 and table2 based on their join keys (join_key1 and join_key2, respectively).
    sorted_table1 = table1.sort_values(join_key1)
    sorted_table2 = table2.sort_values(join_key2)
    
# the merging part
# The function initializes two pointers (pointer1 and pointer2) to track the current position while iterating through the sorted tables.
# It initializes an empty list called result to store the merged rows.

    pointer1 = 0 
    pointer2 = 0
    result = []
    
    while True:
        condition1 = pointer1 < len(sorted_table1) 
        condition2 = pointer2 < len(sorted_table2)
        
# For an inner join, the loop continues as long as both sorted_table1 and sorted_table2 have remaining elements to be processed.
# For left and right joins, the loop continues as long as there are elements in either sorted_table1 or sorted_table2, depending on the join type.
        
        if join_type=='inner' and not (condition1 and condition2):
            break
        
        if join_type in ['right','left']:
            if condition1 and not condition2:
                table1_remaining = sorted_table1[~sorted_table1[join_key1].isin(table2.sort_values(join_key2))]
                indices, rowSeries = zip(*table1_remaining.add_suffix('_1').iterrows())
                result.extend(list(rowSeries))
                break
            elif not condition1:
                break

        value1 = sorted_table1.iloc[pointer1][join_key1]
        value2 = sorted_table2.iloc[pointer2][join_key2]


# The function compares the values from the current positions (pointer1 and pointer2) of sorted_table1 and sorted_table2.
# If the values are equal, it means there is a match for the join key, and the corresponding rows from both tables are concatenated and added to the result list.
        if value1 == value2:
            result.append(pd.concat([sorted_table1.iloc[pointer1].add_suffix('_1'), sorted_table2.iloc[pointer2].add_suffix('_2')]))
            
            skip_condition1 = (pointer2 < len(sorted_table2)-1) and (sorted_table2.iloc[pointer2+1][join_key2] != value2)
            skip_condition2 = (pointer1 < len(sorted_table1)-1) and (sorted_table1.iloc[pointer1+1][join_key1] != value1)
            
# After a match is found (when value1 == value2), the function checks if there are more occurrences of the current join key in either table. 

# It does this by checking if the next element in sorted_table1 has the same join key value as the 
# current element (skip_condition1), and similarly for sorted_table2 (skip_condition2).

# If skip_condition1 is True, it means that the next row in sorted_table1 has a different join key value than the current one. 
# In this case, pointer1 is incremented (pointer1 += 1), effectively moving to the next distinct join key value in sorted_table1.
# Similarly, if skip_condition2 is True, it means that the next row in sorted_table2 has a different join key value than the
# current one. In this case, pointer2 is incremented (pointer2 += 1), effectively moving to the next distinct join key value in sorted_table2.

            if skip_condition1:
                pointer1 += 1
            if skip_condition2:
                pointer2 += 1
            if (not skip_condition1) and (not skip_condition2):
                if (pointer2 >= len(sorted_table2)-1):
                    pointer2 += 1
                elif (pointer1 >= len(sorted_table1)-1):
                    pointer1 +=1
                else:
                    pointer2_checkpoint = pointer2 
                    value2_checkpoint = value2
                    while True:
                        pointer2 += 1
                        if(pointer2 == len(sorted_table2)):
                            break
                        value2 = sorted_table2.iloc[pointer2][join_key2]
                        if value2 != value2_checkpoint:
                            break
                        result.append(pd.concat([sorted_table1.iloc[pointer1].add_suffix('_1'), sorted_table2.iloc[pointer2].add_suffix('_2')]))

                    pointer2 = pointer2_checkpoint
                    pointer1 += 1
                    
        elif value1 < value2:
                pointer1 += 1
        else:
                pointer2 += 1
                
    return pd.DataFrame(result)


In [None]:
# result from applying sort_merge_join
start_time = time.time()
merged_res_1 = sort_merge_join(property_dicts['follows'], property_dicts['friendOf'],'Object','Subject')
merged_res_1 = merged_res_1.rename({'Subject_1':'User', 'Object_1':'follows', 'Object_2':'friendsOf'}, axis=1).drop('Subject_2', axis=1)
merged_res_2 = sort_merge_join(merged_res_1, property_dicts['likes'],'friendsOf','Subject')
merged_res_2 = merged_res_2.rename({'Object_2':'likes'}, axis=1).drop('Subject_2', axis=1)
merged_res_3 = sort_merge_join(merged_res_2, property_dicts['hasReview'],'likes','Subject')
merged_result = merged_res_3.rename({'Object_2':'hasReview'}, axis=1).drop('Subject_2', axis=1)
end_time = time.time()

merged_result

In [None]:
print('time taken: %s seconds' % (end_time - start_time))

## Task 3

The third task is to design and implement an improvement algorithm regarding the running time.
There is no restrictions on the approaches. Possible candidates are: use radix join algorithm, use
a different hash function or hashing scheme, partition the data before the join operation, or use
parallel sorting algorithms. Other options are for instance building indexes on the data before the
query evaluation.

### Radix join not improved

In [None]:
def radix_hash_function(join_key, num_buckets, radix_level):
    # Extract the specified radix level from the join key.
    radix_value = join_key // (num_buckets ** radix_level) % num_buckets
    return radix_value

def radix_partition(table, join_key, num_buckets, radix_level):
    # Create a new column for each radix level.
    for r in range(radix_level):
        table[f'hash_key_{r}'] = table[join_key].apply(radix_hash_function, args=(num_buckets, r))

    # Group the rows based on radix levels.
    groups = table.groupby([f'hash_key_{r}' for r in range(radix_level)])

# Returns the dictionary of buckets where each bucket has a key in the of the combination of radix levels values, and the values
# in the dict are the rows corresponding to these radix values

    buckets = {}
    for i, (group_key, group) in enumerate(groups):
        buckets[group_key] = group.to_dict('records')

    return buckets

def chained_radix_join(table1, table2, join_key1, join_key2, num_buckets, radix_level):
# Calls radix_partition on both tables to obtain buckets of rows organized based on their radix values for the specified radix level.    
    table1 = table1.rename(columns={"Subject": "Subject_1", "Object": "Object_1"})
    table2 = table2.rename(columns={"Subject": "Subject_2", "Object": "Object_2"})
    buckets1 = radix_partition(table1, f"{join_key1}_1", num_buckets, radix_level)
    buckets2 = radix_partition(table2, f"{join_key2}_2", num_buckets, radix_level)
    

# Iterates through the radix buckets of table1, and for each radix value, checks if there is a corresponding radix bucket in table2 with the same value.

    merged_tables = []
    for hash_value in buckets1.keys(): # changed in here
        if hash_value not in buckets2.keys():
            continue
        inner_buckets1 = buckets1[hash_value]
        inner_buckets2 = buckets2[hash_value]
        
# If a match is found, it performs an additional check on each row of both buckets to ensure that the join keys match for all radix levels.
# Concatenates the matching rows from both tables and stores them in a list called merged_tables.

        for inner_row1 in inner_buckets1:
            for inner_row2 in inner_buckets2:
                # Check if join keys match for all radix levels.
                match = True
                for r in range(radix_level):
                    if inner_row1[f'hash_key_{r}'] != inner_row2[f'hash_key_{r}']:
                        match = False
                        break

                if match:
                    # Concatenate matching rows from both tables.
                    merged_row = pd.concat([pd.Series(inner_row1), pd.Series(inner_row2)], axis=0)
                    merged_tables.append(merged_row.drop([f'hash_key_{r}' for r in range(radix_level)]))

    if merged_tables:
        result = pd.DataFrame(merged_tables)
    else:
        result = pd.DataFrame()

    
    result = result[result[f"{join_key1}_1"] == result[f"{join_key2}_2"]]
    
    return result


In [None]:
# result from applying hash_join
start_time = time.time()
rad_res_1 = chained_radix_join(property_dicts['follows'], property_dicts['friendOf'],'Object','Subject')
rad_res_1 = rad_res_1.rename({'Subject':'User', 'Object':'follows', 'Object_2':'friendsOf'}, axis=1).drop('Subject_2', axis=1)
rad_res_2 = chained_radix_join(rad_res_1, property_dicts['likes'],'friendsOf','Subject')
rad_res_2 = rad_res_2.rename({'Object_2':'likes'}, axis=1).drop('Subject_2', axis=1)
rad_res_3 = chained_radix_join(rad_res_2, property_dicts['hasReview'],'likes','Subject')
rad_result = rad_res_3.rename({'Object_2':'hasReview'}, axis=1).drop('Subject_2', axis=1)
end_time = time.time()

rad_result

In [None]:
print('time taken: %s seconds' % (end_time - start_time))

### Radix join improved

In [None]:
# the faster version

def radix_hash_function(join_key, num_buckets, radix_level):
    # Extract the specified radix level from the join key.
    radix_value = join_key // (num_buckets ** radix_level) % num_buckets
    return radix_value

def radix_partition(table, join_key, num_buckets, radix_level):
    # Create a new column for each radix level.
    for r in range(radix_level):
        table[f'hash_key_{r}'] = table[join_key].apply(radix_hash_function, args=(num_buckets, r))
    
    # Group the rows based on radix levels.
    groups = table.groupby([f'hash_key_{r}' for r in range(radix_level)])

    buckets = {}
    for group_key, group in groups:
        buckets[group_key] = group.set_index([f'hash_key_{r}' for r in range(radix_level)])
    return buckets

def chained_radix_join(table1, table2, join_key1, join_key2, num_buckets, radix_level):
    table1 = table1.rename(columns={"Subject": "Subject_1", "Object": "Object_1"})
    table2 = table2.rename(columns={"Subject": "Subject_2", "Object": "Object_2"})
    buckets1 = radix_partition(table1, f"{join_key1}_1", num_buckets, radix_level)
    buckets2 = radix_partition(table2, f"{join_key2}_2", num_buckets, radix_level)
    
    merged_tables = []
    for hash_value in buckets1.keys(): # changed in here
        if hash_value not in buckets2.keys():
            continue
        inner_buckets1 = buckets1[hash_value]
        inner_buckets2 = buckets2[hash_value]
# instead of going through the buckets row by row, we just join all the tables with same hash values using pandas join based on the index of the tables
        merged_tables.append(inner_buckets1.join(inner_buckets2))
    if merged_tables:
        result = pd.concat(merged_tables)
    else:
        result = pd.DataFrame()
# here we filter out values that are not the same to avoid collision
    result = result[result[f"{join_key1}_1"] == result[f"{join_key2}_2"]]
    
    return result


In [None]:
# result from applying hash_join
start_time = time.time()
rad_res_1 = chained_radix_join(property_dicts['follows'], property_dicts['friendOf'],'Object','Subject')
rad_res_1 = rad_res_1.rename({'Subject':'User', 'Object':'follows', 'Object_2':'friendsOf'}, axis=1).drop('Subject_2', axis=1)
rad_res_2 = chained_radix_join(rad_res_1, property_dicts['likes'],'friendsOf','Subject')
rad_res_2 = rad_res_2.rename({'Object_2':'likes'}, axis=1).drop('Subject_2', axis=1)
rad_res_3 = chained_radix_join(rad_res_2, property_dicts['hasReview'],'likes','Subject')
rad_result = rad_res_3.rename({'Object_2':'hasReview'}, axis=1).drop('Subject_2', axis=1)
end_time = time.time()

rad_result

In [None]:
print('time taken: %s seconds' % (end_time - start_time))

# Task 1 with 'watdiv.10M.nt' Big Data File

In [None]:
# we read the data in batch of 100. so the memory is never overflooded regardless of the data size.
# also, after reading the data, we make one file for each property and neglect all the other properties
# that are not necessary to save memory size and processing time.

file_path = 'watdiv.10M.nt'
batch_size = 100
file_path_follows = 'follows.txt'
file_path_friendOf = 'friendOf.txt'
file_path_likes = 'likes.txt'
file_path_hasReview = 'hasReview.txt'

# Function to delete a file if it exists
def delete_if_exists(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f"Deleted '{file_path}'.")
    else:
        print(f"'{file_path}' does not exist.")

# Delete the files if they exist
delete_if_exists(file_path_follows)
delete_if_exists(file_path_friendOf)
delete_if_exists(file_path_likes)
delete_if_exists(file_path_hasReview)


# Read the file in chunks using pd.read_csv() and process each chunk
# No need to extract the property in seperated files
# Because we have each seperate file for each property type
for df_chunk in pd.read_csv(file_path, chunksize=batch_size, sep='\t', header=None, names=['Subject', 'Property', 'Object']):
    df_chunk['Property'] = df_chunk['Property'].str.extract(r'[#/]([^#/>]+)>\s*$')
    filtered_df = df_chunk.loc[df_chunk['Property'].str.contains('follows|friendOf|likes|hasReview')]
    if len(filtered_df):
        if (filtered_df['Property'] == 'follows').sum():
            filtered_df[filtered_df['Property'] == 'follows'][['Subject', 'Object']].to_csv(file_path_follows, mode='a', index=False, header=False)
        
        if (filtered_df['Property'] == 'friendOf').sum():
            filtered_df[filtered_df['Property'] == 'friendOf'][['Subject', 'Object']].to_csv(file_path_friendOf, mode='a', index=False, header=False)
        
        if (filtered_df['Property'] == 'likes').sum():
            filtered_df[filtered_df['Property'] == 'likes'][['Subject', 'Object']].to_csv(file_path_likes, mode='a', index=False, header=False)
        
        if (filtered_df['Property'] == 'hasReview').sum():
            filtered_df[filtered_df['Property'] == 'hasReview'][['Subject', 'Object']].to_csv(file_path_hasReview, mode='a', index=False, header=False)


In [None]:
list_of_dfs = ['follows', 'friendOf', 'likes', 'hasReview']
property_dicts_big = {}

for prop in list_of_dfs:
    property_dicts_big[prop] = pd.read_csv(f'{prop}.txt', header=None, names=['Subject','Object'])
    property_dicts_big[prop]['Object'] = property_dicts_big[prop].Object.str.rstrip(" .")
    property_dicts_big[prop]['Object'] = property_dicts_big[prop]['Object'].str.extract(r'[#/]([^#/>]+)>\s*$')
    property_dicts_big[prop]['Subject'] = property_dicts_big[prop]['Subject'].str.extract(r'[#/]([^#/>]+)>\s*$')

property_dicts_big['follows']

In [None]:
# finding whether the property tables have unique elements types i.e. users, products, reviews, ...
# would be helpful to transform strings into integers and join them

for prop in list_of_dfs:
    print(f"The {prop} has the following unique values:")
    print("Subject: " + str(property_dicts_big[prop]['Subject'].apply(lambda x: x[:4]).unique()))
    print("Object: " + str(property_dicts_big[prop]['Object'].apply(lambda x: x[:4]).unique()))

In [None]:
import re 

def extract_integer_part(input_string):
    pattern = r'\d+'  # This regex pattern matches one or more digits in the string.
    match = re.search(pattern, input_string)
    if match:
        return int(match.group())  # Convert the matched substring to an integer.
    return None

In [None]:
for prop in list_of_dfs:
    property_dicts_big[prop]['Subject'] = property_dicts_big[prop]['Subject'].apply(extract_integer_part)
    property_dicts_big[prop]['Object'] = property_dicts_big[prop]['Object'].apply(extract_integer_part)

# Task 2 with 'watdiv.10M.nt' Big Data File

## Hash join

In [None]:
# result from applying hash_join
start_time = time.time()
hashed_res_1_big = hash_join(property_dicts_big['follows'], property_dicts_big['friendOf'],'Object','Subject')
hashed_res_1_big = hashed_res_1_big.rename({'Subject':'User', 'Object':'follows', 'Object_2':'friendsOf'}, axis=1).drop('Subject_2', axis=1)
hashed_res_2_big = hash_join(hashed_res_1_big, property_dicts_big['likes'],'friendsOf','Subject')
hashed_res_2_big = hashed_res_2_big.rename({'Object_2':'likes'}, axis=1).drop('Subject_2', axis=1)
hashed_res_3_big = hash_join(hashed_res_2_big, property_dicts_big['hasReview'],'likes','Subject')
hashed_result_big = hashed_res_3_big.rename({'Object_2':'hasReview'}, axis=1).drop('Subject_2', axis=1)
end_time = time.time()

hashed_result_big

In [None]:
print('time taken: %s seconds' % (end_time - start_time))

# Task 3 Improvements

## Radix improved join

In [None]:
# result from applying hash_join
start_time = time.time()
rad_res_1_big = chained_radix_join(property_dicts_big['follows'], property_dicts_big['friendOf'],'Object','Subject')
rad_res_1_big = rad_res_1_big.rename({'Subject':'User', 'Object':'follows', 'Object_2':'friendsOf'}, axis=1).drop('Subject_2', axis=1)
rad_res_2_big = chained_radix_join(rad_res_1_big, property_dicts_big['likes'],'friendsOf','Subject')
rad_res_2_big = rad_res_2_big.rename({'Object_2':'likes'}, axis=1).drop('Subject_2', axis=1)
rad_res_3_big = chained_radix_join(rad_res_2_big, property_dicts_big['hasReview'],'likes','Subject')
rad_result_big = rad_res_3_big.rename({'Object_2':'hasReview'}, axis=1).drop('Subject_2', axis=1)
end_time = time.time()

rad_result_big

In [None]:
print('time taken: %s seconds' % (end_time - start_time))