This problem is the same as merge in Pandas and inner join in SQL. One soluton: loop through list1, and for each element in list1 you loop through every element in list2. That's a double for-loop and the complexity of such a code would be **O(N^2)**.

Instead we could build a dictionary for list1. Then we will loop through the elements of list2 and join that way. The complexity would be **O(N)**.

In [1]:
def my_join(list1, list2):
    dict1 = {item[0]:item[1:] for item in list1}
    result = list()
    
    for item in list2:
        if item[0] in dict1:
            result.append([item[0]] + dict1[item[0]] + item[1:])
        
    return result

In general though, the dictionary building part is done prior to running the join/merge query. That step is called **indexing**. A dictionary type index is called **hash table index**. There are other indexing used, like b-tree. Each has its advantages and disadvantages. We'll be looking at that tomorrow.

In [2]:
def inner_join(list1, list2):
    dict1 = {item[0]:item[1:] for item in list1}
    dict2 = {item[0]:item[1:] for item in list2}
    
    result = list()
    
    for key, value in dict1.items():
        if key in dict2:
            result.append([key] + dict1[key] + dict2[key])
        
    return result

Below is a little routine to test the performance of the functions. You can use it to test your code. Try varying out the lists, dictionaries and list comprehensions and see what the difference in performance is. Of course, it will also depend on the speed of your computer.

In [5]:
import random
random.seed(127)
numbers = range(1000)
random.shuffle(numbers)
test1 = [[index] + [random.random() for _ in range(100)] for index in numbers]
random.shuffle(numbers)
test2 = [[index] + [random.random() for _ in range(100)] for index in numbers]

In [6]:
%timeit my_join(test1, test2)
%timeit inner_join(test1, test2)

100 loops, best of 3: 3.75 ms per loop
100 loops, best of 3: 4.98 ms per loop


For simplicity, the functions above assumed the indexes are unique in each of the lists. Which, as we know is not usually the case. We can change the function to handle duplicates by storing a list-of-lists for each dictionary key.

In [7]:
from collections import defaultdict
def my_join(list1, list2):
    dict1 = defaultdict(list)
    for item in list1:
        dict1[item[0]].append(item[1:])
    result = []
    for item in list2:
        if item[0] in dict1:
            for elem in dict1[item[0]]:
                result.append([item[0]] + elem + item[1:])
    return result

In [8]:
my_join([[1,2],[2,3],[2,20]], [[2,7],[1,4],[2,11]])

[[2, 3, 7], [2, 20, 7], [1, 2, 4], [2, 3, 11], [2, 20, 11]]

The complexity of function above is still O(N) when the indexes are unique or mostly unique. The worst case is when all the index values in both lists are the same. In that case, the join will provide N^2 rows. So there is no way to avoid O(N^2).