**comprehensions: introduction**

In [1]:
old_list = [1, 2, 3, 4, 5]
new_list = []
for i in old_list:
    new_list.append(i * 2)
new_list


[2, 4, 6, 8, 10]

 doubling numbers. This simple operation can also be done on a single line instead.

In [2]:
[i * 2 for i in old_list]


[2, 4, 6, 8, 10]

There are two ways of using an if-statement in a comprehension. You can filter via:

In [3]:
[i for i in range(16) if i % 2 == 0]


[0, 2, 4, 6, 8, 10, 12, 14]

In [4]:
[i if i % 2 == 0 else i * 2 for i in range(16)]


[0, 2, 2, 6, 4, 10, 6, 14, 8, 18, 10, 22, 12, 26, 14, 30]

In [5]:
[i if i % 2 == 0 else i * 2 for i in range(16) if i % 3 == 0]


[0, 6, 6, 18, 12, 30]

In [6]:
old_list = 'abcde'
new_list = []
for i in range(len(old_list)):
    if i % 2 == 0:
        new_list.append(old_list[i])
new_list


['a', 'c', 'e']

It can be turned into a one-liner by using a comprehensions and the enumerate function.

In [7]:
[char for idx, char in enumerate('abcde') if idx % 2 == 0]


['a', 'c', 'e']

In [8]:
old_list = 'abcde'
new_list = []
for i, c in enumerate(old_list):
    if i % 2 == 0:
        if c in 'aeuio':
            char = c.upper()
        else:
            char = c
        new_list.append(char)
new_list


['A', 'c', 'E']

comprehensions: nested

In [9]:
for i in range(5):
    for j in range(i):
        print((i, j))


(1, 0)
(2, 0)
(2, 1)
(3, 0)
(3, 1)
(3, 2)
(4, 0)
(4, 1)
(4, 2)
(4, 3)


In [10]:
[(i, j) for i in range(5) for j in range(i)]


[(1, 0),
 (2, 0),
 (2, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3)]

In [11]:
for i in range(5):
    if i > 2:
        for j in range(i):
            if j < 2:
                print((i, j))


(3, 0)
(3, 1)
(4, 0)
(4, 1)


In [12]:
[(i, j) for i in range(5) if i > 2 for j in range(i) if j < 2]


[(3, 0), (3, 1), (4, 0), (4, 1)]

comprehensions: dict

In [13]:
[c for i, c in enumerate('abceabce') if i < 5]


['a', 'b', 'c', 'e', 'a']

In [14]:
{c for i, c in enumerate('abceabce')}


{'a', 'b', 'c', 'e'}

In [15]:
tuple(c for i, c in enumerate('abceabce'))


('a', 'b', 'c', 'e', 'a', 'b', 'c', 'e')

In [16]:
{i: c for i, c in enumerate('abceabce') if i < 5}


{0: 'a', 1: 'b', 2: 'c', 3: 'e', 4: 'a'}

comprehensions: unique

In [17]:
{i: c for i, c in enumerate('abcdefa')}


{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'a'}

In [18]:
{c: i for i, c in enumerate('abcdefa')}


{'a': 6, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5}

comprehensions: unpack

In [19]:
arr = [('a', 1), ('b', 2), ('c', 2)]
for idx, (char, i) in enumerate(arr):
    print(idx, char, i)


0 a 1
1 b 2
2 c 2


In [20]:
[{key: value, 'i': idx} for idx, (key, value) in enumerate(arr)]


[{'a': 1, 'i': 0}, {'b': 2, 'i': 1}, {'c': 2, 'i': 2}]

comprehensions: zip

In [21]:
d = {'a': 1, 'b': 2, 'c': 3}
[(k, v) for k, v in d.items()]


[('a', 1), ('b', 2), ('c', 3)]

Using .items() you can have access to both the keys and the values. You could use it, for example, to quickly double all the values in a dictionary;

In [22]:
d = {'a':1, 'b':2}
{k: v * 2 for k, v in d.items()}


{'a': 2, 'b': 4}

Another function to be aware of is zip. It allows you to "zip" lists together like a zipper.

In [23]:
[(a, b) for a, b in zip([1, 2, 3], [4, 5, 6])]


[(1, 4), (2, 5), (3, 6)]

In [24]:
[(a, b, c) for a, b, c in zip([1, 2, 3], [4, 5, 6], [7, 8, 9])]


[(1, 4, 7), (2, 5, 8), (3, 6, 9)]


**lunr.py**

In [27]:
import pandas as pd

df = pd.read_csv("/content/clinc.csv").assign(idx=lambda d: d.index)
df.sample(10)


Unnamed: 0,text,label,idx
13369,what is my next day off,next_holiday,13369
9128,all right,yes,9128
5067,i really need to switch to a new insurance plan,insurance_change,5067
236,is it safe for me to visit malawi,travel_alert,236
11095,thanks for your response,thank_you,11095
10295,schedule an uber for 3 to go to the airport,uber,10295
10922,i must disconnect from my phone,sync_device,10922
20948,can you get a call started to martha,make_call,20948
23625,how can i root an android phone,oos,23625
10440,what would tomorrow's date be,date,10440


In [28]:
documents = df.to_dict(orient="records")


In [30]:
!pip install lunr


Collecting lunr
  Downloading lunr-0.7.0.post1-py3-none-any.whl (35 kB)
Installing collected packages: lunr
Successfully installed lunr-0.7.0.post1


The lunr function has three parameters.

ref is the key in the documents to be used sa the reference.
fields is a sequence of keys to index from the documents.
documents is the list of dictionaries that resemble the documents to be indexed.
This gives us an index variable that we can use to query our data.

In [31]:
from lunr import lunr

index = lunr(ref='idx', fields=('text',), documents=documents)


In [32]:
index.search('spanish')
# [{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
#  {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
#  {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
#  ...
#  {'ref': '19726', 'score': 5.065, 'match_data': <MatchData "spanish">}]


[{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
 {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '27', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '28', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4526', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4529', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4556', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4573', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4575', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4576', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4585', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '5638', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19505', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19507', 'score': 

This index gives us a score as well as a reference to our original data. We can re-use this to get our original documents again.

In [33]:
[documents[int(i['ref'])] for i in index.search('spanish')]
# [{'text': "can you tell me how to say 'i do not speak much spanish', in spanish",
#  'label': 'translate',
#  'idx': 4501},
# {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
# {'text': 'what is dog in spanish', 'label': 'translate', 'idx': 26},
# ...
# {'text': 'please change your language setting to spanish now',
#  'label': 'change_language',
#  'idx': 19726}]


[{'text': "can you tell me how to say 'i do not speak much spanish', in spanish",
  'label': 'translate',
  'idx': 4501},
 {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
 {'text': 'what is dog in spanish', 'label': 'translate', 'idx': 26},
 {'text': 'how do you say dog in spanish', 'label': 'translate', 'idx': 27},
 {'text': 'dog in spanish', 'label': 'translate', 'idx': 28},
 {'text': 'how can i say not now in spanish',
  'label': 'translate',
  'idx': 4526},
 {'text': 'how do you say goodbye in spanish',
  'label': 'translate',
  'idx': 4529},
 {'text': 'what is spanish for hello', 'label': 'translate', 'idx': 4556},
 {'text': 'how do you say thank you in spanish',
  'label': 'translate',
  'idx': 4573},
 {'text': 'how can i say thank you in spanish',
  'label': 'translate',
  'idx': 4575},
 {'text': 'what is thank you in spanish', 'label': 'translate', 'idx': 4576},
 {'text': 'how do you say cat in spanish', 'label': 'translate', 'idx': 4585},
 {'text': 

Benchmark
We were curious about the performance statistics so we ran some comparisons

we observe that, Lunr is faster to retreive data that either a list comprehension or a Pandas query, even if we re-use the index to fetch the original documents.

In [35]:
%timeit df.loc[lambda d: d['text'].str.contains("spanish")]
# 4.79 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit [d for d in documents if 'spanish' in d['text']]
# 1.86 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit index.search('spanish')
# 304 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit [documents[int(i['ref'])] for i in index.search('spanish')]
# 309 µs ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


12.5 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.87 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
804 µs ± 220 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
715 µs ± 72.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


**Usecase**: Lunr is great for smaller datasets that fit on a single machine and for rapid prototyping. It doesn't have great support for typos and it won't scale once your dataset grows bigger.