Interning: reusing objects on-demand
At startup, Python preloads (caches) a global list of integers in the range [-5, 256]

Any time an integer is referenced in that range, Python will use the cached version of that object

Singletons: Optimization strategy - small integers show up often


In [1]:
a = 10
b = 10

In [2]:
id(a)

4520085728

In [3]:
id(b)

4520085728

In [4]:
a = 500
b = 500

In [5]:
id(a)

4556600464

In [6]:
id(b)

4556599472

String Interning
not all
As the Python code is compiled, identifiers are interned
- variable names
- function names
- class names
- etc

Some string literals may also be automatically interned
- string literals that look like identifiers


Why do this?
it's all about optimization

Python, both internally, and in the code you write deals with lots and lots of dictionary type lookups, on string keys, which means a lot of string equality testing.

Let's say we want to see if two strings are equal:

a = 'some' b = 'some'

Using a == b, we need to compare the two strings character by character

But if we know that 'some' has been intenrned, then a and b are the same string if they both point to the same memory address

In which case we can use a is b instead

This is much faster

In [7]:
# force intern
import sys

a = sys.intern('kaden cho')
b = sys.intern('kaden cho')

In [8]:
a is b

True

When should you do this?
- dealing with a large number of strings that could have repetition e.g. tokenizing a large corpus of text
- lots of string comparisons

In [9]:
a = 'hello'
b = 'hello'

In [10]:
print(id(a), id(b))

4556854024 4556854024


In [11]:
a = 'hello world'
b = 'hello world'

In [12]:
print(id(a), id(b))

4556507824 4556870768


In [13]:
def compare_using_equals(n):
    a = 'a long string that is not interned' * 200
    b = 'a long string that is not interned' * 200
    for i in range(n):
        if a == b:
            pass

In [14]:
def compare_using_interning(n):
    a = sys.intern('a long string that is not interned' * 200)
    b = sys.intern('a long string that is not interned' * 200)
    for i in range(n):
        if a is b:
            pass

In [15]:
import time

In [16]:
start = time.perf_counter()
compare_using_equals(10_000_000)
end = time.perf_counter()
print('equality', end - start)

equality 2.617037210000035


In [17]:
start = time.perf_counter()
compare_using_interning(10_000_000)
end = time.perf_counter()
print('interning', end - start)

interning 0.3156129870003497
