OPTIMIZATIONS INTERNING

Interning: Reusing objects on-demand.
At startup, Python pr-loads (cahes) a global list of integers in the range [-5, 256] = Singleton objects.
Any time an integer is referenced in that range, Python will use the cached version of that object.

Singleton objects are classes that can only be instanciated once. i.e. the original startup.

When we write
  a = 10
Python just has to point to the existing reference for 10.

If I write
  a = 257
Python does not use the global list and a new object is created every time.



In [1]:
[-5, 256]

[-5, 256]

In [3]:
a = 100
b = 10

In [4]:
id(a)

2275262223824

In [5]:
id(b)

2275262032464

In [6]:
a = 20

In [7]:
b = 20

In [8]:
id(a)

2275262032784

In [9]:
id(b)

2275262032784

In [10]:
a = -5
b = -5

In [11]:
print(id(a), id(b))

2275262031984 2275262031984


In [12]:
a is b

True

In [16]:
a = 256
b = 256
a is b

True

In [18]:
a = 257
b = 257
a is b

False

In [19]:
a = 10

In [20]:
b = int(10)

In [22]:
c = int('10')

In [25]:
d = int('1010', 2)

In [26]:
print(a, b, c, d)

10 10 10 10


In [27]:
print(id(a), id(b), id(c), id(d))

2275262032464 2275262032464 2275262032464 2275262032464


Some strings are also automatically interned - but not all!
As Python code is complied, identifiers are interned.
    variable names
    function names
    class names
    etc.

Some string literals may also be automatically interned:
    String literals that look like identifiers (e. g. 'hello_world')
    Although if it starts with a digit, even though that is not a valid identifier, it may still get interned. BUT, DON'T COUNT ON IT.

Why:

It's all about (speed and, possibly, memory) optimiation.

Python, both internally, and in code we write, deals with lots and lots of dictionary lookups, on string keys, which means a lot of string equality testing.

Let's say we want to seeif two equal:

    a = 'some_long_string'      b = 'some_long_string'

Using a == b, we need to compare the two strings 'character by character'.

But, if we know that 'some long string' has been interned, then a and b are the same thing if they both point to the same memory address.

In which case we can use a is b instead - which compares to integers (memor address).

This is much faster than the character by character comparision.

Not all strings are interned by Python.

But, I can force strings to be interned by using the sys.intern() method.


In [28]:
import sys

a = sys.intern( 'the quick brown fox')          a is b -> True
b = sys.intern( 'the quick brown fox')          much faster than a == b

When should I do this?
    DEALING WITH A LARGE NUMBER OF STRINGS THAT COULD HAVE A HIGH REPETITIOM.
            e.g. tokenizing a large corpus of text (NLP)
    lots of string comparisons.

In [30]:
a = 'hello'
b = 'hello'

In [31]:
print(id(a), id(b))

2275347157360 2275347157360


In [32]:
a = 'hello world'
b = 'hello world'

In [33]:
print(id(a), id(b))

2275351288496 2275351289648


In [34]:
a == b

True

In [35]:
a is b

False

In [36]:
a = 'hello world'
b = 'hello world'
a == b

True

In [37]:
a is b

False

In [38]:
a = '_this_is_a_long_string_that_could_be_used_as_an_identifier'

In [39]:
b = '_this_is_a_long_string_that_could_be_used_as_an_identifier'

In [40]:
a is b

True

In [41]:
import sys

In [42]:
a = sys.intern('hello world')

In [43]:
b = sys.intern('hello world')

In [44]:
c = 'hello world'

In [45]:
print(id(a), id(b), id(c))

2275351289264 2275351289264 2275344279600


In [46]:
a == b

True

In [47]:
a is b

True

In [49]:
def compare_using_equals(n):
    a = 'a long string that is not interned' * 200
    b = 'a long string that is not interned' * 200
    for i in range(n):
        if a == b:
            pass

In [50]:
def compare_using_equals(n):
    a = sys.intern('a long string that is not interned' * 200)
    b = sys.intern('a long string that is not interned' * 200)
    for i in range(n):
        if a is b:
            pass

In [51]:
import time

In [55]:
start = time.perf_counter()
compare_using_equals(10000000)
end = time.perf_counter()
print('equality', end-start)

equality 0.2517725000006976
