![](img/logo.png)

# String Interning in Python 3.6

In [1]:
import sys
sys.version

'3.6.5 (default, Mar 29 2018, 03:28:50) \n[GCC 5.4.0 20160609]'

In [2]:
s1 = 'parrot_is_dead'
s2 = 'parrot_is_dead'

In [3]:
s1 == s2

True

In [4]:
s1 is s2

True

In [5]:
id(s1), id(s2)

(139784549578800, 139784549578800)

`s1` and `s2` refer to **the same object** in memory. It makes sense, since strings are immutable.

This behavior of strings is called **interning**. It's kinda like caching.

> In computer science, **string interning** is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a **string intern pool**.

[https://en.wikipedia.org/wiki/String_interning]

In [6]:
s3 = 'parrot is dead'
s4 = 'parrot is dead'
assert s3 == s4
s3 is s4

False

`s3` and `s4` refer to **different objects** in memory. Why?

This is not a matter of length:

In [7]:
len(s1) == len(s2) == len(s3) == len(s4)

True

![](img/thinking.png)

## Implementation of string interning

[**`PyUnicode_InternInPlace`**](https://github.com/python/cpython/blob/master/Doc/c-api/unicode.rst)
```
.. c:function:: void PyUnicode_InternInPlace(PyObject **string)

   Intern the argument *\*string* in place.  The argument must be the address of a
   pointer variable pointing to a Python unicode string object.  If there is an
   existing interned string that is the same as *\*string*, it sets *\*string* to
   it [...].
```


source-code: https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L15170

So it's all about making the pointer to refer to string object that's already there.

`PyUnicode_InternInPlace` is called by **`intern_string_constants`**, which *probably* is what we are looking for [[source](https://github.com/python/cpython/blob/b7e1eff8436f6e0c4aac440036092fcf96f82960/Objects/codeobject.c#L47)].

It checks if **`all_name_chars()`** returns 1. 

It's implemented in the following way:

```C
all_name_chars(PyObject *o)
{
    const unsigned char *s, *e;

    if (!PyUnicode_IS_ASCII(o))
        return 0;

    s = PyUnicode_1BYTE_DATA(o);
    e = s + PyUnicode_GET_LENGTH(o);
    for (; s != e; s++) {
        if (!Py_ISALNUM(*s) && *s != '_')
            return 0;
    }
    return 1;
}
```
https://github.com/python/cpython/blob/b7e1eff8436f6e0c4aac440036092fcf96f82960/Objects/codeobject.c#L15






In [8]:
s1 = 'abc'
s2 = 'abc'
assert s1 == s2
s1 is s2

True

In [9]:
s1 = 'a bc'
s2 = 'a bc'
assert s1 == s2
s1 is s2

False

In [10]:
s1 = 'a_bc'
s2 = 'a_bc'
assert s1 == s2
s1 is s2

True

In [11]:
s1 = 'a '
s2 = 'a '
assert s1 == s2
s1 is s2

False

In [12]:
s1 = 'a '
s2 = 'a '
assert s1 == s2
s1 is s2

False

In [13]:
s1 = '##'
s2 = '##'
assert s1 == s2
s1 is s2

False

In [14]:
s1 = '!!'
s2 = '!!'
assert s1 == s2
s1 is s2

False

In [15]:
# But...

In [16]:
s1 = '#'
s2 = '#'
assert s1 == s2
s1 is s2

True

In [17]:
s1 = '!'
s2 = '!'
assert s1 == s2
s1 is s2

True

One-letter strings seem to be interned by default.

### Also...

Strings created by string concatenation or `format` are not interned as well:

In [18]:
s1 = 'a_parrot'
s1 += '_is_dead'
s2 = 'a_parrot_is_dead'
assert s1 == s2
s1 is s2

False

In [19]:
s1 = 'parrot{}'.format('_ded')
s2 = 'parrot_ded'
assert s1 == s2
s1 is s2

False

## Forcing string interning

In [20]:
from sys import intern

In [21]:
s1 = intern('abc')
s2 = intern('abc')
assert s1 == s2
s1 is s2
# no change, as expected

True

In [22]:
s1 = intern('a ')
s2 = intern('a ')
assert s1 == s2
s1 is s2

True

In [23]:
s1 = 'the_parrot'
s1 += '_is_dead'
s1 = intern(s1)

s2 = 'the_parrot_is_dead'  # already interned (no "special" chars)

assert s1 == s2
s1 is s2

True

In [24]:
s1 = intern('the parrot {}'.format('is ded :('))
s2 = intern('the parrot is ded :(')
assert s1 == s2
s1 is s2

True

### Use-cases

- Loading big string-heavy dataset into Python (e.g. from huge CSV file).

- ...

### Thanks!