# New string data type + upcoming Arrow support

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Explaining dtypes

In [4]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

<div style="font-size:120%">

> "You can assume that "object" dtype means you have string data ..."
    
</div>

## Dedicated "string" data type

Introduced in pandas 1.0 (as experimental feature): https://pandas.pydata.org/docs/dev/whatsnew/v1.0.0.html#dedicated-string-data-type

In [5]:
df = df.convert_dtypes(convert_string=True, convert_integer=False, convert_floating=False)

In [6]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            string
Sex             string
Age            float64
SibSp            int64
Parch            int64
Ticket          string
Fare           float64
Cabin           string
Embarked        string
dtype: object

We have strings now!

Creating a Series with the dtype manually:

In [7]:
s = pd.Series(["a", "b", "c"], dtype="string")
s

0    a
1    b
2    c
dtype: string

In [8]:
s[0] = 10

ValueError: Cannot set non-string value '10' into a StringArray.

<div style="font-size:120%">

-> Implementation is almost exactly the same (still storing Python strings in object-dtype numpy array), but the intent is much clearer!
    
</div>

## Native string dtype using Apache Arrow

This is Work-In-Progress (an initial version to land in pandas 1.2 or 1.3), see https://github.com/pandas-dev/pandas/issues/35169

In [9]:
df = pd.read_csv("string_data.csv")#, nrows=1000)
df.head()

Unnamed: 0,code
0,"P67,Y2,F50"
1,Y67
2,I18
3,"G94,D75,G12,K42,H91,L30,Z85,U87,X40"
4,S22


In [10]:
s = df["code"]

In [11]:
s_python = s.astype("string")

In [12]:
from pandas.core.arrays.string_arrow import ArrowStringDtype, ArrowStringArray
s_arrow = s.astype(ArrowStringDtype())

In [13]:
s_arrow.head()

0                             P67,Y2,F50
1                                    Y67
2                                    I18
3    G94,D75,G12,K42,H91,L30,Z85,U87,X40
4                                    S22
Name: code, dtype: arrow_string

**Better memory usage**

In [14]:
"{:.2f} MiB".format(s_python.memory_usage(deep=True) / 1024**2)

'658.35 MiB'

In [15]:
"{:.2f} MiB".format(s_arrow.memory_usage(deep=True) / 1024**2)

'152.90 MiB'

**Faster string operations**

Converting to lower case:

In [16]:
%timeit s_python.str.lower()

2.54 s ± 592 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%timeit s_arrow.str.lower()

610 ms ± 72.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Equality check:

In [18]:
%timeit s_python == "A1"

1.14 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%timeit s_arrow == "A1"

73.4 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Contains check:

In [20]:
%timeit s_python.str.contains("A1", regex=False)

1.83 s ± 184 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [21]:
%timeit s_arrow.str.contains("A1", regex=False)

275 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
s_python.str.contains("A1", regex=False).sum() == s_arrow.str.contains("A1", regex=False).sum()

True

**How does this work?**

- Apache Arrow has an efficient memory representation for variable-length strings + a growing library of computational kernels
- In pandas, we can optionally store a `pyarrow.array` of strings instead of an object-dtype numpy array
- BUT! setitem operations are less efficient

**Thanks to**
 
* CZI for funding this work
* Maarten Breddels and the Arrow team for implementing string kernels in Arrow
* Uwe Korn, Tom Augspurger and Simon Hawkins for the work integrating this in pandas

In [29]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Quite happy with my first major contribution to <a href="https://twitter.com/ApacheArrow?ref_src=twsrc%5Etfw">@ApacheArrow</a> which is a redo/upstreaming of the <a href="https://twitter.com/vaex_io?ref_src=twsrc%5Etfw">@vaex_io</a> string algorithms. From 2min12 → 8 seconds on half a billion strings (single-threaded). <a href="https://t.co/BSjjBgMSpt">pic.twitter.com/BSjjBgMSpt</a></p>&mdash; Maarten A. Breddels (@maartenbreddels) <a href="https://twitter.com/maartenbreddels/status/1278047178808799233?ref_src=twsrc%5Etfw">June 30, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

Generation of string data:

In [1]:
# copied from https://github.com/hmelberg/health-analytics-using-python/blob/master/4_Organizing_your_data_The_answer_is_half_long.ipynb

import numpy as np
import pandas as pd

def make_data(n, letters=26, numbers=100, seed=False):
    """
    Generate a dataframe with a column of random codes

    Args:
    letters (int): The number of different letters to use
    numbers (int): The number of different numbers to use

    Returns
    A dataframe with a column with one or more codes in the rows

    """
    # each code is assumed to consist of a letter and a number
    alphabet = list('abcdefghigjklmnopqrstuvwxyz')
    letters=alphabet[:letters+1]

    # make random numbers same if seed is specified
    if seed:
        np.random.seed(0)

    # determine the number of codes to be drawn for each event
    n_codes=np.random.negative_binomial(1, p=0.3, size=n)
    # avoid zero (all events have to have at least one code)
    n_codes=n_codes+1

    # for each event, randomly generate a the number of codes specified by n_codes
    codes=[]
    for i in n_codes:
        diag = [np.random.choice(letters).upper()+
              str(int(np.random.uniform(low=1, high=numbers))) 
              for num in range(i)]

        code_string=','.join(diag)
        codes.append(code_string)

    # create a dataframe based on the list   
    df=pd.DataFrame(codes)    
    df.columns=['code']

    return df

In [2]:
df = make_data(10_000_000)

In [3]:
df.to_csv("string_data.csv", index=False)