# String Manipulation

In [1]:
import sys
import os
import random

import pandas as pd
import numpy as np
import pyarrow as pa
sys.path.append(os.path.abspath(".."))

from extras.utils import load_vehicle_data

## Loading Data

In [2]:
df = load_vehicle_data()
make = df.make

## Strings and objects

In [3]:
# The default type is string[pyarrow] as we set the engine to pyarrow
make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string[pyarrow]

In [4]:
# Usually (with pandas 1.0) transforming to string returned object type
make.astype(str)

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

In [5]:
# Making a "full circle", converting to object and then back to string[pyarrow]
string_pa = pd.ArrowDtype(pa.string())
make.astype(str).astype(string_pa)

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string[pyarrow]

The type string[pyarrow] uses less memory and is quicker, but lacks support for some of the older methods.

## Categorical Strings

In [6]:
# Converting to categorical strings saves memory and improves performance, having the same caveat as string[pyarrow]
make.astype("category")

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, string[pyarrow]): [AM General, ASC Incorporated, Acura, Alfa Romeo, ..., Volvo, Wallace Environmental, Yugo, smart]

## The .str Accessor

Many accessors have similarities with python strings

In [7]:
"Ford".lower()

'ford'

In [8]:
make.str.lower()

0        alfa romeo
1           ferrari
2             dodge
3             dodge
4            subaru
            ...    
41139        subaru
41140        subaru
41141        subaru
41142        subaru
41143        subaru
Name: make, Length: 41144, dtype: string[pyarrow]

In [9]:
"Alfa Romeo".find("A")

0

In [10]:
make.str.find("A")

0         0
1        -1
2        -1
3        -1
4        -1
         ..
41139    -1
41140    -1
41141    -1
41142    -1
41143    -1
Name: make, Length: 41144, dtype: int32[pyarrow]

## Searching

After pandas 2.0.2 the traditional legacy call with simple regex doesn't work. The example below would try to look for non-alphabetic characters

In [11]:
print(make.str.extract(r'([^a-z A-Z])'))

ValueError: pat='([^a-z A-Z])' must contain a symbolic group name.

This can be worked around by converting the series to object type or adding a regex named capture group

In [12]:
print(make.str.extract(r'(?P<non_alpha>[^a-z A-Z])'))

      non_alpha
0          <NA>
1          <NA>
2          <NA>
3          <NA>
4          <NA>
...         ...
41139      <NA>
41140      <NA>
41141      <NA>
41142      <NA>
41143      <NA>

[41144 rows x 1 columns]


The result is not entirely helpful, as it returns a dataframe with missing values (because most of them doesn't have non-alphabetic characters). But, if the result is collapsed to a series using the expand parameter and the .value_counts method is chained, a better view is obtained:

In [13]:
(
    make
    .str
    .extract(r'(?P<non_alpha>[^a-z A-Z])', expand=False)
    .value_counts()
)

non_alpha
-    1727
.      46
,       9
Name: count, dtype: int64[pyarrow]

Similarly, to search for non numeric characters, one could use the following:

In [14]:
(
    make
    .str
    .extract(r'(?P<non_numeric>[^0-9])', expand=False)
    .value_counts()
)

non_numeric
C    5336
M    4833
F    3686
B    2796
G    2691
D    2679
P    2589
S    2234
T    2159
V    2001
H    1803
A    1610
N    1471
J    1435
L    1241
I     860
K     618
O     462
R     392
E     167
s      38
W      32
Y       8
Q       3
Name: count, dtype: int64[pyarrow]

## Splitting

In [15]:
age = pd.Series(
    ["0-10", "11-15", "11-15", "61-65", "46-50"],
    dtype=string_pa
    )
age

0     0-10
1    11-15
2    11-15
3    61-65
4    46-50
dtype: string[pyarrow]

In [16]:
age.str.split("-")

0     ['0' '10']
1    ['11' '15']
2    ['11' '15']
3    ['61' '65']
4    ['46' '50']
dtype: list<item: string>[pyarrow]

A series with a list makes it hard for data manipulation. Using the expand=True parameter can remedy that. In addition, with an .iloc operation the first (or second) column can be pulled as an age value. Furthermore, with a .astype method the string can be converted to integer.

In [17]:
age.str.split("-", expand=True)

Unnamed: 0,0,1
0,0,10
1,11,15
2,11,15
3,61,65
4,46,50


In [18]:
(
    age
    .str.split("-", expand=True)
    .iloc[:, 0]
    .astype("int8[pyarrow]")
)

0     0
1    11
2    11
3    61
4    46
Name: 0, dtype: int8[pyarrow]

Similarly, this can be done through other methods, shown here to obtain the second column:

In [19]:
# Here slice(-2) is used to get the last two characters. 
# Wouldn't be possible to obtain the first column as there are numbers with 1 and 2 digits.
(
    age
    .str.slice(-2)
    .astype("int8[pyarrow]")
)

0    10
1    15
2    15
3    65
4    50
dtype: int8[pyarrow]

In [20]:
(
    age
    .str[-2:]
    .astype("int8[pyarrow]")
)

0    10
1    15
2    15
3    65
4    50
dtype: int8[pyarrow]

To obtain the average, the following can be used:

In [21]:
(
    age
    .str.split("-", expand=True)
    .astype("int8[pyarrow]")
    .mean(axis="columns") # If not specified, the mean is calculated for each column
)

0     5.0
1    13.0
2    13.0
3    63.0
4    48.0
dtype: double[pyarrow]

If a random number between the ranges is wanted, the following can be done:

In [22]:
def between(row):
    return random.randint(*row.values)

(
    age
    .str.split("-", expand=True)
    .astype(int)
    .apply(between, axis="columns")
)

0     2
1    11
2    12
3    65
4    46
dtype: int64

## Removing Apply

The above method can be made faster by avoiding the usage of apply. The following code splits the age column into two, then renames them according to their bounds. Then it converts it to integer, creates a random column (with values between 0 and 1) and then adds a value using the lower bound added with the multiplication of the random number and the range (upper-lower bound)

In [23]:
print(
    age
    .str.split("-", expand=True)
    .rename(columns={0:"lower", 1:"upper"})
    .astype("int8[pyarrow]")
    .assign(
        rand=np.random.rand(len(age)),
        age=lambda df_: (
            df_.lower + (df_.rand * (df_.upper - df_.lower))
        ).astype("int8[pyarrow]", errors="ignore")
    )
)

   lower  upper      rand        age
0      0     10  0.790828   7.908283
1     11     15  0.470926  12.883704
2     11     15  0.995712  14.982847
3     61     65  0.512752  63.051009
4     46     50  0.443301  47.773205


## Optimising with NumPy

Speeding things using vectorised operations with NumPy

In [24]:
(
    age
    .str.split("-", expand=True)
    .astype(int)
    .pipe(lambda df_: pd.Series(
        np.random.randint(df_.iloc[:,0], df_.iloc[:,1]),
        index=df_.index
        )
    )
)

0     6
1    11
2    14
3    62
4    49
dtype: int64

### Benchmarking

In [25]:
age_100k = (
    age
    .sample(
        100_000,
        replace=True,
        random_state=42
    )
    .reset_index(drop=True)
)

In [28]:
%%timeit
# Timing apply method
(
    age_100k
    .str.split("-", expand=True)
    .astype("int8[pyarrow]")
    .apply(between, axis="columns")
)

11.7 s ± 1.21 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [29]:
%%timeit
(
    age_100k
    .str.split("-", expand=True)
    .rename(columns={0:"lower", 1:"upper"})
    .astype("int8[pyarrow]")
    .assign(
        rand=np.random.rand(len(age_100k)),
        age=lambda df_: (
            df_.lower + (df_.rand * (df_.upper - df_.lower))
        )
        .astype("int8[pyarrow]", errors="ignore")
    )
)

38 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Replacing text

In [31]:
make.str.replace('A', 'Å')

0        Ålfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string[pyarrow]

In [32]:
# Yields no result as it tries to look for the exact match
make[make == 'A']

Series([], Name: make, dtype: string[pyarrow])

In [34]:
# Hence, this doesn't work either
make.replace('A', 'Å')

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string[pyarrow]

In [35]:
# Activating regex does the work
make.replace('A', 'Å', regex=True)

0        Ålfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: string[pyarrow]