# String Manipulation

String data is commonly used to hold free-form text, semi-structured text, categorical data, and data that should have another type (typically numeric or datetime). We will look at common operations of textual data.

## Strings and Objects

Before pandas 1.0, if you stored strings in a series the underlying type of the series was object. This is unfortunate as the object type can be used for other series that have Python types in them (such as a list, a dictionary, or a custom class). Also, the object type is used for mixed types. If you have a series that has numbers and strings in it, the type is also object.

### The .str Accessor

The object, 'string', and 'category' types have a .str accessor that provides string manipulation methods. Most of these methods are modeled after the Python string methods. If you are adept at the Python string methods, many of the pandas variants should be second nature. Here is the Python string method .lower:

In [1]:
import requests

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"

response = requests.get(download_url)
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

Download ready.


In [2]:
import pandas as pd
nba = pd.read_csv("nba_all_elo.csv")
nba.head(2)

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,...,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
0,1,194611010TRH,NBA,0,1947,11/1/1946,1,0,TRH,Huskies,...,40.29483,NYK,Knicks,68,1300.0,1306.7233,H,L,0.640065,
1,1,194611010TRH,NBA,1,1947,11/1/1946,1,0,NYK,Knicks,...,41.70517,TRH,Huskies,66,1300.0,1293.2767,A,W,0.359935,


In [3]:
nba.fran_id.str.lower()

0           huskies
1            knicks
2             stags
3            knicks
4           falcons
            ...    
126309    cavaliers
126310     warriors
126311    cavaliers
126312    cavaliers
126313     warriors
Name: fran_id, Length: 126314, dtype: object

In [4]:
nba.fran_id.str.capitalize()

0           Huskies
1            Knicks
2             Stags
3            Knicks
4           Falcons
            ...    
126309    Cavaliers
126310     Warriors
126311    Cavaliers
126312    Cavaliers
126313     Warriors
Name: fran_id, Length: 126314, dtype: object

In [5]:
(
    nba
    .fran_id
    .str
    .startswith('Hus')
)

0          True
1         False
2         False
3         False
4         False
          ...  
126309    False
126310    False
126311    False
126312    False
126313    False
Name: fran_id, Length: 126314, dtype: bool

In [6]:
(
    nba
    .fran_id
    .str
    .extract(r'([a-e])', expand=False)
)

0         e
1         c
2         a
3         c
4         a
         ..
126309    a
126310    a
126311    a
126312    a
126313    a
Name: fran_id, Length: 126314, dtype: object

### Searching

There are a few methods that leverage regular expressions to perform searching, replacing, and splitting.

To find all of the non alphabetic characters (disregarding space), you could use this code:

In [7]:
(
    nba
    .fran_id
    .str
    .extract(r'([^a-z A-Z])')
)

Unnamed: 0,0
0,
1,
2,
3,
4,
...,...
126309,
126310,
126311,
126312,


This returns a dataframe that has mostly missing values and by inspection is not very useful. If we collapse it into a series (with the parameter expand=False), we can chain the .value_counts method to view the count of non-missing values:

In [8]:
(
    nba
    .fran_id
    .str
    .extract(r'([^a-z A-Z])', expand=False)
    .value_counts()
)

Series([], Name: fran_id, dtype: int64)

### Replacing Text

Both the series and the .str attribute have a .replace method, and these methods have overlapping functionality. If I want to replace single characters, I typically use .str.replace, but if I have complete replacements for many of the values I use .replace.
If I wanted to replace a capital ”A” with the Unicode letter a with a ring above it, I could use this code:

In [9]:
nba.fran_id.str.replace('H', 'Å')

0           Åuskies
1            Knicks
2             Stags
3            Knicks
4           Falcons
            ...    
126309    Cavaliers
126310     Warriors
126311    Cavaliers
126312    Cavaliers
126313     Warriors
Name: fran_id, Length: 126314, dtype: object

You can use a dictionary to specify complete replacements. (This is very explicit, but it might be problematic if you had 20,000 numeric values that had dashes in them, and you wanted to strip out the dashes for all 20,000 numbers. You would have to create a dictionary with all the entries, tedious work.):

In [10]:
nba.fran_id.replace({'Huskies': 'Åuskies', 'Knicks': 'Ånicks'})[0:2]

0    Åuskies
1     Ånicks
Name: fran_id, dtype: object

Alternatively, you can specify that you mean to use a regular expression to replace just a portion of the strings with the regex=True parameter:

In [11]:
nba.fran_id.replace('H|K', 'Å', regex=True)[0:2]

0    Åuskies
1     Ånicks
Name: fran_id, dtype: object

A importance note is:

> I use .str.replace to replace substrings, and .replace to replace mappings of complete strings.