In [None]:
import pandas as pd
import numpy as np

# Vectorized String Operations

Working with text data is very common. Pandas provides a set of vectorized string functions that are essential for cleaning and manipulating textual data.

These functions are accessed via the `.str` attribute of a `Series` that contains strings.

## Introducing Pandas String Operations

Let's start with a simple `Series` to see how the `.str` accessor works.

In [None]:
data = pd.Series(['peter', 'Paul', 'MARY', 'gUIDO'])
data

In [None]:
# You can apply string methods like `upper()` to every element.
# Without the .str accessor, this would fail.
data.str.upper()

## Tables of Pandas String Methods

Pandas provides dozens of string methods. They generally mirror the built-in string methods in Python, but work on an entire `Series` at once.

### Methods for case conversion

In [None]:
data.str.lower()   # Convert to lowercase
data.str.upper()   # Convert to uppercase
data.str.capitalize() # Capitalize the first character

### Methods for splitting and joining

In [None]:
# The split() method splits each string into a list of substrings.
data_split = pd.Series(['a_b_c', 'd_e_f', 'g_h_i'])
data_split.str.split('_')

### Methods for finding and replacing

In [None]:
# The contains() method returns a boolean Series indicating if the string contains a pattern.
data.str.contains('P')

In [None]:
# The replace() method substitutes occurrences of a pattern.
data.str.replace('peter', 'Charlie')

### Methods using regular expressions

Many methods also accept regular expressions, which makes them very powerful.

In [None]:
# The extract() method can extract matched groups from a regular expression.
data_regex = pd.Series(['a1', 'b2', 'c3'])
data_regex.str.extract('([ab])(\d)') #type: ignore

## Example: Recipe Database

Let's use these methods on a small, sample recipe dataset to see them in action.

In [None]:
recipes = pd.DataFrame({
    'name': ['Spaghetti Bolognese', 'Chicken Noodle Soup', 'Vegan Chili', 'Fish and Chips'],
    'cuisine': ['Italian', 'American', 'Mexican', 'British'],
    'ingredients': ['pasta, ground beef, tomatoes, onion', 
                    'chicken broth, chicken, noodles, carrots', 
                    'beans, tomatoes, onion, chili powder', 
                    'fish, potatoes, flour, oil']
})
recipes

Let's say we want to find all recipes that have chicken in them.

In [None]:
recipes[recipes['name'].str.contains('Chicken')]

We can also work with the ingredients column. Let's find the length of each ingredient list.

In [None]:
recipes['ingredients'].str.len()

What if we want to create a new column that tells us how many ingredients each recipe has? We can use `str.split()` and then `str.len()` on the resulting lists.

In [None]:
recipes['ingredient_count'] = recipes['ingredients'].str.split(',').str.len()
recipes

Let's find all recipes that use tomatoes.

In [None]:
recipes[recipes['ingredients'].str.contains('tomatoes')]

We can even create a new `Series` that contains just the first ingredient of each recipe.

In [None]:
recipes['ingredients'].str.split(',').str.get(0)