# Session 22 Vectorised String Operations

In [1]:
import pandas as pd
import numpy as np

## Vectorised Operations

### What are Vectorised Operations?

Vectorised operations are operations that are applied **element-wise** to an entire array or column **at once**, without using explicit loops.

Example using NumPy:

```python
a = np.array([1, 2, 3, 4])
a * 4
```

Output:

```
array([ 4,  8, 12, 16])
```

Here, multiplication by `4` is applied to **each element of the vector simultaneously**.
This approach is:

* Faster
* More readable
* Optimized at a lower (C) level

The same concept is extended to **strings and dates in Pandas**, which is especially useful when working with columns in a DataFrame.

---

## Vectorised String Operations

In real-world datasets, string data often appears in columns (e.g., names, categories, cities).
We usually need to perform the **same string operation on every row**.

### Problem with Vanilla Python Approach

Consider this example:

```python
s = ['cat', 'mat', None, 'rat']
[i.startswith('c') for i in s]
```

This code has two major problems:

1. **Error Handling Issue**
   It raises:

   ```
   AttributeError: 'NoneType' object has no attribute 'startswith'
   ```

   because `None` does not have string methods.

2. **Performance Issue**

   * Uses Python loops
   * Slow for large datasets
   * Not suitable for data analysis at scale

---

### Pandas Solution: Vectorised String Operations

Pandas provides a **string accessor** called `.str` that allows vectorised string operations.

```python
s = pd.Series(['cat', 'mat', None, 'rat'])
s.str.startswith('c')
```

Output:

```
0     True
1    False
2     None
3    False
dtype: object
```

#### Why this is better:

* Automatically handles missing values (`None`, `NaN`)
* No explicit loops
* Faster and memory-efficient
* Designed specifically for column-wise operations

---

### Common Vectorised String Methods

Some frequently used `.str` methods:

```python
s.str.lower()          # convert to lowercase
s.str.upper()          # convert to uppercase
s.str.len()            # length of each string
s.str.contains('a')    # check substring presence
s.str.replace('a', 'o')
s.str.startswith('c')
s.str.endswith('t')
```

These methods work safely even when missing values are present.

In [3]:
a = np.array([1, 2, 3, 4])
a * 4

array([ 4,  8, 12, 16])

In [4]:
s = ['cat', 'mat', None, 'rat']
[i.startswith('c') for i in s]

AttributeError: 'NoneType' object has no attribute 'startswith'

In [None]:
s = pd.Series(['cat', 'mat', None, 'rat'])
# string accessor
s.str.startswith('c')
# fast, optimized, and a more robust technique

0     True
1    False
2     None
3    False
dtype: object

## Common Vectorised String Methods (Demonstrated)

Pandas provides a rich set of **vectorised string methods** via the `.str` accessor, allowing efficient string manipulation across entire columns.

---

### Case Conversion Methods

`lower()`, `upper()`, `capitalize()`, `title()`

```python
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.capitalize()
df['Name'].str.title()
```

**Explanation:**

* `lower()` → converts all characters to lowercase
* `upper()` → converts all characters to uppercase
* `capitalize()` → capitalizes only the first character of the string
* `title()` → capitalizes the first character of each word

These operations are applied **row-wise** and safely handle missing values.

---

### Length of Strings (`len`)

Using `.str.len()` in a practical scenario:

```python
df['Name'][df['Name'].str.len() == 82].values[0]
```

**Output:**

```
'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'
```

**Explanation:**

* `.str.len()` computes the length of each string in the column
* Useful for:

  * Detecting outliers
  * Identifying unusually long or short values
  * Data quality checks

---

### Removing Leading and Trailing Spaces (`strip`)

```python
df['Name'].str.strip()
```

**Explanation:**

* Removes unwanted whitespace from both ends of strings
* Very useful when cleaning raw or scraped data

---

## Case Study: Using `strip()` and `split()`

**Objective:**
Create three new columns — **LastName**, **Title**, and **FirstName** — from a single `Name` column.

---

### Extracting Last Name

```python
df['LastName'] = df['Name'].str.split(',').str.get(0)
df.head(2)
```

**Explanation:**

* Splits the string at `,`
* Retrieves the first part as `LastName`

---

### Extracting Title and First Name

```python
df['Name'].str.split(',').str.get(1).str.strip().str.split(' ', n=1, expand=True)
```

**Key Parameters Explained:**

* `strip()` → removes extra spaces after the comma
* `n=1` → limits splitting to one occurrence
* `expand=True` → converts output into a DataFrame

---

### Assigning to New Columns

```python
df[['Title', 'FirstName']] = (
    df['Name']
    .str.split(',')
    .str.get(1)
    .str.strip()
    .str.split(' ', n=1, expand=True)
)
```

This results in structured, analysis-ready columns.

---

### Replacing Values (`replace`)

```python
df['Title'] = df['Title'].str.replace('Ms.', 'Miss.')
df['Title'] = df['Title'].str.replace('Mlle.', 'Miss.')
```

**Explanation:**

* Standardizes inconsistent categories
* Helps reduce redundancy before analysis or modeling

---

## Filtering Using Vectorised String Methods

### Using `startswith()` and `endswith()`

```python
df[df['FirstName'].str.endswith('A')]
df[df['FirstName'].str.startswith('A')]
```

**Use case:**
Filtering names based on prefixes or suffixes.

---

### Using `isdigit()` and `isalpha()`

```python
df[df['FirstName'].str.isdigit()]
```

**Explanation:**

* `isdigit()` → checks if the string contains only digits
* `isalpha()` → checks if the string contains only letters
* Useful for detecting corrupted or invalid data

---

## Slightly Advanced Filtering Techniques

### Substring Search (`contains`)

```python
df[df['FirstName'].str.contains("John", case=False)].head(2)
```

**Explanation:**

* `case=False` makes the search case-insensitive
* Ideal for text-based searches and keyword matching

---

### Using Regular Expressions (RegEx)

```python
df[df['LastName'].str.contains("^[aeiouAEIOU].+[aeiouAEIOU]$")]
```

**Explanation:**

* Filters last names that **start and end with a vowel**
* RegEx provides high control and flexibility in filtering logic

---

### String Slicing

```python
df['Name'].str[:4]
```

**Explanation:**

* Extracts the first four characters of each string
* Useful for:

  * Prefix analysis
  * Categorical feature creation
  * Pattern detection

---

## Key Takeaway

Vectorised string operations:

* Eliminate explicit loops
* Are fast and memory-efficient
* Handle missing values gracefully
* Provide fine-grained control over text data

They are essential for **data cleaning, feature engineering, and exploratory data analysis**.

In [7]:
# import titanic
df = pd.read_csv("datasets/titanic.csv")
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [None]:
# lower/upper/capitalize/title
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.capitalize()
df['Name'].str.title()

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [17]:
# using len in a practical scenario
df['Name'][df['Name'].str.len() == 82].values[0]

'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'

In [18]:
# strip
df['Name'].str.strip()

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [21]:
# case: make 3 new column, seggregating First Name, Last Name, and Title respectively for each row
df['LastName'] = df['Name'].str.split(',').str.get(0)
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings


In [29]:
df['Name'].str.split(',').str.get(1).str.strip().str.split(' ', n=1, expand=True) # n=1 allows only 1 split, restricting firstname to be split into multiple parts, and expand=True param converts the output series into a dataframe, with col0 being the Title and col1 being the FirstName


Unnamed: 0,0,1
0,Mr.,Owen Harris
1,Mrs.,John Bradley (Florence Briggs Thayer)
2,Miss.,Laina
3,Mrs.,Jacques Heath (Lily May Peel)
4,Mr.,William Henry
...,...,...
886,Rev.,Juozas
887,Miss.,Margaret Edith
888,Miss.,"Catherine Helen ""Carrie"""
889,Mr.,Karl Howell


In [30]:
df[['Title', 'FirstName']] = df['Name'].str.split(',').str.get(1).str.strip().str.split(' ', n=1, expand=True)

In [31]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName,Title,FirstName
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr.,Owen Harris
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs.,John Bradley (Florence Briggs Thayer)
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss.,Laina


In [32]:
# replace
df['Title'] = df['Title'].str.replace('Ms.', 'Miss.')
df['Title'] = df['Title'].str.replace('Mlle.', 'Miss.')

In [33]:
df['Title'].value_counts()

Title
Mr.          517
Miss.        185
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Major.         2
Col.           2
Don.           1
Lady.          1
Mme.           1
Sir.           1
Capt.          1
the            1
Jonkheer.      1
Name: count, dtype: int64

In [None]:
# filtering using startswith/enswith
df[df['FirstName'].str.endswith('A')]
df[df['FirstName'].str.startswith('A')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName,Title,FirstName
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S,Andersson,Mr.,Anders Johan
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q,McGowan,Miss.,"Anna ""Annie"""
35,36,0,1,"Holverson, Mr. Alexander Oskar",male,42.0,1,0,113789,52.0000,,S,Holverson,Mr.,Alexander Oskar
38,39,0,3,"Vander Planke, Miss. Augusta Maria",female,18.0,2,0,345764,18.0000,,S,Vander Planke,Miss.,Augusta Maria
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0000,B28,,Icard,Miss.,Amelie
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
842,843,1,1,"Serepeca, Miss. Augusta",female,30.0,0,0,113798,31.0000,,C,Serepeca,Miss.,Augusta
845,846,0,3,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.5500,,S,Abbing,Mr.,Anthony
866,867,1,2,"Duran y More, Miss. Asuncion",female,27.0,1,0,SC/PARIS 2149,13.8583,,C,Duran y More,Miss.,Asuncion
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C,Najib,Miss.,"Adele Kiamie ""Jane"""


In [38]:
# filtering using isdigit/isalpha
df[df['FirstName'].str.isdigit()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName,Title,FirstName


In [41]:
# case: search for john (upper and lower case) using contains()
df[df['FirstName'].str.contains("John", case=False)].head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName,Title,FirstName
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs.,John Bradley (Florence Briggs Thayer)
41,42,0,2,"Turpin, Mrs. William John Robert (Dorothy Ann ...",female,27.0,1,0,11668,21.0,,S,Turpin,Mrs.,William John Robert (Dorothy Ann Wonnacott)


In [57]:
# case: search for lastname which either startwith or endwith a vowel
df[df['LastName'].str.contains("^[aeiouAEIOU].+[aeiouAEIOU]$")]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName,Title,FirstName
30,31,0,1,"Uruchurtu, Don. Manuel E",male,40.0,0,0,PC 17601,27.7208,,C,Uruchurtu,Don.,Manuel E
49,50,0,3,"Arnold-Franchi, Mrs. Josef (Josefine Franchi)",female,18.0,1,0,349237,17.8,,S,Arnold-Franchi,Mrs.,Josef (Josefine Franchi)
207,208,1,3,"Albimona, Mr. Nassef Cassem",male,26.0,0,0,2699,18.7875,,C,Albimona,Mr.,Nassef Cassem
210,211,0,3,"Ali, Mr. Ahmed",male,24.0,0,0,SOTON/O.Q. 3101311,7.05,,S,Ali,Mr.,Ahmed
353,354,0,3,"Arnold-Franchi, Mr. Josef",male,25.0,1,0,349237,17.8,,S,Arnold-Franchi,Mr.,Josef
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,Artagaveytia,Mr.,Ramon
518,519,1,2,"Angle, Mrs. William A (Florence ""Mary"" Agnes H...",female,36.0,1,0,226875,26.0,,S,Angle,Mrs.,"William A (Florence ""Mary"" Agnes Hughes)"
784,785,0,3,"Ali, Mr. William",male,25.0,0,0,SOTON/O.Q. 3101312,7.05,,S,Ali,Mr.,William
840,841,0,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,SOTON/O2 3101287,7.925,,S,Alhomaki,Mr.,Ilmari Rudolf


In [58]:
# slicing
df['Name'].str[:4]

0      Brau
1      Cumi
2      Heik
3      Futr
4      Alle
       ... 
886    Mont
887    Grah
888    John
889    Behr
890    Dool
Name: Name, Length: 891, dtype: object