# Tutorial - String data in Python

### Strings as lists

A **string** is a sequence of **characters**. Being a sequence, a string is managed in Python as a list, inheriting some basic methods from lists. For instance, the function `len` gives you the number of characters of a string:

In [1]:
iese = 'IESE Business School'

In [2]:
len(iese)

20

Also, a **substring** can be extracted from a string just as a sublist is extracted from a list:

In [3]:
iese[5:]

'Business School'

Also, in Python, strings are **joined** just as lists, with the plus sign (`+`):

In [4]:
iese[:4] + ', A way to learn'

'IESE, A way to learn'

A difference between strings and list is found in the use of `in`. For strings, it can be use to relate, not only a characer to a string, but a substring to a string:

In [5]:
'IE' in iese

True

### Python string functions

Besides the methods inherited from lists, Python has a collection of methods for manipulating strings. For instance, **conversion to lowercase**:

In [6]:
iese.lower()

'iese business school'

Conversion to uppercase is performed in a similar way. The typical "Find and Replace" method of text editors is implemented in Python by the function `replace`:

In [7]:
iese.replace('IESE', 'Iese')

'Iese Business School'

You can split a string with the function `split`. The split can be based on any separator. If no separator is specified, any whitespace string (containing only white space, line breaks or tabs) is a separator.

In [8]:
iese.split()

['IESE', 'Business', 'School']

In [9]:
iese.split('i')

['IESE Bus', 'ness School']

The function `find` returns the lowest index in the string where a substring is found. With additional arguments `start` and `end`, the search can be restricted.

In [10]:
iese.find('Business')

5

`find` returns -1 if the substring is not found:

In [11]:
iese.find('Negocios')

-1

Finally, the function `count` counts the number of occurrences of a substring within a string:

In [12]:
iese.count('s')

3

### Regular expressions

A **regular expression** is a pattern which describes a collection of strings. The package `re` provides some functions which read a matching pattern as a regular expression. It is imported in the usual way:

In [13]:
import re

Among them, **character classes** are the simplest case. They are built by enclosing a collection of characters in square brackets. The square brackets indicate any of the characters enclosed. For instance, `[0-9]` stands for any digit, [a-z] for any (ASCII) lowercase letter, and `[A-Z]` for any capital. 

The following example uses the `re` function `sub`, which is an alternative to `replace`. Note that the string where the replacement is peformed is the third argument.

In [14]:
re.sub('[a-z]', 'x', iese)

'IESE Bxxxxxxx Sxxxxx'

Character classes get more powerful when complemented with **quantifiers**. For instance, followed by a plus sign (`+`), a character class indicates a sequence of any length. So, `[0-9]+` indicates any sequence of digits, therefore any number, and '[a-z]+' indicates any word in lower case. We can also specify the length of the sequence, as in `[A-Z]{2}`, or the minimum and maximum length, as in `[0-9]{1,3}`.

We see next the quantifier effect in the preceding example:

In [15]:
re.sub('[a-z]+', 'x', iese)

'IESE Bx Sx'

The regular expressions `\w` and `\W`, which stand for any character that can be part of a word and the contrary, respectively, are very useful in practice. The following example uses `\W`, with a quantifier, to split by either white space or punctuation (or both). 

In [16]:
re.split('\W+', 'IESE: A way to learn, a mark to make')

['IESE', 'A', 'way', 'to', 'learn', 'a', 'mark', 'to', 'make']

The function `findall` does not have an equivalent in plain Python. It returns a list containing all the occurrences of a pattern. The following example shows how to transform a string into a list of words.

In [17]:
re.findall('\w+', iese)

['IESE', 'Business', 'School']

### Pandas string functions

**Pandas string functions** are vectorized versions of the above methods. They return a series of the same length, in which each term results from applying a string function to the corresponding term of `s`. The syntax of these functions is typically `s.str.fname(args)`, where `s` is a Pandas series and `args` is a set of *ad hoc* arguments. 

In the first example, I include an **empty string** and a **missing value** for diversity. The `None` entry, which stands for a missing value, could be replaced by `numpy.nan` here.

In [18]:
import pandas as pd

In [19]:
pres = pd.Series(['Donald Trump', 'Joe Biden', '', None])
pres

0    Donald Trump
1       Joe Biden
2                
3            None
dtype: object

The function `str.len` returns the length of every element of a string series. In this example, the **empty string** (`''`) has length zero. Since `None` has no length, a `NaN` value has been returned as the fourth term. The series has then been converted by Python to data type `float`, to cope with that (it would be `int` if all the terms had length).

In [20]:
pres.str.len()

0    12.0
1     9.0
2     0.0
3     NaN
dtype: float64

In Pandas, this method has an extra feature, the ability of dealing with missing values. The same can be said of:

In [21]:
pres.str.lower()

0    donald trump
1       joe biden
2                
3            None
dtype: object

`str.replace` is a vectorized version of the string function `replace`. With `regex=True`, it admits a regular expression as the pattern to replace (that is, it works as `re.sub`). An example follows.

In [22]:
bio = pd.Series(['I was born in 1954', 'My phone is +34 932 534 200'])
bio

0             I was born in 1954
1    My phone is +34 932 534 200
dtype: object

In [23]:
bio.str.replace('[0-9]', 'x', regex=True)

0             I was born in xxxx
1    My phone is +xx xxx xxx xxx
dtype: object

The functions `str.split`, `str.findall` and `str.contains` also take regular expressions with `regex=True`. The first two are just vectorized versions of the `re` methods explained above. `str.contains` detects the presence of a pattern in the terms of a string series. It returns a Boolean series indicating, term by term, whether the pattern occurs. In the following example, the occurrence of phone numbers is checked (you can extract them with `str.findall`, if you wish).

In [24]:
bio.str.contains('\+[0-9]{2} [0-9]{3} [0-9]{3}', regex=True)

0    False
1     True
dtype: bool

### Homework

1. Some symbols, like +, have a special role in regular expressions. You have to be careful when using them in patterns that are read as regular sequences, because they can give unexpected results. The solution is to write them with an **escape character**. In Python, as many other languages, uses the backslash (`\`) as an escape character. For the string `price = 'The price is $20'`, try `re.sub('$', '', price)` and `re.sub('\$', '', price)`. What is the difference?

2. Write a function which anonymizes credit card numbers, turning, for instance, '2875765488882745' into 'Credit card ****745'. Create a short series containing credit card numbers and check that your function does the job.

3. It is always useful in string data analysis to have a **wildcard** which can match any single character (letter, digit, whitespace, etc). The wildcard of Python regular expressions is the dot (`.`). For the series
`s = pd.Series(['Call me tomorrow. I will be ready.', 'Your taxi is here. It is still early'])`, what is the difference between `s.str.replace('. ', ' - ', regex=True)` and `s.str.replace('. ', ' - ', regex=False)`?