(text-intro)=
# Introduction to Text

This chapter covers how to use code to work with text as data, including opening files with text in, changing and cleaning text, regular expressions, and vectorised operations on text.

It has benefitted from the [Python String Cook Book](https://mkaz.blog/code/python-string-format-cookbook/) and Jake VanderPlas' [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html).

## An aside on encodings

Before we get to the good stuff, we need to talk about string encodings. Whether you're using code or a text editor (Notepad, Word, Pages, Visual Studio Code), every bit of text that you see on a computer will have an encoding behind the scenes that tells the computer how to display the underlying data. There is no such thing as 'plain' text: all text on computers is the result of an encoding. Oftentimes, a computer programme (email reader, Word, whatever) will guess the encoding and show you what it thinks the text should look like. But it doesn't always know, or get it right: *that is what is happening when you get an email or open an file full of weird symbols and question marks*. If a computer doesn't know whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), it simply cannot display it correctly and you get gibberish.

When it comes to encodings, there are just two things to remember: i) you should use UTF-8 (aka Unicode), it's the international standard. ii) the Windows operating system tends to use either Latin 1 or Windows 1252 but (and this is good news) is moving to UTF-8.

[Unicode](https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code. The Unicode specifications are continually revised and updated to add new languages and symbols.

Take special care when saving CSV files containing text on a Windows machine using Excel; unless you specify it, the text may not be saved in UTF-8. If your computer and you get confused enough about encodings and re-save a file with the wrong ones, you could lose data.

Hopefully you'll never have to worry about string encodings. But if you *do* see weird symbols appearing in your text, at least you'll know that there's an encoding problem and will know where to start Googling. You can find a much more in-depth explanation of text encodings [here](https://kunststube.net/encoding/).

## Strings

Note that there are many built-in functions for using strings in Python, you can find a comprehensive list [here](https://www.w3schools.com/python/python_ref_string.asp).

Strings are the basic data type for text in Python. They can be of any length. A string can be signalled by quote marks or double quote marks like so:

`'text'`

or


`"text"`

Style guides tend to prefer the latter but some coders (ahem!) have a bad habit of using the former. We can put this into a variable like so:

In [1]:
var = "banana"

Now, if we check the type of the variable:

In [2]:
type(var)

str

We see that it is `str`, which is short for string.

Strings in Python can be indexed, so we can get certain characters out by using square brackets to say which positions we would like.

In [3]:
var[:3]

'ban'

The usual slicing tricks that apply to lists work for strings too, i.e. the positions you want to get can be retrieved using the `var[start:stop:step]` syntax. Here's an example of getting every other character from the string starting from the 2nd position.

In [4]:
var[1::2]

'aaa'

Note that strings, like tuples such as `(1, 2, 3)` but unlike lists such as `[1, 2, 3]`, are *immutable*. This means commands like `var[0] = "B"` will result in an error. If you want to change a single character, you will have to replace the entire string. In this example, the command to do that would be `var = "Banana"`.

Like lists, you can find the length of a string using `len`:

In [5]:
len(var)

6

The `+` operator concatenates two or more strings:

In [6]:
second_word = "panther"
first_word = "black"
print(first_word + " " + second_word)

black panther


Note that we added a space so that the noun made sense. Another way of achieving the same end that scales to many words more efficiently (if you have them in a list) is:


In [7]:
" ".join([first_word, second_word])

'black panther'

Three useful functions to know about are `upper`, `lower`, and `title`. Let's see what they do


In [8]:
var = "input TEXT"
var_list = [var.upper(), var.lower(), var.title()]
print(var_list)

['INPUT TEXT', 'input text', 'Input Text']


```{admonition} Exercise
Reverse the string `"gnirts desrever a si sihT"` using indexing operations.
```

While we're using `print()`, it has a few tricks. If we have a list, we can print out entries with a given separator:


In [9]:
print(*var_list, sep="; and \n")

INPUT TEXT; and 
input text; and 
Input Text


(We'll find out more about what '\n' does shortly.) To turn variables of other kinds into strings, use the `str()` function, for example

In [10]:
(
    "A boolean is either "
    + str(True)
    + " or "
    + str(False)
    + ", there are only "
    + str(2)
    + " options."
)

'A boolean is either True or False, there are only 2 options.'

In this example two boolean variables and one integer variable were converted to strings. `str` generally makes an intelligent guess at how you'd like to convert your non-string type variable into a string type. You can pass a variable or a literal value to `str`.

### f-strings

The example above is quite verbose. Another way of combining strings with variables is via *f-strings*. A simple f-string looks like this:

In [11]:
variable = 15.32399
print(f"You scored {variable}")

You scored 15.32399


This is similar to calling `str` on variable and using `+` for concatenation but much shorter to write. You can add expressions to f-strings too:

In [12]:
print(f"You scored {variable**2}")

You scored 234.8246695201


This also works with functions; after all `**2` is just a function with its own special syntax.

In this example, the score number that came out had a lot of (probably) uninteresting decimal places. So how do we polish the printed output? You can pass more inforation to the f-string to get the output formatted just the way you want. Let's say we wanted two decimal places and a sign (although you always write `+` in the formatting, the sign comes out as + or - depending on the value):

In [13]:
print(f"You scored {variable:+.2f}")

You scored +15.32


There are a whole range of formatting options for numbers as shown in the following table:

| Number     	| Format  	| Output     	| Description                                   	|
|------------	|---------	|------------	|-----------------------------------------------	|
| 15.32347  	| {:.2f}  	| 15.32       	| Format float 2 decimal places                 	|
| 15.32347  	| {:+.2f} 	| +15.32      	| Format float 2 decimal places with sign       	|
| -1         	| {:+.2f} 	| -1.00      	| Format float 2 decimal places with sign       	|
| 15.32347    	| {:.0f}  	| 15          	| Format float with no decimal places           	|
| 3          	| {:0>2d} 	| 03         	| Pad number with zeros (left padding, width 2) 	|
| 3          	| {:*<4d} 	| 3***       	| Pad number with *’s (right padding, width 4)  	|
| 13         	| {:*<4d} 	| 13**       	| Pad number with *’s (right padding, width 4)  	|
| 1000000    	| {:,}    	| 1,000,000  	| Number format with comma separator            	|
| 0.25       	| {:.1%}  	| 25.0%     	| Format percentage                             	|
| 1000000000 	| {:.2e}  	| 1.00e+09   	| Exponent notation                             	|
| 12         	| {:10d}  	|            12 | Right aligned (default, width 10)             	|
| 12         	| {:<10d} 	| 12            | Left aligned (width 10)                       	|
| 12         	| {:^10d} 	|      12       | Center aligned (width 10)                     	|

As well as using this page interactively through the Colab and Binder links at the top of the page, or downloading this page and using it on your own computer, you can play around with some of these options over at [this link](https://www.python-utils.com/).

### Special characters

Python has a string module that comes with some useful built-in strings and characters. For example

In [14]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

gives you all of the punctuation,

In [15]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

returns all of the basic letters in the 'ASCII' encoding (with `.ascii_lowercase` and `.ascii_uppercase` variants), and

In [16]:
string.digits

'0123456789'

gives you the numbers from 0 to 9. Finally, though less impressive visually, `string.whitespace` gives a string containing all of the different (there is more than one!) types of whitespace.

There are other special characters around; in fact, we already met the most famous of them: "\n" for new line. To actually print "\n" we have to 'escape' the backward slash by adding another backward slash:

In [17]:
print("Here is a \n new line")
print("Here is an \\n escaped new line ")

Here is a 
 new line
Here is an \n escaped new line 


The table below shows the most important escape commands:

| Code 	| Result          	|
|------	|-----------------	|
| `\'`   	| Single Quote (useful if using `'` for strings)   	|
| `\"`      | Double Quote (useful if using `"` for strings)   	|
| `\\`   	| Backslash       	|
| `\n`   	| New Line        	|
| `\r`   	| Carriage Return 	|
| `\t`   	| Tab             	|

### Methods for Strings

Let's end this sub-section on strings with a comprehensive overview of all string methods, courtesy of the excellent [**rich**](https://github.com/willmcgugan/rich) package.

In [18]:
from rich import inspect

var_of_type_str = "string"
inspect(var_of_type_str, methods=True)

## Cleaning Text

You often want to make changes to the text you're working with. In this section, we'll look at the various options to do this.

### Replacing sub-strings

A common text task is to replace a substring within a longer string. Let's say you have a string variable `var`. You can use `.replace(old_text, new_text)` to do this.


In [19]:
"Value is objective".replace("objective", "subjective")

'Value is subjective'

As with any variable of a specific type (here, string), this would also work with variables:

In [20]:
text = "Value is objective"
old_substr = "objective"
new_substr = "subjective"
text.replace(old_substr, new_substr)

'Value is subjective'

Note that `.replace` performs an exact replace and so is case-sensitive.

### Replacing characters with translate

A character is an individual entry within a string, like the 'l' in 'equilibrium'. You can always count the number of characters in a string variable called `var` by using `len(var)`. A very fast method for replacing individual characters in a string is `str.translate`. 

Replacing characters is extremely useful in certain situations, most commonly when you wish to remote all punctuation prior to doing other text analysis. You can use the built-in `string.punctuation` for this.

Let's see how to use it to remove all of the vowels from some text. With apologies to economist Lisa Cook, we'll use the abstract from {cite}`cook2011inventing` as the text we'll modify and we'll first create a dictionary of translations of vowels to nothing, i.e. `""`.

In [21]:
example_text = "Much recent work has focused on the influence of social capital on innovative outcomes. Little research has been done on disadvantaged groups who were often restricted from participation in social networks that provide information necessary for invention and innovation. Unique new data on African American inventors and patentees between 1843 and 1930 permit an empirical investigation of the relation between social capital and economic outcomes. I find that African Americans used both traditional, i.e., occupation-based, and nontraditional, i.e., civic, networks to maximize inventive output and that laws constraining social-capital formation are most negatively correlated with economically important inventive activity."
vowels = "aeiou"
translation_dict = {x: "" for x in vowels}
translation_dict

{'a': '', 'e': '', 'i': '', 'o': '', 'u': ''}

Now we turn our dictionary into a string translator and apply it to our text:


In [22]:
translator = example_text.maketrans(translation_dict)
example_text.translate(translator)

'Mch rcnt wrk hs fcsd n th nflnc f scl cptl n nnvtv tcms. Lttl rsrch hs bn dn n dsdvntgd grps wh wr ftn rstrctd frm prtcptn n scl ntwrks tht prvd nfrmtn ncssry fr nvntn nd nnvtn. Unq nw dt n Afrcn Amrcn nvntrs nd ptnts btwn 1843 nd 1930 prmt n mprcl nvstgtn f th rltn btwn scl cptl nd cnmc tcms. I fnd tht Afrcn Amrcns sd bth trdtnl, .., ccptn-bsd, nd nntrdtnl, .., cvc, ntwrks t mxmz nvntv tpt nd tht lws cnstrnng scl-cptl frmtn r mst ngtvly crrltd wth cnmclly mprtnt nvntv ctvty.'

```{admonition} Exercise
Use `translate` to replace all puncuation from the following sentence with spaces: "The well-known story I told at the conferences [about hypocondria] in Boston, New York, Philadelphia,...and Richmond went as follows: It amused people who knew Tommy to hear this; however, it distressed Suzi when Tommy (1982--2019) asked, \"How can I find out who yelled, 'Fire!' in the theater?\" and then didn't wait to hear Missy give the answer---'Dick Tracy.'"
```

Generally, `str.translate` is very fast at replacing individual characters in strings. But you can also do it using a list comprehension and a `join` of the resulting list, like so:

In [23]:
"".join(
    [
        ch
        for ch in "Example. string. with- excess_ [punctuation]/,"
        if ch not in string.punctuation
    ]
)

'Example string with excess punctuation'

### Slugifying

A special case of string cleaning occurs when you are given text with lots of non-standard characters in, and spaces, and other symbols; and what you want is a clean string suitable for a filename or column heading in a dataframe. Remember that it's best practice to have filenames that don't have spaces in. Slugiyfing is the process of creating the latter from the former and we can use the [**slugify**](https://github.com/un33k/python-slugify) package to do it.

Here are some examples of slugifying text:

In [24]:
from slugify import slugify

txt = "the quick brown fox jumps over the lazy dog"
slugify(txt, stopwords=["the"])

'quick-brown-fox-jumps-over-lazy-dog'

In this very simple example, the words listed in the `stopwords=` keyword argument (a list), are removed and spaces are replaced by hyphens. Let's now see a more complicated example:


In [25]:
slugify("当我的信息改变时... àccêntæd tËXT  ")

'dang-wo-de-xin-xi-gai-bian-shi-accentaed-text'

Slugify converts text to latin characters, while also removing accents and whitespace (of all kinds-the last whitespace is a tab). There's also a `replacement=` keyword argument that will replace specific strings with other strings using a list of lists format, eg `replacement=[['old_text', 'new_text']]`

### Splitting strings

If you want to split a string at a certain position, there are two quick ways to do it. The first is to use indexing methods, which work well if you know at which position you want to split text, eg


In [26]:
"This is a sentence and we will split it at character 18"[:18]

'This is a sentence'

Next up we can use the built-in `split` function, which returns a list of places where a given sub-string occurs:


In [27]:
"This is a sentence. And another sentence. And a third sentence".split(".")

['This is a sentence', ' And another sentence', ' And a third sentence']

Note that the character used to split the string is removed from the resulting list of strings. Let's see an example with a string used for splitting instead of a single character:


In [28]:
"This is a sentence. And another sentence. And a third sentence".split("sentence")

['This is a ', '. And another ', '. And a third ', '']

A useful extra function to know about is `splitlines()`, which splits a string at line breaks and returns the split parts as a list.

### count and find

Let's do some simple counting of words within text using `str.count`. Let's use the first verse of Elizabeth Bishop's sestina 'A Miracle for Breakfast' for our text.

In [29]:
text = "At six o'clock we were waiting for coffee, \n waiting for coffee and the charitable crumb \n that was going to be served from a certain balcony \n --like kings of old, or like a miracle. \n It was still dark. One foot of the sun \n steadied itself on a long ripple in the river."
word = "coffee"
print(f'The word "{word}" appears {text.count(word)} times.')

The word "coffee" appears 2 times.


Meanwhile, `find` returns the position where a particular word or character occurs.

In [30]:
text.find(word)

35

We can check this using the number we get and some string indexing:

In [31]:
text[text.find(word) : text.find(word) + len(word)]

'coffee'

But this isn't the only place where the word 'coffee' appears. If we want to find the last occurrence, it's

In [32]:
text.rfind(word)

57

## Regular expressions

Regex, aka regular expressions, provide a way to both search and change text. Their advantages are that they are concise, they run very quickly, they can be ported across languages (they are definitely not just a Python thing!), and they are very powerful. The disadvantage is that they are confusing and take some getting used to!

You can live code regex in a couple of places, the first is within Visual Studio Code itself. Do this by clicking the magnifying glass in the left-hand side panel of options. When the search strip appears, you can put a search term in. To the right of the text entry box, there are three buttons, one of which is a period (full stop) followed by an asterisk. This option allows the Visual Studio text search function to accept regular expressions. This will apply regex to all of the text in your current Visual Studio workspace.

Another approach is to head over to [https://regex101.com/](https://regex101.com/) or [https://regexr.com/](https://regexr.com/) and begin typing your regular expression there (regexr's cheat sheets and reference patterns are well worth checking out too). You will need to add some text in the box for the regex to be applied to.

Try either of the above with the regex `string \w+\s`. This matches any occurrence of the word 'string' that is followed by another word and then a whitespace. As an example, 'string cleaning ' would be picked up as a match when using this regex.

Within Python, the `re` library provides support for regular expressions. Let's try it:


In [33]:
import re

text = "It is true that string cleaning is a topic in this chapter. string editing is another."
re.findall("string \w+\s", text)

['string cleaning ', 'string editing ']

`re.findall` returns all matches. There are several useful search-like functions in `re` to be aware of that have a similar syntax of `re.function(regex, text)`. The table shows what they all do


| Function     | What it does                                                    | Example of use                              | Output for given value of `text`                                      |
|--------------|-----------------------------------------------------------------|---------------------------------------------|-----------------------------------------------------------------------|
| `re.match`   | Declares whether there is a match at the beginning of a string. | `re.match("string \w+\s" , text) is True`  | `None`                               |
| `re.search`  | Declares whether there is a match anywhere in the string.       | `re.search("string \w+\s" , text) is True` | `True`                                                                |
| `re.findall` | Returns all matches.                                            | `re.findall("string \w+\s" , text)`         | `['string cleaning ', 'string editing ']`                             |
| `re.split`   | Splits text wherever a match occurs.                            | `re.split("string \w+\s" , text)`           | `['It is true that ', 'is a topic in this chapter. ', 'is another.']` |


Another really handy regex function is `re.sub`, which substitutes one bit of text for another if it finds a match. Here's an example:

In [34]:
new_text = "new text here! "
re.sub("string \w+\s", new_text, text)

'It is true that new text here! is a topic in this chapter. new text here! is another.'

### Special Characters

So far, we've only seen a very simple application of regex involving a vanilla word, `string`, the code for another word `\w+` and the code for a whitespace `\s`. Let's take a more comprehensive look at the regex special characters:

| Character | Description                                                 | Example Text                           | Example Regex         | Example Match Text  |
|-----------|--------------------------------------------------------|----------------------------------------|-----------------------|---------------------|
| \d        | One Unicode digit in any script                        | "file_93 is open"                      | `file_\d\d`           | "file_93"           |
| \w        | "word character": Unicode letter, digit, or underscore | "blah hello-word blah"                 | `\w-\w`               | "hello-world"       |
| \s        | "whitespace character": any Unicode separator          | "these are some words with spaces"     | `words\swith\sspaces` | "words with spaces" |
| \D        | Non-digit character (opposite of \d)                   | "ABC 10323982328"                    | `\D\D\D`              | "ABC"               |
| \W        | Non-word character (opposite of \w)                    | "Once upon a time *"                   | `\W`                  | "*"                 |
| \S        | Non-whitespace character (opposite of \s)              | "y        "                            | `\S`                  | "y"                 |
| \Z        | End of string                                          | "End of a string"                      | `\w+\Z`               | "string""           |            |
| .        | Match any character except the newline          | "ab=def"                               | `ab.def`            | "ab=def"                |


Note that whitespace characters include newlines, `\n`, and tabs, `\t`.

### Quantifiers

As well as these special characters, there are quantifiers which ask for more than one occurence of a character. For example, in the above, `\w\w` asked for two word characters, while `\d\d` asks for two digits. The next table shows all of the quantifiers.

| Quantifier | Role                                       | Example Text               | Example Regex | Example Match      |
|------------|--------------------------------------------|----------------------------|---------------|--------------------|
| {m}        | Exactly m repetitions                      | "936 and 42 are the codes" | `\d{3}`       | "936"              |
| {m,n}      | From m (default 0) to n (default infinity) | "Words up to four letters" | `\b\w{1,4}\b` | "up", "to", "four" |
| *          | 0 or more. Same as {,}                     | "42 is the code"           | `\d*\s`       | "42"               |
| +          | 1 or more. Same as {1,}                    | "4 323 hello"              | `\d+`         | "4", "323"         |
| ?          | Optional, so 0 or 1. Same as {,1}.                       | "4 323 hello"              | `\d?\s`       | "4"                |

```{admonition} Exercise
Find a single regex that will pick out only the percentage numbers from both "Inflation in year 3 was 2 percent" and "Interest rates were as high as 12 percent". 
```

### Metacharacters

Now, as well as special characters and quantifiers, we can have meta-character matches. These are not characters *per se*, but starts, ends, and other bits of words. For example, `\b` matches strings at a word (`\w+`) boundary, so if we took the text "Three letter words only are captured" and ran `\b\w\w\w\b` we would return "are". `\B` matches strings not at word (`\w+`) boundaries so the text "Bricks" with `\B\w\w\B` applied would yield "ri". The next table contains some useful metacharacters.

| Metacharacter Sequence | Meaning                       | Example Regex | Example Match                                                                |
|------------------------|-------------------------------|--------------------|------------------------------------------------------------------------------|
| ^                      | Start of string or line       | `^abc`               | "abc" (appearing at start of string or line)                                 |
| $                      | End of string, or end of line | `xyz$`               | "xyz" (appearing at end of string or line)                                   |
| \b                     | Match string at word (\w+) boundary                 | `ing\b`              | "match**ing**" (matches ing if it is at the end of a word)                   |
| \B                     | Match string not at word (\w+) boundary              | `\Bing\B`            | "st**ing**er" (matches ing if it is not at the beginning or end of the word) |

Because so many characters have special meaning in regex, if you want to look for, say, a dollar sign or a dot, you need to escape the character first with a backward slash. So `\${1}\d+` would look for a single dollar sign followed by some digits and would pick up the '\$50' in 'she made \$50 dollars'.

```{admonition} Exercise
Find the regex that will pick out only the first instance of the word 'money' and any word subsequent to 'money' from the following: "money supply has grown considerably. money demand has not kept up.". 
```

### Ranges

You probably think you're done with regex, but not so fast! There are more metacharacters to come. This time, they will represent *ranges* of characters.

| Metacharacter Sequence | Description       | Example Expression | Example Match    |
|------------------------|---------------------------------------------------------|--------------------|-----------------------------------|
| \[characters\]           | The characters inside the brackets are part of a matching-character set  | `[abcd]`             | a, b, c, d, abcd     |
| \[^...\]   | Characters inside brackets are a non-matching set; a character not inside is a matching character. | `[^abcd]`            | Any occurrence of any character EXCEPT a, b, c, d. |
| \[character-character\]  | Any character in the range between two characters (inclusive) is part of the set  | `[a-z]`   | Any lowercase letter    |
| \[^character\]           | Any character that is not the listed character     | `[^A]`      | Any character EXCEPT capital A     |

Ranges have two more neat tricks. The first is that they can be concatenated. For example, `[a-c-1-5]` would match any of a, b, c, 1, 2, 3, 4, 5. They can also be modified with a quantifier, so `[a-c0-2]{2}` would match "a0" and "ab".


### Greedy versus lazy regexes

Buckle up, because this one is a bit tricky to grasp. Adding a `?` after a regex will make it go from being 'greedy' to being 'lazy'. Greedy means that you will match the longest possible string that hits the condition. Lazy will mean that you get the shortest possible string matching the condition. It's easiest to demonstrate with an example:


In [35]:
test_string = "stackoverflow"
greedy_regex = "s.*o"
lazy_regex = "s.*?o"

print(f"The greedy match is {re.findall(greedy_regex, test_string)[0]}")
print(f"The lazy match is {re.findall(lazy_regex, test_string)[0]}")

The greedy match is stackoverflo
The lazy match is stacko


In the former (greedy) case, we get from an 's' all the way to the last 'o' within the same word. In the latter (lazy) case we just get everything between the start and first occurrence of an 'o'. 

### Matches versus capture groups

There is often a difference between what you might want to match and what you actually want to *grab* with your regex. Let's say, for example, we're parsing some text and we want any numbers that follow the format '$xx.xx', where the 'x' are numbers but we don't want the dollar sign. To do this, we can create a *capture group* using brackets. Here's an example:


In [36]:
text = "Product 1 was $45.34, while product 2 came in at $50.00 however it was assessed that the $4.66 difference did not make up for the higher quality of product 2."
re.findall("\$(\d{2}.\d{2})", text)

['45.34', '50.00']

Let's pick apart the regex here. First, we asked for a literal dollar sign using `\$`. Next, we opened up a capture group with `(`. Then we said only give us the numbers that are 2 digits, a period, and another 2 digits (thus excluding \$4.66). Finally, we closed the capture group with `)`.

So while we specify a *match* using regex, while only want running the regex to return the *capture group*.

Let's see a more complicated example.

In [37]:
sal_r_per = r"\b([0-9]{1,6}(?:\.)?(?:[0-9]{1,2})?(?:\s?-\s?|\s?to\s?)[0-9]{1,6}(?:\.)?(?:[0-9]{1,2})?)(?:\s?per)\b"
text = "This job pays gbp 30500.00 to 35000 per year. Apply at number 100 per the below address."
re.findall(sal_r_per, text)

['30500.00 to 35000']

In this case, the regex first looks for up to 6 digits, then optionally a period, then optionally another couple of digits, then either a dash or 'to' using the '|' operator (which means or), followed by a similar number, followed by 'per'.

But the capture group is only the subset of the match that is the number range-we discard most of the rest. Note also that other numbers, even if they are followed by 'per', are not picked up. `(?:)` begins a *non-capture group*, which matches only but does not capture, so that although `(?:\s?per)` looks for " per" after a salary (with the space optional due to the second `?`), it does not get returned.

```{admonition} Exercise
Find a regex that captures the wage range from "Salary Pay in range $9.00 - $12.02 but you must start at 8.00 - 8.30 every morning.". 
```

This has been a whirlwind tour of regexes. Although regex looks a lot like gobbledygook, it is a really useful tool to be able to deploy for more complex string cleaning and extraction tasks.

## Scaling up from a single string to a corpus

For this section, it's useful to be familiar with the **pandas** package, which is covered in the [Data Analysis Quickstart](data-quickstart) and [Working with Data](working-with-data) sections. This section will closely follow the treatment by Jake VanderPlas.

We've seen how to work with individual strings. But often we want to work with a group of strings, otherwise known as a corpus, that is a collection of texts. It could be a collection of words, sentences, paragraphs, or some domain-based grouping (eg job descriptions).

Fortunately, many of the methods that we have seen deployed on a single string can be straightforwardly scaled up to hundreds, thousands, or millions of strings using **pandas** or other tools. This scaling up is achieved via *vectorisation*, in analogy with going from a single value (a scalar) to multiple values in a list (a vector).

As a very minimal example, here is capitalisation of names vectorised using a list comprehension:


In [38]:
[name.capitalize() for name in ["ada", "adam", "elinor", "grace", "jean"]]

['Ada', 'Adam', 'Elinor', 'Grace', 'Jean']

A **pandas** series can be used in place of a list. Let's create the series first:

In [39]:
import pandas as pd

dfs = pd.Series(
    ["ada lovelace", "adam smith", "elinor ostrom", "grace hopper", "jean bartik"],
    dtype="string",
)
dfs

0     ada lovelace
1       adam smith
2    elinor ostrom
3     grace hopper
4      jean bartik
dtype: string

Now we use the syntax series.str.function to change the text series:


In [40]:
dfs.str.title()

0     Ada Lovelace
1       Adam Smith
2    Elinor Ostrom
3     Grace Hopper
4      Jean Bartik
dtype: string

If we had a dataframe and not a series, the syntax would change to refer just to the column of interest like so:

In [41]:
df = pd.DataFrame(dfs, columns=["names"])
df["names"].str.title()

0     Ada Lovelace
1       Adam Smith
2    Elinor Ostrom
3     Grace Hopper
4      Jean Bartik
Name: names, dtype: string

The table below shows a non-exhaustive list of the string methods that are available in **pandas**.

| Function (preceded by `.str.`) | What it does |
|-----------------------------|-------------------------|
| `len()` | Length of string. |
| `lower()` | Put string in lower case. |
| `upper()` | Put string in upper case. |
| `capitalize()` | Put string in leading upper case. |
| `swapcase()` | Swap cases in a string. |
| `translate()` | Returns a copy of the string in which each character has been mapped through a given translation table. |
| `ljust()` | Left pad a string (default is to pad with spaces) |
| `rjust()` | Right pad a string (default is to pad with spaces) |
| `center()` | Pad such that string appears in centre (default is to pad with spaces) |
| `zfill()` | Pad with zeros |
| `strip()` | Strip out leading and trailing whitespace |
| `rstrip()` | Strip out trailing whitespace |
| `lstrip()` | Strip out leading whitespace |
| `find()` | Return the lowest index in the data where a substring appears |
| `split()` | Split the string using a passed substring as the delimiter |
| `isupper()` | Check whether string is upper case |
| `isdigit()` | Check whether string is composed of digits |
| `islower()` | Check whether string is lower case |
| `startswith()` | Check whether string starts with a given sub-string |

Regular expressions can also be scaled up with **pandas**. The below table shows vectorised regular expressions.

| Function | What it does |
|-|----------------------------------|
| `match()` | Call `re.match()` on each element, returning a boolean. |
| `extract()` | Call `re.match()` on each element, returning matched groups as strings. |
| `findall()` | Call `re.findall()` on each element |
| `replace()` | Replace occurrences of pattern with some other string |
| `contains()` | Call `re.search()` on each element, returning a boolean |
| `count()` | Count occurrences of pattern |
| `split()` | Equivalent to `str.split()`, but accepts regexes |
| `rsplit()` | Equivalent to `str.rsplit()`, but accepts regexes |


Let's see a couple of these in action. First, splitting on a given sub-string:

In [42]:
df["names"].str.split(" ")

0     [ada, lovelace]
1       [adam, smith]
2    [elinor, ostrom]
3     [grace, hopper]
4      [jean, bartik]
Name: names, dtype: object

It's fairly common that you want to split out strings and save the results to new columns in your dataframe. You can specify a (max) number of splits via the `n=` kwarg and you can get the columns using `expand`

In [43]:
df["names"].str.split(" ", n=2, expand=True)

Unnamed: 0,0,1
0,ada,lovelace
1,adam,smith
2,elinor,ostrom
3,grace,hopper
4,jean,bartik


```{admonition} Exercise
Using vectorised operations, create a new column with the index position where the first vowel occurs for each row in the `names` column.
```

Here's an example of using a regex function with **pandas**:

In [44]:
df["names"].str.extract("(\w+)", expand=False)

0       ada
1      adam
2    elinor
3     grace
4      jean
Name: names, dtype: string

There are a few more vectorised string operations that are useful.

| Method | Description |
|-|-|
| `get()` | Index each element |
| `slice()` | Slice each element |
| `slice_replace()` | Replace slice in each element with passed value |
| `cat()` | Concatenate strings |
| `repeat()` | Repeat values |
| `normalize()` | Return Unicode form of string |
| `pad()` | Add whitespace to left, right, or both sides of strings |
| `wrap()` | Split long strings into lines with length less than a given width |
| `join()` | Join strings in each element of the Series with passed separator |
| `get_dummies()` | extract dummy variables as a dataframe |


The `get()` and `slice()` methods give access to elements of the lists returned by `split()`. Here's an example that combines `split()` and `get()`:



In [45]:
df["names"].str.split().str.get(-1)

0    lovelace
1       smith
2      ostrom
3      hopper
4      bartik
Name: names, dtype: object

We already saw `get_dummies()` in the [Regression](regression) chapter, but it's worth revisiting it here with strings. If we have a column with tags split by a symbol, we can use this function to split it out. For example, let's create a dataframe with a single column that mixes subject and nationality tags:


In [46]:
df = pd.DataFrame(
    {
        "names": [
            "ada lovelace",
            "adam smith",
            "elinor ostrom",
            "grace hopper",
            "jean bartik",
        ],
        "tags": ["uk; cs", "uk; econ", "usa; econ", "usa; cs", "usa; cs"],
    }
)
df

Unnamed: 0,names,tags
0,ada lovelace,uk; cs
1,adam smith,uk; econ
2,elinor ostrom,usa; econ
3,grace hopper,usa; cs
4,jean bartik,usa; cs


If we now use `str.get_dummies` and split on `;` we can get a dataframe of dummies.

In [47]:
df["tags"].str.get_dummies(";")

Unnamed: 0,cs,econ,uk,usa
0,1,0,1,0
1,0,1,1,0
2,0,1,0,1
3,1,0,0,1
4,1,0,0,1


## Reading Text In

### Text file

If you have just a plain text file, you can read it in like so:

```python
fname = 'book.txt'
with open(fname, encoding='utf-8') as f:
    text_of_book = f.read()
```

You can also read a text file directly into a **pandas** dataframe using 

```python
df = pd.read_csv('book.txt', delimiter = "\n")
```

In the above, the delimiter for different rows of the dataframe is set as "\n", which means new line, but you could use whatever delimiter you prefer.

```{admonition} Exercise
Download the file 'smith_won.txt' from this book's github repository using this [link](https://github.com/aeturrell/coding-for-economists/blob/main/data/smith_won.txt) (use right-click and save as). Then read the text in using **pandas**.
```

### CSV file

CSV files are already split into rows. By far the easiest way to read in csv files is using **pandas**,

```python
df = pd.read_csv('book.csv')
```

Remember that **pandas** can read many other file types too.