<center><h1>Chapter 8 Text Data</h1></center>

In [1]:
import numpy as np
import pandas as pd

## 1. str object
### 1. Design intention of str object

`str` object is an attribute defined on `Index` or `Series`, which is specially used to process the text content of each element. It has a large number of methods defined inside. Therefore, to process the text of a sequence, you first need to obtain its `str` object. There is also a `str` module in the Python standard library. For the convenience of use, `pandas` has copied its design for many functions, such as the operation of converting letters to uppercase:

In [2]:
var = 'abcd'
str.upper(var) # Python内置str模块

'ABCD'

In [3]:
s = pd.Series(['abcd', 'efg', 'hi'])
s.str

<pandas.core.strings.accessor.StringMethods at 0x1488ea6db08>

In [4]:
s.str.upper() # pandas中str对象上的upper方法

0    ABCD
1     EFG
2      HI
dtype: object

According to the `API` documentation, 31 of the 50 `str` object methods in `pandas` have the same name and function as the `str` module methods in the standard library, which provides a powerful tool for batch processing of sequences.

### 2. [] Indexer

For the `str` object, it can be understood as a serialization operation on the string. For example, in a general string, `[]` can be used to retrieve an element at a certain position:

In [5]:
var[0]

'a'

You can also get substrings by slicing:

In [6]:
var[-1: 0: -2]

'db'

By using the [] indexer on a str object, you can achieve exactly the same functionality, returning a missing value if out of range:

In [7]:
s.str[0]

0    a
1    e
2    h
dtype: object

In [8]:
s.str[-1: 0: -2]

0    db
1     g
2     i
dtype: object

In [9]:
s.str[2]

0      c
1      g
2    NaN
dtype: object

### 3. string type

As mentioned in the previous chapter, the string type was introduced from the 1.0.0 version of pandas. The motivation for its introduction is that all string types are stored in a Series of object type, but object type should only store mixed types, such as floating point, string, dictionary, list, custom type, etc. Therefore, it is necessary for strings to have their own data storage type like numeric type or category, so string type was introduced.

In general, the results of using the str object method for most sequences of object and string types are consistent, but there are significant differences in the following two points:

First, the str attribute should be used only when the values ​​in each sequence are strings, but this is not necessary. The necessary condition is that there is at least one iterable object in the sequence, including but not limited to strings, dictionaries, and lists. For an iterable object, the results returned by the str object of string type and the str object of object type may be different.

In [10]:
s = pd.Series([{1: 'temp_1', 2: 'temp_2'}, ['a', 'b'], 0.5, 'my_string'])
s.str[1]

0    temp_1
1         b
2       NaN
3         y
dtype: object

In [11]:
s.astype('string').str[1]

0    1
1    '
2    .
3    y
dtype: string

Except for the last string element, the values ​​returned by the first three elements are different. The reason is that when the sequence type is `object`, `[]` indexing is performed on each element. Therefore, for dictionaries, the temp_1 string is returned, for lists, the second value is returned, and the third is a non-iterable object, which returns the missing value. The fourth is `[]` indexing on the string. The `str` object of the `string` type first converts the entire element to a literal string. For example, for a list, the first element is "{", and for the last string element, the representation method before and after the conversion is the same, so the result is consistent with the `object` type.

In addition to the different `str` serialization methods for some objects, another difference between the two is that the `string` type is a `Nullable` type, but the `object` is not. This means that for a sequence of type string, if the return value of the str method is an integer Series or a Boolean Series, the corresponding dtype is the Nullable type of Int and Boolean, respectively, while the object type will return int/float and bool/object, respectively, depending on the presence or absence of missing values. At the same time, string comparison operations also have similar characteristics, with string returning Nullable.ble` type, but `object` does not.

In [12]:
s = pd.Series(['a'])
s.str.len()

0    1
dtype: int64

In [13]:
s.astype('string').str.len()

0    1
dtype: Int64

In [14]:
s == 'a'

0    True
dtype: bool

In [15]:
s.astype('string') == 'a'

0    True
dtype: boolean

In [16]:
s = pd.Series(['a', np.nan]) # 带有缺失值

In [17]:
s.str.len()

0    1.0
1    NaN
dtype: float64

In [18]:
s.astype('string').str.len()

0       1
1    <NA>
dtype: Int64

In [19]:
s == 'a'

0     True
1    False
dtype: bool

In [20]:
s.astype('string') == 'a'

0    True
1    <NA>
dtype: boolean

Finally, it should be noted that for sequences whose elements are all numeric types, even if their types are `object` or `category`, it is not allowed to use the `str` attribute directly. If you need to treat numbers as `string` types, you can use `astype` to force conversion to `Series` of `string` type:

In [21]:
s = pd.Series([12, 345, 6789])
s.astype('string').str[1]

0    2
1    4
2    7
dtype: string

## 2. Basics of regular expressions

The two tables in this section come from the regular expression project [learn-regex-zh](https://github.com/cdoco/learn-regex-zh), which uses the `MIT` open source license agreement. Here we only introduce the basic usage of regular expressions. Readers who need to learn systematically can refer to the book [Regular Expressions Must Know](https://book.douban.com/subject/26285406/).

### 1. Matching of general characters

Regular expressions are a tool that matches the content in a string from left to right according to a certain regular pattern. For general characters, it can find their location. For the convenience of demonstration, the `findall` function of the `re` module in `python` is used to match all patterns that have appeared but not overlapped. The first parameter is the regular expression, and the second parameter is the string to be matched. For example, find `apple` in the following string:

In [22]:
import re
re.findall(r'Apple', 'Apple! This Is an Apple!')

['Apple', 'Apple']

### 2. Metacharacter Basics
|Metacharacter | Description |
| :-----| ----: |
|. | Matches any character except newline|
|\[ \] | Character class, matches any character contained in the brackets|
|\[^ \] | Negated character class, matches any character not contained in the brackets|
|\* | Matches the previous subexpression zero or more times|
|\+ | Matches the previous subexpression one or more times|
|? | Matches the previous subexpression zero or one time|
|{n,m} | Curly braces, matches the previous character at least n times, but not more than m times|
|(xyz) | Character group, matches characters xyz in exact order|
|\| | Branch structure, matches the character before or after the symbol|
|\\ | Escape character, which can restore the original meaning of the metacharacter|
|^ | Matches the beginning of a line|
|$ | Matches the end of a line|

In [23]:
re.findall(r'.', 'abc')

['a', 'b', 'c']

In [24]:
re.findall(r'[ac]', 'abc')

['a', 'c']

In [25]:
re.findall(r'[^ac]', 'abc')

['b']

In [26]:
re.findall(r'[ab]{2}', 'aaaabbbb') # {n}指匹配n次

['aa', 'aa', 'bb', 'bb']

In [27]:
re.findall(r'aaa|bbb', 'aaaabbbb')

['aaa', 'bbb']

In [28]:
re.findall(r'a\\?|a\*', 'aa?a*a')

['a', 'a', 'a', 'a']

In [29]:
re.findall(r'a?.', 'abaacadaae')

['ab', 'aa', 'c', 'ad', 'aa', 'e']

### 3. Abbreviated character set
In addition, there is another type of abbreviated character set in regular expressions, which is equivalent to a set of characters:

|Abbreviation | Description |
| :-----| :---- |
|\\w | Matches all letters, numbers, and underscores: \[a-zA-Z0-9\_\] |
|\\W | Matches non-letters and numbers: \[^\\w\]|
|\\d | Matches numbers: \[0-9\]|
|\\D | Matches non-numbers: \[^\\d\]|
|\\s | Matches space characters: \[\\t\\n\\f\\r\\p{Z}\]|
|\\S | Matches non-space characters: \[^\\s\]|
|\\B | Matches the beginning or end of a set of non-blank characters, not a specific character|

In [30]:
re.findall(r'.s', 'Apple! This Is an Apple!')

['is', 'Is']

In [31]:
re.findall(r'\w{2}', '09 8? 7w c_ 9q p@')

['09', '7w', 'c_', '9q']

In [32]:
re.findall(r'\w\W\B', '09 8? 7w c_ 9q p@')

['8?', 'p@']

In [33]:
re.findall(r'.\s.', 'Constant dropping wears the stone.')

['t d', 'g w', 's t', 'e s']

In [34]:
re.findall(r'上海市(.{2,3}区)(.{2,3}路)(\d+号)', '上海市黄浦区方浜中路249号 上海市宝山区密山路5号')

[('黄浦区', '方浜中路', '249号'), ('宝山区', '密山路', '5号')]

## 3. Five types of text processing operations
### 1. Split

`str.split` can split the columns of a string. The first parameter is a regular expression. The optional parameters include the maximum number of splits from left to right `n` and whether to expand to multiple columns `expand`.

In [35]:
s = pd.Series(['上海市黄浦区方浜中路249号', '上海市宝山区密山路5号'])
s.str.split('[市区路]')

0    [上海, 黄浦, 方浜中, 249号]
1       [上海, 宝山, 密山, 5号]
dtype: object

In [36]:
s.str.split('[市区路]', n=2, expand=True)

Unnamed: 0,0,1,2
0,上海,黄浦,方浜中路249号
1,上海,宝山,密山路5号


A similar function is `str.rsplit`, the difference is that when using the `n` parameter, the maximum number of splits is limited from right to left. However, in the current version, `rsplit` cannot use regular expressions for splitting due to a `bug`:

In [37]:
s.str.rsplit('[市区路]', n=2, expand=True)

Unnamed: 0,0
0,上海市黄浦区方浜中路249号
1,上海市宝山区密山路5号


### 2. Merge

There are two functions for merging, namely `str.join` and `str.cat`. `str.join` means to connect the string list in `Series` with a certain connector, and return the missing value if there is a non-string element in the list:

In [38]:
s = pd.Series([['a','b'], [1, 'a'], [['a', 'b'], 'c']])
s.str.join('-')

0    a-b
1    NaN
2    NaN
dtype: object

`str.cat` is used to merge two sequences. The main parameters are the connector `sep`, the connection form `join`, and the missing value replacement symbol `na_rep`. The connection form defaults to a left join with the index as the key.

In [39]:
s1 = pd.Series(['a','b'])
s2 = pd.Series(['cat','dog'])
s1.str.cat(s2,sep='-')

0    a-cat
1    b-dog
dtype: object

In [40]:
s2.index = [1, 2]
s1.str.cat(s2, sep='-', na_rep='?', join='outer')

0      a-?
1    b-cat
2    ?-dog
dtype: object

### 3. Matching

`str.contains` returns a Boolean sequence of whether each string contains the regular pattern:

In [41]:
s = pd.Series(['my cat', 'he is fat', 'railway station'])
s.str.contains('\s\wat')

0     True
1     True
2    False
dtype: bool

`str.startswith` and `str.endswith` return a Boolean sequence of each string starting and ending with the given pattern. Neither supports regular expressions:

In [42]:
s.str.startswith('my')

0     True
1    False
2    False
dtype: bool

In [43]:
s.str.endswith('t')

0     True
1     True
2    False
dtype: bool

If you need to use a regular expression to detect the pattern of the beginning or end of a string, you can use `str.match`, which returns a Boolean sequence of whether the beginning of each string meets the given regular pattern:

In [44]:
s.str.match('m|h')

0     True
1     True
2    False
dtype: bool

In [45]:
s.str[::-1].str.match('ta[f|g]|n') # 反转后匹配

0    False
1     True
2     True
dtype: bool

Of course, this can also be achieved by using `^` and `$` in the regular expression of `str.contains`:

In [46]:
s.str.contains('^[m|h]')

0     True
1     True
2    False
dtype: bool

In [47]:
s.str.contains('[f|g]at|n$')

0    False
1     True
2     True
dtype: bool

In addition to the above matching functions that return Boolean values, there is also a matching function that returns an index, namely `str.find` and `str.rfind`, which return the index of the first match from left to right and from right to left respectively, and return -1 if not found. It should be noted that these two functions do not support regular matching and can only be used for matching character substrings:

In [48]:
s = pd.Series(['This is an apple. That is not an apple.'])
s.str.find('apple')

0    11
dtype: int64

In [49]:
s.str.rfind('apple')

0    33
dtype: int64

### 4. Replacement

`str.replace` and `replace` are not the same function. The former should be used when performing string replacement.

In [50]:
s = pd.Series(['a_1_b','c_?'])
s.str.replace('\d|\?', 'new', regex=True)

0    a_new_b
1      c_new
dtype: object

When you need to replace different parts differently, you can use the `subgroup` method, and at this time you can pass in a custom replacement function to process them separately. Note that `group(k)` represents the matched `k`th subgroup (the content between the parentheses):

In [51]:
s = pd.Series(['上海市黄浦区方浜中路249号',
                '上海市宝山区密山路5号',
                '北京市昌平区北农路2号'])
pat = '(\w+市)(\w+区)(\w+路)(\d+号)'
city = {'上海市': 'Shanghai', '北京市': 'Beijing'}
district = {'昌平区': 'CP District',
            '黄浦区': 'HP District',
            '宝山区': 'BS District'}
road = {'方浜中路': 'Mid Fangbin Road',
        '密山路': 'Mishan Road',
        '北农路': 'Beinong Road'}
def my_func(m):
    str_city = city[m.group(1)]
    str_district = district[m.group(2)]
    str_road = road[m.group(3)]
    str_no = 'No. ' + m.group(4)[:-1]
    return ' '.join([str_city,
                     str_district,
                     str_road,
                     str_no])
s.str.replace(pat, my_func, regex=True)

0    Shanghai HP District Mid Fangbin Road No. 249
1           Shanghai BS District Mishan Road No. 5
2           Beijing CP District Beinong Road No. 2
dtype: object

The numeric identifiers here are not intuitive. You can use named subgroups to more clearly write out the meaning of the subgroups:

In [52]:
pat = '(?P<市名>\w+市)(?P<区名>\w+区)(?P<路名>\w+路)(?P<编号>\d+号)'
def my_func(m):
    str_city = city[m.group('市名')]
    str_district = district[m.group('区名')]
    str_road = road[m.group('路名')]
    str_no = 'No. ' + m.group('编号')[:-1]
    return ' '.join([str_city,
                     str_district,
                     str_road,
                     str_no])
s.str.replace(pat, my_func, regex=True)

0    Shanghai HP District Mid Fangbin Road No. 249
1           Shanghai BS District Mishan Road No. 5
2           Beijing CP District Beinong Road No. 2
dtype: object

Although it looks a bit complicated here, the corresponding replacement in actual data processing generally obtains data through code to construct a dictionary mapping, which is much simpler in specific writing.

### 5. Extraction

Extraction can be considered as a matching operation that returns a specific element (rather than a Boolean value or the index position corresponding to the element), or as a special splitting operation. In the `str.split` example mentioned above, the delimiter will be removed, which is not the effect the user wants. In this case, `str.extract` can be used for extraction:

In [53]:
pat = '(\w+市)(\w+区)(\w+路)(\d+号)'
s.str.extract(pat)

Unnamed: 0,0,1,2,3
0,上海市,黄浦区,方浜中路,249号
1,上海市,宝山区,密山路,5号
2,北京市,昌平区,北农路,2号


By naming the subgroups, you can directly name the columns of the newly generated `DataFrame`:

In [54]:
pat = '(?P<市名>\w+市)(?P<区名>\w+区)(?P<路名>\w+路)(?P<编号>\d+号)'
s.str.extract(pat)

Unnamed: 0,市名,区名,路名,编号
0,上海市,黄浦区,方浜中路,249号
1,上海市,宝山区,密山路,5号
2,北京市,昌平区,北农路,2号


`str.extractall` is different from `str.extract` which only matches once. It will match all the patterns that meet the conditions. If there are multiple results, they are stored in a multi-level index:

In [55]:
s = pd.Series(['A135T15,A26S5','B674S2,B25T6'], index = ['my_A','my_B'])
pat = '[A|B](\d+)[T|S](\d+)'
s.str.extractall(pat)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
my_A,0,135,15
my_A,1,26,5
my_B,0,674,2
my_B,1,25,6


In [56]:
pat_with_name = '[A|B](?P<name1>\d+)[T|S](?P<name2>\d+)'
s.str.extractall(pat_with_name)

Unnamed: 0_level_0,Unnamed: 1_level_0,name1,name2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
my_A,0,135,15
my_A,1,26,5
my_B,0,674,2
my_B,1,25,6


The function of `str.findall` is similar to `str.extractall`, the difference is that the former stores the results in a list, while the latter processes them as a multi-level index, each row corresponds to only one set of matches, rather than combining all matches into a list.

In [57]:
s.str.findall(pat)

my_A    [(135, 15), (26, 5)]
my_B     [(674, 2), (25, 6)]
dtype: object

## 4. Common string functions

In addition to the five types of string operation functions introduced above, some other practical methods are defined on the `str` object, which are introduced here:

### 1. Letter type functions

The five functions `upper, lower, title, capitalize, swapcase` are mainly used for letter case conversion. It is easy to understand their functions from the following examples:

In [58]:
s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])
s.str.upper()

0                 LOWER
1              CAPITALS
2    THIS IS A SENTENCE
3              SWAPCASE
dtype: object

In [59]:
s.str.lower()

0                 lower
1              capitals
2    this is a sentence
3              swapcase
dtype: object

In [60]:
s.str.title()

0                 Lower
1              Capitals
2    This Is A Sentence
3              Swapcase
dtype: object

In [61]:
s.str.capitalize()

0                 Lower
1              Capitals
2    This is a sentence
3              Swapcase
dtype: object

In [62]:
s.str.swapcase()

0                 LOWER
1              capitals
2    THIS IS A SENTENCE
3              sWaPcAsE
dtype: object

### 2. Numerical functions

Here we need to focus on the `pd.to_numeric` method. Although it is not a method on the `str` object, it can quickly convert and filter numeric values ​​in character format. Its main parameters include `errors` and `downcast`, which represent the processing mode and conversion type of non-numeric values ​​respectively. Among them, there are three `errors` options for those that cannot be converted to numeric values, `raise, coerce, ignore`, which respectively indicate direct error reporting, setting as missing, and keeping the original string.

In [63]:
s = pd.Series(['1', '2.2', '2e', '??', '-2.1', '0'])
pd.to_numeric(s, errors='ignore')

0       1
1     2.2
2      2e
3      ??
4    -2.1
5       0
dtype: object

In [64]:
pd.to_numeric(s, errors='coerce')

0    1.0
1    2.2
2    NaN
3    NaN
4   -2.1
5    0.0
dtype: float64

When cleaning data, you can use the `coerce` setting to quickly view non-numeric rows:

In [65]:
s[pd.to_numeric(s, errors='coerce').isna()]

2    2e
3    ??
dtype: object

### 3. Statistical functions

`count` and `len` return the number of occurrences of the regular pattern and the length of the string respectively:

In [66]:
s = pd.Series(['cat rat fat at', 'get feed sheet heat'])
s.str.count('[r|f]at|ee')

0    2
1    2
dtype: int64

In [67]:
s.str.len()

0    14
1    19
dtype: int64

### 4. Format functions
Format functions are mainly divided into two categories, the first is the space removal type, and the second is the filling type. Among them, there are three types of functions in the first category, namely `strip, rstrip, lstrip`, which represent the removal of spaces on both sides, spaces on the right, and spaces on the left. These functions are useful in data cleaning, especially when the column name contains illegal spaces.

In [68]:
my_index = pd.Index([' col1', 'col2 ', ' col3 '])
my_index.str.strip().str.len()

Int64Index([4, 4, 4], dtype='int64')

In [69]:
my_index.str.rstrip().str.len()

Int64Index([5, 4, 5], dtype='int64')

In [70]:
my_index.str.lstrip().str.len()

Int64Index([4, 5, 5], dtype='int64')

Of the padding functions, `pad` is the most flexible, as it allows you to select the string length, padding direction, and padding content:

In [71]:
s = pd.Series(['a','b','c'])
s.str.pad(5,'left','*')

0    ****a
1    ****b
2    ****c
dtype: object

In [72]:
s.str.pad(5,'right','*')

0    a****
1    b****
2    c****
dtype: object

In [73]:
s.str.pad(5,'both','*')

0    **a**
1    **b**
2    **c**
dtype: object

The above three situations can be equivalently completed using `rjust, ljust, center` respectively. It should be noted that `ljust` refers to right padding rather than left padding:

In [74]:
s.str.rjust(5, '*')

0    ****a
1    ****b
2    ****c
dtype: object

In [75]:
s.str.ljust(5, '*')

0    a****
1    b****
2    c****
dtype: object

In [76]:
s.str.center(5, '*')

0    **a**
1    **b**
2    **c**
dtype: object

When reading Excel files, there is often a need to add leading zeros to numbers. For example, when reading in a stock code, "000007" is treated as the value 7. In addition to using the left-fill function above, you can also use zfill to achieve this in pandas.

In [77]:
s = pd.Series([7, 155, 303000]).astype('string')
s.str.pad(6,'left','0')

0    000007
1    000155
2    303000
dtype: string

In [78]:
s.str.rjust(6,'0')

0    000007
1    000155
2    303000
dtype: string

In [79]:
s.str.zfill(6)

0    000007
1    000155
2    303000
dtype: string

## 5. Exercises
### Ex1: Housing information dataset
There is a housing information dataset as follows:

In [80]:
df = pd.read_excel('../data/house_info.xls', usecols=['floor','year','area','price'])
df.head(3)

Unnamed: 0,floor,year,area,price
0,高层（共6层）,1986年建,58.23㎡,155万
1,中层（共20层）,2020年建,88㎡,155万
2,低层（共28层）,2010年建,89.33㎡,365万


1. Change the `year` column to integer year storage.
2. Replace the `floor` column with two columns `Level, Highest`, where the elements are the floor category (high floor, middle floor, low floor) of the `string` type and the highest floor number of the integer type.
3. Calculate the average price per square meter of the house `avg_price` and store it in the table in the format of `*** yuan/square meter`, where `***` is an integer.
### Ex2: "Game of Thrones" script data set
There is a Game of Thrones script data set as follows:

In [81]:
df = pd.read_csv('../data/script.csv')
df.head(3)

Unnamed: 0,Release Date,Season,Episode,Episode Title,Name,Sentence
0,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,What do you expect? They're savages. One lot s...
1,2011-04-17,Season 1,Episode 1,Winter is Coming,will,I've never seen wildlings do a thing like this...
2,2011-04-17,Season 1,Episode 1,Winter is Coming,waymar royce,How close did you get?


1. Calculate the number of lines in each `Episode`.
2. Use spaces as word separators to find the top five people with the most words per line.
3. If someone's lines contain question marks, the next person to speak is the answerer. If the previous person's lines contain $n$ question marks, it is considered that the answerer answered $n$ questions. Find the top five people who answered the most questions.