# Regular Expression
> Python RegEx

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/chart-preview.png

<br>

This notebook aims to introduce users how to use regular expression to extract useful information from text in Python which would be from documents or websites.

*Presumption:* https://www.youtube.com/watch?v=K8L6KVGG-7o

<br>

*Before starting with this tutorial, please watch this video beforehand so that you would already understand:*

**1) What is group method in regular expression?**

**2) What is a raw string?**

**3) How to create a character set?**

**4) What is the function of quantifiers?**

<br>

***

## **Review**

***Here is the summary tables from the video:***

| Syntax    |  Meanings  |
| --- | --- |
|.    |  Any Character Except New Line  |
|\d   |  Digit (0-9)  |
|\D    |  - Not a Digit (0-9) |
|\w    |  - Word Character (a-z, A-Z, 0-9, _) |
|\W    |  - Not a Word Character |
|\s    |  - Whitespace (space, tab, newline) |
|\S    |  - Not Whitespace (space, tab, newline) |

| Syntax    |  Meanings  |
| --- | --- |
|\b   |   Word Boundary|
|\B   | Not a Word Boundary|
|^    | Beginning of a String|
|$    | End of a String|
|[]   | Matches Characters in brackets|
|[^ ]  | Matches Characters NOT in brackets|
| Vertical bar | Either Or|
|( )   | Group|

<br>

|Quantifiers|  Meanings  |
| --- | --- |
|*    | 0 or More|
|+     | 1 or More|
|?     | 0 or One|
|{3}   | Exact Number|
|{3,4} |Range of Numbers (Minimum, Maximum)|


## **Information Retrieval**

* ## **re**

Before we analysis any text, the relevant information need to be first extracted to exclude all irrelavant information. And sometimes it is not very straight-forward and the text might be mixed with other information, particularly when the text are mined from online sources.

Below we can look at an exmaple of an entry extracted from [Historical GIS for Japan](http://maps.cga.harvard.edu/chgis/japan/). We can see the information are in multiple rows with each row giving different information. If we only aim for one piece of information, it might be easy to copy in one entry but the task gets challenging once we have thousands of them. This is why text mining can be helpful to save us time and effort.

<br>

First of all, we have to **import the library**.

In [3]:
import re

In [2]:
lord_entry = """
name:	abemasaharu\n
vernacular name definition	kanji:	阿部正春\n
alternate vernacular name definition	hiragana:	あべまさはる\n
feature type definition	feature type:	feudal lord 大名 daimyo\n
date range definition	date range:	1664 to 1664\n
time slice definition	valid as:	time slice 年份\n
present location definition	present location:	岩槻市 iwatsukishi\n
point id definition	point id:	jp_dmy_40\n
data source definition	data source:	JP_CHGIS\n
feature type definition	coordinate type:	centroid\n
feature type definition	latitude:	35.93\n
feature type definition	longitude:	139.70\n
admin hierarchy definition	admin hierarchy: 武蔵国 musashi no kuni
"""

## **Name**

Here we can try to get the **kanji name** of the entry. 

From what we have learnt, we can use the group option to get the first group kanji: at the word boundary (\b) followed with space (\s) and everything (regardless of length) behind it. Using pattern1, we have the name we need in the second group.

We will use **re.compile()** to define our pattern, then use **findall()** to look for all matches.

In [None]:
# define our pattern

pattern1 = re.compile(r'(\bkanji:\s)(.*)')

match1 = pattern1.findall(lord_entry) # get all matches
match1 # print them out

[('kanji:\t', '阿部正春')]

We can then access the first element of list [0] (there is only one element) and second element of the [tuple](https://www.w3schools.com/python/python_tuples.asp) [1].

In [None]:
match1[0][1]

'阿部正春'

## **Alternative: Lookaround**

However, we can also use **lookaround method** from **re**, which mean we use "kanji:" to identify what we search for (behind the keyword) but we do not select "kanji: " itself because it is not important for us. 

<br>

Be careful, **space** might not be obvious, it is also count as the element in the string by Python so we always need to address them too.

<br>

***

Given the string **"foobarbarfoo"**:

<br>

**bar(?=bar)**     finds the 1st bar ("bar" which has "bar" after it)

**bar(?!bar)**     finds the 2nd bar ("bar" which does not have "bar" after it)

**(?<=foo)bar**    finds the 1st bar ("bar" which has "foo" before it)

**(?<!foo)bar**    finds the 2nd bar ("bar" which does not have "foo" before it)

<br>

They can also be combined:

**(?<=foo)bar(?=bar)**    finds the 1st bar ("bar" with "foo" before it and "bar" after it)

***


Here we use **(?<=text1)text2** to select text 2 from identifying text 1, in which text 1 is before text 2 in the text.

In [None]:
pattern2 = re.compile(r'(?<=kanji:\s).*')

match2 = pattern2.findall(lord_entry)
match2

['阿部正春']

## **Coordinates**

Now, we can try to get the latitude and longitude from the lord (For example, when we need them for making a map in **GIS**). Since we have already learnt the principle, the code we need is indeed very similar.

* #### **Latitude**

In [4]:
# define our pattern

lat_pattern = re.compile(r'(?<=latitude:\s).*')

match = lat_pattern.findall(lord_entry)
match

['35.93']

We need to be careful here. Normally when we think of coordinates, we expect a floating number. But here what we get (match) is a list. It will cause errors if we later directly use the list for any geospatial operations. So always check the type.

In [6]:
type(match) # it is a list

list

In [7]:
type(match[0]) # we can get the first item of list to remove [], now it is a string

str

We need to further convert the string into float using **float()**.

In [8]:
type(float(match[0]))

float

In [9]:
lat = float(match[0]) # save the final result to lat
lat

35.93

Now we get what we need! Let's do the same for longitude.

* #### **Longitude**

In [10]:
# define our pattern

lon_pattern = re.compile(r'(?<=longitude:\s).*')

match = lon_pattern.findall(lord_entry)
match # list

['139.70']

In [11]:
lon = float(match[0])
lon # float

139.7

## **Chinese Characters**

Here is another small text from 韓愈. Now for Chinese characters, we can use unicode characters to select a specific type of characters.

<br>

**The ranges of Unicode characters which are routinely used for Chinese and Japanese text are:**

* U+3040 - U+30FF: hiragana and katakana **(Japanese only)**

* U+3400 - U+4DBF: CJK unified ideographs extension A **(Chinese, Japanese, and Korean)**

* U+4E00 - U+9FFF: CJK unified ideographs **(Chinese, Japanese, and Korean)**

* U+F900 - U+FAFF: CJK compatibility ideographs **(Chinese, Japanese, and Korean)**

* U+FF66 - U+FF9F: half-width katakana **(Japanese only)**

In [None]:
text = "或問諫議大夫陽城於愈：可以為有道之士乎哉？學廣而聞多，不求聞於人也，行古人之道，居於晉之鄙，晉之鄙人薰其德而善良者幾千人。大臣聞而薦之，天子以為諫議大夫。人皆以為華，陽子不色喜。居於位，五年矣，視其德如在野，彼豈以富貴移易其心哉！"

In [None]:
# now we are looking for every clause (before ",")
pattern = re.compile(r'[\u4e00-\u9fff]+')

match = pattern.findall(text)
match

['或問諫議大夫陽城於愈',
 '可以為有道之士乎哉',
 '學廣而聞多',
 '不求聞於人也',
 '行古人之道',
 '居於晉之鄙',
 '晉之鄙人薰其德而善良者幾千人',
 '大臣聞而薦之',
 '天子以為諫議大夫',
 '人皆以為華',
 '陽子不色喜',
 '居於位',
 '五年矣',
 '視其德如在野',
 '彼豈以富貴移易其心哉']

We can also look for every character instead:

In [None]:
pattern = re.compile(r'[\u4e00-\u9fff]')

match = pattern.findall(text)
match[:5] # print first 5 characters only

['或', '問', '諫', '議', '大']

Here is another example entry from [清代檔案](https://mhdb.mh.sinica.edu.tw/document/index.php?searchAll=%E5%A4%A7%E5%AD%B8%E5%A3%AB&dbSelect=a%40a%40a%40a%40a%40a%40a%40a&limit=1). Here let's say we want to extract the time from the document.

In [1]:
text = """
撥給各種工匠銀乾隆01年8月
--內務府奏銷檔
第1筆

事由：撥給各種工匠銀

內文：雍正十三年四月起至 乾隆 元年五月給發匠役工價所用大制錢數目
郎中永保等文開恭畫坤寧宮神像需用外僱畫匠畫短工九十五工每工錢一百三十四文領去大制錢十二串七百三三十文
銀庫郎中邁格等據掌儀司郎中謨爾德等文開恭造坤寧宮祭祀所用鏨花銀香碟八個爵盤二個漏子一個格漏一個箸一雙匙三張小碟二十個鍾十一個大碗五個壺一把大小盤二十四個鑲銀裹楠木肉槽四個三鑲烏木箸二雙畫像上用掛釣三分亭子上用銀面葉一分需用外僱鏨花匠大器匠做短工七百九十一工四分五厘每工錢一百三十四文領去大制錢一百六串五十四文
...

時間：乾隆01年8月

官司：

官員：

微捲頁數：173-194

冊數：194

資料庫：內務府奏銷檔案
"""

We can also perform a quick retrieval using what we have just learnt.

In [None]:
pattern = re.compile(r'(?<=時間.).*')

match = pattern.findall(text)
match

['乾隆01年8月']

Combining with Web Scrapping, which we will learn later, we can then easily get information for text analysis.

<br>
<br>
<br>

***

## **Additional information**

This notebook is provided for educational purpose and feel free to report any issue on GitHub.

<br>

**Author:** [Ka Hei, Chow](https://www.linkedin.com/in/ka-hei-chow-231345188/)

**License:** The code in this notebook is licensed under the [Creative Commons by Attribution 4.0 license](https://creativecommons.org/licenses/by/4.0/).

**Last modified:** December 2021

<br>

***

<br>

## **References:** 

https://github.com/CoreyMSchafer/code_snippets/blob/master/Python-Regular-Expressions/snippets.txt

https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups

https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters