# 字串處理
### 陳俊宇
**March,1,2023**

---

# 字串基礎知識
可以使用單引號(' ')或雙引號(" ")創建字串。

In [1]:
string1 = "I am a student."
string1

'I am a student.'

In [2]:
string2 = "I am a 'NSYSU' student."
string2

"I am a 'NSYSU' student."

要在字串中包含文字單引號或雙引號，應使用 \，因為單引號(' ')或雙引號(" ")為特殊字元。

In [3]:
string3 = "I\'m a student."
string3

"I'm a student."

多個字串可以使用 [ ] 存儲在 list 中。

In [4]:
["a", "b", "c"]

['a', 'b', 'c']

也可以使用 () 存儲在 tuple (元組) 中。

In [5]:
("a", "b", "c")

('a', 'b', 'c')

# 字串的轉換
## 大小寫

In [6]:
string1 = "I am a student"
string2 = ["I", "am", "a", "student"]

將英文字串轉換成大寫

In [7]:
string1.upper()

'I AM A STUDENT'

+ **需要在字符串對像上調用該方法**
+ **意味著必須訪問 list 中的元素**

In [8]:
[item.upper() for item in string2]

['I', 'AM', 'A', 'STUDENT']

將英文字串轉換成小寫

In [9]:
string1.lower()

'i am a student'

+ **需要在字符串對像上調用該方法**
+ **意味著必須訪問 list 中的元素**

In [10]:
[item.lower() for item in string2]

['i', 'am', 'a', 'student']

將英文字串中的首字母轉換成大寫

In [11]:
string1.title()

'I Am A Student'

+ **需要在字符串對像上調用該方法**
+ **意味著必須訪問 list 中的元素**

In [12]:
[item.title() for item in string2]

['I', 'Am', 'A', 'Student']

將英文字串中的第一個字母大寫，其餘小寫

In [13]:
string1.capitalize() 

'I am a student'

+ **需要在字符串對像上調用該方法**
+ **意味著必須訪問 list 中的元素**

In [14]:
[item.capitalize() for item in string2]

['I', 'Am', 'A', 'Student']

## 編碼轉換
+ 在Python中，可以使用內建的編碼和解碼函數來進行文字編碼轉換
+ 編碼 : .encode(" ") 
+ 解碼 : .decode(" ")  

**1. ASCII 編碼**

In [15]:
text = "Hello, world!"
encoded_text = text.encode("ascii")          # 編碼
decoded_text = encoded_text.decode("ascii")  # 解碼
print(encoded_text)  # b'Hello, world!'
print(decoded_text)  # Hello, world!

b'Hello, world!'
Hello, world!


**2.UTF-8 編碼**

In [16]:
text = "你好，世界！"
encoded_text = text.encode("utf-8")
decoded_text = encoded_text.decode("utf-8")
print(encoded_text)  # b'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x81'
print(decoded_text)  # 你好，世界！

b'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x81'
你好，世界！


**3.Base64 編碼**

In [17]:
import base64

text = "Hello, world!"
encoded_text = base64.b64encode(text.encode("utf-8"))
decoded_text = base64.b64decode(encoded_text).decode("utf-8")
print(encoded_text)  # b'SGVsbG8sIHdvcmxkIQ=='
print(decoded_text)  # Hello, world!

b'SGVsbG8sIHdvcmxkIQ=='
Hello, world!


還有其他編碼方法可用，Python 官方文件提供了詳細的編碼支援和相關函數的說明
+ [codecs — Codec registry and base classes(Python官方文件 - codecs模組)](https://docs.python.org/3/library/codecs.html)

# 找出起始到结束位置

## 找出字串裡面的特定片段的位置
**使用 index()**
+ find() 只適用於字串（String）物件

In [18]:
string4 = ["grape", "banana", "apple", "mango "]

只找出第一個出現的位置：

![image.png](attachment:image.png)

+ index() 用於查找元素在列表中的索引，無法直接應用於字串的列表
```
遍歷 string4 中的每個字串，並對每個字串使用 index() 來尋找 "a" 的位置，
如果該字串中沒有找到該字母，則 index() 方法會拋出 ValueError 錯誤，
我們使用 try-except 來處理錯誤並跳過該字串。
```

In [19]:
substring = "a"
positions = []
for s in string4:
    try:
        position = s.index(substring)
        positions.append(position)
    except ValueError:
        pass
print(positions)

[2, 1, 0, 1]


找出全部位置：

In [20]:
substring = "a"
positions = []
for s in string4:
    for i in range(len(s)):                     # 內層的迴圈來遍歷該字串中的每個字符
        if s[i:i+len(substring)] == substring:
            positions.append(i)
    positions.append('')
print(positions)  

[2, '', 1, 3, 5, '', 0, '', 1, '']


## 找出字串裡面符合特定規則的片段
找出字串裡數字的位置

In [21]:
str1 = "1 and 2 and 4 and 456 7"

1. 正則表達式來匹配數字模式

In [22]:
import re

pattern = r"\d+"  # 匹配一個或多個連續的數字，(正則表達式 "\d+")
matches = re.finditer(pattern, str1)              # 函數進行正則表達式匹配時，返回一個迭代器
positions = [match.start() for match in matches]  # 正則表達式匹配對象的方法，用於獲取匹配的起始位置
print(positions)

[0, 6, 12, 18, 22]


In [23]:
import re

pattern = r"\d+" 
matches = re.finditer(pattern, str1)

for match in matches:
    start_position = match.start()
    print(start_position)

0
6
12
18
22


2. 迴圈遍歷字串並檢查

In [24]:
positions = []
current_position = 0
while current_position < len(str1):
    if str1[current_position].isdigit():     # 使用 isdigit() 進行檢查
        positions.append(current_position)
        current_position += 1
    else:
        current_position += 1
print(positions)

[0, 6, 12, 18, 19, 20, 22]


找出字串裡不是數字的位置
+ 將正則表達式的匹配反轉

In [25]:
import re

str1 = "1 and 2 and 4 and 456 7"
pattern = r"\d+"  # 匹配一個或多個連續的數字

# Step 1: 獲取所有匹配的結果
matches = re.findall(pattern, str1)

# Step 2: 檢查每個位置是否在匹配的結果中
# Step 3: 如果位置不在匹配的結果中，則將該位置視為反向匹配的位置。
positions = [i for i in range(len(str1)) if i not in [m.start() for m in re.finditer(pattern, str1)]]

print(positions)

[1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 19, 20, 21]


# 字串連接

### 1. 使用加號 (+) 運算子
**連接 2 個字串向量**

In [26]:
import string  # 用於獲取字母序列

s = "string" + "".join(string.ascii_lowercase[:5])  # string.ascii_lowercase 中的前 5 個元素
print(s)

stringabcde


### 2. 使用 f-strings 或格式化字符串

In [27]:
s = f"string{''.join(string.ascii_lowercase[:5])}"
print(s)

stringabcde


### 3. 使用 .join()

In [28]:
s = "".join(["string"] + list(string.ascii_lowercase[:5]))
print(s)

stringabcde


**設置向量間的連接符號**
### 1. 使用加號 (+) 運算子和 .join()

In [29]:
str1 = "string"
letters = string.ascii_lowercase[:5]
separator = ": "
result = str1 + separator + separator.join(letters)
print(result)

string: a: b: c: d: e


### 2. 使用 f-strings 或格式化字符串

In [30]:
str1 = "string"
letters = string.ascii_lowercase[:5]
separator = ": "
result = f"{str1}{separator}{separator.join(letters)}"
print(result)

string: a: b: c: d: e


**連接3個以上字串向量**

In [31]:
letters = string.ascii_lowercase  # 取得小寫字母序列

result = "".join([letters[0], letters[1], " is for", "...", "***"])
print(result)  # a is for...***

ab is for...***


In [32]:
s = "".join(list(string.ascii_lowercase))
print(s)

abcdefghijklmnopqrstuvwxyz


In [33]:
letters = string.ascii_lowercase
separator = ", "
result = separator.join(letters)
print(result)

a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z


# 字串重複與連接混合操作

In [34]:
string5 = ["apple", "banana", "grape", "mango"]

每個字串重複 2 次，然后連接起来

In [35]:
result = " ".join([s * 2 for s in string5])
print(result)

appleapple bananabanana grapegrape mangomango


In [36]:
result = " ".join([s + s  for s in string5])
print(result)

appleapple bananabanana grapegrape mangomango


向量中的每個字串分別重複 1 到 4 次

In [37]:
repetitions = range(1, 5)

result = " ".join([s * r for s in string5 for r in repetitions])
print(result)

apple appleapple appleappleapple appleappleappleapple banana bananabanana bananabananabanana bananabananabananabanana grape grapegrape grapegrapegrape grapegrapegrapegrape mango mangomango mangomangomango mangomangomangomango


向量中的字串個別重複 1~4 次

In [38]:
r = 0
result = " ".join([s * (r := r + 1) for s in string5])
print(result)

apple bananabanana grapegrapegrape mangomangomangomango


+ **遞增運算符 := 它允許我們在表達式中同時進行賦值和返回值**
+ 在這個例子中，我們使用 r := r + 1 來將 r 遞增 1，並將新的值賦給 r

連接並重複

In [39]:
repetitions = range(1, 5)
string5 = ['01']
result = "ab" + " ab".join([s * r for s in string5 for r in repetitions])
print(result)

ab01 ab0101 ab010101 ab01010101


# 計算字串中的匹配數量

In [40]:
import re

string5 = ["apple", "banana", "grape", "mango"]

字串中 a 的數量

In [41]:
counts = [len(re.findall("a", s)) for s in string5]
print(counts)

[1, 3, 1, 1]


In [42]:
pattern = ["a"]

counts = [len(re.findall(p, s)) for s in string5 for p in pattern]
print(counts)

[1, 3, 1, 1]


字串中 a~e 的數量

In [43]:
pattern = "[a-e]"  # 正則表達式模式，它匹配字母 a 到 e 之間的任何一個字母

counts = [len(re.findall(pattern, s)) for s in string5]
print(counts)

[2, 4, 2, 1]


字串向量中各別匹配的數量

In [44]:
pattern = ["a", "n", "p", "m"]

counts = [len(re.findall(p, s)) for s in string5 for p in pattern]
print(counts)

[1, 0, 2, 0, 3, 2, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1]


# 檢測字串中是否存在某片段
(相當於 len(re.findall(pattern, s)) 的結果是否大於 0)<br><br>
字串中是否包含 a

In [45]:
# import re

pattern = "a"

# re.search() 如果找到匹配的部分，則會返回一個對象；否則，它會返回 None
result = [re.search(pattern, s) is not None for s in string5]
print(result)

[True, True, True, True]


字串中是否以 a 開頭

In [46]:
result = [s.startswith("a") for s in string5]
print(result)

[True, False, False, False]


字串中是否以 a 结尾

In [47]:
result = [s.endswith("a") for s in string5]
print(result)

[False, True, False, False]


字串中是否包含 a,b,c,d 其中一個字

In [48]:
letters = ["a", "b", "c", "d"]

result = [any(letter in s for letter in letters) for s in string5]
print(result)

[True, True, True, True]


字串向量中是否符合個別的匹配片段

In [49]:
# import re
# string5 = ["apple", "banana", "grape", "mango"]

patterns = ["pp", "ba", "pe", "me"]

result = [any(re.search(pattern, s) is not None for pattern in patterns) for s in string5]
print(result)

[True, True, True, False]


# 從字串中找出匹配到的內容
## 使用 re 模組
## (1) re.match()

In [50]:
strings = ["the smooth", "the depth", "a chicken", "the parked", "the sun", "the huge", "the ball", "the woman", "a helps"]

In [51]:
# import re
# (a|the) 表示匹配字母 "a" 或 "the"，([^ ]+) 表示匹配除空格以外的一個或多個字元
pattern = r"(a|the) ([^ ]+)"

matches = [re.match(pattern, s) for s in strings]

# (完整匹配, 第一個括號匹配, 第二個括號匹配)，如果未找到匹配，則為 None。
result = [(match.group(0), match.group(1), match.group(2)) if match else None for match in matches]
print(result)

[('the smooth', 'the', 'smooth'), ('the depth', 'the', 'depth'), ('a chicken', 'a', 'chicken'), ('the parked', 'the', 'parked'), ('the sun', 'the', 'sun'), ('the huge', 'the', 'huge'), ('the ball', 'the', 'ball'), ('the woman', 'the', 'woman'), ('a helps', 'a', 'helps')]


## (2) re.search()

In [52]:
# import re
pattern = r"(a|the) ([^ ]+)"

matches = [re.search(pattern, s) for s in strings]
result = [(match.group(0), match.group(1), match.group(2)) if match else None for match in matches]
print(result)

[('the smooth', 'the', 'smooth'), ('the depth', 'the', 'depth'), ('a chicken', 'a', 'chicken'), ('the parked', 'the', 'parked'), ('the sun', 'the', 'sun'), ('the huge', 'the', 'huge'), ('the ball', 'the', 'ball'), ('the woman', 'the', 'woman'), ('a helps', 'a', 'helps')]


### re.match() 與  re.search() 的差別
```
re.match(pattern, string) 函數從字串的開頭開始匹配模式。

只有當模式從字串的開頭就匹配成功時，才會返回匹配對象。
換句話說，只有在字串的起始位置就能找到匹配，才會返回結果。
如果在起始位置沒有找到匹配，則返回 None。
因此，你可以使用 re.match() 來檢查字串是否以特定模式開頭。

re.search(pattern, string) 函數在整個字串中搜索模式。

它會搜索整個字串，並返回第一個匹配到的結果。
換句話說，它不僅僅在字串的開頭搜索匹配，而是在整個字串中搜索匹配。
如果找到匹配，則返回匹配對象；如果未找到匹配，則返回 None。
因此，你可以使用 re.search() 在字串中任何位置搜索特定模式。
```
+ 主要的差異在於 re.match() 從字串的開頭開始匹配，而 re.search() 在整個字串中搜索匹配

In [53]:
# import re

pattern = r"abc"
string = "0abcdef"

match_result = re.match(pattern, string)
search_result = re.search(pattern, string)

print("Match Result: ", match_result)
print("Search Result:", search_result)

Match Result:  None
Search Result: <re.Match object; span=(1, 4), match='abc'>


## (3) re.findall() 函數
+ 它可以返回所有匹配的結果

In [54]:
# import re

pattern = r"(a|the) ([^ ]+)"

matches = [re.findall(pattern, s) for s in strings]
result = [m[0] if m else None for m in matches]
print(result)

[('the', 'smooth'), ('the', 'depth'), ('a', 'chicken'), ('the', 'parked'), ('the', 'sun'), ('the', 'huge'), ('the', 'ball'), ('the', 'woman'), ('a', 'helps')]


## 返回匹配的向量含 pattern 的元素的索引

In [55]:
# import re

strings = ["apple", "banana", "grpe", "mango"]
pattern = r"a"

matches = [i for i, s in enumerate(strings) if re.search(pattern, s)]
print(matches)

[0, 1, 3]


+ 使用 re.search() 來搜索 "a"，並且返回了每個字串的索引
+ 結果 [0, 1, 3] 表示除了 "grpe" 每個字串都含有 "a"

# 計算字串的長度

In [56]:
len("NSYSU")

5

In [57]:
len(["I", "am", "a", "student"])

4

In [58]:
my_list = ["I", "am", "a", "student"]
lengths = [len(item) for item in my_list]
print(lengths)

[1, 2, 1, 7]


![image.png](attachment:image.png)

```
len() 函數用於返回容器類型（如字符串、列表、元組等）中元素的數量。
但是 None 是一個特殊的對象，它表示缺失值或未知值，並不是一個容器，因此無法使用 len() 函數來計算其長度
```

# 字串排序

In [59]:
string = ["a","x","b","y","c","z"]
sorted_string = sorted(string)
result = "".join(sorted_string)
print(result)

abcxyz


+ 返回一個排序後的字串列表

In [60]:
string = ["a","x","b","y","c","z"]
sorted_string = sorted(string, reverse=True)
result = "".join(sorted_string)
print(result)

zyxcba


+ 反向順序進行排序

In [61]:
x = ["a", "x", "b", "y", "c", "z"]

sorted_indices = sorted(range(len(x)), key=lambda i: x[i])
order = [i+1 for i in sorted_indices]

print(order)

[1, 3, 5, 2, 4, 6]


+ 返回一個排序後的字串索引

# 在字串的前後位置填充符號

## (1) str.ljust(width, fillchar)
### 在字串的右側填充符號，使其長度達到指定的 width，填充使用指定的 fillchar 字符

In [62]:
string = "GOOD"
padded_string = string.ljust(10, '*')
print(padded_string) 

GOOD******


## (2) str.rjust(width, fillchar)
### 在字串的左側填充符號，使其長度達到指定的 width，填充使用指定的 fillchar 字符

In [63]:
string = "GOOD"
padded_string = string.rjust(10, '*')
print(padded_string)

******GOOD


## (3) str.center(width, fillchar)
### 在字串的前後位置均等填充符號，使其長度達到指定的 width，填充使用指定的 fillchar 字符

In [64]:
string = "GOOD"
padded_string = string.center(10, '*')
print(padded_string) 

***GOOD***


width 小於 string 的長度時，返回原 string

In [65]:
string = "GOOD"
padded_string = string.center(3, '*')
print(padded_string) 

GOOD


# 字串替換(按匹配內容)
## (1) str.replace(old, new, count)
### 將字串中所有匹配到的 old 子字串替換為 new

In [66]:
string = "Hello, World!"
new_string = string.replace("o", "*")
print(new_string)

Hell*, W*rld!


### 只替換第一個匹配的內容

In [67]:
string = "Hello, World!"
new_string = string.replace("o", "*", 1)
print(new_string)

Hell*, World!


### 替換 list 裡字串匹配的內容
+ 整個字串

In [68]:
string5

['apple', 'banana', 'grape', 'mango']

In [69]:
old_item = "banana"
new_item = "orange"

new_list = [new_item if item == old_item else item for item in string5]
print(new_list)

['apple', 'orange', 'grape', 'mango']


+ 字串中的元素

In [70]:
new_list = []
for item in string5:
    new_item = item.replace("a", "*")
    new_list.append(new_item)
print(new_list)

['*pple', 'b*n*n*', 'gr*pe', 'm*ngo']


In [71]:
new_list = [item.replace("a", "*") for item in string5]
print(new_list)

['*pple', 'b*n*n*', 'gr*pe', 'm*ngo']


## (2) re.sub(pattern, repl, string, count=0, flags=0)
### 使用正則表達式進行替換的方法，它可以替換字串中匹配某一模式的部分，並將其替換為指定的內容

In [72]:
# import re

string = "Hello, World!"
new_string = re.sub(r'o', '*', string, count=1)
print(new_string)

Hell*, World!


In [73]:
# import re

string = "Hello, World!"
new_string = re.sub(r'o', '*', string)
print(new_string)

Hell*, W*rld!


## (3) str.translate(table)
### 使用字串翻譯表進行替換的方法，可以使用 maketrans() 函數創建翻譯表，然後將其應用於字串

In [74]:
string = "Hello, World!"
translation_table = str.maketrans("o", "*")
new_string = string.translate(translation_table)
print(new_string)

Hell*, W*rld!


+ 如果要替換 list 裡字串匹配的內容，可以使用迴圈，可參考上面的例子

# 字串替換(按位置)
替換特定位置的文字

In [75]:
string7 = "NSYSU student"
position = 4
replacement = "X"

new_string = string7[:position] + replacement + string7[position+1:]
print(new_string)

NSYSX student


找出特定內容的位置

In [76]:
start_positions = [1, 7]
end_positions = [5, 13]

new_strings = [string7[start-1:end] for start, end in zip(start_positions, end_positions)]
print(new_strings)

['NSYSU', 'student']


**zip() 函數將開始位置和結束位置進行配對，然後使用切片將指定範圍內的字串提取出來，存入一個新的列表中**

In [77]:
new_string = [string7[i:] for i in range(len(string7))]
print(new_string)

['NSYSU student', 'SYSU student', 'YSU student', 'SU student', 'U student', ' student', 'student', 'tudent', 'udent', 'dent', 'ent', 'nt', 't']


**列表中的字串中按位置進行替換**

In [78]:
my_list = ["apple", "banana", "grape", "mango"]
new_list = [item[:2] + "X" + item[3:] if len(item) > 2 else item for item in my_list]
print(new_list)

['apXle', 'baXana', 'grXpe', 'maXgo']


In [79]:
my_list = ["apple", "banana", "grape", "mango"]
new_list = []
for item in my_list:
    if len(item) > 2:
        new_item = item[:2] + "X" + item[3:]
    else:
        new_item = item
    new_list.append(new_item)
print(new_list)

['apXle', 'baXana', 'grXpe', 'maXgo']


# 字串分割
## (1) split()

In [80]:
string8 = ["the smooth and the depth and a chicken and the parked", "the sun and the huge and the ball"]

In [81]:
pattern = " and "

new_strings = [s.split(pattern) for s in string8]
print(new_strings)

[['the smooth', 'the depth', 'a chicken', 'the parked'], ['the sun', 'the huge', 'the ball']]


## (2) re.split()

In [82]:
# import re
pattern = " and "

new_strings = [re.split(pattern, s) for s in string8]
print(new_strings)

[['the smooth', 'the depth', 'a chicken', 'the parked'], ['the sun', 'the huge', 'the ball']]


In [83]:
# import re

string = "a,b,c"
pattern = ","

new_strings = re.split(pattern, string)
print(new_strings)

['a', 'b', 'c']


In [84]:
# import re

string = "abc123def456xyz789"
pattern = "[a-z]+"  # 匹配一個或多個小寫字母的連續序列

new_strings = re.split(pattern, string)
print(new_strings)

['', '123', '456', '789']


+ [a-z] 表示匹配任何小寫字母。
+ \+ 是一個量詞，表示匹配前面的模式一次或多次

In [85]:
# import re

string = "abc123def456xyz789"
pattern = "[0-9]+"

new_strings = re.split(pattern, string)
print(new_strings)

['abc', 'def', 'xyz', '']


# 從句子中找出單字
### 找出第一個單字

In [86]:
string9 = ["NSYSU student", "He saw a dog"] 

In [87]:
word_index = 1

first_words = [s.split()[word_index - 1] for s in string9]
print(first_words)

['NSYSU', 'He']


### 找出第二個以後的全部單字

In [88]:
first_words = [s.split()[1:] for s in string9]
print(first_words)

[['student'], ['saw', 'a', 'dog']]


In [89]:
# import re

string0 = "qweasd.987654"
word_index = 1
separator = "."

words = re.split(re.escape(separator), string0) # 將 "." 視為普通字符
result = words[word_index - 1]
print(result)

qweasd


+ re.escape() 函數對分隔符進行轉義，以確保它們被視為普通字符而不是具有特殊含義的正則表達式元字符

# 中文斷詞
```
進行中文斷詞可以使用不同的函式庫，其中常用的有 jieba、pkuseg 和 THULAC 等，
這些函式庫可以將中文文本分割成詞語或詞塊，以便後續的文本處理和分析
```
+ jieba：https://github.com/fxsjy/jieba
+ pkuseg：https://github.com/lancopku/pkuseg-python
+ THULAC：https://github.com/thunlp/THULAC-Python

### 補充細節：

詞性標註：
+ [彙整中文與英文的詞性標註代號](https://blog.pulipuli.info/2017/11/fasttag-identify-part-of-speech-in.html)

計算字串差異程度的 Hamming distance：
+ [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)

TF-IDF：
+ [TF-IDF（term frequency–inverse document frequency）](https://baike.baidu.com/item/tf-idf/8816134)

去除重複的演算法：
+ [文本去重算法：Minhash/Simhash/Klongsent](https://zhuanlan.zhihu.com/p/43640234)

更多關於斷詞的討論：
+ [中文分词做不好，人机自然语言交互当然难取得突破 ](https://www.sohu.com/a/152768373_491255)
+ [Hanlp等七種優秀的開源中文分詞庫推薦](https://kknews.cc/code/o2j8e8o.html)

---