# Python & Data - Week 6

問題：Parse & Transform

## 本週內容

1. [Weekly Challenge](challenge5.html) | [ipynb 檔案](challenge5.ipynb) 

[打包下載](../week6.zip)

## Parse (解析／拆解)

1. 由原始格式 (CSV, XLS, JSON) 轉化成 Python 能操作的格式
2. 由一般文字轉換為數字或日期等格式，令 Python 可以作進一步處理

### 如何做?

1. 原始格式有專門的 Library 做處理(如 openpyxl、Pandas等)，往後會介紹，無需自行編寫。
2. 由文字轉換為數字、由文字轉換為日期，接著介紹：


#### 由文字轉換為數字

In [104]:
ages = [
    "15",
    "20",
    "21",
    "21",
    "21",
    "10",
]

list(map(lambda age: int(age), ages)) # 使用 int 轉換類型

[15, 20, 21, 21, 21, 10]

In [105]:
personal_info = [
    # age, height (in cm), weight (in kg)
    ("25", "175", "70"),
    ("20", "170", "60"),
    ("30", "180", "85"),
    ("10", "144", "45"),
    ("15", "156", "55"),
]

def calc_bmi(info):
    height_in_inches = info[1] * 0.39370078740157 # convert cm to inches
    weight_in_lb = info[2] * 2.20462262185 # convert kg to lb

    return (weight_in_lb / (height_in_inches * height_in_inches)) * 703

list(map(calc_bmi, personal_info))

TypeError: can't multiply sequence by non-int of type 'float'

#### 由文字抽取數字 (使用 Replace)

In [None]:
ages2 = [
    "age:20",
    "age:15",
    "age:13",
    "age:10",
    "age:30",
]

list(map(lambda age: int(age.replace("age:", "")), ages2))

[20, 15, 13, 10, 30]

#### 由文字抽取數字 (使用 Regular Expression)

使用 `re.search(src_str, re_str)` 在字串中找尋指定格式的內容

`src_str` 是待處理的字串  
`re_str` 是 regular expression 字串

In [None]:
import re
name_and_age = "Hello: 2"
matches = re.search("[0-9]+", name_and_age)

if matches is not None:
    hello_age = int(matches.group(0))
    print(hello_age)
    print("type of hello_age", type(hello_age))

name_and_ages = [
    "Samuel: 20",
    "Mary: 15",
    "John: 13",
    "Peter: 10",
    "Alex: 30",
]

def search_for_age(person):
    matches = re.search("[0-9]+", person)
    if matches is not None:
        return int(matches.group(0))

list(map(search_for_age, name_and_ages))

2
type of hello_age <class 'int'>


[20, 15, 13, 10, 30]

❓ re.match 和 re.search 有甚麼分別?  
💡 如果不想使用 Regex, 有沒有別的方法?

#### 將文字轉換為日期類型

使用 `datetime.strptime(src_str, format_str)` 解析日期字串

`src_str` 是被處理的字串  
`format_str` 是格式字串

In [None]:
from datetime import datetime

date = datetime.strptime("2月10日", '%m月%d日')

print(date)

new_date = date.replace(year = 2022) # change year from 1900 to 2022
print(new_date)

print(new_date > datetime.now()) # check if date is in the past
print(datetime(2022, 2, 22) > datetime.now()) # as of 2022-2-15, 2022-2-22 is in the future

1900-02-10 00:00:00
2022-02-10 00:00:00
False
True


💡 2022年2月15日是星期幾?  
試試找尋 [dateime docs](https://docs.python.org/3/library/datetime.html)

❗️ 注意 datetime 的時差問題，往後有時間再詳細探討。

In [None]:
today = datetime(2022, 2, 15)
print(today)

2022-02-15 00:00:00


## Transform (轉化)

有些時候原始資料格式不符合要求，需要進行轉換，才能作處理或輸出。

### 如何做?

#### 使用 Replace / Subtitute

在 buggy_model_numbers 中有一些型號是 M-0002，但有一些是 M0002。如果要進一步做處理（或統計）需要先作轉換。

In [None]:
buggy_model_numbers = [
    "M-0002",
    "T-0001",
    "M0002",
    "T0001",
    "J-0003"
]

list(map(lambda model_number: model_number.replace("-", ""), buggy_model_numbers))

['M0002', 'T0001', 'M0002', 'T0001', 'J0003']

#### 使用 Regex 進行取代及重組

In [110]:
# Letter part (exact 1 character)
print("Letter part (exact 1 character)")
print(re.match("[A-Z]", "M-0002"))
print(re.search("[A-Z]", "M-0002"))
print(re.match("[A-Z]", "MM-0002"))
print(re.search("[A-Z]", "MM-0002"), end="\n\n")

# First number only
print("First number only")
print(re.search("[0-9]", "M-0002"), end="\n\n")

# All number part (exact 4 digits)
print("All number part (exact 4 digits)")
print(re.match("[0-9]{4}", "02"))
print(re.match("[0-9]{4}", "002"))
print(re.match("[0-9]{4}", "0002"), end="\n\n")

Letter part (exact 1 character)
<re.Match object; span=(0, 1), match='M'>
<re.Match object; span=(0, 1), match='M'>
<re.Match object; span=(0, 1), match='M'>
<re.Match object; span=(0, 1), match='M'>

First number only
<re.Match object; span=(2, 3), match='0'>

All number part (exact 4 digits)
None
None
<re.Match object; span=(0, 4), match='0002'>



In [116]:
import re

buggy_model_numbers2 = [
    "M-0002",
    "t-0001",
    "J-0003",
]

def process_model_number(model_number):
    matches = re.match("([a-zA-Z])-([0-9]{4})", model_number)
    # (M)-(0002)
    prefix = matches.group(1) # first match group
    numbers = matches.group(2) # second match group
    return f'{prefix.upper()}-{numbers}'

# Transform to X-#### format
list(map(process_model_number, buggy_model_numbers2))

['M-0002', 'T-0001', 'J-0003']

[Regex Cheatsheet](https://www.keycdn.com/support/regex-cheatsheet)