# D03 Slicing trong Pandas

## Mục đích

Kết hợp những gì đã biết về slicing danh sách, slicing array của NumPy, và tên cột / hàng, chúng ta sẽ tìm hiểu những cách thức slicing đa dạng trong Pandas.


## Chỉ mục bằng `iloc`

Tương tự như danh sách và mảng hai chiều, bạn có thể slice một data frame bằng chỉ mục với thuộc tính `iloc`.

In [1]:
import pandas as pd
d = pd.read_excel("../assets/hrm.xlsx", index_col="id")
d.iloc[:5, :5]

Unnamed: 0_level_0,sex,yob,height,weight,date_exam
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
223,1,1975,1.69,65.0,2019-03-23
236,1,1971,1.67,65.0,2019-03-19
256,1,1970,1.7,59.0,2019-03-06
296,1,1982,1.71,70.0,2019-04-23
310,0,1980,1.45,42.0,2019-03-28


Bạn cũng có thể slice bằng danh sách chỉ mục. Lưu ý: số chỉ mục của Pandas cũng bắt đầu từ 0.

In [2]:
d.iloc[[1, 2, 5], :5]

Unnamed: 0_level_0,sex,yob,height,weight,date_exam
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
236,1,1971,1.67,65.0,2019-03-19
256,1,1970,1.7,59.0,2019-03-06
312,1,1963,1.7,65.0,2019-02-16


Sử dụng `iloc` cho Series cũng tương tự như vậy.

In [3]:
d["yob"].iloc[:5]

id
223    1975
236    1971
256    1970
296    1982
310    1980
Name: yob, dtype: int64

## Slicing bằng tên

### Slicing cột

Pandas cho phép gọi tên một cột giống như gọi một thuộc tính.

In [4]:
d.yob

id
223     1975
236     1971
256     1970
296     1982
310     1980
        ... 
4200    1983
4214    1972
4216    1961
4220    1962
4240    1984
Name: yob, Length: 330, dtype: int64

Hoặc sử dụng cú pháp giống như slicing danh sách. Cá nhân mình thích cách sử dụng này hơn.

In [5]:
d["yob"]

id
223     1975
236     1971
256     1970
296     1982
310     1980
        ... 
4200    1983
4214    1972
4216    1961
4220    1962
4240    1984
Name: yob, Length: 330, dtype: int64

Bạn có thể cung cấp một danh sách tên các cột. Pandas sẽ hiển thị kết quả slicing đúng theo thứ tự các cột trong danh sách.

In [6]:
d[["height", "weight"]]

Unnamed: 0_level_0,height,weight
id,Unnamed: 1_level_1,Unnamed: 2_level_1
223,1.69,65.0
236,1.67,65.0
256,1.70,59.0
296,1.71,70.0
310,1.45,42.0
...,...,...
4200,1.57,47.0
4214,1.57,58.0
4216,1.45,40.0
4220,1.54,49.0


Việc slicing một cột sẽ trả về series. Để trả về data frame (có một cột), bạn đưa tên của cột vào trong danh sách.

In [7]:
d[["weight"]]

Unnamed: 0_level_0,weight
id,Unnamed: 1_level_1
223,65.0
236,65.0
256,59.0
296,70.0
310,42.0
...,...
4200,47.0
4214,58.0
4216,40.0
4220,49.0


### Sử dụng `loc`

Để slicing hàng, bạn sẽ cần sử dụng `loc` và cung cấp index cho `loc`.

In [8]:
d.loc[[223, 310]]

Unnamed: 0_level_0,sex,yob,height,weight,date_exam,endo_avail,eso_LA,hp_endo,hp_breath,hrm_avail,...,q_fssg_03_nangbung,q_fssg_04_xoanguc,q_fssg_05_metsauan,q_fssg_06_nongratsauan,q_fssg_07_hong,q_fssg_08_daylucan,q_fssg_09_nuotnghen,q_fssg_10_dichtraolen,q_fssg_11_onhieu,q_fssg_12_nongratcuixuong
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
223,1,1975,1.69,65.0,2019-03-23,1,2.0,0.0,,1,...,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
310,0,1980,1.45,42.0,2019-03-28,1,0.0,0.0,,1,...,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0


Bạn có thể cung cấp thông tin về slicing cho `loc` tương tự như với `iloc`, chỉ thay việc sử dụng chỉ mục số bằng index và tên cột.

In [9]:
d.loc[[223, 296, 310], ["sex", "height", "weight"]]

Unnamed: 0_level_0,sex,height,weight
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
223,1,1.69,65.0
296,1,1.71,70.0
310,0,1.45,42.0


Thay vì danh sách tên cột hoặc index, bạn có thể sử dụng cú pháp slicing giống như với danh sách.

In [10]:
d.loc[1:5, "sex":"weight"]

Unnamed: 0_level_0,sex,yob,height,weight
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


Trong một số trường hợp, chúng ta có thể kết hợp `loc` và `iloc`.

In [11]:
d.loc[:, ["weight", "height"]].iloc[:5]

Unnamed: 0_level_0,weight,height
id,Unnamed: 1_level_1,Unnamed: 2_level_1
223,65.0,1.69
236,65.0,1.67
256,59.0,1.7
296,70.0,1.71
310,42.0,1.45


Slicing một cột và một hàng trả về một giá trị.

In [12]:
d.iloc[0, 0]

1

## Slicing bằng điều kiện

### Slicing hàng (query)

Tương tự như NumPy, bạn có thể slice bằng điều kiện. Hãy xem các cú pháp khác nhau sẽ cho ra những kết quả như thế nào.

In [13]:
# Slicing trực tiếp từ tên biến data frame
# trả về một data frame con theo điều kiện
d[d["sex"] == 0]

Unnamed: 0_level_0,sex,yob,height,weight,date_exam,endo_avail,eso_LA,hp_endo,hp_breath,hrm_avail,...,q_fssg_03_nangbung,q_fssg_04_xoanguc,q_fssg_05_metsauan,q_fssg_06_nongratsauan,q_fssg_07_hong,q_fssg_08_daylucan,q_fssg_09_nuotnghen,q_fssg_10_dichtraolen,q_fssg_11_onhieu,q_fssg_12_nongratcuixuong
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
310,0,1980,1.45,42.0,2019-03-28,1,0.0,0.0,,1,...,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0
339,0,1986,1.58,46.0,2019-03-02,1,1.0,1.0,99.0,1,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,2.0,0.0
342,0,1967,1.55,55.0,2019-03-14,0,,,,1,...,3.0,0.0,3.0,0.0,2.0,0.0,2.0,1.0,4.0,0.0
347,0,1980,1.63,59.0,2019-02-26,1,0.0,,,1,...,2.0,0.0,1.0,0.0,3.0,2.0,3.0,2.0,1.0,0.0
369,0,1961,1.57,54.0,2019-03-26,1,0.0,0.0,0.0,1,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4200,0,1983,1.57,47.0,2019-03-01,0,,,,1,...,1.0,0.0,3.0,0.0,2.0,0.0,3.0,2.0,3.0,0.0
4214,0,1972,1.57,58.0,2019-04-06,1,0.0,1.0,1.0,1,...,1.0,0.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,0.0
4216,0,1961,1.45,40.0,2019-03-12,1,0.0,0.0,,1,...,0.0,0.0,0.0,0.0,1.0,0.0,3.0,1.0,3.0,0.0
4220,0,1962,1.54,49.0,2019-03-27,1,0.0,0.0,1.0,1,...,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0


In [14]:
# Slicing trên loc trả về một data frame con
# theo điều kiện và chứa một (vài) cột
d.loc[d["sex"] == 0, "height"]

id
310     1.45
339     1.58
342     1.55
347     1.63
369     1.57
        ... 
4200    1.57
4214    1.57
4216    1.45
4220    1.54
4240    1.63
Name: height, Length: 224, dtype: float64

In [15]:
# Bạn có thể slice cột trước, sau đó slice hàng
d["height"][d["sex"] == 0]

id
310     1.45
339     1.58
342     1.55
347     1.63
369     1.57
        ... 
4200    1.57
4214    1.57
4216    1.45
4220    1.54
4240    1.63
Name: height, Length: 224, dtype: float64

Để kết hợp nhiều điều kiện, bạn sẽ dùng toán tử `&` (cho "AND"), `|` (cho "OR"), và `~` (cho "NOT"). Trong ví dụ dưới đây, mình thay đổi cách sử dụng toán tử so sánh bằng các hàm tương đương trong Pandas. Trong đa số trường hợp, các kết quả trả về là như nhau.

In [16]:
d.loc[d["sex"].eq(0) & d["weight"].gt(50), "height"]

id
342     1.55
347     1.63
369     1.57
380     1.58
404     1.58
        ... 
4013    1.50
4044     NaN
4069    1.65
4214    1.57
4240    1.63
Name: height, Length: 129, dtype: float64

Thay vì đặt điều kiện vào trong `d[]`, `loc`, hoặc `iloc`, bạn có thể sử dụng hàm `query()`.

In [17]:
d.query("sex == 0")["height"]

id
310     1.45
339     1.58
342     1.55
347     1.63
369     1.57
        ... 
4200    1.57
4214    1.57
4216    1.45
4220    1.54
4240    1.63
Name: height, Length: 224, dtype: float64

Cách làm này sẽ dễ dàng hơn khi có nhiều điều kiện.

In [18]:
d.query("(sex == 0) & (weight > 50)")["height"]

id
342     1.55
347     1.63
369     1.57
380     1.58
404     1.58
        ... 
4013    1.50
4044     NaN
4069    1.65
4214    1.57
4240    1.63
Name: height, Length: 129, dtype: float64

### Slicing cột

Thông thường chúng ta hay slice một số cột có cùng một số chữ bắt đầu hoặc kết thúc. Trong cơ sở dữ liệu này, chúng ta có các cột điểm của bộ câu hỏi FSSG. Chúng ta có thể slice riêng các cột này để tính điểm tổng. Chúng ta sẽ dùng hàm `filter()`.

Một lưu ý cho ai dùng R: `filter()` trong `dplyr` tương đương với slice các **hàng**, còn `select()` tương đương với slice các **cột**. Chú ý đừng nhầm lẫn giữa hai ngôn ngữ.

In [19]:
d.filter(like="q_fssg_").head(5)

Unnamed: 0_level_0,q_fssg_01_nongrat,q_fssg_02_dayhoi,q_fssg_03_nangbung,q_fssg_04_xoanguc,q_fssg_05_metsauan,q_fssg_06_nongratsauan,q_fssg_07_hong,q_fssg_08_daylucan,q_fssg_09_nuotnghen,q_fssg_10_dichtraolen,q_fssg_11_onhieu,q_fssg_12_nongratcuixuong
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
223,2.0,2.0,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
236,0.0,2.0,3.0,0.0,0.0,0.0,3.0,0.0,0.0,2.0,1.0,0.0
256,2.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,2.0,3.0,4.0,0.0
296,2.0,2.0,2.0,1.0,0.0,1.0,2.0,1.0,1.0,2.0,2.0,0.0
310,2.0,4.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0


## Sửa đổi dữ liệu

Khi sửa đổi dữ liệu, hai cú pháp duy nhất mà Pandas khuyên dùng là thông qua `loc` và `iloc`. Chẳng hạn, nếu bạn muốn sửa tất cả các giá trị `0` trong `sex` thành `"Nữ"`, bạn phải làm như sau:

In [20]:
d.loc[d["sex"] == 0, "sex"] = "Nữ"
d["sex"]

id
223      1
236      1
256      1
296      1
310     Nữ
        ..
4200    Nữ
4214    Nữ
4216    Nữ
4220    Nữ
4240    Nữ
Name: sex, Length: 330, dtype: object

Pandas hỗ trợ nhiều cách thức thay thế dữ liệu hiệu quả hơn, chúng ta sẽ tìm hiểu trong bài sau.

## Chọn mẫu ngẫu nhiên

Bạn có thể chọn ngẫu nhiên một số bản ghi (hàng) trong data frame bằng hàm `sample()`.

In [21]:
d.sample(n=5)

Unnamed: 0_level_0,sex,yob,height,weight,date_exam,endo_avail,eso_LA,hp_endo,hp_breath,hrm_avail,...,q_fssg_03_nangbung,q_fssg_04_xoanguc,q_fssg_05_metsauan,q_fssg_06_nongratsauan,q_fssg_07_hong,q_fssg_08_daylucan,q_fssg_09_nuotnghen,q_fssg_10_dichtraolen,q_fssg_11_onhieu,q_fssg_12_nongratcuixuong
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
223,1,1975,1.69,65.0,2019-03-23,1,2.0,0.0,,1,...,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3834,Nữ,1986,1.5,44.0,2019-04-05,1,0.0,0.0,,1,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,4.0,0.0
4214,Nữ,1972,1.57,58.0,2019-04-06,1,0.0,1.0,1.0,1,...,1.0,0.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,0.0
1299,Nữ,1956,1.58,50.0,2019-04-02,1,0.0,0.0,0.0,1,...,2.0,0.0,1.0,0.0,3.0,2.0,0.0,0.0,2.0,0.0
2471,1,1977,1.71,74.0,2019-04-23,1,1.0,0.0,1.0,1,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Thiết lập đối số `frac` thay cho `n` để chọn theo một tỉ lệ nhất định.

In [22]:
d.sample(frac=0.02)

Unnamed: 0_level_0,sex,yob,height,weight,date_exam,endo_avail,eso_LA,hp_endo,hp_breath,hrm_avail,...,q_fssg_03_nangbung,q_fssg_04_xoanguc,q_fssg_05_metsauan,q_fssg_06_nongratsauan,q_fssg_07_hong,q_fssg_08_daylucan,q_fssg_09_nuotnghen,q_fssg_10_dichtraolen,q_fssg_11_onhieu,q_fssg_12_nongratcuixuong
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1996,1,1988,1.7,68.0,2019-04-09,1,0.0,1.0,,1,...,2.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,0.0
4014,1,1986,1.61,75.0,2019-02-16,1,0.0,0.0,1.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
3379,Nữ,1969,1.53,50.0,2019-04-11,1,0.0,0.0,,1,...,1.0,0.0,0.0,0.0,3.0,0.0,0.0,2.0,2.0,0.0
1177,Nữ,1978,1.5,48.0,2019-04-18,1,0.0,1.0,,1,...,1.0,0.0,0.0,1.0,3.0,2.0,3.0,2.0,2.0,0.0
951,1,1937,1.54,49.0,2019-02-28,1,1.0,0.0,,1,...,0.0,0.0,0.0,0.0,3.0,0.0,2.0,0.0,4.0,0.0
366,1,1985,1.63,68.0,2019-04-02,1,0.0,0.0,0.0,1,...,2.0,2.0,2.0,2.0,2.0,1.0,1.0,3.0,3.0,3.0
916,Nữ,1946,1.5,50.0,2019-03-09,1,1.0,0.0,0.0,1,...,0.0,0.0,0.0,0.0,2.0,0.0,2.0,3.0,2.0,0.0


## Một số điều kiện khác

### `isin()`

Trong trường hợp bạn muốn lọc các bản ghi là một trong vài giá trị, bạn có thể sử dụng hàm `isin()`.

In [23]:
d.loc[d["eso_LA"].isin([0, 1, 2]), ["sex", "yob"]]

Unnamed: 0_level_0,sex,yob
id,Unnamed: 1_level_1,Unnamed: 2_level_1
223,1,1975
236,1,1971
296,1,1982
310,Nữ,1980
326,1,1975
...,...,...
4189,1,1989
4214,Nữ,1972
4216,Nữ,1961
4220,Nữ,1962


### `in` trong `query()`

Thay vì `isin()`, bạn có thể sử dụng `query` kết hợp với từ khóa `in` trong Python.

In [24]:
d.query("(eso_LA in [0, 1, 2]) & (height > 1.60)")[["sex", "yob"]]

Unnamed: 0_level_0,sex,yob
id,Unnamed: 1_level_1,Unnamed: 2_level_1
223,1,1975
236,1,1971
296,1,1982
326,1,1975
347,Nữ,1980
...,...,...
4066,1,1981
4069,1,1983
4089,1,1958
4189,1,1989


### Phát hiện bản ghi bị trùng

Để phát hiện các hàng có giá trị của một cột nào đó trùng nhau, bạn có thể dùng hàm `duplicated()`.

In [25]:
d["yob"].duplicated()

id
223     False
236     False
256     False
296     False
310     False
        ...  
4200     True
4214     True
4216     True
4220     True
4240     True
Name: yob, Length: 330, dtype: bool

Và lọc ra các hàng này bằng cách slicing như thông thường.

In [26]:
d.loc[d["yob"].duplicated()].iloc[:, :5]

Unnamed: 0_level_0,sex,yob,height,weight,date_exam
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
326,1,1975,1.64,56.0,2019-03-08
347,Nữ,1980,1.63,59.0,2019-02-26
420,Nữ,1971,1.58,55.0,2019-02-16
432,Nữ,1965,1.50,47.0,2019-05-21
446,1,1981,1.70,70.0,2019-02-28
...,...,...,...,...,...
4200,Nữ,1983,1.57,47.0,2019-03-01
4214,Nữ,1972,1.57,58.0,2019-04-06
4216,Nữ,1961,1.45,40.0,2019-03-12
4220,Nữ,1962,1.54,49.0,2019-03-27


---

[Bài trước](./02_colindex.ipynb) - [Danh sách bài](../README.md) - [Bài sau](./04_replace.ipynb)