# Subset Observations (Rows)
데이터프레임에서 일부 행만 추출해서 보기<br>

0. [Preparation](#0)
1. [Extract rows that meet logical criteria](#1)
2. [Remove duplicate rows (only considers columns)](#2)
3. [Select first n rows](#3)
4. [Select last n rows](#4)
5. [Random Sampling](#5)
 1. [By fraction](#5-1)
 2. [By the number](#5-2)
6. [Select rows by position (index)](#6)
7. [Select and order top/bottom n entries](#7)
 1. [Top n entries](#7-1)
 2. [Bottom n entries](#7-2)

### 0. Preparation - Create DataFrames<a name="0"></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame(
    {"a": [4, 5, 6, 6, np.nan, 4, 5],
     "b": [7, 8, np.nan, 9, 9, 7, 5],
     "c": [10, 11, 12, np.nan, 12, 10, 5]
    }, index = pd.MultiIndex.from_tuples(
        [('d', 1), ('d', 2), ('e', 2), ('e', 3), ('e', 4), ('e', 5), ('e', 6)],
        names = ['n', 'v']
    )
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4.0,7.0,10.0
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,3,6.0,9.0,
e,4,,9.0,12.0
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


---
## 1. Extract rows that meet logical criteria.<a name='1'></a>
특정 조건의 행을 추출

### Extract rows that has value less than 7 in column a.
a 열에서 7보다 작은 값을 가진 모든 행을 추출

In [3]:
df[df.a < 7]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4.0,7.0,10.0
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,3,6.0,9.0,
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


In [4]:
df[df['a'] < 7]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4.0,7.0,10.0
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,3,6.0,9.0,
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


### Extract rows that has value greater than or equal to 11 in column c.
c열에서 11보다 크거나 같은 값을 가진 모든 행을 추출

In [5]:
df[df.c >= 11]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,4,,9.0,12.0


In [6]:
df[df["c"] >= 11]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,4,,9.0,12.0


---
## 2. Remove duplicate rows (only considers columns).<a name='2'></a>
열의 값이 중복된 행 제거하기

In [7]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4.0,7.0,10.0
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,3,6.0,9.0,
e,4,,9.0,12.0
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


In [8]:
df = df.drop_duplicates(keep = 'last')
df

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,3,6.0,9.0,
e,4,,9.0,12.0
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


#### `inplace` drops duplicates immediately, but not recommended.
`inplace`를 사용하면 바로 중복 제거가 가능하나 권장되지 않음
```python
df.drop_duplicates(inplace = True)
```

---
## 3. Select first n rows.<a name="3"></a>
앞쪽에서 n개의 열을 선택하기

In [9]:
df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,2,5.0,8.0,11.0
e,2,6.0,,12.0
e,3,6.0,9.0,


---
## 4. Select last n rows.<a name="4"></a>
뒤쪽에서 n개의 열을 선택하기

In [10]:
df.tail(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


---
## 5. Random Sampling<a name="5"></a>
### A. By Fraction<a name="5-1"></a>
Randomly select fraction of rows.<br>
지정한 비율(`frac`)로 데이터를 샘플링

In [11]:
df.sample(frac=0.5)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,6,5.0,5.0,5.0
e,3,6.0,9.0,
e,5,4.0,7.0,10.0


In [12]:
df.sample(frac=0.5)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,2,6.0,,12.0
e,4,,9.0,12.0
e,6,5.0,5.0,5.0


In [13]:
df.sample(frac=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,2,5.0,8.0,11.0
e,6,5.0,5.0,5.0
e,5,4.0,7.0,10.0
e,2,6.0,,12.0
e,3,6.0,9.0,
e,4,,9.0,12.0


In [14]:
df.sample(frac=0.3)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,5,4.0,7.0,10.0
e,2,6.0,,12.0


### B. By the Number<a name="5-2"></a>
Randomly select n rows.
- Error occurs if n is less than the number of data.<br>

데이터 개수(`n`)를 지정해 샘플링
- 지정한 개수보다 데이터가 적으면 오류 발생

In [15]:
df.sample(n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,2,6.0,,12.0
e,6,5.0,5.0,5.0
e,5,4.0,7.0,10.0


---
## 6. Select rows by position (index)<a name="6"></a>
index를 지정해 데이터 추출
- Index is not related to names of index.
- 표시된 index와 무관하게, 실제 몇 번째 index인지로 결정

In [16]:
df.iloc[10:20]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [17]:
df.iloc[1:2]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,2,6.0,,12.0


In [18]:
df.iloc[:2]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,2,5.0,8.0,11.0
e,2,6.0,,12.0


In [19]:
df.iloc[3:]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,4,,9.0,12.0
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


In [20]:
df.iloc[-2:]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,5,4.0,7.0,10.0
e,6,5.0,5.0,5.0


---
## 7. Select and order top/bottom n entries<a name="7"></a>
상위/하위 n개의 행 선택하기

### A. Top n entries<a name="7-1"></a>
```python
df.nlargest(n, 'value')
```

In [21]:
df.nlargest(4, 'a')

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,2,6.0,,12.0
e,3,6.0,9.0,
d,2,5.0,8.0,11.0
e,6,5.0,5.0,5.0


### B. Bottom n entries<a name="7-2"></a>
```python
df.nsmallest(n, 'value')
```

In [22]:
df.nsmallest(2, 'a')

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,5,4.0,7.0,10.0
d,2,5.0,8.0,11.0
