> **Toggle solutions:** Click **Show solution** to expand the hidden answer. Works in JupyterLab, VS Code, and classic Notebook without extensions.

In [2]:
import pandas as pd
import numpy as np

## Labeled Array Model vs Relational Model (Concept)

**Key model differences to remember** (for all subsequent hands-on parts):

- **Index & Order (pandas):** `Series` has a first-class *Index*; **order is intrinsic and preserved**. Binary ops implicitly **align by labels**.
- **Keys & Order (relational):** Relations (tables) are sets/bags of tuples; **order is not part of the model**. Alignment is explicit via **JOIN** on keys.
- **Missing:** `NaN`/`pd.NA` propagate or are skipped per operation in pandas; SQL has `NULL` and three-valued logic.

The code below demonstrates **label alignment** in pandas using two Series with partially overlapping labels.


In [4]:
s1 = pd.Series({'a': 1.0, 'b': 2.0})
s2 = pd.Series({'b':10.0, 'c':100.0})

In [5]:
s1 + s2 # s1.add(s2)

a     NaN
b    12.0
c     NaN
dtype: float64

In [6]:
s1.add(s2, fill_value=0)

a      1.0
b     12.0
c    100.0
dtype: float64

**Exercise (Concept):**
1. Create two new Series with partially overlapping labels (e.g., keys from `'x'..'z'` and `'y'..'w'`) and show how plain `+` vs `.add(..., fill_value=0)` differ.
2. Explain in a sentence how pandas' implicit label alignment relates to a **FULL OUTER JOIN** in SQL.

<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
x1 = pd.Series({'x': 1, 'y': 2})
x2 = pd.Series({'y': 10, 'w': 100})
plain = x1 + x2
filled = x1.add(x2, fill_value=0)
plain, filled
```

**Explanation.** Plain `+` aligns on the union of labels and leaves non-overlaps as `NaN`. `add(..., fill_value=0)` performs the same alignment but substitutes `0` for missing before computing — analogous to a FULL OUTER JOIN followed by arithmetic with NULL→0.

</details>

In [15]:
# Your code here
X = pd.Series([2,3,4],index=['x','y','z'])
Y = pd.Series([1,2,3],index=['y','z','w'])
X.add(Y, fill_value=0), X+Y

(w    3.0
 x    2.0
 y    4.0
 z    6.0
 dtype: float64,
 w    NaN
 x    NaN
 y    4.0
 z    6.0
 dtype: float64)

## Series Index

- The **Index** is a first-class axis of labels. It is not a data column but drives selection and alignment.
- Useful checks: `s.index.is_unique`, `s.index.dtype`, monotonicity flags.
- Avoid in-place edits of individual index labels; prefer replacing the whole `Index` or using helpers like `rename_axis`, `reindex`, `sort_index`.


In [13]:
s = pd.Series([10, 20, 30], index=['u01','u02','u03'], name='score')

In [17]:
s

u01    10
u02    20
u03    30
Name: score, dtype: int64

In [18]:
s.index

Index(['u01', 'u02', 'u03'], dtype='object')

In [24]:
s.rename_axis('user_id')

user_id
u01    10
u02    20
u03    30
Name: score, dtype: int64

In [20]:
s.index.is_unique

True

In [21]:
s.index.dtype

dtype('O')

In [22]:
s.sort_index(ascending=False)

u03    30
u02    20
u01    10
Name: score, dtype: int64

In [23]:
s.reindex(['u01','u02','u04'])

u01    10.0
u02    20.0
u04     NaN
Name: score, dtype: float64

**Exercise (Series Index):**
- Construct a `Series` with **duplicate** labels (e.g., two `'u02'`s). Show that `s.loc['u02']` returns multiple values.
- Then create a version where the index is unique (e.g., append a suffix) and demonstrate that `s.loc['u02_1']` returns a single scalar.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
dup = pd.Series([10, 40], index=['u02','u02'])
sel_dup = dup.loc['u02']

uniq = pd.Series([10, 40], index=['u02_1','u02_2'])
sel_uniq = uniq.loc['u02_1']
type(sel_dup), sel_dup.tolist(), type(sel_uniq), sel_uniq
```

**Explanation.** `dup` has duplicate labels, so `loc['u02']` returns *both* rows (a Series). After making labels unique, `loc['u02_1']` returns a single scalar.

</details>

In [17]:
# Your code here
s = pd.Series([1,2,3], index=['u1','u2','u2'])
type(s.loc['u1']), type(s.loc['u2'])

(numpy.int64, pandas.core.series.Series)

## Label vs Position Indexing

- **Label-based**: `s.loc[label/list/slice]` — slices are **inclusive** of the stop label.
- **Position-based**: `s.iloc[pos/list/slice]` — slices are **exclusive** of the stop.
- Prefer explicit `.loc`/`.iloc` to avoid ambiguity, especially with integer-like indexes.


In [18]:
grades = pd.Series([85, 92, 78, 96],
                   index=['Alice','Bob','Charlie','Diana'],
                   name='grades')

In [19]:
grades.loc['Alice']

np.int64(85)

In [20]:
grades.loc['Alice':'Charlie']

Alice      85
Bob        92
Charlie    78
Name: grades, dtype: int64

In [21]:
grades.iloc[0]

np.int64(85)

In [22]:
grades.iloc[0:2]

Alice    85
Bob      92
Name: grades, dtype: int64

**Exercise (Selection):**
- Select scores for `['Charlie','Diana']` using **labels**.
- Select the last two elements using **positions** only.
- Select all names whose label starts with `'A'` using a boolean mask on the **index**.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
grades = pd.Series([85, 92, 78, 96],
                   index=['Alice','Bob','Charlie','Diana'])
part1 = grades.loc[['Charlie','Diana']]
part2 = grades.iloc[-2:]
part3 = grades.loc[grades.index.str.startswith('A')]
part1, part2, part3
```

**Explanation.** `loc` selects by labels, `iloc` by positions, and the index-based mask chooses labels beginning with `'A'`.

</details>

In [35]:
# Your code here
part1 = grades[['Charlie', 'Diana']]
part2 = grades[-2:]
part3 = grades[grades.index.str.startswith('A')]
part1, part2, part3

(Charlie    78
 Diana      96
 Name: grades, dtype: int64,
 Charlie    78
 Diana      96
 Name: grades, dtype: int64,
 Alice    85
 Name: grades, dtype: int64)

## Boolean Filtering Idioms — `isin`, `between`

- Value filters: `s.isin([...])` and `s.between(lo, hi)` (inclusive by default).
- Remember to combine masks with `&`, `|`, and `~` and wrap each side with parentheses.


In [3]:
t = pd.Series([1, 3, 7, 4, 9], index=list('abcde'))

In [None]:
t.isin([3,9])

In [None]:
t[t.isin([3,9])]

In [None]:
t.between(3,7)

In [None]:
t[t.between(3,7)]

**Exercise (Filtering):**
- From `t`, select values **not** in `{3, 4}` using a boolean mask.
- Using `between`, select values in `[2, 8]` and then invert the mask.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
t = pd.Series([1, 3, 7, 4, 9], index=list('abcde'))
ans1 = t[~t.isin([3,4])]
mask_between = t.between(2,8)
ans2 = t[~mask_between]
ans1, mask_between, ans2
```

**Explanation.** `isin` builds a membership mask; `~` negates it. `between(2,8)` is inclusive; negating selects outside the interval.

</details>

In [6]:
# Your code here
part1 = t[~t.isin([3,4])]
mask = ~t.between(2,8)
part2 = t[mask]
part1, mask, part2

(a    1
 c    7
 e    9
 dtype: int64,
 a     True
 b    False
 c    False
 d    False
 e     True
 dtype: bool,
 a    1
 e    9
 dtype: int64)

## Take by Labels or Positions

- By labels: `s.loc[[...]]`
- By positions: `s.take([pos,...])` (ignores labels; fast positional gather)


In [7]:
u = pd.Series([10, 20, 30, 40, 50], index=list('ABCDE'))

In [9]:
u.loc[['A','E','C']]

A    10
E    50
C    30
dtype: int64

In [10]:
u.take([0, 4, 2])

A    10
E    50
C    30
dtype: int64

**Exercise (Take):**
- Produce the order `['B','B','E']` using positions **only** (i.e., `take`).
- Use labels to get `['E','A']` in that order.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
u = pd.Series([10, 20, 30, 40, 50], index=list('ABCDE'))
pos_only = u.take([1,1,4])
labels_only = u.loc[['E','A']]
pos_only, labels_only
```

**Explanation.** `take` gathers by position (ignores labels), enabling duplicates; `loc` gathers by label in the specified order.

</details>

In [14]:
# Your code here
part1 = u.take([1,1,4])
part2 = u.loc[['E','A']]
part1, part2

(B    20
 B    20
 E    50
 dtype: int64,
 E    50
 A    10
 dtype: int64)

## Index Edits — Assign, Sort, Delete (Series-focused)

- Assign a new `Index`: `s.set_axis(new_index, axis='index')` (returns a copy).
- Sort by labels: `s.sort_index(...)`.
- Delete labels: `s.drop(labels, axis='index')`.


In [15]:
v = pd.Series([5, 1, 9], index=['z','x','y'])

In [16]:
v

z    5
x    1
y    9
dtype: int64

In [17]:
v.sort_index()

x    1
y    9
z    5
dtype: int64

In [18]:
v.drop(['x'])

z    5
y    9
dtype: int64

In [19]:
v.set_axis(['u','v','w'], axis='index')

u    5
v    1
w    9
dtype: int64

**Exercise (Index Edits):**
- Given `v`, create a version with the index reversed lexicographically.
- Drop label `'z'` and verify the resulting index/order.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
v = pd.Series([5, 1, 9], index=['z','x','y'])
rev = v.sort_index(ascending=False)
dropped = rev.drop(['z'])
rev.index.tolist(), dropped.index.tolist()
```

**Explanation.** Sorting by labels yields `['z','y','x']`; dropping `'z'` leaves `['y','x']`. Values remain associated to their labels.

</details>

In [22]:
# Your code here
rev = v.sort_index(ascending=False)
dropped = rev.drop(['z'])
rev.index.tolist(), dropped.index.tolist()

(['z', 'y', 'x'], ['y', 'x'])

## Find Element by Value — Vector Scan vs Indexed Lookup

- Vector scan: `s.eq(v)` -> mask -> labels/positions.
- Extremes: `s.idxmax()`, `s.idxmin()`.
- Reverse map for value->label(s): `s.reset_index().groupby(0)['index'].apply(list)`.


In [23]:
w = pd.Series([85, 92, 78, 92, 96], index=['Alice','Bob','Charlie','Eve','Diana'])

In [24]:
w.eq(92)

Alice      False
Bob         True
Charlie    False
Eve         True
Diana      False
dtype: bool

In [26]:
w.index[w.eq(92)].tolist()

['Bob', 'Eve']

In [27]:
w.idxmax()

'Diana'

In [28]:
w.idxmin()

'Charlie'

In [29]:
np.flatnonzero(w.to_numpy() == 92)

array([1, 3])

**Exercise (Find):**
- Build a reverse map (value -> list of labels) from `w` and use it to return the labels for value `78`.
- Find the **position indices** of all values greater than `90`.


<details>
<summary><b>Show solution</b></summary>

```python
import numpy as np, pandas as pd
w = pd.Series([85, 92, 78, 92, 96], index=['Alice','Bob','Charlie','Eve','Diana'])
rev = w.reset_index().groupby(0)['index'].apply(list)
labels_92 = rev.loc[92]
pos_gt90 = np.flatnonzero(w.to_numpy() > 90)
labels_92, pos_gt90
```

**Explanation.** Group by value to collect all labels (handles duplicates). Positions come from a NumPy condition over the values array.

</details>

In [41]:
# Your code here
rev = w.reset_index().groupby(0)['index'].apply(list)
rev.loc[92]
np.flatnonzero(w.to_numpy() > 90)

array([1, 3, 4])

## Missing Values — dtypes and `isna()`

- `np.nan` (float NaN) vs `pd.NA` (scalar for NA-aware dtypes like `'Int64'`, `'boolean'`, `'string'`).
- `s.isna()` / `s.notna()` for detection.


In [43]:
a = pd.Series([1, 2, np.nan, 4])
b = pd.Series([1, pd.NA, 3], dtype='Int64')

In [44]:
a

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In [45]:
a.isna()

0    False
1    False
2     True
3    False
dtype: bool

In [46]:
b

0       1
1    <NA>
2       3
dtype: Int64

In [47]:
b.dtype

Int64Dtype()

**Exercise (Missing — detect):**
- Create a `Series` mixing integers and `np.nan` and check its `dtype`.
- Create an NA-aware integer `Series` with `pd.NA` and verify the dtype remains `'Int64'`.


<details>
<summary><b>Show solution</b></summary>

```python
import numpy as np, pandas as pd
mix = pd.Series([1, 2, np.nan])
nullable = pd.Series([1, pd.NA, 3], dtype='Int64')
mix.dtype, nullable.dtype
```

**Explanation.** `np.nan` forces float; `pd.NA` keeps nullable int 'Int64'.

</details>

In [52]:
# Your code here
a.dtype
b = pd.Series([1,2,pd.NA], dtype='Int64')
b.dtype

Int64Dtype()

## Missing Values II — Fill or Interpolate

- `fillna(value)`, `ffill()`, `bfill(limit=...)`.
- Time-aware interpolation with `method='time'` requires a `DatetimeIndex`.


In [53]:
m = pd.Series([1.0, np.nan, np.nan, 4.0, np.nan])
idx = pd.date_range('2024-01-01', periods=5, freq='D')
t = pd.Series([1.0, np.nan, 4.0, np.nan, 9.0], index=idx)

In [54]:
m.fillna(0)

0    1.0
1    0.0
2    0.0
3    4.0
4    0.0
dtype: float64

In [55]:
m.ffill()

0    1.0
1    1.0
2    1.0
3    4.0
4    4.0
dtype: float64

In [56]:
m.bfill(limit=1)

0    1.0
1    NaN
2    4.0
3    4.0
4    NaN
dtype: float64

In [57]:
t.interpolate(method='time')

2024-01-01    1.0
2024-01-02    2.5
2024-01-03    4.0
2024-01-04    6.5
2024-01-05    9.0
Freq: D, dtype: float64

**Exercise (Missing — fill/interpolate):**
- Forward-fill only the **first** gap in `m` and leave the rest as `NaN` (hint: `limit`).
- Interpolate `t` using `method='index'` and compare briefly to `method='time'`.


<details>
<summary><b>Show solution</b></summary>

```python
import numpy as np, pandas as pd
m = pd.Series([1.0, np.nan, np.nan, 4.0, np.nan])
ffirst = m.ffill(limit=1)
idx = pd.date_range('2024-01-01', periods=5, freq='D')
t = pd.Series([1.0, np.nan, 4.0, np.nan, 9.0], index=idx)
interp_index = t.interpolate(method='index')
interp_time = t.interpolate(method='time')
ffirst, interp_index, interp_time
```

**Explanation.** `limit=1` caps contiguous fill; 'index' vs 'time' coincide on daily spacing.

</details>

In [59]:
# Your code here
m.ffill(limit=1)
t.interpolate(method='index') # index 기준으로 regular하게. time이면 실제 시간을 고려한 보간을 함.

2024-01-01    1.0
2024-01-02    2.5
2024-01-03    4.0
2024-01-04    6.5
2024-01-05    9.0
Freq: D, dtype: float64

## Missing Values III — Reductions with `NaN`

- Many reductions default to `skipna=True` (`sum`, `mean`, ...).
- Use `skipna=False` to demand complete data.


In [60]:
r = pd.Series([1.0, np.nan, 3.0, np.nan])
bb = pd.Series([True, pd.NA, False], dtype='boolean')

In [61]:
r.sum()

np.float64(4.0)

In [62]:
r.mean()

np.float64(2.0)

In [63]:
r.mean(skipna=False)

np.float64(nan)

In [64]:
r.count()

np.int64(2)

In [65]:
bb.any()

np.True_

In [66]:
bb.all()

np.False_

**Exercise (Missing — reductions):**
- For `r`, compute `min` and `max` with and without `skipna=False` and comment on the difference.
- For `bb`, verify how `sum()` behaves and why.


<details>
<summary><b>Show solution</b></summary>

```python
import numpy as np, pandas as pd
r  = pd.Series([1.0, np.nan, 3.0, np.nan])
bb = pd.Series([True, pd.NA, False], dtype='boolean')
min_skip, max_skip = r.min(), r.max()
min_nos, max_nos = r.min(skipna=False), r.max(skipna=False)
b_sum = bb.sum()
(min_skip, max_skip, min_nos, max_nos, b_sum)
```

**Explanation.** Skip semantics ignore NaNs by default; NA is ignored in boolean sum.

</details>

In [69]:
# Your code here
r.min(skipna=False), r.min(skipna=True), r.max(skipna=False), r.max(skipna=True), bb.sum()

(np.float64(nan),
 np.float64(1.0),
 np.float64(nan),
 np.float64(3.0),
 np.int64(1))

## Missing Values IV — Alignment in Arithmetic

- Binary ops align on **labels**; non-overlapping labels introduce `NaN`.
- Method forms like `.add(other, fill_value=0)` can replace missing entries on the fly.


In [70]:
p = pd.Series({'a': 1.0, 'b': 2.0})
q = pd.Series({'b':10.0, 'c':100.0})

In [71]:
p + q

a     NaN
b    12.0
c     NaN
dtype: float64

In [72]:
p.add(q, fill_value=0)

a      1.0
b     12.0
c    100.0
dtype: float64

**Exercise (Alignment):**
- Create two Series with disjoint labels and show that `+` yields all `NaN`.
- Then use `.add(..., fill_value=...)` to make a meaningful result.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
pp = pd.Series({'x': 1.0})
qq = pd.Series({'y': 2.0})
plain = pp + qq
fixed = pp.add(qq, fill_value=0)
plain, fixed
```

**Explanation.** With disjoint labels, `+` yields all `NaN` after alignment. Using `fill_value=0` substitutes missing with 0 at compute time.

</details>

In [73]:
# Your code here
p.add(q, fill_value=0)

a      1.0
b     12.0
c    100.0
dtype: float64

## Sort, Rank, Top-k, Quantiles

In [74]:
srt = pd.Series([5, 2, 7, 4, 9], index=list('abcde'))

In [75]:
srt.sort_values(ascending=False)

e    9
c    7
a    5
d    4
b    2
dtype: int64

In [76]:
srt.sort_index()

a    5
b    2
c    7
d    4
e    9
dtype: int64

In [77]:
srt.rank(method='dense', ascending=False)

a    3.0
b    5.0
c    2.0
d    4.0
e    1.0
dtype: float64

In [78]:
srt.nlargest(2)

e    9
c    7
dtype: int64

In [79]:
srt.nsmallest(2)

b    2
d    4
dtype: int64

In [80]:
srt.quantile([0.25, 0.50, 0.75])

0.25    4.0
0.50    5.0
0.75    7.0
dtype: float64

**Exercise (Sort/Rank):**
- Get the labels of the **top-3** values in descending order.
- Compute the 10th and 90th percentiles.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
srt = pd.Series([5, 2, 7, 4, 9], index=list('abcde'))
top3_labels = srt.nlargest(3).index.tolist()
q1090 = srt.quantile([0.10, 0.90])
top3_labels, q1090
```

**Explanation.** `nlargest` returns labels in value-descending order; `quantile` computes percentiles by default interpolation rules.

</details>

In [99]:
# Your code here
# srt.rank(method='dense', ascending=False).sort_values()[:3].index
srt.nlargest(3).index.tolist(), srt.quantile([0.1,0.9])

(['e', 'c', 'a'],
 0.1    2.8
 0.9    8.2
 dtype: float64)

## Cumulative, Difference, Lag, Return

In [100]:
chg = pd.Series([100, 110, 121], index=['t0','t1','t2'])

In [101]:
chg.cumsum()

t0    100
t1    210
t2    331
dtype: int64

In [102]:
chg.shift(1)

t0      NaN
t1    100.0
t2    110.0
dtype: float64

In [103]:
chg.diff(1)

t0     NaN
t1    10.0
t2    11.0
dtype: float64

In [104]:
chg.pct_change(1)

t0    NaN
t1    0.1
t2    0.1
dtype: float64

In [105]:
chg.diff(2)

t0     NaN
t1     NaN
t2    21.0
dtype: float64

In [106]:
chg.shift(-1)

t0    110.0
t1    121.0
t2      NaN
dtype: float64

**Exercise (Cumulative/Change):**
- Compute a 2-step percentage change: `chg / chg.shift(2) - 1`.
- Use `shift(fill_value=...)` to avoid leading/trailing `NaN` and redo `diff`.


<details>
<summary><b>Show solution</b></summary>

```python
import pandas as pd
chg = pd.Series([100, 110, 121], index=['t0','t1','t2'])
pct2 = chg / chg.shift(2) - 1
diff_no_nan = chg.diff().fillna(0)
pct2, diff_no_nan
```

**Explanation.** The first two entries of the 2-step percent change are undefined and thus `NaN`. Filling after `diff` removes edge `NaN` but changes semantics; call this out when teaching.

</details>

In [117]:
# Your code here
chg / chg.shift(2) - 1, chg.diff(1).fillna(0)

t0     0.0
t1    10.0
t2    11.0
dtype: float64