## Selecting Subsets of Data from DataFrames with `loc`
## 使用 `loc` 從 DataFrames 中選擇數據子集

>In this chapter, we use the `loc` indexer to select subsets of data from DataFrames. The `loc` indexer selects data in a different manner than *just the brackets*. It has its own separate set of rules that we must learn.  使用 `loc` 索引器從 DataFrame 中選擇數據子集。 

In [1]:
import pandas as pd
ps = pd.Series(['Ganga', 'Yamuna', 'Gomti', 'Koshi','Godavari','Kaveri'], index = ['a','b','c','d','e','f'])
ps

a       Ganga
b      Yamuna
c       Gomti
d       Koshi
e    Godavari
f      Kaveri
dtype: object

In [2]:
for i in ps: print(i)
for i in ps.iteritems():print(i)

Ganga
Yamuna
Gomti
Koshi
Godavari
Kaveri
('a', 'Ganga')
('b', 'Yamuna')
('c', 'Gomti')
('d', 'Koshi')
('e', 'Godavari')
('f', 'Kaveri')


In [3]:
ps.loc['d']
ps.loc['c':'f']
ps.loc['a':'f':2]  # a c e

a       Ganga
c       Gomti
e    Godavari
dtype: object

## `loc` with slice notation

Review Python's slice notation, which is used to select subsets from some core Python objects such as lists, tuples, and strings. Slice notation always has three components - the **start**, **stop**, and **step**. Syntactically, each component is separated by a colon like this - `start:stop:step`. All components of slice notation are optional and not necessary to include. Each has a default value if not included in the notation. The start component defaults to the beginning, the stop defaults to the end, and the step size to 1.

回顧一下 Python 的切片表示法，它用於從一些核心 Python 對象（如列表、元組和字符串）中選擇子集。 切片表示法始終具有三個組件 - **start**、**stop** 和 **step**。 從語法上講，每個組件都用冒號分隔，就像這樣 - `start:stop:step`。 切片符號的所有組件都是可選的，不需要包含。 如果未包含在符號中，則每個都有一個默認值。 start 組件默認為開頭，stop 默認為結尾，步長為 1。

In [4]:
import pandas as pd
df = pd.read_csv('input/pd-loc.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [5]:
df.loc['Niko']
df.loc['Niko', :]
df.loc['Niko', 'state']

'TX'

In [6]:
df.loc[['Niko']]
df.loc[['Niko'], :]
df.loc[['Niko'], ['state']]

Unnamed: 0_level_0,state
name,Unnamed: 1_level_1
Niko,TX


In [7]:
df.loc[['Dean', 'Aaron']]
df.loc[['Dean', 'Aaron'], 'food']
df.loc[['Dean', 'Aaron'], ['age', 'state', 'score']]

Unnamed: 0_level_0,age,state,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dean,32,AK,1.8
Aaron,12,FL,9.0


In [8]:
df.loc['Niko':'Dean']
df.loc['Niko':'Dean':2]
df.loc['Niko':'Dean', ['state', 'color']]

Unnamed: 0_level_0,state,color
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Niko,TX,green
Aaron,FL,red
Penelope,AL,white
Dean,AK,gray


In [9]:
df.loc[:, 'height']
df.loc[:'Dean', 'height':]
df.loc[['Penelope','Cornelia'], :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


## Summary of the `loc` indexer

* Primarily uses labels
* Selects rows and columns simultaneously with `df.loc[rows, cols]`
* Both row and column selections can be a:
    * single label
    * list of labels
    * slice of labels
    * boolean Series
* A comma separates row and column selections

# Selecting Subsets of Data from DataFrames with `iloc`
# 從 iloc 中選取資料

The `iloc` indexer is very similar to the `loc` indexer but only uses **integer location** to make its subset selections. The word `iloc` itself stands for integer location and can help remind you what it does.

## Simultaneous row and column subset selection

The `iloc` indexer is capable of making simultaneous row and column selections just like `loc`. Selection with `iloc` takes the following form, with a comma separating the row and column selections.

```python
df.iloc[rows, cols]

In [10]:
import pandas as pd
df = pd.read_csv('input/pd-loc.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [11]:
df.iloc[5]
df.iloc[2:]
df.iloc[::2]
df.iloc[3, 2]

'Apple'

In [12]:
df.iloc[[3], [2]]
df.iloc[[2, 3, 5], 4]
df.iloc[[2, 3, 5], [4]]
df.iloc[[2, 4], [0, -1]]

Unnamed: 0_level_0,state,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,FL,9.0
Dean,AK,1.8


In [13]:
df.iloc[2, :]
df.iloc[:, [2, 4]]
df.iloc[2:4, [4, 2]]

Unnamed: 0_level_0,height,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaron,120,Mango
Penelope,80,Apple


In [14]:
df.iloc[[2], :]
df.iloc[[5, 2, 4], 3:]
df.iloc[[-3, -1, -2], :]

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dean,AK,gray,Cheese,32,180,1.8
Cornelia,TX,red,Beans,69,150,2.2
Christina,TX,black,Melon,33,172,9.5


## Summary of `iloc`

The `iloc` indexer is analogous to `loc` but only uses **integer location** for selection. The official pandas documentation refers to it as selection by **position**.

* Uses only integer location
* Selects rows and columns simultaneously with `df.iloc[rows, cols]`
* Selection can be a 
    * single integer
    * a list of integers
    * a slice of integers
* A comma separates row and column selections

## Exercises

Read in the movie dataset by executing the cell below and use it for the following exercises.

In [15]:
import pandas as pd
test_dict = {'Corey':[63,75,88], 'Kevin':[48,98,92], 'Akshay': [87, 86, 85]}
df = pd.DataFrame(test_dict)
df

Unnamed: 0,Corey,Kevin,Akshay
0,63,48,87
1,75,98,86
2,88,92,85


You can inspect the DataFrame. First, each dictionary key is listed as a column. Second, the rows are labeled with indices starting with 0 by default. Third, the visual layout is clear and legible.
Each column and row of DataFrame is officially represented as a Series. A series is a one-dimensional  array. Note that an array can be represented both by Series and numpy array, however they are two distinct data types and are interchangeable.

In [16]:
df = df.T                                    # Transpose DataFrame
df.columns = ['Quiz_1', 'Quiz_2', 'Quiz_3']  # Rename Columns
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3
Corey,63,75,88
Kevin,48,98,92
Akshay,87,86,85


In [17]:
df.iloc[0]   
df.iloc[0,:]
df.iloc[0:2, 1:3]
df.iloc[[0,1], [1,2]]

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [18]:
# Access first column by name
df.Quiz_1
df['Quiz_1']

Corey     63
Kevin     48
Akshay    87
Name: Quiz_1, dtype: int64

In [19]:
df.loc[['Corey', 'Kevin'], ['Quiz_2', 'Quiz_3']]

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [20]:
# Define new column as mean of other columns
df['Quiz_4'] = [92, 95, 88]
df['Quiz_Avg1'] = df.mean(axis=1)
df['Quiz_Avg2'] = df.mean(axis=1, skipna=True)
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4,Quiz_Avg1,Quiz_Avg2
Corey,63,75,88,92,79.5,79.5
Kevin,48,98,92,95,83.25,83.25
Akshay,87,86,85,88,86.5,86.5


Concatenating and Finding the Mean with Null Values for Our testscore Data

In [21]:
import numpy as np
# Create new DataFrame of one row
df_new = pd.DataFrame({'Quiz_1':[np.NaN], 'Quiz_2':[np.NaN], 
                       'Quiz_3': [np.NaN],'Quiz_4':[71]}, index=['Adrian'])
df = pd.concat([df, df_new])
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4,Quiz_Avg1,Quiz_Avg2
Corey,63.0,75.0,88.0,92,79.5,79.5
Kevin,48.0,98.0,92.0,95,83.25,83.25
Akshay,87.0,86.0,85.0,88,86.5,86.5
Adrian,,,,71,,
