<!--NAVIGATION-->
<[目录](Index.ipynb) | [Pandas对象简介](03.01-Introducing-Pandas-Objects.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


## Selecting Subsets of Data from DataFrames with `loc`
## 使用 `loc` 從 DataFrames 中選擇數據子集

>In this chapter, we use the `loc` indexer to select subsets of data from DataFrames. The `loc` indexer selects data in a different manner than *just the brackets*. It has its own separate set of rules that we must learn.  使用 `loc` 索引器從 DataFrame 中選擇數據子集。 

In [1]:
import pandas as pd
ps = pd.Series(['Ganga', 'Yamuna', 'Gomti', 'Koshi','Godavari','Kaveri'], index = ['a','b','c','d','e','f'])
ps

a       Ganga
b      Yamuna
c       Gomti
d       Koshi
e    Godavari
f      Kaveri
dtype: object

In [2]:
for i in ps: print(i)
for i in ps.iteritems():print(i)

Ganga
Yamuna
Gomti
Koshi
Godavari
Kaveri
('a', 'Ganga')
('b', 'Yamuna')
('c', 'Gomti')
('d', 'Koshi')
('e', 'Godavari')
('f', 'Kaveri')


In [3]:
ps.loc['d']
ps.loc['c':'f']
ps.loc['a':'f':2]  # a c e

a       Ganga
c       Gomti
e    Godavari
dtype: object

## `loc` with slice notation

Review Python's slice notation, which is used to select subsets from some core Python objects such as lists, tuples, and strings. Slice notation always has three components - the **start**, **stop**, and **step**. Syntactically, each component is separated by a colon like this - `start:stop:step`. All components of slice notation are optional and not necessary to include. Each has a default value if not included in the notation. The start component defaults to the beginning, the stop defaults to the end, and the step size to 1.

回顧一下 Python 的切片表示法，它用於從一些核心 Python 對象（如列表、元組和字符串）中選擇子集。 切片表示法始終具有三個組件 - **start**、**stop** 和 **step**。 從語法上講，每個組件都用冒號分隔，就像這樣 - `start:stop:step`。 切片符號的所有組件都是可選的，不需要包含。 如果未包含在符號中，則每個都有一個默認值。 start 組件默認為開頭，stop 默認為結尾，步長為 1。

In [4]:
import pandas as pd
df = pd.read_csv('input/sample_data.csv', index_col=0)
df

FileNotFoundError: [Errno 2] No such file or directory: 'input/sample_data.csv'

In [None]:
df.loc['Niko']
df.loc['Niko', :]
df.loc['Niko', 'state']

In [None]:
df.loc[['Niko']]
df.loc[['Niko'], :]
df.loc[['Niko'], ['state']]

In [None]:
df.loc[['Dean', 'Aaron']]
df.loc[['Dean', 'Aaron'], 'food']
df.loc[['Dean', 'Aaron'], ['age', 'state', 'score']]

In [None]:
df.loc['Niko':'Dean']
df.loc['Niko':'Dean':2]
df.loc['Niko':'Dean', ['state', 'color']]

In [None]:
df.loc[:, 'height']
df.loc[:'Dean', 'height':]
df.loc[['Penelope','Cornelia'], :]

## Summary of the `loc` indexer

* Primarily uses labels
* Selects rows and columns simultaneously with `df.loc[rows, cols]`
* Both row and column selections can be a:
    * single label
    * list of labels
    * slice of labels
    * boolean Series
* A comma separates row and column selections

In [None]:
import pandas as pd
bikes = pd.read_csv('input/bikes.csv')
bikes.head()
#bikes.tail()

## Changing display options

pandas gives you the ability to change how the output on your screen is displayed. For instance, the default number of columns displayed for a DataFrame is 20, meaning that if your DataFrame has more than 20 columns, then only the first and last 10 columns will be shown on the screen. All the other columns will be hidden and unable to be displayed. This is problematic as many DataFrames have more than 20 columns.

### Get current option value with `get_option`

There are a few dozen display options you can control to change the visual representation of your DataFrame. It is not necessary to remember the option names as the official documentation provides descriptions for all [available options][1]. 

Let's first learn how to retrieve each option value with the `get_option` function. This is not a DataFrame method, but instead, a function that is accessed directly from `pd`.  Below are three of the most common options to change.

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

### Use the `set_option` function to change an option value

To change an option's value, use the `set_option` function. You can set as many options as you would like at one time. It's usage is a bit strange. Pass it the option name as a string and follow it immediately with the value you want to set it to. Continue this pattern of option name followed by new value to set as many options as you desire. Below, we set the maximum number of columns to 100 and the maximum number of rows to 4.

In [None]:
pd.get_option('display.max_columns')
pd.get_option('display.max_rows')
pd.get_option('display.max_colwidth')
pd.set_option('display.max_columns', 100, 'display.max_rows', 4)

# Selecting Subsets of Data from DataFrames with `iloc`

The `iloc` indexer is very similar to the `loc` indexer but only uses **integer location** to make its subset selections. The word `iloc` itself stands for integer location and can help remind you what it does.

## Simultaneous row and column subset selection

The `iloc` indexer is capable of making simultaneous row and column selections just like `loc`. Selection with `iloc` takes the following form, with a comma separating the row and column selections.

```python
df.iloc[rows, cols]
```

Let's read in some sample data and then begin making selections with integer location using `iloc`.

### What is integer location?

Integer location is the term used to reference a row or column. The first row/column is referenced by the integer 0. Each subsequent row is referenced by the next integer. The last row/column is referenced by `n - 1` where `n` is the number of row/columns.

### Select using a list for both rows and columns

Let's select rows with integer location 2 and 4 along with the first and last columns. It is possible to use negative integers in the same manner as Python lists. The integer location -1 refers to the last column below.

In [None]:
import pandas as pd
df = pd.read_csv('input/sample_data.csv', index_col=0)
df

In [None]:
df.iloc[5]
df.iloc[2:]
df.iloc[::2]
df.iloc[3, 2]

In [None]:
df.iloc[[3], [2]]
df.iloc[[2, 3, 5], 4]
df.iloc[[2, 3, 5], [4]]
df.iloc[[2, 4], [0, -1]]

In [None]:
df.iloc[2, :]
df.iloc[:, [2, 4]]
df.iloc[2:4, [4, 2]]

In [None]:
df.iloc[[2], :]
df.iloc[[5, 2, 4], 3:]
df.iloc[[-3, -1, -2], :]

## Summary of `iloc`

The `iloc` indexer is analogous to `loc` but only uses **integer location** for selection. The official pandas documentation refers to it as selection by **position**.

* Uses only integer location
* Selects rows and columns simultaneously with `df.iloc[rows, cols]`
* Selection can be a 
    * single integer
    * a list of integers
    * a slice of integers
* A comma separates row and column selections

## Exercises

Read in the movie dataset by executing the cell below and use it for the following exercises.

In [None]:
import pandas as pd
test_dict = {'Corey':[63,75,88], 'Kevin':[48,98,92], 'Akshay': [87, 86, 85]}
df = pd.DataFrame(test_dict)
df

You can inspect the DataFrame. First, each dictionary key is listed as a column. Second, the rows are labeled with indices starting with 0 by default. Third, the visual layout is clear and legible.
Each column and row of DataFrame is officially represented as a Series. A series is a one-dimensional  array. Note that an array can be represented both by Series and numpy array, however they are two distinct data types and are interchangeable.

In [None]:
df = df.T                                    # Transpose DataFrame
df.columns = ['Quiz_1', 'Quiz_2', 'Quiz_3']  # Rename Columns
df

In [None]:
df.iloc[0]   
df.iloc[0,:]
df.iloc[0:2, 1:3]
df.iloc[[0,1], [1,2]]

In [None]:
# Access first column by name
df.Quiz_1
df.Quiz_4.astype(float)

df['Quiz_1']

In [None]:
df.loc[['Corey', 'Kevin'], ['Quiz_2', 'Quiz_3']]

In [None]:
# Define new column as mean of other columns
df['Quiz_4'] = [92, 95, 88]
df['Quiz_Avg1'] = df.mean(axis=1)
df['Quiz_Avg2'] = df.mean(axis=1, skipna=True)
df

Concatenating and Finding the Mean with Null Values for Our testscore Data

In [None]:
import numpy as np
# Create new DataFrame of one row
df_new = pd.DataFrame({'Quiz_1':[np.NaN], 
                       'Quiz_2':[np.NaN], 
                       'Quiz_3': [np.NaN],
                       'Quiz_4':[71]}, index=['Adrian'])
df_new

In [None]:
# Let Now, concatenate Dataframe with the added new row, Adrian, and display the new Dataframe value using df:
df = pd.concat([df, df_new])
df