在数据分析中，选取需要的数据进行处理和分析是很重要的。在Excel中，主要通过鼠标点选或拖拽来选取数据。

而在pandas中，我们也可以通过列名、位置索引以及各种条件来筛选感兴趣的数据子集，从而对数据进行过滤、筛选、切片、分析和连接等操作。

首先，导入并查看数据集：

In [1]:
# 从 github上的公开存储库获取数据

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv') 
# 查看前5行数据
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


数据集概览：

::: {.callout-note}

### 变量介绍

本教程使用存储为 CSV 格式的泰坦尼克号数据集。

PassengerId：每位乘客的Id。

Survived：指示乘客是否幸存。0是和1否。

Pclass：3 个机票类别之一：Class 1、Class2和 Class 3。

Name：乘客姓名。

Sex：乘客的性别。

Age：乘客的年龄。

SibSp：船上兄弟姐妹或配偶的数量。

Parch：船上父母或孩子的人数。

Ticket：旅客的客票号。

Fare：表示票价。

Cabin：乘客的客舱号。

Embarked：登船的港口。

:::

In [9]:
# 数据集的形状：(891行, 12列)
df.shape
# 数据集的行索引(行标签)：0:891:1 —— 起始索引:结束索引:索引之间的步长
df.index
# 数据集的列名(列标签)
df.columns
# 数据集数值列的统计摘要（有效值数量、均值、标准差、最值、分位数）
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 筛选行数据

第一种：通过行索引获取行数据，冒号指定范围，一般的模式是，起始索引:结束索引，结束索引不含在内，开始索引为0或结束索引为行数时可省略数字。

In [13]:
# 选取前3行数据，行索引取0、1、2，结束索引3是取不到的。
df[:3]  # 等价于df[0:3]
# 选取第5行(索引4)到第8行(索引7)
df[4:8] 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


::: {.callout-warning}

注意在dataframe中第0行才是第一行，所以切片的索引要 -1

:::


### .iloc筛选行索引

iloc方法是按行索引位置选取数据，索引就是表格最左边的数字。

In [15]:
df.iloc[[1, 4]]
# 等价于df.iloc[[1, 4], :]，选取行数据时可以不提取列。

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### .loc筛选行标签

loc方法是按行标签选取数据，因为这里没有设置行标签，所以这里行标签默认等于行索引。

In [16]:
df.loc[[1, 4], :]
# 等价于df.loc[[1, 4], :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 选取列数据

通过列索引或列名可以获取DataFrame的单列数据或多列数据。

假如我想知道每位乘客的姓名（选取单列），`df`为数据表格，`Name`为列名。


In [19]:
# 直接引用列名提取单个列:
df['Name']
df.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [20]:
# 使用.loc或.iloc取列: 逗号前筛选行,逗号后面筛选列
# 行数取全部时只用一个冒号就可以，列取Name列：
df.loc[:, 'Name']
# # 使用.iloc根据列索引来提取第4列(索引3)
df.iloc[:, 3]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

如果需要获取多列，比如所有乘客的幸存情况（`Name`、`Survived`），我们再添加一对中括号把所有的列括起来以获取多个列的数据，数据子集依旧为数据框dataframe。

下面的代码示例用.head()显示前5行。

In [30]:
# 方法一：
df[['Name','Survived']].head()

# 方法二：
# df.loc[:, ['column_name1', 'column_name2', ...]]
df.loc[:, ['Name','Survived']].head()
# df.iloc[:, [column_index1, column_index2]]
df.iloc[:, [3, 1]].head()

Unnamed: 0,Name,Survived
0,"Braund, Mr. Owen Harris",0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,"Heikkinen, Miss. Laina",1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,"Allen, Mr. William Henry",0


::: {.callout-tip}
内部方括号定义了一个[带有列名的Python列表](https://docs.python.org/3/tutorial/datastructures.html#tut-morelists)，而外部方括号用于从 dataframe 中筛选数据。

**取行的时候可以不提取列，但取列的时候必须用":, "来指定行。**

:::


除了列表形式，还可以用列标签区间提取连续的列，比如提取乘客的基本信息(Name、Sex、Age)：

In [33]:
# 用区间提取连续的多个列
df.loc[:, "Name":"Age"].head() 

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0


### 根据条件筛选行数据

提取十八岁以下的乘客信息:

In [8]:
df[df["Age"] < 18].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q


### 根据多个条件筛选行数据
多个条件之间用&相连：`df[(条件1) & (条件2)]`
提取成年(age>=18)男性乘客信息:

In [44]:
df[(df.Age >= 18) & (df.Sex == "male")] # 等价于 df[(df["Age"] < 18) & (df["Sex"] == "male")]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


DataFrames 可以通过多种方式进行过滤，其中最直观的是使用[布尔索引](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-boolean)。

::: {.callout-tip}

这些对原始数据进行筛选和提取的操作并不会改变原数据df，如果需要保存数据子集，需要赋值给新的变量。

:::