## Pandas 基础命令速查表
- 参考 :
    1. [速查表](https://www.heywhale.com/mw/project/59e389b54663f7655c48f518)
- 缩写解释 & 库的导入
    1. df --- 任意的pandas DataFrame(数据框)对象
    2. s --- 任意的pandas Series(数组)对象
    3. pandas和numpy是用Python做数据分析最基础且最核心的库


In [94]:
import pandas as pd
import numpy as np

### 1. 数据的导入

- 数据的导入方式有以下几种 :

- pd.read_csv(filename) 导入csv格式文件中的数据
- pd.read_table(filename) 导入有分隔符的文本 (如TSV) 中的数据
- pd.read_excel(filename) 导入Excel格式文件中的数据
- pd.read_sql(query, connection_object) 导入SQL数据表/数据库中的数据
- pd.read_json(json_string) 导入JSON格式的字符，URL地址或者文件中的数据
- pd.read_html(url) 导入经过解析的URL地址中包含的数据框 (DataFrame) 数据
- pd.read_clipboard() 导入系统粘贴板里面的数据
- pd.DataFrame(dict)  导入Python字典 (dict) 里面的数据，其中key是数据框的表头，value是数据框的内容。

#### 1.1 pd.read_csv()

- 作用 : 读取CSV格式的数据
- 参数 :
    1. filepath_or_buffer : 文件路径, 支持 ftp 文件
    2. sep : 分隔符, 默认为 ,
    3. header : 文件中, 要作为字段的行的行号, 默认 header=0, 如果数据中没有行, 设置 header = None
    4. names: 字段名, 值为 ['字段1', '字段2', '字段3']
    5. index_col : 要作为行标签的列, 等价于 header

In [95]:
CSV_PATH = '../data/可视化数据集/iris.csv'
iris = pd.read_csv(CSV_PATH, sep=',')
iris[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### 2. 数据的导出

- [read_csv和to_csv参数详解](https://blog.csdn.net/u010801439/article/details/80033341/)

- df.to_csv(filename) 将数据框 (DataFrame)中的数据导入csv格式的文件中
- df.to_excel(filename) 将数据框 (DataFrame)中的数据导入Excel格式的文件中
- df.to_sql(table_name,connection_object) 将数据框 (DataFrame)中的数据导入SQL数据表/数据库中
- df.to_json(filename) 将数据框 (DataFrame)中的数据导入JSON格式的文件中

#### 2.1 df.to_csv()

- 作用 : 将文件保存为CSV文件
- 参数 :
    1. path_or_buf 输出文件路径
    2. sep 设置分隔符
    3. na_rep 替换空值
    4. header 是否保留列名 , header=0 表示不保存列名
    5. index 是否保留行索引
    6. cols 是否保留某列数据 (columns=['name'])
    7. index 是否写入列名, 默认为 True

In [96]:
iris.to_csv('../to_data/iris_01.csv',
            sep=';', na_rep='?', header=0)

### 3. 创建测试数据

- pd.DataFrame() 创建DataFrame
    1. column : 值为list, 用于指定 DataFrame 的列名
- pd.Series() 创建Series
- 添加一个日期索引 index
    1. df.index = pd.date_range('2017/1/1', periods=df.shape[0])

In [97]:
# 创建一个 5行5列的数据
pd.DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'])

Unnamed: 0,a,b,c,d,e
0,0.198263,0.701095,0.419892,-0.159066,-0.17304
1,0.865532,0.320909,0.442497,-1.686126,-1.363119
2,-0.014149,0.802615,-0.815405,-1.067336,-1.562864
3,-0.127197,1.785863,-2.132177,-0.318557,0.896271
4,0.407926,0.320378,-1.093439,-0.122914,-1.334965


In [98]:
# 从一个可迭代对象中创建一个 Series
my_list = ['Lab109', 100, '大家好']
pd.Series(my_list)

0    Lab109
1       100
2       大家好
dtype: object

#### 3.1 创建数字索引

In [99]:
# 添加日期索引

df_data = pd.DataFrame(np.random.randn(5, 4))
df_data.index = pd.date_range('2021/12/21', periods=df_data.shape[0])
df_data

Unnamed: 0,0,1,2,3
2021-12-21,0.207619,-1.162415,0.599829,0.616789
2021-12-22,-0.840163,-0.768702,1.694026,-1.734073
2021-12-23,-0.260813,-0.04458,-0.89838,0.517035
2021-12-24,0.408826,0.36032,-0.171546,-0.597426
2021-12-25,0.825498,-0.311212,0.806756,-0.040919


### 4. 数据的查看与检查

- df.head(n) 查看前 n 行的数据
- df.tail(n) 查看后 n 行的数据
- df.shape 查看数据框的行数与列数
- df.info() 查看数据框 (DataFrame) 的索引、数据类型及内存信息
- df.describe() 对于数据类型为数值型的列，查询其描述性统计的内容
- s.value_counts(dropna=False) 查询每个独特数据值出现次数统计
- df.apply(pd.Series.value_counts) 查询数据框 (Data Frame) 中每个列的独特数据值出现次数统计

In [100]:
df = pd.DataFrame(np.random.randn(5, 5))

# 查看数据的前 n 行
df.head()  # 默认是前5行

Unnamed: 0,0,1,2,3,4
0,-0.648698,-0.910378,0.099395,-0.541676,-0.118321
1,-1.617232,0.677458,-1.838289,0.374203,0.041271
2,-0.266963,0.955691,1.163535,2.192054,-0.837062
3,0.654362,-0.917936,-0.688318,-0.244292,0.127459
4,0.22427,1.303762,0.764557,-1.162759,-1.683549


In [101]:
df.head(3)

Unnamed: 0,0,1,2,3,4
0,-0.648698,-0.910378,0.099395,-0.541676,-0.118321
1,-1.617232,0.677458,-1.838289,0.374203,0.041271
2,-0.266963,0.955691,1.163535,2.192054,-0.837062


In [102]:
# 查看数据的最后 n 行
df.tail(4)

Unnamed: 0,0,1,2,3,4
1,-1.617232,0.677458,-1.838289,0.374203,0.041271
2,-0.266963,0.955691,1.163535,2.192054,-0.837062
3,0.654362,-0.917936,-0.688318,-0.244292,0.127459
4,0.22427,1.303762,0.764557,-1.162759,-1.683549


In [103]:
# 查看数据的行数和列数
df.shape

(5, 5)

In [104]:
# 查看数据框 (DataFrame) 的索引、数据类型及内存信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       5 non-null      float64
 1   1       5 non-null      float64
 2   2       5 non-null      float64
 3   3       5 non-null      float64
 4   4       5 non-null      float64
dtypes: float64(5)
memory usage: 328.0 bytes


In [105]:
# 对于数据类型为数值型的列，查询其描述性统计的内容
# std 表示标准差
df.describe()

Unnamed: 0,0,1,2,3,4
count,5.0,5.0,5.0,5.0,5.0
mean,-0.330852,0.221719,-0.099824,0.123506,-0.49404
std,0.871592,1.060388,1.199115,1.281987,0.765927
min,-1.617232,-0.917936,-1.838289,-1.162759,-1.683549
25%,-0.648698,-0.910378,-0.688318,-0.541676,-0.837062
50%,-0.266963,0.677458,0.099395,-0.244292,-0.118321
75%,0.22427,0.955691,0.764557,0.374203,0.041271
max,0.654362,1.303762,1.163535,2.192054,0.127459


In [106]:
s = pd.Series([1, 2, 3, 1, 1, 2, 4, np.nan, 5, 5, 5, 6, 7])
# 查询每个独特数据值出现次数统计
# dropna : 是否包括 nan 的统计
s.value_counts(dropna=False)

1.0    3
5.0    3
2.0    2
3.0    1
4.0    1
NaN    1
6.0    1
7.0    1
dtype: int64

In [107]:
# 查询数据框 (Data Frame) 中每个列的独特数据值出现次数统计
# apply 方法中需要传入函数, df的 每一行 Series 都会被作为 参数传入
df.apply(pd.Series.value_counts)

Unnamed: 0,0,1,2,3,4
-1.838289,,,1.0,,
-1.683549,,,,,1.0
-1.617232,1.0,,,,
-1.162759,,,,1.0,
-0.917936,,1.0,,,
-0.910378,,1.0,,,
-0.837062,,,,,1.0
-0.688318,,,1.0,,
-0.648698,1.0,,,,
-0.541676,,,,1.0,


In [108]:
def func(s):
    print(s)
    return s


df_01 = pd.DataFrame([[1, 2, 3], [3, 4, 5]])
print(df_01)

df_01.apply(func)

   0  1  2
0  1  2  3
1  3  4  5
0    1
1    3
Name: 0, dtype: int64
0    2
1    4
Name: 1, dtype: int64
0    3
1    5
Name: 2, dtype: int64


Unnamed: 0,0,1,2
0,1,2,3
1,3,4,5


### 5.数据的选取

- df[col] # 以数组 Series 的形式返回选取的列
- df[ [col1, col2] ] # 以新的数据框(DataFrame)的形式返回选取的列
- s.iloc[0] # 按照位置选取
- s.loc['index_one'] # 按照行索引选取
- df.iloc[0,:] # 选取第一行
- df.iloc[0,0] # 选取第一行的第一个元素

In [109]:
df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))
df

Unnamed: 0,A,B,C
0,-0.338475,0.506773,0.581276
1,-0.361153,1.252743,0.012709
2,0.452931,1.223422,1.371463
3,1.04912,-2.574127,-0.053049
4,-0.088713,-0.553751,0.118154


In [110]:
df['A']  # 以数组 Series 的形式返回选取的列

0   -0.338475
1   -0.361153
2    0.452931
3    1.049120
4   -0.088713
Name: A, dtype: float64

In [111]:
df[['A', 'B']]  # 以新的数据框(DataFrame)的形式返回选取的列

Unnamed: 0,A,B
0,-0.338475,0.506773
1,-0.361153,1.252743
2,0.452931,1.223422
3,1.04912,-2.574127
4,-0.088713,-0.553751


In [112]:
# 按照位置选取 下标从0开始
df.iloc[0]

A   -0.338475
B    0.506773
C    0.581276
Name: 0, dtype: float64

In [113]:
df.iloc[0, :]

A   -0.338475
B    0.506773
C    0.581276
Name: 0, dtype: float64

In [114]:
# df.iloc 是基于整数位置的选择数据, df.loc 是基于索引选择数据
df.iloc[0, 0]

-0.3384752904815126

In [115]:
#
s = pd.Series(np.array(['I', 'Love', 'Data']), index=['a', 'b', 'c'])
s

a       I
b    Love
c    Data
dtype: object

In [116]:
# 按照行索引选取
s.loc['a']

'I'

### 6. 数据清洗

In [117]:
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1.171894,-0.119856,-0.083624
1,-0.449261,-1.041448,-0.757457
2,-0.354133,-0.090132,-1.221225
3,-0.510058,0.350712,-0.821695
4,-1.100662,-1.460904,1.253864


In [118]:
# 重命名 DataFrame 的列名
df.columns = ['D', 'E', 'F']
df

Unnamed: 0,D,E,F
0,1.171894,-0.119856,-0.083624
1,-0.449261,-1.041448,-0.757457
2,-0.354133,-0.090132,-1.221225
3,-0.510058,0.350712,-0.821695
4,-1.100662,-1.460904,1.253864


In [120]:
df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                          'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                          'C':'foo'})
df.columns = ['a', 'b', 'c']
df

Unnamed: 0,a,b,c
0,1.0,,foo
1,,4.0,foo
2,2.0,,foo
3,3.0,5.0,foo
4,6.0,9.0,foo
5,,,foo


In [121]:
#  检查数据中空值出现的情况，并返回一个由布尔值(True,Fale)组成的列
df.isnull()

Unnamed: 0,a,b,c
0,False,True,False
1,True,False,False
2,False,True,False
3,False,False,False
4,False,False,False
5,True,True,False


In [122]:
# 检查数据中非空值出现的情况，并返回一个由布尔值(True,False)组成的列
df.notnull()

Unnamed: 0,a,b,c
0,True,False,True
1,False,True,True
2,True,False,True
3,True,True,True
4,True,True,True
5,False,False,True


In [123]:
# 移除 DataFrame 中包含空值的行
df.dropna()

Unnamed: 0,a,b,c
3,3.0,5.0,foo
4,6.0,9.0,foo


In [124]:
df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                   'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                   'C':'foo'})
# 移除所有包含空值的列, (axis = 0 , 表示行, axis = 1表示列)
df.dropna(axis=1)

Unnamed: 0,C
0,foo
1,foo
2,foo
3,foo
4,foo
5,foo


In [128]:
df = pd.DataFrame({'A':np.array([1,np.nan,2,3,6,np.nan]),
                   'B':np.array([np.nan,4,np.nan,5,9,np.nan]),
                   'C':'foo'})
# thresh = n 这一行(列)除去NA值，剩余数值的数量大于等于n，便显示这一行。
res = df.dropna(axis=1, thresh=1)
res

Unnamed: 0,A,B,C
0,1.0,,foo
1,,4.0,foo
2,2.0,,foo
3,3.0,5.0,foo
4,6.0,9.0,foo
5,,,foo
